Thesaurus Matching

The TIBCO Patterns string matching algorithm does an excellent job of finding similar strings, but sometimes it's desirable to equate strings which are textually dissimilar. For example, a computer hardware vendor might want customers searching for notebook to be able to find items listed under laptop. Because these two strings bear very little resemblance to each other, the standard matching probably won't find these items, but with a thesaurus that lists the laptop as a synonym for notebook, it does.

The available thesaurus types are substitution, weighted term, and combined. All of them share the following attributes.

A thesaurus defines a set of classes, where a class is a list of terms that are to be considered equivalent. A term is a list of one or more words (more correctly tokens, a white space separated group of characters). As a term might have more than one word it is possible to set up equivalences between phrases and words or phrases and phrases, e.g. hypertension and high blood pressure. All terms within a class are considered equivalent, but equivalences are not transitive across classes. That is if we have two equivalence classes:

john,jonathan,johnny
jean,john

Then john is equivalent to jonathan, johnny and jean, but neither jonathan nor johnny is equivalent to jean.

By default, thesaurus matching supports a small degree of error tolerance. Thus given the above classes john is also match jonathin, just one letter off, but would not match johnithon.

Error tolerance is relative to the length of the term in characters, thus short terms must be match exactly, long terms might allow up to two character differences. If error tolerance is not desired, you can specify when creating a thesaurus that it allow only exact matches.

Standard Substitution

This is used for the typical use case described above. If it finds two equivalent, but not identical, terms in query and record the two terms are linked together as being a match. A substitution penalty might be defined in the query which is applied to all such matched thesaurus terms. Essentially, instead of being considered a perfect 1.0 match, the score for the matched term is considered to be the penalty factor. Thus with a query of john and a record of jean, with no penalty the score is 1.0, with a 0.9 penalty the score is 0.9 and with a 0.1 penalty the score is 0.1. This supports terms that match perfectly without the thesaurus to be considered better matches than those matched by the thesaurus.

Weighted Term

This is a special kind of thesaurus. Its primary purpose is not to link textually dissimilar terms, but to change the importance within the match of certain terms by applying a weighting factor to the term. A value below 1.0 indicates a lower importance, while a value above 1.0 is indicative of a higher importance. The special value -1.0 can be used to indicate a stop term, this is a term that is ignored completely, matching being done as if the term did not appear. For instance for business name data the terms company and incorporated are very common and thus of little relevance to a match.

Without weighted terms a search for abc incorporated would match more strongly on a2z incorporated than on abc company. By defining terms such as incorporated and company as weighted terms with very low weights, we tell the TIBCO Patterns servers that even though there is a perfect match on the long string incorporated this is of little importance, thus the match of abc to abc dominates the score and brings the desired record to the top.

Unlike the penalty factor of the substitution thesaurus the weight on a weighted term does not lower the overall score, it essentially shrinks or enlarges the effective size of the term within the match. Also unlike a substitution thesaurus a weighted term dictionary does not necessarily create an equivalence between record and query, it applies the weighting factor to the term Whenever and wherever it is found, even if there is no match found for it. As with a substitution thesaurus a weighted dictionary can be used to create matches between dissimilar items. For example "inc" and "incorporated" can be specified as equivalent terms. If equivalent but non-equal terms are matched between query and record the weighting factor is applied to the average of the lengths of the terms, and, as with the substitution thesaurus, any penalty factor specified in the query is also applied as a penalty which results in lower final score.

Each class within a weighted term thesaurus has its own weighting factor. This factor is a positive floating point number or the special stop token value -1.0. It is given as the first term within the class. Because weighted terms do not necessarily define synonyms it is perfectly legitimate to define classes with only a single term in them (in addition to the weight).

Combined

A combined thesaurus combines the features of a substitution thesaurus and a weighted term thesaurus. Each class is given both a weight and a penalty. It behaves like a weighted term thesaurus in that the weighting factor is applied to all terms found in query or record, even if not matched with equivalent terms. It behaves like a substitution thesaurus in that when equivalent, but non identical, terms are matched between query and record those terms are linked as matched and the penalty is applied. To continue the example of business names, in addition to reducing the weight for common terms like incorporated we might want to add equivalences for common names and abbreviations such as American Broadcasting Company and ABC. These could be entered with a weighting factor of 1.0 and whatever penalty is desired. A different penalty factor might be desired for common company nicknames such as IBM and Big Blue. The combined thesaurus gives you the flexibility to do so.

For combined thesauri the first entry for each class is the weighting factor as defined for a weighted term thesaurus, the second term is the penalty, a floating point number between 0.0 and 1.0 inclusive. If a query thesaurus penalty is given with a combined thesaurus that penalty is multiplied with the class penalty to get the final penalty. Generally there should never be a need to use a query penalty with a combined thesaurus however.

Conflict Resolution

It is possible that the thesaurus matching might find more than one possible thesaurus match for a particular word in query or record. Although the exact resolution rules are complex the general rule is such conflicts are resolved to maximize the overall score of the thesaurus match.