Classic Thesaurus Tables
A “classic” thesaurus table specifies sets of terms (words or phrases) that the matching algorithm detects and treats as equivalent. This is useful is cases where you match words or phrases with each other despite having dissimilar spellings. There is a classic thesaurus table in operation in Introduction, where the nickname “Peggy” was enabled to match the name “Margaret” with a strong contribution to the match score. Terms in a classic thesaurus table (or the variant tables that are discussed later) can consist of single words or multiple-word phrases, as in the following equivalence classes:
|
Equivalence class terms |
|
|
laptop |
notebook |
|
high blood pressure |
hypertension |
Note the three-word phrase that occurs as the first term in the second row.
The following table represents another set of equivalence classes for color designations:
|
Equivalence class terms |
|
|
||
|
yellow |
lemon |
sunflower |
|
|
|
yellow |
canary |
cream |
ivory |
maize |
|
yellow |
goldenrod |
|
|
|
|
green |
cyan |
aqua |
teal |
turquoise |
|
blue |
cyan |
aqua |
teal |
turquoise |
These classes illustrate that the same term might occur in more than one class. For example, the term “yellow” is a member of each of the first three equivalence classes. The presence of a common term does not cause the three classes to merge; they remain distinct. Thus, “goldenrod” is not regarded as equivalent to “lemon” or “canary”, although both are regarded as equivalent to “yellow”. The use of a term common to several classes lets you equate a less-precise term (“yellow”) with several sets of more-precise terms (shades of “yellow” arranged in several groups), without equating more-precise terms with each other.
The fourth and fifth classes illustrate a somewhat different use of terms common to multiple classes. There are many intermediate shades between “green” and “blue”; you might want to equate all of these with both “green” and “blue”, without equating the extremes of “green” and “blue” with each other.
When matching using a thesaurus, a thesaurus weight can be given. This thesaurus weight specifies a penalty to be applied whenever equivalences are detected and utilized in matching. Similar to other kinds of weight values, the thesaurus weight is a value between 0.0 and 1.0, with 1.0 signifying that no penalty is applied for matches based on the equivalence class. If you desire matches that depend upon thesaurus equivalence to match strongly, but not to outrank high-quality matches that do not depend on thesaurus equivalence, set the thesaurus weight to a fairly high value of less than 1.0, such as 0.9 or 0.95.