Weighted Dictionary

Consider an example with matching company names where the query consists of the text string “ABC Corporation”. Using a simple query against the Company Name field, and normal scoring, the search returns the following results list:

0.71

Busy Corporation

0.70

XYZ Corporation

0.54

ABC Corp.

0.41

ABC Co.

0.30

ABC

0.25

Busy Corp.

The reason for this unsatisfactory result is that the query contains the term “Corporation”, whose importance in this query is much less than the substantial length of the word suggests. But the matching algorithm, being language-independent, matches “Corporation” as eagerly as it matches “ABC”. Company Name fields contain a small number of such lightweight terms that possess slight, if any, significance for a match.

ibi Patterns - Search allows you to attach semantic weight values to specific terms like “Corporation” and “Incorporated” by including such terms in a variant of a thesaurus table called a weighted dictionary. A weighted dictionary identifies certain terms (words or phrases) as possessing either less importance in a match, or more importance in a match, than the typical word or phrase. The weights assigned to terms in a weighted dictionary are real values greater than or equal to 0.0. If the term weight is less than 1.0, the term is a lightweight term whose importance (if detected in a query or record) is to be considered less than that of the typical word or phrase. If the term weight is greater than 1.0, the term is a heavyweight term whose importance (if detected in a query or record) is to be considered greater than that of the typical word or phrase. (Consider every term in the query or a record as having a default semantic weight of 1.0.)

Consider the following weighted dictionary to improve the results of the previous search:

Term Weight

Equivalence class terms

0.1

Company

Co

0.1

Corporation

Corp

0.1

Incorporated

Inc

Note that in addition to the class terms each equivalence class now has a term weight associated with it. The equivalence class allows the equating of a term like “Incorporated” with abbreviations like “Inc.” The weight value is not a thesaurus weight but a semantic term weight. Here the weight value is used to designate the terms in all three classes as lightweight terms with a semantic term weight of 0.1. This means that these terms have roughly one-tenth the importance they would otherwise have in the context of a match.

Perform the “ABC Corporation” query again. Here is the new results list:

1.00

ABC Corp.

0.92

ABC Co.

0.90

ABC

0.19

Busy Corporation

0.19

XYZ Corporation

0.16

Busy Corp.

Results show "ABC Corp" emerging as a top match followed by "ABC Co" and "ABC". The scores for these three records have increased, while the scores for the other three records have dropped. This is the effect of the term weight decreasing the significance of “Corporation” and “Corp”: both the matching of “ABC” and the failure to match “ABC” count for much more than they did. Moreover, with the thesaurus-like equivalence of “Corporation” and “Corp”, the top record receives a perfect score of 1.0.

Note that the record "ABC" has gone from a score of 0.30 to 0.90. This is because even though it does not appear in the record, the "Corporation" string in the query is matched by the "Corporation" term in the weighted dictionary and its effective length is reduced to one-tenth its original length. Therefore, instead of most of the query being unmatched, most of it is now matched.

Note: Unlike the classic thesaurus, the term weights are applied even if there is no match between query and record for the term. Term weights are applied whenever the term is found in either the query, the record or both.