Match Case Score Combiner

The preceding section on Complex Queries describes the many different querylets that can be combined in complex ways to produce a single overall score that represents the degree of similarity between two records. But sometimes users want to know if the two records represent the same entity. Determining whether two records represent the same entity involves more than just the overall similarity of the two records. It matters a great deal what portions of the record were similar to what degree and what portions were not.

ibi Patterns - Search provides two special score combiners to answer this question. The ibi Patterns - Search Record Linkage (RLINK) score combiner uses an associated Learn model to determine whether two records represent the same entity. When using ibi Patterns - Search Machine Learning Platform, a model can be trained using a set of examples to recognize the difference between matching and non-matching records. See ibi Patterns - Search Machine Learning Platform for a description of this feature. The second alternative is the Match Case Score Combiner feature. This feature uses user defined rules to determine if two records match.

A standard practice in determining if two records match is to formulate a set of “rules” that define what must match to what degree before declaring that two records represent the same entity. The Match Case Score Combiner feature provides a means of directly implementing these match rules that otherwise might need complex post processing or multiple queries to do so.

A standard approach in defining rules is to first divide the information about the entity into a set of categories. The determination of categories is a judgment call based on knowledge of the information and the needs of the business. In the Complex Queries example, the following categories are used:

Name - the first, middle and last name fields form the name category.
Street Address - the street 1 and street 2 fields form the street address category.
Location - the City and Zip fields form the location category.
Phone Number - the Home, Cell, and Fax fields form the phone number category.
Date of Birth - the DOB field forms the Date of Birth category.

A record that has no match on the name information is probably not a match regardless of how well the other portions of the record match. Match rules typically follow this pattern: they define a minimum set of categories that must match, and other categories then either support or refute the match. There might be multiple rules. These rules are determined based on the nature of the data and the needs of the business. The following is an example of a set of rules for the categories described previously. These rules are for illustrative purposes and do not represent a recommended set of rules.

Rule 1: Name, Street Address, and Location must match.
Rule 2: Name, Phone Number, and Date of Birth must match.

A match on the name alone is insufficient to determine a match. There might be many people named “John Smith”, but two John Smiths that live at the same address are probably the same person. Similarly two John Smiths that have the same date of birth and a matching phone number are probably the same person, even if the street address and location are different. In this case, "John Smith" probably moved.

The categories that must match are called the core categories. The core categories are a set of categories that by themselves are sufficient enough for users to reasonably judge the two records to be the same entity.

But what does “match” mean when stating the core categories must match? If the aforementioned rules were implemented in SQL, or some other exact matching framework, “match” would be a simple equivalence test. But when using inexact matching, there is no equivalence, just degrees of similarity. Therefore, a degree of similarity that represents the boundary between something considered a match and something considered a non-match must be defined. For each category, you must have a querylet that produces a similarity score for that category. You assign a match threshold score that defines the boundary between matching and non-matching items for that category. If the category similarity score is lower the threshold, the category does not match. If the category similarity score is at or higher than the threshold, the category matches. Rule one states that name, street address, and location must match, so all three categories must have similarity scores at or higher than their respective match thresholds.

Match Case Score Combiner Rule One: Example 1

Considering the two records in the following table, you might ask if these records are for the same person. With rule 1, the answer would be yes, as there is a near-perfect match on the name, street, and location. But most people would say these are most likely not the same person because the date of birth is completely different. With no date of birth information, most people would likely say these are the same person, but the different date of birth values makes most people think otherwise. The difference in the date of birth values refutes the claim made by the match on the core categories.

.

Match Case Score Combiner Rule One Example 1

First

Middle

Last

Street1

Street2

City

ZIP

Home

Cell

Fax

DOB

John

Smith

123 Main St

 

Trenton

08690

 

 

 

1967/12/23

John

 

Smith

133 Main St

 

Trenton

08690

1993/6/14

Match Case Score Combiner Rule One: Example 2

Considering the two records in the following table, you might ask if these records are for the same person. Looking only at the core categories of rule one, these records might not be the same person. Depending on where you have set your match thresholds for the individual core categories, it might be that all three categories are higher than the respective threshold. But because all three are weak matches these two records are likely to be judged a non match by most people. Therefore, in addition to the individual category thresholds, an overall threshold is still needed. For the records to qualify as a match, they must meet both the individual core category match criteria as defined by the rule, and the overall match strength requirement. In ibi Patterns - Search, a test for overall match strength is normally supplied by setting a dynamic cutoff. For more information on the dynamic score cutoff, see Dynamic Score Cutoffs.

A typical example is when a perfect match on all of the core categories returns a score higher than the chosen overall threshold. If the match on the core categories is less than perfect, the record might fall below the overall threshold and the record is not a match, even though the individual categories are all higher than their threshold values.

Match Case Score Combiner Rule One Example 2

First

Middle

Last

Street1

Street2

City

ZIP

Home

Cell

Fax

DOB

John

Smith

123 Main St

Trenton

08690

1967/12/23

John

Smythe

133 Main Ave

Trenton

08691

1967/12/23

In this example, when looking beyond the core categories for rule one, you see a perfect match on the date of birth category. Instead of the date of birth category refuting the match, the category is supporting it. With this additional support, most people would say these two records represent the same person. So the categories outside the core categories for a rule might either support or refute the claim of a match made by the core categories. These are called the secondary categories. Because each secondary category might support or refute, you must define a threshold score that marks the boundary between supporting and refuting.

Some categories, such as date of birth, when they appear as a secondary category, have a very strong impact on the judgment of a match or non-match. Other categories, such as phone number, have a very weak impact on the judgment of a match or non-match. So a supporting and refuting strength must be defined for each secondary category.

Finally when looking at different rules, you might see that a match on the core categories with certain rules might give far more confidence in the match compared to a match on the core categories with other rules. The core categories of a rule have a match strength.

To summarize:

You must define a querylet for each category that returns a similarity score for the category.
A set of match rules are needed. For each rule you must define:

A core set of categories that must match.

A match threshold for each core category.

An overall match strength for the core categories.

A supporting strength for each secondary category.

A refuting strength for each secondary category.

A match threshold for each secondary category that defines the boundary between supporting and refuting scores for the category.

A Match Case Score Combiner represents one match rule. The full set of rules is implemented by creating a match case combiner for each rule. The records match if any one of the rules are satisfied, so an OR score combiner is used to combine the output of the match case combiners.

The following is an implementation of the two match rules used in the Match Case Score Combiner Rule examples:

// Querylets for each category (implementation not shown.)
NetricsQuery name_cat ; // Name category querylet.
NetricsQuery street_cat; // Street category querylet.
NetricsQuery location_cat; // Location category querylet.
NetricsQuery phone_cat; // Phone category querylet.
NetricsQuery dob_cat; // Date of Birth category querylet.
NetricsQuery []cat_qlets = new NetricsQuery[] {
name_cat, street_cat, location_cat, phone_cat, dob_cat
};
// Match Case querylet for rule 1.
NetricsQuery rule_1 = new NetricsQuery.MatchCase(
cat_qlets, // All of our category querylets.
0.8, // This is the match strength,
// this is a moderately strong match case.
// This defines the thresholds for both core categories and
// secondary categories. A negative value indicates it is
// a core category. The threshold is the absolute value.
new double [] { -0.70, -0.80, -0.75, 0.85, 0.60 },
// This serves double duty, for the core categories it is a
// weighting factor, similar to the weight on an AND combiner.
// for a secondary category it is the supporting strength.
// Phone number is a weak supporter, DOB is a very strong supporter.
new double [] { 1.0, 0.80, 0.80, 0.15, 0.40 },
// This is the refuting strength for a secondary category,
// entries for core categories are ignored.
// phone number is a very weak refuter, DOB is a very strong refuter.
new double [] { 0.0, 0.0, 0.0, 0.05, 0.50 }
);
// Match Case querylet for rule 2.
NetricsQuery rule_2 = new NetricsQuery.MatchCase(
cat_qlets, // All of our category querylets.
0.85, // This is the match strength,
// this is a slightly stronger match case.
// Notice we set the threshold for categories slightly lower when
// using them as a core categories. Remember it can still be
// rejected if the combination of scores is below our overall
// cut off score. So a little leeway helps pick up cases that
// will be strengthened by strong matches in other categories.
new double [] { -0.70, 0.85, 0.80, -0.80, -0.50 },
// street and location are fairly strong supporters.
new double [] { 1.0, 0.35, 0.35, 0.60, 0.90 },
// but street and location are very weak refuters.
new double [] { 0.0, 0.10, 0.10, 0.0 0.0 }
);
// The full query
NetricsQuery full_query =
NetricsQuery.Or(null, new NetricsQuery[] { rule_1, rule_2 }) ;

This example is only for illustrative purposes. The proper threshold and strength scores to use depends on the nature of your data and business needs. It is a good practice to use trial queries against real data to tune these values. In addition, querylet references are usually used to improve performance. See the next section on Querylet References to see this example updated to use referenced querylets.

The supporting strength and refuting strength scores are used to define how much the core score is raised or lowered. This represents a percentage of the difference between the raw score from the core categories and a perfect match score of 1.0 or a perfect non-match match score of 0.0.

The following table shows some examples:

Score Examples

Raw Core Score

Secondary Category Score

Secondary Category Threshold

Score

Supporting Strength

Score

Refuting Strength

Score

Output Score

any score

1.0

any score

1.0

N/A

1.0

any score

0.0

any score

N/A

1.0

0.0

0.8

1.0

any score

0.5

N/A

0.9

0.8

0.9

0.8

0.5

N/A

0.85

0.8

1.0

any score

0.3

N/A

0.86

0.8

1.0

any score

0.1

N/A

0.82

0.5

1.0

any score

0.5

N/A

0.75

0.8

0.0

any score

N/A

0.5

0.4

0.5

0.0

any score

N/A

0.5

0.25

Because the strength scores represent a percentage increase, a supporting or refuting strength of 1.0 always pushes the output score to either 1.0 for a supporting weight or 0.0 for a refuting weight if the secondary category is a perfect match or perfect non-match respectively. In the fourth example, the reward is one half of the reward for a perfect match because the secondary category score of 0.9 is one half of the way between the threshold value of 0.8 and a perfect match of 1.0.

This shows how one secondary category affects the score. Each secondary category supplies a positive (supporting) or negative (refuting) increment to the raw score from the core categories. These are summed to determine the final overall score. A cap of 1.0 and a floor of 0.0 is imposed on the final score.

The important thing to note is that large supporting or refuting weights have a very large effect on the final score. In general, supporting and refuting weights should be small. This is especially true of supporting weights. A supporting weight of 0.5 is a very large supporting weight. Generally, a supporting weight should never be higher than the overall record cut off score. A supporting weight higher than the cutoff score would boost even a 0.0 score from the core categories. It is a good practice to the cutoff. This implies that the associated secondary category by itself would indicate a record match. If a single category is strong enough to indicate a match, it should appear as a core category in its own match case.

The situation for refuting weights is a little different. If there is a category that, when present, by itself indicates a match or non-match (for example, a trusted customer ID value), you might want it to have a very high refuting score in the other match cases. Essentially, this indicates that these cases apply only when the highly trusted category is not available.