Sample Problem: Record Equivalence

The Learn Model in ibi Patterns - Search performs a very simple task. The model is used to evaluate any given feature vector and obtain a single model score for that feature vector. When applied to matching problems, the Learn Model functions as a score combiner like the AND and OR combiners as described in Designing Queries for ibi Patterns - Search, but more mathematically sophisticated.

To understand the need for a more sophisticated score combiner, consider judging whether two records containing personal information for two individuals are equivalent, that is, whether the two records represent the same person or not. A human might consider many aspects of the two records that can be relevant to this judgment. Some of these are simple enough to be represented by a standard ibi Patterns - Search query without a Learn Model. Some of the judgments are based on the combination of fields that appear similar enough, while other fields are allowed to be very different. This is the type of human judgment that a Learn model is designed to learn. Other types of judgment are based on external information known only to the human and not explicitly encoded in the match scores of the given field values. These judgments cannot be learned by the model unless appropriate information is added to the records being compared. It is important to understand the types of human judgment that the model can learn, and then provide only examples of such judgment for training the model.

The following are examples of features that do not require a Learn model. While each individual feature can be represented by a straightforward ibi Patterns - Search query, the combination of scores from several such features can be used to train a Learn model:

The name values are similar, despite fields being transposed (for example, the last name in one record is entered as the middle name in another record). A single cognate query can be used to take field transposition into account.
The date of birth field values differ by only a single digit in the month or day of birth. A single date query can properly score such differences.
The social security number matches exactly, indicating that it is the same person. This can be determined by a single simple query that uses the Social Security number field.

The following examples demonstrate human judgment that a Learn model is applicable for:

The values of the name fields, such as first, middle, and last names, are very similar, or very different. The importance of the similarity of each field can be learned by the model.
The first name, date of birth, and address are similar, while the last name is different, and the gender is female in both records. Depending on the country, this could mean that the woman has married and changed her last name.
The Social Security number is very similar and the name fields match well, but all other fields match poorly. The match on the Social Security number probably outweighs the poor match on all other fields.
The name fields are very similar, the state and ZIP code matches, but the street address is different, and the Social Security number is missing. The absence of an important field, such as the Social Security number, might suggest putting more emphasis on the matching information in other field values that are present.
The name, gender and date of birth fields are very similar, but the address, state and ZIP code fields do not match. A strong match in only a few important fields might be enough to identify the equivalence of the two records. A model is able to learn the relative importance of matches in each field.
The last name, street address, city, state and ZIP code fields match well, but the first name and apartment number fields are not very similar. A large amount of similar text in the record does not necessarily mean that the records are equivalent if certain important fields are very different.

The following examples demonstrate human judgment that rely on information unavailable to the Learn model:

The first and middle names are abbreviated differently, for example, "F. Scott" and "Francis S.". It cannot be deduced from the field values what the abbreviations mean. However, a properly constructed thesaurus file could take common abbreviations into account.
The last name in two sparsely populated records is a very common last name. Typically records have no information about how common the last name is. However, if this information is encoded in a separate numerical field, then it might be used for training a model.
If the records are associated with new-borns in a hospital and the patient medical record numbers of two very similar records are sequential, this might suggest that the records represent twins. Text comparison of record numbers does not detect when the numbers are sequential, only that they are textually similar.
The perfectly matching names and addresses are accompanied by a date of birth that differs by an entire generation. If so, there can be a name suffix like Jr. or Sr. A Learn model could never pick up on this information from a standard record comparison as the date query gives no information on the time between the two dates. However, a scoring predicate could be created that returns a score based on how close to a generation apart the two dates are. This would enable the Learn model to make such judgments. In addition, no information on name suffixes is available unless they are split out into a separate field, and even then the only information available is how well they match.

Your judgment on whether records represent the same person is complex, involving a large number of implicit patterns of relevant features. If a pair of records contain one or more of these relevant patterns, you tend to judge the records to be equivalent; and vice versa.

Example 1

NAME

DOB

SEX

STREET ADDRESS

CITY

STATE

ZIP

Bly, William

01/03/42

M

321 S. Orchard Ave.

Manitowoc

WI

54220

Lee, William

10/03/62

M

846 N. Orchard Ave.

Manitowoc

WI

54220

Despite the large amount of similar text, these two records represent different people. Not all of the text has equal relevance to the decision. In this example certain key similarities are lacking, such as the Date of Birth.

Example 2

NAME

DOB

SEX

STREET ADDRESS

CITY

STATE

ZIP

SSN

Smith, Jane

04/21/64

F

456 Orchard Av

Manitowoc

WI

54220

378-42-4481

Doering, Jane

04/21/64

F

1456 Willow Rd

Green Bay

WI

54301

378-42-4418

Despite the differences between these two records, they are likely equivalent because of the closeness of the Social Security numbers and the dates of birth, and because women in the U.S. often change their last names when they marry.

 

Example 3

NAME

DOB

SEX

STREET ADDRESS

CITY

STATE

ZIP

SSN

Doering, K

 

M

456 Orchard Av

Manitowoc

WI

54220

387-12-7780

Doering, Kate

04/21/64

F

450 Orchard St

Manitowoc

WI

54220

 

In this case, it is more tentative. The equivalence of the two records can be investigated based upon the matching last name and first initial, and a street address that is close but not identical. (The mismatched gender might be a data entry error.) The missing Social Security number represents a key value that would probably influence the judgment greatly if it were present; in its absence, these other features collectively wield enough influence to sway the judgment in the positive direction.

Collectively these examples illustrate several points:

Even large amounts of similar values do not guarantee the existence of a relevant pattern of features that is sufficient for the two records to match.
The relevance of a feature might depend on the match strength of other relevant features. For example, if name fields match closely, a matching Date of Birth becomes important as well; whereas a matching Date of Birth all by itself is not very significant.
A feature that is usually important might lose its importance entirely in certain contexts – for example, the last name in the case of married females in the U.S.
When features are missing, the criteria of user judgments can shift markedly.