Cognate Queries

Often, queries are structured, that is, the query contains several separate data items that must be matched against different fields of the record. Each of these data items is often referred to as a querylet. A common approach to the implementation of a structured query is a targeted query. This is a set of simple queries, where each simple query matches one querylet of the structured query against a different field or set of fields in the record.

One limitation of a set of simple queries is that the component querylets are restricted to matching only within the field or fields specified. In some cases, however, you might have several querylets that correspond to a group of closely related fields whose values are commonly assigned to a wrong field. In such cases, the cognate query provides an option for a more refined comparison of these related querylets and fields. Like simple queries, the cognate query computes a match score between 0.0 and 1.0 for the group of related fields.

For instance, tables containing name data often have several name fields: first name, last name, middle name, second surname, and so on. It is common for data to be entered in one name field that belongs in another. Therefore, the values given for the querylets are subject to the same confusion. In this situation, construct a cognate query in which each name querylet is targeted at the corresponding (or “cognate”) field.

This sounds like a set of targeted simple queries, but there is a crucial difference. Though each querylet in a cognate query is preferentially matched with data in the cognate field, the cognate query also allows the possibility of cross-matching between a querylet and the other participating non-cognate fields. Cross-matching of this kind occurs without penalty (unless you assign a penalty for it). This makes the cognate query the tool of choice for matching across closely related fields.

A query consists of first, middle, and last name querylets – for example, “Alfred”, “E”, “Newman”. You can construct a cognate query matching these querylets against the corresponding First, Middle, and Last Name fields of the table. The search results might include the following four records:

 

Score

First

Middle

Last

1.00

E

Alfred

Newman

0.92

Alfred

E

Neuman

0.46

John

E

Neumann

0.38

Albert

E

Noyes

Note that the top record “E Alfred Newman” scores a perfect 1.0, although the query’s first name “Alfred” and middle initial “E” are fielded differently in the record. This causes the “E Alfred Newman” record to outrank the “Alfred E Neuman” record, which does not score a perfect match due to the different spelling of the last name.

You can alter this behavior by lowering the non-cognate weight from its default value of 1.0 to some lesser value. (Like field weights, the non-cognate weight is a floating point value between 0.0 and 1.0.) You might then obtain the following results:

Score

First

Middle

Last

0.92

Alfred

E

Neuman

0.88

E

Alfred

Newman

0.46

John

E

Neumann

0.38

Albert

E

Noyes

Note that the “E Alfred Newman” record now scores comparably to the “Alfred E Neuman” record – the “imperfection” of misfielding, like misspelling, being reflected in a slightly lower match score.

After reading the section on Complex Queries, you might think that the cognate query could be simulated by an AND of simple queries, one simple query for each querylet, and each simple query matching all fields in the set. However, there is a critical difference between the cognate and an AND of simple queries.

For example, a person with the first name "johnny" and the last name "johns".

Simple query "johnny" against "first,middle,last" AND Simple query "Johns" against "first,middle,last" yields:

Score

First

Middle

Last

0.85

johnny

e

 

0.82

john

s

henny

The second record scores almost as high as the first even though it is not a good match. The reason is that the string "john" in the first name field of the record is getting matched twice, once for each querylet. This is where the cognate query comes in, while it allows for cross field matching, it ensures that each item is matched only once. The results of the cognate query on the same records are:

Score

First

Middle

Last

0.92

johhny

e

 

0.48

john

s

henny

You can see that with the cognate query the second record now has a much lower, more appropriate score.

Cognate Queries with an Empty Field Penalty

All three fields (First, Middle, and Last) are populated in both query and record. In actual name data, it is very often the case that the middle name is not populated. The following is an example where not all fields are populated:

Query

First

Middle

Last

John

Quincy

 

 

Query Results

Score

First

Middle

Last

1.0

 

Quincy

 

0.84

John

Q

Adams

0.84

John Q

 

Adams

0.83

John

Quick

Adamson

0.81

John

 

Adams

0.81

 

John

Adams

0.58

 

 

Adamso

Here, you can see that “John Quick Adamson” scored higher than “John Adams” in the match. Most users would not rank these matches in this order. The reason is that when a name, especially a middle name, is missing, users tend to discount it in the match. Users see “John Adams” as a good match because they discount the unmatched “Quincy” because there is no middle name in the record. The standard cognate query penalizes the match for the completely unmatched “Quincy” in the query. The empty field penalty option allows the cognate query to discount unmatched data that can be attributed to an empty field either in the query or the record.

When the number of populated fields in the query and the record differs, the empty field penalty factor adjusts the amount of the penalty applied for the unmatched data in the “extra” fields. In this example, when the query “John, Quincy, Adams” is matched against “John, Adams”, “,John, Adams” and “John Q,Adams” the query has one extra field. In the match against “,Adamso” the query has two extra fields. Extra fields are available only if the count of populated fields is different. For example, in a match of “John, Adams” against “,John, Adams” no extra fields are present. Both have two populated fields and one unpopulated field. It does not matter that different fields are populated in the two records.

The determination of whether a field is empty or not is made after character mapping is applied. If a character map eliminates all characters in a field, the field is considered empty. Beware of fields that contain only punctuation or special characters; under the standard character map, they are considered empty.

But which fields are the “extra” fields? The cognate query chooses the fields with the poorest match as the extra fields. If there is one extra field, the field with the poorest match is chosen as the extra field. If two extra fields are present, the two poorest matching fields are chosen, and so on. The empty field penalty adjustment is applied to these fields.

An empty field penalty of 1.0 applies the full penalty for unmatched data. This is the default behavior. An empty field penalty of 0.0 completely discounts all unmatched data in the extra fields; it applies no penalty for unmatched data in the extra fields. Factors in between apply a partial penalty.

Look at the following example with three different empty field penalties applied:

Query Results Empty Penalty = 1.0

Score

First

Middle

Last

1.0

 

Quincy

 

0.84

John

Q

Adams

0.84

John Q

 

Adams

0.83

John

Quick

Adamson

0.81

John

 

Adams

0.81

 

John

Adams

0.58

 

 

Adamso

With an empty penalty of 1.0, the results are unchanged. “John, Adams” is still fully penalized for the unmatched “Quincy” in the query.

Query Results Empty Penalty = 0.0

Score

First

Middle

Last

1.0

 

Quincy

 

1.0

John

 

Adams

1.0

 

John

Adams

0.99

 

 

Adams

0.89

 

 

Adamso

0.84

John

Q

Adams

0.83

John

Quick

Adamson

With an empty penalty of 0.0 “John, Adams” and “,John, Adams” are now considered perfect matches. There was no penalty for the unmatched “Quincy” when the first and last names matched perfectly. However, the matches still do not look quite right. The “,,Adamso” record is matching “John,Q,Adams”. This is because there was no penalty for the unmatched “John, Quincy”, but the “John,Q,Adams” record is penalized for the unmatched “Quincy” as no “extra” fields are available in this match. This can be compensated for by applying some penalty.

Query Results Empty Penalty = 0.1

Score

First

Middle

Last

1.0

 

Quincy

 

0.97

John

 

Adams

0.97

 

John

Adams

0.96

John Q

 

Adams

0.84

John

Q

Adams

0.83

John

Quick

Adamson

0.82

 

 

Adamso

These matches are now much closer to the way most people would judge them. But “,,Adamso” score is still too high. When the number of extra fields is a very high proportion of the total number of fields, a low extra field penalty usually results in scores most people would judge as too high. (The other factor here is “Q” vs. “Quincy”, people recognize this as an abbreviation, and so it is judged as a good match; the cognate query does not have such contextual knowledge.)

The check for extra fields works whether the extra fields are in the query or the record. The following are scores for an example where the query is incomplete and the record has extra fields:

 

Query

First

Middle

Last

John

 

 

Query Results Empty Penalty = 0.1

Score

First

Middle

Last

1.0

 

 

 

1.0

 

John

Adams

0.99

John

Q

Adams

0.97

John

Quincy

Adams

0.91

John Q

 

Adams

0.88

John

Quick

Adamson

0.86

 

 

Adamso

Notice that “John Q,,Adams” scores much lower than “John,Q,Adams”. In the case of “John Q,,Adams”, no extra fields are available. So the extra field penalty adjustment is not applied. In the case of “John,Q,Adams”, the record has one extra field. The empty field penalty adjustment is applied to the extra field, reducing the penalty for the unmatched “Q”. The presence of an empty field does not mean the empty field penalty adjustment will be applied. It is only the presence of an extra field that causes the empty field penalty to be applied.

In cases where a field is frequently unpopulated, It is a good practice to use a low empty field penalty (such as 0.1) so the query can give better overall results. Name matching, where the middle name is frequently empty, is the most common example of this. However, if it is often the case that the majority of fields are empty, a low empty field penalty might result in scores that appear too high. And of course, where matching on all fields is considered mandatory, the empty field penalty adjustment should not be used.

Additional points on Cognate Queries

The following are some additional facts to note about cognate queries:

Cognate queries can be used in any scenario where you have a set of querylets matching against a corresponding group of closely related fields when you expect frequent misfielding. Name fields are one example; sets of address fields (lines of a street address, apartment/suite number, city, and postal code) are another. In such situations, you always have the option of constructing a set of simple queries (one per querylet) and combining their scores in a complex query. However, in such scenarios, the cognate query's advantage of the cross-matching behavior is lost.
Cognate queries assume a one-to-one correspondence between querylets and fields. If the structure of the querylets is not identical to the structure of the fields, you might need to create this correspondence artificially by adapting the structure of the query. You can concatenate two or more querylets or insert one or more "blank" querylets. For instance, if your query has only a first and last name, but the table has three name fields (First, Middle, Last), allow for the possibility of cross-matching with the Middle Name field by supplying an empty string as a middle name querylet.
Cognate queries support field weights same as simple queries. One difference is that unlike the field weights for simple queries, cognate field weights specify only the relative importance of the fields. They do not penalize perfect matches in less important fields. For a full description of the behavior of field weights in cognate verse simple queries see the section, Weighting Factors.
Note: Remember in Simple Queries lowering a field weight lowers the final score even for perfect matches. In Cognate Queries, lowering a field weight lowers the relative importance of the field, but perfect matches still get perfect scores.

If you set the cognate field weights for First, Middle, and Last name fields to 0.8, 0.6, and 1.0, respectively, a match in which first and middle name are transposed in the record still scores just as highly no matter how these fields are weighted (unless, you choose to penalize cross-matching by setting the non-cognate weight to a value less than 1.0).

Cognate queries support querying on specific attributes in a Variable Attributes type field like Simple queries. See Variable Attributes Queries for details.
If ALL query strings are empty or if all attributes to be searched are empty, there is no information on which to base a match score. As with the Simple query in this case, the query is assigned the "empty score". The behavior and usage of the empty score for Cognate queries is identical to that for Simple queries. However, note that all of the querylets of the Cognate must be empty for it to be considered empty. As long as one querylet has some data in it, the Cognate is not considered to have an empty query.

Like simple queries, cognate queries can be combined with simple queries or other cognate queries, in more complex query structures. The complex queries are discussed in detail in the subsequent sections.