Matching Compound Records and Querylet Grouping
A common matching problem for joined tables is the matching of compound records (see Compound Records). Here the input record might have zero, one or more child records for the same child table. Typically, when matching two compound records, the match is based on the best possible match of any one of the input child records for a particular child table against any one of the child records for the same table of the compound record being matched. That is the best match out of all possible combinations of child record comparisons is taken as the match score for the compound record. Consider the following input compound record for our Persons and Addresses tables (the Phones table is left out for brevity):
|
Table |
Field Values |
|
Persons |
John,Smith,1968/12/23,111-22-3333 |
|
Addresses |
123 Main St.,Gothem,NY |
|
Addresses |
456 2ND St.,Smallton,VT |
With these two Addresses records, the match query for this might look like this:
NetricsQuery name_qry = NetricsQuery.Cognate(
new String [] { “John”, “Smith” },
new String [] { “first_name”, “last_name” },
null,
0.9) ;
NetricsQuery dob_qry = NetricsQuery.Custom(
“1968/12/23”,
new String [] { “dob” },
null,
NetricsQuery.CS_DATE) ;
NetricsQuery ssn_qry = NetricsQuery.Simple(
“111-22-3333”,
new String [] { “SSN” },
null) ;
NetricsQuery street1_qry = NetricsQuery.Simple(
“123 Main St.”,
new String [] { “Addresses.street” },
null) ;
NetricsQuery city1_qry = NetricsQuery.Simple(
“Gothem”,
new String [] { “Addresses.city” },
null) ;
NetricsQuery addr1_qry = NetricsQuery.And(null, new NetricsQuery [] {
street1_qry,
city1_qry
}
) ;
NetricsQuery street2_qry = NetricsQuery.Simple(
“456 2ND St.”,
new String [] { “Addresses.street” },
null) ;
NetricsQuery city2_qry = NetricsQuery.Simple(
“Smallton”,
new String [] { “Addresses.city” },
null) ;
NetricsQuery addr2_qry = NetricsQuery.And(null, new NetricsQuery [] {
street2_qry,
city2_qry
}
) ;
NetricsQuery addr_qry = NetricsQuery.Or(null, new NetricsQuery [] {
addr1_qry, addr2_qry
}
) ;
NetricsQuery full_query = NetricsQuery.And(null, new NetricsQuery [] {
name_qry,
dob_qry,
ssn_qry,
addr_qry
}
) ;
In this example, there is a separate query to match each child record, and then the "OR" combiner is used to select the one that matches best. This query will find the combination of Persons and Addresses that best matches the given Persons record and any one of the given child records.
However, there is a problem with this query that could cause it to perform poorly. That problem is related to the GIP prefilter that performs the join operation and makes the initial record selection (For more information about the GIP prefilter, see Prefilters and Scaling). The GIP prefilter does not know which querylets are associated with which child record; therefore, it assumes all querylets are associated with the same child record and attempts to select a child record that best matches all of the querylets. This can lead to very poor selections by the GIP prefilter. Because the GIP prefilter only makes a crude selection of a large candidate pool, it might work well enough for smaller tables. For larger tables, this can result in poor overall match results because the desired records might get pushed out of the candidate pool by many records that match all of the querylets, as good as or better than records matching the individual querylets. The querylet grouping feature allows you to group querylets by which child record they are associated with. By knowing which querylets came from the same child record and which came from different child records, the GIP prefilter can make better child record selections.
The querylet grouping feature allows you to associate a group name with one or more querylets on the query tree. These querylets can be of any type and can be at any position in the tree. When the GIP prefilter matches child records, it matches querylets with different group names separately. Typically a group is associated with a particular child record in the input compound record. A standard practice is to use the combination of the child table name and child record key as the group name. The Input Compound Record Example with group names added looks like this:
... // Parent record queries as above.
NetricsQuery street1_qry = NetricsQuery.Simple(
“123 Main St.”,
new String [] { “Addresses.street” },
null) ;
NetricsQuery city1_qry = NetricsQuery.Simple(
“Gothem”,
new String [] { “Addresses.city” },
null) ;
NetricsQuery addr1_qry = NetricsQuery.And(null, new NetricsQuery [] {
street1_qry,
city1_qry
}
) ;
addr1_qry.setGroup(“Addresses.key1”) ;
NetricsQuery street2_qry = NetricsQuery.Simple(
“456 2ND St.”,
new String [] { “Addresses.street” },
null) ;
NetricsQuery city2_qry = NetricsQuery.Simple(
“Smallton”,
new String [] { “Addresses.city” },
null) ;
NetricsQuery addr2_qry = NetricsQuery.And(null, new NetricsQuery [] {
street1_qry,
city1_qry
}
) ;
addr2_qry.setGroup(“Addresses.key2”) ;
NetricsQuery addr_qry = NetricsQuery.Or(null, new NetricsQuery [] {
addr1_qry, addr2_qry
}
) ;
NetricsQuery full_query = NetricsQuery.And(null, new NetricsQuery [] {
name_qry,
dob_qry,
ssn_qry,
addr_qry
}
) ;
Notice that for the address query, the group name is set on the AND query. Group names propagate down the query tree, thus setting the group on addr1_qry is the same as setting it on street1_qry and city1_qry.
Some things to note about querylet groups:
| • | Querylet groups apply only to queries on child records of a joined query. They are quietly ignored if used on querylets against parent records or in non-joined queries. |
| • | A querylet can belong to only one group. An error occurs if two or more different groups are assigned to the same querylet. This includes assignments that were propagated down from a higher level in the query tree. |
| • | A querylet group name can be any valid string. There is a length limit of 999 bytes when encoded as a UTF-8 string value. |
| • | All querylets not assigned a group name belong to the same unnamed group. |
| • | Querylets on different parts of the query tree can be assigned to the same group. |
| • | Grouping only affects how child records are selected in the prefilter. It has no effect on the scoring of the record. |
| • | Because a querylet can be assigned to only one group, a querylet that uses data from two or more different input child records cannot be assigned to a group. |
| • | The primary use case for querylet grouping is in compound record matching. It is a good practice to use the querylet grouping feature to ensure accuracy when doing matching of this type. However, there is a performance penalty when querylet grouping is used. The amount of the penalty is very hard to predict, but if the load is high and performance is critical, it might be necessary to consider removing the use of the querylet grouping feature. |