Dynamic Score Cutoffs

Some applications present search results directly to an end user while other applications pass results on to another application for automatic processing. Depending on the nature of the application, you can minimize the presentation or processing of results that have a small likelihood of being authentic matches (“false positives”), while avoiding the loss of records of interest (“false negatives”).

This means applying some cutoff method for the results list. ibi Patterns - Search provides several dynamic cutoff methods that can be configured within your application:

Exact-plus-N — Returns exact matches (with a score of 1.0), plus the highest scoring N inexact matches, where N is a fixed number that you specify.
Percent-of-top — Returns records with scores greater than or equal to a specified percentage of the top score returned (that is, the score of the first record in the results list).
Percent-gap — Returns records until a gap is encountered between consecutive scores that is greater than a specified percentage of the top score returned. This is the best method for dynamically eliminating the most false positives.
Absolute — Returns records with scores greater than a specified score value.
Note: When using the Absolute cutoff method, select the cutoff score based on careful evaluation of runs with the real data.

When a cutoff method is selected, the “number of matches requested” continues to function as a ceiling on the number of matches returned. Dynamic score cutoffs only reduce the size of the results list. For example, regardless of the cutoff method selected, if the number of matches requested is 25, no more than 25 records are returned. Hence, if you select a cutoff method, for example Exact-plus-N, with more than 25 exact matches to the query in the table, the results list will contain only a portion (25) of the exact matches.

If your application requires a dynamic cutoff method, pursue it in a systematically as it depends on the nature of the data and your particular application’s tolerance level of false positives.

Studying the score profiles of a representative set of queries against the actual data is usually the best way to determine an appropriate method for limiting search results. You look for a score region such that scores greater than the region almost always represent authentic matches, and scores less than the region almost always represent non matches. In case there is no such region, that is, if scores of definite authentic matches frequently overlap with scores of definite inauthentic matches, you must tune the query structure or the weighting of the parts of a complex query, to create such a “region of separation” between definite authentic and definite inauthentic matches.

Tip: The preceding section explains the definite authentic and definite inauthentic matches for a reason. There is always an intermediate region of results which you regard as possible authentic matches. The required “region of separation” inevitably includes such “maybe” results. This is precisely the region you will “manage” when you select the cutoff method.

 

The choice of cutoff method depends on the following factors:

Whether the region of separation corresponds reliably to an absolute range of score values. If it does, select an absolute score cutoff somewhere in the region. Otherwise, select one of the cutoff methods that is relative to the top scoring item in the list, or the Exact-plus-N method.
Your policy regarding tolerance of false positives. This depends on the application. With interactive searches, the appearance of some false positives at the end of the results ensures that no true-though-very-inexact matches are missed. On the other hand, in an application that triggers actions, such as the merging of records based on the results of a search, you might want to minimize false positives at the cost of a few more false negatives.

Cutoffs When Using a Learn Model

When using a Learn Model to score records, the scores returned might change each time the model is retrained. This might alter the appropriate cutoff score (the Learn UI application in ibi Patterns - Search has a tool for selecting an appropriate absolute cutoff score for a model). To avoid the need to alter application parameters to update the cutoff score each time a Learn model is retrained, ibi Patterns - Search allows a cutoff score to be embedded in the Learn Model itself. As an option, a query can use this embedded cutoff score as an absolute cutoff score for the query. This ensures the cutoff used for the model is always the correct cutoff for that model.

Use of Business Criteria to Re-sort the Results List

Consider an intelligent selection of a cutoff method as a prerequisite in any setting where an application re-sorts the final results list according to criteria other than the match score, that is, following business rules or other requirements. In such cases, a stricter approach defining the dynamic cutoff method and its parameters might be required, so that a small number of high quality matches and fewer false positives are returned. The reason is that a re-sorting of the results list according to business criteria can easily position less-similar matches ahead of more-similar or even exact matches. It might confuse an end user, especially if clear non-matches are positioned ahead of clear matches. Re-sorting proves to be a pitfall for a requesting application also if the application assumes that high-quality matches necessarily precede lower-quality matches.