Handling Low Confidence Predictions
In nearly all cases, it is not necessary to be concerned with low confidence predictions from a Learn Model. A Learn Model that is well trained for a data table containing a representative sample of records should have very few, if any, low confidence predictions. The recommended approach is to ensure the data table used to train the model contains data that is representative of all matching situations likely to occur, and that the model is well trained for this data table.
If it is not practical to obtain a representative sample of records, or if it is critical that you detect and handle all cases of low confidence predictions, two approaches are available for handling low confidence predictions. It should be noted that in both cases it is essential to understand the reliability of the confidence measure used, and the meaning of different confidence values.
A few points to know about confidence values:
• | Every score generator produces a confidence value of 1.0. |
• | The RLINK combiner returns the confidence value generated by the Learn Model. |
• | The First Valid score combiner (as described in Use the First Valid Score Combiner) returns the confidence value of the selected child query. |
• | All other score combiners return as their confidence value the minimum confidence value of all of their child queries. |
From these points, you can see that every query returns a confidence value.
• | If RLINK combiners are not used (if a Learn Model is not used), the confidence value is 1.0. |
• | If one or more RLINK combiners are used, the confidence value is the minimum of the values produced by the RLINK combiners, unless a First Valid score combiner is used. |
Handle Low Confidence Predictions in Your Application
In this approach, your application must check the confidence value returned for each query. The following three confidence values are available. You can use them based on your need.
Confidence Value Name |
Description |
Usage |
Minimum confidence value |
This is the lowest confidence value for any prediction made by the Learn Model during the processing of this query. It might be the confidence of a record that is not returned in the result set. |
Used to ensure that the result contains all true matches. A low minimum confidence value might indicate that a matching record was incorrectly given a low score, and thus was not returned in the result set. Such value might also mean that a non-matching record was incorrectly given a high score and thus was returned. The application can flag this match for review, or it can issue an alternative query. You can also reissue the query wrapped in a First Valid score combiner to return those records that had a low confidence value. For information, see Finding Low Confidence Pairs. |
Result set confidence value |
This is the lowest confidence value of any record in the result set. It is the minimum of the individual record confidence values. |
Used as a quick check to see if a result set contains any low confidence matches. |
Individual record confidence value |
This is the confidence for one of the records returned. |
Used to ensure that query results do not contain false matches. Records with low confidence should be flagged for special processing or ignored. |
Use the First Valid Score Combiner
The First Valid score combiner is designed for working with the confidence values output by a Learn Model. It selects a query based on confidence values. Each child query of the First Valid score combiner is assigned a confidence value threshold. The confidence values returned by the child queries are examined in order; the first child query that has a confidence value greater than or equal to its assigned threshold is selected as the result of the First Valid combiner.
A typical usage of the First Valid score combiner is to place an RLINK score combiner as its first child query, with an appropriate confidence value threshold. The second child query is a standard matching query that does not use a Learn Model. If the Learn Model prediction confidence value is low, the standard query is used. Otherwise, the Learn Model query is used. In this way, the Learn Model is applied in those cases where it is well trained, and a fall back standard matching query is applied in those cases where it is not well trained.
The First Valid score combiner allows any number of child queries. Thus it is possible to have several alternative Learn Models in a single query. It selects the first Learn Model that is well trained for the particular match situation. The use case for multiple alternative models is exceedingly rare however.
Finding Low Confidence Pairs
The following are the reasons why you might want to find low confidence pairs:
1. | To find new training pairs to improve a model. This is used in applications that dynamically adapt to changing data by retraining a Learn Model with new examples as they come in. |
2. | To find records that might have been lost because of an inadequately trained Learn Model. |
The First Valid score combiner is the best way to find low confidence pairs. The First Valid score combiner has a flag that reverses its operation; instead of selecting the first child query with a confidence value greater than or equal to its threshold, it selects the first child query with a confidence value less than or equal to its threshold. This is called the invalid only flag. This flag can be used to return only those records that represent a poorly trained matching situation.
To find new training pairs to improve a model, a sample data set is "deduplicated" using an existing Learn Model. The RLINK query is wrapped as the first and only query of a First Valid score combiner with the inverse flag set. If no child queries of the First Valid score combiner satisfy the confidence value criteria, the -1.0 reject score is returned, causing the record to be rejected. Thus, a First Valid score combiner with a single RLINK child query and the invalid only flag set, returns only low confidence records. As all records returned represent matching situations that are poorly trained in the existing model, they are good candidates for inclusion in the training data set for the Learn Model.
A low minimum confidence value returned in a query indicates there might be one or more records that were not returned because they were assigned an incorrectly low score by a Learn Model that was not trained for the particular example. To find those potentially lost records, the original query is reissued as the first and only child query of a First Valid combiner with the invalid only flag turned on. The query returns only those records that had low confidence; the high confidence records are filtered out. In this way, the potentially missed records can be found and processed as they are no longer pushed out of the search results by high confidence, high scoring records.