Identifying Record Pairs
The Pair Selection tab displays all records that were loaded from the CSV file. Identifying, selecting, and labeling suitable record pairs manually might require a significant amount of time. However, finding new useful pairs automatically simplifies this process to a large extent.
The pairs are created to provide sufficient training examples to the model so that it can learn to recognize all relevant situations where the two records match or do not match. Thus, the record pairs must represent a variety of subsets (refer to Subsets for the definition), a variety of labels for each subset, a variety of feature score values (the degree of match) for each feature, and so on. In particular, the user must strive to find pairs for borderline situation that are not obvious, when the correct label of the pair is not immediately visible from the first glance at the values of the two records.
Figure 26: Pair Selection
Selecting Pairs Manually
You can create record pairs manually by first selecting two records and then setting the appropriate label for the pair.
| 1. | Select a record and click Add Record to Pair. Repeat this step to form an appropriate pair. |
| 2. | Select an appropriate label from the available options. To assign pair labels separately from selecting pair records, skip step 2. |
| 3. | Click Save Pair. |
When you click the Save Pair button, the selected pair is randomly added to either the Training dataset or the Validation dataset. The Training dataset is used to train the model. The Validation dataset is used to monitor the performance of the trained model over unseen record pairs and to stop the training process when the validation error rate is the lowest.
Labels
| • | True |
Select the True label if the two records match (represent the same entity).
| • | False |
Select the False label if the two records do not match (represent different entities).
| • | Unsure |
Select the Unsure label if you are not sure whether the records match or not. Record pairs with Unsure labels or without any assigned label do not participate in the learning process, but you can change the label later.
Eventually all pairs are expected to have a label to participate in the learning process.
Finding Useful Pairs Automatically
You can also find record pairs automatically by using Suggest Pairs button on the Pair Selection tab.
The Suggest Pairs button starts the Low Confidence Pair Finder, which searches for new pairs in the background. You can review and label the already found pairs while the search is running. The pairs that are found are useful for training of the Learn model, because they address situations that were not sufficiently trained earlier. Thus, the existing datasets can be augmented to cover new matching scenarios or a new model can be trained from the very beginning by using only the automatically found pairs.
The pairs are found by examining records in the existing table. The confidence of model predictions is used to determine the pairs that are likely to be useful and indicate how reliable is the prediction.
The confidence of the model prediction is determined by the similar record pairs, which were used during model training as follows:
| • | If the model has never seen similar pairs during training or it has seen similar pairs with contradictory labels, the confidence is low. |
| • | If the model has seen many similar pairs with consistent labels during training, the confidence is high. |
The Low Confidence Pair Finder focuses on finding pairs with the lowest confidence. After you label a pair and add it to the Training dataset, the retrained model is likely to predict this pair with an increased confidence.
You must assign a True, False, or Unsure label to the found record pair before saving it. You can also mark the subset represented by the record pair as Always False (for more information, see the section Always False Subsets) . Once the pair is labeled and saved, the next found pair is automatically displayed for labeling.
To stop automatically finding pairs, click the Stop Suggesting button, then label and save the last found pair that is displayed.
An existing Learn model is required to find pairs automatically. You can train an initial model even when no pairs have been saved (for more information, see the section Training a Learn Model), and then click the Suggest Pairs button. Alternatively, you can simply click the Suggest Pairs button and the application offers to train and save the initial model.
The number of the found pairs is displayed on the Pair Selection tab. The Low Confidence Pair Finder tends to find a large number of record pairs when no pairs or just a small number of pairs has been used to train the model. It tends to find fewer pairs when many pairs were already used to train the model. If very few pairs are found, it is recommended to wait for several minutes or longer to see if a larger number of pairs is found. Eventually the process no longer finds any record pairs within a reasonable time, which means that the model is already well trained for the given data table. To ensure that a sufficient number of pairs can be found in a reasonable time, the data table should contain a representative sample of at least 100000 records.
While finding pairs automatically, the system periodically offers to retrain the model. It is recommended to do this since the model learns the new matching situations which reduces the total number of pairs that you need to label. After the model is automatically retrained and saved, you can review the training results and then click the Suggest Pairs button again to continue the process.
Subsets
When field values for a certain feature are missing, the criteria used to determine a record match are likely to be different. For example, if a Social Security number is missing, a match on a secondary field, such as birth date or address, is likely to be crucial to establish a match of the two records, whereas if the Social Security number is present, the secondary fields might be almost irrelevant. Therefore, the model learns differently depending on what feature scores are present or absent, and must be trained for each case.
A subset defines which feature scores are present and which are absent. There is a separate subset for each combination of present and absent feature scores. If a Simple or Cognate feature uses multiple fields, its score is absent only if all the fields used are empty in either record of the pair.
Always False Subsets
Some subsets of present feature scores can be marked as Always False. Record pairs that belong to these subsets or any of their subsets are always classified as False by the Learn model. Even when the present feature scores in such record pair represent an exact match in the two records of the pair, this information is still not sufficient to classify the pair as a True match.
For example, in a Learn model for person matching, having two records that only have the City and State features that match exactly (all other feature scores are empty) is not sufficient to determine that the two records represent the same person (there are many people living in the same city and state). Thus a subset that only has non-empty City and State feature scores can be marked as Always False. This causes any record pairs that have only a non-empty City feature score, or only a non-empty State feature score, or that have all empty feature scores to also be classified as False, since these are subsets of the original subset that was marked as Always False.
You can use the Always False subsets to make the decision about the subset only once instead of labeling a potentially large number of pairs for the same subset, or trying to find two records with exact matches for the features in this subset to demonstrate that the exact match must still be classified as False. Also having Always False subsets reduces the size of the model file and speeds up model predictions for these subsets.
To mark a subset as Always False, select a pair of records that represent that subset, select the False label and then click the Mark as Always False button. Review the list of present feature scores on the confirmation dialog and confirm the Always False subset. You can also mark an automatically found pair as Always False.
The record pairs that represent Always False subsets are stored in a separate Always False Subsets dataset. You can review these pairs in the Pairs tab. Deleting such pair removes the Always False subset.
A Learn model assigns a score of 0.0 and a confidence of 1.0 for all pairs that belong to one of the saved Always False subsets or their subsets.
Supporting Functions
| • | Basic search |
This function searches for the string specified in the search text-box. The basic search function finds the search string anywhere in the text of any field. It does not search for whole words.
| • | Reset |
The Reset button is used to reset all changes made to the record organization on the Pair Selection tab. It removes all sorting orders, searches and filters for the data table. This will make the records go back to the original order in the CSV file, and all records will be shown. In addition, any automatic filters used to process model training suggestions will be removed.
Column Context Menu Functions
The following functions can be accessed by right-clicking on column title:
| • | Sort |
Using sorting, you can find records with the same or similar values in the sorted fields. Sorting each field makes it easy to analyze it individually. Each field can be sorted in ascending or descending order. Multiple columns can be selected for sorting. The last column that is sorted becomes the primary sort column.
| • | Clear Filter |
This function removes any filter that is currently applied for the selected field. It does not remove any filters for other fields.
| • | Filtering blank and non-blank field values |
Filtering by blank and non-blank field values can be used to view subsets of present and empty field values. It helps to narrow down the list of records and focus on a specific subset of present field values.
| • | Custom Filter |
Select the Custom Filter menu item to specify up to two custom filters for the selected field. If both filters are specified, they can be combined with an AND or OR operation. To specify each filter, select a filter type from the drop-down menu and enter a value to be used by the filter. The types of custom filters available in the drop-down depend on the type of the selected field.
| • | Hide Field |
This function makes the selected field invisible in the Pair Selection tab. If you want to show a hidden field again, use the Show/Hide Fields item in the Table Functions drop-down menu.
Table Functions
The following functions can be accessed through the Table Functions drop-down menu:
| • | Advanced Search |
This function provides an additional search functionality that is field-specific. There are several search operations that can be specified for each field: equals, contains (contains the search string anywhere in the field), and contains phrase (contains the specified whole word or a phrase of whole words).
| • | Sort dialog |
This function provides an ability to precisely define and change the sorting by multiple columns. You can add a number of columns to the list of sorted fields, specify ascending or descending order for each column, and move any column up or down the list of sorted fields.
| • | Clear Filter and Sort State |
After searching, filtering, or sorting the field values, you can use this function to make the records go back to the original order in CSV file and to display all records. Unlike the Reset button, this function does not remove any automatic filters that might have been applied to process model training suggestions.
| • | Show/Hide Fields |
With this function, you can select any field to be hidden or shown again in the Pair Selection tab. If the field is selected, it will show in the Pair Selection tab. If you clear the checkbox, the field will be hidden from the tab. Unlike the Ignore checkbox in the Data tab, this function hides the field only from the Pair Selection tab.