Configuring the Data File and Its Fields

The data file must be in CSV format. The first row of the file must contain field names.

The data file should contain a representative sample of records in the possibly much larger table on the TIBCO Patterns server where the trained model is eventually used. The data file should have enough records to create a large variety of record pairs. It should have at least 100000 records to use the Low Confidence Pair Finder efficiently. An empty data file should not be assigned, otherwise you will not be able to create any record pairs or train a model.

1. On the Project tab, click Assign.
2. Click Browse and locate the data file. The Learn UI provides a choice to associate the selected data file with the project in one of the following ways:

From the project directory

Select Copy data file to project directory. In this case, the file is copied to the project directory. The copied file is accessed by the project. This makes the project folder completely portable so that it can be easily copied to another computer.

From a different location

Do not select Copy data file to project directory. In this case, the file is linked to the project without copying it. This can be used to create several Learn projects on the same system that use the same data file, without creating multiple copies of the data file.

Warning: Changes to the data file can invalidate the entire project or the labels of existing pairs. Therefore, when using this option, you must take care that the data file is not modified, renamed, or deleted as long as the Learn UI project exists. In case you make changes to the field values in the CSV file that do not make any existing pair labels invalid, then you can still use such data file. The application detects data file changes and suggests to automatically update field data in the existing pairs from the modified CSV file.

Figure 9: Assign Data File

Reviewing List of Fields

After the data file is assigned, the list of fields from the file is displayed on the Data tab. Statistics for all fields are also displayed. These statistics might help determine the appropriate field type and also whether a field is useful in determining a record match.

Figure 10: Reviewing the List of Fields

You can perform the following operations in this tab:

3. Select key field

Choose the key field for the data table in the Key column. The field selected must contain a unique value for every record.

4. Change field type

The default field type is Searchable Text. You can change this by clicking the field type. Then choose the new field type from the drop-down list. Fields that contain date values, for example, Date of Birth, generally should be changed to Date or Searchable Date type. Fields to be compared as numbers, such as size or weight, should be assigned the Integer or Floating Point field type. However, numeric ID fields, like order numbers, phone numbers, and ZIP codes are best left as Searchable Text fields to be compared as text.

Statistics for a Searchable Date field calculated on the Data tab, must be identical to statistics when the Date field type is selected for the same field.

The custom filter that is applied on the Pair Selection tab for the field of type Date must be preserved when the field type is changed to Searchable Date, and vice versa.

5. Ignore fields

Selecting the Ignore checkbox for a field eliminates this field from the pair selection and learning process. Fields that are never useful for matching can be marked as ignored.

These fields are not displayed in all the other tabs, avoiding clutter when examining records and pairs.