Defining Features

After the data file and field types are configured, the next step is to create features. A feature that is defined in Learn UI is a group of fields used in matching that provides one or more input feature scores for training a model.

When creating features, consider the following:

Create features for fields that decide whether any two records match. Start with defining features for the fields that are the most important. It is not necessary for each table field to be included in one or more features.
There is a limit on the total number of model features. The current total appears at the bottom of the Features tab. In typical systems, the practical limit is 10 or 11 model features. Using more features can result in very large model files that are likely to require too much memory.
A separate feature for each field is preferred to combining them into a single feature. Because the number of features is limited, you can combine several related fields into one feature to stay within this limit.
If fields must be combined into one feature, select fields that make up parts of a whole, such as the components of an address.
Consider removing any features defined for fields that provide very little additional information. For example, remove the feature for a State field if more than 90% of records are from the same state.
Metadata fields, such as record creation date, the person who last updated the record, are typically not useful for record matching and must not be used to create features.
A higher match score for a feature must always represent a higher (or unchanged, but never lower) probability of a positive match for the record pair as a whole, provided that the scores for all other features remain the same.

To add a new feature:

1. Open the Features tab.
2. If some features have already been created, click Add Feature. The Add Feature screen is displayed. If no features have been created, that screen is displayed immediately.
3. Specify the feature name. Every feature must have a unique name.
4. Select the feature category.
5. Select the feature type.
6. Click Next.
7. Select one or more fields to be used in the feature and configure the options displayed for the selected feature type.
8. Click Finish.
Note: You cannot create a new feature unless a key field for the data table has been selected.

The features are displayed in a table in the Features tab.

Categories of Features

Features are of the following categories:

Generic
Data Specific

Figure 18: Selecting Feature Category and Type

Generic Features

Generic features can be used to create any feature for general purpose applications, especially when a data-specific feature is not defined for your application area. Generic features correspond to the basic query types available in TIBCO® Patterns. For more information on queries, see the section "Designing Queries for Patterns" in the TIBCO® Patterns Concepts guide.

Types of Generic Features:

Simple

A Simple feature compares one or more fields in two records. A single text string constructed from the selected field values in one record is compared to the concatenated value of the selected fields in the other record. The feature score reflects the contributions of the whole or partial matches found across the selected fields. You can add a thesaurus file and specify a thesaurus type and weight for a simple feature. If the Match Empty Values checkbox is selected, and all the selected fields are empty in both records in the record pair, the feature score is 1.0 instead of the empty score -1.0.

Figure 19: Simple Feature

Cognate

A Cognate feature specifies a structured comparison over a group of fields. It is able to match a value even if it is entered into a wrong field. Use cognate features for closely related fields subject to frequent misfielding. For example, the fields first_name, middle_name, and last_name can be combined by using a cognate feature. You can add a thesaurus file for a cognate feature and specify the thesaurus type and weight. In addition, you can specify a noncognate weight (score penalty if a value is entered into a wrong field), and an empty field penalty (a score penalty to discount unmatched data that can be attributed to an empty field in either of the two records in a record pair). If the Match Empty Values checkbox is selected, and all the selected fields are empty in both records in the record pair, the feature score is 1.0 instead of the empty score -1.0.

Figure 20: Cognate Feature

 

Date

A Date feature compares the similarity of the date values in two records. Select a field from the drop-down list. The list contains all fields that have been assigned the Date field type. For example, a Date feature can be used to compare dates in the field date_of_birth field. If the Match Empty Values checkbox is selected, and the selected field is empty in both records in the record pair, the feature score is 1.0 instead of the empty score -1.0. If the Match Empty Values checkbox is selected, and the selected field is empty in both records in the record pair, the feature score is 1.0 instead of the empty score -1.0.

For more information, see the section "Date Comparisons" in TIBCO® Patterns Concepts Guide.

Figure 21: Date Feature

 

Predicate

A Predicate feature uses an exact matching predicate expression to compute the feature score. These expressions are defined using the language for the TIBCO Patterns predicate expressions. A Predicate feature requires you to compose a valid predicate expression. For more information about constructing predicate expressions, see the sections "Constructing Predicate Expressions" and "Predicate Queries" in TIBCO® Patterns Concepts Guide.

Predicate expressions, as described in the Concepts Guide, reference table record field values as "$field-name". The predicate expressions used in the Learn UI also have an ability to reference fields in the query record. Query record field values are referenced as ${field-name}.

Note: If the query record field value is to be treated as a text value it must be enclosed in double quotes, for example "${field-name}". For more information about using query record field values in predicate expressions, see the description of NetricsPredicateMapper class in the Java API documentation.

 

A predicate expression must refer to the same fields in both records in a record pair, one being the query record, the other the table record. The result of a predicate expression must be the same if the two records in any pair are switched.

Figure 22: Predicate Feature

Data-Specific Features

Data-specific features are predefined for a certain domain. Use them first if your data matches the purpose of the feature. A data-specific feature is a predefined combination of underlying model features. It eliminates the need to define the exact parameters for several generic features. The data-specific features use parameters that in most cases are best for the type of data indicated.

Types of data-specific features:

Person Name

This feature compares the similarity of First Name, Last Name, and an optional Middle Name field. You can specify a thesaurus to be used in underlying model features that include the First Name or Middle Name fields.

Figure 23: Person Name Feature

Gender Feature

This feature determines whether a gender field in two records has the same meaning. It also allows the model to predict differently for Male and Female record pairs. The codes used to indicate male and female are defined in this feature. These gender codes are selected from the list of the most frequent field values.

Note: Gender codes are not case sensitive.

 

Figure 24: Gender Feature

The list of created Features.