TIBCO EBX®
Match and Merge Documentation > Reference Manual
Navigation modeMatch and Merge Documentation > Reference Manual

Matching strategies

Types of matching strategies

The add-on provides two strategies for finding duplicates: phonetic and distance. An additional configuration allows you to mix these two strategies, which improves the quality of matches.

It is possible to extend these strategies by implementing any other form of matching. The add-on accepts custom matching algorithm configuration.

Phonetic matching

The phonetic matching relies on how words are pronounced. When pronunciation is similar, the match score is higher.

For example, phonetic matching will provide a high score for the following:

However, for these values, phonetic matching fails to find a match:

Moreover, the language used influences the matching sensitivity. For instance, depending on whether the language used is English or French, the matching results are different for the following:

The language used by the matching process is a property in the 'Table configuration'.

When terms to match are not names but phone numbers, email addresses, codes, etc., phonetic matching is likely not the best strategy. In such cases, using a distance matching strategy is generally preferable.

Distance matching

Distance matching computes the distance, or number of differences, between two terms. When the terms to compare are long (that is, more than 30 characters) or have varying sizes, distance matching may not be suitable.

Distance matching is a highly efficient method of comparing terms such as phone numbers, email addresses and business codes. Moreover, since distance matching is not language-specific, it can be more efficient for multilingual terms.

For example, distance matching provides correct outcomes for the following case:

For the same example, phonetic matching fails to find a match.

Using a double matching strategy

Deciding which matching strategy to apply depends on the data being compared. It may be helpful to test different strategies before launching the matching process over the scope of the entire database. Even with the most suitable matching strategy, 'false negative' records may occur.

A 'false negative' record is a record that should have been identified as a suspect record by the matching procedure, but was not. This situation is problematic because it marks records as golden even though potential suspect records still exist in the database. To fix this issue, the EBX® Match and Merge Add-on can be configured to apply two levels of matching using different strategies.

The second level matching strategy is optional.

Matching algorithms by strategy

The table below highlights the most popular matching algorithms used by the phonetic and distance matching strategies.

Matching algorithm

Phonetic

Distance

Use context

NY SIIS

X

Better for European and Hispanic name

Double metaphone

X

More generic than Soundex and NY SIIS

Levershtein, Jaro Winker, Fuzzy Full text

X

For short string, not reliant on language, best applied to password, email, business code, phone number, postal code, etc.

Table 55: Matching algorithms by matching strategy

Implementing a custom matching algorithm

Besides the predefined matching algorithms, you also can create a matching algorithm as your desire. The following section describes in detail step by step on how to implement a custom matching algorithm.

/new_algorithm_record.png

Once you finish, this custom algorithm will be displayed in the list of available algorithms.

/custom_algorithm_config.png

Using matching algorithms with the add-on

The EBX® Match and Merge Add-on manages matching scores formatted as similarity percentages. The highest similarity is 100% (equality).

/1000020100000556000003CE892A8A7E.png

Depending on a matching process policy's configuration, the similarity percentage is used by the add-on to decide whether the record is a suspect or not. The figure above highlights this decision process.

The minimum and maximum percentages are the 'stewardship min score' and 'stewardship max score' properties, respectively in the 'Process policy' table.

The matching algorithms integrated into the add-on translate their scores into a similarity percentage. When configuring new matching algorithms, the score must always be translated into a similarity score.