Cloud Software Group, Inc. EBX®
Match and Merge Documentation > Administrator Guide
Navigation modeMatch and Merge Documentation > Administrator Guide

Matching algorithms

Included algorithms

The add-on includes several predefined algorithms to use in matching operations. Each algorithm has advantages and disadvantages depending on the type of data being searched. The following table provides descriptions for the algorithms included with the add-on and any input parameters they might have:

Algorithm

Description

Hybrid fuzzy

This algorithm is designed to match two strings by evaluating their similarity using seven different criteria. Each criterion is assigned a weight value between 0.0 and 1.0. A criterion is also assigned a priority order that is applied during the comparison process.

When comparing two strings, the algorithm will go through all seven criteria in the priority order. If the whole string or parts of it meet the condition of a particular criterion, the weight assigned to that criterion will be used in the final formula. If conditions are not met, that criterion is skipped, and the algorithm moves to the next one.

Once all the criteria have been evaluated, the final similarity score is calculated based on the average weight of the remaining criteria. If the resulting score is greater than the threshold value defined by an administrator in the decision tree's comparison node, the two strings will be considered a match.

Points of note about this algorithm:

  • Does not return a 100% matching score on synonyms.

  • Strings that are similar, but contain characters like spaces or dashes do not return a 0% matching score.

  • When strings fall within the acceptable distance score they are not considered as the exact same. For example, a distance of 1 returns .95, a distance of 2 returns .9, etc.

Phonetic full text

An algorithm best used for strings. It can recognize the phonetic equivalent of two words spelled differently using the Beider-morse phonetic tokens. The comparator takes into account the synonyms and stop words defined in the data model.

Beider-morse

A phonetic algorithm for short strings (e.g. proper names) that is able to recognize a phonetically equivalent of two words written in a different way. The "sounds-alike" test is based not only on the spelling but on linguistic properties of various languages.

Full text

An algorithm best used for strings. It can find the case-insensitive exact matches of the words in the compared values. The comparator takes into account the synonyms and stop words defined in the data model.

Note

Values that contain only stop words are not supported. Additionally, this algorithm is only compatible with String and text data types.

Fuzzy full text

An algorithm best used for strings. It can find the similar and fuzzy matches of the words in the compared values and is based on the Levenshtein distance algorithm. The comparator takes into account the synonyms and stop words defined in the data model.

Note

Values that contain only stop words are not supported. Additionally, this algorithm is only compatible with String and text data types.

Exact

An algorithm to match data that should be exact (e.g. code). This algorithm returns a matching score of 100% for values that are exactly the same. Matching using the Exact algorithm is case-sensitive by default for fields with String and Text data types. You can set it to case-insensitive when configuring data comparison nodes in the decision tree.

JaroWinkler

A distance algorithm for short strings (e.g. proper names) that tallies the number of characters in common and places a higher emphasis on differences at the start of the string.

Input parameters: The 'Threshold' parameter values should be from 0.0 to 1.0. The parameter determines when a Winkler bonus should be added. Decreasing this parameter might result in an increased score. For example: With the keyword 'Fra' and data 'France', if the 'Threshold' = 0.7 score = 88.33. If the 'Threshold' = 0.9 score = 83.33. Default value: 0.7

Levenshtein

A distance algorithm for short strings that works well when only few differences between the values are expected. For example, this works well for dialects spoken in a particular part of the country, or by a specific group of people.

Soundex

A phonetic algorithm for short strings (e.g. proper names) that indexes the strings based on the way they sound in English rather than the way they are spelled. The homophones are encoded to the same representation so that they can be matched despite minor differences in spellings.

Range

An algorithm to match values within a predefined range. The two values are considered a match if the distance between two values is within the range.

Note

This algorithm is only available for numeric and date/time fields.

Search strategies and recommended algorithms

Some matching algorithms are not compatible with certain search strategies and could return inaccurate matching results. Thus, when configuring a comparison node in a matching policy's decision tree, the list of available algorithms is filtered depending on a field's configured search strategy. This helps prevent an incompatible configuration. Additionally, the combination of a field's configured search strategy in the Data Model Assistant and its matching algorithm can impact performance. See Performance recommendations for more information.

The following table shows the compatibility between search strategies and algorithms:

Search strategy

Algorithms

Default search template (set in DMA)

Default full text search strategy or any other configured for the default template.

  • Exact

  • Levenshtein

  • JaroWinkler

  • Full text

  • Fuzzy full text

Not recommended but still available:

  • Beider-morse

  • Soundex

  • Phonetic full text

Non-default search templates

Tese Levenshtein

  • Exact

  • Levenshtein

  • JaroWinkler

  • Full text

  • Fuzzy full text

Tese NGram

  • Exact

  • Levenshtein

  • JaroWinkler

  • Full text

  • Fuzzy full text

Tese Soundex

  • Exact

  • Soundex

  • Beider-morse

  • Phonetic full text

Tese Double Metaphone

  • Exact

  • Soundex

  • Beider-morse

  • Phonetic full text

Jaro-winkler

  • Exact

  • Levenshtein

  • Jaro-winkler

  • Full text

  • Fuzzy full text