Cloud Software Group, Inc. EBX®
Match and Merge Documentation > Administrator Guide
Navigation modeMatch and Merge Documentation > Administrator Guide

Performance recommendations

Overview

The EBX Match and Merge Add-on and EBX® core product work together to handle matching tasks. Due to this shared responsibility, the biggest performance gain comes from making optimizations in both the add-on and the core product. For more details on the roles of each during the matching process, see Understanding matching operation processing.

The following topics describe:

EBX® optimization

To ensure the best performance when matching, it is important to follow EBX® performance guidelines, especially for high volumes of data. It is of particular importance to allocate only half of the application server's available memory to the JVM. This leaves the other half free for file system caching. Additionally, the pre-processing phase is optimized to benefit from multiple CPUs. This means that CPU oriented hardware can improve matching performance.

Identifying indexing issues

The search strategies applied to fields involved in matching can have a large impact on performance. The following topics present some best practices that can help prevent performance issues:

Stop words

String values that contain common tokens greatly increase the number of irrelevant match candidates found when pre-processing and slow the matching process. To alleviate this, you can define stop words, which EBX® ignores during pre-processing (i.e. they are not indexed). Stop words are registered, then defined in the search strategy of an xs:string field in the Search extension of the Data Model Assistant.

The best candidate terms for stop words are frequently found values. For example, 'street' and 'blvd' make good stop words because they exist in almost all records related to addresses. The Troubleshooting section describes how to get the most frequent values for a field configured to use matching.

Improper selection of stop words can negatively impact performance, particularly when too few values are left in a record to establish similarity. If a value is composed solely of stop words, it indicates that the words have significant semantic meaning and should not be considered stop word, despite their frequency.

Note

The result of a matching process is not affected by a set of well-chosen stop words. They are ignored by the core product during pre-processing, but are still taken into account when executing the add-on's decision tree logic.

Prefer code search strategy

The Name and Text search strategies tokenize values to support full-text search. Nevertheless, some data are ill-suited for tokenization. This can lead to an abundance of common tokens, which slows the matching process. This especially holds true for xs:string fields holding identifiers, dates, phone numbers or emails. For these cases, the preferred method is to use the Code search strategy, which ensures that the whole value is indexed as a single token.

Note

By default, the Code search strategy uses the contains operator when searching. This is expensive in terms of performance on large volumes. Instead, update the strategy to use starts with when matching mid-sized to large volumes.

Avoid empty or null value placeholders

When a value is null or empty, do not use a placeholder like 'Not available', 'N/A', or default values like '1', 'xxx', etc. Those placeholders are common tokens, and slow down matching. It is recommended to leave the value empty instead.

Use fuzzy and phonetic search strategies sparingly

The fuzzy and phonetic search strategies are slower than the default, built-in strategies. In the context of matching, this impacts:

If the conditions on other comparison nodes are met, then a record is considered ‘similar’ during pre-processing, even without applying a fuzzy strategy to a field. Hence, we recommend keeping the default search strategy when using fuzzy or phonetic algorithms in the decision tree, and when the decision tree contains at least 4 fields.

However, if your matching policy still requires a fuzzy or phonetic search strategy, the following information can help to mitigate the performance impact:

Child dataspaces

If you have multiple child dataspaces, be sure that you do not run matching operations on these simultaneously. It can lead to poor performance and inaccurate results.

Transaction management

In scenarios with successive modifications, it is advised to batch these modifications into a single transaction. Batching transactions helps minimize the overhead associated with multiple transactional operations. This practice is particularly beneficial and can ensure optimal performance and resource utilization in environments where numerous modifications occur in succession.

EBX Match and Merge Add-on optimization suggestions

The following matching policy recommendations can help improve performance:

Troubleshooting

Activating the logs can help you investigate matching process performance. The kernel log contains key indicators for each field configured to use matching and offers valuable clues for applying performance recommendations. For example:

The log also provides precise information about matching progress: number of records, average throughput, percentage of completion. This is useful to estimate the duration of the matching phase. It can also help you determine whether pre-processing is slower than it should be, considering the volume of data.