Cloud Software Group, Inc. EBX®
Match and Merge Documentation > Administrator Guide
Navigation modeMatch and Merge Documentation > Administrator Guide

Performance recommendations

Overview

The EBX Match and Merge Add-on and EBX® core product work together to handle matching tasks. Due to this shared responsibility, the biggest performance gain comes from making optimizations in both the add-on and the core product. For more details on the roles of each during the matching process, see Matching processing overview.

The following topics describe:

EBX® optimization

To ensure the best performance when matching, it is important to follow EBX® performance guidelines, especially for high volumes of data. It is of particular importance to allocate only half of the application server's available memory to the JVM. This leaves the other half free for file system caching. Additionally, the pre-processing phase is optimized to benefit from multiple CPUs. This means that CPU oriented hardware can improve matching performance.

Identifying indexing issues

The search strategies applied to fields involved in matching can have a large impact on performance. The following topics present some best practices that can help prevent performance issues:

Stop words

String values that contain common tokens greatly increase the number of irrelevant match candidates found when pre-processing and slow the matching process. To alleviate this, you can define stop words, which EBX® ignores during pre-processing (i.e. they are not indexed). Stop words are registered in a module, then defined in the search strategy of an xs:string field in the Search extension of the Data Model Assistant.

The best candidate terms for stop words are frequently found values. For example, 'street' and 'blvd' make good stop words because they exist in almost all records related to addresses.

Note

The result of a matching process is not affected by a set of well-chosen stop words. They are ignored by the core product during pre-processing, but are still taken into account when executing the add-on's decision tree logic.

Prefer code search strategy

The Name and Text search strategies tokenize values to support full-text search. Nevertheless, some data are ill-suited for tokenization. This can lead to an abundance of common tokens, which slows the matching process. This especially holds true for xs:string fields holding identifiers, dates, phone numbers or emails. For these cases, the preferred method is to use the Code search strategy, which ensures that the whole value is indexed as a single token.

Avoid empty or null value placeholders

When a value is null or empty, do not use a placeholder like 'Not available', 'N/A', or default values like '1', 'xxx', etc. Those placeholder are common tokens, and slow down matching. It is recommended to leave the value empty instead.

Use fuzzy and phonetic search strategies sparingly

The fuzzy and phonetic search strategies are slower than the default, built-in strategies. In the context of matching, this impacts:

If the conditions on other comparison nodes are met, then a record is considered ‘similar’ during pre-processing, even without applying a fuzzy strategy to a field. Hence, we recommend keeping the default search strategy when using fuzzy or phonetic algorithms in the decision tree, and when the decision tree contains at least 4 fields.

However, if your matching policy still requires a fuzzy or phonetic search strategy, the following information can help to mitigate the performance impact:

Child dataspaces

If you have multiple child dataspaces, be sure that you do not run matching operations on these simultaneously. It can lead to poor performance and inaccurate results.

EBX Match and Merge Add-on optimization suggestions

The following matching policy recommendations can help improve performance:

Troubleshooting

Activating the logs can help you investigate matching process performance. To activate the logs of the pre-processing phase, open ebx.properties file and set the property ebx.log4j.category.log.kernel to DEBUG. With this setting, it displays precise information about matching progress: number of records, average throughput, percentage of completion. This is useful to estimate the duration of the matching phase. It can also help you determine whether pre-processing is slower than it should be, considering the volume of data.