The EBX Match and Merge Add-on and EBX® core product work together to handle matching tasks. Due to this shared responsibility, the biggest performance gain comes from making optimizations in both the add-on and the core product. For more details on the roles of each during the matching process, see Matching processing overview.
The following topics describe:
To ensure the best performance when matching, it is important to follow EBX® performance guidelines, especially for high volumes of data. It is of particular importance to allocate only half of the application server's available memory to the JVM. This leaves the other half free for file system caching. Additionally, the pre-processing phase is optimized to benefit from multiple CPUs. This means that CPU oriented hardware can improve matching performance.
The search strategies applied to fields involved in matching can have a large impact on performance. The following topics present some best practices that can help prevent performance issues:
String values that contain common tokens greatly increase the number of irrelevant match candidates found when pre-processing and slow the matching process. To alleviate this, you can define stop words, which EBX® ignores during pre-processing (i.e. they are not indexed). Stop words are registered in a module, then defined in the search strategy of an xs:string
field in the Search extension of the Data Model Assistant.
The best candidate terms for stop words are frequently found values. For example, 'street' and 'blvd' make good stop words because they exist in almost all records related to addresses.
The result of a matching process is not affected by a set of well-chosen stop words. They are ignored by the core product during pre-processing, but are still taken into account when executing the add-on's decision tree logic.
The Name and Text search strategies tokenize values to support full-text search. Nevertheless, some data are ill-suited for tokenization. This can lead to an abundance of common tokens, which slows the matching process. This especially holds true for xs:string
fields holding identifiers, dates, phone numbers or emails. For these cases, the preferred method is to use the Code search strategy, which ensures that the whole value is indexed as a single token.
When a value is null or empty, do not use a placeholder like 'Not available', 'N/A', or default values like '1', 'xxx', etc. Those placeholder are common tokens, and slow down matching. It is recommended to leave the value empty instead.
The fuzzy and phonetic search strategies are slower than the default, built-in strategies. In the context of matching, this impacts:
Core product performance when gathering similar records during pre-processing. The cost is mainly due to the increased number of results for a fuzzy strategy even if only applied to one decision tree node.
Add-on performance is impacted as fuzziness can add a lot of false positives in the results. The increase in the number of records passed to the add-on can slow the process. Since these additional records might not meet conditions of other comparison nodes, the add-on is required to process additional records with no guarantee of finding matches.
If the conditions on other comparison nodes are met, then a record is considered ‘similar’ during pre-processing, even without applying a fuzzy strategy to a field. Hence, we recommend keeping the default search strategy when using fuzzy or phonetic algorithms in the decision tree, and when the decision tree contains at least 4 fields.
However, if your matching policy still requires a fuzzy or phonetic search strategy, the following information can help to mitigate the performance impact:
Jaro-Winkler is the most expensive.
The Levenshtein search strategy accepts parameters to tune its behavior, which impact performance. First, Levenshtein is an ‘edit-distance based’ algorithm. If distance 2 is too expensive, try decreasing the distance to 1. Second, there is a parameter to specify whether the values are tokenized or not. Without tokenization, the algorithm behaves like a fuzzy ‘Code’ strategy, and is less expensive. With tokenization, it behaves like a fuzzy ‘Text’ strategy. The fuzziness is applied to all tokens, but it greatly increases the number of results returned and is more expensive. The parameters are configured in the ‘Search’ extension of the DMA
The phonetic parameter of ‘Name’ and ‘Text’ search strategies uses the Beider-Morse strategy. This is the least expensive.
If you have multiple child dataspaces, be sure that you do not run matching operations on these simultaneously. It can lead to poor performance and inaccurate results.
The following matching policy recommendations can help improve performance:
Decision tree: Before performing matching on an entire table with a large volume of records, test the matching policy on a subset of records (less than 50k records). And:
Check that the number of matches and suspects is consistent with your dataset. If it is not consistent, start the troubleshooting process with the decision tree. Begin by reviewing the nodes leading to the Match or Suspect output.
When a grouping or merging phase is slower than usual, it is likely the symptom of other configuration issues.
Weight: When a decision tree has a Match output that is reached after 1 or 2 comparison nodes, it means that the fields involved in these nodes are more important than the others. To improve performance and result quality, set the weight parameter higher for these fields to help identify them during pre-processing.
To improve matching performance, set the weight to 0 for fields that have a limited number of options. For example, fields that are boolean or enumerations.
Exclude records: By defining a set of criteria that excludes certain records from the matching operation, you can eliminate a subset of records that the add-on must process. This can lead to performance improvements. See Excluding records for more details.
Activating the logs can help you investigate matching process performance. To activate the logs of the pre-processing phase, open ebx.properties
file and set the property ebx.log4j.category.log.kernel
to DEBUG
. With this setting, it displays precise information about matching progress: number of records, average throughput, percentage of completion. This is useful to estimate the duration of the matching phase. It can also help you determine whether pre-processing is slower than it should be, considering the volume of data.