The EBX Match and Merge Add-on and EBX® core product work together to handle matching tasks. Due to this shared responsibility, the biggest performance gain comes from making optimizations in both the add-on and the core product. For more details on the roles of each during the matching process, see Understanding matching operation processing.
The following topics describe:
To ensure the best performance when matching, it is important to follow EBX® performance guidelines, especially for high volumes of data. It is of particular importance to allocate only half of the application server's available memory to the JVM. This leaves the other half free for file system caching. Additionally, the pre-processing phase is optimized to benefit from multiple CPUs. This means that CPU oriented hardware can improve matching performance.
The search strategies applied to fields involved in matching can have a large impact on performance. The following topics present some best practices that can help prevent performance issues:
String values that contain common tokens greatly increase the number of irrelevant match candidates found when pre-processing and slow the matching process. To alleviate this, you can define stop words, which EBX® ignores during pre-processing (i.e. they are not indexed). Stop words are registered, then defined in the search strategy of an xs:string
field in the Search extension of the Data Model Assistant.
The best candidate terms for stop words are frequently found values. For example, 'street' and 'blvd' make good stop words because they exist in almost all records related to addresses. The Troubleshooting section describes how to get the most frequent values for a field configured to use matching.
Improper selection of stop words can negatively impact performance, particularly when too few values are left in a record to establish similarity. If a value is composed solely of stop words, it indicates that the words have significant semantic meaning and should not be considered stop word, despite their frequency.
The result of a matching process is not affected by a set of well-chosen stop words. They are ignored by the core product during pre-processing, but are still taken into account when executing the add-on's decision tree logic.
The Name and Text search strategies tokenize values to support full-text search. Nevertheless, some data are ill-suited for tokenization. This can lead to an abundance of common tokens, which slows the matching process. This especially holds true for xs:string
fields holding identifiers, dates, phone numbers or emails. For these cases, the preferred method is to use the Code search strategy, which ensures that the whole value is indexed as a single token.
By default, the Code search strategy uses the contains operator when searching. This is expensive in terms of performance on large volumes. Instead, update the strategy to use starts with when matching mid-sized to large volumes.
When a value is null or empty, do not use a placeholder like 'Not available', 'N/A', or default values like '1', 'xxx', etc. Those placeholders are common tokens, and slow down matching. It is recommended to leave the value empty instead.
The fuzzy and phonetic search strategies are slower than the default, built-in strategies. In the context of matching, this impacts:
Core product performance when gathering similar records during pre-processing. The cost is mainly due to the increased number of results for a fuzzy strategy even if only applied to one decision tree node.
Add-on performance is impacted as fuzziness can add a lot of false positives in the results. The increase in the number of records passed to the add-on can slow the process. Since these additional records might not meet conditions of other comparison nodes, the add-on is required to process additional records with no guarantee of finding matches.
If the conditions on other comparison nodes are met, then a record is considered ‘similar’ during pre-processing, even without applying a fuzzy strategy to a field. Hence, we recommend keeping the default search strategy when using fuzzy or phonetic algorithms in the decision tree, and when the decision tree contains at least 4 fields.
However, if your matching policy still requires a fuzzy or phonetic search strategy, the following information can help to mitigate the performance impact:
Jaro-Winkler is the most expensive.
The Levenshtein search strategy accepts parameters to tune its behavior, which impact performance. First, Levenshtein is an edit-distance based algorithm. If distance 2 is too expensive, try decreasing the distance to 1. Second, there is a parameter to specify whether the values are tokenized or not. Without tokenization, the algorithm behaves like a fuzzy Code strategy, and is less expensive. With tokenization, it behaves like a fuzzy Text strategy. The fuzziness is applied to all tokens, but it greatly increases the number of results returned and is more expensive. The parameters are configured in the Search extension of the DMA
The phonetic parameter of Name and Text search strategies uses the Beider-Morse phonetic algorithm. This is the least expensive.
If you have multiple child dataspaces, be sure that you do not run matching operations on these simultaneously. It can lead to poor performance and inaccurate results.
In scenarios with successive modifications, it is advised to batch these modifications into a single transaction. Batching transactions helps minimize the overhead associated with multiple transactional operations. This practice is particularly beneficial and can ensure optimal performance and resource utilization in environments where numerous modifications occur in succession.
The following matching policy recommendations can help improve performance:
Decision tree: Before performing matching on an entire table with a large volume of records, test the matching policy on a subset of records (less than 50k records). And:
Check that the number of matches and suspects is consistent with your dataset. If it is not consistent, start the troubleshooting process with the decision tree. Begin by reviewing the nodes leading to the Match or Suspect output.
When a grouping or merging phase is slower than usual, it is likely the symptom of other configuration issues.
Weight: When a decision tree has a Match output that is reached after 1 or 2 comparison nodes, it means that the fields involved in these nodes are more important than the others. To improve performance and result quality, set the weight parameter higher for these fields to help identify them during pre-processing.
The pre-processing phase estimates the similarity between records based on the value of their matching fields. Pre-processing is sensitive to frequent tokens and values, as it increases the number of irrelevant match candidates. To improve matching performance, set the weight to 0 for fields that have a limited number of options. For example, fields that are boolean, enumerations, or flags. A field with a weight of 0 is not considered when estimating similarity during the pre-processing phase, but it is still taken into account when executing the decision tree comparisons.
Exclude records: By defining a set of criteria that excludes certain records from the matching operation, you can eliminate a subset of records that the add-on must process. This can lead to performance improvements. See Excluding records for more details.
Matching on related tables and business objects: Matching configurations that include relationships between multiple tables, or configurations applied to business objects introduce complexity that significantly impacts performance. For mid to large-sized datasets, it's advisable to denormalize the data model, enabling a streamlined single-table matching policy. This simplifies the comparison logic and reduces the overhead associated with relationship-based queries, enhancing the overall speed and efficiency of matching operations.
Activating the logs can help you investigate matching process performance. The kernel log contains key indicators for each field configured to use matching and offers valuable clues for applying performance recommendations. For example:
Number of occurrences of the most frequent token: If the count is significantly high compared to the table size, it may indicate the absence of essential stop words. When the property ebx.log4j.category.log.kernel
is set to DEBUG
, the 10 most frequent tokens are logged, and should be considered as stop words. See Stop words for more information.
Number of occurrences of the most frequent value: when the field contains a few distinct values with a high frequency, the field is not suitable to estimate the similarity between records. Consider ignoring such fields during preprocessing. See the bullet point discussing Weight in EBX Match and Merge Add-on optimization suggestions.
Number of null values. Excessive null values in a field, especially when the matching field is configured to match with other null values or any value, can significantly impact performance. A record with null values will appear similar to many others. This leads to a lot of decision tree executions. In this case, it is better to configure null values to no match.
The Code search strategy with the contains parameter: The contains operator is expensive on large volumes. Modify it to starts with when matching mid to large volumes.
The log also provides precise information about matching progress: number of records, average throughput, percentage of completion. This is useful to estimate the duration of the matching phase. It can also help you determine whether pre-processing is slower than it should be, considering the volume of data.