TIBCO Patterns
From a conceptual viewpoint, TIBCO Patterns functions as an in-memory database that provides inexact matching instead of exact matching.
The data is loaded into tables in-memory with rows and columns similar to a database management system (DBMS). The biggest difference is that a DBMS is optimized for exact matching (by indexing certain columns), TIBCO Patterns is designed for measuring similarity across any chosen set of fields.
With exact matching, a record either matches or does not. However, for inexact matching, all records match to some degree. It is a matter of judgment as to how well one record matches compared to another. The patented technology in TIBCO Patterns is designed to make judgments like the way a human would. In fuzzy matching, however, there is never a guarantee that the selection of best matching records exactly corresponds to the set of records a human would select. There is always a possibility that the returned “best” matches might not include a record a human would judge as a best match (this is called a false negative) or that the "best" matches might include a record a human would judge as a poor match (this is called a false positive). The goal of inexact matching is to reduce the number of false negatives and the number of false positives to a minimum. It should be remembered that the minimum cannot be guaranteed to be zero; this is why TIBCO Patterns does not replace the exact matching capabilities of a DBMS, it is performing a fundamentally different operation.
TIBCO Patterns technology can compute a meaningful similarity score between a query and any record – the similarity score is quite small when there is little similarity between a given record and the query.
For example, assume that a table is loaded with data about people, and you want to find records matching the following query:
|
Query |
||
|
Results |
Score |
Record Contents |
|
1.0 |
Maria Kristina Cassandra |
|
|
0.89 |
Maria Kristna Cassendra |
|
|
0.48 |
Mary K Casand |
|
|
........ |
....... |
|
|
0.05 |
Bill Bailey |
|
The first record in the result set is an exact match, having the highest possible similarity score of 1.0. The second record is very similar to the query - it probably represents the same person. The third and subsequent records are even less similar. Note that even the final record in this list, the very dissimilar “Bill Bailey,” still has a computable score, even though it is very small.
This kind of similarity computation is very different from a typical database search or select function where the record match for the query is always just “yes” or “no,” and the result set is all records for which the answer is true. With inexact pattern matching, the answer is a score based on a measure of similarity, and a result set is a list of records ranked by score.
This means that with inexact pattern matching, unlike the typical exact-match search, the set of results is not automatically defined, since there is often no clear division between “yes” and “no.” Instead, there is generally a “maybe” region characterized by similarity scores that are neither very high (near-exact matches) nor very low (almost no similarity between the record and the query). Capturing and creating value from this intermediate region is the whole point of inexact matching.
Matching Records
Generally, matching records are those that are similar enough to the query of interest. In practice, “similar enough” corresponds to some chosen threshold value of the similarity score. Once this threshold value is determined – based on the business impact and trade-off between false positives (records higher than the threshold that are not true matches) and false negatives (records lower the threshold that are true matches) – the result set then becomes well-defined. Matching records can then be processed in all the ways that search results typically are processed: consumed by another application, displayed in a Web application, or sorted by attributes.
Here is a second example.
|
Query: |
Peter Sellars Herbert Lohm Aleck Guiness |
||
|
Results: |
Score |
Movie |
Actors |
|
|
0.68 |
The Ladykillers |
Alec Guinness, Peter Sellers, Cecil Parker, Herbert Lom |
|
|
0.49 |
Revenge of the Pink Panther |
Peter Sellers, Herbert Lom, Burt Kwouk |
|
|
0.45 |
A Shot in the Dark |
Peter Sellers, Elke Sommer, Herbert Lom |
|
|
0.43 |
Return of the Pink Panther |
Peter Sellers, Christopher Plummer, Herbert Lom |
|
|
0.35 |
Murder By Death |
Peter Falk, James Coco, Peter Sellers, David Niven, Alec Guinness |
|
|
0.22 |
Lawrence of Arabia |
Peter O’Toole, Omar Sharif, Alec Guinness, Jose Ferrer |
In this example of a movie database search, one movie is found with matches to all six query terms, corresponding to the names of three actors. The next four movies returned match two out of the three actors for the query terms, and the final movie shown in this list matches only one of the actors.
Note the free-form nature of the query. It consists of a single “phrase” with half a dozen terms. The “best matching records” contain all these terms the “records that may match” contain just some, or only one. Search engines often enforce a Boolean “AND” relationship between terms in a query: all the given terms must match the record. To make the search more flexible, some search engines provide a query language that includes OR and AND operators for specifying these relationships explicitly.
TIBCO Patterns inexact matching uses algorithms that themselves tend to score higher records that closely match more query terms instead of records that match fewer. It is the total amount of similarity that matters, not “hard-coded” relationships between query terms. So if a query contains multiple terms, the top of the results list generally contains those records that contain the best-quality similarity to the most terms. In this respect, the beginning of the result list tends to resemble that of an "AND" type search. Later entries in the result list may more closely resemble an "OR" search as they might have only one or two matching terms. The truth is that TIBCO Patterns's inexact matching is neither AND nor OR, being an improvement on both, selecting records with the best overall match to all of the terms given in the query.