TIBCO Patterns - Search Deduplication Library Overview.
The TIBCO Patterns - Search Deduplication Library contains utilities to assist in finding duplicate records in a table stored in a TIBCO Patterns - Search server.
This overview covers what is included in the Deduplication Library and a brief description of its usage.
Package Contents
The Deduplication Library package consists of the following:
- com/tibco/patterns/deduplication This directory contains all of the classes implementing the library.
- com/tibco/patterns/dedupe_utils This directory contains the source for simple implementations of certain interfaces from the library.
- TIB_tps_dedupe.jar A jar file containing the complete library and the simple interface implementations (but not the sample files).
- sample This directory contains the source for two sample implementations using the deduplication library.
The documentation for the Deduplication Library is included in the product documentation directory. It is provided in standard javadoc HTML format.
Compatibility
The library is compiled to be compatible with Java release 1.7 or higher. Your Java engine and compiler must support release 1.7 or higher in order to use this library.
Each release of the TIBCO Patterns - Search server includes a new release of the Deduplication Library. The Deduplication Library release should always be the same as both the TIBCO Patterns - Search server and the Java API. The compatibility of a Deduplication Library with a newer or older TIBCO Patterns - Search server or TIBCO Patterns - Search Java API release is not guaranteed.
Related Documentation
This documentation is intended as a reference guide to the Deduplication Library. It is not intended to be a tutorial or a reference on using TIBCO Patterns - Search or the TIBCO Patterns - Search Java API. It is strongly recommended that you review the following documentation before using the Deduplication Library.
- TIBCO Patterns - Search Concepts Guide. This document provides an overview of the TIBCO Patterns - Search server, its features and how to use them. The Deduplication Library reference documents assume the reader is familiar with the basic concepts presented in this guide.
- TIBCO Patterns - Search Java API Reference A javadoc format reference to the TIBCO Patterns - Search Java API. The Deduplication Library reference documents assume the reader is familiar with the Java API.
- TIBCO Patterns - Search Programmer's Guide. This document is both a reference manual for the "C" API and provides more detailed documentation on some of the features of the TIBCO Patterns - Search server. The Deduplication Library documentation is focused on describing the Deduplication Library and doesn't fully describe the functionality of the associated TIBCO Patterns - Search server features.
Library Interface Overview
The Deduplication Library is organized around the Deduplicator object type:
- Deduplicator
A Deduplicator binds together all of the information and functionality needed to find duplicates within a table. It scans through the queries produced by a KeyedQuerySource, performing each query on a TIBCO Patterns - Search server.Queries are processed in batches. When a batch is complete the Deduplicator stores the results via the PairStore interface.
- Interface Types
These define the information and functionality which the user must provide for a deduplication.PairStore All duplicates found by a deduplicator are recorded through a PairStore interface. Also, the PairStore interface can report if a batch has already been completed.
QueryBuilder All queries built by the Deduplicators are obtained through the QueryBuilder interface. Note that a QueryBuilder interface may return null; this indicates no query should be performed for that record.
Keyed Used to enforce presence of keys on types that require it.
KeyedQuerySource An iterator-like interface that produces a sequence of queries.
Logger Defines a basic logging interface used by the Deduplication framework to report errors, warnings and other informative messages. Users can provide an implementation of this interface to control the reporting of messages from the framework.
- Simple Interface Implementations
These provide very basic implementations of some of the required interfaces.KeyedNetricsRecord A simple wrapper of NetricsBaseRecord that implements the Keyed interface. Because this wraps NetricsBaseRecord, it can wrap either NetricsRecord or NetricsCompoundRecord.
NetricsRecQuerySource An implementation of KeyedQuerySource that uses an implementation of NetricsBaseRecSrc as the source of query data. Because this implementation wraps a NetricsBaseRecSrc it can use either a NetricsRecSrc or a NetricsCompoundRecSrc. So this implementation can be used in either Compound Record deduplication or Standard Record deduplication.
Log4jLogger A simple wrapper of a Log4j logger object as an implementation of the Logger interface. This allows messages generated by the deduplication framework to be routed into a Log4j logger. This implementation does not support setting the log levels through the interface.
The implementations above are considered part of the deduplication framework. The implementations listed below are included as a set of useful utilities. They are under the dedupe_utils package. They are not considered part of the deduplication framework. The source code to these implementations can be found in the deduplication framework src directory.FileSystemPairStore An implementation of PairStore that places results in files under a specified directory.
GenericCompoundQueryBuilder This is an implementation of the QueryBuilder interface that wraps an implementation of the ANetricsCompoundQueryBuilder interface. This ties the TIBCO Patterns - Search query builder interface into the deduplication framework. As the Query Builder Package is an implementation of ANetricsCompoundQueryBuilder, this allows QBP query definitions to be plugged into the deduplication framework.
GenericQueryBuilder This is an implementation of the QueryBuilder interface that wraps an implementation of the ANetricsQueryBuilder interface. This ties the TIBCO Patterns - Search query builder interface into the deduplication framework.
PrintStreamLogger An implementation of the Logger interface that writes messages to a PrintStream. This is useful for simple applications that wish to log to a file, standard out or standard error.
- Supporting types
BatchStatus An enumeration that lists the possible states for a batch of queries.
BatchStatistics Contains statistics for a batch including: its identifier, when it started processing, the host on which it was processed, how long it took to process, and its state.
DedupeException The exception thrown when a serious error is encountered by the deduplication framework.
QueryBuildException The exception thrown when a query builder or query source can't build a query for an input record.
ErrorHandler This class is used to capture and record error events encountered during a deduplication run. To get more control over the handling of error events such as: encountering a bad record, the failure of a query, the failure of a searcher thread, the failure to create or store a batch of queries, the user can create an extension of this class that overrides the default behavior. deduplication framework.
QueryErrorAction An enumeration of the actions that can be returned by ErrorHandler. Extensions of ErrorHandler must return one of these codes when an error condition is processed.
Host Holds connection parameters to a TIBCO Patterns - Search server.
KeyedQuery This holds a generated query within the deduplication framework. It associates the object being matched with the query used to find its matches.
Pair Encapsulates a single duplicate found by a Deduplicator.
SearcherInfo The Deduplicator is multi-threaded. Each SearcherInfo contains information about a thread that is performing searches (queries) on a TIBCO Patterns - Search server.
SearcherStatus An enumeration that lists the possible states of a search thread.
Advanced Usage Notes
- Multiple threads of searching on a TIBCO Patterns - Search server is supported.
- Use of multiple TIBCO Patterns - Search servers is supported. Each server must have a complete copy of the table or tables to be deduplicated.
- Use of multiple servers and multiple threads may be combined freely.
- Search servers may be added and removed during Deduplication.
Examples
Here is a simple example of how to use the Deduplication Library.
This example assumes a table names with columns first,
middle, and last, is already loaded into the TIBCO
Patterns - Search server running on the local machine on the
default port of 5051.
Main.java:
import com.tibco.patterns.dedupe_utils.FileSystemPairStore; import com.tibco.patterns.dedupe_utils.PrintStreamLogger; import com.tibco.patterns.deduplication.Deduplicator; import com.tibco.patterns.deduplication.KeyedQuerySource; import com.tibco.patterns.deduplication.NetricsRecQuerySource; import com.netrics.likeit.NetricsServerInterface; import com.netrics.likeit.NetricsTableCursor; import com.netrics.likeit.NetricsTableCursorRecSrc; import com.netrics.likeit.NetricsQuery; import com.netrics.likeit.NetricsRecord; import com.netrics.likeit.NetricsSearchOpts; import com.tibco.patterns.deduplication.QueryBuilder; public class Main throws Exception { public static void main(String[] args) throws Exception { Deduplicator dd = null; FileSystemPairStore ps = null ; PrintStreamLogger logger = null ; KeyedQuerySource kqs = null ; NetricsTableCursor names_cursor ; NetricsTableCursorRecSrc names_rec_src ; NetricsServerInterface nsi ; // Store our pairs as csv file in pairs directory. ps = new FileSystemPairStore("./pairs/"); // Send all log messages to the standard error. logger = new PrintStreamLogger(System.err); // We will use our names table as our query source. String table_name = "names" ; // Create a cursor to scan the records. names_cursor = new NetricsTableCursor(table_name, 100) ; // We assume a default server. nsi = new NetricsServerInterface("localhost", 5051); // Create a record source from the cursor. names_rec_src = new NetricsTableCursorRecSrc(names_cursor, nsi) ; // Create our query source. kqs = new NetricsRecQuerySource(names_rec_src, new NameQueryBuilder()); // Create the deduplication engine. dd = new Deduplicator(-1, kqs, logger, null, ps); // Run four threads against the local server. dd.setHostWorkerCount("localhost", 5051, 4); // Start the deduplication run. dd.start(); // Wait for it to complete. dd.waitWorkComplete(); // Shut down the engine. dd.shutdown(); // Found pairs are in the "./pairs" directory. } /** This class implements a simple query builder. */ static class NameQueryBuilder implements QueryBuilder<KeyedNetricsRecord> { static String[] fieldNames = new String[] {"first", "middle", "last"}; public KeyedQuery buildQuery(KeyedNetricsRecord record) { NetricsSearchOpts options; NetricsQuery q ; String[] fieldVals = record.getRecord().getFields(); String[] queryVals = { fieldVals[0], fieldVals[1], fieldVals[2] }; // don't query on empty records int querylength = queryVals[0].length() + queryVals[1].length() + queryVals[2].length(); if (querylength == 0) { return null; } options = new NetricsSearchOpts(); options.useAbsoluteCutoff(0.7); options.scoreType(NetricsSearchOpts.SCORE_SYMMETRIC); q = NetricsQuery.Cognate(queryVals, fieldNames, null, 0.8); return new KeyedQuery((Keyed)record, q, options, null); } } }
Samples
The deduplication framework comes with two sample projects. The first project is an extended version of the above example. It shows a number of different options. It also shows how to integrate the deduplication framework with the grouping library.
This example consists of the following files: Main.java, FileSystemPairStoreWithHTMLOutput.java, NameQueryBuilder.java, JoinQueryBuilder.java.
The second sample project shows how to integrate the deduplication framework, the Query Builder Platform and the grouping library into a full deduplication application. This supports using multiple TIBCO Patterns - Search servers. Because it is driven by the QBP configuration files, it can be used to deduplicate almost any table or set of joined tables without code changes.
This example consists of the following files: QBPDedupe.java, QueryRunValidator.java
To compile and run these examples you must have the deduplication framework jar file (TIB_tps_dedupe.jar), the grouping library jar file (TIB_tps_grouping.jar) and the Query Builder Platform jar file (TIB_tps_qbp.jar) on the class path.
Package | Description |
---|