TIBCO Patterns - Search Grouping Library Overview

The TIBCO Patterns - Search Grouping Library is a Java library designed to help organize the results of a de-duplication. This overview will cover what is included in the libary, how to install it, and basic usage.

Package Contents

The TIBCO Patterns - Search Grouping Library is delivered as a JAR file (<install-home>/grouping/lib/TIB_tps_grouping.jar). A sample program is included in <install-home>/grouping/sample.

Installation

Copy the jar file to a location of your choice and add an entry to your Java class path for the jar file.

To run the sample program copy the executable jar file from the installed sample bin directory and the sample data file sample_1_intput.csv from the sample data directory to a directory of your choice. In that directory run:

  java -jar sample1.jar

It will output some statistics to the standard out and create three CSV files.

The library is compiled to be compatible with Java release 1.7 or higher. Your Java Virtual Machine and compiler must support release 1.7 or higher in order to use this interface.

Library Overview

Before describing the library a brief description of what is meant by "de-duplication" may be helpful.

Given a set of records, the usual method of finding duplicates is to use each record to form a query into the set of records. Each query returns a set of records similar to the query record. To each returned record a score is attached indicating the degree of similarity. This method produces complete results, but has two major drawbacks:

First, the volume of data is produced by a deduplication is usally unwieldy. To see why, suppose there are 101 records which will end up being related. Each of these 100 records will form a query, and each query will may return up to 100 results; the total results number as high as 10100. Expand this to a data set of millions of records with many large sets of related records, and the volume of results grows very quickly. Using these results, e.g. for reports or database cleansing can be very difficult just due to data volume.

Second, determining the proper order to process record relationships is difficult because the linkage between records can be very complex. A very simple example: A is linked to B with score 0.75, W is linked to A with score 0.73, R is linked to B with score 0.65 B is linked to F with score 0.80, F is linked to R with score 0.78, and A is linked to R with score 0.80. Even on this very small set of five related records the proper processing order isn't immediately obvious. Additionally, the relationships relevent to processing a set of records could be scattered widely with the complete set of relationships, making it difficult to even detect which relationships belong together.

The TIBCO Patterns - Search Grouping Library addresses both these problems. Given the results of a deduplication, it outputs sets of related records, ordered by decreasing similarity. For example, from the 10100 pairs in the first example above, the 101 records are output in the order they should be handled.

The TIBCO Patterns - Search Grouping Library is designed to handle data sets from the small to the exceptionally large: 100s of millions of record relationships. Small data sets (under 20 million relationships) require almost no programming; a simplified API is provided for these. Large data sets may require using the full API, with attention given to application performance.

The TIBCO Patterns - Search Grouping Library is designed to fit naturally with output from TIBCO Patterns - Search queries, including queries that use a machine learning model, and TIBCO Patterns - Search Deduplication Library.

Library organization
The Grouping object is the core of the TIBCO Patterns - Search Grouping Library. It performs grouping on an Iterable of InputPair objects, and outputs groups using an application-provided implementation of IGroupOutput. The application is free to determine where InputPairs come from and where output is saved. Utility classes CsvPairInput and CsvGroupOutput are provided for csv files, but the application is free to use databases, sockets, or any other means of reading and storing data.

The primary restriction on input is that pairs must be presented in decreasing score. Out-of-order scores will result in an exception being thrown.

Tips to get started
Do a trial run on a small section of data using the csv utility classes and the sample program as a template.
Putting the score column first in the input file will make it easier to sort by decreasing score with just operating system sort utilities (e.g. sort -r on *nix).

Library Capabilities
Data Capacity
The library can easily group tens of millions of pairs within minutes using just the simplified interface. The simplified interface runs entirely in physical memory.
There is also a deeper interface capable of handling larger data sets, into the billions of input pairs. It allows the application programmer to optimize library operations for their data set by customizing the usage of physical memory versus disk-backed operations and leveraging pre-calculated information about the input pairs. Use of the deeper interface may require consulting or implementation services from TIBCO.

Glossary
Record: For the TIBCO Patterns - Search Grouping Library, a record is just a key. Responsibility for linking keys back to other data systems lies with the application.
Input Pair: Two record identifiers and a score. Input pairs must be presented by decreasing score.
Score: A floating point value. Scores are typically between 0.0 and 1.0 inclusive, but this is not required.
Group: A set of records linked together by a set of pairs.
Sub-group A tightly linked subset of records of a group. Sub-groups provide fine-grained information about how the records in a group are linked.
Grouping: The process of assigning records to groups based on a sequence of Input Pairs.
Grouped Record: A record key together with information about it's position with it's assigned group: which subgroup it was assigned to, the highest score seen for that record, and the record it linked to with that score.
Grouped Pair: An input pair and the subgroups of it's two records.
Primary Pair: A grouped pair that links two records within one sub-group.
Cross-over pair: A pair that links records across two sub-groups. A cross-over pair always has a lower score than all the pairs in both sub-groups.

Packages 
Package Description
com.tibco.patterns.grouping