TIBCO Patterns - Search Grouping Library Overview
The TIBCO Patterns - Search Grouping Library is a Java library designed to help organize the results of a de-duplication. This overview will cover what is included in the libary, how to install it, and basic usage.Package Contents
The TIBCO Patterns - Search Grouping Library is delivered as a JAR file (<install-home>/grouping/lib/TIB_tps_grouping.jar). A sample program is included in <install-home>/grouping/sample.
Installation
Copy the jar file to a location of your choice and add an entry to your Java class path for the jar file.
To run the sample program copy the executable jar file from the installed sample bin directory and the sample data file sample_1_intput.csv from the sample data directory to a directory of your choice. In that directory run:
java -jar sample1.jar
It will output some statistics to the standard out and create three CSV files.
The library is compiled to be compatible with Java release 1.7 or higher. Your Java Virtual Machine and compiler must support release 1.7 or higher in order to use this interface.
Library Overview
Before describing the library a brief description of what is meant by
"de-duplication" may be helpful.
Given a set of records, the usual method of finding duplicates is to
use each record to form a query into the set of records. Each query
returns a set of records similar to the query record. To each returned
record a score is attached indicating the degree of similarity. This
method produces complete results, but has two major drawbacks:
First, the volume of data is produced by a deduplication is usally
unwieldy. To see why, suppose there are 101 records which will end up
being related. Each of these 100 records will form a query, and each
query will may return up to 100 results; the total results number as
high as 10100. Expand this to a data set of millions of records with
many large sets of related records, and the volume of results grows
very quickly. Using these results, e.g. for reports or database
cleansing can be very difficult just due to data volume.
Second, determining the proper order to process record relationships
is difficult because the linkage between records can be very complex.
A very simple example: A is linked to B with score 0.75, W is linked
to A with score 0.73, R is linked to B with score 0.65 B is linked to
F with score 0.80, F is linked to R with score 0.78, and A is linked
to R with score 0.80. Even on this very small set of five related
records the proper processing order isn't immediately obvious.
Additionally, the relationships relevent to processing a set of records
could be scattered widely with the complete set of relationships,
making it difficult to even detect which relationships belong together.
The TIBCO Patterns - Search Grouping Library addresses both these problems. Given
the results of a deduplication, it outputs sets of related records,
ordered by decreasing similarity. For example, from the 10100 pairs in
the first example above, the 101 records are output in the order they
should be handled.
The TIBCO Patterns - Search Grouping Library is designed to handle data sets from
the small to the exceptionally large: 100s of millions of record
relationships. Small data sets (under 20 million relationships) require
almost no programming; a simplified API is provided for these. Large
data sets may require using the full API, with attention given to
application performance.
The TIBCO Patterns - Search Grouping Library is designed to fit naturally with
output from TIBCO Patterns - Search queries, including queries that use a machine learning model,
and TIBCO Patterns - Search Deduplication Library.
Library organization
The Grouping object is the core of the TIBCO Patterns - Search Grouping Library.
It performs grouping on an Iterable of InputPair objects, and outputs
groups using an application-provided implementation of IGroupOutput.
The application is free to determine where InputPairs come from and
where output is saved. Utility classes CsvPairInput and CsvGroupOutput
are provided for csv files, but the application is free to use
databases, sockets, or any other means of reading and storing data.
The primary restriction on input is that pairs must be presented in
decreasing score. Out-of-order scores will result in an exception
being thrown.
Tips to get started
Do a trial run on a small section of data using the csv utility classes and
the sample program as a template.
Putting the score column first in the input file will make it easier to sort
by decreasing score with just operating system sort utilities (e.g. sort -r
on *nix).
Library Capabilities
Data Capacity
The library can easily group tens of millions of
pairs within minutes using just the simplified interface. The simplified
interface runs entirely in physical memory.
There is also a deeper
interface capable of handling larger data sets, into the billions of input
pairs. It allows the application programmer to optimize library operations
for their data set by customizing the usage of physical memory versus
disk-backed operations and leveraging pre-calculated information about
the input pairs. Use of the deeper interface may require consulting or
implementation services from TIBCO.
Glossary
Record: For the TIBCO Patterns - Search Grouping Library, a record is just a key.
Responsibility for linking keys back to other data systems lies with
the application.
Input Pair: Two record identifiers and a score. Input pairs must
be presented by decreasing score.
Score: A floating point value. Scores are typically between 0.0
and 1.0 inclusive, but this is not required.
Group: A set of records linked together by a set of pairs.
Sub-group A tightly linked subset of records of a group.
Sub-groups provide fine-grained information about how the records in a group
are linked.
Grouping: The process of assigning records to groups based on a
sequence of Input Pairs.
Grouped Record: A record key together with information about it's
position with it's assigned group: which subgroup it was assigned to, the
highest score seen for that record, and the record it linked to with that
score.
Grouped Pair: An input pair and the subgroups of it's two
records.
Primary Pair: A grouped pair that links two records within one
sub-group.
Cross-over pair: A pair that links records across two sub-groups.
A cross-over pair always has a lower score than all the pairs in both
sub-groups.
| Package | Description |
|---|---|
| com.tibco.patterns.grouping |