ibi™ Patterns - Search Grouping Library Overview

The ibi Patterns - Search Grouping Library is a Java library which organizes links between key values. This overview will cover what is included in the libary, how to install it, and basic usage.

Package Contents

The ibi Patterns - Search Grouping Library is delivered as a JAR file (<install-home>/grouping/lib/TIB_tps_grouping.jar). A sample program is included in <install-home>/grouping/sample.

Installation

Copy the jar file to a location of your choice and add an entry to your Java class path for the jar file.

To run the sample program copy the executable jar file from the installed sample bin directory and the sample data file sample_1_intput.csv from the sample data directory to a directory of your choice. In that directory run:

  java -jar sample.jar

It will output some statistics to the standard out and create three CSV files.

The library is compiled to be compatible with Java release 1.8 or higher. Your Java Virtual Machine and compiler must support release 1.8 or higher in order to use this interface.


Library Overview

The ibi Patterns - Search Grouping Library organizes keys into groups using links.

  • A key is just an identifier, such as a record key.
  • A link is two key values and a score from 0.0 to 1.0.
  • A group is set of keys which:
    • are all associated, that is, one can follow links from any key in the group to any other key in the group
    • is self-contained, that is, there are no links to keys outside the group.

The ibi Patterns - Search Grouping Library selects the highest-scoring links which organize the keys into groups. Excess links are discarded. Applications which need to process the links can benefit greatly from this data reduction.

The ibi Patterns - Search Grouping Library can handle data sets from the small to the exceptionally large, up to billions of links. It can easily group tens of millions of links within seconds.


A Common Use-Case

A common source of links is a de-duplication. De-duplication is the process of finding duplicates in a set of records. The usual method is to use each record to form a query into the set of records. Each query returns a set of records similar to the query record. To each returned record a score is attached indicating the degree of similarity. This method produces complete results, but has two major drawbacks:

First, the volume of data is produced by a deduplication is usually unwieldy. To see why, suppose there are 101 records which will end up being related. Each of these 100 records will form a query, and each query will may return up to 100 results; the total results number as high as 10100. Expand this to a data set of millions of records with many large sets of related records, and the volume of results grows very quickly. Using these results, e.g. for reports or database cleansing can be very difficult just due to data volume.

Second, determining the proper order to process record relationships is difficult because the linkage between records can be very complex. A very simple example: A is linked to B with score 0.75, W is linked to A with score 0.73, R is linked to B with score 0.65 B is linked to F with score 0.80, F is linked to R with score 0.78, and A is linked to R with score 0.80. Even on this very small set of five related records the proper processing order isn't immediately obvious. Additionally, the relationships relevent to processing a set of records could be scattered widely with the complete set of relationships, making it difficult to even detect which relationships belong together.

The ibi Patterns - Search Grouping Library addresses both these problems. Given the results of a deduplication, it outputs sets of related record keys, ordered by decreasing similarity. For example, from the 10100 pairs in the first example above, the 101 records are output in the order they should be handled, with only 100 links; a 99% reduction.

When the deduplication is on a join-set, associateKeys() should be called using the parent-record keys not the full compound-key. Also, linkId should be populated, as reporting likely requires both compound records. The ibi Patterns - Search Grouping Library is designed to fit naturally with output from ibi Patterns - Search queries, including queries that use a machine learning model, and the ibi Patterns - Search Deduplication Library.

Library organization
The GroupingEngine object is the core of the ibi Patterns - Search Grouping Library . Tell it the all the key assocications, and it will produce an iterator of the groups. The application is free to determine where key associations come from and where output is saved. A utility class is provided to place output in standard CSV files.


Physical Requirements
Memory: The ibi Patterns - Search Grouping Library operates entirely in physical memory. Precise memory usage depends on several factors (number of redundant links, average group size, etc.). As a general guideline, multiply the number of links by 400 bytes. For example, 20M links should take no more than 8GB.
CPU: The ibi Patterns - Search Grouping Library can run with just a single core available, although it will benefit from the availability of a second core.


Data Privacy
The ibi Patterns - Search Grouping Library operates only on keys and scores. As long as the keys contain no sensitive data, it has no particular privacy requirements.


Changes from 1.0
The ibi Patterns - Search Grouping Library version 2.0 has several major changes:

  • The 2.0 API is streamlined and easier to use.
  • Links (pairs) no longer need to be sorted. Version 1.0 required links in descending score order.
  • Scores are handled only to a specified precision.
  • Scores must be between 0.0 and 1.0 inclusive. Version 1.0 allowed scores outside this range.
  • Version 2.0 has much higher throughput, and can handle larger data-sets.

Tips

  • Do a trial run on a small section of data using the StandardOutput utility class, and the sample program as a template.
  • When running a large data-set, remember to estimate memory usage and give appropriate memory arguments to the Java Virtual Machine.

Glossary

Key: For the ibi Patterns - Search Grouping Library, a key is just an arbitrary string of characters. Keys may contain commas, backslashes, etc., although appropriate care should be taken when reading/writing such values to/from CSV files.

Record: In most applications, keys typically identify database records. The ibi Patterns - Search Grouping Library only uses keys; responsibility for linking keys back to other data systems lies with the application.

Link: Two keys and a score.

Score: A floating point value. Scores must be between 0.0 and 1.0 inclusive.

Group: A set of keys associated by links. Every pair of keys in a group can be associated by following a trail of links. Groups are self-contained; keys in a group are not linked outside the group.

Sub-group A tightly linked subset of records of a group. Sub-groups provide fine-grained information about how the records in a group are linked.

Leader: The key in a sub-group with the highest link-score.

Grouping: The process of assigning records to groups based on links.

Key Info: A key together with information about it's position with it's assigned group: which subgroup it was assigned to, the highest score seen for that key, and the key it's linked to with that score.

Link Info: A link and the subgroups of it's two keys.

Primary Link: A link between two keys within a sub-group.

Cross-over Link: A link between keys in different subgroups.

Packages 
Package Description
com.tibco.patterns.grouping