QBPDedupe (TIBCO Patterns Deduplication Sample Usage)

java.lang.Object
- QBPDedupe

```
public class QBPDedupe
extends java.lang.Object
```
A Deduplication application using the Query Builder Platform to define the query. This combines QBP with the deduplication platform. The query is defined by a query definition configuration file. This is a configuration file for the GeneralCompoundQueryBuilder class. The query is used to deduplicate data stored on one or more target TIBCO Patterns - Search servers. It assumes the data is already loaded on the target servers.
The source data for the deduplication is defined using a RunQuery configuration file. Both the input and output table and field mappings are applied, but the target server information in the file is ignored.

- Constructor Summary
  
  Constructors
  Constructor and Description
  
  QBPDedupe()
- Method Summary
  
  All Methods Static Methods Concrete Methods
  Modifier and Type Method and Description
  
  static void main(java.lang.String[] args)
  Run a deduplication.
  - Methods inherited from class java.lang.Object
    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructors
Constructor and Description
`QBPDedupe()`

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static void`	`main(java.lang.String[] args)` Run a deduplication.

Constructor Detail
- QBPDedupe
```
public QBPDedupe()
```

Method Detail

main

public static void main(java.lang.String[] args)

Run a deduplication. This runs a deduplication of a table or joined set of tables. It uses the deduplication framework to run the deduplication and the Query Builder Platform to define the query used in the deduplication. The input is defined separately from the targets to be deduplicated, so it is possible to deduplicate one set of records against another.

The query is defined using the QBP QueryDef.xsd configuration file format. The QBP RunQuery.xsd configuration file is used to define the source for the query data. It also defines any table name and field name mappings for the source and the target tables and the cutoff score for the deduplication. The target servers themselves are specified as command line arguments. The target information in the run query configuration file is ignored.

The output generated is a set of pair files. These identify the matching records. If grouping is requested a set of files are generated that identify all groups of records that represent the same entity.

Java must have the deduplication jar file (TIB_tps_dedupe.jar), the grouping jar file (TIB_tps_grouping.jar), the QBP jar file (TIB_tps_qbp.jar) and the deduplication sample class files on its class path. The arguments are:

 
 java QBPDedupe -query-def query-def-file
                [-annotated-def annotated-file]
                 -query-src run-query-file
                 { -host host:port[:threads] }+
                 -outout-dir output-dir
                 [-batch-size batch-size]
                 [-do-grouping]
  Where: 
     query-def-file - is a query configuration file. A
         RecordMatchingQueryDef element must be the top
         level item.
    annotated-def - if this is given and there are errors in
         the query configuration, the annotated definition file
         is written to this file.  Otherwise it is not output.
    run-query-file - is a RunQuery configuration file as used
         by the QBP system.  This defines the source for the
         query data and table and field mappings for both the
         input source and target servers.
    host:port[:threads] - is the IP address or URL and port
         number of a TIBCO Patterns - Search server that holds
         the data to be deduplicated.  If the optional threads
         count is provided it is used as the number of query
         threads to run against the server, otherwise the number
         of query threads is set to the number of worker threads
         on the server.  If multiple hosts are given, queries
         are run against all of the hosts.
    output-dir - is an existing directory where all generated
         data is stored.
    batch-size - the number of pairs in each deduplication batch.
         Default is 1000.
    -do-grouping - if this is given a grouping report is run
         against the result of the deduplication.

Parameters:: args - command line arguments as described above.

Class QBPDedupe

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

QBPDedupe

Method Detail

main