public class QBPDedupe
extends java.lang.Object
The source data for the deduplication is defined using a RunQuery configuration file. Both the input and output table and field mappings are applied, but the target server information in the file is ignored.
| Constructor and Description |
|---|
QBPDedupe() |
| Modifier and Type | Method and Description |
|---|---|
static void |
main(java.lang.String[] args)
Run a deduplication.
|
public static void main(java.lang.String[] args)
The query is defined using the QBP QueryDef.xsd configuration file format. The QBP RunQuery.xsd configuration file is used to define the source for the query data. It also defines any table name and field name mappings for the source and the target tables and the cutoff score for the deduplication. The target servers themselves are specified as command line arguments. The target information in the run query configuration file is ignored.
The output generated is a set of pair files. These identify the matching records. If grouping is requested a set of files are generated that identify all groups of records that represent the same entity.
Java must have the deduplication jar file (TIB_tps_dedupe.jar), the grouping jar file (TIB_tps_grouping.jar), the QBP jar file (TIB_tps_qbp.jar) and the deduplication sample class files on its class path. The arguments are:
java QBPDedupe -query-def query-def-file
[-annotated-def annotated-file]
-query-src run-query-file
{ -host host:port[:threads] }+
-outout-dir output-dir
[-batch-size batch-size]
[-do-grouping]
Where:
query-def-file - is a query configuration file. A
RecordMatchingQueryDef element must be the top
level item.
annotated-def - if this is given and there are errors in
the query configuration, the annotated definition file
is written to this file. Otherwise it is not output.
run-query-file - is a RunQuery configuration file as used
by the QBP system. This defines the source for the
query data and table and field mappings for both the
input source and target servers.
host:port[:threads] - is the IP address or URL and port
number of a TIBCO Patterns - Search server that holds
the data to be deduplicated. If the optional threads
count is provided it is used as the number of query
threads to run against the server, otherwise the number
of query threads is set to the number of worker threads
on the server. If multiple hosts are given, queries
are run against all of the hosts.
output-dir - is an existing directory where all generated
data is stored.
batch-size - the number of pairs in each deduplication batch.
Default is 1000.
-do-grouping - if this is given a grouping report is run
against the result of the deduplication.
args - command line arguments as described above.