Class QBPDedupe


  • public class QBPDedupe
    extends java.lang.Object
    A Deduplication application using the Query Builder Platform to define the query. This combines QBP with the deduplication platform. The query is defined by a query definition configuration file. This is a configuration file for the GeneralCompoundQueryBuilder class. The query is used to deduplicate data stored on one or more target TIBCO Patterns - Search servers. It assumes the data is already loaded on the target servers.

    The source data for the deduplication is defined using a RunQuery configuration file. Both the input and output table and field mappings are applied, but the target server information in the file is ignored.

    • Constructor Summary

      Constructors 
      Constructor Description
      QBPDedupe()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static void main​(java.lang.String[] args)
      Run a deduplication.
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • QBPDedupe

        public QBPDedupe()
    • Method Detail

      • main

        public static void main​(java.lang.String[] args)
        Run a deduplication. This runs a deduplication of a table or joined set of tables. It uses the deduplication framework to run the deduplication and the Query Builder Platform to define the query used in the deduplication. The input is defined separately from the targets to be deduplicated, so it is possible to deduplicate one set of records against another.

        The query is defined using the QBP QueryDef.xsd configuration file format. The QBP RunQuery.xsd configuration file is used to define the source for the query data. It also defines any table name and field name mappings for the source and the target tables and the cutoff score for the deduplication. The target servers themselves are specified as command line arguments. The target information in the run query configuration file is ignored.

        The output generated is a set of pair files. These identify the matching records. If grouping is requested a set of files are generated that identify all groups of records that represent the same entity.

        Java must have the deduplication jar file (TIB_tps_dedupe.jar), the grouping jar file (TIB_tps_grouping.jar), the QBP jar file (TIB_tps_qbp.jar) and the deduplication sample class files on its class path. The arguments are:

         
         java QBPDedupe -query-def query-def-file
                        [-annotated-def annotated-file]
                         -query-src run-query-file
                         { -host host:port[:threads] }+
                         -outout-dir output-dir
                         [-batch-size batch-size]
                         [-do-grouping]
          Where: 
             query-def-file - is a query configuration file. A
                 RecordMatchingQueryDef element must be the top
                 level item.
            annotated-def - if this is given and there are errors in
                 the query configuration, the annotated definition file
                 is written to this file.  Otherwise it is not output.
            run-query-file - is a RunQuery configuration file as used
                 by the QBP system.  This defines the source for the
                 query data and table and field mappings for both the
                 input source and target servers.
            host:port[:threads] - is the IP address or URL and port
                 number of a TIBCO Patterns - Search server that holds
                 the data to be deduplicated.  If the optional threads
                 count is provided it is used as the number of query
                 threads to run against the server, otherwise the number
                 of query threads is set to the number of worker threads
                 on the server.  If multiple hosts are given, queries
                 are run against all of the hosts.
            output-dir - is an existing directory where all generated
                 data is stored.
            batch-size - the number of pairs in each deduplication batch.
                 Default is 1000.
            -do-grouping - if this is given a grouping report is run
                 against the result of the deduplication.
         
        Parameters:
        args - command line arguments as described above.