TIBCO EBX®
Match and Merge Documentation > User Guide
Navigation modeMatch and Merge Documentation > User Guide

Data Cleansing

Introduction

The cleansing process involves detecting and correcting inaccurate or missing data, and deleting obsolete records. You can activate profiling mode to create reports on data quality defects. Based on this report, you can execute the appropriate cleansing procedures to improve data quality.

Enabling cleansing

To enable the data cleansing feature for a table, its Use cleansing property needs to be set to Yes. This property is available in the 'Table' domain located in the TIBCO EBX® Match and Merge Add-on dataspace under the Administration tab. You also need to configure a Cleansing policy containing one or many Cleansing procedures. Once you have completed configuration, the Profiling and Cleansing buttons display in the UI, as illustrated below.

/profiling_and_cleansing_buttons.png

Configuration

The 'Cleansing' domain is used to configure cleansing procedures and policies applied to tables and is available from the TIBCO EBX® Match and Merge Add-on dataspace located under the EBX® 'Administration' tab.

/Cleansing_configuration_1.png

Cleansing policy

The 'Cleansing policy' table displays a list of cleansing policies and their corresponding tables. From this table, you can specify whether or not a cleansing policy is active on a given table. Additionally, you can associate a cleansing policy with a context by specifying a dataset and a dataspace.

The same table can be configured multiple times with different cleansing policies and contexts. But, only one cleansing policy can be active per context.

The 'Cleansing policy content' table lets you choose which cleansing procedures are attached to a cleansing policy.

Source (logical name: CleansingPolicy).

Properties

Definition

Code

Any naming convention is possible.

Name

Any name is possible.

Short description

Description of the cleaning policy

Long description

Long description of the cleaning policy

Table

Table on which the cleansing policy is configured.

Is active

If 'Yes': the cleansing policy is used by the add-on.

If 'No': the cleansing policy is not used by the add-on.

Is context

If 'Yes': the cleansing policy requires a context and you must use the 'Data life cycle context' property to specify the context applied to this cleansing policy.

If 'No': the cleansing policy does not require on a context.

Data life cycle context

The context used by this cleansing policy. When 'Is context' is set to 'Yes', then you provide a context by specifying a dataspace and dataset.

Table 37: Cleansing policy

Cleansing procedure on table

This feature allows you to associate cleansing procedures with specific tables and fields.

Source (logical name: CleansingProcedureOnTable).

Properties

Definition

Cleansing procedure definition

A drop-down list that contains the available cleansing procedures. Any cleansing procedures listed with the prefix of '[ON]' are provided by the add-on as ready-to-use procedures. You can develop and customize your own cleansing procedures via the API.

Table

Table on which the cleansing procedure applies.

Field

Field in the specified 'Table' on which the cleansing procedure applies.

Additional field(s)

When the cleansing procedure is able to manage more than one field, then the additional fields can be specified here.

Table 38: Cleansing procedure on table

Cleansing policy content

The 'Cleansing policy content' table specifies which cleansing procedures are attached to cleansing policies.

Source (logical name: cleansingPolicyContent).

Properties

Definition

Cleansing policy

Reference to a cleansing policy.

Cleansing procedure on table

Reference to a cleansing procedure that has been defined on a table.

Table 39: Cleansing policy content

Cleansing procedure

The 'Cleansing procedure' table lists fully defined cleansing procedures. Cleansing procedures can be defined by referencing a cleansing procedure implementation that has previously been published. Although certain predefined cleansing procedures are provided with the EBX® Match and Merge Add-on, you can use the API to implement your own bespoke cleansing procedures.

Source (logical name: cleansingProcedureDefinition).

Properties

Definition

Code

Any naming convention can be used except the prefix '[ON]' which is reserved by the add-on for predefined cleansing procedures.

Name

Any naming convention can be used.

Cleansing procedure

List of implemented cleansing procedures that have been published through the API provided by the add-on.

You can create multiple cleansing procedures based on the same cleansing procedure implementation. For instance, the '[ON] Missing fields values' cleansing procedure can be used to create two definitions of cleansing procedures with different values for the input parameters.

Input parameters

A list of customized parameters. The available parameters depend on the cleansing procedure configuration.

D.E.C. type

D.E.C. on which the cleansing procedure is applied. For example, this can be a 'Table' or a 'Field'.

Can be used in profiling procedure

If 'Yes': the cleansing procedure can be used in the profiling operation.

If 'No': the cleansing procedure is not usable for the profiling operation.

Can be used for many tables

If 'Yes': the cleansing procedure can be attached to many tables at one time.

If 'No': the cleansing procedure cannot be attached to many tables at one time.

The 'cleansing procedure on table' feature allows you to attach a cleansing procedure to a table.

Can be used for many fields

If 'Yes': the cleansing procedure can be attached to multiple fields simultaneously.

If 'No': the cleansing procedure cannot be attached to multiple fields simultaneously.

The 'cleansing procedure on table' feature allows you to attach a cleansing procedure to a field.

Save last value only

You can either delete the existing results in the Cleansing report tables or keep them over time to make possible data analytics processes.

If 'Yes': All existing Cleansing results for this Cleansing procedure and current table are deleted before the procedure is executed.

If 'No': Keep the previous results in the Cleansing report tables. The execution of the procedure enriches the results over time and allows you to apply data analytics processes.

Cleansing operation definition

Every cleansing procedure can publish several 'Cleansing operation(s)' that are used to fix data quality defects. For instance, the cleansing procedure '[ON] Missing fields values' publishes two cleansing operations that enable management of missing data values. The "Replace all" operation allows you to enter a value that will be used to fill all missing fields simultaneously. The "Replace manually" operation allows you to enter missing values one at a time.

Name

Name of the cleansing operation that displays in the UI.

Cleansing operation

Name of the cleansing operation coming from the implemented procedure.

Input parameters

A list of customized parameters. The available parameters depend on the cleansing procedure configuration.

Is matching active

If 'Yes': matching is executed each time a record has been fixed.

If 'No': matching is not executed after a record has been fixed.

Table 40: Cleansing procedure

Cleansing procedure view

Every cleansing procedure is associated to a data view name that is used to filter the 'Big data report cleansing' table.

Source (logical name: cleansingProcedureView).

Properties

Definition

Cleansing report storage

dataspace and dataset inside which the cleansing report is stored

Cleansing procedure

Reference to the cleansing procedure.

Data view

Name of the data view (publication name) used to filter the results of this cleansing procedure in the table 'Big data report cleansing'.

Table 41: Cleansing procedure view

Cleansing report storage

dataspace and dataset where the cleansing report data is stored

Source (logical name: cleansingReportStorage).

Properties

Definition

Code

Code of the cleansing report storage.

dataspace

dataspace where is located the cleansing report storage

dataset

dataset where is located the cleansing report storage

Is active

Use by default when 'Is active' = 'Yes'. Only one storage can be active at a same time

Table 42: Cleansing report storage

It is possible to declare other custom cleansing report storage environments by following this procedure:

Predefined cleansing procedures

The ready-to-use cleansing procedures are listed in the following table. You also can create your own bespoke cleansing procedures using API.

Name

Description

Deprecated records

The 'Deprecated records' procedure allows you to get the list of the deprecated records (based on a time configuration) and decide to delete or keep the records.

The way to detect a deprecated record is based on the following policies (time configuration):

  • Duration between the last modification date of the record and the current date (in days)

  • Comparison from the last modification date of the record and a selected date (Input parameters → To this date).

  • Comparison from the last modification date of the record and a computed date using a Java class (Input parameters → Java class).

Unused records

The 'Unused records' procedure provides you with a set of rules to detect the unused records and process them (delete, keep, etc.). An unused record is not referenced by any other record in the repository.

Foreign key fixing

The broken foreign keys are automatically identified and a UI allows you to fix the defective relationships.

Missing fields values

The 'Missing fields values' identifies all missing values for the configured fields. A missing value is either an empty value or any other string values that are configured depending on the need. To clean up the missing values, two cleansing operations are provided: 'Replace all' by a same value and 'Replace manually' to decide the value for every missing field

Table 43: Predefined cleansing procedures

Profiling

Clicking the 'Profiling' button executes cleansing procedures-in profiling mode-and generates a report containing table data quality defects. From this report, you can execute cleansing procedures to fix these defects. The following figure illustrates this process:

/profiling_cleansing_result.png

Operations

Clicking the 'Cleansing' button displays a list of the available cleansing procedures enabled for the current table.

/list_cleansing_procedures.png

Special notation:

/100002010000000E0000000E3C009B93.png

In the current version of the add-on the following cleansing procedures are provided: 'Missing fields values', 'Foreign key fixing', 'Unused records', 'Deprecated records'. You can add your own cleansing procedures by using the API provided with the add-on.

Result storage

A dedicated dataspace is used to store and separate cleansing results by type (profiling and cleansing).

/Result_storage_reporting.png

These tables are pre-pended as 'Big data' with flat and generic data structures. You can reuse them no matter what cleansing procedure output parameters are used.

Other tables are used to save detailed results by cleansing procedure, such as the list of unused records.

A data view can be configured to display data that is relevant to a specific cleansing procedure. An example of this using the 'Missing fields values' cleansing procedure is shown below.

/cleansing_result_view.png

Special notation:

/100002010000000E0000000E3C009B93.png

You can set the 'Big data report cleansing' and 'Big data report profiling' tables to keep only the most recent results, or to store a history of results. This behavior is determined by the 'Save last value only' property contained in the 'Cleansing procedure' table.

Example

In this section, you will find the examples of cleansing operations.

Cleansing - missing field values

In the following example, the pre-defined cleansing procedure '[ON] F-CP1 - Missing fields values' was configured to fix records with an empty value in the 'Long label' field.

/missing_fields_values_configuration.png

The 'Missing fields values' option displays in the matching view under the 'Cleansing' drop-down list and allows you to execute the cleansing procedure. Once you enter the cleansing screen, the add-on generates a report containing table data quality defects with a total number of records to fix.

/missing_fields_values_result.png

There are two ways to fix the records:

By executing the 'Replace manually' option, you can narrow your focus by selecting records in the set of records to fix. You can remove the filter and come back to the screen to work on all records using 'Deselect all'.

/replace_manually_1.png

A preview button, available at the end of 'Record link' field, allows you to update the missing field directly in the record view. The 'Replace by this value' updates with the newly added value for the missing field.

/replace_manually_2.png

Instead of entering a value to replace one at a time for each record, you can click on 'Save and apply same on next' button to open the next record with the 'Replace by this value' populated with the previous entry.