The cleansing process involves detecting and correcting inaccurate or missing data, and deleting obsolete records. You can activate profiling mode to create reports on data quality defects. Based on this report, you can execute the appropriate cleansing procedures to improve data quality.
To enable the data cleansing feature for a table, its Use cleansing property needs to be set to Yes. This property is available in the 'Table' domain located in the TIBCO EBX® Match and Merge Add-on dataspace under the Administration tab. You also need to configure a Cleansing policy containing one or many Cleansing procedures. Once you have completed configuration, the Profiling and Cleansing buttons display in the UI, as illustrated below.
The 'Cleansing' domain is used to configure cleansing procedures and policies applied to tables and is available from the TIBCO EBX® Match and Merge Add-on dataspace located under the EBX® 'Administration' tab.
The 'Cleansing policy' table displays a list of cleansing policies and their corresponding tables. From this table, you can specify whether or not a cleansing policy is active on a given table. Additionally, you can associate a cleansing policy with a context by specifying a dataset and a dataspace.
The same table can be configured multiple times with different cleansing policies and contexts. But, only one cleansing policy can be active per context.
The 'Cleansing policy content' table lets you choose which cleansing procedures are attached to a cleansing policy.
Source (logical name: CleansingPolicy).
Properties | Definition |
---|---|
Code | Any naming convention is possible. |
Name | Any name is possible. |
Short description | Description of the cleaning policy |
Long description | Long description of the cleaning policy |
Table | Table on which the cleansing policy is configured. |
Is active | If 'Yes': the cleansing policy is used by the add-on. If 'No': the cleansing policy is not used by the add-on. |
Is context | If 'Yes': the cleansing policy requires a context and you must use the 'Data life cycle context' property to specify the context applied to this cleansing policy. If 'No': the cleansing policy does not require on a context. |
Data life cycle context | The context used by this cleansing policy. When 'Is context' is set to 'Yes', then you provide a context by specifying a dataspace and dataset. |
Table 37: Cleansing policy
This feature allows you to associate cleansing procedures with specific tables and fields.
Source (logical name: CleansingProcedureOnTable).
Properties | Definition |
---|---|
Cleansing procedure definition | A drop-down list that contains the available cleansing procedures. Any cleansing procedures listed with the prefix of '[ON]' are provided by the add-on as ready-to-use procedures. You can develop and customize your own cleansing procedures via the API. |
Table | Table on which the cleansing procedure applies. |
Field | Field in the specified 'Table' on which the cleansing procedure applies. |
Additional field(s) | When the cleansing procedure is able to manage more than one field, then the additional fields can be specified here. |
Table 38: Cleansing procedure on table
The 'Cleansing policy content' table specifies which cleansing procedures are attached to cleansing policies.
Source (logical name: cleansingPolicyContent).
Properties | Definition |
---|---|
Cleansing policy | Reference to a cleansing policy. |
Cleansing procedure on table | Reference to a cleansing procedure that has been defined on a table. |
Table 39: Cleansing policy content
The 'Cleansing procedure' table lists fully defined cleansing procedures. Cleansing procedures can be defined by referencing a cleansing procedure implementation that has previously been published. Although certain predefined cleansing procedures are provided with the EBX® Match and Merge Add-on, you can use the API to implement your own bespoke cleansing procedures.
Source (logical name: cleansingProcedureDefinition).
Properties | Definition |
---|---|
Code | Any naming convention can be used except the prefix '[ON]' which is reserved by the add-on for predefined cleansing procedures. |
Name | Any naming convention can be used. |
Cleansing procedure | List of implemented cleansing procedures that have been published through the API provided by the add-on. You can create multiple cleansing procedures based on the same cleansing procedure implementation. For instance, the '[ON] Missing fields values' cleansing procedure can be used to create two definitions of cleansing procedures with different values for the input parameters. |
Input parameters | A list of customized parameters. The available parameters depend on the cleansing procedure configuration. |
D.E.C. type | D.E.C. on which the cleansing procedure is applied. For example, this can be a 'Table' or a 'Field'. |
Can be used in profiling procedure | If 'Yes': the cleansing procedure can be used in the profiling operation. If 'No': the cleansing procedure is not usable for the profiling operation. |
Can be used for many tables | If 'Yes': the cleansing procedure can be attached to many tables at one time. If 'No': the cleansing procedure cannot be attached to many tables at one time. The 'cleansing procedure on table' feature allows you to attach a cleansing procedure to a table. |
Can be used for many fields | If 'Yes': the cleansing procedure can be attached to multiple fields simultaneously. If 'No': the cleansing procedure cannot be attached to multiple fields simultaneously. The 'cleansing procedure on table' feature allows you to attach a cleansing procedure to a field. |
Save last value only | You can either delete the existing results in the Cleansing report tables or keep them over time to make possible data analytics processes. If 'Yes': All existing Cleansing results for this Cleansing procedure and current table are deleted before the procedure is executed. If 'No': Keep the previous results in the Cleansing report tables. The execution of the procedure enriches the results over time and allows you to apply data analytics processes. |
Cleansing operation definition Every cleansing procedure can publish several 'Cleansing operation(s)' that are used to fix data quality defects. For instance, the cleansing procedure '[ON] Missing fields values' publishes two cleansing operations that enable management of missing data values. The "Replace all" operation allows you to enter a value that will be used to fill all missing fields simultaneously. The "Replace manually" operation allows you to enter missing values one at a time. | |
Name | Name of the cleansing operation that displays in the UI. |
Cleansing operation | Name of the cleansing operation coming from the implemented procedure. |
Input parameters | A list of customized parameters. The available parameters depend on the cleansing procedure configuration. |
Is matching active | If 'Yes': matching is executed each time a record has been fixed. If 'No': matching is not executed after a record has been fixed. |
Table 40: Cleansing procedure
Every cleansing procedure is associated to a data view name that is used to filter the 'Big data report cleansing' table.
Source (logical name: cleansingProcedureView).
Properties | Definition |
---|---|
Cleansing report storage | dataspace and dataset inside which the cleansing report is stored |
Cleansing procedure | Reference to the cleansing procedure. |
Data view | Name of the data view (publication name) used to filter the results of this cleansing procedure in the table 'Big data report cleansing'. |
Table 41: Cleansing procedure view
dataspace and dataset where the cleansing report data is stored
Source (logical name: cleansingReportStorage).
Properties | Definition |
---|---|
Code | Code of the cleansing report storage. |
dataspace | dataspace where is located the cleansing report storage |
dataset | dataset where is located the cleansing report storage |
Is active | Use by default when 'Is active' = 'Yes'. Only one storage can be active at a same time |
Table 42: Cleansing report storage
It is possible to declare other custom cleansing report storage environments by following this procedure:
Create a data model with any names.
Include the data model 'ebx-addon-daqa-cleansing-report-metadata.xsd' from module 'ebx-addon-daqa'.
Create a table with the physical name 'BigDataCleansingReport' (this name cannot be changed).
Inside the BigDataCleansingReport table, create two groups 'CleansingReportGenericFields' and 'CleansingProcedureFields'. The name of these two groups cannot be changed. All the fields inside these two groups are reused from the included data model 'ebx-addon-daqa-cleansing-report-metadata.xsd'.
Create another table with the physical name 'BigDataProfilingReport' to store profiling data (this name cannot be changed).
Inside the table BigDataProfilingReport, create two groups 'CleansingReportGenericFields' and 'ProfilingFields'. The name of these two groups cannot be changed. All the fields inside these two groups are reused from the included data model 'ebx-addon-daqa-cleansing-report-metadata.xsd'.
Publish this data model to any dataspace.
Go to the TIBCO EBX® Match and Merge Add-on dataset in the Administration part, Cleansing report storage table, and declare the new created dataset as new storage.
The ready-to-use cleansing procedures are listed in the following table. You also can create your own bespoke cleansing procedures using API.
Name | Description |
---|---|
Deprecated records | The 'Deprecated records' procedure allows you to get the list of the deprecated records (based on a time configuration) and decide to delete or keep the records. The way to detect a deprecated record is based on the following policies (time configuration):
|
Unused records | The 'Unused records' procedure provides you with a set of rules to detect the unused records and process them (delete, keep, etc.). An unused record is not referenced by any other record in the repository. |
Foreign key fixing | The broken foreign keys are automatically identified and a UI allows you to fix the defective relationships. |
Missing fields values | The 'Missing fields values' identifies all missing values for the configured fields. A missing value is either an empty value or any other string values that are configured depending on the need. To clean up the missing values, two cleansing operations are provided: 'Replace all' by a same value and 'Replace manually' to decide the value for every missing field |
Table 43: Predefined cleansing procedures
Clicking the 'Profiling' button executes cleansing procedures-in profiling mode-and generates a report containing table data quality defects. From this report, you can execute cleansing procedures to fix these defects. The following figure illustrates this process:
Clicking the 'Cleansing' button displays a list of the available cleansing procedures enabled for the current table.
Special notation: | |
---|---|
In the current version of the add-on the following cleansing procedures are provided: 'Missing fields values', 'Foreign key fixing', 'Unused records', 'Deprecated records'. You can add your own cleansing procedures by using the API provided with the add-on. |
A dedicated dataspace is used to store and separate cleansing results by type (profiling and cleansing).
These tables are pre-pended as 'Big data' with flat and generic data structures. You can reuse them no matter what cleansing procedure output parameters are used.
Other tables are used to save detailed results by cleansing procedure, such as the list of unused records.
A data view can be configured to display data that is relevant to a specific cleansing procedure. An example of this using the 'Missing fields values' cleansing procedure is shown below.
Special notation: | |
---|---|
You can set the 'Big data report cleansing' and 'Big data report profiling' tables to keep only the most recent results, or to store a history of results. This behavior is determined by the 'Save last value only' property contained in the 'Cleansing procedure' table. |
In this section, you will find the examples of cleansing operations.
In the following example, the pre-defined cleansing procedure '[ON] F-CP1 - Missing fields values' was configured to fix records with an empty value in the 'Long label' field.
The 'Missing fields values' option displays in the matching view under the 'Cleansing' drop-down list and allows you to execute the cleansing procedure. Once you enter the cleansing screen, the add-on generates a report containing table data quality defects with a total number of records to fix.
There are two ways to fix the records:
Replace all - fill in all missing fields with the same value
Replace manually - fill in missing fields with the same or different values
By executing the 'Replace manually' option, you can narrow your focus by selecting records in the set of records to fix. You can remove the filter and come back to the screen to work on all records using 'Deselect all'.
A preview button, available at the end of 'Record link' field, allows you to update the missing field directly in the record view. The 'Replace by this value' updates with the newly added value for the missing field.
Instead of entering a value to replace one at a time for each record, you can click on 'Save and apply same on next' button to open the next record with the 'Replace by this value' populated with the previous entry.