Analyzing Data Quality

In this section:

Adding Rules
Tracking Analysis Progress
Viewing a Summary of Analysis Results
Viewing Results
Change Rules

Data quality analysis is a process of verifying data values against known rules to ascertain if the values accurately represent the real world entity. Rules are an implementation of an underlying service and rules are generally associated with a data class. There are several built-in rules available for data quality analysis. These rules can be mapped to a single data attribute or a group of data attributes.

During data analysis, one or more of the following steps are executed:

Data values may be cleansed by removing unexpected characters, leading, or trailing spaces.
Data values may be validated against data specifications using regular expression checks, blacklist/whitelist checks, lookups, LUHN validation, etc.
Missing or incomplete data values may be enriched using reference data or other statistical data imputation methods.
Data values may be standardized by reformatting the output values.
Data values may be normalized by detecting and replacing abnormal values.
Tag values may be generated to report data defects or to highlight other data facts.
The overall outcome of the analysis is summarized into a tag category as:
- VALID. Input value is either valid or has defects that were fixed (cleansed) and output value is considered valid.
- INVALID. Input value is invalid and could not be fixed.
- CLEANSED. Input value has deficiencies that were fixed and output value is considered valid.
- MISSING. Input value is missing (blank spaces or null).

Adding Rules

In this section:

Single Variable Rules
Multi-Variable Rules
Preassigned Rules

To add (assign) rules when analyzing data:

Click Add under Rules, as shown in the following image.
In the left pane of the mapping window, click the plus (+) icon next to an input data variable (for example, income$K), as shown in the following image.

The Select Rule dialog opens, as shown in the following image.
Search the Rules Catalog and select the most appropriate rule.
Map input data variables to the rule input.

Single Variable Rules

These rules need one input and once selected, the input data variable is automatically mapped to the rule input (for example, income$K).

Multi-Variable Rules

These rules require multiple inputs.

Map each rule input to an input variable.
Provide a meaningful group name.

Note: The group name is used to generate the output results file name, so make sure you provide a unique group name.

Preassigned Rules

Some columns may have a preassigned rule based on the data class discovered by the profiler. You can delete and reassign a different rule if the assigned rule is not appropriate for the input variable.

Verify all the rules associated with the input variables before submitting the data for processing.

Tracking Analysis Progress

The analysis progress bar displays the current processing status, as shown in the following image.

Viewing a Summary of Analysis Results

You can view a summary of analysis results by clicking View under Summary, as shown in the following image.

Summarized results contain:

Result	Description
Overall DQ Score	Data Quality (DQ) score for the entire data set. This score will be in the range 0 to 100 (100 is the best).
Rule based DQ Score	Data Quality (DQ) score for each rule applied to a variable or a group of variables. This score will be in the range 0 to 100 (100 is the best).
Tags	List of issues or reportable facts identified during the data quality analysis.
Stats & Stats Chart	Summarized stats for final outcome of the analysis. VALID. Represents percentage of valid output values. INVALID. Represents percentage of values that failed validation and could not be fixed. MISSING. Represents percentage of values that are empty or missing. CLEANSED. Represents percentage of values that were fixed - standardized, normalized, or enriched.

The following is a sample summary of analysis report.

Viewing Results

In this section:

Results Output

You can download the detailed results of the Data Quality (DQ) analysis by clicking View under Results.

The .zip file that is generated contains the following folders:

input. Contains the input data set.
profile. Contains data profiling results.
results. Contains DQ analysis results.

For example:

Results Output

The results folder contains analysis results.

One JSON file named rules.json with the summary of rules mapped by the user.
One JSON file with the naming convention <<input_file_name>>.results.json with the summarized results of the rule based DQ analysis.
One CSV file with the naming convention <<input_file_name>>.cleansed.csv with the same number of columns as the input data set.
Multiple CSV files with the naming convention <<variable or group name>>[<<rule name>>].csv representing an output file for each combination of variable/group and rule that is executed.

In the following example, a user submitted eight rules for execution, and the results folder contains:

One rules JSON.
One summarized output JSON.
One cleansed output CSV.
Eight output CSVs (one for each rule).

rules.json - Summary of rules mapped by the user.

Value	Description
Rule Name	Name of the rule selected by the user.
Group Name	Group name provided by the user for mapping multiple variables into a single Rule.
Input Map	Mapping of input variables to rule inputs.
Variable Options	Data expectations by the user for each variable. shouldBeUnique. Set to true if column values are always expected to be unique. allowsNulls. Set to false if column values can never be Null. businessImpact. A number that represents HIGH, MEDIUM, or LOW impact of the variable on business outcomes.

The following is a sample rules.json file for reference.

<<input_data_set_name>>.results.json - JSON output with summarized results of the DQ analysis.

Value	Description
Input File	Input file that contains the input data set uploaded by the user.
Output File	Output file that contains cleansed output values of all the variables in the input data set.
Overall DQ Score	Data Quality (DQ) score for the entire data set in the range 0 to 100 (100 is the best).
Rule based DQ Score	Data Quality (DQ) score for each variable in the range 0 to 100 (100 is the best).
Tags	List of issues or reportable facts identified during the data quality analysis.
Count Processed	Total number of data values processed by a Rule.
Count Valid	Total number of valid output values generated by a Rule.
Count Invalid	Total number of values that failed validation and could not be fixed by a Rule.
Count Missing	Total number of empty or missing values.
Count Cleansed	Total number of cleansed output values generated by a Rule.

The following is a sample <<input_data_set_name>>.results.json file for reference.

Analyzing Data Quality

Adding Rules

Single Variable Rules

Multi-Variable Rules

Preassigned Rules

Tracking Analysis Progress

Viewing a Summary of Analysis Results

Viewing Results

Results Output

Change Rules