Analyzing Data Quality

In this section:

Data quality analysis is a process of verifying data values against known rules to ascertain if the values accurately represent the real-world entity. Rules are an implementation of an underlying service and rules are generally associated with a data class. There are several built-in rules available for data quality analysis. These rules can be mapped to a single data attribute or a group of data attributes.

During data analysis, one or more of the following steps are executed:

  1. Data values may be cleansed by removing unexpected characters, leading, or trailing spaces.
  2. Data values may be validated against data specifications using regular expression checks, blacklist/whitelist checks, lookups, LUHN validation, etc.
  3. Missing or incomplete data values may be enriched using reference data or other statistical data imputation methods.
  4. Data values may be standardized by reformatting the output values.
  5. Data values may be normalized by detecting and replacing abnormal values.
  6. Tag values may be generated to report data defects or to highlight other data facts.
  7. The overall outcome of the analysis is summarized into a tag category as:
    • VALID. Input value is either valid or has defects that were fixed (cleansed) and output value is considered valid.
    • INVALID. Input value is invalid and could not be fixed.
    • CLEANSED. Input value has deficiencies that were fixed and output value is considered valid.
    • MISSING. Input value is missing (blank spaces or null).

Adding Rules

In this section:

To add (assign) rules when analyzing data:

  1. Click Add under Rules, as shown in the following image.
  2. In the left pane of the mapping window, click the plus (+) icon next to an input data variable (for example, income$K), as shown in the following image.

    The Select Rule dialog opens, as shown in the following image.

  3. Search the Rules Catalog and select the most appropriate rule.
  4. Map input data variables to the rule input.

Single Variable Rules

These rules need one input and once selected, the input data variable is automatically mapped to the rule input (for example, income$K).

Multi-Variable Rules

These rules require multiple inputs.

  1. Map each rule input to an input variable.
  2. Provide a meaningful group name.

Note: The group name is used to generate the output results file name, so make sure you provide a unique group name.

Preassigned Rules

Some columns may have a preassigned rule based on the data class discovered by the profiler. You can delete and reassign a different rule if the assigned rule is not appropriate for the input variable.

Verify all the rules associated with the input variables before submitting the data for processing.

Tracking Analysis Progress

The analysis progress bar displays the current processing status, as shown in the following image.

Viewing a Summary of Analysis Results

You can view a summary of analysis results by clicking View under Summary, as shown in the following image.

Summarized results contain:

Result

Description

Overall DQ Score

Data Quality (DQ) score for the entire data set. This score will be in the range 0 to 100 (100 is the best).

Rule based DQ Score

Data Quality (DQ) score for each rule applied to a variable or a group of variables. This score will be in the range 0 to 100 (100 is the best).

Tags

List of issues or reportable facts identified during the data quality analysis.

Stats & Stats Chart

Summarized stats for final outcome of the analysis.

  • VALID. Represents percentage of valid output values.
  • INVALID. Represents percentage of values that failed validation and could not be fixed.
  • MISSING. Represents percentage of values that are empty or missing.
  • CLEANSED. Represents percentage of values that were fixed - standardized, normalized, or enriched.

The following is a sample summary of analysis report.

Viewing Results

In this section:

You can download the detailed results of the Data Quality (DQ) analysis by clicking View under Results.

The .zip file that is generated contains the following folders:

For example:

Results Output

The results folder contains analysis results.

  1. One JSON file named rules.json with the summary of rules mapped by the user.
  2. One JSON file with the naming convention <<input_file_name>>.results.json with the summarized results of the rule based DQ analysis.
  3. One CSV file with the naming convention <<input_file_name>>.cleansed.csv with the same number of columns as the input data set.
  4. Multiple CSV files with the naming convention <<variable or group name>>[<<rule name>>].csv representing an output file for each combination of variable/group and rule that is executed.

In the following example, a user submitted eight rules for execution, and the results folder contains:

  • One rules JSON.
  • One summarized output JSON.
  • One cleansed output CSV.
  • Eight output CSVs (one for each rule).

rules.json - Summary of rules mapped by the user.

Value

Description

Rule Name

Name of the rule selected by the user.

Group Name

Group name provided by the user for mapping multiple variables into a single Rule.

Input Map

Mapping of input variables to rule inputs.

Variable Options

Data expectations by the user for each variable.

  • shouldBeUnique. Set to true if column values are always expected to be unique.
  • allowsNulls. Set to false if column values can never be Null.
  • businessImpact. A number that represents HIGH, MEDIUM, or LOW impact of the variable on business outcomes.

The following is a sample rules.json file for reference.

<<input_data_set_name>>.results.json - JSON output with summarized results of the DQ analysis.

Value

Description

Input File

Input file that contains the input data set uploaded by the user.

Output File

Output file that contains cleansed output values of all the variables in the input data set.

Overall DQ Score

Data Quality (DQ) score for the entire data set in the range 0 to 100 (100 is the best).

Rule based DQ Score

Data Quality (DQ) score for each variable in the range 0 to 100 (100 is the best).

Tags

List of issues or reportable facts identified during the data quality analysis.

Count Processed

Total number of data values processed by a Rule.

Count Valid

Total number of valid output values generated by a Rule.

Count Invalid

Total number of values that failed validation and could not be fixed by a Rule.

Count Missing

Total number of empty or missing values.

Count Cleansed

Total number of cleansed output values generated by a Rule.

The following is a sample <<input_data_set_name>>.results.json file for reference.

Rerun Analysis With Different Rules

After analyzing your data through Rules for the first time, you can re-analyze the same data with a different set of Rules, allowing you to skip the data upload and profiling steps. A new record is generated that displays at the top of the home page for each rerun.

  1. Click View under Rules, as shown in the following image.
  2. Edit the rules by replacing existing rules or adding new ones, as shown in the following image.
  3. When you submit the new request for Rules execution, the most recent analysis will be shown at the top of the home page. The results from different analysis runs will be grouped together with the most recent results shown first. Click on the drop-down button to expand and view results from previous analysis.