Column Cleanser

This operator removes the columns according to the specified column's completeness or variance criteria.

Column Cleanser operator icon

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Transform
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Algorithm

This operator applies a set of rules to remove columns, easing the burden of specifying filtering criteria column by column. The user selects the columns to test, and then a filtering condition is set. According to this condition, columns are selectively removed.

According to the filtering conditions defined, the Sparsity, High Variance, and Low-Variance checks are calculated. Multiple filtering conditions can be applied. If a Low-Variance check involving the calculation of the coefficient of variation is applied to a column that has zero mean and all identical values, the columns are removed and a warning appears in the Summary tab.

Input

An input is a single tabular data set.

Configuration

The following table provides the configuration details for the Column Cleanser operator.

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Select these columns and treat as continuous Specify the columns as continuous columns. It must be Double or Integer. Click Select Columns to select the required columns.
Note: The columns selected in the Select these columns and treat as categorical parameter are not available.
Select these columns and treat as categorical Specify the columns as categorical columns. Click Select Columns to select the required columns.
Note: The columns selected in the Select these columns and treat as continuous parameter are not available.
Remove columns with a percentage of missing rows higher than (-1=ignore, accepts real number 0-100) Removes the column with a percentage of missing values higher than the specified number. This is known as a Sparsity check. The value can be -1, or a real number between 0 to 100.

Default: -1.0

Note: To ignore this filtering condition, the value must be set to -1.
Remove categorical columns with a count of distinct values higher than (-1=ignore, accepts integer number 0-10000) Removes the categorical columns with a count of distinct values higher than the specified number. This is known as a High Variance check. The value can be -1, or an integer number between 0 to 10000.

Default: -1.

Note: To ignore this filtering condition, the value must be set to -1.
Remove categorical columns with a count of distinct values higher than this percentage of the number of rows (-1=ignore, accepts real number 0-100) Removes the categorical columns with a percentage of distinct values higher than the specified number. This is known as a High Variance check. The value can be -1, or a real number between 0 to 100.

Default: -1.

Note: To ignore this filtering condition, the value must be set to -1.
Remove categorical columns with a percentage of the most frequent category higher than (-1=ignore, accepts real number 0-100) Removes the categorical columns of the most frequent category which appears more often than the specified percentage of rows. This is known as a Low Variance check. The value can be -1, or a real number between 0 to 100.

Default: -1.

Note: To ignore this filtering condition, the value must be set to -1.
Remove numeric columns with a Coefficient of Variation(Standard Deviation divided by Mean) lower than (-1=ignore, accepts real number 0-0.01) Removes the continuous columns with the coefficient of variation lower than the specified value. This is known as a Low Variance check. The value can be -1, or a real number between 0 to 100.

Default: -1.

Note: To ignore this filtering condition, the value must be set to -1.
Output Schema Specify the schema for the output table or view.
Output Table Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Output

Visual Output
  • Output: Preview of the clean data consisting of the columns which satisfy the defined filtering conditions and passed the data cleaning checks.
  • Summary: Displays information about the removed columns.
Output to Successive operator
A single tabular data with cleaned data consisting of the columns satisfying the filtering conditions.

Example

The following example shows the cleansed data for the given data set using the Column Cleanser operator.

Column Cleanser operator workflow
Data
demographic: A data set for demographic that contains the following information:
  • Multiple columns such as ID, AGE_IN_YEARS, LEVEL_OF_EDUCATION, YEARS_WITH_CURRENT_EMPLOYER, and YEARS_AT_CURRENT_ADDRESS.
  • Multiple rows (850 rows).
Parameter Setting
The parameter settings for the demographic data set are as follows:
  • Select these columns and treat as continuous: ID, AGE_IN_YEARS

  • Select these columns and treat as categorical: LEVEL_OF_EDUCATION, YEARS_WITH_CURRENT_EMPLOYER, YEARS_AT_CURRENT_ADDRESS

  • Remove columns with a percentage of missing rows higher than (-1=ignore, accepts real number 0-100): 20

  • Remove categorical columns with a count of distinct values higher than (-1=ignore, accepts integer number 0-10000): -1

  • Remove categorical columns with a count of distinct values higher than this percentage of the number of rows (-1=ignore, accepts real number 0-100): -1.0

  • Remove categorical columns with a percentage of the most frequent category higher than (-1=ignore, accepts real number 0-100): 30

  • Remove numeric columns with a Coefficient of Variation(Standard Deviation divided by Mean) lower than (-1=ignore, accepts real number 0-0.01): -1.0

  • Store Results: Yes

Results
These figures display the results for the parameter settings for the demographic data set.
Output
Column Cleanser operator - Output tab
Summary
Column Cleanser operator - Summary tab