Column Cleanser
This operator removes the columns according to the specified column's completeness or variance criteria.
Information at a Glance
|
Parameter |
Description |
|---|---|
| Category | Transform |
| Data source type | TIBCO® Data Virtualization |
| Send output to other operators | Yes |
| Data processing tool | TIBCO® DV, Apache Spark 3.2 or later |
Algorithm
This operator applies a set of rules to remove columns, easing the burden of specifying filtering criteria column by column. The user selects the columns to test, and then a filtering condition is set. According to this condition, columns are selectively removed.
According to the filtering conditions defined, the Sparsity, High Variance, and Low-Variance checks are calculated. Multiple filtering conditions can be applied. If a Low-Variance check involving the calculation of the coefficient of variation is applied to a column that has zero mean and all identical values, the columns are removed and a warning appears in the Summary tab.
Input
An input is a single tabular data set.
Configuration
The following table provides the configuration details for the Column Cleanser operator.
| Parameter | Description |
|---|---|
| Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
| Select these columns and treat as continuous | Specify the columns as continuous columns. It must be Double or Integer. Click Select Columns to select the required columns. Note: The columns selected in the Select these columns and treat as categorical parameter are not available. |
| Select these columns and treat as categorical | Specify the columns as categorical columns. Click Select Columns to select the required columns. Note: The columns selected in the Select these columns and treat as continuous parameter are not available. |
| Remove columns with a percentage of missing rows higher than (-1=ignore, accepts real number 0-100) | Removes the column with a percentage of missing values higher than the specified number. This is known as a Sparsity check. The value can be -1, or a real number between 0 to 100. Default: -1.0 Note: To ignore this filtering condition, the value must be set to -1. |
| Remove categorical columns with a count of distinct values higher than (-1=ignore, accepts integer number 0-10000) | Removes the categorical columns with a count of distinct values higher than the specified number. This is known as a High Variance check. The value can be -1, or an integer number between 0 to 10000. Default: -1. Note: To ignore this filtering condition, the value must be set to -1. |
| Remove categorical columns with a count of distinct values higher than this percentage of the number of rows (-1=ignore, accepts real number 0-100) | Removes the categorical columns with a percentage of distinct values higher than the specified number. This is known as a High Variance check. The value can be -1, or a real number between 0 to 100. Default: -1. Note: To ignore this filtering condition, the value must be set to -1. |
| Remove categorical columns with a percentage of the most frequent category higher than (-1=ignore, accepts real number 0-100) | Removes the categorical columns of the most frequent category which appears more often than the specified percentage of rows. This is known as a Low Variance check. The value can be -1, or a real number between 0 to 100. Default: -1. Note: To ignore this filtering condition, the value must be set to -1. |
| Remove numeric columns with a Coefficient of Variation(Standard Deviation divided by Mean) lower than (-1=ignore, accepts real number 0-0.01) | Removes the continuous columns with the coefficient of variation lower than the specified value. This is known as a Low Variance check. The value can be -1, or a real number between 0 to 100. Default: -1. Note: To ignore this filtering condition, the value must be set to -1. |
| Output Schema | Specify the schema for the output table or view. |
| Output Table | Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator. |
| Store Results | When set to Yes, the operator saves the results. If set to No, the operator does not save the results. |
Output
- Output: Preview of the clean data consisting of the columns which satisfy the defined filtering conditions and passed the data cleaning checks.
- Summary: Displays information about the removed columns.
Example
The following example shows the cleansed data for the given data set using the Column Cleanser operator.
- Multiple columns such as ID, AGE_IN_YEARS, LEVEL_OF_EDUCATION, YEARS_WITH_CURRENT_EMPLOYER, and YEARS_AT_CURRENT_ADDRESS.
- Multiple rows (850 rows).
-
Select these columns and treat as continuous: ID, AGE_IN_YEARS
-
Select these columns and treat as categorical: LEVEL_OF_EDUCATION, YEARS_WITH_CURRENT_EMPLOYER, YEARS_AT_CURRENT_ADDRESS
-
Remove columns with a percentage of missing rows higher than (-1=ignore, accepts real number 0-100): 20
-
Remove categorical columns with a count of distinct values higher than (-1=ignore, accepts integer number 0-10000): -1
-
Remove categorical columns with a count of distinct values higher than this percentage of the number of rows (-1=ignore, accepts real number 0-100): -1.0
-
Remove categorical columns with a percentage of the most frequent category higher than (-1=ignore, accepts real number 0-100): 30
-
Remove numeric columns with a Coefficient of Variation(Standard Deviation divided by Mean) lower than (-1=ignore, accepts real number 0-0.01): -1.0
-
Store Results: Yes