Correlation Filter (DB)
Filters numeric columns so the remaining columns are not correlated strongly with each other.
Information at a Glance
Parameter |
Description |
---|---|
Category | Transform |
Data source type | DB |
Send output to other operators | Yes1 |
Data processing tool | SQL |
Input
The input data is a database source. You can choose the columns from which you want distinct combinations, and the operator performs the calculation.
Restrictions
- If some of the Columns to Filter or Dependent Column contain null values, these values are skipped to compute the correlations.
- If a column contains only null values, the correlation with any other column is 0.
Configuration
Parameter | Description |
---|---|
Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
Columns to Filter
required |
Select two or more numeric columns. Columns selected in this parameter are compared with each other, and columns are removed from this set until all of the remaining columns have correlations under the threshold defined below. |
Dependent Column
required |
Select a numeric column. When determining which columns to remove due to high correlation with another column, the one with higher correlation with the dependent variable is selected. |
Correlation Threshold
required |
Enter a number greater than 0 and less than or equal to 1. This threshold is used to determine whether each pair of columns are considered collinear. |
Maximum Number of Filtered Columns
required |
Enter an integer greater than 0 or -1. If -1, the operator returns all columns whose correlations are under the threshold. If n > 0, the operator returns the top n columns, ranked by their correlation with the dependent variable. |
Pass Through Other Columns? | Choose yes to include columns not selected in Columns to Filter in the final results. The Dependent Column is always included. |
Output Type |
|
Output Schema | The schema for the output table or view. |
Output Table | Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator. |
Drop If Exists | Specifies whether to overwrite an existing table.
|
The final output schema of the Correlation Filter operator is cleared if one of the following occurs:
- You change the configuration properties of the Correlation Filter
- You change the input connected to the Correlation Filter
- You clear the step run results of the Correlation Filter
In this case, the output schema transmitted to subsequent operators again becomes the partial schema defined at design time (hence, subsequent operators can turn invalid). You must run the Correlation Filter operator again to transmit the new output schema.
Outputs
The Output tab displays a preview of the output data set.
The Summary tab displays information about the parameters selected and the output.
The Correlation Results tab displays which columns have been selected with additional details (correlation with dependent variable, reason why columns were not selected).
Example