Correlation Filter (HD)
Filters numeric columns so the remaining columns are not correlated strongly with each other.
Information at a Glance
Category | Transform |
Data source type | HD |
Sends output to other operators | Yes1 |
Data processing tool | Spark |
Input
A file on HDFS. You can choose the columns you want distinct combinations from, and the operator performs the calculation.
- Bad or Missing Values
- If a row of the input contains null values in at least one of the selected
Columns to Filter, the entire row is skipped before computing the correlation matrix.
After the columns to keep are determined based on the correlations, the input null values from the input are conserved in the output data set for the columns concerned.
This is a different behavior from the Database version of this operator. Replace null values before running this operator.
Configuration
Parameter | Description |
---|---|
Notes | Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator. |
Columns to Filter
*required |
Select two or more numeric columns. Columns selected in this parameter are compared with each other, and columns are removed from this set until all of the remaining columns have correlations under the threshold defined below. |
Dependent Column
*required |
Select a numeric column. When determining which columns to remove due to high correlation with another column, the one with higher correlation with the dependent variable is selected. |
Correlation Threshold
*required |
Enter a number greater than 0 and less than or equal to 1. This threshold is used to determine whether each pair of columns are considered collinear. |
Maximum Number of Filtered Columns
*required |
Enter an integer greater than 0 or -1. If -1, the operator returns all columns whose correlations are under the threshold. If n > 0, the operator returns the top n columns, ranked by their correlation with the dependent variable. |
Pass Through Other Columns? | Choose yes to include columns not selected in Columns to Filter in the final results. The Dependent Column is always included. |
Correlation Method | Choose the correlation method to compute. Supported methods are Pearson or Spearman correlations.
Note:
Pearson vs Spearman correlations The Pearson correlation coefficient is the most widely used. It measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it might be more appropriate to use the Spearman rank correlation method. |
Write Rows Removed Due to Null Data To File | Rows with at least one null value in the
Columns to Filter are skipped during the correlation analysis (but kept in the output). This parameter allows you to specify if rows with null values are written to a file.
The file is written to the same directory as the rest of the output. The filename is suffixed with _baddata.
|
Storage Format | Select the format in which to store the results. The storage format is determined by your type of operator.
Typical formats are Avro, CSV, TSV, or Parquet. |
Compression | Select the type of compression for the output.
Available Avro compression options. |
Output Directory | The location to store the output files. |
Output Name | The name to contain the results. |
Overwrite Output | Specifies whether to delete existing data at that path. |
Advanced Spark Settings Automatic Optimization |
|
Output
- Visual Output
-
- The Output tab displays a preview of the output data set.
- The Summary tab displays information about the selected parameters and the output.
- The Correlation Results tab displays which columns have been selected with additional details (correlation with dependent variable, reason why columns were not selected).
- Data Output
- The data set created with filtered columns.
Note: A partial schema can be transmitted to subsequent operators at design time, but the user must run the operator for subsequent operators to see the final output schema.The final output schema of the Correlation Filter operator is cleared if one of the following occurs:
In this case, the output schema transmitted to subsequent operators again becomes the partial schema defined at design time (hence, subsequent operators can turn invalid), and the user must run the Correlation Filter operator again to transmit the new output schema.