Correlation Filter (HD)

Filters numeric columns so the remaining columns are not correlated strongly with each other.

Information at a Glance

Category Transform
Data source type HD
Sends output to other operators Yes1
Data processing tool Spark
Note: The Correlation Filter (HD) operator is for Hadoop data only. For database data, use the Correlation Filter (DB) operator.

Input

A file on HDFS. You can choose the columns you want distinct combinations from, and the operator performs the calculation.

Bad or Missing Values
If a row of the input contains null values in at least one of the selected Columns to Filter, the entire row is skipped before computing the correlation matrix.

After the columns to keep are determined based on the correlations, the input null values from the input are conserved in the output data set for the columns concerned.

This is a different behavior from the Database version of this operator. Replace null values before running this operator.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns to Filter

*required

Select two or more numeric columns. Columns selected in this parameter are compared with each other, and columns are removed from this set until all of the remaining columns have correlations under the threshold defined below.
Dependent Column

*required

Select a numeric column. When determining which columns to remove due to high correlation with another column, the one with higher correlation with the dependent variable is selected.
Correlation Threshold

*required

Enter a number greater than 0 and less than or equal to 1. This threshold is used to determine whether each pair of columns are considered collinear.
Maximum Number of Filtered Columns

*required

Enter an integer greater than 0 or -1. If -1, the operator returns all columns whose correlations are under the threshold. If n > 0, the operator returns the top n columns, ranked by their correlation with the dependent variable.
Pass Through Other Columns? Choose yes to include columns not selected in Columns to Filter in the final results. The Dependent Column is always included.
Correlation Method Choose the correlation method to compute. Supported methods are Pearson or Spearman correlations.
Note:

Pearson vs Spearman correlations

The Pearson correlation coefficient is the most widely used. It measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it might be more appropriate to use the Spearman rank correlation method.

Write Rows Removed Due to Null Data To File Rows with at least one null value in the Columns to Filter are skipped during the correlation analysis (but kept in the output). This parameter allows you to specify if rows with null values are written to a file.

The file is written to the same directory as the rest of the output. The filename is suffixed with _baddata.

  • Do Not Write or Count Null Rows (Fastest) - remove null value data but do not count and display in the result UI.

  • Do Not Write Null Rows to File - remove null value data and display in the result UI, but do not write to an external file.

  • Write Up to 1000 Null Rows to File - remove null value data and write the first 1000 rows of that data to the external file.

  • Write All Null Rows to File - remove null value data and write all removed rows to an external file.

Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
  • The Output tab displays a preview of the output data set.
  • The Summary tab displays information about the selected parameters and the output.
  • The Correlation Results tab displays which columns have been selected with additional details (correlation with dependent variable, reason why columns were not selected).
Data Output
The data set created with filtered columns.
Note: A partial schema can be transmitted to subsequent operators at design time, but the user must run the operator for subsequent operators to see the final output schema.
The final output schema of the Correlation Filter operator is cleared if one of the following occurs:
  • The user changes the configuration properties of the Correlation Filter.
  • The user changes the input connected to the Correlation Filter.
  • The user clears the step run results of the Correlation Filter.

In this case, the output schema transmitted to subsequent operators again becomes the partial schema defined at design time (hence, subsequent operators can turn invalid), and the user must run the Correlation Filter operator again to transmit the new output schema.

1 The full output schema is not available until you step run the operator. After you run this operator, the output schema automatically updates, and subsequent operators either validate or turn red, depending on the structure of the output data.