Correlation Filter (HD)

Filters numeric columns so the remaining columns are not correlated strongly with each other.

Information at a Glance

Category	Transform
Data source type	HD
Sends output to other operators	Yes¹
Data processing tool	Spark

Note: The Correlation Filter (HD) operator is for Hadoop data only. For database data, use the Correlation Filter (DB) operator.

Input

A file on HDFS. You can choose the columns you want distinct combinations from, and the operator performs the calculation.

Bad or Missing Values

If a row of the input contains null values in at least one of the selected Columns to Filter, the entire row is skipped before computing the correlation matrix.

After the columns to keep are determined based on the correlations, the input null values from the input are conserved in the output data set for the columns concerned.

This is a different behavior from the Database version of this operator. Replace null values before running this operator.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns to Filter *required	Select two or more numeric columns. Columns selected in this parameter are compared with each other, and columns are removed from this set until all of the remaining columns have correlations under the threshold defined below.
Dependent Column *required	Select a numeric column. When determining which columns to remove due to high correlation with another column, the one with higher correlation with the dependent variable is selected.
Correlation Threshold *required	Enter a number greater than 0 and less than or equal to 1. This threshold is used to determine whether each pair of columns are considered collinear.
Maximum Number of Filtered Columns *required	Enter an integer greater than 0 or -1. If -1, the operator returns all columns whose correlations are under the threshold. If n > 0, the operator returns the top `n` columns, ranked by their correlation with the dependent variable.
Pass Through Other Columns?	Choose yes to include columns not selected in Columns to Filter in the final results. The Dependent Column is always included.
Correlation Method	Choose the correlation method to compute. Supported methods are Pearson or Spearman correlations. Note: Pearson vs Spearman correlations The Pearson correlation coefficient is the most widely used. It measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it might be more appropriate to use the Spearman rank correlation method.
Write Rows Removed Due to Null Data To File	Rows with at least one null value in the Columns to Filter are skipped during the correlation analysis (but kept in the output). This parameter allows you to specify if rows with null values are written to a file. The file is written to the same directory as the rest of the output. The filename is suffixed with _baddata. Do Not Write or Count Null Rows (Fastest) - remove null value data but do not count and display in the result UI. Do Not Write Null Rows to File - remove null value data and display in the result UI, but do not write to an external file. Write Up to 1000 Null Rows to File - remove null value data and write the first 1000 rows of that data to the external file. Write All Null Rows to File - remove null value data and write all removed rows to an external file.
Storage Format	Select the format in which to store the results. The storage format is determined by your type of operator. Typical formats are Avro, CSV, TSV, or Parquet.
Compression	Select the type of compression for the output. Available Parquet compression options. GZIP Deflate Snappy no compression Available Avro compression options. Deflate Snappy no compression
Output Directory	The location to store the output files.
Output Name	The name to contain the results.
Overwrite Output	Specifies whether to delete existing data at that path. Yes - if the path exists, delete that file and save the results. No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization	Yes specifies using the default Spark optimization settings. No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output

The Output tab displays a preview of the output data set.
The Summary tab displays information about the selected parameters and the output.
The Correlation Results tab displays which columns have been selected with additional details (correlation with dependent variable, reason why columns were not selected).

Data Output

The data set created with filtered columns.

Note: A partial schema can be transmitted to subsequent operators at design time, but the user must run the operator for subsequent operators to see the final output schema.

The final output schema of the Correlation Filter operator is cleared if one of the following occurs:

The user changes the configuration properties of the Correlation Filter.
The user changes the input connected to the Correlation Filter.
The user clears the step run results of the Correlation Filter.

In this case, the output schema transmitted to subsequent operators again becomes the partial schema defined at design time (hence, subsequent operators can turn invalid), and the user must run the Correlation Filter operator again to transmit the new output schema.

¹ The full output schema is not available until you step run the operator. After you run this operator, the output schema automatically updates, and subsequent operators either validate or turn red, depending on the structure of the output data.

Contents

Index

Search Results

Correlation Filter (HD)

Information at a Glance

Input

Configuration

Output