Correlation Filter (DB)

Filters numeric columns so the remaining columns are not correlated strongly with each other.

Information at a Glance

Category Transform
Sends output to other operators Yes1
Data processing tool SQL
Note: The Correlation Filter (DB) operator is for database data only. For Hadoop data, use the Correlation Filter (HD) operator.

Input

The input data is a database source. You can choose the columns from which you want distinct combinations, and the operator performs the calculation.

Restrictions

  • If some of the Columns to Filter or Dependent Column contain null values, these values are skipped to compute the correlations.
  • If a column contains only null values, the correlation with any other column is 0.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns to Filter

required

Select two or more numeric columns. Columns selected in this parameter are compared with each other, and columns are removed from this set until all of the remaining columns have correlations under the threshold defined below.
Dependent Column

required

Select a numeric column. When determining which columns to remove due to high correlation with another column, the one with higher correlation with the dependent variable is selected.
Correlation Threshold

required

Enter a number greater than 0 and less than or equal to 1. This threshold is used to determine whether each pair of columns are considered collinear.
Maximum Number of Filtered Columns

required

Enter an integer greater than 0 or -1. If -1, the operator returns all columns whose correlations are under the threshold. If n > 0, the operator returns the top n columns, ranked by their correlation with the dependent variable.
Pass Through Other Columns? Choose yes to include columns not selected in Columns to Filter in the final results. The Dependent Column is always included.
Output Type
  • TABLE outputs a database table. Specifying TABLE enables Storage Parameters.
  • VIEW outputs a database view.
Output Schema The schema for the output table or view.
Output Table The table path and name where the results are output. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Drop If Exists Specifies whether to overwrite an existing table.
  • Yes - If a table with the name exists, it is dropped before storing the results.
  • No - If a table with the name exists, the results window shows an error message.

The final output schema of the Correlation Filter operator is cleared if one of the following occurs:

  • You change the configuration properties of the Correlation Filter
  • You change the input connected to the Correlation Filter
  • You clear the step run results of the Correlation Filter

In this case, the output schema transmitted to subsequent operators again becomes the partial schema defined at design time (hence, subsequent operators can turn invalid). You must run the Correlation Filter operator again to transmit the new output schema.

Outputs

Visual Output

The Output tab displays a preview of the output data set.

The Summary tab displays information about the parameters selected and the output.

The Correlation Results tab displays which columns have been selected with additional details (correlation with dependent variable, reason why columns were not selected).

Data Output
Output is a data set of the newly created table/view.
A partial schema can be transmitted to subsequent operators at design time, but you must run the operator for subsequent operators to see the final output schema.

Example



1 The full output schema is not available until you step run the operator. After you run this operator, the output schema automatically updates, and subsequent operators either validate or turn red, depending on the structure of the output data.