Correlation (HD)

Use to specify two or more numeric type attributes (columns) in a data set for relative analysis against each other by calculating the correlation between each pair of selected columns.

Correlation

Information at a Glance

Category Explore
Data source type HD
Sends output to other operators Yes
Data processing tool MapReduce
Note: The Correlation (HD) operator is for Hadoop data only. For database data, use the Correlation (DB) operator.

Algorithm

The covariance between two variables (X and Y) is calculated as shown in the following formula:

covariance formula

where mean of X and mean of Y are the mean values for X and Y, respectively.

The correlation is calculated by normalizing the covariance, as shown in the following formula:

correlation formula

Note: The PCA operator is a multivariate modeling operator that also determines the covariance and correlation between variables. However, it goes a step further by applying a mapping of the variables into a reduced Principal Component space.

For information about correlation and covariance, and the algorithms that describe them, see Correlation and Covariance.

Input

A data set from the preceding operator.

Bad or Missing Values
In Team Studio, all null values are filtered for Correlation Analysis.

Restrictions

The algorithm is relevant only when run on numeric data.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns The numeric columns for which the correlation or covariance should be calculated.
Group by When you select one or more group-by columns, the operator calculates a separate correlation (or covariance) matrix for every combination of values in the group-by columns. You can select one or more group-by columns.
Note: The Group by column selection cannot overlap with the main Columns selection.
Calculate Specify whether to calculate the Correlation (the default) or the Covariance.

Correlation is normalized covariance, scaled so that the correlation between any variable and a positive multiple of itself is always 1.

Store Results? Specifies whether to store the results.
  • true - results are stored.
  • false - the data set is passed to the next operator without storing.
Results Location The HDFS directory where the results of the operator are stored. This is the main directory, the subdirectory of which is specified in Results Name. Click Choose File to open the Hadoop File Explorer Dialog Box and browse to the storage location. Do not edit the text directly.
Results Name The name of the file in which to store the results.
Overwrite Specifies whether to delete existing data at that path and file name.
  • Yes - if the path exists, delete that file and save the results.
  • No - Fail if the path already exists.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
One correlation (or covariance) matrix for each combination of specified group-by values.
Note: If a group-by requirement is not specified, only one matrix is output.
Data Output
For Hadoop data set analysis, the visual output also is passed as its output to any following operator.

Example

The following example shows both the Hadoop correlation matrix and the corresponding covariance matrix output for different group by classes of Iris flowers. Note that when the correlation attribute is compared to itself, the resulting correlation coefficient value is 1 (this is not the case for the covariance data).