Correlation (HD)

Use to specify two or more numeric type attributes (columns) in a data set for relative analysis against each other by calculating the correlation between each pair of selected columns.

Correlation

Information at a Glance

Category	Explore
Data source type	HD
Sends output to other operators	Yes
Data processing tool	MapReduce

Note: The Correlation (HD) operator is for Hadoop data only. For database data, use the Correlation (DB) operator.

Algorithm

The covariance between two variables (X and Y) is calculated as shown in the following formula:

covariance formula

where mean of X and mean of Y are the mean values for X and Y, respectively.

The correlation is calculated by normalizing the covariance, as shown in the following formula:

correlation formula

Note: The PCA operator is a multivariate modeling operator that also determines the covariance and correlation between variables. However, it goes a step further by applying a mapping of the variables into a reduced Principal Component space.

For information about correlation and covariance, and the algorithms that describe them, see Correlation and Covariance.

Input

A data set from the preceding operator.

Bad or Missing Values: In Team Studio, all null values are filtered for Correlation Analysis.

Restrictions

The algorithm is relevant only when run on numeric data.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns	The numeric columns for which the correlation or covariance should be calculated.
Group by	When you select one or more group-by columns, the operator calculates a separate correlation (or covariance) matrix for every combination of values in the group-by columns. You can select one or more group-by columns. Note: The Group by column selection cannot overlap with the main Columns selection.
Calculate	Specify whether to calculate the Correlation (the default) or the Covariance. Correlation is normalized covariance, scaled so that the correlation between any variable and a positive multiple of itself is always 1.
Store Results?	Specifies whether to store the results. true - results are stored. false - the data set is passed to the next operator without storing.
Results Location	The HDFS directory where the results of the operator are stored. This is the main directory, the subdirectory of which is specified in Results Name. Click Choose File to open the Hadoop File Explorer Dialog Box and browse to the storage location. Do not edit the text directly.
Results Name	The name of the file in which to store the results.
Overwrite	Specifies whether to delete existing data at that path and file name. Yes - if the path exists, delete that file and save the results. No - Fail if the path already exists.
Storage Format	Select the format in which to store the results. The storage format is determined by your type of operator. Typical formats are Avro, CSV, TSV, or Parquet.
Compression	Select the type of compression for the output. Available Parquet compression options. GZIP Deflate Snappy no compression Available Avro compression options. Deflate Snappy no compression
Advanced Spark Settings Automatic Optimization	Yes specifies using the default Spark optimization settings. No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output: One correlation (or covariance) matrix for each combination of specified group-by values.

Note: If a group-by requirement is not specified, only one matrix is output.
Data Output: For Hadoop data set analysis, the visual output also is passed as its output to any following operator.

Example

The following example shows both the Hadoop correlation matrix and the corresponding covariance matrix output for different group by classes of Iris flowers. Note that when the correlation attribute is compared to itself, the resulting correlation coefficient value is 1 (this is not the case for the covariance data).