Correlation (DB)

Use to specify two or more numeric type attributes (columns) in a data set for relative analysis against each other by calculating the correlation between each pair of selected columns.

Correlation

Information at a Glance

Category	Explore
Data source type	DB
Sends output to other operators	No
Data processing tool	n/a

Note: The Correlation (DB) operator is for database data only. For Hadoop data, use the Correlation (HD) operator.

Algorithm

The covariance between two variables (X and Y) is calculated as shown in the following formula:

covariance formula

where mean of X and mean of Y are the mean values for X and Y, respectively.

The correlation is calculated by normalizing the covariance, as shown in the following formula:

correlation formula

Note: The PCA operator is a multivariate modeling operator that also determines the covariance and correlation between variables. However, it goes a step further by applying a mapping of the variables into a reduced Principal Component space.

For information about correlation and covariance, see Correlation and Covariance.

Input

A data set from the preceding operator.

Bad or Missing Values: In Team Studio, all null values are filtered for Correlation Analysis.

Restrictions

The algorithm is relevant only when run on numeric data.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns	The numeric columns for which the correlation should be calculated.

Output

Visual Output: The correlation coefficient table. Each coefficient value provides a measure of how related the two variables are to each other. The value is 1 when the column is being compared against itself. A negative value means an opposite, negative relationship (that is, as one value goes up, the other goes down).

Note: These values are equivalent to the correlation coefficients in a linear regression (with the column name as the dependent variable). This output could be useful, for example, in deciding which variables to include in a linear regression model.
Data Output: None. This is a terminal operator.