Correlation

This operator is used to specify two or more numeric type attributes (columns) in a data set for relative analysis against each other by calculating the correlation between each pair of selected columns.

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter	Description
Category	Explore
Data source type	TIBCO® Data Virtualization
Send output to other operators	Yes
Data processing tool	TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The covariance between two variables (X and Y) is calculated as given in the following formula:

where mean of X and mean of Y are the mean values for X and Y, respectively.

The correlation is calculated by normalizing the covariance, as given in the following formula:

For information about correlation and covariance, and the algorithms that describe them, see Correlation and Covariance.

Input

An input is a single tabular data set.

Note: The Correlation operator generates a column Attribute in its output. Hence, the input data set should not contain a column with the name Attribute or else it results in an error.

Missing or Null Values

The selected columns should not have any null values.

Restrictions

The algorithm is relevant only for numeric data.

Configuration

The following table provides the configuration details for the Correlation operator.

Parameter	Description
Notes	Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Columns	Specify the numeric columns for which the correlation or covariance should be calculated. Click Select Columns to select the required columns. Note: The input data set should not contain a column with the name Attribute or else it results in an error.
Group by	When you select one or more Group-by columns, the operator calculates a separate correlation (or covariance) matrix for every combination of values in the Group-by columns. You can select one or more columns. Click Select Columns to select the required columns. Note: The Group by column selection cannot overlap with the column selected in Columns parameters. There must be at least a minimum of two unique data points for a given combination of the Group by column and the Attribute column to generate a Correlation or Covariance value, or else it results in a NaN value.
Calculate	Specify whether to calculate the Correlation or the Covariance. Correlation is normalized covariance, scaled so that the correlation between any variable and a positive multiple of itself is always 1. Default: Correlation
Output Schema	Specify the schema for the output table or view.
Output Table	Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results	When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Output

Visual Output

Display correlation (or covariance) matrices for each combination of specified group-by values stacked into one output

Note: If a Group-by parameter is not specified, only one matrix calculated from the full data is output.

Data Output

The visual output is passed as its output to the downstream operator.

Example

The following example illustrates the Correlation operator.

Data

golf: This data set contains the following information:

Multiple columns namely outlook, temperature, wind, humidity, and play.
Multiple rows (14 rows).

Parameter Setting

The parameter settings for the golf data set are as follows:

Columns: temperature, humidity
Group By: outlook
Calculate: Covariance
Store Results: Yes

Results

The following figure displays the results for the parameter settings for the golf data set.

Did you find this helpful?

Yes No