Correlation

This operator is used to specify two or more numeric type attributes (columns) in a data set for relative analysis against each other by calculating the correlation between each pair of selected columns.

Correlation

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Explore
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The covariance between two variables (X and Y) is calculated as given in the following formula:

covariance formula

where mean of X 
    and mean of Y 
    are the mean values for X and Y, respectively.

The correlation is calculated by normalizing the covariance, as given in the following formula:

correlation formula

For information about correlation and covariance, and the algorithms that describe them, see Correlation and Covariance.

Input

An input is a single tabular data set.

Note: The Correlation operator generates a column Attribute in its output. Hence, the input data set should not contain a column with the name Attribute or else it results in an error.
Missing or Null Values
The selected columns should not have any null values.

Restrictions

The algorithm is relevant only for numeric data.

Configuration

The following table provides the configuration details for the Correlation operator.

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Columns Specify the numeric columns for which the correlation or covariance should be calculated. Click Select Columns to select the required columns.
Note: The input data set should not contain a column with the name Attribute or else it results in an error.
Group by When you select one or more Group-by columns, the operator calculates a separate correlation (or covariance) matrix for every combination of values in the Group-by columns. You can select one or more columns. Click Select Columns to select the required columns.
Note:
  • The Group by column selection cannot overlap with the column selected in Columns parameters.
  • There must be at least a minimum of two unique data points for a given combination of the Group by column and the Attribute column to generate a Correlation or Covariance value, or else it results in a NaN value.

Calculate Specify whether to calculate the Correlation or the Covariance. Correlation is normalized covariance, scaled so that the correlation between any variable and a positive multiple of itself is always 1.

Default: Correlation

Output Schema Specify the schema for the output table or view.
Output Table Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Output

Visual Output
Display correlation (or covariance) matrices for each combination of specified group-by values stacked into one output
Note: If a Group-by parameter is not specified, only one matrix calculated from the full data is output.
Data Output
The visual output is passed as its output to the downstream operator.

Example

The following example illustrates the Correlation operator.

Correlation operator workflow
Data
golf: This data set contains the following information:
  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).
Parameter Setting
The parameter settings for the golf data set are as follows:
  • Columns: temperature, humidity

  • Group By: outlook

  • Calculate: Covariance

  • Store Results: Yes

Results
The following figure displays the results for the parameter settings for the golf data set.
Correlation operator - Output tab