Wide Data Variable Selector - Chi Square / Anova

From a very large data set (that is, one whose variables number in the thousands or millions), produces a new data set with correlations and significance statistics for each predictor (X) variable against a user-specified dependent (Y) variable.

Information at a Glance

Parameter	Description
Category	Transform
Data source type	HD
Send output to other operators	Yes
Data processing tool	Spark SQL

Algorithm

For each predictor (X) variable, the operator computes the correlation against the dependent (Y) variable. If categorical predictors exist, they are converted to continuous predictors using impact coding before the correlations are calculated. The algorithm does two passes through the data, one to collect the dependent values and another to calculate the correlations.

Note: For this operator, the dependent variable must be categorical. If your dependent variable is continuous, then use the operator Wide Data Variable Selector - Correlations

The t statistic and corresponding p value calculations use the following formula.

Scalability should not be limited by anything other than available cluster resources. The algorithm makes two passes through the data: one to collect the dependent values, and another to calculate the correlations.

Input

A single tabular data set that contains key-value pairs of variables and values in stacked format, with variable_names, continuous_values, and categorical_values, and row_id columns.

Bad or Missing Data

Missing data is not present in the input table. There is a minimum of two values for each predictor and dependent variable. Missing data is casewise deleted.

Error and Exception Handling

The operation checks for validity of the dependent variable specification. See the Algorithm section for more information.

If the dependent variable is categorical, then it should be in a categorical values column and have discrete values (string, long, int).
If the dependent variable is continuous, then use the operator Wide Data Variable Selector - Correlations.

If there are not enough cases to calculate correlation for a variable (at least 2), then the operation returns NaN.

If there are not enough cases to calculate t statistic and p value (at least 3), then the operation returns 0 and 1, respectively.

Configuration

Parameter	Description
Notes	Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Dependent Variable Name	The name of the dependent variable against which the correlation is computed. The dependent variable must be categorical. If it is continuous, then use the operator Wide Data Variable Selector - Correlations. Required.
Variables Column	The name of the column that contains the dependent variable.
Continuous Values Column	The name of the column that contains continuous predictor values.
Categorical Values Column	The name of the column that contains the categorical predictor values. Required.
Row ID Column	The name of the column that contains the row ID numbers. Required.
Number of Bins	The number of bins used for the correlation. The default is 10.
Chi Square Output	Can be one of the following: Anova Chi-Square Chi-Square and p values
Output Directory	The location to store the output files.
Output Name	The name to contain the results.
Overwrite Output	Specifies whether to delete existing data at that path. Yes - if the path exists, delete that file and save the results. No - fail if the path already exists.
Storage Format	Select the format in which to store the results. The storage format is determined by your type of operator. Typical formats are Avro, CSV, TSV, or Parquet.
Compression	Select the type of compression for the output. Available Parquet compression options. GZIP Deflate Snappy no compression Available Avro compression options. Deflate Snappy no compression
Advanced Spark Settings Automatic Optimization	Yes specifies using the default Spark optimization settings. No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings dialog for more information.

Output

Visual Output

A tabular preview of the output data set, which includes Output and Summary tabs.

Output

A single tabular data set containing correlations for each predictor along with significance statistics.

Summary

The default summary, which includes parameters selected, input data size, and output location.

Data Output

A single tabular data set that contains s for each predictor, along with significance statistics.

Example

The following example shows the relationship between a wide table and the stacked table input the operator requires.

Did you find this helpful?

Yes No