Chi Square, Independence Test

Determines whether categorical columns are statistically independent of a categorical dependent variable column.

Information at a Glance

Category	Predict
Data source type	HD
Sends output to other operators	Yes
Data processing tool	Spark

See Pearson's Chi Square Operations for information about the Chi Square operators.

Input

The operator requires a tabular input on Hadoop. The input should contain at least two categorical columns, one that represents the independent variable, and one that represents the dependent variable. The operator can compute the chi square test on multiple independent columns in one run - in this case, each independent column is compared to the dependent column and forms a row in the output data set of the chi square test metrics.

Bad or Missing Values: Before computing any of the chi square tests, rows with null values in any of the independent or dependent columns are dropped. These rows are reported and written to a file according to the value of the Write Rows Removed Due to Null Data parameter.

Restrictions

Scalability issues might result if there are many distinct values in the categorical columns (more than 1,000) or if many independent columns are selected.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Dependent Column	A categorical column.
Independent Columns	One or more categorical columns to compare to the dependent columns. The null hypothesis in this case is that the distribution of categories in the independent and dependent columns are statistically independent.
Significance Threshold	The confidence level under which we reject the null hypothesis. Practically, this value is used to determine the Reject Null Hypothesis column in the output. We reject the null hypothesis if the P-Value (which represents the probability that the variance in the event distributions occurred due to chance) is less than this value.
Use Fisher's Exact Test instead of Chi Square	Select to use the Fisher's exact test rather than the more common, and more robust, Pearson's Chi Square Test. Note: For computational and theoretical reasons, the Fisher's exact test is appropriate only for 2 x 2 tables; that is, when the independent and dependent variables have only two possible outcomes and when the number of observations is extremely small. We cannot perform a Fisher's exact test in cases when the cells in the table have a value greater than five. Yes - Compute a Fisher's Exact Test for all the independent variables. No (the default) - Use the Chi Square test.
Write Rows Removed Due to Null Data To File	Rows with null values in at least one of the independent columns or the dependent column are removed from the analysis. This parameter allows you to specify that the data with null values are written to a file. The file is written to the following location: @default_tempdir/alpine_out/@user_name/@flow_name/@operator_name_uuid_bad_data From the drop-down list, specify one of the following. Do Not Write Null Rows to File - remove null value data and display in the result UI, but do not write to an external file. Do Not Write or Count Null Rows (Fastest) - remove null value data but do not count and display in the result UI. Write All Null Rows to File - remove null value data and write all removed rows to an external file.
Output Directory	The location to store the output files.
Output Name	The name to contain the results.
Overwrite Output	Specifies whether to delete existing data at that path. Yes - if the path exists, delete that file and save the results. No - fail if the path already exists.
Storage Format	Select the format in which to store the results. The storage format is determined by your type of operator. Typical formats are Avro, CSV, TSV, or Parquet.
Compression	Select the type of compression for the output. Available Parquet compression options. GZIP Deflate Snappy no compression Available Avro compression options. Deflate Snappy no compression
Advanced Spark Settings Automatic Optimization	Yes specifies using the default Spark optimization settings. No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output

Data Output

The results of the statistical test for each independent variable are output to the next operator. However, the format of these results differ depending on whether a Chi Square test or a Fisher's exact test are used.

In the case of a Chi Square test, we output a table with the following columns and one row for each independent column selected:

Independent Variable: The name of the independent column
Degrees of Freedom: The number of degrees of freedom. The degrees of freedom differ between the Chi Square test of independence and the Chi Square test for Goodness of Fit.
Chi Square Statistic: The test statistic, a decimal. The test is a measure of difference between the observed and expected distribution.
P-Value: The probability that the two samples are from the same distribution. Lower P-Values indicate a greater relationship between the independent and dependent variables. The P-Value is a function of the degrees of freedom and the test statistic. In general, a high chi square statistic for the same degrees of freedom leads to a lower P-Value. By convention, tests that yield P-Values of greater than 0.05 (reject the null hypothesis 0.05 percent of the time) are considered to show significance.
Reject Null Hypothesis: Whether the P-Value was less than the alpha value set in the parameters. (The default alpha value is 0.05).

In the case of the Fisher's exact test, the output is only the Independent Variable, P-Value, and Reject Null Hypothesis columns. This is because the Fisher's exact test does not compute a test statistic from which the probability is estimated, but rather a probability directly and does not use degrees of freedom.