Pearson's Chi Square Operations

Team Studio provides two Chi Square operators for predicting on a Hadoop data set.

Independence Test

You can use Chi Square, Independence Test for determining whether categorical columns are statistically independent of a categorical dependent variable column.

For example, if we have a dataset of user churn that includes the factors gender and browser type, we could use the Independence Test operator to determine if there is a significant statistical relationship between gender and churn, or between browser and churn. The Chi Square test of independence can be used as the basis for performing inferential statistics on data with many categorical variables, or as an exploratory data step, such as to determine what factors to include in a logistic regression or decision tree. We leverage the Mllib implementation of Pearson's Chi Square test of independence.

This operator also includes the option to use a Fisher's exact test. The Fisher's exact test is a similar test for statistical significance to the Chi Square test with slightly different assumptions. The Fisher's exact test is recommended when sample sizes are small, (cell sizes less than five), and can be calculated only on 2 x 2 tables. The operator is based on the formula described in the article Fisher's exact test.

To examine whether the frequencies of events differ from a normal distribution or from a known distribution, consider using the Chi Square, Goodness of Fit operator.

To examine statistical significance between quantitative variables, consider using one of the T-Test operators. (See T-Test - Independent Samples, T-Test - Paired Samples, or T-Test - Single Sample for more information.)

Here is a Fun Fact: Fisher developed the test to analyze the result of an experiment to test his friend Dr. Muriel Bristol's assertion that she could tell the difference between a cup of tea in which the milk was poured first and a cup of tea with milk added second.

Goodness of Fit

You can use Chi Square, Goodness of Fit to test for goodness of fit for a distribution.

In this case, the Chi Square test is performed on two vectors of the frequency of events: the vector of observed frequencies and the vector of expected frequencies. The null hypothesis is that the frequencies in each cell (frequency of each event occurring) are equal in the observed and expected distribution. This test differs from the test of independence, because it assumes that the degrees of freedom are equal to the number of possible events minus one.

This test should be used to test observed outcomes against some known theoretical distribution; therefore, we designed the operator so that it accepts the observed and expected frequency data in two different datasets, because these datasets might not be the same length. (Most likely, the expected vector is already aggregated with sum of frequencies for each distinct event, while the observed vector comes from some real data with each observation as a row.) Our implementation leverages the Chi Square Test in Spark's MLlib. We use the absolute frequencies of the observed and expected datasets as the vectors input for the test.