T-Test - Independent Samples

Computes a test of statistical significance against a student's t-distribution for one measure across two different groups.

Information at a Glance

Category Model Validation
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

The independent samples t-test is used to test whether two groups are significantly different for the same measure.

For information about Student's t-distribution, see https://en.wikipedia.org/wiki/Student%27s_t-distribution

Algorithm

The means and variances for all of the test statistics are computed using Spark's MultivariateStatisticalSummary object, but the t-tests themselves are computed from Java's commons-math library.

Input

A tabular data set with numeric columns.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Values A numeric column that contains the sample values.
Columns to Test Specify the columns to test.
Column To Group By A categorical column used to separate the two samples.
First Group Value (appears in group by column) The name of the group for the first sample. This must be a value in the Column to Group By list.
Note: For example, to split out people who voted for a particular candidate versus those who did not, and assuming the values in the column were "yes" and "no", you could specify "yes" as the first group value and "no" as the second group value.
Second Group Value (appears in group by column) The name of the group for the second sample. This must be a value in the Column to Group By list.
Sample Means Have Equal Variance (Homoscedastic T-Test) Specify whether to use a homoscedastic t-test (No or Yes).
Write Rows Removed Due to Null Data To File Rows with null values are removed from the analysis. This parameter allows you to specify that the data with null values be written to a file.

The file is written to the same directory as the rest of the output. The filename is suffixed with _baddata.

  • Do Not Write Null Rows to File - remove null value data and display in the result UI, but do not write to an external file.
  • Do Not Write or Count Null Rows (Fastest) - remove null value data but do not count and display in the result UI.
  • Write All Null Rows to File - remove null value data and write all removed rows to an external file.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.

Output

Visual Output
See Independent Samples T-Test Use Case for example data on a puppy training program that illustrates use of the independent samples t-test. The results contain one row for each column selected in the Sample Columns parameter. In this case we see that the puppies that Jenny trained do not perform significantly differently than those that Rachel trained either before or after training, because none of the p-values are close to zero.

Data Output
  • T Statistic - A value computed based on the average and variance. The higher the magnitude of the t-statistic, the higher the difference between the means.
  • Two Tailed PValue - The sum of the area under the Students t-distribution above the absolute value of the t-statistic and below the inverse of the t-statistic. A higher value indicates a greater absolute difference in the sample compared. The null hypothesis is usually rejected if p < 0.05.
  • Lower One Tailed PValue - The area under the Student's t-distribution between negative infinity and the t statistic. A lower p-value indicates that sample a is less than sample b. The null hypothesis is usually rejected if p < 0.05.
  • Upper One Tailed PValue - The area under the Student's t-distribution between positive infinity and the t statistic. A lower p-value indicates that sample a is greater than sample b. The null hypothesis is usually rejected if p < 0.05.