T-Test - Single Sample

Tests for statistical significance between a set of numeric values (from one column) and a known mean. This operator allows one to compute the test across several different sample columns with one operator.

Information at a Glance

Category Model Validation
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

The single sample t-test is used to test whether a sample population has a significantly different mean from the known population mean.

For information about Student's t-distribution, see https://en.wikipedia.org/wiki/Student%27s_t-distribution.

Algorithm

The means and variances for all of the test statistics are computed using Spark's MultivariateStatisticalSummary object, but the t-tests themselves are computed from Java's commons-math library.

Input

A tabular data set with numeric columns.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Sample Columns Select the column(s) of numeric values on which to compute the t-test.
Assumed Mean Enter a numeric value (the population mean) against which to compute the t-test.
Write Bad Data To File Rows with null values are removed from the analysis. This parameter allows you to specify that the data with null values be written to a file.

The file is written to the same directory as the rest of the output. The filename is suffixed with _baddata.

  • Do Not Write Null Rows to File - remove null value data and display in the result UI, but do not write to an external file.
  • Do Not Write or Count Null Rows (Fastest) - remove null value data but do not count and display in the result UI.
  • Write All Null Rows to File - remove null value data and write all removed rows to an external file.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.

Output

Visual Output
Each row represents a column selected in the Sample Columns parameter.

See Single Sample T-Test Use Case for example data on a puppy training program that illustrates use of the single sample t-test. In this case, we see that we have above average puppies since the Upper One Tailed PValue for the Score_Before_Training column is very close to zero, and that after training the puppies, they are still above average since the Upper One Tailed PValue for the Score_After_Training column is also close to zero.



Data Output
  • T Statistic - A value computed based on the average and variance. The higher the magnitude of the t-statistic, the higher the difference between the means.
  • Two Tailed PValue - The sum of the area under the Students t-distribution above the absolute value of the t-statistic and below the inverse of the t-statistic. A higher value indicates a greater absolute difference in the sample compared. We usually reject the null hypothesis if p < 0.05.
  • Lower One Tailed PValue - The area under the Student's t-distribution between negative infinity and the t statistic. A lower p-value indicates that sample a is less than sample b. We usually reject the null hypothesis if p < 0.05.
  • Upper One Tailed PValue - The area under the Student's t-distribution between positive infinity and the t statistic. A lower p-value indicates that sample a is greater than sample b. We usually reject the null hypothesis if p < 0.05.