T-Test - Single Sample
Tests for statistical significance between a set of numeric values (from one column) and a known mean. This operator allows one to compute the test across several different sample columns with one operator.
Information at a Glance
|
Parameter |
Description |
|---|---|
| Category | Model Validation |
| Data source type | HD |
| Send output to other operators | Yes |
| Data processing tool | Spark |
The single sample t-test is used to test whether a sample population has a significantly different mean from the known population mean.
For information about Student's t-distribution, see https://en.wikipedia.org/wiki/Student%27s_t-distribution.
Algorithm
The means and variances for all of the test statistics are computed using Spark's MultivariateStatisticalSummary object, but the t-tests themselves are computed from Java's commons-math library.
Input
A tabular data set with numeric columns.
Configuration
| Parameter | Description |
|---|---|
| Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
| Sample Columns | Select the column(s) of numeric values on which to compute the t-test. |
| Assumed Mean | Enter a numeric value (the population mean) against which to compute the t-test. |
| Write Bad Data To File | Rows with null values are removed from the analysis. This parameter allows you to specify that the data with null values be written to a file.
The file is written to the same directory as the rest of the output. The file name is suffixed with _baddata.
|
| Storage Format | Select the format in which to store the results. The storage format is determined by your type of operator.
Typical formats are Avro, CSV, TSV, or Parquet. |
| Compression | Select the type of compression for the output.
Available Parquet compression options.
Available Avro compression options.
|
| Output Directory | The location to store the output files. |
| Output Name | The name to contain the results. |
| Overwrite Output | Specifies whether to delete existing data at that path.
|
Output
See Single Sample T-Test Use Case for example data on a puppy training program that illustrates use of the single sample t-test. In this case, we see that we have above average puppies since the Upper One Tailed PValue for the Score_Before_Training column is very close to zero, and that after training the puppies, they are still above average since the Upper One Tailed PValue for the Score_After_Training column is also close to zero.

- T Statistic - A value computed based on the average and variance. The higher the magnitude of the t-statistic, the higher the difference between the means.
- Two Tailed PValue - The sum of the area under the Students t-distribution above the absolute value of the t-statistic and below the inverse of the t-statistic. A higher value indicates a greater absolute difference in the sample compared. We usually reject the null hypothesis if p < 0.05.
- Lower One Tailed PValue - The area under the Student's t-distribution between negative infinity and the t statistic. A lower p-value indicates that sample a is less than sample b. We usually reject the null hypothesis if p < 0.05.
- Upper One Tailed PValue - The area under the Student's t-distribution between positive infinity and the t statistic. A lower p-value indicates that sample a is greater than sample b. We usually reject the null hypothesis if p < 0.05.