T-Test - Independent Samples
Computes a test of statistical significance against a student's t-distribution for one measure across two different groups.
Information at a Glance
Parameter |
Description |
---|---|
Category | Model Validation |
Data source type | HD |
Send output to other operators | Yes |
Data processing tool | Spark |
The independent samples t-test is used to test whether two groups are significantly different for the same measure.
For information about Student's t-distribution, see https://en.wikipedia.org/wiki/Student%27s_t-distribution
Algorithm
The means and variances for all of the test statistics are computed using Spark's MultivariateStatisticalSummary object, but the t-tests themselves are computed from Java's commons-math library.
Input
A tabular data set with numeric columns.
Configuration
Parameter | Description |
---|---|
Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
Values | A numeric column that contains the sample values. |
Columns to Test | Specify the columns to test. |
Column To Group By | A categorical column used to separate the two samples. |
First Group Value (appears in group by column) | The name of the group for the first sample. This must be a value in the
Column to Group By list.
Note: For example, to split out people who voted for a particular candidate versus those who did not, and assuming the values in the column were "yes" and "no", you could specify "yes" as the first group value and "no" as the second group value.
|
Second Group Value (appears in group by column) | The name of the group for the second sample. This must be a value in the Column to Group By list. |
Sample Means Have Equal Variance (Homoscedastic T-Test) | Specify whether to use a homoscedastic t-test (No or Yes). |
Write Rows Removed Due to Null Data To File | Rows with null values are removed from the analysis. This parameter allows you to specify that the data with null values be written to a file.
The file is written to the same directory as the rest of the output. The filename is suffixed with _baddata.
|
Advanced Spark Settings Automatic Optimization |
|
Storage Format | Select the format in which to store the results. The storage format is determined by your type of operator.
Typical formats are Avro, CSV, TSV, or Parquet. |
Compression | Select the type of compression for the output.
Available Parquet compression options.
Available Avro compression options.
|
Output Directory | The location to store the output files. |
Output Name | The name to contain the results. |
Overwrite Output | Specifies whether to delete existing data at that path.
|
Output
- T Statistic - A value computed based on the average and variance. The higher the magnitude of the t-statistic, the higher the difference between the means.
- Two Tailed PValue - The sum of the area under the Students t-distribution above the absolute value of the t-statistic and below the inverse of the t-statistic. A higher value indicates a greater absolute difference in the sample compared. The null hypothesis is usually rejected if p < 0.05.
- Lower One Tailed PValue - The area under the Student's t-distribution between negative infinity and the t statistic. A lower p-value indicates that sample a is less than sample b. The null hypothesis is usually rejected if p < 0.05.
- Upper One Tailed PValue - The area under the Student's t-distribution between positive infinity and the t statistic. A lower p-value indicates that sample a is greater than sample b. The null hypothesis is usually rejected if p < 0.05.