Summary Statistics
The Summary Statistics operator loads a data set and calculates basic statistics for each of the selected columns, either over the entire data set or grouped by another set of columns.
Information at a Glance
|
Parameter |
Description |
|---|---|
| Category | Explore |
| Data source type | TIBCO® Data Virtualization |
| Send output to other operators | Yes |
| Data processing tool | TIBCO® DV, Apache Spark 3.2 or later |
Algorithm
The Summary Statistics operator takes an input data set and performs basic statistical calculations.
For each selected column, this operator computes Count, Distinct, Min, Max, Mean, number of positive values, number of negative values, number of zeroes, number of empty values, number of null values, Lower Quartile, Upper Quartile, Median, Standard Deviation, Coefficient of Variation, and the n (where n is specified in input) Most Common Value and their counts.
When Group By columns are selected, these statistics are calculated for each unique value in each Group By column, and corresponding Group_By_<col> columns are added to the output data set.
Input
An input is a single tabular data set.
Missing or Null Values
Skips missing or null values while performing this operation or calculating the number of distinct values.
Configuration
The following table provides the configuration details for the Summary Statistics operator.
| Parameter | Description |
|---|---|
| Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
| Columns | Specify the columns for which the summary statistics should be displayed. Click Select Columns to select the available columns from the input data set for analysis. The selected columns cannot be used in the Group By parameter box. |
| Group By | Specify the columns in the input data set to determine grouped results. Click Select Columns to select the available columns from the input data set for analysis. |
| Calculate the Number of Distinct Values (slower) |
Specify whether to calculate the number of distinct values for selected columns. Default: Yes Note:
Calculating distinct values can add significant processing time. |
| Number of Most Common Values to Display | Specify the maximum number of the most common values to output for each column and the corresponding counts to the output. |
| Output Schema | Specify the schema for the output table or view. |
| Output Table | Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator. |
| Store Results | When set to Yes, the operator saves the results. If set to No, the operator does not save the results. |
Output
-
Output: A table that displays the analysis results of the selected fields, limited by the maximum display of rows and columns.
The default contents of a table are as follows:
- Group By column name
- Column name
- Data type
- Count
- Distinct value
- Min value
- Max value
- Mean value
- Positive value count
- Negative value count
- Zero value count
- Null value count
- Empty value count
- Lower Quartile
- Upper Quartile
- Median (approx.) - Approximate median value for numerical columns.
- Standard Deviation
- Coefficient of Variation
- Most Common Value - The most common value for the column.
- Most Common Count - The most common count for the column.
- Parameter Summary Info: Displays information about the input parameters. A list of the input parameters and their current settings.
A tabular data set containing a row per selected column and a combination of values of Group By columns (if selected). The columns represent the statistical measures calculated for each input column.
Example
The following example demonstrates the Summary Statistics operator.
Data
golf: This data set contains the following information:
- Multiple columns namely outlook, temperature, wind, humidity, and play.
- Multiple rows (14 rows).
The parameter settings for the golf data set are as follows:
-
Columns: outlook, temperature, humidity, play
-
Group By: wind
-
Calculate the Number of Distinct Values (slower): Yes
-
Number of Most Common Values to Display: 3
-
Store Results: Yes
The following figure displays the results for the parameter settings for the golf data set.