Batch Aggregation
This operator performs aggregations on multiple columns using the Batch Aggregation algorithm from Spark MLLib.
Information at a Glance
|
Parameter |
Description |
|---|---|
| Category | Transform |
| Data source type | TIBCO® Data Virtualization |
| Send output to other operators | Yes |
| Data processing tool | TIBCO® DV, Apache Spark 3.2 or later |
Algorithm
The Batch Aggregation operator takes an input data set and performs multiple aggregations on multiple columns. Rows of output data set are represented by the aggregation computations for each group determined by the Group By columns.
Input
An input is a single tabular data set.
Configuration
The following table provides the configuration details for the Batch Aggregation operator.
| Parameter | Description |
|---|---|
| Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
| Group By | Specify the columns in the input data set to determine grouped results. Click Select Columns to open the dialog for selecting the available columns from the input data set for analysis. |
| Find Maximum | Maximum value for each of these columns for each group. Click Select Columns to open the dialog for selecting the available columns from the input data set for analysis. |
| Find Minimum | Minimum value for each of these columns for each group. Click Select Columns to open the dialog for selecting the available columns from the input data set for analysis. |
| Calculate Sum | Sums for each of these columns for each group. |
| Calculate Mean | Mean value for each of these columns for each group. |
| Calculate Variance | Variance for each of these columns for each group. See Aggregation Methods for Batch Aggregation for implementation and performance details. |
| Calculate Standard Deviation | Standard deviation for each of these columns for each group. See Aggregation Methods for Batch Aggregation for implementation and performance details. |
| Calculate Number of Distinct (slower) | Number of distinct values (excluding null values) for each of these columns for each group. See Aggregation Methods for Batch Aggregation for implementation and performance details. |
| Calculate Median (slower) | Median for each of these columns for each group. See Aggregation Methods for Batch Aggregation for implementation and performance details. |
| Column Name Format | Specify whether the aggregation type is added to the beginning or the end of the column name in the output.
The available options are suffix and prefix. Default: suffix |
| Output Schema | Specify the schema for the output table or view. |
| Output Table | Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator. |
| Store Results | When set to Yes, the operator saves the results. If set to No, the operator does not save the results. |
Output
- Output: Displays the specified aggregates for specified columns of the input data set.
- Parameter Summary Info: Displays information about the input parameters. A list of the input parameters and their current settings.
- Column Data Sizes: Displays the size of the entire data set.
Example
The following example demonstrates the Batch Aggregation operator.
- Multiple columns namely outlook, temperature, wind, humidity, and play.
- Multiple rows (14 rows).
-
Group By: outlook
-
Find Maximum: humidity
-
Find Minimum: temperature
-
Calculate Sum: temperature
-
Calculate Mean: temperature, humidity
-
Calculate Variance: humidity
-
Calculate Standard Deviation: humidity
-
Calculate Number of Distinct (slower): temperature, wind, play
-
Column Name Format: suffix
-
Store Results: Yes
In the above output, the groups are not in alphabetically sorted order. To order the aggregations, connect this operator to a sorting operator.