Batch Aggregation

This operator performs aggregations on multiple columns using the Batch Aggregation algorithm from Spark MLLib.

Batch Aggregate operator icon

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Transform
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The Batch Aggregation operator takes an input data set and performs multiple aggregations on multiple columns. Rows of output data set are represented by the aggregation computations for each group determined by the Group By columns.

Input

An input is a single tabular data set.

Missing or Null Values
Skips missing or null values while performing aggregations or calculating the number of distinct values.

Configuration

The following table provides the configuration details for the Batch Aggregation operator.

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Group By Specify the columns in the input data set to determine grouped results. Click Select Columns to open the dialog for selecting the available columns from the input data set for analysis.
Find Maximum Maximum value for each of these columns for each group. Click Select Columns to open the dialog for selecting the available columns from the input data set for analysis.
Find Minimum Minimum value for each of these columns for each group. Click Select Columns to open the dialog for selecting the available columns from the input data set for analysis.
Calculate Sum Sums for each of these columns for each group.
Calculate Mean Mean value for each of these columns for each group.
Calculate Variance Variance for each of these columns for each group. See Aggregation Methods for Batch Aggregation for implementation and performance details.
Calculate Standard Deviation Standard deviation for each of these columns for each group. See Aggregation Methods for Batch Aggregation for implementation and performance details.
Calculate Number of Distinct (slower) Number of distinct values (excluding null values) for each of these columns for each group. See Aggregation Methods for Batch Aggregation for implementation and performance details.
Calculate Median (slower) Median for each of these columns for each group. See Aggregation Methods for Batch Aggregation for implementation and performance details.
Column Name Format Specify whether the aggregation type is added to the beginning or the end of the column name in the output. The available options are suffix and prefix.

Default: suffix

Output Schema Specify the schema for the output table or view.
Output Table Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Output

Visual Output
  • Output: Displays the specified aggregates for specified columns of the input data set.
  • Parameter Summary Info: Displays information about the input parameters. A list of the input parameters and their current settings.
  • Column Data Sizes: Displays the size of the entire data set.
Output to successive operators
A model object that can be used with operators.

Example

The following example demonstrates the Batch Aggregation operator.

Batch Aggregation operator workflow
Data
golf: This data set contains the following information:
  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).
Parameter Setting
The parameter settings for the golf data set are as follows:
  • Group By: outlook

  • Find Maximum: humidity

  • Find Minimum: temperature

  • Calculate Sum: temperature

  • Calculate Mean: temperature, humidity

  • Calculate Variance: humidity

  • Calculate Standard Deviation: humidity

  • Calculate Number of Distinct (slower): temperature, wind, play

  • Column Name Format: suffix

  • Store Results: Yes

Results
These figures display the results for the parameter settings for the golf data set.
Output
Batch Aggregation output tab

In the above output, the groups are not in alphabetically sorted order. To order the aggregations, connect this operator to a sorting operator.

Parameter Summary Info
Batch Aggregation - Parameter Summary Info tab
Column Data Sizes
Batch Aggregation - column data sizes tab