Aggregation Methods for Batch Aggregation
In contrast to the Aggregation operator, which forces you to configure each aggregation separately, with the Batch Aggregation operator, you can select many numeric columns for each aggregation method and compute all these aggregations at once. The result is a wide data set that contains the grouping column and a column for each of the aggregations.
| Aggregation Parameter Name | Prefix/Suffix | Formula | Performance Implications |
|---|---|---|---|
| Count | group_size | The number of non-null members of this group. | |
| Minimum | min | The lowest value in each group. | |
| Maximum | max | The highest value in each group. | |
| Sum | sum | The sum of all of the values in each group. | |
| Mean | mean | Sum/Count for each group.
|
Online calculation performed using Spark SQL |
| Variance | var | The population variance:
|
Online calculation performed using Spark SQL (v 1.5.1) |
| Standard Deviation | sd | The square root of the above. | Online calculation performed using Spark SQL (v 1.5.1) |
| Distinct | distinct | The number of distinct values in the group. | Calculated using Spark SQL (v 1.5.1) Slow if there are many distinct values within each group, or if no group was selected. |
| Median1 | median | The middle element of the group. Specifically, we calculate median as the nth largest element in the group where
![]() |
Expensive. Unlike the other values, cannot be calculated using highly performant Spark SQL optimizer. Requires an additional shuffle step. Might reach memory limitations if there are many groups. |
