Aggregation Methods for Batch Aggregation

In contrast to the Aggregation operator, which forces you to configure each aggregation separately, with the Batch Aggregation operator, you can select many numeric columns for each aggregation method and compute all these aggregations at once. The result is a wide data set that contains the grouping column and a column for each of the aggregations.

Available Aggregation Methods
Aggregation Parameter Name Prefix/Suffix Formula Performance Implications
Count group_size The number of non-null members of this group.  
Minimum min The lowest value in each group.  
Maximum max The highest value in each group.  
Sum sum The sum of all of the values in each group.  
Mean mean Sum/Count for each group. Online calculation performed using Spark SQL
Variance var The population variance:

avg(col*col)-avg(col)*avg(col)

Online calculation performed using Spark SQL (v 1.5.1)
Standard Deviation sd The square root of the above. Online calculation performed using Spark SQL (v 1.5.1)
Distinct distinct The number of distinct values in the group. Calculated using Spark SQL (v 1.5.1) Slow if there are many distinct values within each group, or if no group was selected.
Median1 median The middle element of the group. Specifically, we calculate median as the nth largest element in the group where median for batch aggregation 
      Expensive. Unlike the other values, cannot be calculated using highly performant Spark SQL optimizer. Requires an additional shuffle step. Might reach memory limitations if there are many groups.