Pivot (HD)

Lets you transform the categorical data contained in a column of a table into columns of a new table, by means of subtotals (or other calculations) that might be defined by another column in the same list. The other calculations might be averages and counts.

Information at a Glance

Category Transform
Data source type HD
Sends output to other operators Yes
Data processing tool MapReduce

The Pivot (HD) operator is for Hadoop data only. For database data, use the Pivot (DB) operator.

Algorithm

For typical data entry and storage, data usually appear in flat tables, meaning that it consists of only columns and rows. While such data can contain a lot of information, it can be difficult to get summarized information. A pivot table can help quickly summarize the flat data, giving it depth, and highlighting the desired information.

The usage of a pivot table is extremely broad and depends on the situation. The first question to ask is "What am I looking for?" A pivot table usually consists of row, column, and data (or fact) fields. These fields allow several kinds of aggregations including: sum, average, count, max, min, etc.

The Pivot Column is usually a categorical column, and in the output table it is transformed into multiple columns, one for each category.

  • The results are also grouped by a selected column.
  • The values in the new columns are aggregates of a third column (or of the presence of the category if no aggregate column is chosen).
  • The columns are listed (in the table, file, or array) in descending order of the value of the category they represent.
Note: This operator can only connect to subsequent operators in certain situations - see the Output section below for details.

Input

A data set from the preceding operator.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Pivot Column The column for pivot transformation. You can choose any data type column. Limit: 1,500 distinct values.
Group By The column to group by.
Aggregate Column The pivot column's value column.
Aggregation The aggregate function. Options:
  • sum
  • average
  • count
  • max
  • min
Store Results? Specifies whether to store the results.
  • true - results are stored.
  • false - the data set is passed to the next operator without storing.
Results Location The HDFS directory where the results of the operator are stored. This is the main directory, the subdirectory of which is specified in Results Name. Click Choose File to open the Hadoop File Explorer Dialog Box and browse to the storage location. Do not edit the text directly.
Results Name The name of the file in which to store the results.
Overwrite Specifies whether to delete existing data at that path and file name.
  • Yes - if the path exists, delete that file and save the results.
  • No - Fail if the path already exists.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Use Spark If Yes (the default), uses Spark to optimize calculation time.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
If Use Array is true, pivoted values are stored in an array in one column.



The values in the array are arranged in descending order of the value of the category they represent.

If Use Array is false or if the operator is operating on a Hadoop dataset, pivoted values are each placed in their own column.



Data Output
If Use Array is true, no output is sent to the succeeding operator.