Unstack

Takes an HDFS data set in stacked format and produces an unstacked (wide) HDFS data set using user-specified grouping and pivot columns.

Information at a Glance

Category Transform
Data source type HD
Sends output to other operators Yes
Data processing tool Spark SQL

The operation of Unstack is similar to a pivot operation, except that it is capable of operating on text data as well as numeric data, whereas a pivot operation works on numeric data only. Unstack takes multiple Columns to Keep (group by), a single column that contains the column names (pivot column), and a single column that contains the values of the new columns in the output data set (aggregate column) as inputs.

The maximum number of new columns that can be produced is 10,000. If the pivot column contains more than 10,000 distinct values, an error results.

Missing data appears as null values in the output data set. Unstack handles all column types as inputs except date/time and Boolean.

Input

A single tabular data set.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
New Column Names A column name that contains the new column header.
New Column Values A column name that contains the values.
Aggregation Method The method that aggregates the new values.
Columns to Keep Any number of non-date/time and Boolean columns.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
A tabular preview of the output data set. Summary output includes parameters selected, input data size, and output location.
Data Output
A single tabular data set.
Note: Unstack is a 'semi-terminal' operator. A partial schema can be transmitted to subsequent operators at design time, but you must run the operator for subsequent operators to see the final output schema.

Example

In the following illustration, name and date are the row identifiers, while header and value are the columns to pivot. To complete the operation, Unstack pivots the text values (text value 1 and text value 2), and then places the values in the proper row. Where name equals cc, there is only one date (date4) and two headers. This results in one row. Where name equals bb, there are three dates, each with one header. This results in three rows. Columns in which the header values do not exist for the dates are left empty.