Unstack
Takes an HDFS data set in stacked format and produces an unstacked (wide) HDFS data set using user-specified grouping and pivot columns.
Information at a Glance
The operation of Unstack is similar to a pivot operation, except that it is capable of operating on text data as well as numeric data, whereas a pivot operation works on numeric data only. Unstack takes multiple Columns to Keep (group by), a single column that contains the column names (pivot column), and a single column that contains the values of the new columns in the output data set (aggregate column) as inputs.
The maximum number of new columns that can be produced is 10,000. If the pivot column contains more than 10,000 distinct values, an error results.
Missing data appears as null values in the output data set. Unstack handles all column types as inputs except date/time and Boolean.
Configuration
Parameter | Description |
---|---|
Notes | Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator. |
New Column Names | A column name that contains the new column header. |
New Column Values | A column name that contains the values. |
Aggregation Method | The method that aggregates the new values. |
Columns to Keep | Any number of non-date/time and Boolean columns. |
Storage Format | Select the format in which to store the results. The storage format is determined by your type of operator.
Typical formats are Avro, CSV, TSV, or Parquet. |
Compression | Select the type of compression for the output.
Available Avro compression options. |
Output Directory | The location to store the output files. |
Output Name | The name to contain the results. |
Overwrite Output | Specifies whether to delete existing data at that path. |
Advanced Spark Settings Automatic Optimization |
|
Example
In the following illustration, name and date are the row identifiers, while header and value are the columns to pivot. To complete the operation, Unstack pivots the text values (text value 1 and text value 2), and then places the values in the proper row. Where name equals cc, there is only one date (date4) and two headers. This results in one row. Where name equals bb, there are three dates, each with one header. This results in three rows. Columns in which the header values do not exist for the dates are left empty.