Transpose

Allows you to rearrange data so that rows and columns are switched.

Information at a Glance

Category Transform
Data source type HD
Sends output to other operators Yes1
Data processing tool Spark

You can choose which input column should be used to define the new header. If the input has X columns and Y rows, the output has Y rows and X columns.

In the following example, the Name column is selected to be the output header.

Age Name Grade
12 Jenny A
14 Mary A
13 Emily B

After Transpose is run with Name as the header column, the data set looks like the following example.

Name Jenny Mary Emily
Age 12 14 13
Grade A A B

Input

A data set from HDFS to this operator. At least one categorical column is necessary to define the new header.

Bad or Missing Values
Missing values are kept only if they are not in the column selected to define the new header. In this case, the job fails at runtime.

Restrictions

This operator cannot transpose an input larger than 5,000 rows. If the input has a single column and you select this column to be the new header, an error occurs while the operator is being configured.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Column for New Header You can, as an option, select a categorical (chararray) column whose name and values define the new header. If no column is selected, the header in the output is default (Column1, Column2...ColumnX)
Note: If the selected column contains null or duplicate values, the job fails at runtime with a meaningful error message.

If some values contain non-alphanumeric characters, they are replaced by an underscore in the new header.

If some values start with a non-letter character, the letter "a" is prepended to match the column name regex "^[A-Za-z]+ \\ w*$".

New Name for First Column Optional new name for the first column in the output, matching the regular expression "^[A-Za-z]+ \\ w*$".

If you do not want to specify a name, keep the default empty box.

Note: If you specify a value here, it overrides the first column name of the output: either the column name of the column selected for new header (previous parameter), or the default value Column1 from the default header if no input column was selected to define the output header.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output


Data Output
This is a semi-terminal operator that can be connected to any subsequent operator at design time, but does not transmit the full output schema until the user runs the operator. The partial output schema at design time is only be the first column of the output. After running it, the output schema is automatically updated and subsequent operators turn red in case the UI parameters selection is not valid anymore.
Note: The final output schema of the Transpose operator is cleared if one of the following events occurs.
  • The user changes the configuration properties of the Transpose operator.
  • The user changes the input connected to the Transpose operator.
  • The user clears the step run results of the Transpose operator.
In this case, the output schema transmitted to subsequent operators again becomes the partial schema defined at design time (hence, subsequent operators can turn invalid), and the user must run the Transpose operator again to transmit the new output schema.
Note: The first column of the output is always chararray (because it is created from input header). All of the other columns are either double if all input columns (except the column chosen to define the new header if the user specified it) are numeric, or chararray otherwise.
1 The full output schema is not available until you step run the operator. After you run this operator, the output schema automatically updates, and subsequent operators either validate or turn red, depending on the structure of the output data.