Sort By Multiple Columns

Allows you to choose up to three columns to sort by and returns a data set sorted by the selected column(s), adding a column called row_index that enables you to filter the output based on the sorting results.

Information at a Glance

Category Transform
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

Input

A tabular data set from HDFS.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Primary Sort Column First column to sort by. While Secondary Sort Column and Tertiary Sort Column can be left blank, this column is required.
Primary Column Sort Order Order by which to sort the first column: Ascending (the default) or Descending.
Secondary Sort Column Second column to sort by. To sort by one column only, leave this column and the Tertiary Sort Column blank.
Secondary Column Sort Order Order by which to sort the second column: Ascending (the default) or Descending.
Tertiary Sort Column Third column to sort by. To sort two columns only, leave this one blank.
Tertiary Column Sort Order Order by which to sort the third column: Ascending (the default) or Descending.
Create 'row_index' Column Specify whether to add the row_index column, which adds an extra column to the data set that shows the sort index.

Default value: No.

Write Rows Removed Due to Null Data to File Rows with null values (only in the columns selected to sort by) are removed from the analysis. This parameter allows you to specify that the data with null values be written to a file.

The file is written to the same directory as the rest of the output. The filename is suffixed with _baddata.

  • Do Not Write Null Rows to File (the default) - remove null value data and display in the result UI, but do not write to an external file.
  • Do Not Write or Count Null Rows (Fastest) - remove null value data but do not count and display in the result UI.
  • Write All Null Rows to File - remove null value data and write all removed rows to an external file.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
The following example is sorted by age, then income.



Data Output
A data set that contains the sorted columns and the extra row_index column if selected.