One-Hot Encoding

Performs one-hot encoding on a set of categorical columns selected: it encodes categorical features using a one-hot scheme (also known as "one-of-K" scheme), and outputs a binary column for each distinct category in the input column.

Information at a Glance

Category Transform
Data source type HD
Sends output to other operators Yes1
Data processing tool Spark

The One-Hot Encoding operator is useful for turning categorical predictors into numeric (binary) predictors for algorithms that do not support categorical variables natively.

Note: Most of the ML algorithms available in Team Studio already include one-hot encoding as a preprocessing step directly.

Input

A single HDFS tabular data set.

Bad or Missing Values
Dirty data: When parsing delimited data, the One-Hot Encoding operator removes dirty data (such as strings in numeric columns, doubles in integer columns, or rows with the incorrect number of values) as it parses. These rows are silently removed because Spark is incapable of handling them.

Null values: Before performing one-hot encoding, the operator filters any rows that contain null values in the specified Columns to Encode. The operator then processes these rows with null values according to the value of the Write Rows Removed Due to Null Data To File parameter. The number of rows removed due to null data is reported in the Summary tab of the visual output.

Restrictions

For the maximum number of categories for each column selected on Columns to Encode, the default values is 30; this value can be modified in the Advanced Spark Settings menu (in the parameter Max Column Distinct Categories).

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns to Encode

*required

Select the categorical column(s) on which to perform one-hot encoding.
Keep Encoded Columns Define whether the input columns to encode should be kept in the output - yes or no (the default).
Drop Last Category Select Yes (the default) to indicate that the last category in the column to encode should be dropped. Otherwise, select No.

For example, if a column to encode "categ" contains three categories ("a", "b", "c") and Yes is selected for this parameter, the output data set contains only two encoded binary columns: "categ_a" and "categ_b" ("categ_c" is dropped).

Output Column Prefix (Optional) Specify a string to prepend to all the output encoded column names. This option can be useful if you want to select all of the encoded columns in a subsequent operator, because they all start with the same prefix, which simplifies filtering on the first letters and then selecting them.
Write Rows Removed Due to Null Data to File Rows with null values (only in the Columns to Encode) are removed from the analysis. Use this parameter to specify that the data with null values is written to a file. The file is written to the same directory as the rest of the output. The filename is appended with the suffix _baddata.
  • Do Not Write Null Rows to File (the default) - remove null value data and display in the result UI, but do not write to an external file.

  • Do Not Write or Count Null Rows (Fastest) - remove null value data, but do not count and display in the result UI.

  • Write All Null Rows to File - remove null value data and write all removed rows to an external file.

Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
  • Output preview:



  • Summary: parameters selected, null data removed from input and output location:



Data Output
The data set that contains the encoded columns.

Additional Notes

"Semi-terminal" operator
A partial schema can be transmitted to subsequent operators at design time, but you must run the operator for subsequent operators to see the final output schema.
Note: The final output schema of the Correlation Filter operator is cleared if one of the following occurs.
  • The user changes the configuration properties of the One-Hot Encoding operator.
  • The user changes the input connected to the One-Hot Encoding operator.
  • The user clears the step run results of the One-Hot Encoding operator.

In this case, the output schema transmitted to subsequent operators again becomes the partial schema defined at design time (hence, subsequent operators can turn invalid), and you must run the One-Hot Encoding operator again to transmit the new output schema.

1 The full output schema is not available until you step run the operator. After you run this operator, the output schema automatically updates, and subsequent operators either validate or turn red, depending on the structure of the output data.