Resampling

Changes the distribution of values in a single column. You can use this operator to either balance all values in the selected column or change the proportion of only one value. You can use it to up-sample or down-sample.

Information at a Glance

Category Sample
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

Input

A Hadoop file that has at least one categorical column.

Bad or Missing Values
Rows with null values in the selected Column to Resample are removed from the dataset prior to resampling. Null values in other columns do not affect the result.

Restrictions

Input data must have at least one categorical column with less than 100 distinct values.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Column to Resample

*required

A categorical column with less than 100 distinct values.
Balance All Values in Selected Column Yes balances all values in the selected column by up-sampling rows to match the number of rows of the most common value.

Sample with Replacement must be Yes, and any text entered into Single Value from Selected Column for Resampling is ignored.

For example, if a dataset has 3 distinct values in the selected column with the following distribution

Value Count
A 100
B 75
C 50

the output has the following distribution.

Value Count
A 100
B 100
C 100
No resamples only one value in the chosen column. The user must enter values for Single Value from Selected Column for Resampling and Multiplier for Up-Sampling or Down-Sampling. Given the same input as above, if the user chooses to resample the value B with a multiplier of 3, output distribution is:
Value Count
A 100
B 225
C 50
Single Value from Selected Column for Resampling Required when Balance All Values in Selected Column is No.

A character string or a numeric value that appears in the column selected in Column to Resample. An error occurs when running the operator if the value does not occur in the column.

Multiplier for Up-Sampling or Down-Sampling Required when Balance All Values in Selected Column is No.

A positive decimal number that is the multiplicative factor by which to resample the selected column and value.

Sample with Replacement
  • For multipliers less than or equal to 1, specify whether samples are with or without replacement.
  • For multipliers greater than 1, click Yes to sample rows with replacement.
Exact (Slower) An exact calculation requires an additional pass through the data and results in slower operator execution. For non-exact resampling, the output distribution of values can vary from the expected distribution.
Use Random Seed
  • Click Yes to use the specified random seed and get repeatable results.
  • Click No to use a system-generated seed value.
Random Seed An integer value that is used as the seed for the pseudo-random row extraction. Only used if Use Random Seed is Yes.
Write Rows Removed Due to Null Data To File Rows with at least one null value in the Column to Resample are removed from the dataset prior to resampling. This parameter allows you to specify whether rows with null values are written to a file.
The file is written to the same directory as the rest of the output. The filename is given a suffix of _baddata.
  • Do Not Write Null Rows to File - remove null value data and display in the result UI, but do not write to an external file.
  • Do Not Write or Count Null Rows (Fastest) - remove null value data but do not count and display in the result UI.
  • Write All Null Rows to File - remove null value data and write all removed rows to an external file.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Outputs

Visual Output

The Output tab displays a preview of the output dataset.

The Summary tab displays information about the parameters selected, the output value distribution, and the information about rows removed from the data due to null values in the selected column.



Data Output
The resampled data set.

Example