Resampling

This operator changes the distribution of values in a single column. You can use this operator to either balance all values in the selected column or change the proportion of only one value.

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Sample
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Input

A single tabular data set that has at least one categorical column.

Bad or Missing Values
Rows with null values in the selected Column to Resample are removed from the data set prior to resampling. Null values in other columns do not affect the result.

Restrictions

Input data must have at least one categorical column with less than 100 distinct values.

Configuration

The following table provides the configuration details for the Resampling operator.

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Column to Resample Specify a categorical column with less than 100 distinct values. The columns must be String, Boolean, Integer, or Long data types. If a data set has a column with some other data types, you must cast them before running this operator.
Balance All Values in Selected Column Specify whether to balance all values in the selected column.

Select Yes to balance all values in the selected column by up-sampling rows to match the number of rows of the most common value. The Sample with Replacement must be Yes, and any text entered into Single Value from Selected Column for Resampling is ignored.

Select No to resample only one value in the selected column. The user must enter values for Single Value from Selected Column for Resampling and Multiplier for Up-Sampling or Down-Sampling.

For example, when the data set contains 3 distinct values in the selected column with the following distribution:

ValueCount
A100
B75
C50

When Yes is selected, the output has the following distribution:

ValueCount
A100
B100
C100

If No is selected, and the user chooses to resample the value B with a multiplier of 3, the output has the following distribution:

ValueCount
A100
B225
C50

Note: You can get exact or approximate counts in the output that depends on the setting of the Exact (Slower) parameter.
Single Value from Selected Column for Resampling Specify a character string or a numeric value that appears in the column selected in Column to Resample. If the value does not exist in the selected column, an error appears when running the operator.

This parameter is required when Balance All Values in Selected Column is set to No.

Multiplier for Up-Sampling or Down-Sampling Specifies a positive decimal number to represent the multiplicative factor that is used to resample the selected column and value.

This parameter is required when Balance All Values in Selected Column is set to No

Sample with Replacement Specify whether samples are with or without replacement. The following settings are recommended:
  • For multipliers less than or equal to 1, specify whether samples are with or without replacement.
  • For multipliers greater than 1, select Yes to sample rows with replacement.
Exact (Slower) Specify whether exact resampling is required.

For small datasets, we strongly recommend using the option Exact (slower) = Yes because, for non-exact resampling, the output distribution of values can vary from the expected distribution.

An exact calculation requires an additional pass through the data and results in slower operator execution.

Use Random Seed Specify whether to use a random seed. Select Yes to use the specified random seed and get repeatable results. Select No to use a system-generated seed value.
Random Seed The seed used for the pseudo-random generation. It is an integer value that is used if Use Random Seed is Yes.
Output Schema Specify the schema for the output table or view.
Output Table Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Outputs

Visual Output
  • Parameters Summary Info: Displays information about the input parameters and their current settings.
  • Output: A table that displays the output of a data set for the resampled data.

  • Distribution Summary: Displays information about the output value distribution such as class, value, and the distribution type.

Output to successive operators
A tabular data set that contains the resampled data tables created.

Example

The following example demonstrates the Resampling operator.

Resampling operator workflow

Data
golf: This data set contains the following information:
  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).
Parameter Setting
The parameter settings for the golf data set are as follows:
  • Column to Resample: play

  • Balance All Values in Selected Column: No

  • Single Value from Selected Column for Resampling: yes

  • Multiplier for Up-Sampling or Down-Sampling: 2.2

  • Sample with Replacement: Yes

  • Exact (Slower): Yes

  • Use Random Seed: No

  • Store Results: Yes

Results
These figures displays the results for the parameter settings for the golf data set.
Parameters Summary Info
Resampling operator - Parameter Summary Info tab
Output
Resampling operator - Output tab
Distribution Summary
Resampling operator - Distribution Summary tab