Resampling
This operator changes the distribution of values in a single column. You can use this operator to either balance all values in the selected column or change the proportion of only one value.
Information at a Glance
|
Parameter |
Description |
|---|---|
| Category | Sample |
| Data source type | TIBCO® Data Virtualization |
| Send output to other operators | Yes |
| Data processing tool | TIBCO® DV, Apache Spark 3.2 or later |
Input
A single tabular data set that has at least one categorical column.
Restrictions
Input data must have at least one categorical column with less than 100 distinct values.
Configuration
The following table provides the configuration details for the Resampling operator.
| Parameter | Description | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. | ||||||||||||||||||||||||
| Column to Resample | Specify a categorical column with less than 100 distinct values. The columns must be String, Boolean, Integer, or Long data types. If a data set has a column with some other data types, you must cast them before running this operator. | ||||||||||||||||||||||||
| Balance All Values in Selected Column | Specify whether to balance all values in the selected column. Select Yes to balance all values in the selected column by up-sampling rows to match the number of rows of the most common value. The Sample with Replacement must be Yes, and any text entered into Single Value from Selected Column for Resampling is ignored. Select No to resample only one value in the selected column. The user must enter values for Single Value from Selected Column for Resampling and Multiplier for Up-Sampling or Down-Sampling. For example, when the data set contains 3 distinct values in the selected column with the following distribution:
When Yes is selected, the output has the following distribution:
Note: You can get exact or approximate counts in the output that depends on the setting of the Exact (Slower) parameter. |
||||||||||||||||||||||||
| Single Value from Selected Column for Resampling | Specify a character string or a numeric value that appears in the column selected in
Column to Resample. If the value does not exist in the selected column, an error appears when running the operator.
This parameter is required when Balance All Values in Selected Column is set to No. |
||||||||||||||||||||||||
| Multiplier for Up-Sampling or Down-Sampling | Specifies a positive decimal number to represent the multiplicative factor that is used to resample the selected column and value.
This parameter is required when Balance All Values in Selected Column is set to No |
||||||||||||||||||||||||
| Sample with Replacement | Specify whether samples are with or without replacement. The following settings are recommended:
|
||||||||||||||||||||||||
| Exact (Slower) | Specify whether exact resampling is required. For small datasets, we strongly recommend using the option Exact (slower) = Yes because, for non-exact resampling, the output distribution of values can vary from the expected distribution. An exact calculation requires an additional pass through the data and results in slower operator execution. |
||||||||||||||||||||||||
| Use Random Seed | Specify whether to use a random seed. Select Yes to use the specified random seed and get repeatable results. Select No to use a system-generated seed value. | ||||||||||||||||||||||||
| Random Seed | The seed used for the pseudo-random generation. It is an integer value that is used if Use Random Seed is Yes. | ||||||||||||||||||||||||
| Output Schema | Specify the schema for the output table or view. | ||||||||||||||||||||||||
| Output Table | Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator. | ||||||||||||||||||||||||
| Store Results | When set to Yes, the operator saves the results. If set to No, the operator does not save the results. |
Outputs
- Parameters Summary Info: Displays information about the input parameters and their current settings.
-
Output: A table that displays the output of a data set for the resampled data.
-
Distribution Summary: Displays information about the output value distribution such as class, value, and the distribution type.
Example
The following example demonstrates the Resampling operator.
- Multiple columns namely outlook, temperature, wind, humidity, and play.
- Multiple rows (14 rows).
-
Column to Resample: play
-
Balance All Values in Selected Column: No
-
Single Value from Selected Column for Resampling: yes
-
Multiplier for Up-Sampling or Down-Sampling: 2.2
-
Sample with Replacement: Yes
-
Exact (Slower): Yes
-
Use Random Seed: No
-
Store Results: Yes