Random Sampling

This operator extracts data rows from the input data set and generates sample tables or views according to the sample properties (percentage or row count) that the user specifies.

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Sample
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Input

An input is a single tabular data set.

Bad or Missing Values

Null values are not allowed and result in an error.

Configuration

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Number of Samples Specify the number of samples to generate. The samples are in the form of either database tables or views. For example, if the user inputs 3 in this field, 3 sample tables or views are generated.
Sample By

Specify the size of samples. The following values are available:

  • Percentage

  • Number of Rows

Sample Size Specify the number of rows to generate for each sample data set. This property is interpreted in conjunction with the Sample By property.

  • Percentage - Specify the number of rows to include in the total sample as a percentage of the number of rows in the input data set. For example, if the user inputs 20%, 30%, and 40% for three samples and the input data set contain 10,000 rows, each sample data set contains 2000, 3000, 4000 rows, and 9,000 rows are selected in total.

    If the Disjoint property is true, the total aggregate percentage should be less than 100%.

  • Row - Specify the exact number of rows to include in each sample data set.

See Define Sample Size dialog dialog help for more information.
Consistent Determines whether the operator always creates the same set of random rows for each sample data generation.
  • true - sample data generation is consistent, provided that the number of samples, sample size, and the value of the random seed remain unchanged. If set to true, then Replacement must be false. It must be true to set Key Columns.
  • false - a different random sample is created each time the operator is run. If set to false, then Random Seed is disabled.

Default: false

Random Seed The seed used for the pseudo-random generation.

  • The seed is the number with which the random sampling algorithm starts to generate the pseudo-random numbers.
  • The range of this value is from 0 to 1.
  • A different system-generated seed value is used if the Random Seed value is not specified.

Replacement Specify whether the sampling is with or without replacement.

  • true - sampling with replacement.
  • false - sampling without replacement. It is the default selection.

If Replacement is selected, both the Consistent and Disjoint properties are set to false and disabled.

Disjoint Specify whether each sample should be drawn from the entire data set, or from the remaining rows after previous samples are excluded.

  • If you select Disjoint, the same data does not appear in different samples.
  • If you specify the Percentage type for Sample by property, the sum of all the sample percentages should not be greater than 100%.

If it is set to true, then Replacement parameter must be false.

Key Columns Used in conjunction with the Consistent property.

  • Click Select Columns to display the Select Columns dialog, which is used for selecting columns to ensure the ordering of the data before generating the pseudo-random sample data set.
  • The Random Sampling operator uses these key columns to guarantee the order of the rows from the input data set, so that the generation of pseudo-random sample data sets is consistent every time.
  • If no key columns are specified, the Random Sampling operator assumes that the row ordering of the input data set is consistent.

See Key Columns dialog for more information.
Output Schema Specify the schema for the output table or view.
Output Table Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Output

Visual Output
  • Output: The data rows of the output table or view of each generated sample are displayed.
Output to successive operators

A tabular data set of the sample data tables was created. Typically, the data set is passed on to a Sample Selector operator, such as Train and Test, to select a sample to use with subsequent operators. An additional column is produced in the output as a result of the operator execution.

  • tds_sample_column: This column indicates the sample to which the row has been assigned and is used by the Sample Selector operator.

Example

The following example extracts the data rows from the input data set and displays them according to the sample properties defined by the user.

Random Sampling operator workflow

Data

golf: This data set contains the following information:

  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).

Parameter Setting

The parameter settings for the golf data set are as follows:

  • Number of Samples: 2

  • Sample By: Percentage

  • Sample Size: 10%, 20%

  • Consistent: true

  • Random Seed: 1

  • Replacement: false

  • Disjoint: true

  • Key Columns: outlook, temperature, humidity, wind

  • Store Results: Yes

Output

The following figure displays the output for the parameter settings for the golf data set.

Random Sampling operator Output