Random Sampling

This operator extracts data rows from the input data set and generates sample tables or views according to the sample properties (percentage or row count) that the user specifies.

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter	Description
Category	Sample
Data source type	TIBCO® Data Virtualization
Send output to other operators	Yes
Data processing tool	TIBCO® DV, Apache Spark 3.2 or later

Input

An input is a single tabular data set.

Bad or Missing Values

Null values are not allowed and result in an error.

Configuration

Parameter	Description
Notes	Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Number of Samples	Specify the number of samples to generate. The samples are in the form of either database tables or views. For example, if the user inputs 3 in this field, 3 sample tables or views are generated.
Sample By	Specify the size of samples. The following values are available: Percentage Number of Rows
Sample Size	Specify the number of rows to generate for each sample data set. This property is interpreted in conjunction with the Sample By property. Percentage - Specify the number of rows to include in the total sample as a percentage of the number of rows in the input data set. For example, if the user inputs 20%, 30%, and 40% for three samples and the input data set contain 10,000 rows, each sample data set contains 2000, 3000, 4000 rows, and 9,000 rows are selected in total. If the Disjoint property is true, the total aggregate percentage should be less than 100%. Row - Specify the exact number of rows to include in each sample data set. See Define Sample Size dialog dialog help for more information.
Consistent	Determines whether the operator always creates the same set of random rows for each sample data generation. true - sample data generation is consistent, provided that the number of samples, sample size, and the value of the random seed remain unchanged. If set to true, then Replacement must be false. It must be true to set Key Columns. false - a different random sample is created each time the operator is run. If set to false, then Random Seed is disabled. Default: false
Random Seed	The seed used for the pseudo-random generation. The seed is the number with which the random sampling algorithm starts to generate the pseudo-random numbers. The range of this value is from 0 to 1. A different system-generated seed value is used if the Random Seed value is not specified.
Replacement	Specify whether the sampling is with or without replacement. true - sampling with replacement. false - sampling without replacement. It is the default selection. If Replacement is selected, both the Consistent and Disjoint properties are set to false and disabled.
Disjoint	Specify whether each sample should be drawn from the entire data set, or from the remaining rows after previous samples are excluded. If you select Disjoint, the same data does not appear in different samples. If you specify the Percentage type for Sample by property, the sum of all the sample percentages should not be greater than 100%. If it is set to true, then Replacement parameter must be false.
Key Columns	Used in conjunction with the Consistent property. Click Select Columns to display the Select Columns dialog, which is used for selecting columns to ensure the ordering of the data before generating the pseudo-random sample data set. The Random Sampling operator uses these key columns to guarantee the order of the rows from the input data set, so that the generation of pseudo-random sample data sets is consistent every time. If no key columns are specified, the Random Sampling operator assumes that the row ordering of the input data set is consistent. See Key Columns dialog for more information.
Output Schema	Specify the schema for the output table or view.
Output Table	Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results	When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Output

Visual Output

Output: The data rows of the output table or view of each generated sample are displayed.

Output to successive operators

A tabular data set of the sample data tables was created. Typically, the data set is passed on to a Sample Selector operator, such as Train and Test, to select a sample to use with subsequent operators. An additional column is produced in the output as a result of the operator execution.

tds_sample_column: This column indicates the sample to which the row has been assigned and is used by the Sample Selector operator.

Example

The following example extracts the data rows from the input data set and displays them according to the sample properties defined by the user.

Random Sampling operator workflow

Data

golf: This data set contains the following information:

Multiple columns namely outlook, temperature, wind, humidity, and play.
Multiple rows (14 rows).

Parameter Setting

The parameter settings for the golf data set are as follows:

Number of Samples: 2
Sample By: Percentage
Sample Size: 10%, 20%
Consistent: true
Random Seed: 1
Replacement: false
Disjoint: true
Key Columns: outlook, temperature, humidity, wind
Store Results: Yes

Output

The following figure displays the output for the parameter settings for the golf data set.

Random Sampling operator Output

Did you find this helpful?

Yes No