Random Sampling (DB)

Extracts data rows from the input data set and generates sample tables/views according to the sample properties (percentage or row count) the user specifies.

Information at a Glance

Category Sample
Data source type DB
Sends output to other operators Yes
Data processing tool n/a

The Random Sampling (DB) operator is for database data only. For Hadoop data, use the Random Sampling (HD) operator.

Input

A data set from the preceding operator.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Number of Samples The number of samples to generate. The samples are in the form of either database tables or views. For example, if the user inputs 3 in this field, 3 sample tables/views are generated.
Sample By The size of samples by Percentage or by Number of Rows.
Sample Size The number of rows to generate for each sample data set. This property is interpreted in conjunction with the Sample By property.
  • Percentage - Specify the number of rows to include in the total sample as a percentage of the number of rows in the input data set. For example, if the user inputs 20%, 30%, 40% for three samples and the input data set contains 10,000 rows, each sample data set contains 2000, 3000, 4000 rows, and 9,000 rows is selected in total.

    The total aggregate percentage should be less than 100% if the Disjoint property is true. total. The

  • Row - Specify the exact number of rows to include in each sample data set.
See Define Sample Sizes dialog box help for more information.
Random Seed The seed used for the pseudo-random row extraction.
  • The seed is the number with which the random sampling algorithm starts to generate the pseudo-random numbers.
  • The range of this value is from 0 to 1.
  • A different system-generated seed value is used if no set Random Seed value is specified.
Consistent Determines whether the operator always creates the same set of random rows for each sample data generation.
  • true - sample data generation is consistent, provided that the number of samples, sample size, and the value of the Random Seed remain unchanged. If set to true, then Replacement must be false. Must be true to set Key Columns.
  • false - a different random sample is created each time the operator is run. If set to false, then Random Seed is disabled.

Default value: false.

Replacement Specifies whether this is sampling with or without replacement.
  • true - sampling with replacement.
  • false (the default) - sampling without replacement.

If Replacement is selected, both the Consistent and Disjoint properties are set to false and disabled.

Disjoint Specify whether each sample should be drawn from the entire data set, or from the remaining rows after previous samples are excluded.
  • If you select Disjoint, the same data does not appear in different samples.
  • If, for Sample by, you specify the Percentage type, the sum of all the sample percentages should not be greater than 100.

If set to true, then Replacement must be false.

Key Columns Used in conjunction with the Consistent property.
  • Click Select Columns to display the Select Columns dialog box, which is used for selecting columns to ensure the ordering of the data before generating the pseudo-random sample data set.
  • The Random Sampling operator uses these key columns to guarantee the order of the rows from the input data set, so that the generation of pseudo-random sample data sets is consistent every time.
  • If no key columns are specified, the Random Sampling operator assumes that the row ordering of the input data set is consistent.
See Key Columns Dialog Box for more information.
Output Schema The schema for the output table or view.
Output Table The table path and name where the results are output. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Storage Parameters Advanced database settings for the operator output. Available only for TABLE output.

See Storage Parameters Dialog Box for more information.

Drop If Exists Specifies whether to overwrite an existing table.
  • Yes - If a table with the name exists, it is dropped before storing the results.
  • No - If a table with the name exists, the results window shows an error message.

Output

Visual Output
The data rows of the output table/view of each generated sample displayed (up to 2000 rows of the data).
Data Output
A data set of sample data tables created. Typically, the data set is passed on to a Sample Selector operator, such as Train and Test, to select a sample to use with subsequent operators.

Example

random sampling example