Stratified Sampling

Extracts data rows from the input data set and generates sample tables/views according to the sample properties specified by users.

Information at a Glance

Category Sample
Data source type DB
Sends output to other operators Yes
Data processing tool n/a

The user chooses a sample column. The proportion of all distinct values in the sample column remains unchanged in all samples generated.

Input

A data set from the preceding operator.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Sampling Column The column which the proportion of all distinct values remain unchanged in all generated samples.

For example, if a column "gender" is chosen as the sample column and it contains 2 distinct values, "male" and "female," and there are 40% "male" values and 60% "female" values, all the samples generated contain 40% "male" values and 60% "female" values in the "gender" column.

Number of Samples The number of samples to generate.

The samples are in the form of database tables/views. For example, if the number of samples is 3, 3 sample tables/views are generated.

Sample by The size of samples by Percentage or by Number of Rows.

If Sample by is set to Percentage, the sum of all of the sample percentages should not be greater than 100.

Sample Size Number of rows to generate for each sample data set. This property is interpreted in conjunction with the Sample By property.
  • Percentage - Specify the number of rows to include in the total sample as a percentage of the number of rows in the input data set. For example, if the user inputs 20%, 30%, 40% for three samples and the input data set contains 10,000 rows, each sample data set contains 2000, 3000, 4000 rows, and 9,000 rows is selected in total.

    The total aggregate percentage should be less than 100% if the Disjoint property is true.

  • Row - Specify the exact number of rows to include in each sample data set.
See Define Sample Sizes dialog box help for more information.
Random Seed The seed used for the pseudo-random row extraction. The seed is the number with which the Random Sampling algorithm starts to generate the pseudo-random numbers.

The range of this value is from 0 to 1.

A different system-generated seed value is used if no set Random Seed value is specified.

Consistent Specify whether the operator always creates the same set of rows for each sample data generation.
  • If Consistent is set to true, sample data generation is consistent, provided that the number of samples, sample size, and value of the Random Seed remain unchanged.
    Note: If a Random Seed value is specified, the Consistent property is set automatically to true.
  • If Consistent is set to false (the default), a different random sample is created each time.
Disjoint Specify whether each sample should be drawn from the entire data set, or from the remaining rows after previous samples are excluded.

If Disjoint is set to true, the same data does not appear in different samples.

Default value: false.

Key Columns Used in conjunction with the Consistent property.
  • Click Select Columns to display the Select Columns dialog box, which is used for selecting columns to ensure the ordering of the data before generating the pseudo-random sample data set.
  • The Stratified Sampling operator uses these key columns to guarantee the order of the rows from the input data set, so that the generation of pseudo-random sample data sets is consistent every time.
  • If no key columns are specified, the Stratified Sampling operator assumes that the row ordering of the input data set is consistent.
See Key Columns Dialog Box for more information.
Output Schema The schema for the output table or view.
Output Table The table path and name where the results are output. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Storage Parameters Advanced database settings for the operator output. Available only for TABLE output.

See Storage Parameters Dialog Box for more information.

Drop If Exists Specifies whether to overwrite an existing table.
  • Yes - If a table with the name exists, it is dropped before storing the results.
  • No - If a table with the name exists, the results window shows an error message.

Output

Visual Output
The data rows of the output table/view of each generated sample displayed (up to 2000 rows of the data).
Data Output
Data sets of sample data tables are created. The output is typically connected to a Sample Selector operator to select a sample to use with further succeeding operators.