Stratified Sampling
Extracts data rows from the input data set and generates sample tables/views according to the sample properties specified by users.
Information at a Glance
The user chooses a sample column. The proportion of all distinct values in the sample column remains unchanged in all samples generated.
Configuration
Parameter | Description |
---|---|
Notes | Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator. |
Sampling Column | The column which the proportion of all distinct values remain unchanged in all generated samples.
For example, if a column "gender" is chosen as the sample column and it contains 2 distinct values, "male" and "female," and there are 40% "male" values and 60% "female" values, all the samples generated contain 40% "male" values and 60% "female" values in the "gender" column. |
Number of Samples | The number of samples to generate.
The samples are in the form of database tables/views. For example, if the number of samples is 3, 3 sample tables/views are generated. |
Sample by | The size of samples by
Percentage or by
Number of Rows.
If Sample by is set to Percentage, the sum of all of the sample percentages should not be greater than 100. |
Sample Size | Number of rows to generate for each sample data set. This property is interpreted in conjunction with the
Sample By property.
|
Random Seed | The seed used for the pseudo-random row extraction. The seed is the number with which the Random Sampling algorithm starts to generate the pseudo-random numbers.
The range of this value is from 0 to 1. A different system-generated seed value is used if no set Random Seed value is specified. |
Consistent | Specify whether the operator always creates the same set of rows for each sample data generation. |
Disjoint | Specify whether each sample should be drawn from the entire data set, or from the remaining rows after previous samples are excluded.
If Disjoint is set to true, the same data does not appear in different samples. Default value: false. |
Key Columns | Used in conjunction with the
Consistent property.
|
Output Schema | The schema for the output table or view. |
Output Table | The table path and name where the results are output. By default, this is a unique table name based on your user ID, workflow ID, and operator. |
Storage Parameters | Advanced database settings for the operator output. Available only for
TABLE output.
See Storage Parameters Dialog Box for more information. |
Drop If Exists | Specifies whether to overwrite an existing table. |
Output
- Visual Output
- The data rows of the output table/view of each generated sample displayed (up to 2000 rows of the data).
- Data Output
- Data sets of sample data tables are created. The output is typically connected to a Sample Selector operator to select a sample to use with further succeeding operators.