Random Sampling (HD)

Extracts data rows from the input data set and generates sample tables/views according to the sample properties (percentage or row count) the user specifies.

Information at a Glance

Category Sample
Data source type HD
Sends output to other operators Yes
Data processing tool MapReduce

The Random Sampling (HD) operator for is for Hadoop data only. For database data, use the Random Sampling (DB) operator.

Input

A data set from the preceding operator.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Number of Samples The number of samples to generate. The samples are in the form of Hadoop files. For example, if the user inputs 3 in this field, 3 sample files are generated.
Sample By The size of samples by Percentage or by Number of Rows.
Sample Size The number of rows to generate for each sample data set. This property is interpreted in conjunction with the Sample By property.
  • Percentage - Specify the number of rows to include in the total sample as a percentage of the number of rows in the input data set. For example, if the user inputs 20%, 30%, 40% for three samples and the input data set contains 10,000 rows, each sample data set contains 2000, 3000, 4000 rows, and 9,000 rows is selected in total.

    The total aggregate percentage should be less than 100% if the Disjoint property is true. total. The

  • Row - Specify the exact number of rows to include in each sample data set.
See Define Sample Sizes dialog box help for more information.
Random Seed The seed used for the pseudo-random row extraction.
  • The seed is the number with which the random sampling algorithm starts to generate the pseudo-random numbers.
  • The range of this value is from 0 to 1.
  • A different system-generated seed value is used if no set Random Seed value is specified.
Consistent Determines whether the operator always creates the same set of random rows for each sample data generation.
  • true - sample data generation is consistent, provided that the number of samples, sample size, and the value of the Random Seed remain unchanged. If set to true, then Replacement must be false. Must be true to set Key Columns.
  • false (the default) - a different random sample is created each time the operator is run. If set to false, then Random Seed is disabled.
Replacement Specifies that one row of data can be selected multiple times.
  • true - sampling with replacement.
  • false (the default) - sampling without replacement, where one row can be selected only once.

If set to true, then both the Consistent and Disjoint properties are set to false and disabled.

Disjoint Specify whether each sample should be drawn from the entire data set, or from the remaining rows after previous samples are excluded.
  • If you select Disjoint, then the same data does not appear in different samples.
  • If, for Sample by, you specify the Percentage type, then the sum of all the sample percentages should not be greater than 100.

If set to true, then Replacement must be false.

Store Results? Specifies whether to store the results.
  • true - results are stored.
  • false - the data set is passed to the next operator without storing.
Results Location The HDFS directory where the results of the operator are stored. This is the main directory, the subdirectory of which is specified in Results Name. Click Choose File to open the Hadoop File Explorer Dialog Box and browse to the storage location. Do not edit the text directly.
Results Name The name of the file in which to store the results.
Overwrite Specifies whether to delete existing data at that path and file name.
  • Yes - if the path exists, delete that file and save the results.
  • No - Fail if the path already exists.
Compression Select the type of compression for the output.
Available Parquet compression options are the following.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options are the following.

  • Deflate
  • Snappy
  • no compression

Output

Visual Output
The data rows of each generated sample displayed (up to 2000 rows of the data).
Data Output
Data sets of sample files created. Typically, the data set is passed on to a Sample Selector operator, such as Train and Test, to select a sample to use with subsequent operators.

Example

random sampling example