WS Node - Random Sample Filtering - Specifications Tab / In-Database Random Sample Filtering - Specifications Tab
The Random Sample Filtering workspace node can be accessed from the Feature Finder, ribbon bar, or Node Browser. Double-click the node to display the specifications dialog box.
In-Database Random Sample Filtering
The In-Database Random Sample Filtering node follows the functionality for the Random Sampling module in Statistica with a few important notes. The In-Database Random Sample Filtering node doesn’t allow replacement or oversampling as well as split node random sampling. The transformations to the data are not applied until the process gets to the In-Database analytics node, which will operate on the modified query. This means that if there are multiple downstream nodes connected, they might have different results since random sampling will happen once the main analytics procedure is executed.
Variables. Click this button to display a variable selection dialog box, which is used to choose the variables from the current spreadsheet to be included in the random sample.
Simple Sampling. Select this tab to access the options described below.
- Simple random sampling
- Select this option button to create a probability sample (subset) via random sampling. You have two choices regarding how the sampling fraction for drawing the sample will be determined: either via the percentage of cases within the original spreadsheet, or as an approximate number of cases; select the respective option (Calculate based on percentage of cases or Calculate based on approximate N) on the Options tab to select either method of determining the sampling fraction.
- % =
- Specify the approximate percentage of cases or the approximate number of cases to be used when creating the subset according to the respective option (Calculate based on percentage of cases or Calculate based on approximate N) specified on the Options tab.
- With replacement
- (This option is not available in the In-Database Random Sample Filtering node dialog box.) When you select this check box, once a case is selected to be included into the subset, that case will be placed back into the pool of available choices for the remaining cases in the subset (hence, an individual case can appear more than once in the resulting subset).
- Exact
- Select this check box to ensure that the exact % of cases specified are returned. Oversampling makes it possible for you to specify more cases to be returned than exist in the input. If the number of cases is 50, you can specify 75 or 150% cases back from sampling.
- Systematic random sampling
- Select this option button to create the probability sample (subset) via systematic random sampling. For instance, if you enter a 5 into the
K= box, Statistica randomly selects a case within the first five cases, and then finish obtaining the subset by selecting each fifth case in the spreadsheet after the originally selected case.
Stratified Sampling. Select this tab to access the options described below.
- Strata Variables
- Select one or more stratification variables. The stratified sample will be drawn from the combinations of all codes for all stratification variables.
- Stratification Groups
- In the % column or N column (depending on whether you select the Calculate based on percentage of cases option button or Calculate based on count of cases option button on the Options tab) specify the sampling fraction/number to sample from each stratum. You can also select the Uniform probability check box (see below), in which case the same sampling fraction will be applied to all strata.
- Codes
- Click this button to display the Select codes for stratification variables dialog box, where you specify codes for the Strata variables. By default, all distinct integer values will be used to define the strata for stratified sampling.
- Uniform probability/% =/N =
- Select this check box to apply identical sampling fractions to all strata; then specify either the common (to all strata) percentage of cases to be used when drawing the samples, or the approximate numbers of cases; use the respective option (Calculate based on percentage of cases or Calculate based on approximate N) on the Options tab to select either method for determining the sampling fractions. Note that if sample sizes (N) are requested that are greater than the actual number of cases belonging to some strata in the population (in the input file), all cases from those strata will be selected into the final sample.
- Exact
- Select this check box to ensure that the exact N or exact % of cases specified are returned. Oversampling enables you to specify more cases to be returned than exist in the input. If the number of cases is 50, you can specify 75 or 150% cases back from sampling.
Options. Select this tab to access the options described below.
- Use case selection condition expression
- (This option is not available in the In-Database Random Sample Filtering node dialog box.) When this check box is selected, the case selection conditions specified via the Cases button will be applied before any further sampling is performed; clear this check box to ignore any case selection conditions.
- Options for random sampling
- The options in this group box pertain to simple random sampling, stratified random sampling, and random splitting of the data file only.
Note:
- Create an output spreadsheet/Max Rows (These options are available only in the In-Database Random Sample Filtering node dialog box.) Select this check box to extract the first N rows as specified by the Max Rows parameter and make it visible as a downstream document (Max Rows set to 0 extracts all of the data). This functionality is added for the initial design and troubleshooting and is not intended to be enabled during production runs. Enabling this option can significantly decrease the performance of the workflow.
- Use Diehard-certified random number generator (note: this algorithm is slower)
- (This option is not available in the In-Database Random Sample Filtering node dialog box.) Statistica uses a very carefully designed and tested random number generator (see DIEHARD Suite of Tests and Random Number Generation) whenever random numbers are required for certain operations or procedures (and this default highest-quality random number generator can be used for even the most demanding modeling and simulation projects and Monte Carlo experiments). However, for most simple random or stratified random sampling, simpler and faster methods for randomly selecting the cases (observations) for the final sample can be used. In particular for very large data sets and samples, clear this check box (to use the simpler random number generator) to draw samples more efficiently.
- Calculate based on approximate percentage of cases
- Select this option button to specify the sampling fraction(s) for simple random or stratified sampling, or for splitting the data file, in terms of percentages.
- Calculate based on approximate N
- Select this option button to specify the sampling fractions for simple random or stratified sampling, or for splitting the data file, in terms of the approximate number of cases in the final sample (or strata). Note that if sample sizes (Approximate N) are requested that are greater than the actual number of cases belonging to the respective strata in the population (in the input file), then all cases from those strata will be selected into the final sample.
Options. See Common Options.
OK. Click this button to accept all the specifications made in the dialog box and to close it. To view the new spreadsheet, click the icon on the lower-right corner of the Concatenate Variables node.
See also, Home tab.