Use Sample data
|
Use the options in this group box to create a new output data spreadsheet that is a sample of the input data set. Select the sampling method, and then click the
More options button to view sampling options specific to the selected method.
- Systematic random sampling: This method computes and creates a new output data spreadsheet consisting of all selected variables and a random subset of the cases.
When you select
Systematic random sampling and click the
More options button, the
Systematic random sampling dialog box is displayed, where you can specify the K-value (common distance between values selected from the original data set).
- Stratified random sampling: This method computes and creates a new output data spreadsheet as a stratified random sample of the input data. You can use this option to systematically over-sample rare events, for example, for predictive classification projects.
When you select
Stratified random sampling and click the
More options button, the
Stratified sampling
dialog box is displayed. You can select one or more stratification variables and specify either (sampling) percentages or approximate numbers of cases for each stratum. Constant sampling rates for all strata and additional sub-setting of variables and cases can also be requested.
- Simple random sampling: This method computes and creates a new output data spreadsheet as a random sample of the input data.
When you select
Simple random sampling and click the
More options button, the
Simple random sampling dialog box is displayed, where you can choose to either use a specified percentage of the cases or an approximate constant number of cases. You can also set the seed for sampling.
|
Remove duplicate records (cases)
|
You can select this check box to detect and remove duplicate records during the run and validation process.
Select the
Remove duplicate records (cases) check box and click the
Duplicate records (cases) button to display the
Select variables to define duplicate records dialog box. Use the
single variable selection dialog box to select any number of variables that specify the basis of distinction for de-duping the data set.
|
Valid data range
|
You can use the options in this dialog box to specify a minimum and maximum value for each of the selected variables.
Cases with values outside the specified range are treated as invalid data. Select the
Valid data range check box and click the
Valid data range
button to display the
Missing data and Invalid Case Definition dialog box.
|
Remove outlier
|
You can use the options in this dialog box to select the variables for outlier analysis and specify how to treat outliers once they are detected.
The Set to boundary option (located in the
Outlier and Extreme Value dialog box) iteratively recodes outliers and extreme values to +/-3 standard deviation limits. Select the
Remove outlier check box and click the
Outlier button to display the
Outlier and Extreme Value
dialog box.
|
Missing data
|
You can specify the type of algorithms for handling missing cases, including the methods of mean substitution and case wise deletion, for each variable in the analysis.
Select the
Missing data check box and click the
Missing data definition
button to display the
Missing data definition dialog box.
|