Data Preparation - Advanced Tab

For Data Preparation step, the following options are available on the Advanced tab.

Element Name Description
Use Sample data Use the options in this group box to create a new output data spreadsheet that is a sample of the input data set. Select the sampling method, and then click the More options button to view sampling options specific to the selected method.
  • Systematic random sampling: This method computes and creates a new output data spreadsheet consisting of all selected variables and a random subset of the cases.

    When you select Systematic random sampling and click the More options button, the Systematic random sampling dialog box is displayed, where you can specify the K-value (common distance between values selected from the original data set).

  • Stratified random sampling: This method computes and creates a new output data spreadsheet as a stratified random sample of the input data. You can use this option to systematically over-sample rare events, for example, for predictive classification projects.

    When you select Stratified random sampling and click the More options button, the Stratified sampling dialog box is displayed. You can select one or more stratification variables and specify either (sampling) percentages or approximate numbers of cases for each stratum. Constant sampling rates for all strata and additional sub-setting of variables and cases can also be requested.

  • Simple random sampling: This method computes and creates a new output data spreadsheet as a random sample of the input data.

    When you select Simple random sampling and click the More options button, the Simple random sampling dialog box is displayed, where you can choose to either use a specified percentage of the cases or an approximate constant number of cases. You can also set the seed for sampling.

Remove duplicate records (cases) You can select this check box to detect and remove duplicate records during the run and validation process.

Select the Remove duplicate records (cases) check box and click the Duplicate records (cases) button to display the Select variables to define duplicate records dialog box. Use the single variable selection dialog box to select any number of variables that specify the basis of distinction for de-duping the data set.

Valid data range You can use the options in this dialog box to specify a minimum and maximum value for each of the selected variables.

Cases with values outside the specified range are treated as invalid data. Select the Valid data range check box and click the Valid data range button to display the Missing data and Invalid Case Definition dialog box.

Remove outlier You can use the options in this dialog box to select the variables for outlier analysis and specify how to treat outliers once they are detected.

The Set to boundary option (located in the Outlier and Extreme Value dialog box) iteratively recodes outliers and extreme values to +/-3 standard deviation limits. Select the Remove outlier check box and click the Outlier button to display the Outlier and Extreme Value dialog box.

Missing data You can specify the type of algorithms for handling missing cases, including the methods of mean substitution and case wise deletion, for each variable in the analysis.

Select the Missing data check box and click the Missing data definition button to display the Missing data definition dialog box.