Data Preparation Tab

For Data preparation step, you can use the options on the Data preparation tab to open or connect a data file for the analysis from a local machine or from a URL server. You can also use this section to apply data transformations, identify and review specific variables for the analysis, select a labeling variable for identifying specific case names to be used in the analysis, remove duplicate cases from the data and specify the use of a sample data set.

Element Name Description
Open/Connect data file Displays the Select Data Source dialog box and select the data file for the analysis.

Data Miner Recipes data files are saved in the standard Statistica format with extension *.sta.

If you have completed subsequent steps in this project, opening a different data file invalidates those steps. Hence, if you click the Open/Connect data file button while a project is in process, you are asked if you want to delete the subsequent steps of the project (if any).
  1. Click No to cancel or Yes to proceed, in which case you are prompted to save the current project with the display of a Save as dialog box.
Apply data transformations Displays the Batch Transformation Formulas dialog box, which contains options to supplement the data transformation formulas built into the Statistica spreadsheet.

You can enter several transformation formulas into a text editor and evaluate these transformations in sequence, one by one. Any transformation you choose is also be applied to new data during deployment.

Select variables Displays a five-list variable selection dialog box, which is used to choose variables for the analysis.

You can select continuous and categorical targets, continuous and categorical predictors, and a validation sample variable.

Select labels(s) Displays a single list variable selection dialog box, which is used to identify one or more labeling variables for identifying specific cases from the data set for use in the analysis.

Note that case labels (ID) must be unique. If the selected variable does contain duplicate labels (for example two or more cases with the same label (ID) or the case names contain duplicate labels, the Next step fails and you are prompted to review the data set for duplicate names before continuing to the next step.

Use sample dataset You can select this check box to extract a random sample from the original data set and use that sample as the data for the analysis.

By default, this check box is selected for large data files and cleared for smaller data sets. The default selection is to use Systematic random sampling with K =1. Additional sampling methods are specified on the Advanced tab.

Remove duplicate record(s) You can select this check box to remove duplicate records from the data set.

Options for defining duplicate records are available on the Advanced tab.

Variables Displays the variable's type (continuous or categorical) and role (input, output and validation sample).

Changes to a variable's type or role are made by clicking the Select variables button.