Select Dependent Variables and Predictors - Quick Tab
Option | Description |
---|---|
Variables | Displays a standard variable selection dialog box. With Statistica Data Miner, you can designate your major variables for analysis according to the following two criteria. |
Dependent (Output, Criterion) vs. Independent (Input, Predictor) Variables | For predictive data mining, variables can be designated as independent, input, or predictor variables and dependent, output, or criterion variables. If you are trying to predict the price of some real estate (property) from the square footage, then the price would be the dependent variable (the one you are trying to predict), and the square footage would be the independent or predictor variable. Note that this distinction is not meaningful in exploratory data analysis (EDA) when the goal is data exploration rather than prediction. Most analytic nodes for exploratory data analysis (example, the Descriptive Statistics node) processes both dependent and independent variables, and apply the requested types of analyses. |
Continuous and categorical variables | Continuous variables are those that are measured on some scale (example, height, and weight); categorical variables typically contain information about the group, class, or code to which each observation belongs (example, Gender with classes Male and Female). The distinction between continuous and categorical variables is obviously important with regard to the types of analyses that are applicable to each; for example, cross-tabulation can yield a meaningful summary result for categorical variables, while descriptive statistics like the mean, standard deviation, etc. are only meaningful for continuous variables. Most analytic nodes of Statistica Data Miner automatically apply the proper methods and statistical procedures to the continuous and categorical variables. In general, Statistica Data Miner expects that categorical variables contain integer codes (eample, 1, 2) to identify the group or class to which each observation belongs. Because of the manner in which Statistica generally handles text values, categorical variables may also contain text values (example, Male, Female). |
Using text variables or text values in data miner projects | When using text variables or variables with text values in data miner projects with deployment, or projects that compare predicted classifications from different nodes using Goodness of Fit nodes, you should be careful to review the results to ensure that the coding of categorical variables (with text values) is consistent across nodes. Generally, the program automatically ensures that identical coding is used across all nodes in the same data miner project; however, when using numeric variables with text labels as input data sources marked for deployment, or in some special data analysis scenarios, it is important that you understand how such values are handled in Statistica Data Miner. |
Codes (for categorical dependent and predictor variables) | Displays the standard Codes selection dialog box, where you can select the specific groups or codes you want to consider in your analyses. Selection of specific codes is optional, and Statistica usually automatically picks up all (integer) codes found in the data file for categorical variables; because of the manner in which Statistica generally handles text values, categorical variables might also contain text values (example, Male, Female). Note that some nodes (example, the Neural Networks nodes) of Statistica Data Miner does not allow you to select specific codes but instead automatically processes and uses all codes (groups, classes) found in the data. Those nodes usually issue a warning message if specific codes are specified; in that case you can also always use the Case selection conditions to select only those groups and observations that you want to include in the particular analyses. |
Select Cases | Displays the Spreadsheet Case Selection Conditions dialog box, which contains options to specify case selection conditions for the current data source; case selection conditions are not available for data sources residing on remote servers, and connected for in-place processing of data bases. Case selection conditions in Statistica Data Miner are a property of the input data source, and not the data file. In other words, if case selection conditions are specified for a data file, then generally, those case selection conditions are ignored, or overridden by the case selection conditions specified using this option. |
W | Displays the Spreadsheet Case Weights dialog box, which contains options to specify case weights for the current data source; case weights are not available for data sources residing on remote servers, and connected for in-place processing of data bases. Case weights in Statistica Data Miner are a property of the input data source, and not of the data file. In other words, if case weights are specified for a data file, then generally, those case weights are ignored, or overridden by the case weights specified using this option. |
Always use these selections, overriding any selections the generating node may make | This option is only available for generated data sources, that is, those that are produced as the result of some analyses by some analytic or data filtering and cleaning node. Select this check box to make the selections on this dialog permanent, and not dependent on the default selections of variables generated by the node that created the data source. A typical operation in data mining is to specify some transformations or data cleaning operations, and to connect the filtered or transformed data to subsequent analyses.
When such a system of consecutive operations is updated, by default, the variable selections for each data source that is created by a node (example, Out 1, Out 2, and Out 3) in the illustration) are automatically replaced or overwritten. This might be desirable when the nodes that create the data sources for further analyses automatically set the correct variables of interest. However, in other situations this may not be desirable. For example, if you want to perform a particular analyses on prediction residuals from a regression analysis, then you might want to select those residuals (created by the Regression node) as a dependent variable of interest for subsequent analyses; and you do not want to overwrite those specifications each time the data mining project is updated (recomputed). Set the Always use these selections, overriding any selections the generating node may make option if you want to specify a fixed selection of variables for subsequent analyses. |