Select Dependent Variables and Predictors
Option | Description |
---|---|
Variable selections for different types of analyses | Because of the different nature of the types and classes of analyses, the typical analysis node does not use all variables and specifications selected in this dialog box. For example, a Censoring indicator variable can be specified on the Advanced tab of this dialog box, but it is only applicable to analyses involving censored observations (for example, Survival Analysis). Data sources and their descriptions (variable selections, etc.) in Statistica Data Miner can best be thought of as data objects that can be freely moved around a data mining project, connected to and disconnected from various nodes, or dragged from one data miner workspace into another. |
Specification of variables for data sources created from analyses | A typical operation in data mining is to specify transformations or data cleaning operations and to connect the filtered or transformed data to subsequent analyses.
When such a system of consecutive operations is updated, by default, the variable selections for each data source that is created by a node (such as Out 1, Out 2, and Out 3 in the illustration) are automatically replaced (overwritten). This might be desirable when the nodes that create the data sources for further analyses automatically set the correct variables of interest. However, in other situations this may not be desirable. For example, if you want to perform a particular analysis on prediction residuals from a regression analysis, you might want to select those residuals (created by the Regression node) as a dependent variable of interest for subsequent analyses; and you don't want to overwrite those specifications each time the data mining project is updated (recomputed). The Select dependent variables and predictors dialog box for data sources that are created from computations performed by a node contain an option to Always use these selections, overriding any selections the generating node may make. Set this option if you want to specify a fixed selection of variables for subsequent analyses. |
Data for deployment project; do not re-estimate models | This option is only applicable to analyses that can automatically generate deployed solutions, which can be applied to new data (for example, all analytic nodes in the Classification and Discrimination or Regression Modeling and Multivariate Exploration folder of the All Procedures Node Browser configuration. Typically, all nodes that support automatic deployment of models or solutions are named as {Type of Method or Model} with Deployment. Analytic nodes that automatically generate information for deployment can either use the input data to fit or estimate the respective type of model (for example, perform a multiple regression analysis), or apply a previously estimated or fitted model to new data to compute predicted values or classifications (example, apply a linear multiple regression equation to compute predicted values to new observations or measurements). For those analytic nodes, you can use this option to mark the respective data source to be used for estimating or fitting the model(s), or to be used for deployed projects or models only. |
Processing order for data marked for deployed projects | When multiple data sources are connected to the same data cleaning or filtering, or analytic node, the order in which different data sources are evaluated by those nodes is generally not fixed (predictable); however, data sources that are marked for deployed projects are always evaluated after all other data sources were processed that were not marked for deployed project. This is an important feature, because it allows you to connect data to analytic nodes that automatically generate information for deployment, and connect to the same nodes data sources marked for deployment; the program then estimates model parameters from the training data (not marked for deployment), and apply the model to the testing data (marked for deployed projects). |
Predicting new observations, when observed values are not (yet) available | One of the main purposes of predictive data mining is to allow for accurate prediction (predicted classification) of new observations, for which observed values or classifications are not available. An example of such an application is presented in Example 3 (see also Example 4). When connecting data for deployment (prediction or predicted classification) to the nodes for Classification and Discrimination or Regression Modeling and Multivariate Exploration, ensure that the structure of the input file for deployment is the same as that used for building the models. Specifically, make sure that the same numbers and types of predictor variables are specified, that a (continuous or categorical) dependent variable is specified (even if all values for that variable are missing), and that the variable names match those in the data file used to build the models (this is particularly important for the deployment of neural networks, which will rely on this information). |
OK | Accepts the selections you have made and exits this dialog box. |
Cancel | Exits the dialog box without making any selections. |