Feature and Method Selection Computational Details
Feature Selection with Interaction Effects
The STATISTICA Process Optimization module contains various powerful methods for screening large numbers of continuous and categorical predictor variables, as well as their interactions to a user-specified degree (see also the Introductory Overview and the descriptions of the options for the Generalized Feature Selection Analysis dialog). The basic algorithms for screening variables and effects are identical to that described in the documentation for the Feature Selection and Variable Screening module. These options are augmented so that the analyses are not only performed on the original variables, but also their products (interactions).
Specifically, the program will generate a design matrix with main effects and interactions for the effects selected for the analyses. The basic Feature Selection and Variable Screening algorithm will then be applied to the columns of the design matrix instead of on the variables in (data columns of) the input file.
The final variables that are extracted when you request a list of the best predictors is assembled from all variables that were involved with the important ("strongest") effects. So, for example, if a two-way interaction between variables 7 and 9 was found to be strongly related to the dependent variable in the analyses, then both variables 7 and 9 would be reported as important predictors.
- Design terms
- The columns of the design matrix (design terms) for interaction effects are created as follows:
Continuous-by-continuous predictor interactions. The program will create a single column in the design matrix for each product of the continuous predictor columns.
Continuous-by-categorical predictor interactions. The program will first determine the number of unique values (classes) in the categorical predictor, and then generate as many columns as there are unique values in the categorical predictors; for each column j of the k columns (unique values), the program will generate a 1 if the respective observation belongs to class j, and a 0 otherwise; each column (with the 0/1 indicator codes) will then be multiplied by the continuous predictor variable. Hence, for continuous-by-categorical predictor interactions the program will generate as many columns in the design matrix as there are unique values in the categorical predictor.
Categorical-by-categorical predictor interactions. The program will enumerate the unique combinations of groups or classes into a single column in the design matrix; for example, the interaction between two categorical predictors with two unique values (classes) each would result in a single column with (2*2 =) 4 values. As described in the Feature Selection and Variable Screening Introductory Overview, the program will attempt to find the best combination of groups or classes that best predicts the dependent variable; hence, by combining the classes found in the different categorical predictors no information is lost. However, note that these coded columns in the design matrix are technically "confounded" with the main effects. In other words, if one of the categorical predictors is strongly related to the dependent variable in the analysis, then it is likely that some of the interactions with other categorical predictors will show strong relationships with the dependent variable as well.
Higher-order interactions (e.g., three-way interactions) are created accordingly, i.e., they are generated as the products of continuous and categorical predictors following the rules outlined above. For example, a three-way interaction column would be generated by multiplying a two-way interaction with another effect.
Feature and Method Selection
The options for feature and method selection enable you to screen predictor variables for regression and classification problems as well as the methods that can be used to find the predictors that are important.
In general, the program will compute the predictor statistics computed by the respective method, and then rank the predictors based on the method-specific measure of predictor importance. The following methods are used:
- Predictor Importance
- The following methods can be used to evaluate the importance of the predictors in the analyses.
Linear model. Be default, the program will fit a linear model using stepwise selection of predictors; for classification tasks, a stepwise linear discriminant function analysis is computed [see General Discriminant Analysis (GDA)]; for regression problems a stepwise linear regression is computed [see General Regression Models (GRM)]. For classification analyses, predictor importance is computed by ranking the values of the Wilks' lambda statistics for each predictor (see also the General Discriminant Analysis (GDA) Results dialog); for regression analyses, predictor importance is computed by ranking the p values for each predictor effect (for tied p values, the rankings are based on the ranking of the F values; see also the General Discriminant Analysis (GDA) Results dialog).
Classification and regression trees. For classification and regression trees, the standard rankings for predictor importance are used; see the Classification and Regression Trees Results dialog for details.
Boosted trees. For boosted trees models (stochastic gradient boosting), the standard rankings for predictor importance are used; see the Boosting Trees Results dialog for details.
MARSplines. For multivariate regression splines (MARSplines), the program will compute rankings based on the number of times that each predictor was used (referenced) in a basis function. The more frequently a predictor was used (referenced by a basis function), the greater is its importance. See also the Multivariate Regression Splines (MARSplines) Results dialog for details.
Neural networks. For neural networks, the program will determine the best 5 neural networks for the analysis problem (i.e., for the respective dependent variable and predictors). The final importance rankings for the predictors are then computed by averaging the importance rankings for each predictor over all 5 networks.
- Method rankings
- The method rankings for regression problems are computed based on the magnitudes of the Pearson correlation coefficients relating the predicted values to the observed values for all observations (cases); for classification problems, the rankings are computed from the overall misclassification rates for each model and for all observations (cases).
- Types of methods; choosing a method
- The automatic analyses performed by STATISTICA Feature and Method Selection options usually use the respective default parameters for each analysis. Experience with these methods has generally shown that useful summaries regarding which predictors are important, and which methods work best, can usually be extracted very quickly. The methods that are used in this program are quite diverse, and different types of data problems can best be solved (i.e., models be fit) using, for example, linear models, tree building techniques, advanced or more sophisticated tree building techniques (e.g., boosted trees and MARSplines), or neural networks.
As a general rule, first always consider the simplest technique: The linear model. In practice, in particular in automated manufacturing contexts, those methods unfortunately often do not work very well. Next look at tree methods; those techniques have the advantage that they often generate interpretable and relatively simple solutions. If the best model (e.g., highest Pearson correlation between predicted and observed values; lowest misclassification rate) is clearly obtained by more complex methods, then the relationships between the predictors and the dependent variable in the analyses is clearly complex, interactive, and highly nonlinear.