Generalized Feature Selection Analysis - Advanced Tab

Select the Advanced tab of the Generalized Feature Selection Analysis dialog box to access options to select various results statistics and parameters for the analyses. See also, Feature Selection and Variable Screening Introductory Overview for additional details.

Also, on this tab you can determine the degree or level of interactions that you want to evaluated (screen) during the analyses. By default, the program will screen for the two-way interactions along with the main effects. Be careful when increasing this value, i.e., before requesting 3-way, 4-way, and higher-way interactions. Even with modest numbers of predictors, selecting all possible interactions (e.g., up to 10-way interactions for 100 predictors) quickly creates "astronomical-scale" predictor lists.

Reports

In this group box, select the respective check box for the results that you want to compute. In general, the program will screen the predictors one by one and sort the list of predictors in terms of the strength of their (linear or nonlinear) relationships with the dependent or outcome variable(s). Separate analyses will be performed and results reported for each dependent variable.

Summary of best k predictors

Select this check box to compute results spreadsheets for the predictors, for each dependent variable. The predictors will be listed in order according to the strength of their relationships with the respective dependent variables. "Strength of relationship" is in terms of the magnitude of F or chi-square statistic computed for each predictor (for regression and classification tasks, respectively), or in terms of the statistical significance p of those statistics (see also the documentation for the Feature Selection and Variable Screening module); use the desired Criterion for selecting predictors on the Quick tab.

Report of best k predictors

Select this check box to create a report with the selected variable numbers (of the best k predictors). The best predictors are selected from among the most important main effects and interaction effects that were found. For example, if the interaction between variables 3 and 8 was found to be among the strongest predictors (predictor effects), then both variables 3 and 8 would be selected into the report of the list of best k predictors. You can copy the list of variable numbers and paste them directly into other variable selection dialogs, for example, to select variables for subsequent analyses, graphs, etc.

Histogram of importance for best k predictors

Select this check box to display histograms of the best or most important predictors for each dependent variable, using the criterion of importance specified in the Criterion for selecting predictors box on the Quick tab. Note that the value of importance in this histogram (plotted along the vertical y axis) will always be the respective F or chi-square values, regardless of the chosen Criterion for selecting predictors.

Design terms

Select this check box to display the design terms that were created for the variable screening analysis. Specifically, if two-way interactions were requested, the program will create the products of all continuous predictors and evaluate their relationship to the dependent variable(s) in the analysis. In general, depending on the type of predictor (continuous or categorical), the terms that will be created in the design matrix are:

Continuous-by-continuous predictor interactions. STATISTICA will create a single column in the design matrix for each product of the continuous predictor columns.

Continuous-by-categorical predictor interactions. STATISTICA will first determine the number of unique values (classes) in the categorical predictor, and then generate as many columns as there are unique values in the categorical predictors; for each column j of the k columns (unique values), the program will generate a 1 if the respective observation belongs to class j, and a 0 otherwise; each column (with the 0/1 indicator codes) will then be multiplied by the continuous predictor variable. Hence, for continuous-by-categorical predictor interactions the program will generate as many columns in the design matrix as there are unique values in the categorical predictor.

Categorical-by-categorical predictor interactions. STATISTICA will enumerate the unique combinations of groups or classes into a single column in the design matrix; for example, the interaction between two categorical predictors with two unique values (classes) each would result in a single column with (2*2 =) 4 values. As described in the Feature Selection and Variable Screening Introductory Overview, the program will attempt to find the best combination of groups or classes that best predicts the dependent variable; hence, by combining the classes found in the different categorical predictors no information is lost. However, note that these coded columns in the design matrix are technically "confounded" with the main effects. In other words, if one of the categorical predictors is strongly related to the dependent variable in the analysis, then it is likely that some of the interactions with other categorical predictors will show strong relationships with the dependent variable as well.

The spreadsheet summarizing the terms in the coded design matrix will contain detailed information about the specific assignment of columns to the various effects.

Spreadsheet with interactions

Select this check box to create a spreadsheet containing the design matrix and the dependent variable values. This spreadsheet is useful if you want to review plots (e.g., scatterplots) for particular coded interactions to see how they are related to the dependent variable.

Feature Selection with Interactions

In general, the options available on this dialog are very similar in terms of the results that can be computed as the Feature Selection and Variable Screening module. The main generalization of these techniques as implemented via these options is that you can also screen the interactions (e.g., products) of predictor variables. See also the Generalized Feature Selection Analysis dialog box topic. In the Interaction Level field, specify the degree of the interaction you would like to consider, e.g., enter a 2 to consider all 2-way interactions among predictors, a 3 to consider all 3-way interactions, etc.

Number of cuts for continuous predictors

Enter in this field a value to specify the "coarseness of the grid" applied to the continuous predictors. As described in Feature Selection and Variable Screening Computational Details, STATISTICA will divide the range of values for each continuous predictor into k intervals (and the combinations of the k intervals), and compute statistics based on the means or frequencies in those intervals for regression or classification-type problems, respectively. Therefore, if you specify k=2 in this edit field, the range of values for each continuous predictor variable will be split into 2 categories, and the variable screening will only detect simple monotone (e.g., linear) relationships to the dependent variables. If you specify k=3, simple monotone and non-monotone (e.g., quadratic) relationships will be picked up as well; the default value (10) is well suited to perform screening for practically all types of monotone or complex non-monotone relationships, and hence will not bias the selection of variables in favor of any particular subsequent analysis that may be applied for predictive data mining.

Contents

Index

Search Results

Generalized Feature Selection Analysis - Advanced Tab