Variable Selection (HD)
Identifies and prioritizes the variables of interest to a prediction task or model. This is especially helpful when there are a large number of potential variables for a model, enabling the modeler to focus on only a subset of those that show the strongest relevance.
Information at a Glance
|
Parameter |
Description |
|---|---|
| Category | Explore |
| Data source type | HD |
| Send output to other operators | Yes |
| Data processing tool | MapReduce |
Algorithm
For Hadoop, there are two choices: Information Gain and R2. The Hadoop-based information gain option calculates both information gain and information gain ratio.
Information Gain is a measure of the change in the entropy (or uncertainty) of a random variable Y when it is conditioned on another (categorical) variable X. In our case, Y is the class to be predicted (the dependent variable), and X is a candidate driver.
The entropy of a categorical random variable Y with n possible values (classes) is given by
The conditional entropy of Y given the values of a discrete variable X that takes on the m values is given by
The information gain about Y, given that we know X measures how much more we know about Y because we know X:
Score threshold by chance is a score we can get just by chance, even if X is not truly predictive of Y. We generate X that are designed to be independent of Y according to the distribution of Y and then calculate the score. We can generate a lower-bound threshold T. Any candidate feature that scores lower than T is almost certainly not predictive of the output variable, and can be eliminated. In practice, T is quite small, and probably does not eliminate too many variables. However, it still gives a useful sense of scores that correspond to meaningful and less-meaningful variables.
For numerical columns in Hadoop, we compute mutual information between dependent and independent variables without discretization. The Hadoop Variable Operator does not perform Minimum Description Length (MDL) discretization like the database version, because it is extremely expensive when dealing with big data.
For each independent variable X, R2 is the coefficient of determination for a simple linear regression between X and the dependent variable Y.
Input
A data set from the preceding operator.
Configuration
| Parameter | Description |
|---|---|
| Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
| Dependent Column | Define the column to be used as the class Variable.
For R2 scoring, the dependent column must be numeric. |
| Score Type | Hadoop options:
|
| Columns | Use Select Columns to open the dialog for selecting the available columns from the input dataset for analysis. |
| Score Threshold | Indicates what a good value is for your score type. Variables above the score threshold are stored and passed to subsequent operators.
Default value: 0.5. |
| Always included columns | Columns in this list are automatically passed to subsequent operators, whether or not their score is above the threshold. |
| Do power transformations? | If
Yes, and if an independent variable is numerical, the Variable Selection operator determines the score for the original variable plus a series of power transforms.
The power transforms currently supported are:
If data in a column does not make sense for a particular transform, that transform's score is not determined. For example, if a column includes negative numbers, we do not determine the score for natural log for that variable. Power transforms are not performed on Always Included Columns, unless that column is also selected in Columns. Default value: No. |
| Variable Mask Directory | Directory where the variable mask is stored. This variable mask is a file that stores which columns (with what transforms) should be passed to subsequent operators. |
| Variable Mask Name | File name where the variable mask is stored. |
Output
| Status | Meaning |
|---|---|
| Approved | R2 or info gain value is above the set threshold for this variable. |
| Mandatory | R2 or info gain value is not above the set threshold, but this column is in the "always included" list. |
| Rejected | R2 or info gain value is not above the set threshold. |
| Selected | If checked, this variable is added to the variable mask. |
For R2

For info gain
