Variable Selection (HD)

Identifies and prioritizes the variables of interest to a prediction task or model. This is especially helpful when there are a large number of potential variables for a model, enabling the modeler to focus on only a subset of those that show the strongest relevance.

Variable Selection

Information at a Glance

Category	Explore
Data source type	HD
Sends output to other operators	Yes
Data processing tool	MapReduce

Note: The Variable Selection (HD) operator is for Hadoop data only. For database data, use the Variable Selection (DB) operator.

Algorithm

For Hadoop, there are two choices: Information Gain and R2. The Hadoop-based information gain option calculates both information gain and information gain ratio.

Information Gain

Information Gain is a measure of the change in the entropy (or uncertainty) of a random variable Y when it is conditioned on another (categorical) variable X. In our case, Y is the class to be predicted (the dependent variable), and X is a candidate driver.

The entropy of a categorical random variable Y with n possible values (classes) is given by

The conditional entropy of Y given the values of a discrete variable X that takes on the m values is given by

The information gain about Y, given that we know X measures how much more we know about Y because we know X:

Score Threshold by Chance

Score threshold by chance is a score we can get just by chance, even if X is not truly predictive of Y. We generate X that are designed to be independent of Y according the distribution of Yand then calculate the score. We can generate a lower-bound threshold T. Any candidate feature that scores lower than T is almost certainly not predictive of the output variable, and can be eliminated. In practice, T is quite small, and probably does not eliminate too many variables. However, it still gives a useful sense of scores that correspond to meaningful and less-meaningful variables.

Handling numerical columns with Information Gain

For numerical columns in Hadoop, we compute mutual information between dependent and independent variables without discretization. The Hadoop Variable Operator does not perform Minimum Description Length (MDL) discretization like the database version, because it is extremely expensive when dealing with big data.

R2

For each independent variable X, R² is the coefficient of determination for a simple linear regression between X and the dependent variable Y.

Note: R² indicates how well a linear model f(X) = α + βX fits the given data.

Input

A data set from the preceding operator.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Dependent Column	Define the column to be used as the class Variable. For R2 scoring, the dependent column must be numeric.
Score Type	Hadoop options: R2 Info Gain
Columns	Use Select Columns to open the dialog box for selecting the available columns from the input dataset for analysis.
Score Threshold	Indicates what a good value is for your score type. Variables above the score threshold are stored and passed to subsequent operators. Default value: 0.5.
Always included columns	Columns in this list are automatically passed to subsequent operators, whether or not their score is above the threshold.
Do power transformations?	If Yes, and if an independent variable is numerical, the Variable Selection operator determines the score for the original variable plus a series of power transforms. The power transforms currently supported are: exponent natural log square square root inverse If data in a column does not make sense for a particular transform, that transform's score is not determined. For example, if a column includes negative numbers, we do not determine the score for natural log for that variable. Power transforms are not performed on Always Included Columns, unless that column is also selected in Columns. Default value: No.
Variable Mask Directory	Directory where the variable mask is stored. This variable mask is a file that stores which columns (with what transforms) should be passed to subsequent operators.
Variable Mask Name	File name where the variable mask is stored.

Output

Visual Output

Column name, R2 or info gain, and status values, as shown in the following table.

Status	Meaning
Approved	R2 or info gain value is above the set threshold for this variable.
Mandatory	R2 or info gain value is not above the set threshold, but this column is in the "always included" list.
Rejected	R2 or info gain value is not above the set threshold.
Selected	If checked, this variable is added to the variable mask.

For R2

For info gain

Data Output

The subset of suggested variables from the Variable Selection operator can be passed to the Naive Bayes, Linear Regression, and Logistic Regression operators. These operators use the results of the variable selection to automatically choose the appropriate independent values.

Related reference

Variable Selection (DB)

Summary Statistics (HD)

Summary Statistics (DB)