Optimal Binning for Predictive Data Mining Overview

Overview

The purpose of the Combining Groups facilities is to allow for the pre-processing of categorical predictors with many classes in predictive data mining projects. Many of the analytic methods available in Statistica Data Miner become inefficient when applied to analyses that involve categorical predictors with thousands or tens of thousands of classes each. Such variables, however, are commonly found in many domains where data mining techniques can yield important insights.

Typical examples of categorical predictor variables with many classes are postal zip codes or Standard Industrial Classification codes (SIC, or the newer 6-digit NAICS codes), which classify all industrial activity. Manufacturers of industrial machinery or service providers to industry routinely record the SIC or NAICS codes for their customers in their data warehouses to utilize that information in their marketing.

The problem is that many useful analytic procedures, such as linear models (see GLM), logistic regression (see GLZ), etc, cannot handle categorical predictor variables with 10,000 classes. For example, GLM Introductory Overview - Summary of Computations, says that design matrices constructed from such predictors will become very large, and generally unusable for building valid (for prediction) linear models.

Combining Groups for Prediction

The solution to this problem is to combine the classes in such predictors (with thousands of categories) to yield a much smaller aggregated set of groups, each consisting of many individual classes from the original categorical predictor. Specifically, the Combining Groups module and nodes of Statistica Data Miner will apply a CHAID-like algorithm to find a good combination of classes, with respect to a particular continuous or categorical outcome variable of interest. The general principal is fairly simple: The program will try various combinations of classes to find the best such combination that will maximize the relationship of the newly recoded variable to the outcome variable.

In practice, the computations to determine the best recoding of classes to predict a particular variable can be difficult. In Statistica, a CHAID-like algorithm is used for this purpose. This algorithm will generally find a very good set of groupings for the respective categorical predictor variables; however, note that the final recoding (re-grouping) may not represent a global optimum (the single best recoding), but only a good recoding (local optimum), sufficient to allow for useful subsequent analyses.