Ensemble Decision Tree Modeling with Alpine Forest

Team Studio provides a number of forest modeling operators for Hadoop and database data sources.

Important: If you are using the Alpine Forest Classification operator for database sources from version 6.0 or earlier, then you must remove the operator from your workflow and replace it with Alpine Forest - MADlib. You must use Alpine Forest Predictor - MADlib with it.



Like the Decision Tree operator, Alpine Forest Classification creates a tree-like branching series of computational steps or logic "tests" that lead to an ultimate decision value. The difference is that the Alpine Forest Classification operator creates multiple decision trees, with each tree differing slightly. Specifically, each tree works on a random subset of the training data and uses a random subset of variables at the decision nodes.

Alpine Forest Classification is, therefore, an "ensemble" method combining a "forest" of individual decision trees that are generated using a random selection of attributes at each node to determine the split. The final classification decision is determined by a "vote count" of the most frequent classification across all of the resulting trees.

The main idea behind Alpine Forests is that, by creating many different decision trees and assuming that each individual tree makes mistakes in different places, the group of trees together should know, on average, the right answer in most of the places. Therefore, the aggregate tree results are hopefully more accurate than a single tree's results.

Alpine Forest Classification Modeling Advantages

Alpine Forest Classification modeling is considered to be one of the most accurate learning algorithms currently available, producing highly accurate categorical classification results. Additional advantages of Alpine Forest Classification modeling include:

  • Ability to automatically select variables from a large set of predictors, without the modeler first having to reduce the variables to only strong predictors first.
  • Ability to work well "off-the shelf" without major configuration. A modeler can get quick, relatively accurate results within a few minutes.
  • Ability to accept thousands of input predictor variables without variable deletion. In other words, it handles "wide" data where there are more predictors than observations and does not need to do some sort of variable reduction process first, like other methods usually have to do.
  • Ability to indicate which variables are important for the classification.
  • Ability to generate built-in, cross-validation error estimates of model accuracy as the forest building progresses.
  • Ability to pick up very non-linear boundaries or interactions between variables.
  • Ability to handle large datasets efficiently.

Some disadvantages of the Alpine Forest Classification method include the tendency to overfit for some datasets (that is, if the number of trees is set too high) and that the resulting Alpine Forests are difficult for humans to interpret and visualize. Also, for data including categorical variables with varying numbers of levels, Alpine Forests tend to be biased in favor of those attributes with more levels.