Alpine Forest Classification

An Alpine Forest Classification model is an ensemble classification method of creating a collection of decision trees with controlled variation. Ensemble modeling is the application of many models, each operating on a subset of the data.

Information at a Glance

Category Modeling
Data source type HD
Sends output to other operators Yes
Data processing tool MapReduce, Spark

You usually do not need to change the default configuration settings for this operator. The main properties that are specific to Alpine Forest Classification modeling are the Number of Features per Node, Number of Trees, Sample With Replacement, and Sample Percentage.

For information about the advantages of using this model, see Ensemble Decision Tree Modeling with Alpine Forest.

Algorithm

The Alpine Forest Classification operator implements the algorithm by building multiple decision trees, each one starting by randomly selecting a subset, m, of the available attributes and then selecting a subset of data observations (rows). The final "ensemble" classification is then done by querying each tree in the "Alpine Forest" for its prediction (essentially a "vote"). The overall prediction of the Alpine Forest Classification model is the mode of the votes - that is the most common prediction from individual tree models.

The Alpine Forest of trees is grown on a dataset consisting of n data observations and m classifiers (independent variables). For each tree of the forest:

  • The value of n is specified by the Number of Trees configuration property. This is the number of decision trees to create, each with its own randomly selected subset of data rows.
  • For each individual decision tree, m out of total available Independent Variables are randomly chosen from which to determine the optimum decision tree node split. This is referred to as the random input selection methodology. The value of m is specified by the Number of Features per Node configuration property.
  • For Alpine Forest Classification, there is a with or without replacement option. This option controls whether a data row can be chosen more than once (that is, be replaced) in the dataset used for each of the n decision trees created. This is specified by the Sampling with Replacement configuration property.
  • The remaining data rows not included in the n decision tree dataset samples are used for automatically generated cross-validation error estimates of the model. Note: These are called the OOB (Out of Bag) Error Estimates.
  • Each individual decision tree is grown according to the specified tree growth configuration parameters set for the Alpine Forest Classification operator.

In summary, the Alpine Forest Classification algorithm combines an ensemble classification or "bagging" approach with the random input selection of features, in order to construct a collection of CART decision trees with controlled variation. The individual models are combined by voting for the final classification or prediction.

Input

A dataset that contains the dependent and independent variables for modeling.

Configuration

Minimal Configuration
  • Dependent Column: the property in the dataset to be the predicted Dependent Variable. For classification models, the dependent variable must be categorical.
  • Columns: the expected Independent Variable data columns, or properties, to use for model training.
  • Sampling Percentage: the fraction of the data rows to be used as randomly-selected datasets for each decision tree.
Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Dependent Column The quantity to model or predict.

Select the data column to be considered the dependent variable. Alpine Forest Classification on Hadoop only supports classification models.

The dependent column must be a categorical variable.

To perform a regression instead, use the Alpine Forest Regression operator.

Columns Select the Independent Variable column(s) to include for the decision tree training.

At least one column must be specified.

Click Select Columns to open the dialog.

See Select Columns Dialog Box for more details.

Number of Trees This specifies how many individual decision trees to train in the Alpine Forest.
Note: Increasing the number of trees created generally increases the accuracy of the model. However, as long as enough trees are created, the Alpine Forest Classification model is not very sensitive to changing this property.

The user interface displays only a maximum of 20 tree results, even if more are generated.

Default value: 10.

Use Automatic Configuration Specifies that Team Studio should determine all the required Alpine Forest Classification configuration properties except for the Number of Trees property.

Default value: true.

Number of Features Function Automatically sets a value for Number of Features per Node.
  • Square Root - The number of features per node is set to the square root of the number of columns (truncated to an integer), or at least to 1.
  • 1/3 - The number of features per node is set to the (number of columns)/3 (truncated to an integer), or at least to 1.
  • All - The number of features per node is set to the number of columns.
  • User Defined - Set the number of features per node value in Number of Features per Node.

Default value: Square Root.

Number of Features per Node Specifies m, the number of predictors to consider at each node during the tree building process. The Alpine Forest Classification algorithm calculates the best split for the tree based on these m variables that are selected randomly from the training set.
  • The Number of Features per Node should be much less than the number of columns specified for the Columns property.

The number of features per node is the configuration property that an Alpine Forest Classification model is most sensitive to. Increasing the number per split makes each of the decision trees bigger, providing more information at each node. However, it also becomes harder to interpret for the modeler.

Default value: 1.

Sampling with Replacement Specifies whether to use replacement when selecting training variable data row samples from the input dataset. This property controls whether a data row can be reused for each of the n training data samples collected from the available dataset rows.
  • Setting this value to true increases the training performance time because there are more possible random dataset combinations.
  • Setting this value to false specifies that the system does not choose a data row more than once for each of the decision trees. This setting is appropriate for a small sample of n data rows from a large data set. In such a case, sampling without replacement is approximately the same as sampling with replacement (where the odds of randomly choosing the same data point twice is low).

Default value: true

Sampling Percentage Specifies m, the number of predictors to consider at each node during the tree building process. The Alpine Forest Classification algorithm calculates the best split for the tree based on these m variables that are selected randomly from the training set.
  • The Number of Features per Node should be much less than the number of columns specified for the Columns property.

The number of features per node is the configuration property that an Alpine Forest Classification model is most sensitive to. Increasing the number per split makes each of the decision trees bigger, providing more information at each node. However, it also becomes harder to interpret for the modeler.

Default value: 1.

Maximum Depth Specifies the "depth" of the tree, or the maximum number of decision nodes it can branch out to beneath the root node. A tree stops growing any deeper either if a node becomes empty (that is, there are no more examples to split in the current node), or if the depth of the tree exceeds this limit.
  • Maximum Depth is used during the tree-growth phase.
  • The range of possible values is between -1 and any integer greater than 0. A value of -1 represents "no bound" - the tree can take on any size or an unlimited number of decision nodes until the nodes become empty.

Default value: -1

Minimum Size For Split Specifies the minimum size (or number of members) of a node in the decision tree to allow a further split. If the node has fewer data members than the minimum size for split, then it must become a leaf or end node in the tree. When individual trees are being trained, this is a criteria for stopping tree training.
  • The range of possible values is any integer ≥ 2.
  • Minimum Size for Split is referenced during the pre-pruning phase.

Default value: 2.

Minimum Leaf Size Specifies the fewest number of data instances that can exist within a terminal leaf node of a decision tree. This property pre-prunes to constrain the tree leaf to at least this number of training samples.
  • The range of possible values is any integer value ≥ 1.
  • This property limits the tree depth, based on the size of the leaf nodes, ensuring enough data makes it to each part of the tree.

This setting is useful when the model construction is taking too long, or when the model shows very good ROC on training data but not nearly as good performance on hold-out or cross-validation data (due to over-fitting). For example, if the minimum leaf size is 2, then each terminal leaf node must contain at least 2 training data points.

Default value: 1.

Maximum JVM Heap Size Determines the amount of virtual memory assigned to an individual tree trainer. The number of training samples for a single tree is limited by this.
  • A value of -1 automatically sets the Maximum JVM Heap Size to avoid out-of-memory issues.

Default value: 1024

Use Spark If Yes (the default), uses Spark to optimize calculation time.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Outputs

Visual Output

When run against Hadoop, the Alpine Forest Classification operator does not display the individual decision trees generated within the model because it is assumed that with large datasets the trees are much larger than can be displayed visually. Instead, summarizing statistics are presented.



  • Variable Importance - Results for Hadoop data sources display the Variable Importance value, which provides a way of measuring the impact each variable has on the model.

    At each split, we calculate how much this split reduces node impurity (purity gain). Then for each variable, we sum up over all splits where it is used (weighted by the number of samples used in the node), over all trees. We then find the variable that has the maximum purity gain and divide by this value across all variables.

    For Alpine Forest Classification, we use Information Gain as the impurity function.

    Note: The variable importance values are also stored as a CSV file in the following HDFS directory:

    @default_tempdir/tsds_model/@user_name/@flow_name/AlpineForest_<uniqueFlowRunID>/varImp.csv



  • Individual Tree Statistics - The Individual Tree Statistics (for up to 20 trees in the model) are displayed providing the number of training samples, dropped training samples, non-leaf nodes and number of leaves for each tree.


  • Average Tree Statistics - displays the average statistical values across all the individual trees in the model, which provides a sense of the overall size of the decision trees in the model.
Output to Succeeding Operators
Connect this operator to succeeding operators.