Alpine Forest - MADlib

Uses the MADlib built-in function, forest_train(), to generate multiple decision trees, the combination of which is used to make a prediction based on several independent columns.

Information at a Glance

Category Model
Data source type DB
Sends output to other operators Yes
Data processing tool MADlib
Note: This operator works only with MADlib 1.8 or higher.

Each decision tree is generated based on bootstrapped sampling and a random subset of the feature columns. The destination of the output of this operator must be an Alpine Forest Predictor (MADlib) operator. MADlib 1.8 or higher must be installed on the database. For more information, see the official MADlib documentation.

Input

The input table must have a single, categorical (string or integer) or regression (floating point) column to predict, and one or more independent columns to serve as input.

Bad or Missing Values
Any rows in the source table that contain NULL values for the predicted or independent columns are ignored.

Restrictions

This operator works only on databases with MADlib 1.8+ installed. Source data tables must have a numeric ID column that uniquely identifies each row in the source table. The prediction column must be integer or string for classification trees, or floating point for regression trees.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
MADlib Schema The name of schema where MADlib is installed. By default, the schema name is madlib.
Model Output Schema The name of schema to use for MADlib-generated output tables.
Model Output Table
The name of MADlib-generated output table. This table is generated by the forest trainer. The following additional tables are also generated.
  • A table with the same name, appended with _summary.
  • A table with the same name, appended with _group.
Drop If Exists
  • If Yes (the default), drop the existing table of the same name and create a new one.
  • If No, stop the flow and alert the user that an error has occurred.
ID Column All source tables must have a numeric ID column to uniquely identify each row.
Dependent Variable The name of the column to predict. If the data type of the column is floating point, the generated trees are regression trees. Otherwise, the generated trees are classification trees.
Feature List The selection of one or more columns to use as independent variables to predict the dependent variable. Note that the execution time increases as more columns are selected.
Number of Trees The maximum number of trees to generate. MADlib usually generates this number of trees, but the actual number may be slightly less. The default number is 100. Note that the execution time increases as more trees are generated.
Number of Random Features The number of random features to select at each split. If none is specified, the default is square root of n for classification trees or n over 3 for regression trees, where n is the maximum number of trees.
Calculate Variable Importance Whether or not to calculate variable importance. The default is true. If false, the execution time decreases.
Number of Permutations for Each Feature (for Var. Importance) Variable importance is calculated by permuting the variable with random variables and computing the drop in predictive accuracy. The default value is 1, and higher values lead to longer execution times. Usually 1 is sufficient.
Maximum Tree Depth The generated trees do not exceed this depth, where the root node is at depth 0. If not specified, the default is 10. Longer tree depths might lead to longer execution times.
Minimum Observations Before Splitting The number of observations that must occur at a particular node before considering a split. If not specified, the default is 20.
Minimum Observations in Terminal Nodes The minimum number of observations in any terminal node. If not specified, the default is n over 3, where n is the minimum observations before splitting.
Number of Bins for Split Boundaries For continuous value features, values are quantized into bins for split boundaries. If not specified, the default is 100. Higher values result in longer execution times.
Sampling Ratio Specify only a fraction of the input for training. This must be specified as an integer value between 1 and 100. Smaller values result in faster execution times because less data is sampled. The default is 100, where all input rows are sampled.

Output

Visual Output
This operator has three sets of output tabs.
  • The first set of tabs contains a text representation of each of the generated decision trees.
  • The second tab contains DOT notation of each of the generated decision trees. DOT notation can be exported to third-party tools such as GraphViz.
  • The third tab contains the raw output tables generated by MADlib.
    • The first is the model output table. The gid column represents the group ID. Grouping is not supported at this time, so this value is always 1. The sample_id represents the tree ID. The tree column encodes each generated decision tree in binary format.
    • The second is the output summary table, which contains information about how the trees were generated. Many parameters passed to the MADlib training function appear here as columns.
    • The third tab is the grouping table, which contains one row for each set of values by which we are grouping. Because grouping is not supported, it has only one row.
Data Output
The output of this operator must be sent to an Alpine Forest Predictor operator.

Example