Decision Tree - MADlib

Team Studio supports the MADlib Decision Tree model implementation.

Information at a Glance

Category Model
MADlib version < 1.8
Data source type DB
Sends output to other operators Yes
Data processing tool n/a

For more information about working with decision trees, see Classification Modeling with Decision Tree.

Algorithm

The Decision Tree (MADlib) Operator supports the C4.5 deterministic method for constructing the decision tree structure, allowing users to choose information gain, Gini coefficient, or gain ratio as the split criteria. The MADlib implementation also supports decision tree pruning and missing value handling.

Note that the MADlib Decision Tree is considered an 'Early Stage Development' algorithm.

More information including general principles can be found in the official MADlib documentation.

Input

A data set that contains the dependent and independent variables for modeling.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
MADlib Schema Name The schema where MADlib is installed in the database. MADlib must be installed in the same database as the input data set. If a madlib schema exists in the database, this parameter defaults to madlib.
Split Criterion The criterion used to determine the split of the data at each node of the tree. The Split Criterion can be the Information Gain, Gini Coefficient, or Information Gain Ratio.
Model Output Schema Name The name of the schema where the output is stored.
Model Output Table Name The name of the table that is created to store the Regression model. The model output table stores:

id | tree_location | feature | probability | ebp_coeff | maxclass | scv | live | sample_size | parent_id | lmc_nid | lmc_fval | is_continuous | split_value | tid | dp_ids

See the official MADlib decision tree documentation for more info.

Drop If Exists
  • If Yes (the default), drop the existing table of the same name and create a new one.
  • If No, stop the flow and alert the user that an error has occurred.
Validation Table Name The table name for a validation data set to score the learned decision tree model against. A ratio of correctly classified items in the validation set is given.

Default value: null (or no validation table).

Continuous Features The user can select the continuous Independent Variable data columns for the decision tree training.

At least one Continuous Features column or one Categorical Features column must be specified.

Click Column Names to open the dialog box for selecting the available columns from the input data set for analysis.

Categorical Features The user can select the categorical Independent Variable data columns to include for the decision tree training.

At least one Continuous Features column or one Categorical Features column must be specified.

Class Column Required. The data column to be the Dependent Variable. This is the quantity to model or predict.
Confidence Level Specifies the confidence % boundary to use for the pessimistic error algorithm of pruning.

Confidence Level controls the pruning phase of the Decision Tree algorithm.

  • The pruning phase uses confidence intervals to estimate the "worst case" error rate of the node.
  • The confidence level is the certainty factor or upper limit of the chance of an error being found in a leaf node.
  • If the node has an error rate greater than this Confidence limit, it is pruned. Consider this as the probability of there being an incorrect classification in the leaf node.
  • Setting a higher Confidence Level value allows the model to use nodes with higher individual error rates (less pruning).
  • Setting a lower Confidence Level value indicates less tolerance for error, therefore more pruning.

Default value: 25, representing a 25% probability of there being an error in the leaf node classification set.

Handle Missing Values Specifies how to handle missing values in the data set.
  • ignore - Missing values are ignored.
  • explicit - Missing values are explicitly replaced with the average value for the feature.

Default value: ignore

Maximum Tree Depth Sets the "depth" of the tree or the maximum number of decision nodes it can branch out to beneath the root node. A tree stops growing any deeper if either a node becomes empty (that is, there are no more examples to split in the current node) or the depth of the tree exceeds this Maximal Tree Depth limit.

Maximal Tree Depth is used during the tree-growth phase.

Values must be greater than 0.

Default value: 10

Node Prune Threshold The minimum percentage of the number of records required in a child node. This threshold applies only to the non-root nodes.
The value must be in \[0,1\] .
  • If the value is 1, the trained tree only has one node (the root node)
  • If the value is 0, no nodes are pruned by this parameter.
Note: You can use pruning to avoid overfitting the decision tree.
Node Split Threshold Minimum percentage of the number of records required in a node for a further split to be possible.
The value must be in \[0,1\] .
  • If the value is 1, the trained tree only has two levels, since only the root node can grow.
  • If the value is 0, then the tree can grow extensively.
Verbosity A Boolean value that indicates whether to log all output of the training results. Default: false.

Output

Visual Output
The Decision Tree (MADlib) Operator has an intuitive output - the classification tree structure with the leaf nodes that indicate the count of data set rows (members) they contain.



Double-click a decision tree node if its sub-nodes are not displayed in the UI.

Additional Notes

Output Details

Connect this operator to the following succeeding operators.

  • Predictor operators
  • Scoring operators (such as ROC)

Decision trees need succeeding operators to effectively analyze their effectiveness. A Predictor operator provides the prediction value for each data row, compared against the actual data set training value and the associated confidence level.

Adding additional scoring operators, such as a ROC graph, is also helpful in immediately assessing how predictive the Decision Tree model is. For the ROC graph, any AUC value over .80 is typically considered a "good" model. A value of 0.5 just means the model is no better than a "dumb" model that can guess the right answer half the time.

The output from the Predictor operator appears as follows:



  • The prediction value (Yes or No) uses a threshold assumption of > than 50% confidence that the prediction will happen.
  • The C(Yes) column indicates that the confidence that the Dependent value is 1.
    Note: Usually this is a decimal value. In this case, the data set is small and created as an example.
  • The C(No) column indicates the confidence that the Dependent value will be 0.

Example

The following example illustrates a typical analytic flow configuration for Decision Tree operators.