Gradient Boosting Classification

A predictive method by which a series of shallow decision trees incrementally reduce prediction errors of previous trees. This method can be used for both classification and regression.

Information at a Glance

Category Model
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

See Gradient Boosting for more information.

Input

A tabular data set.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Dependent Column Specify for the gradient boosting classification model. Select the data column to consider the dependent variable for the classification.
Note: Be sure that the values chosen for the Independent and the Dependent variable are different. If the user chooses a field as a Dependent variable that is also selected in the Independent variable list, an error occurs.
  • This is the quantity to model or predict.
  • For the Gradient Boosting Classification operator, the dependent column must be a binary categorical or continuous numerical value that has only 0 or 1 as its values. Multi-class classification is currently not supported.
Independent Columns Independent Variable data columns to include for the gradient boosting tree training. At least one column must be specified. Click Select Values to open the dialog box for selecting the available columns from the input data set for analysis.

Check/Uncheck the box in front of the column names to select/de-select the columns.

Loss Function Choose the loss function to use to calculate the gradient boosting trees. Choosing a different loss function leads to different interpretations for trained models and might have an effect on the final prediction accuracy as well. For mathematical details of different loss functions, see various online resources; for example, http://www.saedsayad.com/docs/gbm2.pdf has good technical definitions of different loss functions.

AdaBoost: Uses the exponential loss function.

Logistic: Uses the log loss function.

TruncatedHinge: Uses a truncated hinge loss function. It could be a bit more accurate in problems where there are a lot of outlier training samples. Intuitively speaking, this means that there is a cap on penalties imposed on grossly misclassified examples, which tend to happen with outliers. Currently, the cap is fixed but in the future, it might be exposes as a configurable parameter.

Number of Trees Number of trees to use to train the gradient boosting classification model. Boosting uses results from previous trees to find training samples that need more attention (has larger losses). The more trees one has, the more accurate the training predictions. The validation accuracy might go down if there are too many trees. The number of trees also depends highly on the shrinkage parameter. Typically, the smaller shrinkage value means one needs more trees and vice versa.
Important: Performance of this operator is linear based on the number of trees selected. Therefore, if the user selects 200 trees, the runtime is twice that of 100 trees.

Default value: 100

Maximum Tree Depth Sets the "depth" of the tree or the maximum number of nodes it can branch out to beneath the root node. A tree stops growing any deeper if either a node becomes empty (that is, there are no more examples to split in the current node) or the depth of the tree exceeds this Maximum Tree Depth limit. The smaller the depth is, the shallower and weaker individual trees are. Smaller trees also often necessitate a larger number of trees.
  • The range of possible values is between -1 and any integer greater than 0.
  • A value of -1 represents "no bound" - the tree can take on any size or number of decision nodes until the nodes become empty.

Default value: 4.

Minimum Node Split Size Specifies the minimal size (or number of members) of a node in the tree to allow a further split. If the node has fewer data members than the Minimum Node Split Size, it must become a leaf or end node in the tree. Similar to the Maximum Tree Depth parameter mentioned above, a larger node split size means that trees are smaller.
  • The range of possible values is any integer ≥ 1.

Default value: 10.

Bagging Rate The approximate fraction of training data that is sampled (without replacement) when training each tree. For example, if this value is 0.5, the first tree is trained on a random 50% of the training data set, and the second trained on a different random 50% of the training data set, and so on. A proper value might improve model performance by mitigating overfitting.

Default value: 0.5.

Shrinkage Weight given to individual trees. The smaller this number is, the more trees one might need. The larger the number is, the fewer trees one might need.

Default value: 0.01.

Fraction of Data for Training Use for training the gradient boosting trees. The rest of the data set is used to measure the validation performance while the training is happening. This allows the training algorithm to estimate the optimal number of trees for validation data set accuracy.

Default value: 0.8.

Return the Optimal Number of Trees When enabled, returns the optimal number of trees for the gradient boosting classification model. This is the optimal number of trees as measured against the validation data set (if the training fraction is less than 1).

Default value: yes.

Finetune Terminal Nodes Fine-tuning decision tree nodes might improve accuracy. The mathematical details of this is described in Jerome Friedman's original gradient boosting paper as 'TreeBoost' ( https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

).

Default value: yes.

Maximum Number of Bins (2-65536) The maximum number of bins to use during the classification. A larger number might improve accuracy in some cases, particularly if the number of unique values in categorical features exceeds the default value. For example, if the number of unique values in categorical columns exceeds this number, feature hashing is automatically performed on that column (see https://en.wikipedia.org/wiki/Feature_hashing).

The range of available values is 2-65536, inclusive.

Default value: 256.

Maximum Number of Samples for Bin Finding The number of samples used to determine the numeric feature discretization. A larger number might improve accuracy in some cases.

Default value: 5000.

Discretization Type The method to use to group variable values into bins. If Equal Width, the values are divided into intervals of equal widths. If Equal Frequency, the values are sorted in ascending order and divided into a number of intervals that contain an equal number of sorted values.

Default value: Equal Width.

Verbose Training If yes, the algorithm prints many more messages to the console and the log. This can be useful when troubleshooting.

Default value: no.

Spark Checkpoint Directory The HDFS path where various intermediate Spark calculations are stored. Typically, there is no need to change this.

Default value: @default_tempdir/tsds_ runtime/@user_name/@flow_name.

See Workflow Variables for more information about the default value variable.

Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
Results display the Variable Importance value, which shows the impact each variable has on the model.