Gradient Boosting Regression

A predictive method by which a series of shallow decision trees incrementally reduce prediction errors of previous trees. This method can be used for both regression and classification.

Information at a Glance

Category Model
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

For more information, see Gradient Boosting.

Input

A tabular data set.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Dependent Column Specify the dependent column for the Gradient Boosting Regression model. Select the data column to be considered the dependent variable for the regression.
Note: Be sure that the values chosen for the independent and dependent variables are different. If the user chooses a field as a dependent variable that is also selected in the Independent variable list, an error occurs.
  • This is the quantity to model or predict.
  • For the Gradient Boosting Classification operator, the dependent column must be a binary categorical or continuous numerical value that has only 0 or 1 as its values. Multi-class classification is currently not supported.
Independent Columns Select the independent variable data columns to include for the gradient boosting tree training.
  • At least one column must be specified. Click Select Values to open the dialog box for selecting the available columns from the input data set for analysis.
  • Check or uncheck the box in front of the column names to select or deselect the columns.
Loss Function The loss function used for calculating the Gradient Boosting trees. Choosing a different loss function leads to different interpretations for trained models and might have an effect on the final prediction accuracy as well. For mathematical details of different loss functions, see various online resources; for example, http://www.saedsayad.com/docs/gbm2.pdf has good technical definitions of the Gaussian, Laplacian, and Poisson loss functions.
Number of Trees

*required

The number of trees to use to train the Gradient Boosting Classification model. Boosting uses results from previous trees to find training samples that need more attention (have larger losses). The more trees one has, the more accurate the training predictions would be, but the validation accuracy might go down if there are too many trees. The number of trees also depends highly on the shrinkage parameter. Typically, the smaller shrinkage value means one needs more trees and vice versa.
Note: Performance of this operator is linear based on the number of trees selected. Therefore, if the user selects 200 trees, the runtime is twice that of 100 trees.

Default value: 100.

Maximum Tree Depth

*required

Sets the depth of the tree or the maximum number of nodes it can branch out to beneath the root node. A tree stops growing any deeper if either a node becomes empty (that is, there are no more examples to split in the current node) or the depth of the tree exceeds this Maximum Tree Depth limit. The smaller the depth is, the shallower and 'weaker' individual trees are. Smaller trees also often necessitate a larger number of trees.
  • The range of possible values is between -1 and any integer greater than 0.
  • A value of -1 represents "no bound" - the tree can take on any size or number of decision nodes until the nodes become empty.

Default value: 4.

Minimum Node Split Size

*required

Specifies the minimal size (or number of members) of a node in the tree to allow a further split. If the node has fewer data members that the Minimum Node Split Size, it must become a leaf or end node in the tree. Similar to the maximum depth parameter mentioned above, a larger node split size means that trees are smaller.

The range of possible values is any integer ≥ 1.

Default value: 10.

Bagging Rate

*required

The approximate fraction of training data that is sampled (without replacement) when training each tree. For example, if this value is 0.5, the first tree is trained on a random 50% of the training data set, and the second is trained on a different random 50% of the training data set, and so on. A proper value might improve model performance by mitigating overfitting.

Default value: 0.5.

Shrinkage

*required

The weight given to individual trees. The smaller this number is, the more trees one might need. The larger the number is, the fewer trees one might need.

Default value: 0.01.

Fraction of Data for Training

*required

The fraction of the data used for training the Gradient Boosting trees. The rest of the data set is used to measure the validation performance while the training is happening. This allows the training algorithm to estimate the optimal number of trees for validation data set accuracy.

Default value: 0.8.

Return the Optimal Number of Trees When enabled, returns the optimal number of trees for the Gradient Boosting Classification model. This is the optimal number of trees as measured against the validation data set (if the training fraction is less than 1).

Default value: yes.

Finetune Terminal Nodes Fine-tuning decision tree nodes might improve accuracy. The mathematical details of this are described in Jerome Friedman's original gradient boosting paper as TreeBoost (https://statweb.stanford.edu/~jhf/ftp/trebst.pdf).

Default value: yes.

Maximum Number of Bins (2-65536) The maximum number of bins to use during the classification. A larger number might improve accuracy in some cases, particularly if the number of unique values in categorical features exceed the default value. For example, if the number of unique values in categorical columns exceeds this number, feature hashing is automatically performed on that column.

The range of available values is 2-65536, inclusive.

Default value: 256.

Maximum Number of Samples for Bin Finding

*required

The number of samples used to determine the numeric feature discretization. A larger number might improve accuracy in some cases.

Default value: 5000.

Discretization Type The method to use to group variable values into bins. If Equal Width (the default), the values are divided into intervals of equal widths. If Equal Frequency, the values are sorted in ascending order and divided into a number of intervals that contain an equal number of sorted values.
Repartition Data A Spark operation that might improve training performance (speed) in certain cases. Typically, there is no need to change this parameter.

Default value: no.

Verbose Training If set to yes, the algorithm prints many more messages to the console and log, which can be useful when troubleshooting.

Default value: no.

Spark Checkpoint Directory

*required

An HDFS path where various intermediate Spark calculations are stored. Typically, there is no need to change this parameter.

Default value: @default_tempdir/tsds_ runtime/@user_name/@flow_name.

Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.