Contents

Index

Search Results

Gradient Boosting Classification

A predictive method by which a series of shallow decision trees incrementally reduce prediction errors of previous trees. This method can be used for both classification and regression.

Information at a Glance

Category	Model
Data source type	HD
Sends output to other operators	Yes
Data processing tool	Spark

See Gradient Boosting for more information.

Input

A tabular data set.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Dependent Column	Specify for the gradient boosting classification model. Select the data column to consider the dependent variable for the classification. Note: Be sure that the values chosen for the Independent and the Dependent variable are different. If the user chooses a field as a Dependent variable that is also selected in the Independent variable list, an error occurs. This is the quantity to model or predict. For the Gradient Boosting Classification operator, the dependent column must be a binary categorical or continuous numerical value that has only 0 or 1 as its values. Multi-class classification is currently not supported.
Independent Columns	Independent Variable data columns to include for the gradient boosting tree training. At least one column must be specified. Click Select Values to open the dialog box for selecting the available columns from the input data set for analysis. Check/Uncheck the box in front of the column names to select/de-select the columns.
Loss Function	Choose the loss function to use to calculate the gradient boosting trees. Choosing a different loss function leads to different interpretations for trained models and might have an effect on the final prediction accuracy as well. For mathematical details of different loss functions, see various online resources; for example, http://www.saedsayad.com/docs/gbm2.pdf has good technical definitions of different loss functions. AdaBoost: Uses the exponential loss function. Logistic: Uses the log loss function. TruncatedHinge: Uses a truncated hinge loss function. It could be a bit more accurate in problems where there are a lot of outlier training samples. Intuitively speaking, this means that there is a cap on penalties imposed on grossly misclassified examples, which tend to happen with outliers. Currently, the cap is fixed but in the future, it might be exposes as a configurable parameter.
Number of Trees	Number of trees to use to train the gradient boosting classification model. Boosting uses results from previous trees to find training samples that need more attention (has larger losses). The more trees one has, the more accurate the training predictions. The validation accuracy might go down if there are too many trees. The number of trees also depends highly on the shrinkage parameter. Typically, the smaller shrinkage value means one needs more trees and vice versa. Important: Performance of this operator is linear based on the number of trees selected. Therefore, if the user selects 200 trees, the runtime is twice that of 100 trees. Default value: 100
Maximum Tree Depth	Sets the "depth" of the tree or the maximum number of nodes it can branch out to beneath the root node. A tree stops growing any deeper if either a node becomes empty (that is, there are no more examples to split in the current node) or the depth of the tree exceeds this Maximum Tree Depth limit. The smaller the depth is, the shallower and weaker individual trees are. Smaller trees also often necessitate a larger number of trees. The range of possible values is between -1 and any integer greater than 0. A value of -1 represents "no bound" - the tree can take on any size or number of decision nodes until the nodes become empty. Default value: 4.
Minimum Node Split Size	Specifies the minimal size (or number of members) of a node in the tree to allow a further split. If the node has fewer data members than the Minimum Node Split Size, it must become a leaf or end node in the tree. Similar to the Maximum Tree Depth parameter mentioned above, a larger node split size means that trees are smaller. The range of possible values is any integer ≥ 1. Default value: 10.
Bagging Rate	The approximate fraction of training data that is sampled (without replacement) when training each tree. For example, if this value is 0.5, the first tree is trained on a random 50% of the training data set, and the second trained on a different random 50% of the training data set, and so on. A proper value might improve model performance by mitigating overfitting. Default value: 0.5.
Shrinkage	Weight given to individual trees. The smaller this number is, the more trees one might need. The larger the number is, the fewer trees one might need. Default value: 0.01.
Fraction of Data for Training	Use for training the gradient boosting trees. The rest of the data set is used to measure the validation performance while the training is happening. This allows the training algorithm to estimate the optimal number of trees for validation data set accuracy. Default value: 0.8.
Return the Optimal Number of Trees	When enabled, returns the optimal number of trees for the gradient boosting classification model. This is the optimal number of trees as measured against the validation data set (if the training fraction is less than 1). Default value: yes.
Finetune Terminal Nodes	Fine-tuning decision tree nodes might improve accuracy. The mathematical details of this is described in Jerome Friedman's original gradient boosting paper as 'TreeBoost' ( https://statweb.stanford.edu/~jhf/ftp/trebst.pdf ). Default value: yes.
Maximum Number of Bins (2-65536)	The maximum number of bins to use during the classification. A larger number might improve accuracy in some cases, particularly if the number of unique values in categorical features exceeds the default value. For example, if the number of unique values in categorical columns exceeds this number, feature hashing is automatically performed on that column (see https://en.wikipedia.org/wiki/Feature_hashing). The range of available values is 2-65536, inclusive. Default value: 256.
Maximum Number of Samples for Bin Finding	The number of samples used to determine the numeric feature discretization. A larger number might improve accuracy in some cases. Default value: 5000.
Discretization Type	The method to use to group variable values into bins. If Equal Width, the values are divided into intervals of equal widths. If Equal Frequency, the values are sorted in ascending order and divided into a number of intervals that contain an equal number of sorted values. Default value: Equal Width.
Verbose Training	If yes, the algorithm prints many more messages to the console and the log. This can be useful when troubleshooting. Default value: no.
Spark Checkpoint Directory	The HDFS path where various intermediate Spark calculations are stored. Typically, there is no need to change this. Default value: `@default_tempdir/tsds_ runtime/@user_name/@flow_name`. See Workflow Variables for more information about the default value variable.
Advanced Spark Settings Automatic Optimization	Yes specifies using the default Spark optimization settings. No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output: Results display the Variable Importance value, which shows the impact each variable has on the model.

Copyright © Cloud Software Group, Inc. All rights reserved.