Random Forest Classification

This operator implements the Random Forest Classification algorithm from Spark MLlib.

Random_Forest_Classification_icon

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter	Description
Category	Model
Data source type	TIBCO® Data Virtualization
Send output to other operators	Yes
Data processing tool	TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The Random Forest Classification is an ensemble tree algorithm to the classification task to make a categorical prediction by averaging the numerical classification tree predictions of the ensemble. You can fine-tune the hyper-parameters of interest with the cross-validation training method. The operator uses the specified metric to evaluate the performance. The output of the operator is the model object with the best validation performance. This operator implements the Random Forest Classification algorithm from Spark MLlib.

Input

An input is a single tabular data set.

Bad or Missing Values

Null values are not allowed and result in an error.

Configuration

The following table provides the configuration details for the Random Forest Classification operator.

Parameter	Description
Notes	Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Dependent Variable	Specify the categorical data column as a dependent column. It must be numerical and the value cannot be a label or class.
Use all available columns as Predictors	When set to Yes, the operator uses all the available columns as predictors and ignores the Continuous Predictors and Categorical Predictors parameters. When set to No, the user must select at least one of the Continuous or Categorical Predictors.
Continuous Predictors	Specify the numerical data columns as independent columns. It must be numerical column. Click Select Columns to select the required columns. Note: The columns selected in the Categorical Predictors parameter are not available.
Categorical Predictors	Specify the categorical data columns as independent columns. Note: The columns selected in the Continuous Predictors parameter are not available.
Impurity	Specify the criterion for calculating the information gain when training the Random Forest model. The following values are available: Gini Entropy Default: Gini
Evaluation Metric	Specify the metric for evaluating model performance during cross-validation training. The following values are available: Auto FMeasure Accuracy For more information, see the Apache Spark documentation on Classification and Regression. Note: The value of the beta parameter for FMeasure is set to 1. If the user selects Auto, then the operator uses Accuracy. Default: Auto
Number of Feature Functions	Specify the function to determine the number of features for building each decision tree. The following values are available: All ⅓ Square Root log2 User Defined Default: Square Root
Feature Sampling Ratio	Specify the fraction of number of features per node to use when Number of Feature Functions is set to the User Defined option. The input for this parameter should be a comma-separated sequence of double values in `(0,1)`. Default: `0.5,0.7`
Max Depth	Specify the maximum depth of each tree. The input for this parameter should be a comma-separated sequence of integer values. Default: `2,3`
Number of Trees	Specify the total number of trees. The input for this parameter should be a comma-separated sequence of integer values. Default: `10,100`
Row Sampling Ratio	Specify the fraction of training data for building each decision tree. The input for this parameter should be a comma-separated sequence of double values in `(0,1)`. Default: `1`
Min Leaf Size	Specify the smallest number of data instances that can exist within a terminal leaf node of a decision tree. The input for this parameter should be a comma-separated sequence of integer values (for example, `1,2`). Default: `1`
Max Bins	Specify the maximum number of bins used for discretizing and splitting continuous features. The input for this parameter should be a comma-separated sequence of integer values (for example, 256). Note: The number of Max Bins should be larger than the number of unique levels of any selected categorical columns. Max Bins should be increased to the maximum cardinality of categorical features. However, depending on the available resources, the system might not be able to handle very high values and might cause an error. Default: `32`
Number of Cross Validation Folds	Specify the number of cross-validation samples. Default: `3`
Random Seed	The seed used for the pseudo-random generation. Default: `1`

Output

Visual Output

Parameter Summary Info: Displays information about the input parameters and their current settings.
Variable Importance: Displays the importance of predictors as evaluated in the training process. For each predictor, the variable importance for the model is displayed in the second column. This provides an indication of the importance or impact of a particular parameter.
Training Summary: Displays a table with a row for each tested combination of hyper-parameters. For each hyper-parameter, the chosen metric is displayed and the Best Model is marked. The information provides an insight into the parameters which resulted in the best model.

Output to successive operators

A model object that can be used with a Predictor operator. Three additional columns are produced in the Predictor operator: