Predictor

This operator applies an input model (for example, regression, classification, or clustering) to an input data set in order to predict a target value.

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter	Description
Category	Predict
Data source type	TIBCO® Data Virtualization
Send output to other operators	Yes
Data processing tool	TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The Predictor operator is used to generate predictions based on the model(s) developed from the input model operator(s).

Input Model	What the Predictor Calculates
Classification algorithms	Class with the highest probability
Numeric regression algorithms	Predicted value
Clustering algorithms	Predicted cluster
Anomaly detection algorithms	Anomaly class
PCA	Principal components

This operator takes one or more model objects and an input data set from upstream. Then it applies each model object to the input data and returns the prediction. Depending on the model types, the Predictor operator generates different prediction columns. For each additional input model, an index number is added to generated column names to separate them.

The operator includes the following models and prediction columns in the user-specified output table.

Model Type	Model	Model Abbreviation (key)	Prediction Columns
Classification	Naive Bayes	NB	PRED_<key>: The value predicted by the classification model (returns the most probable class). CONF_<key>: The probability of the predicted classification. INFO_<key>: The probability of each class prediction.
	Elastic-Net Logistic Regression	LOR
	Random Forest Classification	RFC
	Gradient-Boosted Classification	GBTC
Regression	Elastic-Net Linear Regression	LR	PRED_<key>: The value predicted by the regression model.
	Random Forest Regression	RFR
	Gradient-Boosted Regression	GBR
Clustering	K-Means Clustering	KM	PRED_KM: The value of the predicted cluster. DIST_KM: The distance between the cluster centroid and the observation.
Principal Component Analysis	Principal Component Analysis	PCA	y_i_PCA: The i^th number of principal components (starting from zero).
Anomaly Detection	Isolation Forest	ISF	PRED_ISF: To specify whether an observation is an anomaly or not. By default, 1 is an anomaly, and 0 is not an anomaly. CONF_ISF: The anomaly score returned.

Input

One or more input TIBCO Data Virtualization modeling operators (for example, regression, classification, or clustering) and one input data set against which the models are applied.

This operator is limited by the cluster resources and Spark data frame size.

Bad or Missing Values

Null values are not allowed and result in an error.
If the input column names do not match the column names in the data set selected for model training, an error is reported.
Input data, tabular data, and at least one model object must be connected to this operator, or else results in an error.
The dependent variable should be in the input data set, or else the operator produces an error.

Configuration

The following table provides the configuration details for the Predictor operator.

Parameter	Description
Notes	Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Output Schema	Specify the schema for the output table or view.
Output Table	Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results	When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Output

Visual Output

Output: Displays a table of the predicted data set.
Summary: Displays a list of the TIBCO DV modeling operators and their selected columns.

Output to successive operators

A table output that can be used by the downstream operator.

Example

The following example builds a Naive Bayes model and a Gradient-Boosted Tree Classification model, then combines the models with the Predictor operator.

Predictor operator workflow

Data

golf: This data set contains the following information:

Multiple columns namely outlook, temperature, wind, humidity, and play.
Multiple rows (14 rows).

Parameter Setting

The parameter settings for the golf data set are as follows:

Store Results: Yes

Results

These figures displays the results for the parameter settings for the golf data set.

Output

Predictor operator Output tab

Summary

Predictor operator Summary tab

Did you find this helpful?

Yes No