Predictor

This operator applies an input model (for example, regression, classification, or clustering) to an input data set in order to predict a target value.

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Predict
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The Predictor operator is used to generate predictions based on the model(s) developed from the input model operator(s).

Input Model What the Predictor Calculates
Classification algorithms Class with the highest probability
Numeric regression algorithms Predicted value
Clustering algorithms Predicted cluster
Anomaly detection algorithms Anomaly class
PCA Principal components

This operator takes one or more model objects and an input data set from upstream. Then it applies each model object to the input data and returns the prediction. Depending on the model types, the Predictor operator generates different prediction columns. For each additional input model, an index number is added to generated column names to separate them.

The operator includes the following models and prediction columns in the user-specified output table.

Model Type Model Model Abbreviation (key) Prediction Columns

 

 

 

Classification

Naive Bayes NB
  • PRED_<key>: The value predicted by the classification model (returns the most probable class).
  • CONF_<key>: The probability of the predicted classification.

  • INFO_<key>: The probability of each class prediction.

Elastic-Net Logistic Regression LOR
Random Forest Classification RFC
Gradient-Boosted Classification GBTC

 

 

Regression

Elastic-Net Linear Regression

LR

 

 

PRED_<key>: The value predicted by the regression model.
Random Forest Regression

RFR

Gradient-Boosted Regression

GBR

Clustering K-Means Clustering KM
  • PRED_KM: The value of the predicted cluster.

  • DIST_KM: The distance between the cluster centroid and the observation.

Principal Component Analysis Principal Component Analysis PCA

y_i_PCA: The ith number of principal components (starting from zero).

Anomaly Detection Isolation Forest ISF
  • PRED_ISF: To specify whether an observation is an anomaly or not. By default, 1 is an anomaly, and 0 is not an anomaly.

  • CONF_ISF: The anomaly score returned.

Input

One or more input TIBCO Data Virtualization modeling operators (for example, regression, classification, or clustering) and one input data set against which the models are applied.

This operator is limited by the cluster resources and Spark data frame size.

Bad or Missing Values

  • Null values are not allowed and result in an error.

  • If the input column names do not match the column names in the data set selected for model training, an error is reported.

  • Input data, tabular data, and at least one model object must be connected to this operator, or else results in an error.

  • The dependent variable should be in the input data set, or else the operator produces an error.

Configuration

The following table provides the configuration details for the Predictor operator.

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Output Schema Specify the schema for the output table or view.
Output Table Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Output

Visual Output
  • Output: Displays a table of the predicted data set.
  • Summary: Displays a list of the TIBCO DV modeling operators and their selected columns.
Output to successive operators

A table output that can be used by the downstream operator.

Example

The following example builds a Naive Bayes model and a Gradient-Boosted Tree Classification model, then combines the models with the Predictor operator.

Predictor operator workflow

Data

golf: This data set contains the following information:

  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).

Parameter Setting

The parameter settings for the golf data set are as follows:

  • Store Results: Yes

Results

These figures displays the results for the parameter settings for the golf data set.

Output

Predictor operator Output tab

Summary

Predictor operator Summary tab