Running Predictive Analytics on Your Data

How to:

Access Predictive Models
Train Binary Classification Models
Train Regression Models
Train Clustering Models
Train Anomaly Detection Models
Train Time-Series Forecasting Models
Edit Predictive Models
Save Predictive Models

When creating a Data Flow, you can easily run predictive analytics on your data sets using Machine Learning functions, without prior knowledge of advanced statistics.

Train and run multiple iterations of predictive models in parallel, evaluate and compare models actively, and select which model you want to save. Then you can re-run your model against new data sets.

Note: For more information, see the TIBCO WebFOCUS Installation and Configuration manual for your platform.

Procedure: How to Access Predictive Models

After you create a Data Flow, you can select from different model algorithms to run against your data set.

From the WebFOCUS Hub, click the plus menu, and then click Create Data Flow. Or, from the WebFOCUS Hub, click Application Directories, click an application, then right-click a data set, and select Flow. Or, from the WebFOCUS Reporting Server browser interface, click an application, then right-click a data set, and select Flow.

The Data Flow page opens, as shown in the following image.
From the Data panel, drag a data source onto the canvas, as shown in the following image.

Note: Double-click a data source to display sample data.
From the side panel, click Train Models.
The Train Models panel opens, as shown in the following image.

The following models display within the Train Models panel:
- Binary Classification
- Regression
- Clustering
- Anomaly Detection
- Time-series Forecasting
Now you can select a model to train and run against your data.

Procedure: How to Train Binary Classification Models

These models predict binary values based on four different algorithms: Random Forest, K-Nearest-Neighbors, Logistic Regression, and Extreme Gradient Boosting.

Note: When running the Binary Classification model algorithm, smaller-sized data files may not generate a model. Larger-sized files are recommended for best results.

Double-click or drag and drop the Binary Classification model to the canvas.

The Configure dialog box displays, as shown in the following image.

You can click the Target dropdown menu to select a different target. All numeric Field measures are selected by default as Predictors. You can add or remove Predictors by selecting or unselecting the check boxes.
Click Apply.

Your selected model type appears on the dataflow canvas, as shown in the following image.
Click the Train and Predict icon to train your model.

The Compare Model Evaluation Results dialog opens, as shown in the following image.

The model algorithms run in parallel, allowing you to easily compare results and determine which model is best. You can filter which model comparisons you want to see by selecting or deselecting the model check boxes.
Close the Compare Model Evaluation Results dialog box to return to the canvas.
Note: To re-open the Compare Model Evaluation Results dialog box, click the Compare icon on the canvas toolbar.

Your model data displays in the following tabs. You can select different model algorithm options from the model drop-down menu. The best model is selected by default.
- Result. A preview of the first 50 rows of your new data set. Target and predicted columns are highlighted in yellow.
- Evaluation. A chart that demonstrates the accuracy of the selected model.
- Confusion Matrix. A report that includes the performance metrics and hyperparameter values.
- ROC. The effect of the decision-threshold on various metrics is shown in the ROC curve.
- Precision-Recall. The Precision-Recall curve is independent of the true negatives.
- Feature Importances. The most important features in your data set.
  
  Note: Feature Importances is available for the Random Forest model only.
- Training Log. A report that includes the performance metrics and hyperparameter values.

Procedure: How to Train Regression Models

These models predict numeric values based on four different regression algorithms: Random Forest, K-Nearest-Neighbors, Polynomial Regression, and Extreme Gradient Boosting.

Double-click or drag and drop the Regression model to the canvas.

The Configure dialog box displays, as shown in the following image.

You can click the Target dropdown menu to select a different target. All numeric Field measures are selected by default as Predictors. You can add or remove Predictors by selecting or unselecting the check boxes.
Click Apply.

Your selected model type appears on the dataflow canvas, as shown in the following image.
Click the Train and Predict icon to train your model.

The Compare Model Evaluation Results dialog opens, as shown in the following image.

The model algorithms run in parallel, allowing you to easily compare results and determine which model is best. The best model has the lowest Root Mean Square Error value, and a scatter plot with dots closest to the red line You can filter which model comparisons you want to see by selecting or deselecting the model check boxes.
Close the Compare Model Evaluation Results dialog box to return to the canvas.
Note: To re-open the Compare Model Evaluation Results dialog box, click the Compare icon on the canvas toolbar.

Your model data displays in the following tabs. You can select different model algorithm options from the model drop-down menu. The best model is selected by default.
- Result. A preview of the first 50 rows of your new data set. Target and predicted columns are highlighted in yellow.
- Evaluation. A chart that demonstrates the accuracy of the selected model.
- Feature Importances. The most important features in your data set.
  
  Note: Feature Importances is available for the Random Forest model only.
- Training Log. A report that includes the performance metrics and hyperparameter values.

Procedure: How to Train Clustering Models

These models produce cluster assignments, based on two different clustering algorithms: K-Means and BIRCH. K-Means uses different geometric properties of data points to assign them to clusters based on similarities. BIRCH is a hierarchical method that allows data points to be in the same cluster if they are separated by a distance smaller than a set threshold distance. Both clustering model types run at the same time with default hyperparameters.

Double-click or drag and drop the Clustering model to the canvas.

The Configure dialog box displays, as shown in the following image.

All numeric Field measures are selected by default as Predictors. You can add or remove Predictors by selecting or unselecting the check boxes.
Click Apply.

Your selected model type appears on the dataflow canvas, as shown in the following image.
Click the Train and Predict icon to train your model.
Your model data displays according to your model algorithm. You can select the K-MEANS Clustering or BIRCH Clustering algorithm from the drop-down menu. The best algorithm is selected by default.

Your model data displays in the following tabs, using the K-Means algorithm.
- Result. A preview of the first 50 rows of your new data set. Target and predicted columns are highlighted in yellow.
- WCSS. Within Cluster Sum of Squares. Indicates the optimal number of clusters after which a further division into more clusters does not lead to a significant further gain.
- Silhouette. Uses the mean intra-cluster distances and the mean nearest-cluster distances. The larger the score, the better defined the clusters. The highest possible score is 1.
- Calinski-Harabasz. The ratio of the between-cluster dispersion and the within-cluster dispersion. The higher the score, the better defined the clusters.
- Davies-Bouldin. Uses the cluster diameters and the inter-cluster distances. The lower the score, the better defined the clusters. The lowest possible score is zero.
- Parallel coordinates. Shows the locations of the individual observations or centroids of the K clusters in dataspace. Clarifies along which directions the clusters are separated, and which variables are most important in separating the clusters. Locations of centroids are shown in the image below.
  
  Locations of individual observations are shown in the image below.
- Cardinality. Shows the number of members, or rows in the data, per cluster. It may make evident that some very small clusters are outliers, rather than useful clusters.
- Projections. Helps you explore if projected K clusters are on top of each other or if they are distinct.
- Training Log. A report that includes the performance metrics and hyperparameter values.
Your model data displays in the following tabs, using the BIRCH algorithm.
- Result. A preview of the first 50 rows of your new data set. Target and predicted columns are highlighted in yellow.
- Distance Threshold Plot. Shows the number of clusters found as a function of the threshold value. If for a wide range of threshold values the number of clusters remains approximately constant, then that number of clusters may be a good choice.
- Parallel coordinates. Shows the locations of the individual observations or centroids of the K clusters in dataspace. Clarifies along which directions the clusters are separated, and which variables are most important in separating the clusters. Locations of centroids are shown in the image below.
  
  Locations of individual observations are shown in the image below.
- Cardinality. Shows the number of members, or rows in the data, per cluster. It may make evident that some very small clusters are outliers, rather than useful clusters.
- Training Log. A report that includes the performance metrics and hyperparameter values.

Procedure: How to Train Anomaly Detection Models

These models detect anomalies, based on one clustering algorithm: Isolation Forest.

Double-click or drag and drop the Anomaly Detection model to the canvas.

The Configure dialog box displays, as shown in the following image.

All numeric Field measures are selected by default as Predictors. You can add or remove Predictors by selecting or unselecting the check boxes.
Click Apply.

Your selected model type appears on the dataflow canvas, as shown in the following image.
Click the Train and Predict icon to train your model.
Your model data displays in the following tabs, using the Isolation Forest model algorithm.
- Result. A preview of the first 50 rows of your new data set. Target and predicted columns are highlighted in yellow.
- Anomaly Scores. A chart that demonstrates the accuracy of the selected model.
- Training Log. A report that includes the performance metrics and hyperparameter values.

Procedure: How to Train Time-Series Forecasting Models

These models produce time-series forecasts based on the forecasting algorithm: Auto-SARIMA.

Double-click or drag and drop the Time-Series Forecasting model to the canvas.

The Configure dialog box displays, as shown in the following image.

You can click the Forecast dropdown menu to select a different Forecast variable. You can choose a Date/Datetime variable by selecting its radio button.
Click Apply.

Your selected model type appears on the dataflow canvas, as shown in the following image.
Click the Train and Predict icon to train your model.
Your model data displays in the following tabs.
- Result. Shows the forecast results of your datetime variables.
- Forecast. ForecastSARIMA finds patterns of various kinds in the historic data. If these patterns persist over a time period that extends beyond the historic data, predictions can be made, with some degree of uncertainty. The most likely future values are displayed with a solid curve, and the 95% confidence interval is displayed as a shaded area.
- Training Log. A report that includes the performance metrics and hyperparameter values.

Procedure: How to Edit Predictive Models

Before or after your model is trained, you can edit your model target, predictors, or datetime variables, depending on your model type. You can also edit the default parameters unique to each model.

To edit your model target and predictors, right-click the canvas model node, point to Edit Settings, and then click Target and Predictors, or for Time-Series forecasting models, click Forecast and Date/Datetime variables.

To edit your model parameters, right-click the canvas model node, point to Edit Settings, point to Parameters and Hyperparameters, and then click your model algorithm type. For Time-Series forecasting models, the Sampling frequency parameter Week may not work with your dataset, and is an Advanced option.

Note: For Time-Series forecasting models, if the chosen parameter for Sampling frequency is too long, it may result in too few data points to produce reliable statistics. In this case, the algorithm will modify the sampling to a shorter frequency, for example, from Year to Quarter, and redo the analysis. Sampling frequency modifications are reported in the training log.

You can also click the Model Editor icon to change targets, predictors, and parameters.

Procedure: How to Save Predictive Models

When training a model, you can save it from the Compare Model Evaluation Results dialog box. After running a model, you can save it from the tabbed panel beneath the canvas. You can then re-run your saved model against new data sets.

Click the Save icon to save your model.

The Save dialog opens, as shown in the following image.

You can change the model algorithm, name, or location, and add a description.
Click Save.

Your model is saved to your selected folder location.

Your saved model can be run later against new data that is similar to the data it was trained on.