Rapid Deployment of Predictive Models Overview

The Statistica Rapid Deployment of Predictive Models module will quickly generate predictions from one or more previously trained models based on information stored in industry-standard PMML (Predictive Model Markup Language) deployment code. The output predictions or class probablities can optionally be written into the current input data file (or database if the current input data is a query into an external database via Streaming Database Connector) for subsequent analyses involving other variables in the current input data file or data warehouse. PMML is an XML-based language for encoding information (results) from data mining projects. The Rapid Deployment of Predictive Models module is particularly well suited for generating predictions for a large number of observations (cases) because it passes (reads) through the data once, storing only the data for a single observation at a time.

Rapid Deployment now supports deployment of legacy PMML ie PMMl versions 2 and 3 generated by Statistica and also PMML 4.

Deploying multiple models

The Rapid Deployment of Predictive Models module can evaluate multiple models simultaneously. Predictions from all models are output to a summary spreadsheet with voted prediction for classification models and average prediction for regression models. All the models being evaluated must have the same dependent variable but can have different predictors. Models with different dependent variable will not be allowed to connect to the same Rapid Deployment node. The evaluated models can be of different versions (e.g., PMML 2.0, PMML3.0,PMML4.0) and from different vendors (e.g., Knime, rapidminer, R, Python). You can also save the predictions for further processing, along with other variables in the current input data file. This capability is extremely useful when performing detailed analyses of the predictive power of different models.

Writing Statistics to an External Database

With the Rapid Deployment of Predictive Models module, you can write computed statistics (predictions, predicted classifications, classification probabilities, residuals) back into the current input data file; this option, on the Rapid Deployment of Predictive Models dialog box - Save results tab, is available for Statistica Spreadsheets as well as external databases, connected via Streaming Database Connector. This capability to, for example, merge classification probabilities computed by various models into an existing database or data warehouse is extremely useful in the context of data mining applications to deploy models for extremely large data sets (e.g., to compute probabilities that particular customers in a large database of customers are likely to purchase from a mail-order catalogue). Also, Classification probabilities are computed for each model and a class label (for example, good, bad ) is determined for each case for each model based on the Classification probabilities. A majority voted label is provided as the aggregated prediction for each case. Because the processing of large data sets in (remote) external databases via Steaming DB Connector is extremely efficient (e.g., requiring very little memory on the computer running the Rapid Deployment of Models module), this method of deploying fully trained models for data mining will scale easily to even extremely large data sets.

Configuring the Streaming Database Connector for writing
In order to take advantage of the ability to write computed statistics for observations back into the database, the Streaming Database Connector must be properly configured (e.g. for read/write access in the Query Options dialog box). Also, the database fields (variables) to which you want to write must already exist in the database, and must be of the correct type (e.g., you cannot write numeric information into data fields of type Text). To learn more about the options to configure the connection, refer to Streaming Database Connector Technology and the Query Options dialog box.

Analysis Modules (Models) that Generate PMML Code

The following analytic modules for predictive data mining generates deployment code in PMML code, and are therefore compatible with the Rapid Deployment of Predictive Models module:

Module PMML PMML 4
Generalized Linear (GLZ) Yes Yes
Note:

-Not supported for sigma-restricted type parametrization

-Not supported for Beta distribution models

General Linear (GLM) Yes Yes
Note: Not supported for sigma-restricted type parametrization
General Regression Yes Yes
General ANOVA Yes Yes
Note: Not supported for sigma-restricted type parametrization
General CHAID Yes Yes
General Classification & Regression Trees Yes Yes
Advanced Classification and Regression Trees (ITrees) Yes Yes
Boosted Trees Yes Yes
Random Forests Yes Yes
Statistica Automated Neural Networks (SANN) Yes Yes
Note: Not supported for SANN Time series models
Support Vector Machines (SVM) Yes Yes
Generalized Cluster Analysis Yes Yes
Multiple Regression Yes No
Multivariate Adaptive Regression Splines (MARS) Yes No
General Discriminant Analysis Yes No
Cox Proportional Hazards Yes No
Naive Bayes Yes No
K-Nearest Neighbors Yes No
PCA (NIPALS) Yes No
PLS (NIPALS) Yes No
Sequence, Association and Link Analysis (SAL) Yes No
Text Mining Yes No

Neural Networks

Neural networks models can be saved in PMML format and evaluated by the Rapid Deployment of Models module if the respective model or ensemble of models predicts only a single continuous or categorical dependent or outcome variable; use the respective features for applying fully trained networks in Statistica Automated Neural Networks (SANN) to simultaneously predict multiple continuous and/or categorical outcomes (see also the deployment of models in Statistica Automated Neural Networks).

PMML Extensions

Even though the PMML standard is a promising development to bring cross-platform and cross-application compatibility to data mining, it currently can accommodate only fairly simple implementations of the methods that are defined. Therefore, in most cases, special extensions had to be added to the standard in order to allow users to take advantage of the advanced implementations of the respective methods available in Statistica.

Program Overview

The Statistica Rapid Deployment of Predictive Models module can read single or multiple PMML files to compute predicted values or classes for test data based on trained models. This information can optionally be written into the current input data file (or database if the current input data is a query into an external database via Streaming Database Connector) for subsequent analyses involving other variables in the current input data file or data warehouse. PMML code can be generated by practically all modules for predictive data mining available in Statistica, including the clustering methods (EM, K-means & Tree) available in the Cluster Analysis module. When applicable, the program will compute predicted values, Misclassification error rate for classification model, and mean squared error for regression model (The input data has to be provided in order to be scored by the PMML model(s)), and simple or overlaid lift and gains charts for binomial or multinomial classification problems.

For more information on scoring PMML 4 model, see Evaluating latest version PMML.