Rapid Deployment of Predictive Models Overview
The Statistica Rapid Deployment of Predictive Models module will quickly generate predictions from one or more previously trained models based on information stored in industry-standard PMML (Predictive Model Markup Language) deployment code. The output predictions or class probablities can optionally be written into the current input data file (or database if the current input data is a query into an external database via Streaming Database Connector) for subsequent analyses involving other variables in the current input data file or data warehouse. PMML is an XML-based language for encoding information (results) from data mining projects. The Rapid Deployment of Predictive Models module is particularly well suited for generating predictions for a large number of observations (cases) because it passes (reads) through the data once, storing only the data for a single observation at a time.
Rapid Deployment now supports deployment of legacy PMML ie PMMl versions 2 and 3 generated by Statistica and also PMML 4.
Deploying multiple models
The Rapid Deployment of Predictive Models module can evaluate multiple models simultaneously. Predictions from all models are output to a summary spreadsheet with voted prediction for classification models and average prediction for regression models. All the models being evaluated must have the same dependent variable but can have different predictors. Models with different dependent variable will not be allowed to connect to the same Rapid Deployment node. The evaluated models can be of different versions (e.g., PMML 2.0, PMML3.0,PMML4.0) and from different vendors (e.g., Knime, rapidminer, R, Python). You can also save the predictions for further processing, along with other variables in the current input data file. This capability is extremely useful when performing detailed analyses of the predictive power of different models.
Writing Statistics to an External Database
With the Rapid Deployment of Predictive Models module, you can write computed statistics (predictions, predicted classifications, classification probabilities, residuals) back into the current input data file; this option, on the Rapid Deployment of Predictive Models dialog box - Save results tab, is available for Statistica Spreadsheets as well as external databases, connected via Streaming Database Connector. This capability to, for example, merge classification probabilities computed by various models into an existing database or data warehouse is extremely useful in the context of data mining applications to deploy models for extremely large data sets (e.g., to compute probabilities that particular customers in a large database of customers are likely to purchase from a mail-order catalogue). Also, Classification probabilities are computed for each model and a class label (for example, good, bad ) is determined for each case for each model based on the Classification probabilities. A majority voted label is provided as the aggregated prediction for each case. Because the processing of large data sets in (remote) external databases via Steaming DB Connector is extremely efficient (e.g., requiring very little memory on the computer running the Rapid Deployment of Models module), this method of deploying fully trained models for data mining will scale easily to even extremely large data sets.
Analysis Modules (Models) that Generate PMML Code
The following analytic modules for predictive data mining generates deployment code in PMML code, and are therefore compatible with the Rapid Deployment of Predictive Models module:
Module | PMML | PMML 4 |
---|---|---|
Generalized Linear (GLZ) | Yes | Yes
Note: -Not supported for sigma-restricted type parametrization -Not supported for Beta distribution models |
General Linear (GLM) | Yes | Yes
Note: Not supported for sigma-restricted type parametrization
|
General Regression | Yes | Yes |
General ANOVA | Yes | Yes
Note: Not supported for sigma-restricted type parametrization
|
General CHAID | Yes | Yes |
General Classification & Regression Trees | Yes | Yes |
Advanced Classification and Regression Trees (ITrees) | Yes | Yes |
Boosted Trees | Yes | Yes |
Random Forests | Yes | Yes |
Statistica Automated Neural Networks (SANN) | Yes | Yes
Note: Not supported for SANN Time series models
|
Support Vector Machines (SVM) | Yes | Yes |
Generalized Cluster Analysis | Yes | Yes |
Multiple Regression | Yes | No |
Multivariate Adaptive Regression Splines (MARS) | Yes | No |
General Discriminant Analysis | Yes | No |
Cox Proportional Hazards | Yes | No |
Naive Bayes | Yes | No |
K-Nearest Neighbors | Yes | No |
PCA (NIPALS) | Yes | No |
PLS (NIPALS) | Yes | No |
Sequence, Association and Link Analysis (SAL) | Yes | No |
Text Mining | Yes | No |
Neural Networks
Neural networks models can be saved in PMML format and evaluated by the Rapid Deployment of Models module if the respective model or ensemble of models predicts only a single continuous or categorical dependent or outcome variable; use the respective features for applying fully trained networks in Statistica Automated Neural Networks (SANN) to simultaneously predict multiple continuous and/or categorical outcomes (see also the deployment of models in Statistica Automated Neural Networks).
PMML Extensions
Even though the PMML standard is a promising development to bring cross-platform and cross-application compatibility to data mining, it currently can accommodate only fairly simple implementations of the methods that are defined. Therefore, in most cases, special extensions had to be added to the standard in order to allow users to take advantage of the advanced implementations of the respective methods available in Statistica.
Program Overview
The Statistica Rapid Deployment of Predictive Models module can read single or multiple PMML files to compute predicted values or classes for test data based on trained models. This information can optionally be written into the current input data file (or database if the current input data is a query into an external database via Streaming Database Connector) for subsequent analyses involving other variables in the current input data file or data warehouse. PMML code can be generated by practically all modules for predictive data mining available in Statistica, including the clustering methods (EM, K-means & Tree) available in the Cluster Analysis module. When applicable, the program will compute predicted values, Misclassification error rate for classification model, and mean squared error for regression model (The input data has to be provided in order to be scored by the PMML model(s)), and simple or overlaid lift and gains charts for binomial or multinomial classification problems.
For more information on scoring PMML 4 model, see Evaluating latest version PMML.