Rapid Deployment of Predictive Models Overview
The Statistica Rapid Deployment of Predictive Models module will quickly generate predictions from one or more previously trained models based on information stored in industry-standard PMML (Predictive Model Markup Language) deployment code. The output predictions or class probablities can optionally be written into the current input data file (or database if the current input data is a query into an external database via Streaming Database Connector) for subsequent analyses involving other variables in the current input data file or data warehouse. PMML is an XML-based language for encoding information (results) from data mining projects. The Rapid Deployment of Predictive Models module is particularly well suited for generating predictions for a large number of observations (cases) because it passes (reads) through the data once, storing only the data for a single observation at a time.
Rapid Deployment now supports deployment of legacy PMML ie PMMl versions 2 and 3 generated by Statistica and also PMML 4.
Deploying multiple models
The Rapid Deployment of Predictive Models module can evaluate multiple models simultaneously. Predictions from all models are output to a summary spreadsheet with voted prediction for classification models and average prediction for regression models. All the models being evaluated must have the same dependent variable but can have different predictors. Models with different dependent variable will not be allowed to connect to the same Rapid Deployment node. The evaluated models can be of different versions (e.g., PMML 2.0, PMML3.0,PMML4.0) and from different vendors (e.g., Knime, rapidminer, R, Python). You can also save the predictions for further processing, along with other variables in the current input data file. This capability is extremely useful when performing detailed analyses of the predictive power of different models.
Writing Statistics to an External Database
With the Rapid Deployment of Predictive Models module, you can write computed statistics (predictions, predicted classifications, classification probabilities, residuals) back into the current input data file; this option, on the Rapid Deployment of Predictive Models dialog box - Save results tab, is available for Statistica Spreadsheets as well as external databases, connected via Streaming Database Connector. This capability to, for example, merge classification probabilities computed by various models into an existing database or data warehouse is extremely useful in the context of data mining applications to deploy models for extremely large data sets (e.g., to compute probabilities that particular customers in a large database of customers are likely to purchase from a mail-order catalogue). Also, Classification probabilities are computed for each model and a class label (for example, good, bad ) is determined for each case for each model based on the Classification probabilities. A majority voted label is provided as the aggregated prediction for each case. Because the processing of large data sets in (remote) external databases via Steaming DB Connector is extremely efficient (e.g., requiring very little memory on the computer running the Rapid Deployment of Models module), this method of deploying fully trained models for data mining will scale easily to even extremely large data sets.
- Configuring the Streaming Database Connector for writing
- In order to take advantage of the ability to write computed statistics for observations back into the database, the Streaming Database Connector must be properly configured (e.g. for read/write access in the Query
Options dialog box). Also, the database fields (variables) to which you want to write must already exist in the database, and must be of the correct type (e.g., you cannot write numeric information into data fields of type Text). To learn more about the options to configure the connection, refer to Streaming Database Connector Technology and the Query
Options dialog box.
Analysis Modules (Models) that Generate PMML Code
The following analytic modules for predictive data mining generates deployment code in PMML code, and are therefore compatible with the Rapid Deployment of Predictive Models module:
Module PMML PMML 4 Generalized Linear (GLZ) Yes Yes General Linear (GLM) Yes Yes General Regression Yes Yes General ANOVA Yes Yes General CHAID Yes Yes General Classification & Regression Trees Yes Yes Advanced Classification and Regression Trees (ITrees) Yes Yes Boosted Trees Yes Yes Random Forests Yes Yes Statistica Automated Neural Networks (SANN) Yes Yes Support Vector Machines (SVM) Yes Yes Generalized Cluster Analysis Yes Yes Multiple Regression Yes No Multivariate Adaptive Regression Splines (MARS) Yes No General Discriminant Analysis Yes No Cox Proportional Hazards Yes No Naive Bayes Yes No K-Nearest Neighbors Yes No PCA (NIPALS) Yes No PLS (NIPALS) Yes No Sequence, Association and Link Analysis (SAL) Yes No Text Mining Yes No Neural Networks
Neural networks models can be saved in PMML format and evaluated by the Rapid Deployment of Models module if the respective model or ensemble of models predicts only a single continuous or categorical dependent or outcome variable; use the respective features for applying fully trained networks in Statistica Automated Neural Networks (SANN) to simultaneously predict multiple continuous and/or categorical outcomes (see also the deployment of models in Statistica Automated Neural Networks).
PMML Extensions
Even though the PMML standard is a promising development to bring cross-platform and cross-application compatibility to data mining, it currently can accommodate only fairly simple implementations of the methods that are defined. Therefore, in most cases, special extensions had to be added to the standard in order to allow users to take advantage of the advanced implementations of the respective methods available in Statistica.
Program Overview
The Statistica Rapid Deployment of Predictive Models module can read single or multiple PMML files to compute predicted values or classes for test data based on trained models. This information can optionally be written into the current input data file (or database if the current input data is a query into an external database via Streaming Database Connector) for subsequent analyses involving other variables in the current input data file or data warehouse. PMML code can be generated by practically all modules for predictive data mining available in Statistica, including the clustering methods (EM, K-means & Tree) available in the Cluster Analysis module. When applicable, the program will compute predicted values, Misclassification error rate for classification model, and mean squared error for regression model (The input data has to be provided in order to be scored by the PMML model(s)), and simple or overlaid lift and gains charts for binomial or multinomial classification problems.
For more information on scoring PMML 4 model, see Evaluating latest version PMML.