Machine Learning Program Overview
Statistica Machine Learning provides a number of advanced Statistical methods for handling regression and classification tasks with multiple dependent and independent variables. These methods include Support Vector Machines (SVM) for regression and classification, Naive Bayes for classification, and K-Nearest Neighbors (KNN) for regression and classification. Detailed discussions of these techniques can be found in Hastie, Tibshirani, & Freedman (2001); a specialized comprehensive introduction to support vector machines can also be found in Cristianini and Shawe-Taylor (2000).
These programs are scalable and, hence, capable of handling data sets with a large number of data cases. In addition they can handle both continuous and categorical independent (predictor) variables. When appropriate, data pre-processing in the form of scaling is provided to enhance model predictive ability (i.e., the ability to correctly predict unseen data). All programs provide three distinct methods for partitioning the data set into train and test subsets. Furthermore, cross-validation technique can be performed, when relevant, on the training data for selecting various model parameters among a set of given values. These options are available to address the problem of over-fitting by restricting the model complexity or providing an independent check on the model performance using the test set.
A large number of graphs and spreadsheets can be computed to evaluate the quality of the fit and to aid with the interpretation of results. Various code generator options are available for saving estimated (fully parameterized) models for deployment in C/C++/C#, Visual Basic, or PMML (see also, Using C/C++/C# Code for Deployment). Also, Statistica Machine Learning is fully automated and is an integral part of Statistica Data Miner, which you can use to construct tailor-made applications.
See the Machine Learning Index, Support Vector Machines Introductory Overview, Naive Bayes Classifier Introductory Overview, and K-Nearest Neighbors Introductory Overview.