Machine Learning Program Overview
STATISTICA Machine Learning provides a number of advanced statistical methods for handling regression and classification tasks with multiple dependent and independent variables. These methods include Support Vector Machines (SVM) for regression and classification, Naive Bayes for classification, and K-Nearest Neighbors (KNN) for regression and classification. Detailed discussions of these techniques can be found in Hastie, Tibshirani, & Freedman (2001); a specialized comprehensive introduction to support vector machines can also be found in Cristianini and Shawe-Taylor (2000).
- Support Vector Machines (SVM)
- This method performs regression and classification tasks by constructing nonlinear decision boundaries. Because of the nature of the feature space in which these boundaries are found, Support Vector Machines can exhibit a large degree of flexibility in handling classification and regression tasks of varied complexities. STATISTICA SVM supports four types of Support Vector models with a variety of kernels as basis function expansions including linear, polynomial, RBF, and sigmoid. It also provides a facility for handling imbalanced data.
- Naive Bayes
- This is a well established Bayesian method primarily formulated for performing classification tasks. Given its simplicity, i.e., the assumption that the independent variables are statistically independent, Naive Bayes models are effective classification tools that are easy to use and interpret. Naive Bayes is particularly appropriate when the dimensionality of the independent space (i.e., number of input variables) is high (a problem known as the curse of dimensionality). For the reasons given above, Naive Bayes can often outperform other more sophisticated classification methods. STATISTICA Naive Bayes provides a variety of methods for modelling the conditional distributions of the inputs including normal, lognormal, gamma, and Poisson.
- K-Nearest Neighbors (KNN)
- STATISTICA K-Nearest Neighbors is a memory-based method that, in contrast to other statistical methods, requires no training (i.e., no model to fit). It falls into the category of Prototype Methods. It functions on the intuitive idea that close objects are more likely to be in the same category. Thus, in KNN, predictions are based on a set of prototype examples that are used to predict new (i.e., unseen) data based on the majority vote (for classification tasks) and averaging (for regression) over a set of K nearest prototypes (hence the name K-nearest neighbors).
These programs are scalable and, hence, capable of handling data sets with a large number of data cases. In addition they can handle both continuous and categorical independent (predictor) variables. When appropriate, data pre-processing in the form of scaling is provided to enhance model predictive ability (i.e., the ability to correctly predict unseen data). All programs provide three distinct methods for partitioning the data set into train and test subsets. Furthermore, cross-validation technique can be performed, when relevant, on the training data for selecting various model parameters among a set of given values. These options are available to address the problem of over-fitting by restricting the model complexity or providing an independent check on the model performance using the test set.
A large number of graphs and spreadsheets can be computed to evaluate the quality of the fit and to aid with the interpretation of results. Various code generator options are available for saving estimated (fully parameterized) models for deployment in C/C++/C#, Visual Basic, or PMML (see also, Using C/C++/C# Code for Deployment). Also, STATISTICA Machine Learning is fully automated and is an integral part of STATISTICA Data Miner, which you can use to construct tailor-made applications.
See the Machine Learning Index, Support Vector Machines Introductory Overview, Naive Bayes Classifier Introductory Overview, and K-Nearest Neighbors Introductory Overview.