SVM Classification

Classifies data (both linear and non-linear) by clustering it into the most distant and distinct groups possible.

Information at a Glance

Category Model
Data source type HD
Sends output to other operators Yes
Data processing tool MapReduce

Algorithm

The basic premise behind SVM Classification is following the idea of finding the maximum-margin hyperplane (MMH) in order to minimize the chances of misclassification for the training points. The SVM algorithm runs through the possible ways to divide a data set into classifications using a "hyperplane"; it then chooses the classification with the largest margin, or separation, between the classes' boundary points, called support vectors.

  • Support vectors are the closest data points to the optimal decision boundary between classes.
  • The margin is defined as distance from the hyperplane to the support vectors.

The following shows a simplified example of this maximal margin division process, with the defining support vector data points circled.



Figures (a) & (b): Support Vector Machine Algorithm Overview Diagram

In Figure (a), the data is separated by a vertical "hyperplane," resulting in a small margin between the two classes.

In Figure (b), the data is separated by a diagonal "hyperplane," resulting in a larger margin between the two classes.

The SVM Classification algorithm would choose the classification of Figure (b) over the classification of Figure (a) due to the larger margin.

The algorithm mathematically calculates the geometric distances of the data points to the boundaries (iteratively), which becomes a problem of solving well-understood mathematical quadratic equations.

The goal of the common SVM formulation is to find the optimal linear decision boundary in term of available training points. The above example shows a data set that is considered to be linearly separable. However, the following diagram shows a data set where it is not possible to draw a straight line to separate the classes. This is referred to as the data set having a "nonlinear decision boundary."



Figure ©): Example of non-linearly separable data, meaning it is not possible to draw a straight line to separate the classes.

In addition to acting as a linear model, the SVM algorithm is able to handle noisy and non-linear separable situations using the concepts of Soft Margin and Kernel Trick.

  • The Soft Margin idea manipulates the original SVM optimization problem (so-called primal hard margin) to behave more flexibly for adapting to tolerable non-linearity (noise).
  • Kernel Tricks are easy and efficient mathematical transformations of the data to higher dimensional space.
  • The Kernel Trick is an easy and efficient mathematical way of mapping data sets into a higher dimensional space; where it finds a "hyperplane" with the hope of make the linear separable representation of the data. This is simply the method of mapping observations from a general set, S, into an inner product space, V (equipped with its natural norm), without ever having to compute the mapping explicitly.
  • The three Kernel Methods currently supported by the Team Studio SVM Classification operator are the Dot Product/Linear, the Polynomial Kernel, and the Gaussian Kernel.
Normalization Type
Within the SVM algorithm, the Normalization type sets the mathematical approach to normalizing the SVM Classification data. The following formulas are used for normalization options:
Option Description Formula
Standardization Standardization normalization approach is the default approach FeatureValue
Unit Vector Unit Vector normalization approach is helpful when the modeler needs the data to be in a smaller scale (that is, have smaller values). FeatureValueSampleNorm
Mixed Performs both Standardization and Unit Vector normalization on the data. -
None No normalization. -

The SVM Classification algorithm is currently only supported for Hadoop data sources, as follows:

  • For the Team Studio Hadoop (MapReduce) version, the Team Studio SVM Classification operator implements the primal form of the L2- Regularized, multi-class, cost sensitive Hinge loss function Introduced by Crammer and Singer (2001) [3] along with the idea of explicit random feature mappings for approximate the kernel trick process. The random feature mappings are established based on some classical theorems of harmonic analysis (Bochner 1994 and Schoenberg 1942).
  • The supported Kernel functions for Hadoop sources are the Linear/Dot Product, Polynomial, and Gaussian functions.
  • For the Gaussian function, as a shift invariant kernel, Team Studio implements the Random Fourier transformation presented by Rahimi and Recht (2007).
  • For the Polynomial function, Team Studio implements the Random Maclaurin Feature Maps presented by Kar and Karnick (2012).

Input

A Hadoop data set that contains the dependent and independent variables for modeling.

Configuration

The following configuration parameters are the minimum settings required to use SVM Classification.

  • Dependent Column
  • Maximum Number of Iterations
  • Columns
  • Kernel Type
Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Dependent Column A dependent column must be specified for the SVM Classification. Select the data column to be considered the dependent variable for the classification. This is the quantity to model or predict. The dependent column can be either categorical or continuous.

The cardinality of the dependent column should not be greater than 1000. Therefore, there can be at most 1000 classes in the model.

Maximum Number of Iterations Controls the number of times the SVM Classification algorithm is run against the data.
  • When choosing the dot product (linear) kernel, having the higher number of iterations generally brings higher training accuracy. For other types of kernels, the smaller number of iterations generally generates good results both in terms of accuracy and computational complexity.
  • When using a Kernel Type other than Linear (Gaussian or Polynomial), the modeler should not typically use a Number of Iterations over 40 because they converge quickly and more iterations do not typically make the model any more accurate. When using the Linear Type it is okay to set Maximum Number of Iterations higher, such as 100.

Default value: 100.

Columns Sets the Independent Variable data columns to include for the decision tree training. You must specify at least one column.
  • Either sparse columns (output from the Collapse operator) or non-sparse columns are supported, but not a mix of the two.
  • The limitations on the data dimensionality are having 10 million values for a column of datatype = "sparse" and a total 4000 independent data columns.

Click Select Columns to open the dialog box for selecting the available columns from the input data set for analysis. For more information, see Select Column Dialog Box.

Kernel Type Specifies which implementation of a data transformation algorithm, or "kernel trick", to use to represent data in a higher dimensional space. Methods (KMs) are a class of data transformation algorithms (equations) for pattern analysis, where the goal is to find and study general types of relations (for example, clusters, rankings, principal components, correlations, and classifications) in general types of data (such as sequences, text documents, sets of points, vectors, images, etc.).

According to Mercer's Theorem (1909), the inner product of two data vectors in higher-dimensional space can be represented in linear space by a mathematical formula, or kernel representation. Specifically, it states that every semi-positive definite symmetric function is a kernel. A kernel is a primal space representation of an inner product of two data vectors in higher dimensional space: f(xi) * f(xj)

Note: If the independent variable is a sparse column format, the only choice for Kernel Type is Linear.

Kernel type options are as follows.

  • Gaussian - Implements the Gaussian formula of exp (-1.0*gamma(x-y)(x-y)).

  • Linear (the default) - For the Hadoop MapReduce data sources, this is the explicit default kernel method used even for linearly separable data in order to save on computational cost of solving the optimization problem. This implements the standard dot product of two data vectors. This option is chosen when the modeler believes the data source to be linearly separable (or to have "smooth" decision boundaries).

  • Polynomial - Represents the dot product of two vectors raised to some exponential degree, d, (x * y){}degree{^}. The Polynomial Kernel Method is popular for natural language processing (NLP) applications.

The Polynomial and Gaussian kernel types should be chosen when the modeler believes the data source is not linearly separable, or has a rough Decision Boundary. When this is the case, it is recommended that the modeler try implementing the SVM Classification model with both Polynomial and Gaussian Kernel Types and comparing the results for best fit.

Polynomial Degree Specifies the value of the exponent, d, for the polynomial kernel method implementations.



For Natural Language Processing (NLP) problems, the most common degree is 2, because using larger degrees tends to leading to overfitting (http://en.wikipedia.org/wiki/Polynomial_kernel). However, for other very sparse, large data sources, such as Forest Cover use case, a good degree is 10 - many support vectors are needed to represent the non-smooth Decision Boundaries.

Default value: 2.

This parameter is enabled only when the Polynomial option of the Kernel Type parameter is selected.

Gamma (γ) Specifies the value of the γ variable in the following Gaussian Kernel Type radial basis function:

Gamma and sometimes parameterized using Gamma parameter.

Changing the Gamma value changes the accuracy of the Gaussian model. The modeler should use trial and error with cross validation in order to assess what value is best for the given data set.

This variable is typically set relative to 1 over the number of independent variables, or dimensionality, of the model. For example, if there were 50 variables, it might be set to 1/50 = 0.2.

Default value: 1.

This parameter is enabled only when the Gaussian option of the Kernel Type parameter is selected.

Lambda (λ) Represents an optimization parameter for SVM Classification. It controls the tradeoff between model bias (significance of loss function) and the regularization portion of the minimization function (variance).

The higher the Lambda, the lower chance of over-fitting. Over-fitting is the situation where the model does a good job "learning" or converging to a low error for the training data but does not do as well for new, non-training data. If a model does well on the training data but does not perform well on the predictions, the modeler might seek to increase the Lambda value.

Default value: 0.0001.

Normalization Type Sets the mathematical approach to normalizing the SVM Classification data.

The options are as follows.

  • Mixed: Perform both Standardization and Unit Vector normalization on the data.

  • None: Do not perform any normalization.

  • Standardization. The default.
  • Unit Vector

The Standardization normalization implements the normalization equation normalization equation.

The Unit Vector normalization approach is helpful when the modeler needs the data to be in a smaller scale (that is, have smaller values). It implements the equation unit vector normalization.

Max JVM Heap Size (MB) (-1=Automatic) A Java Virtual Machine data storage setting for Hadoop.
  • The default value is 1024.
  • If the value is -1, the system automatically sets the Heap Size.

Output

Visual Output



Important: To learn more about the visualization available in this operator, see Explore Visual Results.

The ideal results curve would show the Training Error gradually reducing until it plateaus and flattens out at the minimal attained error rate. This shows the number of iterations at which the model has stopped getting more accurate overall.

In the above example, the Training Error seems to minimize after 60 Iterations. The modeler could rerun the model reducing Number of Iterations to 60 since going beyond that doesn't seem to improve the accuracy of the model.

Note: If the results curve continues to oscillate, it is an indication of too much noise in the model. The modeler might try adjusting the Lambda configuration parameter or choosing a different Kernel Type option.

SVM Classification models need additional operators added to them in order to analyze their effectiveness. Often, they are followed by an SVM Predictor which provides the prediction value for each data row compared against the actual data set training value and the associated confidence level. It is also common to add a ROC operator to quickly assess the quality of the model. See the Example section for an illustration.

The output from the SVM Predictor operator shows the predicted Dependent Value for each data instance, along with its associated confidence values.

  • The P(Diagnosis) column uses a threshold assumption of > than 50% confidence that the prediction will happen. For example, if the confidence C(M) is over .50, the prediction value displayed is "Malignant," indicating a cancerous cell.
  • The C(M) column indicates the confidence that the dependent value will be M for malignant cell. Note: usually confidence values are a decimal value that represents the % of confidence in its result.
  • The C(B) column indicates the confidence that the Dependent value will be B for Benign cell.



Adding additional Scoring operators, such as a ROC graph, are also helpful in immediately assessing how predictive the Random Forest model is. For the ROC graph, any AUC value over .80 is typically considered a "good" model. A value of 0.5 just means the model is no better than a "dumb" model that can guess the right answer half the time.
Note: When analyzing the results of a SVM Classification Model, the most important assessment is that it scores well outside of the training data (indicated by the ROC, LIFT and Goodness of Fit Scoring Operators).
Data Output
None.

Example