User manual¶
The Libsvm library provides efficient algorithms to produce a model by Support Vector Machine.
Introduction to support vector machine¶
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. The standard SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the input. Given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that assigns new examples into one category or the other.
More formally, a SVM constructs a hyperplane in a high or infinite-dimensional space, which can be used for classification or regression. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class.
Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function selected to suit the problem.
Linear SVM¶
Given some training data , a set of n points of the form
where the is either 1 or -1, indicating the class to which the point belongs. Each is a p-dimensional real vector. We want to find the maximum-margin hyperplane that divides the points having from those having .
Any hyperplane can be written as the set of points satisfying where denotes the dot product and the normal vector to the hyperplane. We want to choose and to maximize the margin, or distance between the parallel hyperplanes that are as far apart as possible while still separating the data. These hyperplanes can be described by the equations
and
The distance between these two hyperplanes is , so we want to minimize . As we also have to prevent data points from falling into the margin, we add the following constraint : for each i either
So we get the optimization problem :
Primal form¶
The optimization problem presented in the preceding section is difficult to solve because it depends on , which involves a square root. It is possible to alter the equation by substituing with without changing the solution. This is a quadratic programming optimization problem:
By introducing Lagrange multipliers , the previous constrained problem can be expressed as
This problem can now be solved by standard quadratic programming techniques and programs. The stationary Karush-Kuhn-Tucker condition implies that the solution can be expressed as a linear combination of the training vectors .
Only a few will be greater than zero. The corresponding are exactly the support vectors, which lie on the margin and satisfy .
Dual form¶
Using the fact that and substituing , one can show that the dual of the SVM reduces to the following optimization problem:
and to the constraint from the minimization in b
can be computed thanks to the terms:
Soft margin¶
If there exists no hyperplane that can split the “yes” and “no” examples, the soft margin method will choose a hyperplane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. The method introduces slack variables, , which measure the degree of misclassification of the data
The objective function is then increased by a function which penalizes non-zero and the optimization becomes a trade off between a large margin and a small error penalty. If the penalty function is linear, the optimization problem becomes:
This constaint with the objective of minimizing can be solved using Lagrange multipliers as done above. One has then to solve the problem:
The dual form becomes:
Nonlinear SVM¶
The algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be nonlinear and the transformed space high dimensional, thus though the classifier is a hyperplane in the high-dimensional feature space, it may be nonlinear in the original input space.
- Some common kernels include:
Polynomial :
Gaussian Radial Basis Function :
Hyperbolic tangent :
The kernel is related to the transform by the equation
The value is also in the transformed space, with .
The effectiveness of SVM depends on the selection of kernel, the kernel’s parameters, and soft margin parameter C. A common choice is a Gaussian kernel, which has a single parameter . Best combination of C and is selected by a grid search with exponentially growing sequences of C and . Each combination of parameter choices is checked using cross validation, and the parameters with best cross-validation accuracy are picked. The final model, which is used for testing and for classifying data, is then trained on the whole training set using the selected parameters.
Classification¶
Given training vectors in two classes and a vector , C-SVC (the algorithm that we use for classification) solves the following dual problem:
where e is the vector of all ones, is the upper bound, Q is an l by l positive semidefinite matrix and is the kernel. The decision function is
For some classification problems, numbers of data in different classes are unbalanced. We can use different penalty parameters in the SVM formulation, the dual of C-SVC becomes:
where and depending on and and of weights that we can fix for each class.
Regression¶
Up to now, we presented SVM for classification. We can use too for regression. This is similar to the nonlinear case. We replace the soft margin by a -insensitive loss function which is defined like:
where f(x) is the loss function and a precision parameter.
We get this optimization problem if we introduce the slack variables :
with C which is a control parameter like in soft margin.
To solve this problem, we introduce a new time Lagrange multipliers and we will get this regression function:
Choose a kernel¶
|
Normal RBF kernel. |
|
Exponential RBF kernel. |
|
Linear kernel. |
|
Polynomial kernel. |
|
Polynomial kernel. |
|
Sigmoid kernel. |
Classification¶
|
Classification algorithm using LibSVM. |
Regression¶
|
Regression algorithm using LibSVM. |