User manual¶
The Libsvm library provides efficient algorithms to produce a model by Support Vector Machine.
Introduction to support vector machine¶
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. The standard SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the input. Given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that assigns new examples into one category or the other.
More formally, a SVM constructs a hyperplane in a high or infinite-dimensional space, which can be used for classification or regression. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class.
Whereas the original problem may be stated in a finite dimensional space,
it often happens that the sets to discriminate are not linearly separable in
that space. For this reason, it was proposed that the original
finite-dimensional space be mapped into a much higher-dimensional space,
presumably making the separation easier in that space. To keep the
computational load reasonable, the mappings used by SVM schemes are designed to
ensure that dot products may be computed easily in terms of the variables in
the original space, by defining them in terms of a kernel function
selected to suit the problem.
Linear SVM¶
Given some training data , a set of n points of the form
where the is either 1 or -1, indicating the class to which the point
belongs. Each
is a p-dimensional real vector.
We want to find the maximum-margin hyperplane that divides the points having
from those having
.
Any hyperplane can be written as the set of points satisfying
where
denotes the dot product and
the normal vector to the hyperplane.
We want to choose
and
to maximize the margin, or distance
between the parallel hyperplanes that are as far apart as possible while still
separating the data. These hyperplanes can be described by the equations
and
The distance between these two hyperplanes is , so we
want to minimize
. As we also have to prevent data points from
falling into the margin, we add the following constraint : for each i either
So we get the optimization problem :
Primal form¶
The optimization problem presented in the preceding section is difficult to
solve because it depends on , which involves a
square root. It is possible to alter the equation by substituing
with
without changing the solution. This is a quadratic programming optimization problem:
By introducing Lagrange multipliers , the previous constrained problem can be expressed as
This problem can now be solved by standard quadratic programming techniques and
programs. The stationary Karush-Kuhn-Tucker condition implies that the solution
can be expressed as a linear combination of the training vectors
.
Only a few will be greater than zero. The corresponding
are exactly the support vectors, which lie on the margin and
satisfy
.
Dual form¶
Using the fact that and substituing
, one can show that the dual of the
SVM reduces to the following optimization problem:
and to the constraint from the minimization in b
can be computed thanks to the
terms:
Soft margin¶
If there exists no hyperplane that can split the “yes” and “no” examples, the
soft margin method will choose a hyperplane that splits the examples as cleanly
as possible, while still maximizing the distance to the nearest cleanly split
examples. The method introduces slack variables, , which measure the
degree of misclassification of the data
The objective function is then increased by a function which penalizes non-zero
and the optimization becomes a trade off between a large margin
and a small error penalty. If the penalty function is linear, the optimization
problem becomes:
This constaint with the objective of minimizing can be solved
using Lagrange multipliers as done above. One has then to solve the problem:
The dual form becomes:
Nonlinear SVM¶
The algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be nonlinear and the transformed space high dimensional, thus though the classifier is a hyperplane in the high-dimensional feature space, it may be nonlinear in the original input space.
- Some common kernels include:
Polynomial :
Gaussian Radial Basis Function :
Hyperbolic tangent :
The kernel is related to the transform by the equation
The value is also in the transformed space, with
.
The effectiveness of SVM depends on the selection of kernel, the kernel’s
parameters, and soft margin parameter C. A common choice is a Gaussian kernel,
which has a single parameter . Best combination of C and
is selected by a grid search with exponentially growing
sequences of C and
. Each combination of parameter choices is
checked using cross validation, and the parameters with best cross-validation
accuracy are picked.
The final model, which is used for testing and for classifying data, is then
trained on the whole training set using the selected parameters.
Classification¶
Given training vectors in two classes and a vector
,
C-SVC (the algorithm that we use for classification) solves the following dual
problem:
where e is the vector of all ones, is the upper bound, Q is an
l by l positive semidefinite matrix
and
is the kernel. The decision function is
For some classification problems, numbers of data in different classes are unbalanced. We can use different penalty parameters in the SVM formulation, the dual of C-SVC becomes:
where and
depending on
and
and of weights that we can fix for each class.
Regression¶
Up to now, we presented SVM for classification. We can use too for regression.
This is similar to the nonlinear case. We replace the soft margin by
a -insensitive loss function which is defined like:
where f(x) is the loss function and a precision parameter.
We get this optimization problem if we introduce the slack variables :
with C which is a control parameter like in soft margin.
To solve this problem, we introduce a new time Lagrange multipliers and we will get this regression function:
Choose a kernel¶
|
Normal RBF kernel. |
|
Exponential RBF kernel. |
|
Linear kernel. |
|
Polynomial kernel. |
|
Polynomial kernel. |
|
Sigmoid kernel. |
Classification¶
|
Classification algorithm using LibSVM. |
Regression¶
|
Regression algorithm using LibSVM. |