User manual =========== .. currentmodule:: otsvm The Libsvm library provides efficient algorithms to produce a model by Support Vector Machine. Introduction to support vector machine -------------------------------------- A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. The standard SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the input. Given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that assigns new examples into one category or the other. More formally, a SVM constructs a hyperplane in a high or infinite-dimensional space, which can be used for classification or regression. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class. Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function :math:`K(x,y)` selected to suit the problem. Linear SVM ---------- Given some training data :math:`D`, a set of n points of the form .. math:: D\ =\ \{(x_{i},y_{i})\ |\ x_{i} \in \mathbb{R}^{p},\ y_{i} \in \{-1,1\}\}^{n}_{i=1} where the :math:`y_i` is either 1 or -1, indicating the class to which the point :math:`x_i` belongs. Each :math:`x_{i}` is a p-dimensional real vector. We want to find the maximum-margin hyperplane that divides the points having :math:`y_i=1` from those having :math:`y_i=-1`. Any hyperplane can be written as the set of points :math:`x` satisfying :math:`w \cdot x - b = 0` where :math:`cdot` denotes the dot product and :math:`w` the normal vector to the hyperplane. We want to choose :math:`w` and :math:`b` to maximize the margin, or distance between the parallel hyperplanes that are as far apart as possible while still separating the data. These hyperplanes can be described by the equations .. math:: w \cdot x - b = 1 and .. math:: w \cdot x - b = -1 The distance between these two hyperplanes is :math:`\frac{2}{||w||}`, so we want to minimize :math:`||w||`. As we also have to prevent data points from falling into the margin, we add the following constraint : for each i either .. math:: y_{i}(w \cdot x_{i} - b)\geq 1\quad for\ all\ 1\leq i\leq n So we get the optimization problem : .. math:: Min\ ||w|| subject\ to\ ( for\ any\ i=1,...,n ) y_{i}(w \cdot x_{i} - b)\geq 1 Primal form ----------- The optimization problem presented in the preceding section is difficult to solve because it depends on :math:`||w||`, which involves a square root. It is possible to alter the equation by substituing :math:`||w||` with :math:`\frac{1}{2}||w||^{2}` without changing the solution. This is a quadratic programming optimization problem: .. math:: Min\ \frac{1}{2}||w||^{2} subject\ to\ ( for\ any\ i=1,...,n ) y_{i}(w \cdot x_{i} - b)\geq 1 By introducing Lagrange multipliers :math:`\alpha`, the previous constrained problem can be expressed as .. math:: \underset{w,b}{\text{min}}\ \underset{\alpha \geq 0}{\text{max}}\{\frac{1}{2}||w||^{2}-\sum_{i=1}^n \alpha_{i}[y_{i}(w \cdot x_{i} - b)-1]\} This problem can now be solved by standard quadratic programming techniques and programs. The stationary Karush-Kuhn-Tucker condition implies that the solution can be expressed as a linear combination of the training vectors :math:`w = \sum_{i=1}^n \alpha_i y_i x_i`. Only a few :math:`\alpha_i` will be greater than zero. The corresponding :math:`x_i` are exactly the support vectors, which lie on the margin and satisfy :math:`y_i (w\cdot x_i -b)=1`. Dual form --------- Using the fact that :math:`||w||^2 = w \cdot w` and substituing :math:`w = \sum_{i=1}^n \alpha_i x_i y_i`, one can show that the dual of the SVM reduces to the following optimization problem: .. math:: Max\ L(\alpha) = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j k(x_i,x_j) subject\ to\ ( for\ any\ i = 1,...,n ) \alpha_i \geq 0 and to the constraint from the minimization in b .. math:: \sum_{i=1}^n \alpha_i y_i =0 :math:`w` can be computed thanks to the :math:`\alpha` terms: :math:`w=\sum_i \alpha_i y_i x_i` Soft margin ----------- If there exists no hyperplane that can split the "yes" and "no" examples, the soft margin method will choose a hyperplane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. The method introduces slack variables, :math:`\xi_i`, which measure the degree of misclassification of the data :math:`x_i` .. math:: y_i(w \cdot x_i - b) \geq 1 - \xi_i \quad 1 \leq i \leq n The objective function is then increased by a function which penalizes non-zero :math:`\xi_i` and the optimization becomes a trade off between a large margin and a small error penalty. If the penalty function is linear, the optimization problem becomes: .. math:: \underset{w,b,\xi}{\text{min}}\{\frac{1}{2} ||w||^2 + C\sum_{i=1}^n \xi_i \} subject to (for any i=1,...,n) y_i ( w \cdot x_i - b) \geq 1 - \xi_i \quad \xi_i \geq 0 This constaint with the objective of minimizing :math:`||w||` can be solved using Lagrange multipliers as done above. One has then to solve the problem: .. math:: \underset{w,b,\xi}{\text{min}}\ \underset{\alpha,\beta}{\text{max}}\{\frac{1}{2} ||w||^2 + C\sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i[y_i(w \cdot x_i - b)-1 +\xi_i]- \sum_{i=1}^n \beta_i\xi_i \} with \alpha_i , \beta_i \geq 0 The dual form becomes: .. math:: \underset{\alpha_i}{\text{max}}\ \{L(\alpha) = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j k(x_i,x_j)\} subject to (for any i=1,...,n) 0 \leq \alpha_i \leq C and \sum_{i=1}^n \alpha_i y_i = 0 Nonlinear SVM ------------- The algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be nonlinear and the transformed space high dimensional, thus though the classifier is a hyperplane in the high-dimensional feature space, it may be nonlinear in the original input space. Some common kernels include: - Polynomial : :math:`k(x_i,x_j)=(x_i\cdot x_j+c)^d` - Gaussian Radial Basis Function : :math:`k(x_i,x_j)=exp(-\gamma ||x_i-x_j||^2)` - Hyperbolic tangent : :math:`k(x_i,x_j)=tanh(\gamma x_i\cdot x_j + c)` The kernel is related to the transform :math:`\varphi(x_i)` by the equation .. math:: k(x_i,x_j)=\varphi(x_i)\cdot\varphi(x_j) The value :math:`w` is also in the transformed space, with :math:`w=\sum_i\alpha_i y_i\varphi(x_i)`. The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter C. A common choice is a Gaussian kernel, which has a single parameter :math:`\gamma`. Best combination of C and :math:`\gamma` is selected by a grid search with exponentially growing sequences of C and :math:`\gamma`. Each combination of parameter choices is checked using cross validation, and the parameters with best cross-validation accuracy are picked. The final model, which is used for testing and for classifying data, is then trained on the whole training set using the selected parameters. Classification -------------- Given training vectors :math:`x_i` in two classes and a vector :math:`y\in{-1,1}`, C-SVC (the algorithm that we use for classification) solves the following dual problem: .. math:: min_{\alpha} \frac{1}{2} \alpha^TQ\alpha-e^T\alpha 0\leq \alpha_i \leq C y^T\alpha=0 where e is the vector of all ones, :math:`C>0` is the upper bound, Q is an l by l positive semidefinite matrix :math:`Q_{ij}=y_iy_jK(x_i,x_j)` and :math:`K(x_i,x_j)` is the kernel. The decision function is .. math:: sign(\sum_{i=1}^l y_i\alpha_iK(x_i,x)+b) For some classification problems, numbers of data in different classes are unbalanced. We can use different penalty parameters in the SVM formulation, the dual of C-SVC becomes: .. math:: min_{\alpha} \frac{1}{2} \alpha^TQ\alpha-e^T\alpha 0\leq \alpha_i \leq C_+,\ if\ y_i=1 0\leq \alpha_i \leq C_-,\ if\ y_i=-1 y^T\alpha=0 where :math:`C_+` and :math:`C_-` depending on :math:`y_i` and :math:`y_j` and of weights that we can fix for each class. Regression ---------- Up to now, we presented SVM for classification. We can use too for regression. This is similar to the nonlinear case. We replace the soft margin by a :math:`\varepsilon`-insensitive loss function which is defined like: .. math:: |y-f(x)|_\varepsilon = max(0,|y-f(x)|-\varepsilon) where f(x) is the loss function and :math:`\varepsilon` a precision parameter. We get this optimization problem if we introduce the slack variables :math:`\xi\ and\ \xi_i`: .. math:: min_w \{\frac{||w||^2}{2}+C\sum_{i=1}^n(\xi_i+\xi_i^*) \} subject to (for any i=1,...,n) l_i-wx_i+b\leq \varepsilon+\xi_i wx_i-b-l_i\leq \varepsilon +\xi_i^* \xi_i,\xi_i^* \geq 0 with C which is a control parameter like in soft margin. To solve this problem, we introduce a new time Lagrange multipliers and we will get this regression function: .. math:: f(x)=\sum_{i=1}^n(\alpha_i-\alpha_i^*)K(x_i,x)-b Choose a kernel --------------- .. autosummary:: :toctree: _generated/ :template: class.rst_t NormalRBF ExponentialRBF LinearKernel PolynomialKernel RationalKernel SigmoidKernel Classification -------------- .. autosummary:: :toctree: _generated/ :template: class.rst_t SVMClassification Regression ---------- .. autosummary:: :toctree: _generated/ :template: class.rst_t SVMRegression