Support vector machines are a super star in machine learning and data mining in the past decade. It reheats statistical learning in machine learning community. It is also one of the classifiers that work practically with applications to other disciplines such as bioinformatics.
In this post, I will give a brief review of the theory of this classifier and show an F# implementation using the quadratic programming (QP) solver in Microsoft Solver Foundation (MSF).
It may seem dull to start with mathematics. But the math in SVMs is actually quite intuitive and would be quite easy if one has some training in linear algebra and calculus. Also understanding the intuition behind the formulae is more important in the formulation itself. There are quite a lot of good SVM tutorials written by researchers. Thus I only write the important formulas that occur in the implementation. The reader could link the F# code with formulas directly.
My note is based on the machine learning course by Prof. Jaakkola [Jaakkola] and the Matlab implementation therein. The course material is highly comprehensive yet accurate.
Background review for SVMs
Given a binary labeled dataset, where and is 1 or -1. Think about the data points plotted in a d dimensional (Euclidean) space, the linear SVM classifier is a hyperplane in the space and it “best” separates the two kinds of data points. Here “best” means that the hyperplane should have the largest margin, which is the distance from the plane to the sides of labeled points. Intuitively the larger the margin is, the more robust and confident the classifier is. As if the hyperplane shakes a little, it still classifies well for its being far from both sides.
Let’s take d=2 as the discussing example: data points are plotted in a 2-D plane. The aim is to draw a line to separate the two kinds of points such that the margin of the separation is maximized. In the following Figure, the line with maximum margin is shown. After some geometry work, the margin is calculated as , thus minimizing with the constraints that the two kinds of points are well separated () gives the max-margin hyperplane.
Two important extensions to the above simplest SVM are:
1. Allowing imperfect separation, that is when a hyperplane could not separate the points perfectly, it is allowed to mis-separate some data points. This introduces a slack variable for each data point , when the point is on the correct side of the plane, the slack variable is 0, otherwise it is measures the distance it goes from the plane minus 1, if the values goes to below zero set it zero. (Please read the Note and reference section for more explanation.)
2. Kernel. The simplest SVM is a linear classifier in the original space. However, there are situations where linear classifier is not sufficient (even the best), one strategy is to do a non-linear transformation or mapping that maps a data point to another space, so that a linear classifier in that space, which is non-linear in the original space, does a better separation of the data points. The kernel trick is that we don’t need to define explicitly; only the definition of inner product of that space is required: .
With the two extensions, the new maximum margin objective becomes:
The dual form:
The data points with its alpha value greater than 0 are called support vectors.
The solver is actually an Interior Point Solver, which could solve linear and quadratic programming with linear constraints.
Page 14-17 of MSF-SolverProgrammingPrimer.pdf show an example usage of the QP solver in Microsoft Solver Foundation. But the example is not complete and it does not demonstrate one very subtle part: how to set the coefficients for the quadratic terms.
The following code example shows a simple example with the the comment regarding the coefficient setting:
The implementation of SVM is a straightforward translation of equations (1) to (5). The following shows the svm learning (bulidSvm) and testing (svmClassify) functions:
Try on the iris data set we used in the Logistic Regression post:
let svm = buildSVM iris 10.0 Kernel.linear
let classifyResult = svmClassify svm iris
val svm : svm =
[|[|1.0; 6.2; 2.2; 4.5; 1.5|]; [|1.0; 5.9; 3.2; 4.8; 1.8|];
[|1.0; 6.3; 2.5; 4.9; 1.5|]; [|1.0; 6.7; 3.0; 5.0; 1.7|];
[|1.0; 6.0; 2.7; 5.1; 1.6|]; [|1.0; 5.4; 3.0; 4.5; 1.5|];
[|1.0; 4.9; 2.5; 4.5; 1.7|]; [|1.0; 6.0; 2.2; 5.0; 1.5|];
[|1.0; 6.3; 2.7; 4.9; 1.8|]; [|1.0; 6.2; 2.8; 4.8; 1.8|];
[|1.0; 6.1; 3.0; 4.9; 1.8|]; [|1.0; 6.3; 2.8; 5.1; 1.5|];
[|1.0; 6.0; 3.0; 4.8; 1.8|]|];
labels = [|-1; -1; -1; -1; -1; -1; 1; 1; 1; 1; 1; 1; 1|];};
[|6.72796421; 10.0; 10.0; 10.0; 10.0; 6.475497298; 1.490719712;
8.547262902; 3.165478894; 10.0; 10.0; 10.0; 10.0|];
kernel = <fun:svm@226-32>;
w0 = -13.63716815;}
Real: 00:00:00.002, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val classifyResult : float * float  =
[|-2.787610618; -2.380530973; -1.42477876; -2.929203541; -1.681415929;
-1.96460177; -1.24778761; -6.106194692; -2.761061946; -2.973451328;
…. 4.203539824; 1.82300885|])
Notes and references
The optimization in SVMs could be treated as a general QP problem as shown in our implementation. However, when the number of the data points is big, the QP grows too big to handle by the QP solver.
As a special QP problem, SVM has a lot of novel solutions proposed in the machine learning research area, e.g. the SMO approach [Platt] in LibSVM implementation and the Ball Vector Machine approach [Tsang] transforming the special QP into a (discrete) computational geometry problem, minimum enclosing ball. For practical usage of SVM, one usually uses these dedicated approaches.
is viewed as minimizing the error term while controlling the complexity of the function and C is a tradeoff parameter between these two. The reader is referred to the teaching notes and papers by Prof. Poggio [Poggio].
[Burges] A Tutorial on Support Vector Machines for Pattern Recognition. http://research.microsoft.com/en-us/um/people/cburges/papers/svmtutorial.pdf
[Platt] Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. ftp://ftp.research.microsoft.com/pub/tr/tr-98-14.pdf