Note 4: Support Vector Machine SVM - Tech It Yourself

## Saturday, 2 June 2018

1. Introduction
This algorithm is a different type of intuition of logistic regression. Assume that we have data set, in which green color represent positive training examples and brown color represent negative training examples. A line (or hyperplane) is trying to separate these training examples.
Figure: Separate training examples using a line (or hyperplane)
There are many solutions (black or blue line) for this problem. SVM supports to find the optimal solution.
Figure: The optimal line is red line
The problem of finding the optimal hyperplane is an optimization problem and can be solved by optimization techniques (use Lagrange multipliers).
2. Notation
Let 's define a binary classification problem with labels y and features x. We will use $y\in {-1,1}$, use w instead of $[\theta_{0}...\theta_{n}]^{T}$ and b instead of $\theta_{0}$. So:
$y=h(x)=g(w^{T}x + b)\\ \text{y=1 if }w^{T}x + b \geq 1\\ \text{y=-1 if }w^{T}x + b \leq -1\\ \text{These can be combined into}\\y^{(i)}(w^{T}x^{(i)}+b)\\ \text{if }y^{(i)}(w^{T}x^{(i)+b}) \ geq 1 \Rightarrow \text{ our prediction on this example is correct}\\ \text{Note: }w^{T}x + b=0 \text{ is the equation of separating hyperplane.}$
We define support vectors are data points that satisfy:
$w^{T}x+b=1 \text{ or } w^{T}x+b=-1$
They are data points that just touch the boundary of the margin circled below, there are 3 of them ($v_{1},v_{2},v_{3}$). d is 1/2 of the 'width' and is the distance between margin and separating hyperplane. (w,b) only depends on these support vectors.
Figure: 3 support vectors
3. Problem
$\text{We define the hyperplane such that for training examples (i=1,...,m) we have:}\\ w^{T}x^{(i)}+b\geq 1\text{ if }y^{(i)}= 1\\ w^{T}x^{(i)}+b\leq 1\text{ if }y^{(i)}= -1\\ \text{and }H_{1} \text{ and }H_{2} \text{ are hyperplane that}\\ H_{1}:w^{T}x^{(i)}+b= 1\\ H_{2}:w^{T}x^{(i)}+b= -1\\ \text{Of course, support vectors lie on these hyperplanes}\\ \text{and}\\ \text{d+: is the shortest distance to the positive example}\\ \text{d-: is the shortest distance to the negative example}\\ \text{The margin of separating hyperplane is (d+) + (d-) }\\ \text{Recall that the distance from point } (x^{(i)}) \text{ to hyperplane } w^{T}x^{(i)}+b =0\\ \text{is calculated by formula:}\\ \frac{|w^{T}x^{(i)}+b|}{\left \| w \right \|}=\frac{1}{\left \| w \right \|}$

The total distance between $H_{1}, H_{2}$ is $\frac{2}{\left \| w \right \|}$.
Our problem is maximize $\frac{2}{\left \| w \right \|\\}$ with constraint that no data points between $H_{1}, H_{2}$. It means:
$w^{T}x^{(i)}+b\geq 1\text{ if }y^{(i)}= 1\\ w^{T}x^{(i)}+b\leq -1\text{ if }y^{(i)}= -1$
Or we can convert it to minimizing problem:
Minimize: $\left \| w \right \|\\$ or $\frac{1}{2}\left \| w \right \|^{2}$
subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1\\ \text{ and }\lambda _{i}\geq 0$
We can solve this problem using KKT. Let 's make an example with very simple data set below and use KKT to solve it.
Figure: 2 data examples (2,2) and (1,1)
Our problem:
Minimize: $\frac{1}{2}\left \| w \right \|^{2}$
subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1\\$
We have:
$w=\begin{bmatrix} w_{1}\\ w_{2} \end{bmatrix}, x=\begin{bmatrix} x_{1}\\ x_{2} \end{bmatrix}$
Substitute data set into constraints:
$+1(2w_{1}+2w_{2}+b)\geq 1\\ -1(w_{1}+w_{2}+b)\geq -1\\ \text{or}\\ 1-(2w_{1}+2w_{2}+b)\leq 0\\ 1+(w_{1}+w_{2}+b)\leq 0$
The Lagrangian:
$L=\frac{1}{2}w_{1}^{2}+\frac{1}{2}w_{2}^{2}+\lambda _{1}(1-(2w_{1}+2w_{2}+b))+\lambda _{2}(1+(w_{1}+w_{2}+b))$
Applying KKT we have:
$w_{1}-2\lambda _{1}+\lambda _{2}=0\\ w_{2}-2\lambda _{1}+\lambda _{2}=0\\ -\lambda _{1}+\lambda _{2}=0\\ \lambda _{1}(1-(2w_{1}+2w_{2}+b))=0\\ \lambda _{2}(1+(w_{1}+w_{2}+b))=0\\ \lambda _{1}\geq 0,\lambda _{2}\geq 0$
There are 4 cases to check. I just check the case $\lambda _{1}> 0,\lambda _{2}> 0$ and solution is: $w_{1}=w_{2}=1, b=-3$
The separating hyperplane is: $x_{1}+x_{2}-3=0$
Figure: Separating hyperplane
We can check that the data points are support vectors.
$(2,2)\Rightarrow y=2+2-3=1\\ (1,1)\Rightarrow y=1+1-3=-1$
If you are familiar with Python cxvopt package that is used for solving convex optimization problem, you can model our problem like below:
Note: The standard form of our problem in CVXOPT is:
$min_{x}\frac{1}{2}x^{T}Px+q^{T}x$
subject to: $Gx\leq h\\Ax=b$
We can re-write our problem under new form:
$\begin{bmatrix} w_{1}\\ w_{2}\\ b \end{bmatrix}^{T} \begin{bmatrix} 1.0 & 0.0 & 0.0\\ 0.0 & 1.0 & 0.0\\ 0.0 & 0.0 & 0.0 \end{bmatrix} \begin{bmatrix} w_{1}\\ w_{2}\\ b \end{bmatrix}+\begin{bmatrix} 0.0\\ 0.0\\ 0.0 \end{bmatrix} \begin{bmatrix} w_{1}\\ w_{2}\\ b \end{bmatrix}$
subject to:
$\begin{bmatrix} -2.0 & -2.0 & -1.0\\ 1.0 & 1.0 & 1.0 \end{bmatrix} \begin{bmatrix} w_{1}\\ w_{2}\\ b \end{bmatrix}\leq \begin{bmatrix} -1.0\\ -1.0 \end{bmatrix}$
  1 2 3 4 5 6 7 8 9 10 11 from cvxopt import matrix from cvxopt import solvers import numpy as np P = matrix(np.array([[1.0,0.0,0.0],[0.0,1.0,0.0],[0.0,0.0,0.0]]), tc='d') q = matrix(np.array([0.0,0.0,0.0]), tc='d') G = matrix(np.array([[-2.0,-2.0,-1.0],[1.0,1.0,1.0]]), tc='d') h = matrix(np.array([-1.0,-1.0]), tc='d') sol = solvers.qp(P,q,G,h) print(sol['x']) 
Figure: Using cvxopt to solve the problem
4. Kernels
We knew how to apply SVM for linear problem. So how to apply SVM for non-linearly separable classification problems. Let 's review our problem and change perspective.
We have the Lagrangian of problem:
$L = \frac{1}{2}\left \| w \right \|^{2} + \sum_{i=1}^{m}\lambda_{i}(1-y^{(i)}(w^{T}x^{(i)}+b))$
We have:
$\bigtriangledown _{w}L=w-\sum_{i=1}^{m}\lambda _{i}x^{(i)}y^{(i)}=0$
and
$\bigtriangledown _{b}L=\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$
We plug these back to Lagrangian:
$L=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}(x^{(i)})^{T}x^{(j)}$
we obtain the dual problem:
$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$
subject to: $\lambda _{i}\geq 0,i=1,...,m\\ \sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$
By changing perspective, we can gain insight the problem. And we are able to write the algorithm by only using inner products between input feature vectors.
+ If two feature vectors $x^{i},x^{j}$ are completely dissimilar their dot product is 0, and they don’t contribute to f.
+ If two feature vectors $x^{i},x^{j}$ are completely alike, their inner product is 1. We have 2 cases:
++ If both predict the same output value (+1 or -1) so the $y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$ always positive. This decreases f, so the algorithm downgrades the similar feature vectors that make the same prediction.
++ If both predict the different output value (+1 and -1) so the $y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$ always negative. This increases f, so these points contribute to the solution.
Note:
- At optimal point, the solution of primal and dual is equivalent
- The dual form is useful for an algorithm that supports to solve our problem on computer. It is SMO (sequential minimal optimization) algorithm.
Assume that we have a data set like below:
Figure: non-linear data set
We can not separate this data set using linear problem. In order to solve this problem, we will re-use the idea in linear regression. That is mapping the features to a higher dimensional space (e.g: $x^{2} or x^{3}$) via $\varphi$ a feature mapping function. So our problem becomes linear problem.
Figure: Feature mapping
Now we replace x by $\varphi (x)$. We have:
$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$. But computing $<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$ is expensive and time consuming (in case $\varphi$ is a quartic polynomial). If there is a function k such that $K(x^{(i)}, x^{(j)})=<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$ we do not need to compute $\varphi$ anymore. We call K is kernel function. It defines similarity in the transformed space. Our problem becomes:
$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}K(x^{(i)}, x^{(j)})$. let 's see some examples:
The polynomial kernel:
$K(x,z)=(x^{T}z+c)^{2}\\ =\sum_{i,j=1}^{n}(x_{i}x_{j})(z_{i}z_{j})+\sum_{i=1}^{n}(\sqrt{2cx_{i}})(\sqrt{2cz_{i}})+c^{2}$
The Gaussian kernel:
$K(x,z)=exp(-\frac{\left \| x-z \right \|^{2}}{2\sigma ^{2}})$
You can see the computation is inexpensive comparing to using $\varphi$.
4.2 Validate Kernel
Given Kernel function K(x,z) how to check whether it is valid or not (there exists feature mapping function $\varphi$ such that $K(x^{(i)}, x^{(j)})=<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$).
Assume that we have m training sets ${x^{1},...,x^{m}}$ and square m-by-m matrix K with $K_{ij}=K(x^{(i)},x^{(j)})$. This matrix is called Kernel matrix. Moreover, we have:
$K_{ij}=\varphi (x^{(i)})^{T}\varphi (x^{(j)})=\varphi (x^{(j)})^{T}\varphi (x^{(i)})=K_{ji}$
$\Rightarrow$ K is symmetric.
Mercer theorem: For K to be a valid (Mercel) kernel, it is necessary and sufficient that the corresponding kernel matrix is symmetric positive semi-definite.
5. Regularization
Let 's look the figure below:
Figure: Left-relaxed margin and Right-much smaller margin
The left picture shows a 'relaxed' margin classifier. In the left picture, when a outlier is added in the upper-left region, it causes the resulting classifier has a smaller margin.
In order to cover for the case in the right picture, we reformulate the problem using $l_{1}$ regularization:
$min_{w,b,\gamma }=\frac{1}{2}\left \| w \right \|^{2}+C\sum_{i=1}^{m}\gamma _{i}$
subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1-\gamma _{i},i=1,...m\\ \gamma _{i}\geq 0,i=1,...m$
For training example that $\gamma _{i} > 0$, its margin is less than 1 and the cost function increases $C\gamma _{i}$.
The new Lagrangian:
$L(w,b,\lambda ,\gamma ,r )=\frac{1}{2}\left \| w \right \|^{2}+C\sum_{i=1}^{m}\gamma _{i}-\sum_{i=1}^{m}\lambda _{i}[y^{(i)}(w^{T}x+b)-1+\gamma _{i}]-\sum_{i=1}^{m}r_{i}\gamma _{i}$
$\lambda _{i}, r_{i}$ is Lagrange multipliers.
The new dual problem:
$max_{\lambda }W(\lambda)=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$
subject to: $0\leq \lambda _{i}\leq C, i=1,...,m\\ \sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$
The new KKT conditions:
$\lambda _{i}=0\Rightarrow y^{(i)}(w^{T}x+b)\geq 1\\ \lambda _{i}=C\Rightarrow y^{(i)}(w^{T}x+b)\leq 1\\ 0<\lambda _{i}<C\Rightarrow y^{(i)}(w^{T}x+b)= 1$
6. The implementation notes
Simple SVM
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC

X, y = make_blobs(n_samples=50, centers=2, random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');

model = SVC(kernel='linear', C=1E10)
model.fit(X, y)

plt.scatter(model.support_vectors_[:,0], model.support_vectors_[:,1], s=30)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
x = np.linspace(xlim, xlim, 30)
y = np.linspace(ylim, ylim, 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
predicted = model.decision_function(xy).reshape(X.shape)
ax.contour( X, Y, predicted, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])

plt.show()
Kernel SVMIt is not possible to separate this dataset using linear discrimination.
We projects our data into higher-dimensional space defined by Gaussian basis functions, and thereby were able to fit for nonlinear relationships with a linear classifier.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets.samples_generator import make_circles
from mpl_toolkits import mplot3d
from sklearn.gaussian_process.kernels import RBF

X, y = make_circles(100, factor=.1, noise=.1)
#plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');
#r = np.exp(-(X ** 2).sum(1))
r = RBF(1.0).__call__(X).sum(1)
ax = plt.subplot(projection='3d')
ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='autumn')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('r')

plt.show()
Using this kernelized support vector machine, we learn a suitable nonlinear decision boundary. This kernel transformation strategy is used often in machine learning to turn fast linear methods into fast nonlinear methods, especially for models in which the kernel trick can be used.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets.samples_generator import make_circles

X, y = make_circles(100, factor=.1, noise=.1)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');

model = SVC(kernel='rbf', C=1E6)
model.fit(X, y)

plt.scatter(model.support_vectors_[:,0], model.support_vectors_[:,1], s=30)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
x = np.linspace(xlim, xlim, 30)
y = np.linspace(ylim, ylim, 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
predicted = model.decision_function(xy).reshape(X.shape)
ax.contour( X, Y, predicted, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])

plt.show()
SVM is a powerful classification method for a number of reasons:
• It depends on relatively few support vectors, so they are very compact models.
• The prediction phase is very fast.
• It works well with high-dimensional data. Even data with more dimensions than samples, which is a challenging regime for other algorithms.
• Their integration with kernel methods makes them very versatile, able to adapt to many types of data.
• The scaling with the number of samples N is N3 at worst, or N2 for efficient implementations. For large numbers of training samples, this computational cost can be prohibitive.