MathIsimple
Back to Machine Learning
SVM
Intermediate to Advanced

Support Vector Machines

Master maximum margin classification, kernel methods, and optimization techniques

Maximum Margin Principle

Finding the Optimal Hyperplane

SVM finds the hyperplane that maximizes the margin between two classes:

Decision Hyperplane:

wTx+b=0\mathbf{w}^T\mathbf{x} + b = 0

Margin Boundaries:

wTx+b=+1(positive class)\mathbf{w}^T\mathbf{x} + b = +1 \quad \text{(positive class)}wTx+b=1(negative class)\mathbf{w}^T\mathbf{x} + b = -1 \quad \text{(negative class)}

Margin Width:

Margin=2w\text{Margin} = \frac{2}{||\mathbf{w}||}

Hard-Margin SVM (Linearly Separable)

Optimization problem:

minw,b12w2\min_{\mathbf{w}, b} \frac{1}{2}||\mathbf{w}||^2

Subject to:

yi(wTxi+b)1,iy_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad \forall i

Minimizing maximizes margin

Support Vectors

Samples on the margin boundaries:

yi(wTxi+b)=1y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1
  • • Only support vectors determine the decision boundary
  • • Removing other points doesn't change the model
  • • Typically very few support vectors

Soft Margin SVM

Handling Non-Separable Data

Real data is often not linearly separable. Soft-margin SVM allows controlled violations:

Objective Function:

minw,b,ξ12w2+Ci=1nξi\min_{\mathbf{w}, b, \xi} \frac{1}{2}||\mathbf{w}||^2 + C\sum_{i=1}^{n} \xi_i

Constraints:

yi(wTxi+b)1ξiy_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_iξi0,i\xi_i \geq 0, \quad \forall i

• = slack variable (amount of violation for sample )

• = regularization parameter (penalty for violations)

Small C

Wide margin
More tolerance for errors

Underfitting Risk

Moderate C

Balanced trade-off
Good generalization

Optimal

Large C

Narrow margin
Less tolerance for errors

Overfitting Risk

The Kernel Trick

Non-Linear Classification

Kernel functions implicitly map data to higher dimensions where it becomes linearly separable:

K(xi,xj)=ϕ(xi)Tϕ(xj)K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)

Compute dot product in high-dimensional space without explicitly computing

Linear Kernel
K(x,z)=xTzK(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z}

Standard SVM, no transformation. Use for linearly separable data.

Polynomial Kernel
K(x,z)=(xTz+c)dK(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^d

Degree polynomial features. shifts influence. Good for image processing.

RBF (Gaussian) Kernel ⭐
K(x,z)=exp(γxz2)K(\mathbf{x}, \mathbf{z}) = \exp\left(-\gamma ||\mathbf{x} - \mathbf{z}||^2\right)

• Most popular kernel
• Parameter γ controls influence radius
• Maps to infinite-dimensional space
• Smooth, flexible decision boundaries

Common choice: γ = 1/number of features

Sigmoid Kernel
K(x,z)=tanh(κxTz+c)K(\mathbf{x}, \mathbf{z}) = \tanh(\kappa \mathbf{x}^T\mathbf{z} + c)

Inspired by neural networks. Valid kernel only for certain values.

Dual Formulation

Lagrangian Dual Problem

The dual formulation allows applying the kernel trick and is easier to solve:

Dual Objective (Maximize):

LD(α)=i=1nαi12i=1nj=1nαiαjyiyjK(xi,xj)L_D(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i\alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j)

Constraints:

0αiC,i=1nαiyi=00 \leq \alpha_i \leq C, \quad \sum_{i=1}^{n} \alpha_i y_i = 0

Decision Function:

f(x)=sign(i=1nαiyiK(xi,x)+b)f(\mathbf{x}) = \text{sign}\left(\sum_{i=1}^{n} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b\right)

Advantages of Dual

  • • Enables kernel trick (no explicit )
  • • Quadratic programming problem (convex)
  • • Only depends on dot products
  • • Sparse solution ( for non-support vectors)

KKT Conditions

αi[yi(wTxi+b)1+ξi]=0\alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i] = 0

• : not a support vector
• $0 < \\alpha_i < C$: on margin
• : inside margin or misclassified

Multi-class & Implementation

Multi-class Strategies

One-vs-Rest (OvR)

Train binary classifiers (class vs all others). Predict class with highest confidence.

One-vs-One (OvO)

Train classifiers for all pairs. Use voting. More accurate but more training time.

Practical Considerations

Feature Scaling

Essential! Standardize features to mean 0, std 1

Hyperparameter Tuning

Grid search for and kernel parameters (, )

Computational Complexity

Training: to . Prediction:

When to Use SVM

  • • High-dimensional data (text, images)
  • • Clear margin of separation
  • • Small to medium datasets

Example: Non-Linear Classification

RBF Kernel SVM for Circular Decision Boundary

Problem Setup

Dataset: Points inside circle (class +1) vs outside (class -1). Centered at origin, radius 2.

Sample Data:

Class +1 (inside):

(0.5, 0.5), (1, 1), (-0.8, 1.2), ...

Class -1 (outside):

(3, 0), (0, 3.5), (-2.5, 2), ...

Model Configuration

Kernel & Parameters

  • • Kernel: RBF
  • • (moderate regularization)
  • • (moderate influence radius)

Results

  • • Training accuracy: 100%
  • • Test accuracy: 98%
  • • Support vectors: 12 / 100
  • • Decision boundary: smooth circle

Key Observation

Linear kernel would fail (accuracy ~50%). RBF kernel implicitly maps to infinite dimensions where the circular pattern becomes linearly separable. The decision function creates a smooth, circular boundary with minimal support vectors.

Margin Maximization Complete Derivation

From Geometric Margin to Optimization Problem

Geometric Margin Definition

Hyperplane: wTx+b=0\mathbf{w}^T\mathbf{x} + b = 0

Distance from point x_i to hyperplane:

distance=wTxi+bw\text{distance} = \frac{|\mathbf{w}^T\mathbf{x}_i + b|}{\|\mathbf{w}\|}

For correctly classified point: y_i(w^T x_i + b) > 0, so:

γi=yi(wTxi+b)w\gamma_i = \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{\|\mathbf{w}\|}

Margin: Minimum distance over all points

γ=mini=1,,nγi=miniyi(wTxi+b)w\gamma = \min_{i=1,\ldots,n} \gamma_i = \min_{i} \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{\|\mathbf{w}\|}

Optimization Formulation

Goal: Maximize γ

maxw,bminiyi(wTxi+b)w\max_{\mathbf{w},b} \min_{i} \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{\|\mathbf{w}\|}subject to yi(wTxi+b)>0i\text{subject to } y_i(\mathbf{w}^T\mathbf{x}_i + b) > 0 \quad \forall i

Simplification via Scaling

Key insight: Scaling (w, b) by constant doesn't change the hyperplane!

Choose scale such that for the closest points (support vectors):

miniyi(wTxi+b)=1\min_i y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1

Then margin becomes:

γ=1w\gamma = \frac{1}{\|\mathbf{w}\|}

Simplified Optimization:

maxw,b1w    minw,b12w2\max_{\mathbf{w},b} \frac{1}{\|\mathbf{w}\|} \quad \iff \quad \min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2subject to yi(wTxi+b)1i\text{subject to } y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i

Lagrangian & KKT Conditions

Constrained Optimization Framework

Lagrangian Formulation

Introduce Lagrange multipliers α_i ≥ 0 for each constraint:

L(w,b,α)=12w2i=1nαi[yi(wTxi+b)1]L(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^{n} \alpha_i [y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1]

Primal problem: min_w,b max_α L(w, b, α) where α_i ≥ 0

KKT Conditions

At optimal solution (w*, b*, α*), the following must hold:

1. Stationarity:

Lw=wiαiyixi=0    w=iαiyixi\frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0 \implies \mathbf{w}^* = \sum_i \alpha_i y_i \mathbf{x}_iLb=iαiyi=0    iαiyi=0\frac{\partial L}{\partial b} = -\sum_i \alpha_i y_i = 0 \implies \sum_i \alpha_i y_i = 0

2. Primal Feasibility:

yi(wTxi+b)1iy_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i

3. Dual Feasibility:

αi0i\alpha_i \geq 0 \quad \forall i

4. Complementary Slackness:

αi[yi(wTxi+b)1]=0i\alpha_i [y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1] = 0 \quad \forall i

Either α_i = 0 OR point is on margin boundary (support vector)

Dual Problem Derivation

Substitute w = Σα_i y_i x_i into Lagrangian:

L=12iαiyixi2iαiyi(jαjyjxjTxi)biαiyi+iαiL = \frac{1}{2}\left\|\sum_i \alpha_i y_i \mathbf{x}_i\right\|^2 - \sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j^T\mathbf{x}_i\right) - b\sum_i \alpha_i y_i + \sum_i \alpha_i

Using Σα_i y_i = 0 and simplifying:

L=iαi12ijαiαjyiyjxiTxjL = \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j

Dual Optimization Problem:

maxαiαi12ijαiαjyiyjxiTxj\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_jsubject to αi0,iαiyi=0\text{subject to } \alpha_i \geq 0, \sum_i \alpha_i y_i = 0

Quadratic programming problem in α. Only involves dot products x_i^T x_j!

Kernel Trick Mathematical Justification

Implicit Feature Mapping via Mercer's Theorem

Motivation: Non-linear Decision Boundaries

Problem: Data may not be linearly separable in original space

Solution: Map to higher-dimensional space φ: R^d → R^D where D > d

xϕ(x)\mathbf{x} \to \phi(\mathbf{x})

Find linear separator in feature space φ(x), which corresponds to non-linear in original space.

Kernel Function Definition

k(x,z)=ϕ(x)Tϕ(z)k(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x})^T \phi(\mathbf{z})

Kernel computes inner product in feature space without explicitly computing φ!

Example: Polynomial Kernel

k(x,z)=(xTz+c)dk(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^d

For d=2, x ∈ R^2: this implicitly maps to (~6D space) but only requires O(2) ops!

Mercer's Theorem

Theorem: A symmetric function k(x, z) can be expressed as an inner product k(x, z) = φ(x)^T φ(z) for some φ if and only if k is positive semi-definite:

ijcicjk(xi,xj)0{ci},{xi}\sum_i \sum_j c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0 \quad \forall \{c_i\}, \{\mathbf{x}_i\}

This guarantees that using k as a kernel corresponds to some feature mapping, even if we don't know what φ is explicitly!

RBF (Gaussian) Kernel Derivation

k(x,z)=exp(xz22σ2)k(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{\|\mathbf{x}-\mathbf{z}\|^2}{2\sigma^2}\right)

Expansion:

k(x,z)=exp(x2+z22xTz2σ2)k(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{\|\mathbf{x}\|^2 + \|\mathbf{z}\|^2 - 2\mathbf{x}^T\mathbf{z}}{2\sigma^2}\right)=exp(x22σ2)exp(z22σ2)exp(xTzσ2)= \exp\left(-\frac{\|\mathbf{x}\|^2}{2\sigma^2}\right) \exp\left(-\frac{\|\mathbf{z}\|^2}{2\sigma^2}\right) \exp\left(\frac{\mathbf{x}^T\mathbf{z}}{\sigma^2}\right)

Using Taylor expansion of exp(x^T z/σ²), this corresponds to infinite-dimensional feature space!

RBF kernel = infinite-dimensional feature mapping, but computed in O(d) time!

Kernelized SVM

Replace all x_i^T x_j with k(x_i, x_j) in dual problem:

maxαiαi12ijαiαjyiyjk(xi,xj)\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)

Prediction for new point x:

f(x)=sign(iSVαiyik(xi,x)+b)f(\mathbf{x}) = \text{sign}\left(\sum_{i \in SV} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b\right)

Only support vectors (SV) contribute to prediction!

Soft Margin SVM & C Parameter

Handling Non-Separable Data

Motivation

Problem: Data may have:

  • • Outliers
  • • Noise
  • • Truly overlapping classes

Hard margin SVM (y_i(w^T x_i + b) ≥ 1 strictly) has no solution!

Slack Variables

Introduce slack variables ξ_i ≥ 0 to allow constraint violations:

yi(wTxi+b)1ξiy_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i

Interpretation:

  • • ξ_i = 0: Correctly classified, outside margin
  • • 0 < ξ_i < 1: Correctly classified, inside margin
  • • ξ_i ≥ 1: Misclassified

Soft Margin Optimization

minw,b,ξ12w2+Ci=1nξi\min_{\mathbf{w},b,\boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{n} \xi_isubject to yi(wTxi+b)1ξi,ξi0\text{subject to } y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

Tradeoff via C:

  • Large C: Heavy penalty for violations → narrow margin, fewer errors
  • Small C: Light penalty → wider margin, more errors tolerated

Dual Form with Box Constraints

Dual problem is similar but with upper bound on α_i:

maxαiαi12ijαiαjyiyjk(xi,xj)\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)subject to 0αiC,iαiyi=0\text{subject to } 0 \leq \alpha_i \leq C, \sum_i \alpha_i y_i = 0

Support Vector Categories:

  • • α_i = 0: Not support vectors
  • • 0 < α_i < C: On margin boundary (ξ_i = 0)
  • • α_i = C: Inside margin or misclassified (ξ_i > 0)

Hinge Loss Interpretation

Soft margin SVM can be viewed as minimizing regularized hinge loss:

minw,b12w2+Cimax(0,1yi(wTxi+b))\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \max(0, 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b))

Hinge loss: ℓ(z) = max(0, 1-z) is convex, differentiable almost everywhere, and encourages margin ≥ 1.

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is the margin in SVM?
Not attempted
2
For a linearly separable dataset, hard-margin SVM solves:
Not attempted
3
What are support vectors?
Not attempted
4
Soft-margin SVM introduces slack variables to:
Not attempted
5
What is the purpose of the kernel trick?
Not attempted
6
The RBF (Gaussian) kernel is defined as:
Not attempted
7
In the dual formulation, the decision function becomes:
Not attempted
8
What happens when the SVM parameter is very large?
Not attempted
9
For multi-class SVM with K classes, the One-vs-Rest approach trains:
Not attempted
10
Which kernel parameter in RBF kernel leads to overfitting?
Not attempted