MathIsimple

Core Concepts

Maximum margin, dual formulation, and optimization fundamentals

Margins and Support Vectors

Basic Idea

In the sample space, we seek a hyperplane that separates different classes. When multiple separating hyperplanes exist, we should choose the one that is "right in the middle" because it has the best tolerance, robustness, and generalization ability.

Mathematical Formulation

Hyperplane Equation

wTx+b=0w^T x + b = 0
  • ww: normal vector (determines hyperplane direction)
  • bb: displacement term (determines distance from origin)

Margin

γ=2w\gamma = \frac{2}{||w||}
  • Geometric margin: actual distance from sample points to hyperplane
  • Functional margin: γi^=yi(wTxi+b)\hat{\gamma_i} = y_i(w^T x_i + b)

Support Vectors

Sample points closest to the separating hyperplane. These points determine the final classification boundary and are the only points that matter for the solution.

Geometric Intuition

Imagine drawing a line (in 2D) or plane (in 3D) to separate two groups of points. The optimal line/plane is the one that maximizes the minimum distance to points from both groups. This creates a "safety buffer" that makes the classifier more robust to noise and better at generalizing to new data.

Support Vector Machine Basic Formulation

Maximum Margin Optimization

Original Form

argmaxw,b2w\arg\max_{w,b} \frac{2}{||w||}
subject to: yi(wTxi+b)1,i=1,2,...,my_i(w^T x_i + b) \geq 1, \quad i = 1,2,...,m

Equivalent Form (Easier to Optimize)

argminw,b12w2\arg\min_{w,b} \frac{1}{2}||w||^2
subject to: yi(wTxi+b)1,i=1,2,...,my_i(w^T x_i + b) \geq 1, \quad i = 1,2,...,m

Why This Transformation?

  • • Maximizing 2w\frac{2}{||w||} is equivalent to minimizing 12w2\frac{1}{2}||w||^2
  • • Constraints ensure all samples are correctly classified with functional margin ≥ 1
  • • This is a convex optimization problem with a global optimal solution

Dual Problem

Using Lagrangian multiplier method, we can transform the primal optimization problem into its dual form, which has several advantages.

Derivation Steps

Step 1: Lagrangian Function

L(w,b,α)=12w2i=1mαi[yi(wTxi+b)1]L(w,b,\alpha) = \frac{1}{2}||w||^2 - \sum_{i=1}^m \alpha_i[y_i(w^T x_i + b) - 1]

where αi0\alpha_i \geq 0 are Lagrangian multipliers

Step 2: Partial Derivatives

Lw=wi=1mαiyixi=0\frac{\partial L}{\partial w} = w - \sum_{i=1}^m \alpha_i y_i x_i = 0w=i=1mαiyixiw = \sum_{i=1}^m \alpha_i y_i x_i
Lb=i=1mαiyi=0\frac{\partial L}{\partial b} = -\sum_{i=1}^m \alpha_i y_i = 0i=1mαiyi=0\sum_{i=1}^m \alpha_i y_i = 0

Step 3: Dual Form

minα12i=1mj=1mαiαjyiyjxiTxji=1mαi\min_{\alpha} \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m \alpha_i\alpha_j y_i y_j x_i^T x_j - \sum_{i=1}^m \alpha_i
subject to: i=1mαiyi=0,αi0,i=1,2,...,m\sum_{i=1}^m \alpha_i y_i = 0, \quad \alpha_i \geq 0, \quad i = 1,2,...,m

Advantages of Dual Formulation

  • Kernelization: Only involves inner products xiTxjx_i^T x_j, easy to replace with kernel functions
  • Simpler constraints: Linear equality and inequality constraints
  • Reveals sparsity: Most αi\alpha_i will be zero in the solution

KKT Conditions & Solution Sparsity

The Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient conditions for optimality in constrained optimization problems. For SVM, they reveal the special structure of the solution.

KKT Conditions for SVM

1. Primal Feasibility
yi(wTxi+b)1y_i(w^T x_i + b) \geq 1
2. Dual Feasibility
αi0\alpha_i \geq 0
3. Complementary Slackness (Key!)
αi[yi(wTxi+b)1]=0\alpha_i[y_i(w^T x_i + b) - 1] = 0
4. Stationarity
Gradient conditions (partial derivatives = 0)

Sparsity Analysis

From the complementary slackness condition, we can analyze:

  • When yif(xi)>1y_i f(x_i) > 1 (sample far from boundary), must have αi=0\alpha_i = 0
  • Only support vectors (samples with yif(xi)=1y_i f(x_i) = 1) have αi>0\alpha_i > 0
  • Most training samples have αi=0\alpha_i = 0; final model depends only on support vectors

Practical Significance

This sparsity gives SVM excellent generalization ability and efficient prediction—only need to compute kernel function values with support vectors, not all training samples.

SMO Algorithm

Sequential Minimal Optimization (SMO) is an efficient algorithm for solving the SVM dual problem, developed by John Platt in 1998.

Basic Strategy

1. Variable Selection

Select a pair of variables αi\alpha_i and αj\alpha_j to update

2. Subproblem Solving

Fix other parameters, solve the quadratic programming subproblem for these two variables

3. Iterative Update

Repeat until convergence

Selection Strategy

  • First variable: Choose sample that most severely violates KKT conditions
  • Second variable: Choose sample that maximizes objective function change

Convergence Criterion

All samples satisfy KKT conditions (within tolerance). This ensures we've found the optimal solution.

Computational Efficiency

Compared to directly solving the O(m3)O(m^3) complexity of standard quadratic programming, SMO reduces complexity to O(m2)O(m^2) by decomposing the problem into a series of smallest possible subproblems (updating just 2 variables at a time).

Example: Credit Card Fraud Detection

Scenario: A bank wants to build a binary classifier to detect fraudulent credit card transactions based on transaction amount, time, location, and merchant category.

Dataset

  • • 10,000 transactions (150 fraudulent)
  • • Features: amount, time, GPS, merchant
  • • Labels: 0 (legitimate), 1 (fraud)

SVM Approach

  • • Find maximum margin separator
  • • Support vectors: "boundary" transactions
  • • Sparse solution: only ~200 support vectors

Result: The trained SVM finds that only 200 transactions (2%) serve as support vectors—these "boundary cases" fully define the decision boundary. This sparse representation enables fast fraud detection on new transactions while maintaining high accuracy.