SVM Core Concepts | Maximum Margin, Dual Problem, KKT Conditions & SMO

Margins and Support Vectors

Basic Idea

In the sample space, we seek a hyperplane that separates different classes. When multiple separating hyperplanes exist, we should choose the one that is "right in the middle" because it has the best tolerance, robustness, and generalization ability.

Mathematical Formulation

Hyperplane Equation

w^T x + b = 0

• $w$ : normal vector (determines hyperplane direction)
• $b$ : displacement term (determines distance from origin)

Margin

\gamma = \frac{2}{||w||}

• Geometric margin: actual distance from sample points to hyperplane
• Functional margin: $\hat{\gamma_i} = y_i(w^T x_i + b)$

Support Vectors

Sample points closest to the separating hyperplane. These points determine the final classification boundary and are the only points that matter for the solution.

Geometric Intuition

Imagine drawing a line (in 2D) or plane (in 3D) to separate two groups of points. The optimal line/plane is the one that maximizes the minimum distance to points from both groups. This creates a "safety buffer" that makes the classifier more robust to noise and better at generalizing to new data.

Support Vector Machine Basic Formulation

Maximum Margin Optimization

Original Form

\arg\max_{w,b} \frac{2}{||w||}

subject to:

y_i(w^T x_i + b) \geq 1, \quad i = 1,2,...,m

Equivalent Form (Easier to Optimize)

\arg\min_{w,b} \frac{1}{2}||w||^2

subject to:

y_i(w^T x_i + b) \geq 1, \quad i = 1,2,...,m

Why This Transformation?

• Maximizing $\frac{2}{||w||}$ is equivalent to minimizing $\frac{1}{2}||w||^2$
• Constraints ensure all samples are correctly classified with functional margin ≥ 1
• This is a convex optimization problem with a global optimal solution

Dual Problem

Using Lagrangian multiplier method, we can transform the primal optimization problem into its dual form, which has several advantages.

Derivation Steps

Step 1: Lagrangian Function

L(w,b,\alpha) = \frac{1}{2}||w||^2 - \sum_{i=1}^m \alpha_i[y_i(w^T x_i + b) - 1]

where $\alpha_i \geq 0$ are Lagrangian multipliers

Step 2: Partial Derivatives

\frac{\partial L}{\partial w} = w - \sum_{i=1}^m \alpha_i y_i x_i = 0

→

w = \sum_{i=1}^m \alpha_i y_i x_i

\frac{\partial L}{\partial b} = -\sum_{i=1}^m \alpha_i y_i = 0

→

\sum_{i=1}^m \alpha_i y_i = 0

Step 3: Dual Form

\min_{\alpha} \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m \alpha_i\alpha_j y_i y_j x_i^T x_j - \sum_{i=1}^m \alpha_i

subject to:

\sum_{i=1}^m \alpha_i y_i = 0, \quad \alpha_i \geq 0, \quad i = 1,2,...,m

Advantages of Dual Formulation

✓Kernelization: Only involves inner products $x_i^T x_j$ , easy to replace with kernel functions
✓Simpler constraints: Linear equality and inequality constraints
✓Reveals sparsity: Most $\alpha_i$ will be zero in the solution

KKT Conditions & Solution Sparsity

The Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient conditions for optimality in constrained optimization problems. For SVM, they reveal the special structure of the solution.

KKT Conditions for SVM

1. Primal Feasibility

y_i(w^T x_i + b) \geq 1

2. Dual Feasibility

\alpha_i \geq 0

3. Complementary Slackness (Key!)

\alpha_i[y_i(w^T x_i + b) - 1] = 0

4. Stationarity

Gradient conditions (partial derivatives = 0)

Sparsity Analysis

From the complementary slackness condition, we can analyze:

•When $y_i f(x_i) > 1$ (sample far from boundary), must have $\alpha_i = 0$
•Only support vectors (samples with $y_i f(x_i) = 1$ ) have $\alpha_i > 0$
•Most training samples have $\alpha_i = 0$ ; final model depends only on support vectors

Practical Significance

This sparsity gives SVM excellent generalization ability and efficient prediction—only need to compute kernel function values with support vectors, not all training samples.

SMO Algorithm

Sequential Minimal Optimization (SMO) is an efficient algorithm for solving the SVM dual problem, developed by John Platt in 1998.

Basic Strategy

1. Variable Selection

Select a pair of variables $\alpha_i$ and $\alpha_j$ to update

2. Subproblem Solving

Fix other parameters, solve the quadratic programming subproblem for these two variables

3. Iterative Update

Repeat until convergence

Selection Strategy

First variable: Choose sample that most severely violates KKT conditions
Second variable: Choose sample that maximizes objective function change

Convergence Criterion

All samples satisfy KKT conditions (within tolerance). This ensures we've found the optimal solution.

Computational Efficiency

Compared to directly solving the $O(m^3)$ complexity of standard quadratic programming, SMO reduces complexity to $O(m^2)$ by decomposing the problem into a series of smallest possible subproblems (updating just 2 variables at a time).

Example: Credit Card Fraud Detection

Scenario: A bank wants to build a binary classifier to detect fraudulent credit card transactions based on transaction amount, time, location, and merchant category.

Dataset

• 10,000 transactions (150 fraudulent)
• Features: amount, time, GPS, merchant
• Labels: 0 (legitimate), 1 (fraud)

SVM Approach

• Find maximum margin separator
• Support vectors: "boundary" transactions
• Sparse solution: only ~200 support vectors

Result: The trained SVM finds that only 200 transactions (2%) serve as support vectors—these "boundary cases" fully define the decision boundary. This sparse representation enables fast fraud detection on new transactions while maintaining high accuracy.