Master maximum margin classification, kernel methods, and optimization techniques
SVM finds the hyperplane that maximizes the margin between two classes:
Decision Hyperplane:
Margin Boundaries:
Margin Width:
Optimization problem:
Subject to:
Minimizing maximizes margin
Samples on the margin boundaries:
Real data is often not linearly separable. Soft-margin SVM allows controlled violations:
Objective Function:
Constraints:
• = slack variable (amount of violation for sample )
• = regularization parameter (penalty for violations)
Small C
Wide margin
More tolerance for errors
Moderate C
Balanced trade-off
Good generalization
Large C
Narrow margin
Less tolerance for errors
Kernel functions implicitly map data to higher dimensions where it becomes linearly separable:
Compute dot product in high-dimensional space without explicitly computing
Standard SVM, no transformation. Use for linearly separable data.
Degree polynomial features. shifts influence. Good for image processing.
• Most popular kernel
• Parameter γ controls influence radius
• Maps to infinite-dimensional space
• Smooth, flexible decision boundaries
Common choice: γ = 1/number of features
Inspired by neural networks. Valid kernel only for certain values.
The dual formulation allows applying the kernel trick and is easier to solve:
Dual Objective (Maximize):
Constraints:
Decision Function:
• : not a support vector
• $0 < \\alpha_i < C$: on margin
• : inside margin or misclassified
One-vs-Rest (OvR)
Train binary classifiers (class vs all others). Predict class with highest confidence.
One-vs-One (OvO)
Train classifiers for all pairs. Use voting. More accurate but more training time.
Feature Scaling
Essential! Standardize features to mean 0, std 1
Hyperparameter Tuning
Grid search for and kernel parameters (, )
Computational Complexity
Training: to . Prediction:
When to Use SVM
Dataset: Points inside circle (class +1) vs outside (class -1). Centered at origin, radius 2.
Sample Data:
Class +1 (inside):
(0.5, 0.5), (1, 1), (-0.8, 1.2), ...
Class -1 (outside):
(3, 0), (0, 3.5), (-2.5, 2), ...
Kernel & Parameters
Results
Linear kernel would fail (accuracy ~50%). RBF kernel implicitly maps to infinite dimensions where the circular pattern becomes linearly separable. The decision function creates a smooth, circular boundary with minimal support vectors.
Hyperplane:
Distance from point x_i to hyperplane:
For correctly classified point: y_i(w^T x_i + b) > 0, so:
Margin: Minimum distance over all points
Goal: Maximize γ
Key insight: Scaling (w, b) by constant doesn't change the hyperplane!
Choose scale such that for the closest points (support vectors):
Then margin becomes:
Simplified Optimization:
Introduce Lagrange multipliers α_i ≥ 0 for each constraint:
Primal problem: min_w,b max_α L(w, b, α) where α_i ≥ 0
At optimal solution (w*, b*, α*), the following must hold:
1. Stationarity:
2. Primal Feasibility:
3. Dual Feasibility:
4. Complementary Slackness:
Either α_i = 0 OR point is on margin boundary (support vector)
Substitute w = Σα_i y_i x_i into Lagrangian:
Using Σα_i y_i = 0 and simplifying:
Dual Optimization Problem:
Quadratic programming problem in α. Only involves dot products x_i^T x_j!
Problem: Data may not be linearly separable in original space
Solution: Map to higher-dimensional space φ: R^d → R^D where D > d
Find linear separator in feature space φ(x), which corresponds to non-linear in original space.
Kernel computes inner product in feature space without explicitly computing φ!
Example: Polynomial Kernel
For d=2, x ∈ R^2: this implicitly maps to (~6D space) but only requires O(2) ops!
Theorem: A symmetric function k(x, z) can be expressed as an inner product k(x, z) = φ(x)^T φ(z) for some φ if and only if k is positive semi-definite:
This guarantees that using k as a kernel corresponds to some feature mapping, even if we don't know what φ is explicitly!
Expansion:
Using Taylor expansion of exp(x^T z/σ²), this corresponds to infinite-dimensional feature space!
RBF kernel = infinite-dimensional feature mapping, but computed in O(d) time!
Replace all x_i^T x_j with k(x_i, x_j) in dual problem:
Prediction for new point x:
Only support vectors (SV) contribute to prediction!
Problem: Data may have:
Hard margin SVM (y_i(w^T x_i + b) ≥ 1 strictly) has no solution!
Introduce slack variables ξ_i ≥ 0 to allow constraint violations:
Interpretation:
Tradeoff via C:
Dual problem is similar but with upper bound on α_i:
Support Vector Categories:
Soft margin SVM can be viewed as minimizing regularized hinge loss:
Hinge loss: ℓ(z) = max(0, 1-z) is convex, differentiable almost everywhere, and encourages margin ≥ 1.
Test your understanding with 10 multiple-choice questions