In the sample space, we seek a hyperplane that separates different classes. When multiple separating hyperplanes exist, we should choose the one that is "right in the middle" because it has the best tolerance, robustness, and generalization ability.
Sample points closest to the separating hyperplane. These points determine the final classification boundary and are the only points that matter for the solution.
Imagine drawing a line (in 2D) or plane (in 3D) to separate two groups of points. The optimal line/plane is the one that maximizes the minimum distance to points from both groups. This creates a "safety buffer" that makes the classifier more robust to noise and better at generalizing to new data.
Using Lagrangian multiplier method, we can transform the primal optimization problem into its dual form, which has several advantages.
where are Lagrangian multipliers
The Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient conditions for optimality in constrained optimization problems. For SVM, they reveal the special structure of the solution.
From the complementary slackness condition, we can analyze:
Practical Significance
This sparsity gives SVM excellent generalization ability and efficient prediction—only need to compute kernel function values with support vectors, not all training samples.
Sequential Minimal Optimization (SMO) is an efficient algorithm for solving the SVM dual problem, developed by John Platt in 1998.
Select a pair of variables and to update
Fix other parameters, solve the quadratic programming subproblem for these two variables
Repeat until convergence
All samples satisfy KKT conditions (within tolerance). This ensures we've found the optimal solution.
Compared to directly solving the complexity of standard quadratic programming, SMO reduces complexity to by decomposing the problem into a series of smallest possible subproblems (updating just 2 variables at a time).
Scenario: A bank wants to build a binary classifier to detect fraudulent credit card transactions based on transaction amount, time, location, and merchant category.
Result: The trained SVM finds that only 200 transactions (2%) serve as support vectors—these "boundary cases" fully define the decision boundary. This sparse representation enables fast fraud detection on new transactions while maintaining high accuracy.