MathIsimple – Simple, Friendly Math Tools & Learning

Maximum Margin Principle

Finding the Optimal Hyperplane

SVM finds the hyperplane that maximizes the margin between two classes:

Decision Hyperplane:

\mathbf{w}^T\mathbf{x} + b = 0

Margin Boundaries:

\mathbf{w}^T\mathbf{x} + b = +1 \quad \text{(positive class)}

\mathbf{w}^T\mathbf{x} + b = -1 \quad \text{(negative class)}

Margin Width:

\text{Margin} = \frac{2}{||\mathbf{w}||}

Hard-Margin SVM (Linearly Separable)

Optimization problem:

\min_{\mathbf{w}, b} \frac{1}{2}||\mathbf{w}||^2

Subject to:

y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad \forall i

Minimizing maximizes margin

Support Vectors

Samples on the margin boundaries:

y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1

• Only support vectors determine the decision boundary
• Removing other points doesn't change the model
• Typically very few support vectors

Soft Margin SVM

Handling Non-Separable Data

Real data is often not linearly separable. Soft-margin SVM allows controlled violations:

Objective Function:

\min_{\mathbf{w}, b, \xi} \frac{1}{2}||\mathbf{w}||^2 + C\sum_{i=1}^{n} \xi_i

Constraints:

y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i

\xi_i \geq 0, \quad \forall i

• = slack variable (amount of violation for sample )

• = regularization parameter (penalty for violations)

Small C

Wide margin
More tolerance for errors

Underfitting Risk

Moderate C

Balanced trade-off
Good generalization

Optimal

Large C

Narrow margin
Less tolerance for errors

Overfitting Risk

The Kernel Trick

Non-Linear Classification

Kernel functions implicitly map data to higher dimensions where it becomes linearly separable:

K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)

Compute dot product in high-dimensional space without explicitly computing

Linear Kernel

K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z}

Standard SVM, no transformation. Use for linearly separable data.

Polynomial Kernel

K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^d

Degree polynomial features. shifts influence. Good for image processing.

RBF (Gaussian) Kernel ⭐

K(\mathbf{x}, \mathbf{z}) = \exp\left(-\gamma ||\mathbf{x} - \mathbf{z}||^2\right)

• Most popular kernel
• Parameter γ controls influence radius
• Maps to infinite-dimensional space
• Smooth, flexible decision boundaries

Common choice: γ = 1/number of features

Sigmoid Kernel

K(\mathbf{x}, \mathbf{z}) = \tanh(\kappa \mathbf{x}^T\mathbf{z} + c)

Inspired by neural networks. Valid kernel only for certain values.

Dual Formulation

Lagrangian Dual Problem

The dual formulation allows applying the kernel trick and is easier to solve:

Dual Objective (Maximize):

L_D(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i\alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j)

Constraints:

0 \leq \alpha_i \leq C, \quad \sum_{i=1}^{n} \alpha_i y_i = 0

Decision Function:

f(\mathbf{x}) = \text{sign}\left(\sum_{i=1}^{n} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b\right)

Advantages of Dual

• Enables kernel trick (no explicit )
• Quadratic programming problem (convex)
• Only depends on dot products
• Sparse solution ( for non-support vectors)

KKT Conditions

\alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i] = 0

• : not a support vector
• $0 < \\alpha_i < C$: on margin
• : inside margin or misclassified

Multi-class & Implementation

Multi-class Strategies

One-vs-Rest (OvR)

Train binary classifiers (class vs all others). Predict class with highest confidence.

One-vs-One (OvO)

Train classifiers for all pairs. Use voting. More accurate but more training time.

Practical Considerations

Feature Scaling

Essential! Standardize features to mean 0, std 1

Hyperparameter Tuning

Grid search for and kernel parameters (, )

Computational Complexity

Training: to . Prediction:

When to Use SVM

• High-dimensional data (text, images)
• Clear margin of separation
• Small to medium datasets

Example: Non-Linear Classification

RBF Kernel SVM for Circular Decision Boundary

Problem Setup

Dataset: Points inside circle (class +1) vs outside (class -1). Centered at origin, radius 2.

Sample Data:

Class +1 (inside):

(0.5, 0.5), (1, 1), (-0.8, 1.2), ...

Class -1 (outside):

(3, 0), (0, 3.5), (-2.5, 2), ...

Model Configuration

Kernel & Parameters

• Kernel: RBF
• (moderate regularization)
• (moderate influence radius)

Results

• Training accuracy: 100%
• Test accuracy: 98%
• Support vectors: 12 / 100
• Decision boundary: smooth circle

Key Observation

Linear kernel would fail (accuracy ~50%). RBF kernel implicitly maps to infinite dimensions where the circular pattern becomes linearly separable. The decision function creates a smooth, circular boundary with minimal support vectors.

View More Examples & Exercises

Margin Maximization Complete Derivation

From Geometric Margin to Optimization Problem

Geometric Margin Definition

Hyperplane: $\mathbf{w}^T\mathbf{x} + b = 0$

Distance from point x_i to hyperplane:

\text{distance} = \frac{|\mathbf{w}^T\mathbf{x}_i + b|}{\|\mathbf{w}\|}

For correctly classified point: y_i(w^T x_i + b) > 0, so:

\gamma_i = \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{\|\mathbf{w}\|}

Margin: Minimum distance over all points

\gamma = \min_{i=1,\ldots,n} \gamma_i = \min_{i} \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{\|\mathbf{w}\|}

Optimization Formulation

Goal: Maximize γ

\max_{\mathbf{w},b} \min_{i} \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{\|\mathbf{w}\|}

\text{subject to } y_i(\mathbf{w}^T\mathbf{x}_i + b) > 0 \quad \forall i

Simplification via Scaling

Key insight: Scaling (w, b) by constant doesn't change the hyperplane!

Choose scale such that for the closest points (support vectors):

\min_i y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1

Then margin becomes:

\gamma = \frac{1}{\|\mathbf{w}\|}

Simplified Optimization:

\max_{\mathbf{w},b} \frac{1}{\|\mathbf{w}\|} \quad \iff \quad \min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2

\text{subject to } y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i

Lagrangian & KKT Conditions

Constrained Optimization Framework

Lagrangian Formulation

Introduce Lagrange multipliers α_i ≥ 0 for each constraint:

L(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^{n} \alpha_i [y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1]

Primal problem: min_w,b max_α L(w, b, α) where α_i ≥ 0

KKT Conditions

At optimal solution (w*, b*, α*), the following must hold:

1. Stationarity:

\frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0 \implies \mathbf{w}^* = \sum_i \alpha_i y_i \mathbf{x}_i

\frac{\partial L}{\partial b} = -\sum_i \alpha_i y_i = 0 \implies \sum_i \alpha_i y_i = 0

2. Primal Feasibility:

y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i

3. Dual Feasibility:

\alpha_i \geq 0 \quad \forall i

4. Complementary Slackness:

\alpha_i [y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1] = 0 \quad \forall i

Either α_i = 0 OR point is on margin boundary (support vector)

Dual Problem Derivation

Substitute w = Σα_i y_i x_i into Lagrangian:

L = \frac{1}{2}\left\|\sum_i \alpha_i y_i \mathbf{x}_i\right\|^2 - \sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j^T\mathbf{x}_i\right) - b\sum_i \alpha_i y_i + \sum_i \alpha_i

Using Σα_i y_i = 0 and simplifying:

L = \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j

Dual Optimization Problem:

\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j

\text{subject to } \alpha_i \geq 0, \sum_i \alpha_i y_i = 0

Quadratic programming problem in α. Only involves dot products x_i^T x_j!

Kernel Trick Mathematical Justification

Implicit Feature Mapping via Mercer's Theorem

Motivation: Non-linear Decision Boundaries

Problem: Data may not be linearly separable in original space

Solution: Map to higher-dimensional space φ: R^d → R^D where D > d

\mathbf{x} \to \phi(\mathbf{x})

Find linear separator in feature space φ(x), which corresponds to non-linear in original space.

Kernel Function Definition

k(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x})^T \phi(\mathbf{z})

Kernel computes inner product in feature space without explicitly computing φ!

Example: Polynomial Kernel

k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^d

For d=2, x ∈ R^2: this implicitly maps to (~6D space) but only requires O(2) ops!

Mercer's Theorem

Theorem: A symmetric function k(x, z) can be expressed as an inner product k(x, z) = φ(x)^T φ(z) for some φ if and only if k is positive semi-definite:

\sum_i \sum_j c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0 \quad \forall \{c_i\}, \{\mathbf{x}_i\}

This guarantees that using k as a kernel corresponds to some feature mapping, even if we don't know what φ is explicitly!

RBF (Gaussian) Kernel Derivation

k(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{\|\mathbf{x}-\mathbf{z}\|^2}{2\sigma^2}\right)

Expansion:

k(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{\|\mathbf{x}\|^2 + \|\mathbf{z}\|^2 - 2\mathbf{x}^T\mathbf{z}}{2\sigma^2}\right)

= \exp\left(-\frac{\|\mathbf{x}\|^2}{2\sigma^2}\right) \exp\left(-\frac{\|\mathbf{z}\|^2}{2\sigma^2}\right) \exp\left(\frac{\mathbf{x}^T\mathbf{z}}{\sigma^2}\right)

Using Taylor expansion of exp(x^T z/σ²), this corresponds to infinite-dimensional feature space!

RBF kernel = infinite-dimensional feature mapping, but computed in O(d) time!

Kernelized SVM

Replace all x_i^T x_j with k(x_i, x_j) in dual problem:

\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)

Prediction for new point x:

f(\mathbf{x}) = \text{sign}\left(\sum_{i \in SV} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b\right)

Only support vectors (SV) contribute to prediction!

Soft Margin SVM & C Parameter

Handling Non-Separable Data

Motivation

Problem: Data may have:

• Outliers
• Noise
• Truly overlapping classes

Hard margin SVM (y_i(w^T x_i + b) ≥ 1 strictly) has no solution!

Slack Variables

Introduce slack variables ξ_i ≥ 0 to allow constraint violations:

y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i

Interpretation:

• ξ_i = 0: Correctly classified, outside margin
• 0 < ξ_i < 1: Correctly classified, inside margin
• ξ_i ≥ 1: Misclassified

Soft Margin Optimization

\min_{\mathbf{w},b,\boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{n} \xi_i

\text{subject to } y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

Tradeoff via C:

• Large C: Heavy penalty for violations → narrow margin, fewer errors
• Small C: Light penalty → wider margin, more errors tolerated

Dual Form with Box Constraints

Dual problem is similar but with upper bound on α_i:

\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)

\text{subject to } 0 \leq \alpha_i \leq C, \sum_i \alpha_i y_i = 0

Support Vector Categories:

• α_i = 0: Not support vectors
• 0 < α_i < C: On margin boundary (ξ_i = 0)
• α_i = C: Inside margin or misclassified (ξ_i > 0)

Hinge Loss Interpretation

Soft margin SVM can be viewed as minimizing regularized hinge loss:

\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \max(0, 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b))

Hinge loss: ℓ(z) = max(0, 1-z) is convex, differentiable almost everywhere, and encourages margin ≥ 1.

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

What is the margin in SVM?

Not attempted

2

For a linearly separable dataset, hard-margin SVM solves:

Not attempted

3

What are support vectors?

Not attempted

4

Soft-margin SVM introduces slack variables to:

Not attempted

5

What is the purpose of the kernel trick?

Not attempted

6

The RBF (Gaussian) kernel is defined as:

Not attempted

7

In the dual formulation, the decision function becomes:

Not attempted

8

What happens when the SVM parameter is very large?

Not attempted

9

For multi-class SVM with K classes, the One-vs-Rest approach trains:

Not attempted

10

Which kernel parameter in RBF kernel leads to overfitting?

Not attempted

Support Vector Machines

Maximum Margin Principle

Hard-Margin SVM (Linearly Separable)

Support Vectors

Soft Margin SVM

The Kernel Trick

Dual Formulation

Advantages of Dual

KKT Conditions

Multi-class & Implementation

Example: Non-Linear Classification

Problem Setup

Model Configuration

Key Observation

Margin Maximization Complete Derivation

Geometric Margin Definition

Optimization Formulation

Simplification via Scaling

Lagrangian & KKT Conditions

Lagrangian Formulation

KKT Conditions

Dual Problem Derivation

Kernel Trick Mathematical Justification

Motivation: Non-linear Decision Boundaries

Kernel Function Definition

Mercer's Theorem

RBF (Gaussian) Kernel Derivation

Kernelized SVM

Soft Margin SVM & C Parameter

Motivation

Slack Variables

Soft Margin Optimization

Dual Form with Box Constraints

Hinge Loss Interpretation

Practice Quiz