MathIsimple

Kernel Functions

The kernel trick: handling non-linear problems elegantly

The Kernel Trick

Core Idea

Through a mapping ϕ:XF\phi: \mathcal{X} \to \mathbb{F}, transform the original input space into a feature space, making originally linearly non-separable problems become linearly separable.

Mathematical Foundation

Kernel Function Definition

κ(xi,xj)=ϕ(xi)Tϕ(xj)\kappa(x_i, x_j) = \phi(x_i)^T \phi(x_j)

The kernel function computes the inner product in feature space without explicitly computing the mapping.

Why This is Powerful

  • No explicit mapping needed: Never compute ϕ(x)\phi(x) directly
  • Avoids curse of dimensionality: Feature space can be infinite-dimensional
  • Efficient computation: Kernel evaluation is often simple

Intuitive Example

Imagine trying to separate two classes of points arranged in concentric circles in 2D. No straight line can separate them. But if we map to 3D space by adding z=x2+y2z = x^2 + y^2, they become separable by a plane! The kernel trick lets us work in this higher-dimensional space without ever explicitly computing the 3D coordinates.

Common Kernel Functions

Linear Kernel

Formula:
κ(xi,xj)=xiTxj\kappa(x_i, x_j) = x_i^T x_j
Parameters:
None
Best for:
Linearly separable data, high-dimensional sparse data (e.g., text)

Polynomial Kernel

Formula:
κ(xi,xj)=(xiTxj)d\kappa(x_i, x_j) = (x_i^T x_j)^d
Parameters:
d ≥ 1: polynomial degree
Best for:
Problems requiring feature interactions, image processing

Gaussian (RBF) Kernel

Most Popular
Formula:
κ(xi,xj)=exp(xixj22σ2)\kappa(x_i, x_j) = \exp\left(-\frac{||x_i-x_j||^2}{2\sigma^2}\right)
Parameters:
σ > 0: bandwidth parameter
Best for:
Non-linear problems, most commonly used, general purpose

Laplacian Kernel

Formula:
κ(xi,xj)=exp(xixjσ)\kappa(x_i, x_j) = \exp\left(-\frac{||x_i-x_j||}{\sigma}\right)
Parameters:
σ > 0: scale parameter
Best for:
Lower sensitivity to noise, alternative to RBF

Sigmoid Kernel

Formula:
κ(xi,xj)=tanh(βxiTxj+θ)\kappa(x_i, x_j) = \tanh(\beta x_i^T x_j + \theta)
Parameters:
β > 0, θ < 0
Best for:
Neural network equivalent, theoretical interest

Kernel Selection Guidelines

Choosing the right kernel is crucial for SVM performance. Here are practical guidelines:

Data Dimensionality

  • High-dimensional: Linear kernel (d ≥ 1000)
  • Low-dimensional: RBF or polynomial kernel
  • Sparse features: Linear kernel (e.g., text data)

Data Size

  • Large dataset: Linear or low-degree polynomial
  • Medium dataset: RBF kernel preferred
  • Small dataset: Try multiple kernels via cross-validation

Non-linearity

  • Linear relationships: Linear kernel
  • Moderate non-linearity: Polynomial (d=2 or 3)
  • Strong non-linearity: RBF kernel

Computational Efficiency

  • Fastest: Linear kernel
  • Moderate: RBF and Laplacian
  • Slower: High-degree polynomial

Rule of Thumb

When in doubt, start with RBF (Gaussian) kernel and tune its parameters. It's versatile, handles non-linearity well, and works in infinite-dimensional space. If performance or speed is unsatisfactory, then try linear kernel (for high-dim data) or polynomial kernel (for feature interactions).

Mercer's Theorem

Theorem Statement

A symmetric function can be used as a kernel function if and only if its corresponding kernel matrix is positive semi-definite.

In other words: For any set of samples {x1,x2,...,xm}\{x_1, x_2, ..., x_m\}, the kernel matrix KK where Kij=κ(xi,xj)K_{ij} = \kappa(x_i, x_j) must be positive semi-definite.

Why It Matters

  • • Provides theoretical foundation for valid kernels
  • • Guarantees convex optimization problem
  • • Ensures global optimal solution exists

Practical Implications

  • • Can design custom kernels for domain-specific problems
  • • Kernel composition rules (sum, product of kernels)
  • • Mathematical verification of kernel validity

Kernel Composition

Mercer's theorem allows us to create new valid kernels from existing ones:

  • Sum: If κ1\kappa_1 and κ2\kappa_2 are valid, so is κ1+κ2\kappa_1 + \kappa_2
  • Product: If κ1\kappa_1 and κ2\kappa_2 are valid, so is κ1κ2\kappa_1 \cdot \kappa_2
  • Scaling: If κ\kappa is valid and c>0c > 0, then cκc\kappa is valid

Example: Handwritten Digit Recognition

Scenario: Classify handwritten digits (0-9) from 28×28 pixel grayscale images (784 features).

Linear Kernel

Accuracy: ~88%

Fast but misses non-linear patterns in pixel relationships

Polynomial (d=3)

Accuracy: ~93%

Captures feature interactions but computationally expensive

RBF (σ=5.0)

Accuracy: ~97%

Best balance of accuracy and speed for this task

Result: RBF kernel with properly tuned σ\sigma parameter achieved the best performance. It effectively captured the non-linear patterns in how pixels combine to form digits, while maintaining reasonable training and prediction times. The infinite-dimensional feature space of RBF kernel provided sufficient expressiveness without explicit feature engineering.