Kernel Functions

The kernel trick: handling non-linear problems elegantly

The Kernel Trick

Core Idea

Through a mapping $\phi: \mathcal{X} \to \mathbb{F}$ , transform the original input space into a feature space, making originally linearly non-separable problems become linearly separable.

Mathematical Foundation

Kernel Function Definition

\kappa(x_i, x_j) = \phi(x_i)^T \phi(x_j)

The kernel function computes the inner product in feature space without explicitly computing the mapping.

Why This is Powerful

• No explicit mapping needed: Never compute $\phi(x)$ directly
• Avoids curse of dimensionality: Feature space can be infinite-dimensional
• Efficient computation: Kernel evaluation is often simple

Intuitive Example

Imagine trying to separate two classes of points arranged in concentric circles in 2D. No straight line can separate them. But if we map to 3D space by adding $z = x^2 + y^2$ , they become separable by a plane! The kernel trick lets us work in this higher-dimensional space without ever explicitly computing the 3D coordinates.

Common Kernel Functions

Linear Kernel

Formula:

\kappa(x_i, x_j) = x_i^T x_j

Parameters:

None

Best for:

Linearly separable data, high-dimensional sparse data (e.g., text)

Polynomial Kernel

Formula:

\kappa(x_i, x_j) = (x_i^T x_j)^d

Parameters:

d ≥ 1: polynomial degree

Best for:

Problems requiring feature interactions, image processing

Gaussian (RBF) Kernel

Laplacian Kernel

Formula:

\kappa(x_i, x_j) = \exp\left(-\frac{||x_i-x_j||}{\sigma}\right)

Parameters:

σ > 0: scale parameter

Best for:

Lower sensitivity to noise, alternative to RBF

Sigmoid Kernel

Formula:

\kappa(x_i, x_j) = \tanh(\beta x_i^T x_j + \theta)

Parameters:

β > 0, θ < 0

Best for:

Neural network equivalent, theoretical interest

Kernel Selection Guidelines

Choosing the right kernel is crucial for SVM performance. Here are practical guidelines:

Data Dimensionality

High-dimensional: Linear kernel (d ≥ 1000)
Low-dimensional: RBF or polynomial kernel
Sparse features: Linear kernel (e.g., text data)

Data Size

Large dataset: Linear or low-degree polynomial
Medium dataset: RBF kernel preferred
Small dataset: Try multiple kernels via cross-validation

Non-linearity

Linear relationships: Linear kernel
Moderate non-linearity: Polynomial (d=2 or 3)
Strong non-linearity: RBF kernel

Computational Efficiency

Fastest: Linear kernel
Moderate: RBF and Laplacian
Slower: High-degree polynomial

Rule of Thumb

When in doubt, start with RBF (Gaussian) kernel and tune its parameters. It's versatile, handles non-linearity well, and works in infinite-dimensional space. If performance or speed is unsatisfactory, then try linear kernel (for high-dim data) or polynomial kernel (for feature interactions).

Mercer's Theorem

Theorem Statement

A symmetric function can be used as a kernel function if and only if its corresponding kernel matrix is positive semi-definite.

In other words: For any set of samples $\{x_1, x_2, ..., x_m\}$ , the kernel matrix $K$ where $K_{ij} = \kappa(x_i, x_j)$ must be positive semi-definite.

Why It Matters

• Provides theoretical foundation for valid kernels
• Guarantees convex optimization problem
• Ensures global optimal solution exists

Practical Implications

• Can design custom kernels for domain-specific problems
• Kernel composition rules (sum, product of kernels)
• Mathematical verification of kernel validity

Kernel Composition

Mercer's theorem allows us to create new valid kernels from existing ones:

• Sum: If $\kappa_1$ and $\kappa_2$ are valid, so is $\kappa_1 + \kappa_2$
• Product: If $\kappa_1$ and $\kappa_2$ are valid, so is $\kappa_1 \cdot \kappa_2$
• Scaling: If $\kappa$ is valid and $c > 0$ , then $c\kappa$ is valid

Example: Handwritten Digit Recognition

Scenario: Classify handwritten digits (0-9) from 28×28 pixel grayscale images (784 features).

Linear Kernel

Accuracy: ~88%

Fast but misses non-linear patterns in pixel relationships

Polynomial (d=3)

Accuracy: ~93%

Captures feature interactions but computationally expensive

RBF (σ=5.0)

Accuracy: ~97%

Best balance of accuracy and speed for this task

Result: RBF kernel with properly tuned $\sigma$ parameter achieved the best performance. It effectively captured the non-linear patterns in how pixels combine to form digits, while maintaining reasonable training and prediction times. The infinite-dimensional feature space of RBF kernel provided sufficient expressiveness without explicit feature engineering.