Through a mapping , transform the original input space into a feature space, making originally linearly non-separable problems become linearly separable.
The kernel function computes the inner product in feature space without explicitly computing the mapping.
Imagine trying to separate two classes of points arranged in concentric circles in 2D. No straight line can separate them. But if we map to 3D space by adding , they become separable by a plane! The kernel trick lets us work in this higher-dimensional space without ever explicitly computing the 3D coordinates.
Choosing the right kernel is crucial for SVM performance. Here are practical guidelines:
When in doubt, start with RBF (Gaussian) kernel and tune its parameters. It's versatile, handles non-linearity well, and works in infinite-dimensional space. If performance or speed is unsatisfactory, then try linear kernel (for high-dim data) or polynomial kernel (for feature interactions).
A symmetric function can be used as a kernel function if and only if its corresponding kernel matrix is positive semi-definite.
In other words: For any set of samples , the kernel matrix where must be positive semi-definite.
Mercer's theorem allows us to create new valid kernels from existing ones:
Scenario: Classify handwritten digits (0-9) from 28×28 pixel grayscale images (784 features).
Accuracy: ~88%
Fast but misses non-linear patterns in pixel relationships
Accuracy: ~93%
Captures feature interactions but computationally expensive
Accuracy: ~97%
Best balance of accuracy and speed for this task
Result: RBF kernel with properly tuned parameter achieved the best performance. It effectively captured the non-linear patterns in how pixels combine to form digits, while maintaining reasonable training and prediction times. The infinite-dimensional feature space of RBF kernel provided sufficient expressiveness without explicit feature engineering.