MathIsimple

Support Vector Regression

Applying SVM principles to continuous prediction tasks

SVR vs Classification SVM

While classification SVM finds a decision boundary to separate classes, SVR finds a function that fits data points within a tolerance margin.

AspectClassification SVMRegression SVR
GoalFind maximum margin separating hyperplaneFind function within epsilon-tube containing most samples
OutputDiscrete class labelsContinuous real values
MarginDistance between classesEpsilon-tube width around regression line
Loss FunctionHinge lossEpsilon-insensitive loss
Support VectorsPoints on or violating marginPoints outside epsilon-tube

Key Insight

In classification, we want to push samples apart. In regression, we want to fit samples within a tube while keeping the function as flat (simple) as possible.

Epsilon-Insensitive Loss

The ϵ\epsilon-insensitive loss allows prediction errors within ϵ\epsilon to have zero loss, creating a "tube" around the regression function.

Mathematical Definition

ϵ(z)={0if zϵzϵotherwise\ell_\epsilon(z) = \begin{cases} 0 & \text{if } |z| \leq \epsilon \\ |z| - \epsilon & \text{otherwise} \end{cases}

When f(x)yϵ|f(x) - y| \leq \epsilon

Loss = 0
Prediction is "good enough," no penalty

When f(x)y>ϵ|f(x) - y| > \epsilon

Loss = f(x)yϵ|f(x) - y| - \epsilon
Linear penalty beyond epsilon

Why This Loss Function?

  • Robustness: Tolerates small errors, making SVR robust to noise
  • Sparsity: Only samples outside the tube become support vectors
  • Simplicity: Encourages flat (simple) functions by not penalizing small deviations
  • Interpretability: ϵ\epsilon parameter has clear meaning: acceptable error tolerance

SVR Formulation

Primal Problem with Dual Slack Variables

Objective Function

minw,b,ξ,ξ^12w2+Ci=1m(ξi+ξi^)\min_{w,b,\xi,\hat{\xi}} \frac{1}{2}||w||^2 + C\sum_{i=1}^m(\xi_i + \hat{\xi_i})

Constraints (Upper and Lower Bounds)

yiwTϕ(xi)bϵ+ξiy_i - w^T\phi(x_i) - b \leq \epsilon + \xi_i (upper bound violation)
wTϕ(xi)+byiϵ+ξi^w^T\phi(x_i) + b - y_i \leq \epsilon + \hat{\xi_i} (lower bound violation)
ξi,ξi^0\xi_i, \hat{\xi_i} \geq 0

Why two slack variables? Because predictions can deviate above or below the target. ξi\xi_i handles upper violations, ξi^\hat{\xi_i} handles lower violations.

Dual Problem & Prediction

After solving the dual problem:

f(x)=i=1m(αi^αi)κ(xi,x)+bf(x) = \sum_{i=1}^m(\hat{\alpha_i} - \alpha_i)\kappa(x_i, x) + b

Lagrangian multipliers:

αi\alpha_i for lower bound, αi^\hat{\alpha_i} for upper bound

Kernel trick:

Works just like classification SVM, κ(xi,x)\kappa(x_i, x) can be any valid kernel

Support Vectors in Regression

In SVR, support vectors are samples that either lie on the boundary of the epsilon-tube or outside it.

Boundary Support Vectors

  • αi^αi>0|\hat{\alpha_i} - \alpha_i| > 0
  • • Prediction error = ϵ\epsilon
  • • Lie exactly on tube boundary
  • • Help determine bb

Error Support Vectors

  • αi^αi=C|\hat{\alpha_i} - \alpha_i| = C
  • • Prediction error >ϵ> \epsilon
  • • Outside the tube
  • • Strongly influence solution

Inside the Tube

Samples with prediction error ϵ\leq \epsilon have αi=αi^=0\alpha_i = \hat{\alpha_i} = 0 and do not contribute to the final model. This creates the desired sparse representation—the regression function depends only on "difficult" samples outside the tube.

Advantages of SVR

1. Sparsity

Only support vectors (samples outside epsilon-tube) affect the final model, leading to efficient predictions

2. Robustness

Epsilon-insensitive loss makes SVR less sensitive to outliers and noise compared to squared error loss

3. Kernel Capability

Can handle non-linear regression through kernel trick without explicit feature transformation

4. Theoretical Guarantees

Based on statistical learning theory with provable generalization bounds

Example: California Housing Price Prediction

Scenario: Predict median house prices in California districts based on location, demographics, and housing characteristics.

Dataset

  • • 20,640 districts in California
  • • 8 features: latitude, longitude, median age, total rooms, population, median income, etc.
  • • Target: median house value

SVR Configuration

  • • Kernel: RBF (σ=2.0\sigma = 2.0)
  • ϵ=0.1\epsilon = 0.1 (10k tolerance)
  • C=1.0C = 1.0

Support Vectors

2,456 (12%)

Test RMSE

$48,200

R² Score

0.78

Result: SVR with RBF kernel captured non-linear relationships between location and price. Only 12% of districts became support vectors—these were areas with unusual price patterns (e.g., luxury neighborhoods, areas with recent development). The sparse model predicted efficiently while remaining robust to outliers like extremely expensive coastal properties.