Support Vector Regression (SVR) | Epsilon-Insensitive Loss & Regression SVM

SVR vs Classification SVM

While classification SVM finds a decision boundary to separate classes, SVR finds a function that fits data points within a tolerance margin.

Aspect	Classification SVM	Regression SVR
Goal	Find maximum margin separating hyperplane	Find function within epsilon-tube containing most samples
Output	Discrete class labels	Continuous real values
Margin	Distance between classes	Epsilon-tube width around regression line
Loss Function	Hinge loss	Epsilon-insensitive loss
Support Vectors	Points on or violating margin	Points outside epsilon-tube

Key Insight

In classification, we want to push samples apart. In regression, we want to fit samples within a tube while keeping the function as flat (simple) as possible.

Epsilon-Insensitive Loss

The $\epsilon$ -insensitive loss allows prediction errors within $\epsilon$ to have zero loss, creating a "tube" around the regression function.

Mathematical Definition

\ell_\epsilon(z) = \begin{cases} 0 & \text{if } |z| \leq \epsilon \\ |z| - \epsilon & \text{otherwise} \end{cases}

When $|f(x) - y| \leq \epsilon$

Loss = 0
Prediction is "good enough," no penalty

When $|f(x) - y| > \epsilon$

Loss = $|f(x) - y| - \epsilon$
Linear penalty beyond epsilon

Why This Loss Function?

✓Robustness: Tolerates small errors, making SVR robust to noise
✓Sparsity: Only samples outside the tube become support vectors
✓Simplicity: Encourages flat (simple) functions by not penalizing small deviations
✓Interpretability: $\epsilon$ parameter has clear meaning: acceptable error tolerance

SVR Formulation

Primal Problem with Dual Slack Variables

Objective Function

\min_{w,b,\xi,\hat{\xi}} \frac{1}{2}||w||^2 + C\sum_{i=1}^m(\xi_i + \hat{\xi_i})

Constraints (Upper and Lower Bounds)

y_i - w^T\phi(x_i) - b \leq \epsilon + \xi_i

(upper bound violation)

w^T\phi(x_i) + b - y_i \leq \epsilon + \hat{\xi_i}

(lower bound violation)

\xi_i, \hat{\xi_i} \geq 0

Why two slack variables? Because predictions can deviate above or below the target. $\xi_i$ handles upper violations, $\hat{\xi_i}$ handles lower violations.

Dual Problem & Prediction

After solving the dual problem:

f(x) = \sum_{i=1}^m(\hat{\alpha_i} - \alpha_i)\kappa(x_i, x) + b

Lagrangian multipliers:

$\alpha_i$ for lower bound, $\hat{\alpha_i}$ for upper bound

Kernel trick:

Works just like classification SVM, $\kappa(x_i, x)$ can be any valid kernel

Support Vectors in Regression

In SVR, support vectors are samples that either lie on the boundary of the epsilon-tube or outside it.

Boundary Support Vectors

• $|\hat{\alpha_i} - \alpha_i| > 0$
• Prediction error = $\epsilon$
• Lie exactly on tube boundary
• Help determine $b$

Error Support Vectors

• $|\hat{\alpha_i} - \alpha_i| = C$
• Prediction error $> \epsilon$
• Outside the tube
• Strongly influence solution

Inside the Tube

Samples with prediction error $\leq \epsilon$ have $\alpha_i = \hat{\alpha_i} = 0$ and do not contribute to the final model. This creates the desired sparse representation—the regression function depends only on "difficult" samples outside the tube.

Advantages of SVR

1. Sparsity

Only support vectors (samples outside epsilon-tube) affect the final model, leading to efficient predictions

2. Robustness

Epsilon-insensitive loss makes SVR less sensitive to outliers and noise compared to squared error loss

3. Kernel Capability

Can handle non-linear regression through kernel trick without explicit feature transformation

4. Theoretical Guarantees

Based on statistical learning theory with provable generalization bounds

Example: California Housing Price Prediction

Scenario: Predict median house prices in California districts based on location, demographics, and housing characteristics.

Dataset

• 20,640 districts in California
• 8 features: latitude, longitude, median age, total rooms, population, median income, etc.
• Target: median house value

SVR Configuration

• Kernel: RBF ( $\sigma = 2.0$ )
• $\epsilon = 0.1$ (10k tolerance)
• $C = 1.0$

Support Vectors

2,456 (12%)

Test RMSE

$48,200

R² Score

0.78

Result: SVR with RBF kernel captured non-linear relationships between location and price. Only 12% of districts became support vectors—these were areas with unusual price patterns (e.g., luxury neighborhoods, areas with recent development). The sparse model predicted efficiently while remaining robust to outliers like extremely expensive coastal properties.

Support Vector Regression