Linear Regression - Examples & Exercises

📊 Example 1: Least Squares Method

Data: Study Hours vs Exam Score

Student	Hours Studied (x)	Exam Score (y)
1	1	50
2	2	55
3	3	65
4	4	70
5	5	80

Task: Fit a line y = wx + b using least squares, predict score for 6 hours

Solution

Step 1: Calculate means

x̄ = (1+2+3+4+5)/5 = 3

ȳ = (50+55+65+70+80)/5 = 64

Step 2: Calculate slope w

w = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²

Numerator:

(1-3)(50-64) + (2-3)(55-64) + (3-3)(65-64) + (4-3)(70-64) + (5-3)(80-64)

= (-2)(-14) + (-1)(-9) + (0)(1) + (1)(6) + (2)(16)

= 28 + 9 + 0 + 6 + 32 = 75

Denominator:

(1-3)² + (2-3)² + (3-3)² + (4-3)² + (5-3)²

= 4 + 1 + 0 + 1 + 4 = 10

w = 75/10 = 7.5

Step 3: Calculate intercept b

b = ȳ - w·x̄

b = 64 - 7.5×3 = 64 - 22.5 = 41.5

Step 4: Final model and prediction

y = 7.5x + 41.5

For x = 6 hours:

y = 7.5×6 + 41.5 = 45 + 41.5 = 86.5 points

Step 5: Calculate R²

Predictions: ŷ = [49, 56.5, 64, 71.5, 79]

SS_res = Σ(yᵢ - ŷᵢ)² = (50-49)² + (55-56.5)² + ... = 8.75

SS_tot = Σ(yᵢ - ȳ)² = (50-64)² + (55-64)² + ... = 670

R² = 1 - 8.75/670 = 0.987

Excellent fit! Model explains 98.7% of variance.

📈 Example 2: Multiple Linear Regression

Data: Student Performance Prediction

Predict final grade using three factors:

Student	Study Hours/week	Attendance %	Homework %	Final Grade
1	5	80	70	65
2	10	95	90	85
3	8	90	85	80
4	3	70	60	55
5	12	98	95	92
6	7	85	80	75

Matrix Solution

Model: y = w₀ + w₁·x₁ + w₂·x₂ + w₃·x₃

Normal Equation: w = (XᵀX)⁻¹Xᵀy

Design Matrix X:

    1   5   80  70
    1  10   95  90
X = 1   8   90  85
    1   3   70  60
    1  12   98  95
    1   7   85  80

Target Vector y:

y = [65, 85, 80, 55, 92, 75]ᵀ

Learned Coefficients:

w₀ (intercept) = 5.2

w₁ (study hours) = 2.1

w₂ (attendance) = 0.4

w₃ (homework) = 0.3

Interpretation:

Each additional study hour increases grade by 2.1 points
Each 1% increase in attendance increases grade by 0.4 points
Each 1% increase in homework completion increases grade by 0.3 points

🎯 Example 3: Ridge Regression (L2 Regularization)

Problem: High-dimensional Data

Dataset: 50 samples, 100 features (p > n situation)

Ordinary Least Squares: Overfits severely

Training R² = 0.99, Test R² = 0.12 ❌

Ridge Solution

Objective Function:

min ||y - Xw||² + λ||w||²

Solution: w = (XᵀX + λI)⁻¹Xᵀy

Results for different λ:

λ	Training R²	Test R²	\|\|w\|\|
0	0.99	0.12	156.3
0.1	0.92	0.68	42.1
1.0	0.88	0.82	18.5
10	0.75	0.74	5.2
100	0.65	0.63	1.8

Optimal λ = 1.0: Best test performance (82%)

💡 Key Insights

Regularization prevents overfitting by penalizing large weights
λ controls bias-variance tradeoff: larger λ = higher bias, lower variance
Use cross-validation to select optimal λ
Ridge never sets coefficients exactly to zero (unlike LASSO)

✏️ Practice Problems

Problem 1: Gradient Descent

Given loss function L(w) = (y - wx)² where y=5, x=2, and current w=1:

Calculate the current loss L(1)
Calculate gradient dL/dw at w=1
Update w using learning rate α=0.1
Repeat for 3 iterations

Show Solution

Iteration 1:

L(1) = (5 - 1×2)² = (5-2)² = 9

dL/dw = 2(y - wx)(-x) = 2(5 - 2)(-2) = -12

w₁ = w₀ - α·dL/dw = 1 - 0.1×(-12) = 1 + 1.2 = 2.2

Iteration 2:

L(2.2) = (5 - 2.2×2)² = 0.36

dL/dw = 2(5 - 4.4)(-2) = -2.4

w₂ = 2.2 - 0.1×(-2.4) = 2.44

Iteration 3:

L(2.44) = (5 - 4.88)² = 0.0144

w₃ = 2.44 + 0.024 = 2.464

Convergence: w → 2.5 (optimal)

Problem 2: Feature Scaling Impact

Dataset with unscaled features:

Feature 1 (Income): range [20,000 - 200,000]

Feature 2 (Age): range [20 - 65]

Feature 3 (Credit Score): range [300 - 850]

Tasks:

Apply standardization (z-score) to Income feature with μ=80,000, σ=40,000
Apply min-max normalization to Age feature
Explain why scaling matters for gradient descent

Show Solution

1. Standardization (Income = 60,000):

z = (x - μ) / σ = (60,000 - 80,000) / 40,000 = -0.5

2. Min-Max Normalization (Age = 35):

x_norm = (x - min) / (max - min) = (35 - 20) / (65 - 20) = 15/45 = 0.333

3. Why scaling matters:

Features on different scales → gradient magnitudes differ greatly
Large-scale features dominate updates → slow convergence
Scaling ensures all features contribute equally
Makes learning rate easier to tune