Master linear regression with concrete examples
← Back to Course| Student | Hours Studied (x) | Exam Score (y) |
|---|---|---|
| 1 | 1 | 50 |
| 2 | 2 | 55 |
| 3 | 3 | 65 |
| 4 | 4 | 70 |
| 5 | 5 | 80 |
Task: Fit a line y = wx + b using least squares, predict score for 6 hours
Step 1: Calculate means
x̄ = (1+2+3+4+5)/5 = 3
ȳ = (50+55+65+70+80)/5 = 64
Step 2: Calculate slope w
w = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
Numerator:
(1-3)(50-64) + (2-3)(55-64) + (3-3)(65-64) + (4-3)(70-64) + (5-3)(80-64)
= (-2)(-14) + (-1)(-9) + (0)(1) + (1)(6) + (2)(16)
= 28 + 9 + 0 + 6 + 32 = 75
Denominator:
(1-3)² + (2-3)² + (3-3)² + (4-3)² + (5-3)²
= 4 + 1 + 0 + 1 + 4 = 10
w = 75/10 = 7.5
Step 3: Calculate intercept b
b = ȳ - w·x̄
b = 64 - 7.5×3 = 64 - 22.5 = 41.5
Step 4: Final model and prediction
y = 7.5x + 41.5
For x = 6 hours:
y = 7.5×6 + 41.5 = 45 + 41.5 = 86.5 points
Step 5: Calculate R²
Predictions: ŷ = [49, 56.5, 64, 71.5, 79]
SS_res = Σ(yᵢ - ŷᵢ)² = (50-49)² + (55-56.5)² + ... = 8.75
SS_tot = Σ(yᵢ - ȳ)² = (50-64)² + (55-64)² + ... = 670
R² = 1 - 8.75/670 = 0.987
Excellent fit! Model explains 98.7% of variance.
Predict final grade using three factors:
| Student | Study Hours/week | Attendance % | Homework % | Final Grade |
|---|---|---|---|---|
| 1 | 5 | 80 | 70 | 65 |
| 2 | 10 | 95 | 90 | 85 |
| 3 | 8 | 90 | 85 | 80 |
| 4 | 3 | 70 | 60 | 55 |
| 5 | 12 | 98 | 95 | 92 |
| 6 | 7 | 85 | 80 | 75 |
Model: y = w₀ + w₁·x₁ + w₂·x₂ + w₃·x₃
Normal Equation: w = (XᵀX)⁻¹Xᵀy
Design Matrix X:
1 5 80 70
1 10 95 90
X = 1 8 90 85
1 3 70 60
1 12 98 95
1 7 85 80
Target Vector y:
Learned Coefficients:
w₀ (intercept) = 5.2
w₁ (study hours) = 2.1
w₂ (attendance) = 0.4
w₃ (homework) = 0.3
Interpretation:
Dataset: 50 samples, 100 features (p > n situation)
Ordinary Least Squares: Overfits severely
Training R² = 0.99, Test R² = 0.12 ❌
Objective Function:
min ||y - Xw||² + λ||w||²
Solution: w = (XᵀX + λI)⁻¹Xᵀy
Results for different λ:
| λ | Training R² | Test R² | ||w|| |
|---|---|---|---|
| 0 | 0.99 | 0.12 | 156.3 |
| 0.1 | 0.92 | 0.68 | 42.1 |
| 1.0 | 0.88 | 0.82 | 18.5 |
| 10 | 0.75 | 0.74 | 5.2 |
| 100 | 0.65 | 0.63 | 1.8 |
Optimal λ = 1.0: Best test performance (82%)
Given loss function L(w) = (y - wx)² where y=5, x=2, and current w=1:
Iteration 1:
L(1) = (5 - 1×2)² = (5-2)² = 9
dL/dw = 2(y - wx)(-x) = 2(5 - 2)(-2) = -12
w₁ = w₀ - α·dL/dw = 1 - 0.1×(-12) = 1 + 1.2 = 2.2
Iteration 2:
L(2.2) = (5 - 2.2×2)² = 0.36
dL/dw = 2(5 - 4.4)(-2) = -2.4
w₂ = 2.2 - 0.1×(-2.4) = 2.44
Iteration 3:
L(2.44) = (5 - 4.88)² = 0.0144
w₃ = 2.44 + 0.024 = 2.464
Convergence: w → 2.5 (optimal)
Dataset with unscaled features:
Feature 1 (Income): range [20,000 - 200,000]
Feature 2 (Age): range [20 - 65]
Feature 3 (Credit Score): range [300 - 850]
Tasks:
1. Standardization (Income = 60,000):
z = (x - μ) / σ = (60,000 - 80,000) / 40,000 = -0.5
2. Min-Max Normalization (Age = 35):
x_norm = (x - min) / (max - min) = (35 - 20) / (65 - 20) = 15/45 = 0.333
3. Why scaling matters: