Master linear regression with OLS, gradient descent, and regularization techniques
Simple linear regression models the relationship between one predictor variable and response variable :
• : slope (weight)
• : intercept (bias)
• : error term
Slope:
Intercept:
R-squared:
MSE:
Multiple linear regression extends to multiple predictors:
Design Matrix (n × p):
Weights (p × 1):
Derivation: Set gradient to zero and solve for optimal weights
Computational Considerations
Computing is . For high-dimensional data (large ), use iterative methods like gradient descent instead.
Gradient descent iteratively updates weights in the direction of steepest descent:
where is the learning rate and is the gradient of the loss function
Compromise: use small batches (e.g., 32, 64, 128 samples)
Balances computation efficiency and convergence stability
Adds L2 penalty to shrink coefficients:
Closed-form solution:
When to use:
Adds L1 penalty for sparse solutions:
Key property:
Can set coefficients exactly to zero, performing automatic feature selection
When to use:
Combines L1 and L2 penalties:
Gets benefits of both: feature selection (L1) and stability (L2). Useful when features are correlated.
True relationship between and is linear:
Check: Residual plot should show random scatter. Patterns indicate non-linearity.
Observations are independent: for
Check: Durbin-Watson test for autocorrelation. Important for time series data.
Constant error variance: for all
Check: Plot residuals vs fitted values. Fan shape indicates heteroscedasticity. Use Breusch-Pagan test.
Errors follow normal distribution:
Check: Q-Q plot or Shapiro-Wilk test. Required for inference (confidence intervals, hypothesis tests).
Features are not highly correlated: is invertible
Check: VIF (Variance Inflation Factor). VIF > 10 indicates problematic collinearity.
| Sqft () | Price (, $1000s) |
|---|---|
| 1200 | 285 |
| 1500 | 350 |
| 1800 | 420 |
| 2100 | 485 |
| 2400 | 575 |
Interpretation: Each additional square foot increases price by $287.50
Minimize the sum of squared residuals (RSS):
Since scalars equal their transpose: y^T Xw = w^T X^T y
Using matrix calculus rules:
Rule 1:
Rule 2:
Applying these rules:
For optimality (first-order condition):
These are called the Normal Equations
If X^T X is invertible:
This is the closed-form solution for OLS regression
The Hessian (second derivative matrix) is:
Conclusion: X^T X is positive semi-definite (all eigenvalues ≥ 0), confirming this is a minimum. If X has full column rank, X^T X is positive definite, making it a unique global minimum.
Statement: An OLS solution always exists (possibly not unique).
Proof Sketch:
Statement: The OLS solution is unique if and only if X has full column rank.
Proof:
⇒ Direction (Full rank ⇒ Unique):
⇐ Direction (Not full rank ⇒ Not unique):
Multicollinearity (rank-deficient)
Solution: Ridge regression or feature selection
Full Rank Case
Standard case for OLS
Initialize: w₀ ← random or zeros
For t = 0, 1, 2, ... until convergence:
g_t ← ∇L(w_t) = -2X^T(y - Xw_t)
w_(t+1) ← w_t - η * g_t
Return: w_final
η is the learning rate (step size)
Theorem: For linear regression with loss L(w) = ||y - Xw||², if:
then gradient descent converges to the global optimum w*.
Step 1: Define Error
Let e_t = w_t - w* be the error at iteration t.
Step 2: Substitute Gradient
Since ∇L(w) = 2X^T X(w - w*):
Step 3: Spectral Analysis
Let A = I - 2ηX^T X. For convergence, we need ||A|| < 1 (spectral norm).
Eigenvalues of A: μ_i = 1 - 2ηλ_i where λ_i are eigenvalues of X^T X.
Step 4: Convergence Condition
For convergence, |μ_i| < 1 for all i:
This gives: 0 < η < 2/λ_max
Convergence Rate
Linear convergence with rate determined by condition number κ = λ_max/λ_min. Optimal step size: η = 1/(λ_max + λ_min) gives fastest convergence.
Objective: Add penalty proportional to sum of squared coefficients:
Closed-Form Solution:
Taking gradient and setting to zero:
Key Property:
X^T X + λI is always invertible (positive definite) even if X^T X is singular. This solves multicollinearity problems!
Objective: Add penalty proportional to sum of absolute values:
No Closed Form, but Subdifferential:
The L1 norm is not differentiable at 0. Subdifferential:
Solved iteratively using coordinate descent or proximal gradient methods.
Sparsity Property:
L1 penalty produces exact zeros in the solution. This is because the L1 ball has corners at axes, so the optimal point often lies on an axis (sparse solution).
Ridge (L2)
Constraint: ||w||² ≤ t (circle/sphere)
LASSO (L1)
Constraint: ||w||₁ ≤ t (diamond/cross-polytope)
Theorem: Regularization introduces bias but reduces variance.
For Ridge, expected value:
The bias increases with λ, but for noisy data, the reduction in variance often outweighs the bias increase, leading to better test performance.
Test your understanding with 10 multiple-choice questions