MathIsimple
Back to Machine Learning
Regression
Beginner to Intermediate

Linear Regression

Master linear regression with OLS, gradient descent, and regularization techniques

Simple Linear Regression

Model Formulation

Simple linear regression models the relationship between one predictor variable and response variable :

y=wx+b+ϵy = wx + b + \epsilon

• : slope (weight)

• : intercept (bias)

• : error term

Objective: Minimize Sum of Squared Residuals

minw,bi=1n(yi(wxi+b))2\min_{w,b} \sum_{i=1}^{n} (y_i - (wx_i + b))^2

Closed-Form Solution

Slope:

w=Cov(X,Y)Var(X)=(xixˉ)(yiyˉ)(xixˉ)2w = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}

Intercept:

b=yˉwxˉb = \bar{y} - w\bar{x}

Model Evaluation

R-squared:

R2=1SSresSStot=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}

MSE:

MSE=1n(yiy^i)2MSE = \frac{1}{n}\sum(y_i-\hat{y}_i)^2

Multiple Linear Regression

Matrix Formulation

Multiple linear regression extends to multiple predictors:

y=Xw+ϵ\mathbf{y} = \mathbf{Xw} + \epsilon

Design Matrix (n × p):

X=[1x11x1p1x21x2p1xn1xnp]\mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix}

Weights (p × 1):

w=[bw1w2wp]\mathbf{w} = \begin{bmatrix} b \\ w_1 \\ w_2 \\ \vdots \\ w_p \end{bmatrix}

Normal Equation

w=(XTX)1XTy\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Derivation: Set gradient to zero and solve for optimal weights

Computational Considerations

Computing is . For high-dimensional data (large ), use iterative methods like gradient descent instead.

Gradient Descent

Iterative Optimization

Gradient descent iteratively updates weights in the direction of steepest descent:

w(t+1)=w(t)αL(w(t))\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha \nabla L(\mathbf{w}^{(t)})

where is the learning rate and is the gradient of the loss function

Batch Gradient Descent

L=2nXT(Xwy)\nabla L = \frac{2}{n}\mathbf{X}^T(\mathbf{Xw} - \mathbf{y})
  • ✓ Uses all training samples
  • ✓ Stable convergence
  • ✗ Slow for large datasets

Stochastic Gradient Descent

Li=2xiT(xiTwyi)\nabla L_i = 2\mathbf{x}_i^T(\mathbf{x}_i^T\mathbf{w} - y_i)
  • ✓ Fast updates
  • ✓ Can escape local minima
  • ✗ Noisy convergence

Mini-Batch Gradient Descent

Compromise: use small batches (e.g., 32, 64, 128 samples)

Lbatch=2BiBxiT(xiTwyi)\nabla L_{\text{batch}} = \frac{2}{|B|}\sum_{i \in B} \mathbf{x}_i^T(\mathbf{x}_i^T\mathbf{w} - y_i)

Balances computation efficiency and convergence stability

Regularization

Ridge Regression (L2)

Adds L2 penalty to shrink coefficients:

Lridge=i=1n(yixiTw)2+λj=1pwj2L_{\text{ridge}} = \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\mathbf{w})^2 + \lambda\sum_{j=1}^{p}w_j^2

Closed-form solution:

wridge=(XTX+λI)1XTy\mathbf{w}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

When to use:

  • • All features potentially relevant
  • • Multicollinearity present
  • • Want smooth shrinkage
Lasso Regression (L1)

Adds L1 penalty for sparse solutions:

Llasso=i=1n(yixiTw)2+λj=1pwjL_{\text{lasso}} = \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\mathbf{w})^2 + \lambda\sum_{j=1}^{p}|w_j|

Key property:

Can set coefficients exactly to zero, performing automatic feature selection

When to use:

  • • Many irrelevant features
  • • Need interpretability
  • • Want feature selection
Elastic Net

Combines L1 and L2 penalties:

Lelastic=i=1n(yixiTw)2+λ1j=1pwj+λ2j=1pwj2L_{\text{elastic}} = \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\mathbf{w})^2 + \lambda_1\sum_{j=1}^{p}|w_j| + \lambda_2\sum_{j=1}^{p}w_j^2

Gets benefits of both: feature selection (L1) and stability (L2). Useful when features are correlated.

OLS Assumptions

1. Linearity

True relationship between and is linear:

Check: Residual plot should show random scatter. Patterns indicate non-linearity.

2. Independence

Observations are independent: for

Check: Durbin-Watson test for autocorrelation. Important for time series data.

3. Homoscedasticity

Constant error variance: for all

Check: Plot residuals vs fitted values. Fan shape indicates heteroscedasticity. Use Breusch-Pagan test.

4. Normality of Errors

Errors follow normal distribution:

Check: Q-Q plot or Shapiro-Wilk test. Required for inference (confidence intervals, hypothesis tests).

5. No Multicollinearity

Features are not highly correlated: is invertible

Check: VIF (Variance Inflation Factor). VIF > 10 indicates problematic collinearity.

Example: Housing Price Prediction

Simple Linear Regression: Sqft vs Price

Sample Data (5 houses)

Sqft ()Price (, $1000s)
1200285
1500350
1800420
2100485
2400575

Step 1: Calculate means

xˉ=1200+1500+1800+2100+24005=1800\bar{x} = \frac{1200+1500+1800+2100+2400}{5} = 1800yˉ=285+350+420+485+5755=423\bar{y} = \frac{285+350+420+485+575}{5} = 423

Step 2: Calculate covariance and variance

Cov(X,Y)=(xixˉ)(yiyˉ)n=69000\text{Cov}(X,Y) = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n} = 69000Var(X)=(xixˉ)2n=240000\text{Var}(X) = \frac{\sum(x_i-\bar{x})^2}{n} = 240000

Step 3: Calculate coefficients

w=Cov(X,Y)Var(X)=69000240000=0.2875w = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = \frac{69000}{240000} = 0.2875b=yˉwxˉ=4230.2875×1800=94.5b = \bar{y} - w\bar{x} = 423 - 0.2875 \times 1800 = -94.5

Final Model

Price=0.2875×Sqft94.5\text{Price} = 0.2875 \times \text{Sqft} - 94.5

Interpretation: Each additional square foot increases price by $287.50

Complete OLS Derivation

Ordinary Least Squares: Mathematical Derivation

Step 1: Define Objective Function

Minimize the sum of squared residuals (RSS):

L(w)=yXw2=(yXw)T(yXw)L(\mathbf{w}) = \|\mathbf{y} - \mathbf{Xw}\|^2 = (\mathbf{y} - \mathbf{Xw})^T(\mathbf{y} - \mathbf{Xw})

Step 2: Expand the Quadratic Form

L(w)=yTyyTXwwTXTy+wTXTXwL(\mathbf{w}) = \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{Xw} - \mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{Xw}

Since scalars equal their transpose: y^T Xw = w^T X^T y

L(w)=yTy2wTXTy+wTXTXwL(\mathbf{w}) = \mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{Xw}

Step 3: Take Gradient with Respect to w

Using matrix calculus rules:

Rule 1:

w(aTw)=a\frac{\partial}{\partial \mathbf{w}}(\mathbf{a}^T\mathbf{w}) = \mathbf{a}

Rule 2:

w(wTAw)=2Aw\frac{\partial}{\partial \mathbf{w}}(\mathbf{w}^T\mathbf{Aw}) = 2\mathbf{Aw}

Applying these rules:

Lw=2XTy+2XTXw\frac{\partial L}{\partial \mathbf{w}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{Xw}

Step 4: Set Gradient to Zero

For optimality (first-order condition):

Lw=0    2XTy+2XTXw=0\frac{\partial L}{\partial \mathbf{w}} = 0 \implies -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{Xw} = 0XTXw=XTy\mathbf{X}^T\mathbf{Xw} = \mathbf{X}^T\mathbf{y}

These are called the Normal Equations

Step 5: Solve for Optimal Weights

If X^T X is invertible:

w=(XTX)1XTy\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

This is the closed-form solution for OLS regression

Step 6: Verify Second-Order Condition

The Hessian (second derivative matrix) is:

2Lw2=2XTX\frac{\partial^2 L}{\partial \mathbf{w}^2} = 2\mathbf{X}^T\mathbf{X}

Conclusion: X^T X is positive semi-definite (all eigenvalues ≥ 0), confirming this is a minimum. If X has full column rank, X^T X is positive definite, making it a unique global minimum.

Existence & Uniqueness of OLS Solution

When Does OLS Solution Exist and When Is It Unique?

Theorem: Existence

Statement: An OLS solution always exists (possibly not unique).

Proof Sketch:

  1. The objective L(w) is a continuous function on R^p
  2. As ||w|| → ∞, L(w) → ∞ (goes to infinity)
  3. Therefore, L attains its minimum on any compact set containing origin
  4. By continuity and coercivity, a global minimum exists

Theorem: Uniqueness

Statement: The OLS solution is unique if and only if X has full column rank.

Proof:

⇒ Direction (Full rank ⇒ Unique):

  • If rank(X) = p (full column rank), then X^T X is positive definite
  • A positive definite matrix is invertible
  • Therefore (X^T X)^(-1) exists and is unique
  • Hence w* = (X^T X)^(-1) X^T y is the unique solution

⇐ Direction (Not full rank ⇒ Not unique):

  • If rank(X) < p, then null space of X is non-trivial
  • If w* is a solution and v ∈ null(X) (i.e., Xv = 0), then Xw* = X(w* + v)
  • Both w* and w* + v give same predictions and same RSS
  • Therefore, infinitely many solutions exist

Practical Implications

Multicollinearity (rank-deficient)

  • • Features are linearly dependent
  • • X^T X is singular (non-invertible)
  • • Infinite solutions exist
  • • Unstable: small data changes → large coefficient changes

Solution: Ridge regression or feature selection

Full Rank Case

  • • Features are linearly independent
  • • X^T X is invertible
  • • Unique solution exists
  • • Stable and well-defined

Standard case for OLS

Gradient Descent Convergence Analysis

Convergence Proof and Step Size Selection

Gradient Descent Algorithm

Initialize: w₀ ← random or zeros

For t = 0, 1, 2, ... until convergence:

g_t ← ∇L(w_t) = -2X^T(y - Xw_t)

w_(t+1) ← w_t - η * g_t

Return: w_final

η is the learning rate (step size)

Convergence Theorem

Theorem: For linear regression with loss L(w) = ||y - Xw||², if:

0<η<2λmax(XTX)0 < \eta < \frac{2}{\lambda_{\max}(\mathbf{X}^T\mathbf{X})}

then gradient descent converges to the global optimum w*.

Proof Sketch

Step 1: Define Error

Let e_t = w_t - w* be the error at iteration t.

et+1=wt+1w=(wtηL(wt))w\mathbf{e}_{t+1} = \mathbf{w}_{t+1} - \mathbf{w}^* = (\mathbf{w}_t - \eta \nabla L(\mathbf{w}_t)) - \mathbf{w}^*

Step 2: Substitute Gradient

Since ∇L(w) = 2X^T X(w - w*):

et+1=et2ηXTXet=(I2ηXTX)et\mathbf{e}_{t+1} = \mathbf{e}_t - 2\eta \mathbf{X}^T\mathbf{X}\mathbf{e}_t = (\mathbf{I} - 2\eta \mathbf{X}^T\mathbf{X})\mathbf{e}_t

Step 3: Spectral Analysis

Let A = I - 2ηX^T X. For convergence, we need ||A|| < 1 (spectral norm).

Eigenvalues of A: μ_i = 1 - 2ηλ_i where λ_i are eigenvalues of X^T X.

Step 4: Convergence Condition

For convergence, |μ_i| < 1 for all i:

1<12ηλi<1i-1 < 1 - 2\eta\lambda_i < 1 \quad \forall i

This gives: 0 < η < 2/λ_max

Convergence Rate

et+1(12ηλmin)et\|\mathbf{e}_{t+1}\| \leq (1 - 2\eta\lambda_{\min})\|\mathbf{e}_t\|

Linear convergence with rate determined by condition number κ = λ_max/λ_min. Optimal step size: η = 1/(λ_max + λ_min) gives fastest convergence.

Regularization: Ridge & LASSO

Mathematical Analysis of L2 and L1 Penalties

Ridge Regression (L2)

Objective: Add penalty proportional to sum of squared coefficients:

LRidge(w)=yXw2+λw2L_{\text{Ridge}}(\mathbf{w}) = \|\mathbf{y} - \mathbf{Xw}\|^2 + \lambda\|\mathbf{w}\|^2

Closed-Form Solution:

Taking gradient and setting to zero:

LRidgew=2XTy+2(XTX+λI)w=0\frac{\partial L_{\text{Ridge}}}{\partial \mathbf{w}} = -2\mathbf{X}^T\mathbf{y} + 2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\mathbf{w} = 0wRidge=(XTX+λI)1XTy\mathbf{w}_{\text{Ridge}}^* = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

Key Property:

X^T X + λI is always invertible (positive definite) even if X^T X is singular. This solves multicollinearity problems!

LASSO Regression (L1)

Objective: Add penalty proportional to sum of absolute values:

LLASSO(w)=yXw2+λw1L_{\text{LASSO}}(\mathbf{w}) = \|\mathbf{y} - \mathbf{Xw}\|^2 + \lambda\|\mathbf{w}\|_1

No Closed Form, but Subdifferential:

The L1 norm is not differentiable at 0. Subdifferential:

w1={g:gi{{1}wi>0[1,1]wi=0{1}wi<0}\partial\|\mathbf{w}\|_1 = \{\mathbf{g} : g_i \in \begin{cases} \{1\} & w_i > 0 \\ [-1, 1] & w_i = 0 \\ \{-1\} & w_i < 0 \end{cases}\}

Solved iteratively using coordinate descent or proximal gradient methods.

Sparsity Property:

L1 penalty produces exact zeros in the solution. This is because the L1 ball has corners at axes, so the optimal point often lies on an axis (sparse solution).

Geometric Interpretation

Ridge (L2)

Constraint: ||w||² ≤ t (circle/sphere)

  • • Smooth boundary
  • • Solutions rarely exactly zero
  • • Shrinks all coefficients proportionally

LASSO (L1)

Constraint: ||w||₁ ≤ t (diamond/cross-polytope)

  • • Sharp corners at axes
  • • Solutions often have exact zeros
  • • Performs automatic feature selection

Bias-Variance Tradeoff in Regularization

Theorem: Regularization introduces bias but reduces variance.

E[wRidge]w(biased)\mathbb{E}[\mathbf{w}_{\text{Ridge}}] \neq \mathbf{w}^* \quad \text{(biased)}Var(wRidge)<Var(wOLS)(lower variance)\text{Var}(\mathbf{w}_{\text{Ridge}}) < \text{Var}(\mathbf{w}_{\text{OLS}}) \quad \text{(lower variance)}

For Ridge, expected value:

E[wRidge]=(XTX+λI)1XTXww\mathbb{E}[\mathbf{w}_{\text{Ridge}}] = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}\mathbf{w}^* \neq \mathbf{w}^*

The bias increases with λ, but for noisy data, the reduction in variance often outweighs the bias increase, leading to better test performance.

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is the objective of Ordinary Least Squares (OLS)?
Not attempted
2
For simple linear regression , what is the closed-form solution for ?
Not attempted
3
In multiple linear regression with matrix notation , the OLS solution is:
Not attempted
4
What is R-squared ()?
Not attempted
5
Ridge regression adds which penalty term to the OLS objective?
Not attempted
6
What is the main advantage of Lasso regression over Ridge?
Not attempted
7
In gradient descent for linear regression, how is the weight updated?
Not attempted
8
Which OLS assumption is violated if error variance increases with predicted values?
Not attempted
9
For dataset with 100 samples and 5 features, what is the computational complexity of computing OLS solution using normal equation?
Not attempted
10
Multicollinearity occurs when:
Not attempted