Machine Learning/Learning Center/Linear Models/Linear Regression

Linear Regression

Learn how to predict continuous values using the least squares method, from simple regression to multiple variables with real housing price examples

What is Linear Regression?

Regression Task

Linear regression is a supervised learning algorithm for predicting continuous numerical values. Given a dataset D = {(x₁, y₁), (x₂, y₂), ..., (xₘ, yₘ)}, the goal is to learn a linear function that maps input features to output values with minimal error.

Linear Regression Model

Simple Linear Regression (one feature):

y = wx + b

Multiple Linear Regression (multiple features):

y = w₁x₁ + w₂x₂ + ... + w_dx_d + b = w^Tx + b

Objective

Find the weights (w) and bias (b) that minimize the difference between predicted values (ŷ = wx + b) and actual values (y) across all training samples.

Key Insight

We measure error using Mean Squared Error (MSE), which penalizes large errors more heavily. Squaring ensures all errors are positive and differentiable.

Least Squares Method

The mathematical foundation for finding optimal regression parameters

Mean Squared Error (MSE)

The goal is to minimize the sum of squared errors between predictions and actual values:

E(w,b) = Σᵢ₌₁ᵐ (yᵢ - ŷᵢ)² = Σᵢ₌₁ᵐ (yᵢ - wxᵢ - b)²

where m is the number of samples, yᵢ is the actual value, and ŷᵢ is the predicted value

Simple Linear Regression Solution

For single-variable regression (one feature), we can find the optimal w and b by taking partial derivatives of E(w,b) and setting them to zero:

Closed-form solution for w:

w = Σᵢ yᵢ(xᵢ - x̄) / Σᵢ xᵢ² - (1/m)(Σᵢ xᵢ)²

Closed-form solution for b:

b = (1/m) Σᵢ (yᵢ - wxᵢ) = ȳ - wx̄

where x̄ is the mean of x values and ȳ is the mean of y values

Multiple Linear Regression Solution (OLS)

For multiple features, we use matrix notation. Define the augmented weight vector ŵ = (w; b) and create an augmented feature matrix X by adding a column of ones for the bias term:

ŵ* = (X^TX)^-1X^Ty

This is the Normal Equation or Ordinary Least Squares (OLS) solution. It provides the exact optimal solution when X^TX is invertible.

Computational Considerations

The normal equation requires computing (X^TX)^-1, which has O(d³) time complexity where d is the number of features. For large d (thousands of features), this becomes expensive.

Alternative: Use gradient descent or stochastic gradient descent (SGD) for large-scale problems, which have better computational complexity and can handle massive datasets.

Housing Price Prediction Example

Real-world application of multiple linear regression

Dataset Overview

We have data from 200 house sales in suburban markets. Here's a sample of 8 properties:

ID	Sqft	Bedrooms	Bathrooms	Location	Year	Price
1	1,200	2	1	7/10	2005	$285,000
2	1,800	3	2	8.5/10	2010	$425,000
3	2,400	4	2.5	9/10	2015	$575,000
4	1,500	3	2	6/10	2000	$325,000
5	2,100	3	2	8/10	2012	$485,000
6	950	2	1	5.5/10	1998	$225,000
7	2,800	4	3	9.5/10	2018	$695,000
8	1,650	3	2	7.5/10	2008	$385,000

Feature Descriptions

Square Footage: Continuous feature, typically 900-3000 sqft for this market

Bedrooms: Discrete feature, usually 2-5 bedrooms

Bathrooms: Can be fractional (half-baths), 1-3.5 typical range

Location Score: Neighborhood quality rating from 1-10 based on schools, safety, amenities

Year Built: Construction year, 1995-2020 range

Price (Target): Sale price in USD, $200k-$750k range

Learned Model (Example)

After applying OLS to the full 200-sample dataset, we might obtain a model like:

price = 145.5 × sqft

+ 22,500 × bedrooms

+ 18,000 × bathrooms

+ 12,000 × location_score

+ 1,200 × year_built

- 2,350,000

Interpretation:

• Each additional square foot increases price by ~$145
• Each additional bedroom adds ~$22,500 to the price
• Each additional bathroom adds ~$18,000 to the price
• Each point of location score adds ~$12,000 to the price
• Newer homes are worth more (each year adds ~$1,200)

Making Predictions

Example: Predict the price of a house with:

• 2,000 sqft
• 3 bedrooms
• 2.5 bathrooms
• Location score: 8.5
• Built in 2015

price = 145.5(2000) + 22500(3) + 18000(2.5) + 12000(8.5) + 1200(2015) - 2350000

price = 291,000 + 67,500 + 45,000 + 102,000 + 2,418,000 - 2,350,000

Predicted Price: $573,500

OLS Assumptions

Conditions required for OLS to provide reliable estimates

1. Linearity

The relationship between features and target is linear. The true relationship should be captured by y = w^Tx + b + noise.

How to check: Plot residuals vs predicted values. Should show random scatter with no patterns.

2. Independence

Observations are independent of each other. One data point doesn't influence another.

How to check: Important for time series data. Use Durbin-Watson test to detect autocorrelation.

3. Homoscedasticity

Variance of errors is constant across all levels of independent variables (no heteroscedasticity).

How to check: Residual plot should show consistent spread. Breusch-Pagan test can formally test this.

4. Normality of Errors

Residuals (errors) should follow a normal distribution with mean zero.

How to check: Q-Q plot or Shapiro-Wilk test. Histogram of residuals should be bell-shaped.

5. No Multicollinearity

Independent variables should not be highly correlated with each other.

How to check: Calculate VIF (Variance Inflation Factor). VIF > 10 indicates problematic multicollinearity.

What happens when assumptions are violated?

•Violated linearity: Model will have high bias and systematically under/over-predict
•Violated independence: Standard errors will be underestimated, confidence intervals too narrow
•Violated homoscedasticity: Some predictions will be more uncertain than others
•Multicollinearity: Coefficient estimates become unstable with high variance

Log-Linear Regression

Handling exponential growth and multiplicative relationships

When to Use Log-Linear Models

When the target variable grows or shrinks exponentially, taking the logarithm can transform the relationship into a linear one. This is a special case of Generalized Linear Models (GLM).

ln(y) = w^Tx + b

Equivalently: y = e^{w^Tx + b}

Use Cases

•Population growth: Cities, bacteria colonies
•Salary prediction: Compensation often grows exponentially with experience
•Website traffic: Viral growth patterns
•Stock prices: Compound returns over time
•Disease spread: Epidemiological models

Advantages

Makes exponential relationships linear and easier to fit
Handles multiplicative effects naturally
Coefficients represent percentage changes
Reduces impact of outliers (log dampens large values)
Predictions are always positive (since y = e^prediction)

Housing Price Example with Log Transform

If house prices show multiplicative effects (e.g., doubling square footage roughly doubles price in the luxury market), we might use:

ln(price) = 0.65 × ln(sqft) + 0.12 × bedrooms + 0.08 × bathrooms + ...

The coefficient 0.65 on ln(sqft) means a 1% increase in square footage leads to approximately a 0.65% increase in price (elasticity).

Common Questions About Linear Regression

How do I know if my data is suitable for linear regression?

Create scatter plots of each feature vs the target variable. If you see roughly linear trends, linear regression is a good starting point. Check the correlation coefficient (r) - values above 0.3 or below -0.3 indicate meaningful linear relationships. Always validate using residual plots.

What's the difference between R² and adjusted R²?

R² (coefficient of determination) measures the proportion of variance explained by the model, ranging from 0 to 1. However, R² always increases when adding more features, even irrelevant ones. Adjusted R² penalizes the addition of features that don't improve the model, making it better for model comparison and selection.

Should I normalize or standardize my features?

For the normal equation (closed-form OLS), normalization isn't strictly necessary but recommended for numerical stability. For gradient descent, normalization/standardization is crucial because features on different scales can cause the algorithm to converge slowly or oscillate. Standardization (z-score: subtract mean, divide by std) is preferred over min-max normalization for regression.

Can I use linear regression for time series forecasting?

Yes, but be careful about the independence assumption. Time series data is often autocorrelated (today's value depends on yesterday's). You can include lagged values as features, but consider specialized time series methods like ARIMA or exponential smoothing. Always test for autocorrelation in residuals using the Durbin-Watson statistic.

How do I handle categorical features in linear regression?

Use one-hot encoding (also called dummy variable encoding). For a categorical feature with k categories, create k-1 binary features to avoid multicollinearity (the "dummy variable trap"). For example, for house_type with values (apartment, townhouse, detached), create two features: is_townhouse and is_detached, with apartment as the reference category.