Learn how to predict continuous values using the least squares method, from simple regression to multiple variables with real housing price examples
Linear regression is a supervised learning algorithm for predicting continuous numerical values. Given a dataset D = {(x₁, y₁), (x₂, y₂), ..., (xₘ, yₘ)}, the goal is to learn a linear function that maps input features to output values with minimal error.
Simple Linear Regression (one feature):
y = wx + b
Multiple Linear Regression (multiple features):
y = w₁x₁ + w₂x₂ + ... + w_dx_d + b = wTx + b
Find the weights (w) and bias (b) that minimize the difference between predicted values (ŷ = wx + b) and actual values (y) across all training samples.
We measure error using Mean Squared Error (MSE), which penalizes large errors more heavily. Squaring ensures all errors are positive and differentiable.
The mathematical foundation for finding optimal regression parameters
The goal is to minimize the sum of squared errors between predictions and actual values:
E(w,b) = Σᵢ₌₁ᵐ (yᵢ - ŷᵢ)² = Σᵢ₌₁ᵐ (yᵢ - wxᵢ - b)²
where m is the number of samples, yᵢ is the actual value, and ŷᵢ is the predicted value
For single-variable regression (one feature), we can find the optimal w and b by taking partial derivatives of E(w,b) and setting them to zero:
Closed-form solution for w:
w = Σᵢ yᵢ(xᵢ - x̄) / Σᵢ xᵢ² - (1/m)(Σᵢ xᵢ)²
Closed-form solution for b:
b = (1/m) Σᵢ (yᵢ - wxᵢ) = ȳ - wx̄
where x̄ is the mean of x values and ȳ is the mean of y values
For multiple features, we use matrix notation. Define the augmented weight vector ŵ = (w; b) and create an augmented feature matrix X by adding a column of ones for the bias term:
ŵ* = (XTX)-1XTy
This is the Normal Equation or Ordinary Least Squares (OLS) solution. It provides the exact optimal solution when XTX is invertible.
The normal equation requires computing (XTX)-1, which has O(d³) time complexity where d is the number of features. For large d (thousands of features), this becomes expensive.
Alternative: Use gradient descent or stochastic gradient descent (SGD) for large-scale problems, which have better computational complexity and can handle massive datasets.
Real-world application of multiple linear regression
We have data from 200 house sales in suburban markets. Here's a sample of 8 properties:
| ID | Sqft | Bedrooms | Bathrooms | Location | Year | Price |
|---|---|---|---|---|---|---|
| 1 | 1,200 | 2 | 1 | 7/10 | 2005 | $285,000 |
| 2 | 1,800 | 3 | 2 | 8.5/10 | 2010 | $425,000 |
| 3 | 2,400 | 4 | 2.5 | 9/10 | 2015 | $575,000 |
| 4 | 1,500 | 3 | 2 | 6/10 | 2000 | $325,000 |
| 5 | 2,100 | 3 | 2 | 8/10 | 2012 | $485,000 |
| 6 | 950 | 2 | 1 | 5.5/10 | 1998 | $225,000 |
| 7 | 2,800 | 4 | 3 | 9.5/10 | 2018 | $695,000 |
| 8 | 1,650 | 3 | 2 | 7.5/10 | 2008 | $385,000 |
Square Footage: Continuous feature, typically 900-3000 sqft for this market
Bedrooms: Discrete feature, usually 2-5 bedrooms
Bathrooms: Can be fractional (half-baths), 1-3.5 typical range
Location Score: Neighborhood quality rating from 1-10 based on schools, safety, amenities
Year Built: Construction year, 1995-2020 range
Price (Target): Sale price in USD, $200k-$750k range
After applying OLS to the full 200-sample dataset, we might obtain a model like:
price = 145.5 × sqft
+ 22,500 × bedrooms
+ 18,000 × bathrooms
+ 12,000 × location_score
+ 1,200 × year_built
- 2,350,000
Example: Predict the price of a house with:
price = 145.5(2000) + 22500(3) + 18000(2.5) + 12000(8.5) + 1200(2015) - 2350000
price = 291,000 + 67,500 + 45,000 + 102,000 + 2,418,000 - 2,350,000
Predicted Price: $573,500
Conditions required for OLS to provide reliable estimates
The relationship between features and target is linear. The true relationship should be captured by y = w^Tx + b + noise.
How to check: Plot residuals vs predicted values. Should show random scatter with no patterns.
Observations are independent of each other. One data point doesn't influence another.
How to check: Important for time series data. Use Durbin-Watson test to detect autocorrelation.
Variance of errors is constant across all levels of independent variables (no heteroscedasticity).
How to check: Residual plot should show consistent spread. Breusch-Pagan test can formally test this.
Residuals (errors) should follow a normal distribution with mean zero.
How to check: Q-Q plot or Shapiro-Wilk test. Histogram of residuals should be bell-shaped.
Independent variables should not be highly correlated with each other.
How to check: Calculate VIF (Variance Inflation Factor). VIF > 10 indicates problematic multicollinearity.
Handling exponential growth and multiplicative relationships
When the target variable grows or shrinks exponentially, taking the logarithm can transform the relationship into a linear one. This is a special case of Generalized Linear Models (GLM).
ln(y) = wTx + b
Equivalently: y = ewTx + b
If house prices show multiplicative effects (e.g., doubling square footage roughly doubles price in the luxury market), we might use:
ln(price) = 0.65 × ln(sqft) + 0.12 × bedrooms + 0.08 × bathrooms + ...
The coefficient 0.65 on ln(sqft) means a 1% increase in square footage leads to approximately a 0.65% increase in price (elasticity).
Create scatter plots of each feature vs the target variable. If you see roughly linear trends, linear regression is a good starting point. Check the correlation coefficient (r) - values above 0.3 or below -0.3 indicate meaningful linear relationships. Always validate using residual plots.
R² (coefficient of determination) measures the proportion of variance explained by the model, ranging from 0 to 1. However, R² always increases when adding more features, even irrelevant ones. Adjusted R² penalizes the addition of features that don't improve the model, making it better for model comparison and selection.
For the normal equation (closed-form OLS), normalization isn't strictly necessary but recommended for numerical stability. For gradient descent, normalization/standardization is crucial because features on different scales can cause the algorithm to converge slowly or oscillate. Standardization (z-score: subtract mean, divide by std) is preferred over min-max normalization for regression.
Yes, but be careful about the independence assumption. Time series data is often autocorrelated (today's value depends on yesterday's). You can include lagged values as features, but consider specialized time series methods like ARIMA or exponential smoothing. Always test for autocorrelation in residuals using the Durbin-Watson statistic.
Use one-hot encoding (also called dummy variable encoding). For a categorical feature with k categories, create k-1 binary features to avoid multicollinearity (the "dummy variable trap"). For example, for house_type with values (apartment, townhouse, detached), create two features: is_townhouse and is_detached, with apartment as the reference category.