MathIsimple
Deep Learning
12 min read

Why Linear Regression Uses Squared Error

The loss looks simple because the statistical story behind it is simple.

Linear RegressionSquared ErrorMaximum LikelihoodGradient DescentStatistics

A house is worth $480,000. Your model predicts $500,000. Why square the $20,000 error instead of just taking its absolute size and moving on?

That question sounds cosmetic, but it is not. The loss function is not decoration. It encodes what kinds of mistakes you care about and what kind of noise you believe your data contains. In linear regression, squared error shows up for a very specific reason: it is the objective implied by a Gaussian noise model.

Once you see that link, squared error stops looking like an arbitrary classroom choice. It becomes the natural consequence of a statistical assumption.

The model is simple. The assumptions are not.

Linear regression says the prediction is an affine function of the input features:

y^=w1x1++wdxd+b=wx+b\hat{y} = w_1x_1 + \cdots + w_dx_d + b = \mathbf{w}^\top\mathbf{x} + b

That compact formula hides the usual machine-learning vocabulary:

  • A sample is one row of data.
  • A feature is one input variable such as square footage, age, or years of experience.
  • A label is the target value you want to predict.
  • The weights tell you how strongly each feature influences the prediction.
  • The bias shifts the whole prediction line or hyperplane up and down.

The model itself is straightforward. What matters is the gap between prediction and reality. That gap is the residual:

r(i)=y^(i)y(i)r^{(i)} = \hat{y}^{(i)} - y^{(i)}

Training means choosing w\mathbf{w} and bb so those residuals are as small as possible across the dataset.

Squared error measures more than "wrongness"

For one sample, the standard loss is

(i)(w,b)=12(y^(i)y(i))2\ell^{(i)}(\mathbf{w}, b) = \frac{1}{2}\left(\hat{y}^{(i)} - y^{(i)}\right)^2

The factor of 12\frac{1}{2} is just algebraic housekeeping. It cancels the 22 that appears during differentiation.

Squaring the residual does three useful things at once:

  • It makes positive and negative errors equally costly.
  • It penalizes large misses more heavily than small ones.
  • It produces a smooth objective with an easy gradient.

Those are practical benefits. They are not the deepest reason squared error became the default. The deeper reason comes from probability.

There is a closed-form answer

Linear regression is one of the rare cases where you can sometimes solve for the optimal parameters directly instead of using an iterative optimizer. If you fold the bias term into the design matrix by appending a column of ones, you can write the whole model as

X~=[X  1]\tilde{\mathbf{X}} = [\mathbf{X} \; \mathbf{1}]θ=[wb]\boldsymbol{\theta} = \begin{bmatrix} \mathbf{w} \\ b \end{bmatrix}y^=X~θ\hat{\mathbf{y}} = \tilde{\mathbf{X}}\boldsymbol{\theta}

The least-squares objective becomes

L(θ)=yX~θ22L(\boldsymbol{\theta}) = \|\mathbf{y} - \tilde{\mathbf{X}}\boldsymbol{\theta}\|_2^2

Setting its gradient to zero gives the normal equation:

θ=(X~X~)1X~y\boldsymbol{\theta}^* = (\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\top\mathbf{y}

That is the analytic, or closed-form, solution. If X~X~\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}} is singular, you use a pseudoinverse or regularization instead. Either way, linear regression is one of the few core models where the math can sometimes jump straight to the answer.

Why we still use gradient descent anyway

Closed-form solutions are elegant, but they are not always the best engineering choice. Large datasets make the matrix operations expensive. Deep models do not have a closed form at all. And once you add constraints, custom losses, or nonlinear structure, you are back in the world of iterative optimization.

That is where gradient descent enters. For mini-batch training, the parameter updates look like this:

wwηBiBw(i)(w,b)\mathbf{w} \leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla_{\mathbf{w}} \ell^{(i)}(\mathbf{w}, b)bbηBiB(i)(w,b)bb \leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \frac{\partial \ell^{(i)}(\mathbf{w}, b)}{\partial b}

Here η\eta is the learning rate and B\mathcal{B} is the mini-batch. The idea is the same one used everywhere in deep learning: estimate the slope of the loss, then step downhill.

Analytic solutions give you the exact minimizer in one shot. Numerical methods give you a sequence of improving guesses.

Linear regression happens to support both viewpoints. That makes it a great model for understanding the bridge between classical statistics and modern deep learning.

The real reason squared error appears

Now for the statistical core. Suppose the data are generated by

y=wx+b+ϵ,ϵN(0,σ2)y = \mathbf{w}^\top\mathbf{x} + b + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2)

That assumption says the target is a linear signal plus Gaussian noise. Under that model, the conditional density of one observation is

p(yx;w,b)=12πσ2exp((ywxb)22σ2)p(y \mid \mathbf{x}; \mathbf{w}, b) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mathbf{w}^\top\mathbf{x} - b)^2}{2\sigma^2}\right)

If you maximize the likelihood of the whole dataset, you multiply those probabilities over all samples. Taking the negative log turns that product into a sum:

logp(yX;w,b)=i=1n[12log(2πσ2)+12σ2(y(i)wx(i)b)2]-\log p(\mathbf{y} \mid \mathbf{X}; \mathbf{w}, b) = \sum_{i=1}^{n} \left[\frac{1}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\left(y^{(i)} - \mathbf{w}^\top\mathbf{x}^{(i)} - b\right)^2\right]

The first term is constant with respect to the parameters. The factor 12σ2\frac{1}{2\sigma^2} is also irrelevant to the minimizer. What remains is the sum of squared residuals.

argmaxp(yX;w,b)=argmini=1n(y(i)wx(i)b)2\arg\max p(\mathbf{y} \mid \mathbf{X}; \mathbf{w}, b) = \arg\min \sum_{i=1}^{n} \left(y^{(i)} - \mathbf{w}^\top\mathbf{x}^{(i)} - b\right)^2

That is why squared error is so standard. Under Gaussian noise, least squares is maximum likelihood.

If you change the noise model, the loss changes too

This is the part that clarifies almost everything. Loss functions are not chosen in isolation. They correspond to assumptions.

If you assume Laplace noise instead of Gaussian noise, the negative log-likelihood becomes absolute error, not squared error. That makes the estimator more resistant to outliers, but it also changes the optimization behavior because the loss is no longer smooth at zero.

So the question is not "Which loss is mathematically prettier?" It is "What kind of data-generating story do I believe, and what training behavior can I tolerate?"

The takeaway

Linear regression teaches three ideas that show up everywhere else in machine learning:

  • The model class reflects a structural assumption.
  • The loss function reflects a statistical assumption.
  • The optimizer tells you how you will search for the best parameters.

Squared error wins in linear regression because the Gaussian noise model leads straight to it. The fact that it is smooth, differentiable, and easy to optimize is a major bonus. But the real reason is probabilistic, not aesthetic.

Ask AI ✨