Why Linear Regression Uses Squared Error

A house is worth $480,000. Your model predicts $500,000. Why square the $20,000 error instead of just taking its absolute size and moving on?

That question sounds cosmetic, but it is not. The loss function is not decoration. It encodes what kinds of mistakes you care about and what kind of noise you believe your data contains. In linear regression, squared error shows up for a very specific reason: it is the objective implied by a Gaussian noise model.

Once you see that link, squared error stops looking like an arbitrary classroom choice. It becomes the natural consequence of a statistical assumption.

The model is simple. The assumptions are not.

Linear regression says the prediction is an affine function of the input features:

\hat{y} = w_1x_1 + \cdots + w_dx_d + b = \mathbf{w}^\top\mathbf{x} + b

That compact formula hides the usual machine-learning vocabulary:

A sample is one row of data.
A feature is one input variable such as square footage, age, or years of experience.
A label is the target value you want to predict.
The weights tell you how strongly each feature influences the prediction.
The bias shifts the whole prediction line or hyperplane up and down.

The model itself is straightforward. What matters is the gap between prediction and reality. That gap is the residual:

r^{(i)} = \hat{y}^{(i)} - y^{(i)}

Training means choosing $\mathbf{w}$ and $b$ so those residuals are as small as possible across the dataset.

Squared error measures more than "wrongness"

For one sample, the standard loss is

\ell^{(i)}(\mathbf{w}, b) = \frac{1}{2}\left(\hat{y}^{(i)} - y^{(i)}\right)^2

The factor of $\frac{1}{2}$ is just algebraic housekeeping. It cancels the $2$ that appears during differentiation.

Squaring the residual does three useful things at once:

It makes positive and negative errors equally costly.
It penalizes large misses more heavily than small ones.
It produces a smooth objective with an easy gradient.

Those are practical benefits. They are not the deepest reason squared error became the default. The deeper reason comes from probability.

There is a closed-form answer

Linear regression is one of the rare cases where you can sometimes solve for the optimal parameters directly instead of using an iterative optimizer. If you fold the bias term into the design matrix by appending a column of ones, you can write the whole model as

\tilde{\mathbf{X}} = [\mathbf{X} \; \mathbf{1}]

\boldsymbol{\theta} = \begin{bmatrix} \mathbf{w} \\ b \end{bmatrix}

\hat{\mathbf{y}} = \tilde{\mathbf{X}}\boldsymbol{\theta}

The least-squares objective becomes

L(\boldsymbol{\theta}) = \|\mathbf{y} - \tilde{\mathbf{X}}\boldsymbol{\theta}\|_2^2

Setting its gradient to zero gives the normal equation:

\boldsymbol{\theta}^* = (\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\top\mathbf{y}

That is the analytic, or closed-form, solution. If $\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}}$ is singular, you use a pseudoinverse or regularization instead. Either way, linear regression is one of the few core models where the math can sometimes jump straight to the answer.

Why we still use gradient descent anyway

Closed-form solutions are elegant, but they are not always the best engineering choice. Large datasets make the matrix operations expensive. Deep models do not have a closed form at all. And once you add constraints, custom losses, or nonlinear structure, you are back in the world of iterative optimization.

That is where gradient descent enters. For mini-batch training, the parameter updates look like this:

\mathbf{w} \leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla_{\mathbf{w}} \ell^{(i)}(\mathbf{w}, b)

b \leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \frac{\partial \ell^{(i)}(\mathbf{w}, b)}{\partial b}

Here $\eta$ is the learning rate and $\mathcal{B}$ is the mini-batch. The idea is the same one used everywhere in deep learning: estimate the slope of the loss, then step downhill.

Analytic solutions give you the exact minimizer in one shot. Numerical methods give you a sequence of improving guesses.

Linear regression happens to support both viewpoints. That makes it a great model for understanding the bridge between classical statistics and modern deep learning.

The real reason squared error appears

Now for the statistical core. Suppose the data are generated by

y = \mathbf{w}^\top\mathbf{x} + b + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2)

That assumption says the target is a linear signal plus Gaussian noise. Under that model, the conditional density of one observation is

p(y \mid \mathbf{x}; \mathbf{w}, b) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mathbf{w}^\top\mathbf{x} - b)^2}{2\sigma^2}\right)

If you maximize the likelihood of the whole dataset, you multiply those probabilities over all samples. Taking the negative log turns that product into a sum:

-\log p(\mathbf{y} \mid \mathbf{X}; \mathbf{w}, b) = \sum_{i=1}^{n} \left[\frac{1}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\left(y^{(i)} - \mathbf{w}^\top\mathbf{x}^{(i)} - b\right)^2\right]

The first term is constant with respect to the parameters. The factor $\frac{1}{2\sigma^2}$ is also irrelevant to the minimizer. What remains is the sum of squared residuals.

\arg\max p(\mathbf{y} \mid \mathbf{X}; \mathbf{w}, b) = \arg\min \sum_{i=1}^{n} \left(y^{(i)} - \mathbf{w}^\top\mathbf{x}^{(i)} - b\right)^2

That is why squared error is so standard. Under Gaussian noise, least squares is maximum likelihood.

If you change the noise model, the loss changes too

This is the part that clarifies almost everything. Loss functions are not chosen in isolation. They correspond to assumptions.

If you assume Laplace noise instead of Gaussian noise, the negative log-likelihood becomes absolute error, not squared error. That makes the estimator more resistant to outliers, but it also changes the optimization behavior because the loss is no longer smooth at zero.

So the question is not "Which loss is mathematically prettier?" It is "What kind of data-generating story do I believe, and what training behavior can I tolerate?"

The takeaway

Linear regression teaches three ideas that show up everywhere else in machine learning:

The model class reflects a structural assumption.
The loss function reflects a statistical assumption.
The optimizer tells you how you will search for the best parameters.

Squared error wins in linear regression because the Gaussian noise model leads straight to it. The fact that it is smooth, differentiable, and easy to optimize is a major bonus. But the real reason is probabilistic, not aesthetic.

Why Linear Regression Uses Squared Error

The model is simple. The assumptions are not.

Squared error measures more than "wrongness"

There is a closed-form answer

Why we still use gradient descent anyway

The real reason squared error appears

If you change the noise model, the loss changes too

The takeaway

Related reading

Gradient Descent, Explained from First Principles

The Chain Rule in Deep Learning: What Backpropagation Is Really Computing

Why Deep Networks Need Activation Functions