Why MLPs Struggle with Images, and Why CNNs Succeed

A multilayer perceptron can approximate an enormous class of functions, so it is tempting to conclude that image recognition should simply be another function-approximation problem. Feed in pixels, predict a label, and let the model learn the rest.

The catch is that images are not arbitrary vectors. They are spatial objects. Once you forget that geometry, you force the model to relearn basic visual structure from scratch. That is why convolutional networks are not just "more powerful MLPs." They encode the right structural assumptions for image data.

Images are not tables

In tabular data, feature order is often incidental. Swapping two columns does not change what the columns mean. In an image, the opposite is true. A pixel's meaning depends heavily on where it sits and which pixels surround it.

A $3 \times 3$ image can be written as

\begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{bmatrix}

but an MLP typically flattens it into

[x_{11}, x_{12}, x_{13}, x_{21}, x_{22}, x_{23}, x_{31}, x_{32}, x_{33}]

The raw numbers are preserved, but the local neighborhood structure is gone. The model no longer knows which pixels were horizontal neighbors, vertical neighbors, or part of the same local patch.

Why flattening creates three immediate problems

First, flattening destroys locality. Visual patterns such as edges, corners, and textures are defined by relationships among nearby pixels, not isolated pixel values. A flattened MLP can still learn those relationships, but it must infer them with no architectural help.

Second, flattening makes translation expensive to learn. If a dog ear shifts from the upper-left region of the image to the lower-right region, humans still perceive the same local structure. A fully connected network sees a different set of coordinates lighting up and must essentially relearn the same feature in many positions.

Third, parameter counts explode. A single $1000 \times 1000$ grayscale image becomes a vector in $\mathbb{R}^{10^6}$ . Connecting that vector to just one hidden layer of width $1000$ already requires on the order of $10^9$ weights.

What the fully connected image model is really assuming

If we keep the two-dimensional indexing explicit, a fully connected image-to-image layer looks like

H_{i,j} = U_{i,j} + \sum_k \sum_l W_{i,j,k,l} X_{k,l}

This equation says that the output at position $(i,j)$ depends on every input position $(k,l)$ , each with its own learned coefficient. That is an extremely flexible model, but it makes three dubious commitments.

It treats absolute location as essential because the weight depends on both the input coordinate and the output coordinate.
It assumes distant pixels should be given the same modeling privilege as nearby pixels.
It spends separate parameters learning what is often the same local pattern in many different places.

Those are exactly the inefficiencies convolution is designed to remove.

Two simple visual priors change everything

CNNs are built around two assumptions that fit natural images remarkably well.

The first is locality: to decide whether a feature is present at $(i,j)$ , nearby pixels usually matter far more than distant ones.

The second is translation reuse: the same kind of local edge, corner, or texture should be recognized with the same detector no matter where it appears.

Convolution does not become useful because it is mysterious. It becomes useful because it formalizes two ordinary facts about vision: nearby matters most, and reusable patterns should be reused.

From full connectivity to convolution in three steps

Start from the full expression

H_{i,j} = U_{i,j} + \sum_k \sum_l W_{i,j,k,l} X_{k,l}

First, rewrite input indices as offsets from the output location. Let $a = k - i$ and $b = l - j$ . Then the same equation becomes

H_{i,j} = U_{i,j} + \sum_a \sum_b V_{i,j,a,b} X_{i+a, j+b}

Second, impose parameter sharing by removing the dependence on $(i,j)$ :

V_{i,j,a,b} = V_{a,b}

Now the same local detector is reused everywhere. Third, impose locality by restricting the offsets to a small window:

H_{i,j} = u + \sum_{a=-\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} V_{a,b} X_{i+a, j+b}

That is the essential form of convolution. The operation no longer tries to learn a separate visual rule for every coordinate pair. It learns one local detector and slides it across the image.

Why channels fit naturally into this picture

Real images are not single matrices. RGB images have channels, and convolution extends naturally:

H_{i,j,d} = \sum_{a=-\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} \sum_c V_{a,b,c,d} X_{i+a, j+b, c}

Each output channel $d$ is a learned detector built from all input channels $c$ . That lets the model combine local shape, contrast, and color information while still preserving the central idea of local shared structure.

The main takeaway

MLPs struggle with images not because they are too weak in principle, but because they ignore the geometry that makes visual data efficient to model. Flattening erases locality, forces the model to relearn translation reuse, and makes parameter counts explode.

CNNs succeed because they build the right priors into the architecture: local connectivity and parameter sharing. Once those assumptions are in place, the network can spend its capacity learning visual concepts instead of rediscovering the two-dimensional structure of the image itself.