Why CNNs Beat MLPs on Images: Locality, Weight Sharing, and Pooling

A multilayer perceptron can in principle approximate any function from pixels to labels. The reason CNNs are everywhere in computer vision is not that MLPs cannot work — it is that they refuse to acknowledge the geometry of an image, and pay for that ignorance in parameter count, sample efficiency, and translation robustness.

This article walks the full path from why fully connected layers fail on images, through the two visual priors that make convolution natural, to the mechanics of padding, stride, channels, and pooling that turn convolution into a working CNN.

Images are not tables

In tabular data, feature order is often incidental. Swapping two columns does not change what the columns mean. In an image, the opposite is true. A pixel's meaning depends heavily on where it sits and which pixels surround it.

A $3 \times 3$ image can be written as

\begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{bmatrix}

but an MLP typically flattens it into

[x_{11}, x_{12}, x_{13}, x_{21}, x_{22}, x_{23}, x_{31}, x_{32}, x_{33}]

The raw numbers are preserved, but the local neighborhood structure is gone. The model no longer knows which pixels were horizontal neighbors, vertical neighbors, or part of the same local patch.

Three immediate problems with flattening

Locality is destroyed. Visual patterns such as edges, corners, and textures are defined by relationships among nearby pixels. A flattened MLP can still learn those relationships, but it must infer them with no architectural help from the input format.

Translation must be relearned everywhere. If an object shifts from one corner of the image to the opposite corner, humans still perceive the same local structure. A fully connected network sees a different set of input coordinates and must essentially learn the same feature in many positions independently.

Parameter counts explode. A $1000 \times 1000$ RGB image becomes a vector in $\mathbb{R}^{3 \times 10^6}$ . Connecting that vector to a single hidden layer of width 1000 already requires on the order of $3 \times 10^9$ weights — more than will fit on most GPUs, before training has even started.

Two visual priors that change everything

CNNs are built around two assumptions that fit natural images remarkably well.

The first is locality: to decide whether a feature is present at $(i,j)$ , nearby pixels usually matter far more than distant ones.

The second is translation reuse: the same kind of local edge, corner, or texture should be recognized with the same detector no matter where it appears.

Convolution does not become useful because it is mysterious. It becomes useful because it formalizes two ordinary facts about vision: nearby matters most, and reusable patterns should be reused.

From full connectivity to convolution in three steps

A fully connected image-to-image layer can be written with explicit two-dimensional indexing:

H_{i,j} = U_{i,j} + \sum_k \sum_l W_{i,j,k,l} X_{k,l}

The output at position $(i,j)$ depends on every input position with its own learned coefficient. This makes three dubious commitments: it treats absolute location as essential, gives distant pixels equal modeling privilege as nearby ones, and spends separate parameters learning the same local pattern in different places.

Reformulating the input indices as offsets from the output location $a = k - i$ , $b = l - j$ gives:

H_{i,j} = U_{i,j} + \sum_a \sum_b V_{i,j,a,b} X_{i+a, j+b}

Imposing translation reuse removes the dependence on $(i,j)$ from the kernel: $V_{i,j,a,b} = V_{a,b}$ . The same local detector is now reused everywhere. Imposing locality restricts the offsets to a small window:

H_{i,j} = u + \sum_{a=-\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} V_{a,b} X_{i+a, j+b}

That is the essential form of convolution. The operation no longer tries to learn a separate visual rule for every coordinate pair — it learns one local detector and slides it across the image. Parameter count drops from $O(H^2 W^2)$ to $O(K^2)$ , where $K$ is the kernel size.

Padding preserves spatial support at the boundary

With no padding and stride one, a kernel can only sit where it fully overlaps the input. For an input of height $H$ and width $W$ with a kernel of size $K_h \times K_w$ , the output size is $(H - K_h + 1) \times (W - K_w + 1)$ . Stack several such layers and feature maps shrink quickly.

Padding places extra values, usually zeros, around the border of the input. With padding $P_h, P_w$ and stride $S_h, S_w$ , the output size becomes

H_{\text{out}} = \left\lfloor \frac{H + 2P_h - K_h}{S_h} \right\rfloor + 1, \qquad W_{\text{out}} = \left\lfloor \frac{W + 2P_w - K_w}{S_w} \right\rfloor + 1

For a $3 \times 3$ kernel with stride one, choosing $P = 1$ keeps input and output sizes equal. That is why "same padding" is so common: it lets a network add depth without erasing spatial resolution too aggressively. Padding is less about inventing new information than about preserving fair access to boundary information.

Stride is deliberate downsampling

Stride controls how far the kernel moves between evaluations. A stride of two means every other spatial position is skipped. The output becomes smaller, the computation becomes cheaper, and each later unit corresponds to a larger region of the original input.

Stride discards spatial detail. It does so deliberately: later stages of the network need larger effective receptive fields and lower computational cost if they are going to model object-level structure rather than only local edges.

CNNs do not downsample because detail is unimportant. They downsample because semantics often require aggregating over larger spatial context than pixel-level resolution can support efficiently.

Aggressive stride can hurt dense prediction tasks like segmentation and keypoint localization, while helping classification tasks that care more about what is present than the exact pixel location.

Multi-channel convolution is feature fusion

A color image has multiple input channels. A standard convolution layer with $C_{\text{in}}$ input channels and one output channel computes:

Y = \sum_{c=1}^{C_{\text{in}}} X_c \star K_c + b

Each input channel has its own spatial kernel. The per-channel responses are summed at each spatial location — the layer is not making separate channel decisions and averaging them later. It is fusing channel evidence into one response map as part of the linear operation itself.

If the layer has $C_{\text{out}}$ output channels, then it learns $C_{\text{out}}$ separate kernel banks. Each bank produces one feature map. The network is learning many local detectors in parallel: edges, color transitions, texture fragments, corners, and increasingly abstract patterns at higher layers.

Pooling compresses space while adding translation tolerance

Pooling summarizes a local window without learning new weights. Max pooling takes the strongest activation in the window. Average pooling takes the mean. A $2 \times 2$ max-pooling layer with stride two maps:

\begin{bmatrix} 1 & 3 & 2 & 4 \\ 5 & 2 & 1 & 3 \\ 4 & 1 & 6 & 2 \\ 0 & 3 & 1 & 5 \end{bmatrix} \longrightarrow \begin{bmatrix} 5 & 4 \\ 4 & 6 \end{bmatrix}

Max pooling asks "did the feature appear strongly anywhere in this local region?" Average pooling asks "what was the average activation strength across this region?" These are different questions and produce different inductive biases.

Pooling also creates limited translation tolerance. If a strong edge detector shifts by one or two pixels but stays inside the same pooling window, the pooled response barely changes. That is often desirable in classification.

How the pieces fit together

The standard CNN progression is not accidental:

Early layers keep relatively high resolution and detect simple patterns like edges and small textures.
Middle layers combine those responses into local parts and motifs.
Later layers operate on lower-resolution maps, but each unit sees a much larger portion of the original image and can respond to larger object structures.

Spatial resolution goes down, but semantic scope goes up. That is the organizing principle linking stride, pooling, padding, and multi-channel convolution. The architectural choices are not miscellaneous engineering knobs — they are the mechanisms through which a CNN decides how much local detail to keep, how much space to compress, and how quickly to move from pixels toward object-level representation.

The main takeaway

MLPs struggle with images not because they are too weak in principle, but because they ignore the geometry that makes visual data efficient to model. Flattening erases locality, forces the model to relearn translation reuse, and makes parameter counts explode.

CNNs succeed because they build the right priors into the architecture: local connectivity and parameter sharing. Padding protects spatial support, stride controls intentional downsampling, channels let the network learn many detectors and fuse evidence across inputs, and pooling summarizes local evidence while adding translation robustness. Once those assumptions are in place, the network can spend its capacity learning visual concepts instead of rediscovering the two-dimensional structure of the image.