Padding, Stride, Channels, and Pooling in CNNs

Padding, stride, channels, and pooling can feel like four disconnected details the first time you meet convolutional networks. They are not. Together they determine how a CNN turns raw pixels into progressively more abstract, more spatially compressed representations.

If you understand what these four ingredients do, the architecture of a CNN stops looking like a bag of layer names and starts looking like a controlled pipeline from local detail to high-level structure.

Why convolution shrinks feature maps by default

With no padding and stride one, a kernel can only be placed where it fully overlaps the input. For an input of height $H$ and width $W$ with a kernel of size $K_h \times K_w$ , the output size is

H_{\text{out}} = H - K_h + 1, \qquad W_{\text{out}} = W - K_w + 1

So a $28 \times 28$ input convolved with a $5 \times 5$ kernel becomes $24 \times 24$ . Stack several such layers and feature maps shrink quickly.

That is not a bug. It is the geometric consequence of using a finite window without extending the input. But it does mean edge pixels participate in fewer receptive fields than central pixels, which is one reason padding matters.

Padding preserves spatial support at the boundary

Padding places extra values, usually zeros, around the border of the input. With padding $P_h$ and $P_w$ and stride $S_h, S_w$ , the output size becomes

H_{\text{out}} = \left\lfloor \frac{H + 2P_h - K_h}{S_h} \right\rfloor + 1, \qquad W_{\text{out}} = \left\lfloor \frac{W + 2P_w - K_w}{S_w} \right\rfloor + 1

For a $3 \times 3$ kernel with stride one, choosing $P = 1$ keeps input and output sizes equal. That is why "same padding" is so common: it lets you add depth without erasing spatial resolution too aggressively.

Conceptually, padding is less about inventing new information than about preserving fair access to boundary information. It prevents the edges from being ignored simply because the kernel has fewer valid landing spots there.

Stride is deliberate downsampling

Stride controls how far the kernel moves between evaluations. A stride of two means every other spatial position is skipped. The output becomes smaller, the computation becomes cheaper, and each later unit corresponds to a larger region of the original input.

That is a real information tradeoff. Stride discards spatial detail. But it does so for a reason: later stages of the network need larger effective receptive fields and lower computational cost if they are going to model object-level structure instead of only local edges.

CNNs do not downsample because detail is unimportant. They downsample because semantics often require aggregating over larger spatial context than pixel-level resolution can support efficiently.

This is why aggressive stride can hurt dense prediction tasks like segmentation and keypoint localization, while helping classification tasks that care more about what is present than the exact pixel location.

Multi-channel convolution is feature fusion, not separate voting

A color image has multiple input channels. A standard convolution layer with $C_{\text{in}}$ input channels and one output channel computes

Y = \sum_{c=1}^{C_{\text{in}}} X_c \star K_c + b

Each input channel has its own spatial kernel. The per-channel responses are summed at each spatial location. So the layer is not making separate channel decisions and averaging them later. It is fusing channel evidence into one response map as part of the linear operation itself.

If the layer has $C_{\text{out}}$ output channels, then it learns $C_{\text{out}}$ separate kernel banks. Each bank produces one feature map. In practice that means the network is learning many local detectors in parallel: edges, color transitions, texture fragments, corners, and increasingly abstract patterns.

Pooling compresses space while increasing local invariance

Pooling summarizes a local window without learning new weights. Max pooling takes the strongest activation in the window. Average pooling takes the mean.

For example, a $2 \times 2$ max-pooling layer with stride two maps

\begin{bmatrix} 1 & 3 & 2 & 4 \\ 5 & 2 & 1 & 3 \\ 4 & 1 & 6 & 2 \\ 0 & 3 & 1 & 5 \end{bmatrix} \longrightarrow \begin{bmatrix} 5 & 4 \\ 4 & 6 \end{bmatrix}

Max pooling says, "did the feature appear strongly anywhere in this local region?" Average pooling says, "what was the average activation strength across this local region?" Those are different questions, so they create different inductive biases.

Pooling also creates limited translation tolerance. If a strong edge detector shifts by one or two pixels but stays inside the same pooling window, the pooled response may barely change. That is often desirable in classification.

Why later feature maps get smaller but more semantic

The usual CNN progression is not accidental.

Early layers keep relatively high resolution and detect simple patterns like edges and small textures.
Middle layers combine those responses into local parts and motifs.
Later layers operate on lower-resolution maps, but each unit sees a much larger portion of the original image and can respond to larger object structures.

Spatial resolution goes down, but semantic scope goes up. That is the organizing principle linking stride, pooling, padding, and multi-channel convolution.

The main takeaway

Padding protects spatial support, stride controls intentional downsampling, channels let the network learn many detectors and fuse evidence across inputs, and pooling summarizes local evidence while adding limited translation robustness.

These are not miscellaneous engineering knobs. They are the mechanisms through which a CNN decides how much local detail to keep, how much space to compress, and how quickly to move from pixels toward object-level representation.