Padding, stride, channels, and pooling can feel like four disconnected details the first time you meet convolutional networks. They are not. Together they determine how a CNN turns raw pixels into progressively more abstract, more spatially compressed representations.
If you understand what these four ingredients do, the architecture of a CNN stops looking like a bag of layer names and starts looking like a controlled pipeline from local detail to high-level structure.
Why convolution shrinks feature maps by default
With no padding and stride one, a kernel can only be placed where it fully overlaps the input. For an input of height and width with a kernel of size , the output size is
So a input convolved with a kernel becomes . Stack several such layers and feature maps shrink quickly.
That is not a bug. It is the geometric consequence of using a finite window without extending the input. But it does mean edge pixels participate in fewer receptive fields than central pixels, which is one reason padding matters.
Padding preserves spatial support at the boundary
Padding places extra values, usually zeros, around the border of the input. With padding and and stride , the output size becomes
For a kernel with stride one, choosing keeps input and output sizes equal. That is why "same padding" is so common: it lets you add depth without erasing spatial resolution too aggressively.
Conceptually, padding is less about inventing new information than about preserving fair access to boundary information. It prevents the edges from being ignored simply because the kernel has fewer valid landing spots there.
Stride is deliberate downsampling
Stride controls how far the kernel moves between evaluations. A stride of two means every other spatial position is skipped. The output becomes smaller, the computation becomes cheaper, and each later unit corresponds to a larger region of the original input.
That is a real information tradeoff. Stride discards spatial detail. But it does so for a reason: later stages of the network need larger effective receptive fields and lower computational cost if they are going to model object-level structure instead of only local edges.
CNNs do not downsample because detail is unimportant. They downsample because semantics often require aggregating over larger spatial context than pixel-level resolution can support efficiently.
This is why aggressive stride can hurt dense prediction tasks like segmentation and keypoint localization, while helping classification tasks that care more about what is present than the exact pixel location.
Multi-channel convolution is feature fusion, not separate voting
A color image has multiple input channels. A standard convolution layer with input channels and one output channel computes
Each input channel has its own spatial kernel. The per-channel responses are summed at each spatial location. So the layer is not making separate channel decisions and averaging them later. It is fusing channel evidence into one response map as part of the linear operation itself.
If the layer has output channels, then it learns separate kernel banks. Each bank produces one feature map. In practice that means the network is learning many local detectors in parallel: edges, color transitions, texture fragments, corners, and increasingly abstract patterns.
Pooling compresses space while increasing local invariance
Pooling summarizes a local window without learning new weights. Max pooling takes the strongest activation in the window. Average pooling takes the mean.
For example, a max-pooling layer with stride two maps
Max pooling says, "did the feature appear strongly anywhere in this local region?" Average pooling says, "what was the average activation strength across this local region?" Those are different questions, so they create different inductive biases.
Pooling also creates limited translation tolerance. If a strong edge detector shifts by one or two pixels but stays inside the same pooling window, the pooled response may barely change. That is often desirable in classification.
Why later feature maps get smaller but more semantic
The usual CNN progression is not accidental.
- Early layers keep relatively high resolution and detect simple patterns like edges and small textures.
- Middle layers combine those responses into local parts and motifs.
- Later layers operate on lower-resolution maps, but each unit sees a much larger portion of the original image and can respond to larger object structures.
Spatial resolution goes down, but semantic scope goes up. That is the organizing principle linking stride, pooling, padding, and multi-channel convolution.
The main takeaway
Padding protects spatial support, stride controls intentional downsampling, channels let the network learn many detectors and fuse evidence across inputs, and pooling summarizes local evidence while adding limited translation robustness.
These are not miscellaneous engineering knobs. They are the mechanisms through which a CNN decides how much local detail to keep, how much space to compress, and how quickly to move from pixels toward object-level representation.