The operation inside a modern Conv2d layer is almost never a true mathematical convolution. In practical deep learning libraries, the forward pass uses cross-correlation: the kernel is applied without flipping.
That sounds like trivia until you derive the gradients. Then something beautiful happens: the weight gradient takes the form of another cross-correlation. The implementation choice is not arbitrary history. It lines up naturally with backpropagation.
Convolution and cross-correlation are not the same operation
Let
Then cross-correlation computes
while true convolution first flips the kernel by and then computes
Most deep learning code uses the first expression. The name "convolution" survived, but the actual tensor operation is cross-correlation.
That is exactly what the CNN forward formula says
For a single-channel input and kernel, the basic forward rule is
The kernel indices move in the same direction as the input indices . There is no reversal. So the forward pass is cross-correlation by construction.
This is computationally convenient because the kernel is learned anyway. If the best detector happens to look like a flipped version of some other detector, gradient descent can simply learn that version directly.
Set up a tiny example and expand every output
Take a input, a kernel, and the resulting output:
The forward pass produces
This expansion makes the key fact obvious: each kernel entry participates in several output positions. For example, appears in all four output formulas, so its gradient must accumulate all four downstream contributions.
Backpropagation explains the kernel gradient cleanly
Suppose the loss is
Then the output error is
For , the chain rule gives
From the expanded formulas, the local derivatives are
so
The same pattern holds for every kernel entry. The full kernel gradient is the cross-correlation of the input with the output error .
In other words, the backward pass asks: which local input patches were present where the output error was large, and how should the kernel change to reduce that error next time?
A concrete numeric pass makes the structure obvious
Let
The forward pass gives
so the error is
Now cross-correlate with to get the kernel gradient:
With learning rate , gradient descent updates the kernel to
Why frameworks keep the unfipped kernel convention
The deep learning convention is mostly about clarity and efficiency. Since the kernel is learned from data, there is no need to impose the classical convolution flip during the forward pass. The model can learn whichever orientation is useful.
More importantly, the unfipped convention makes the implementation story clean: the same sliding-window logic appears in both the forward operation and the kernel-gradient calculation.
The main takeaway
CNN layers are historically called convolutions, but the standard forward pass is cross-correlation. That choice is not a mathematical compromise in practice. It is a natural parameterization of the learned kernel.
Once you expand the computation graph and apply the chain rule, the reason becomes intuitive: the kernel gradient is another cross-correlation, this time between the input and the output error map. That is the small derivation that turns a naming oddity into a coherent piece of backpropagation.