Convolution Under the Hood: Cross-Correlation, Channels, and CNN Backpropagation

The operation inside a modern Conv2d layer is almost never a true mathematical convolution. In practical deep learning libraries, the forward pass uses cross-correlation: the kernel is applied without flipping.

That sounds like trivia until you derive the gradients. Then something coherent happens — the weight gradient takes the form of another cross-correlation. The implementation choice is not arbitrary history. It lines up naturally with backpropagation.

Convolution and cross-correlation are not the same operation

Let

X = \begin{bmatrix} a_1 & b_1 \\ c_1 & d_1 \end{bmatrix}, \qquad K = \begin{bmatrix} a_2 & b_2 \\ c_2 & d_2 \end{bmatrix}

Then cross-correlation computes

a_1 a_2 + b_1 b_2 + c_1 c_2 + d_1 d_2

while true convolution first flips the kernel by $180^\circ$ and then computes

a_1 d_2 + b_1 c_2 + c_1 b_2 + d_1 a_2

Most deep learning code uses the first expression. The name "convolution" survived, but the actual tensor operation is cross-correlation.

The forward formula is cross-correlation by construction

For a single-channel input and kernel, the basic forward rule is

Y_{i,j} = \sum_a \sum_b X_{i+a, j+b} W_{a,b}

The kernel indices $(a,b)$ move in the same direction as the input indices $(i+a, j+b)$ . There is no reversal. So the forward pass is cross-correlation directly.

This is computationally convenient because the kernel is learned anyway. If the best detector happens to look like a flipped version of some other detector, gradient descent can simply learn that version directly.

Setting up a tiny example

Take a $3 \times 3$ input, a $2 \times 2$ kernel, and the resulting $2 \times 2$ output:

X = \begin{bmatrix} X_{11} & X_{12} & X_{13} \\ X_{21} & X_{22} & X_{23} \\ X_{31} & X_{32} & X_{33} \end{bmatrix}, \qquad W = \begin{bmatrix} W_{11} & W_{12} \\ W_{21} & W_{22} \end{bmatrix}

The forward pass produces

Y_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21} + W_{22}X_{22}

Y_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22} + W_{22}X_{23}

Y_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31} + W_{22}X_{32}

Y_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32} + W_{22}X_{33}

This expansion makes the key fact obvious: each kernel entry participates in several output positions. $W_{11}$ appears in all four output formulas, so its gradient must accumulate all four downstream contributions.

Backpropagation: the kernel gradient is a cross-correlation

Suppose the loss is

\mathcal{L} = \frac{1}{2} \sum_{i,j} (Y_{i,j} - Y^*_{i,j})^2

Then the output error is

\frac{\partial \mathcal{L}}{\partial Y_{i,j}} = Y_{i,j} - Y^*_{i,j} \equiv dY_{i,j}

For $W_{11}$ , the chain rule gives

\frac{\partial \mathcal{L}}{\partial W_{11}} = dY_{11}\frac{\partial Y_{11}}{\partial W_{11}} + dY_{12}\frac{\partial Y_{12}}{\partial W_{11}} + dY_{21}\frac{\partial Y_{21}}{\partial W_{11}} + dY_{22}\frac{\partial Y_{22}}{\partial W_{11}}

From the expanded formulas, the local derivatives are

\frac{\partial Y_{11}}{\partial W_{11}} = X_{11}, \quad \frac{\partial Y_{12}}{\partial W_{11}} = X_{12}, \quad \frac{\partial Y_{21}}{\partial W_{11}} = X_{21}, \quad \frac{\partial Y_{22}}{\partial W_{11}} = X_{22}

\frac{\partial \mathcal{L}}{\partial W_{11}} = dY_{11}X_{11} + dY_{12}X_{12} + dY_{21}X_{21} + dY_{22}X_{22}

The same pattern holds for every kernel entry. The full kernel gradient is the cross-correlation of the input $X$ with the output error $dY$ .

In other words, the backward pass asks: which local input patches were present where the output error was large, and how should the kernel change to reduce that error next time?

A concrete numeric pass

Let

X = \begin{bmatrix} 1 & 2 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \end{bmatrix}, \qquad W = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \qquad Y^* = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

The forward pass gives

Y = \begin{bmatrix} 2 & 3 \\ 0 & 2 \end{bmatrix}

so the error is

dY = Y - Y^* = \begin{bmatrix} 1 & 2 \\ -1 & 1 \end{bmatrix}

Cross-correlate $X$ with $dY$ to get the kernel gradient:

\frac{\partial \mathcal{L}}{\partial W} = \begin{bmatrix} 6 & 2 \\ 1 & 4 \end{bmatrix}

With learning rate $\eta = 0.1$ , gradient descent updates the kernel to

W_{\text{new}} = W - \eta \frac{\partial \mathcal{L}}{\partial W} = \begin{bmatrix} 0.4 & -0.2 \\ -0.1 & 0.6 \end{bmatrix}

Why frameworks keep the unflipped convention

The deep learning convention is mostly about clarity and efficiency. Since the kernel is learned from data, there is no need to impose the classical convolution flip during the forward pass. The model can learn whichever orientation is useful.

More importantly, the unflipped convention makes the implementation story clean: the same sliding-window logic appears in both the forward operation and the kernel-gradient calculation. PyTorch's torch.nn.functional.conv2d and the autograd-generated backward pass share the same primitive, just with different operand roles.

The main takeaway

CNN layers are historically called convolutions, but the standard forward pass is cross-correlation. That choice is not a mathematical compromise in practice — it is a natural parameterization of the learned kernel.

Once you expand the computation graph and apply the chain rule, the reason becomes intuitive: the kernel gradient is another cross-correlation, this time between the input and the output error map. That is the small derivation that turns a naming oddity into a coherent piece of backpropagation.

Convolution Under the Hood: Cross-Correlation, Channels, and CNN Backpropagation

Convolution and cross-correlation are not the same operation

The forward formula is cross-correlation by construction

Setting up a tiny example

Backpropagation: the kernel gradient is a cross-correlation

A concrete numeric pass

Why frameworks keep the unflipped convention

The main takeaway

Related reading

Why CNNs Beat MLPs on Images: Locality, Weight Sharing, and Pooling

Batch Normalization, ResNet, and DenseNet: Making Very Deep CNNs Trainable

From LeNet to AlexNet to VGG: How CNNs Got Deep