MathIsimple
CNNs & Computer Vision
14 min read

Convolution Under the Hood: Cross-Correlation, Channels, and CNN Backpropagation

The operation we call 'convolution' is technically cross-correlation, and the backward pass reveals a clean mathematical symmetry.

CNNCross-CorrelationBackpropagationMatrix Math

The operation inside a modern Conv2d layer is almost never a true mathematical convolution. In practical deep learning libraries, the forward pass uses cross-correlation: the kernel is applied without flipping.

That sounds like trivia until you derive the gradients. Then something coherent happens — the weight gradient takes the form of another cross-correlation. The implementation choice is not arbitrary history. It lines up naturally with backpropagation.

Convolution and cross-correlation are not the same operation

Let

X=[a1b1c1d1],K=[a2b2c2d2]X = \begin{bmatrix} a_1 & b_1 \\ c_1 & d_1 \end{bmatrix}, \qquad K = \begin{bmatrix} a_2 & b_2 \\ c_2 & d_2 \end{bmatrix}

Then cross-correlation computes

a1a2+b1b2+c1c2+d1d2a_1 a_2 + b_1 b_2 + c_1 c_2 + d_1 d_2

while true convolution first flips the kernel by 180180^\circ and then computes

a1d2+b1c2+c1b2+d1a2a_1 d_2 + b_1 c_2 + c_1 b_2 + d_1 a_2

Most deep learning code uses the first expression. The name "convolution" survived, but the actual tensor operation is cross-correlation.

The forward formula is cross-correlation by construction

For a single-channel input and kernel, the basic forward rule is

Yi,j=abXi+a,j+bWa,bY_{i,j} = \sum_a \sum_b X_{i+a, j+b} W_{a,b}

The kernel indices (a,b)(a,b) move in the same direction as the input indices (i+a,j+b)(i+a, j+b). There is no reversal. So the forward pass is cross-correlation directly.

This is computationally convenient because the kernel is learned anyway. If the best detector happens to look like a flipped version of some other detector, gradient descent can simply learn that version directly.

Setting up a tiny example

Take a 3×33 \times 3 input, a 2×22 \times 2 kernel, and the resulting 2×22 \times 2 output:

X=[X11X12X13X21X22X23X31X32X33],W=[W11W12W21W22]X = \begin{bmatrix} X_{11} & X_{12} & X_{13} \\ X_{21} & X_{22} & X_{23} \\ X_{31} & X_{32} & X_{33} \end{bmatrix}, \qquad W = \begin{bmatrix} W_{11} & W_{12} \\ W_{21} & W_{22} \end{bmatrix}

The forward pass produces

Y11=W11X11+W12X12+W21X21+W22X22Y_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21} + W_{22}X_{22}Y12=W11X12+W12X13+W21X22+W22X23Y_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22} + W_{22}X_{23}Y21=W11X21+W12X22+W21X31+W22X32Y_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31} + W_{22}X_{32}Y22=W11X22+W12X23+W21X32+W22X33Y_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32} + W_{22}X_{33}

This expansion makes the key fact obvious: each kernel entry participates in several output positions. W11W_{11} appears in all four output formulas, so its gradient must accumulate all four downstream contributions.

Backpropagation: the kernel gradient is a cross-correlation

Suppose the loss is

L=12i,j(Yi,jYi,j)2\mathcal{L} = \frac{1}{2} \sum_{i,j} (Y_{i,j} - Y^*_{i,j})^2

Then the output error is

LYi,j=Yi,jYi,jdYi,j\frac{\partial \mathcal{L}}{\partial Y_{i,j}} = Y_{i,j} - Y^*_{i,j} \equiv dY_{i,j}

For W11W_{11}, the chain rule gives

LW11=dY11Y11W11+dY12Y12W11+dY21Y21W11+dY22Y22W11\frac{\partial \mathcal{L}}{\partial W_{11}} = dY_{11}\frac{\partial Y_{11}}{\partial W_{11}} + dY_{12}\frac{\partial Y_{12}}{\partial W_{11}} + dY_{21}\frac{\partial Y_{21}}{\partial W_{11}} + dY_{22}\frac{\partial Y_{22}}{\partial W_{11}}

From the expanded formulas, the local derivatives are

Y11W11=X11,Y12W11=X12,Y21W11=X21,Y22W11=X22\frac{\partial Y_{11}}{\partial W_{11}} = X_{11}, \quad \frac{\partial Y_{12}}{\partial W_{11}} = X_{12}, \quad \frac{\partial Y_{21}}{\partial W_{11}} = X_{21}, \quad \frac{\partial Y_{22}}{\partial W_{11}} = X_{22}

so

LW11=dY11X11+dY12X12+dY21X21+dY22X22\frac{\partial \mathcal{L}}{\partial W_{11}} = dY_{11}X_{11} + dY_{12}X_{12} + dY_{21}X_{21} + dY_{22}X_{22}

The same pattern holds for every kernel entry. The full kernel gradient is the cross-correlation of the input XX with the output error dYdY.

In other words, the backward pass asks: which local input patches were present where the output error was large, and how should the kernel change to reduce that error next time?

A concrete numeric pass

Let

X=[120011101],W=[1001],Y=[1111]X = \begin{bmatrix} 1 & 2 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \end{bmatrix}, \qquad W = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \qquad Y^* = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

The forward pass gives

Y=[2302]Y = \begin{bmatrix} 2 & 3 \\ 0 & 2 \end{bmatrix}

so the error is

dY=YY=[1211]dY = Y - Y^* = \begin{bmatrix} 1 & 2 \\ -1 & 1 \end{bmatrix}

Cross-correlate XX with dYdY to get the kernel gradient:

LW=[6214]\frac{\partial \mathcal{L}}{\partial W} = \begin{bmatrix} 6 & 2 \\ 1 & 4 \end{bmatrix}

With learning rate η=0.1\eta = 0.1, gradient descent updates the kernel to

Wnew=WηLW=[0.40.20.10.6]W_{\text{new}} = W - \eta \frac{\partial \mathcal{L}}{\partial W} = \begin{bmatrix} 0.4 & -0.2 \\ -0.1 & 0.6 \end{bmatrix}

Why frameworks keep the unflipped convention

The deep learning convention is mostly about clarity and efficiency. Since the kernel is learned from data, there is no need to impose the classical convolution flip during the forward pass. The model can learn whichever orientation is useful.

More importantly, the unflipped convention makes the implementation story clean: the same sliding-window logic appears in both the forward operation and the kernel-gradient calculation. PyTorch's torch.nn.functional.conv2d and the autograd-generated backward pass share the same primitive, just with different operand roles.

The main takeaway

CNN layers are historically called convolutions, but the standard forward pass is cross-correlation. That choice is not a mathematical compromise in practice — it is a natural parameterization of the learned kernel.

Once you expand the computation graph and apply the chain rule, the reason becomes intuitive: the kernel gradient is another cross-correlation, this time between the input and the output error map. That is the small derivation that turns a naming oddity into a coherent piece of backpropagation.

Related reading

Continue the cluster — these articles build directly on the ideas above.

Ask AI ✨