MathIsimple
Deep Learning
14 min read

CNN Forward and Backward Passes: Why Convolution is actually Cross-Correlation

The operation we call 'convolution' is technically cross-correlation, and the backward pass reveals a beautiful mathematical symmetry.

CNNCross-CorrelationBackpropagationMatrix Math

The operation inside a modern Conv2d layer is almost never a true mathematical convolution. In practical deep learning libraries, the forward pass uses cross-correlation: the kernel is applied without flipping.

That sounds like trivia until you derive the gradients. Then something beautiful happens: the weight gradient takes the form of another cross-correlation. The implementation choice is not arbitrary history. It lines up naturally with backpropagation.

Convolution and cross-correlation are not the same operation

Let

X=[a1b1c1d1],K=[a2b2c2d2]X = \begin{bmatrix} a_1 & b_1 \\ c_1 & d_1 \end{bmatrix}, \qquad K = \begin{bmatrix} a_2 & b_2 \\ c_2 & d_2 \end{bmatrix}

Then cross-correlation computes

a1a2+b1b2+c1c2+d1d2a_1 a_2 + b_1 b_2 + c_1 c_2 + d_1 d_2

while true convolution first flips the kernel by 180180^\circ and then computes

a1d2+b1c2+c1b2+d1a2a_1 d_2 + b_1 c_2 + c_1 b_2 + d_1 a_2

Most deep learning code uses the first expression. The name "convolution" survived, but the actual tensor operation is cross-correlation.

That is exactly what the CNN forward formula says

For a single-channel input and kernel, the basic forward rule is

Yi,j=abXi+a,j+bWa,bY_{i,j} = \sum_a \sum_b X_{i+a, j+b} W_{a,b}

The kernel indices (a,b)(a,b) move in the same direction as the input indices (i+a,j+b)(i+a, j+b). There is no reversal. So the forward pass is cross-correlation by construction.

This is computationally convenient because the kernel is learned anyway. If the best detector happens to look like a flipped version of some other detector, gradient descent can simply learn that version directly.

Set up a tiny example and expand every output

Take a 3×33 \times 3 input, a 2×22 \times 2 kernel, and the resulting 2×22 \times 2 output:

X=[X11X12X13X21X22X23X31X32X33],W=[W11W12W21W22]X = \begin{bmatrix} X_{11} & X_{12} & X_{13} \\ X_{21} & X_{22} & X_{23} \\ X_{31} & X_{32} & X_{33} \end{bmatrix}, \qquad W = \begin{bmatrix} W_{11} & W_{12} \\ W_{21} & W_{22} \end{bmatrix}

The forward pass produces

Y11=W11X11+W12X12+W21X21+W22X22Y_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21} + W_{22}X_{22}Y12=W11X12+W12X13+W21X22+W22X23Y_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22} + W_{22}X_{23}Y21=W11X21+W12X22+W21X31+W22X32Y_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31} + W_{22}X_{32}Y22=W11X22+W12X23+W21X32+W22X33Y_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32} + W_{22}X_{33}

This expansion makes the key fact obvious: each kernel entry participates in several output positions. For example, W11W_{11} appears in all four output formulas, so its gradient must accumulate all four downstream contributions.

Backpropagation explains the kernel gradient cleanly

Suppose the loss is

L=12i,j(Yi,jYi,j)2\mathcal{L} = \frac{1}{2} \sum_{i,j} (Y_{i,j} - Y^*_{i,j})^2

Then the output error is

LYi,j=Yi,jYi,jdYi,j\frac{\partial \mathcal{L}}{\partial Y_{i,j}} = Y_{i,j} - Y^*_{i,j} \equiv dY_{i,j}

For W11W_{11}, the chain rule gives

LW11=dY11Y11W11+dY12Y12W11+dY21Y21W11+dY22Y22W11\frac{\partial \mathcal{L}}{\partial W_{11}} = dY_{11}\frac{\partial Y_{11}}{\partial W_{11}} + dY_{12}\frac{\partial Y_{12}}{\partial W_{11}} + dY_{21}\frac{\partial Y_{21}}{\partial W_{11}} + dY_{22}\frac{\partial Y_{22}}{\partial W_{11}}

From the expanded formulas, the local derivatives are

Y11W11=X11,Y12W11=X12,Y21W11=X21,Y22W11=X22\frac{\partial Y_{11}}{\partial W_{11}} = X_{11}, \quad \frac{\partial Y_{12}}{\partial W_{11}} = X_{12}, \quad \frac{\partial Y_{21}}{\partial W_{11}} = X_{21}, \quad \frac{\partial Y_{22}}{\partial W_{11}} = X_{22}

so

LW11=dY11X11+dY12X12+dY21X21+dY22X22\frac{\partial \mathcal{L}}{\partial W_{11}} = dY_{11}X_{11} + dY_{12}X_{12} + dY_{21}X_{21} + dY_{22}X_{22}

The same pattern holds for every kernel entry. The full kernel gradient is the cross-correlation of the input XX with the output error dYdY.

In other words, the backward pass asks: which local input patches were present where the output error was large, and how should the kernel change to reduce that error next time?

A concrete numeric pass makes the structure obvious

Let

X=[120011101],W=[1001],Y=[1111]X = \begin{bmatrix} 1 & 2 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \end{bmatrix}, \qquad W = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \qquad Y^* = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

The forward pass gives

Y=[2302]Y = \begin{bmatrix} 2 & 3 \\ 0 & 2 \end{bmatrix}

so the error is

dY=YY=[1211]dY = Y - Y^* = \begin{bmatrix} 1 & 2 \\ -1 & 1 \end{bmatrix}

Now cross-correlate XX with dYdY to get the kernel gradient:

LW=[6214]\frac{\partial \mathcal{L}}{\partial W} = \begin{bmatrix} 6 & 2 \\ 1 & 4 \end{bmatrix}

With learning rate η=0.1\eta = 0.1, gradient descent updates the kernel to

Wnew=WηLW=[0.40.20.10.6]W_{\text{new}} = W - \eta \frac{\partial \mathcal{L}}{\partial W} = \begin{bmatrix} 0.4 & -0.2 \\ -0.1 & 0.6 \end{bmatrix}

Why frameworks keep the unfipped kernel convention

The deep learning convention is mostly about clarity and efficiency. Since the kernel is learned from data, there is no need to impose the classical convolution flip during the forward pass. The model can learn whichever orientation is useful.

More importantly, the unfipped convention makes the implementation story clean: the same sliding-window logic appears in both the forward operation and the kernel-gradient calculation.

The main takeaway

CNN layers are historically called convolutions, but the standard forward pass is cross-correlation. That choice is not a mathematical compromise in practice. It is a natural parameterization of the learned kernel.

Once you expand the computation graph and apply the chain rule, the reason becomes intuitive: the kernel gradient is another cross-correlation, this time between the input and the output error map. That is the small derivation that turns a naming oddity into a coherent piece of backpropagation.

Ask AI ✨