CNN Forward and Backward Passes: Why Convolution is actually Cross-Correlation

The operation inside a modern Conv2d layer is almost never a true mathematical convolution. In practical deep learning libraries, the forward pass uses cross-correlation: the kernel is applied without flipping.

That sounds like trivia until you derive the gradients. Then something beautiful happens: the weight gradient takes the form of another cross-correlation. The implementation choice is not arbitrary history. It lines up naturally with backpropagation.

Convolution and cross-correlation are not the same operation

Let

X = \begin{bmatrix} a_1 & b_1 \\ c_1 & d_1 \end{bmatrix}, \qquad K = \begin{bmatrix} a_2 & b_2 \\ c_2 & d_2 \end{bmatrix}

Then cross-correlation computes

a_1 a_2 + b_1 b_2 + c_1 c_2 + d_1 d_2

while true convolution first flips the kernel by $180^\circ$ and then computes

a_1 d_2 + b_1 c_2 + c_1 b_2 + d_1 a_2

Most deep learning code uses the first expression. The name "convolution" survived, but the actual tensor operation is cross-correlation.

That is exactly what the CNN forward formula says

For a single-channel input and kernel, the basic forward rule is

Y_{i,j} = \sum_a \sum_b X_{i+a, j+b} W_{a,b}

The kernel indices $(a,b)$ move in the same direction as the input indices $(i+a, j+b)$ . There is no reversal. So the forward pass is cross-correlation by construction.

This is computationally convenient because the kernel is learned anyway. If the best detector happens to look like a flipped version of some other detector, gradient descent can simply learn that version directly.

Set up a tiny example and expand every output

Take a $3 \times 3$ input, a $2 \times 2$ kernel, and the resulting $2 \times 2$ output:

X = \begin{bmatrix} X_{11} & X_{12} & X_{13} \\ X_{21} & X_{22} & X_{23} \\ X_{31} & X_{32} & X_{33} \end{bmatrix}, \qquad W = \begin{bmatrix} W_{11} & W_{12} \\ W_{21} & W_{22} \end{bmatrix}

The forward pass produces

Y_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21} + W_{22}X_{22}

Y_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22} + W_{22}X_{23}

Y_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31} + W_{22}X_{32}

Y_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32} + W_{22}X_{33}

This expansion makes the key fact obvious: each kernel entry participates in several output positions. For example, $W_{11}$ appears in all four output formulas, so its gradient must accumulate all four downstream contributions.

Backpropagation explains the kernel gradient cleanly

Suppose the loss is

\mathcal{L} = \frac{1}{2} \sum_{i,j} (Y_{i,j} - Y^*_{i,j})^2

Then the output error is

\frac{\partial \mathcal{L}}{\partial Y_{i,j}} = Y_{i,j} - Y^*_{i,j} \equiv dY_{i,j}

For $W_{11}$ , the chain rule gives

\frac{\partial \mathcal{L}}{\partial W_{11}} = dY_{11}\frac{\partial Y_{11}}{\partial W_{11}} + dY_{12}\frac{\partial Y_{12}}{\partial W_{11}} + dY_{21}\frac{\partial Y_{21}}{\partial W_{11}} + dY_{22}\frac{\partial Y_{22}}{\partial W_{11}}

From the expanded formulas, the local derivatives are

\frac{\partial Y_{11}}{\partial W_{11}} = X_{11}, \quad \frac{\partial Y_{12}}{\partial W_{11}} = X_{12}, \quad \frac{\partial Y_{21}}{\partial W_{11}} = X_{21}, \quad \frac{\partial Y_{22}}{\partial W_{11}} = X_{22}

\frac{\partial \mathcal{L}}{\partial W_{11}} = dY_{11}X_{11} + dY_{12}X_{12} + dY_{21}X_{21} + dY_{22}X_{22}

The same pattern holds for every kernel entry. The full kernel gradient is the cross-correlation of the input $X$ with the output error $dY$ .

In other words, the backward pass asks: which local input patches were present where the output error was large, and how should the kernel change to reduce that error next time?

A concrete numeric pass makes the structure obvious

Let

X = \begin{bmatrix} 1 & 2 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \end{bmatrix}, \qquad W = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \qquad Y^* = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

The forward pass gives

Y = \begin{bmatrix} 2 & 3 \\ 0 & 2 \end{bmatrix}

so the error is

dY = Y - Y^* = \begin{bmatrix} 1 & 2 \\ -1 & 1 \end{bmatrix}

Now cross-correlate $X$ with $dY$ to get the kernel gradient:

\frac{\partial \mathcal{L}}{\partial W} = \begin{bmatrix} 6 & 2 \\ 1 & 4 \end{bmatrix}

With learning rate $\eta = 0.1$ , gradient descent updates the kernel to

W_{\text{new}} = W - \eta \frac{\partial \mathcal{L}}{\partial W} = \begin{bmatrix} 0.4 & -0.2 \\ -0.1 & 0.6 \end{bmatrix}

Why frameworks keep the unfipped kernel convention

The deep learning convention is mostly about clarity and efficiency. Since the kernel is learned from data, there is no need to impose the classical convolution flip during the forward pass. The model can learn whichever orientation is useful.

More importantly, the unfipped convention makes the implementation story clean: the same sliding-window logic appears in both the forward operation and the kernel-gradient calculation.

The main takeaway

CNN layers are historically called convolutions, but the standard forward pass is cross-correlation. That choice is not a mathematical compromise in practice. It is a natural parameterization of the learned kernel.

Once you expand the computation graph and apply the chain rule, the reason becomes intuitive: the kernel gradient is another cross-correlation, this time between the input and the output error map. That is the small derivation that turns a naming oddity into a coherent piece of backpropagation.