How CNNs See Images: A Step-by-Step Visual Guide

Ever wondered how your phone recognizes faces, or how self-driving cars spot pedestrians? At the heart of most modern image recognition systems is a type of neural network called a Convolutional Neural Network (CNN).

The name sounds intimidating, but the core idea is surprisingly intuitive. Let's walk through exactly what happens when a CNN looks at a picture and decides what's in it—step by step, layer by layer.

Our example: teaching a CNN to tell whether an image contains an apple or an orange.

Step 1: The Raw Material — Input Image

First, we feed the CNN a regular color photo. Let's say it's a 32×32 pixel image of a red apple with a green leaf in the background.

Here's what the computer actually sees: not a "picture," but three grids of numbers.

A color image is stored as three separate layers (called channels):

Red channel: A 32×32 grid where each number represents how red that pixel is
Green channel: Same size, representing green intensity
Blue channel: Same size, representing blue intensity

In our apple image:

The red channel has high values where the apple is (it's red) and lower values in the leaf/background areas
The green channel has high values in the leaf area
The blue channel is relatively low throughout

This RGB representation is the CNN's starting point.

Step 2: Finding Clues — The Convolution Layer

This is where the magic happens. The convolution layer slides small "detectors" (called kernels or filters) across the image, looking for specific patterns.

Think of each kernel as a specialized magnifying glass. One might look for red patches. Another might look for curved edges. Each kernel produces a feature map—a new grid showing where that pattern appears in the image.

Let's use 4 kernels designed to distinguish apples from oranges:

Kernel 1: Red Detector

Scans for red regions
Output: Apple area lights up bright
Meaning: "Something red here"

Kernel 2: Yellow Detector

Looks for orange/yellow tones
Output: Mostly dark (no orange)
Meaning: "Not much yellow here"

Kernel 3: Edge Detector

Finds curved boundaries
Output: Round outline highlighted
Meaning: "Round shape here"

Kernel 4: Stem Detector

Looks for small protrusions
Output: Small bright spot at top
Meaning: "Stem-like bump here"

After this layer, we have 4 feature maps (each 32×32), each encoding a different type of visual clue.

Step 3: Keeping What Matters — The Pooling Layer

The feature maps are still large—32×32 pixels each. That's a lot of data, and much of it is redundant. The pooling layer compresses each feature map while preserving the important information.

How Max Pooling Works

We divide each feature map into small 2×2 regions. For each region, we keep only the maximum value and discard the other three.

Result: Each 32×32 feature map becomes 16×16—half the size in each dimension, one-quarter the total data.

What gets preserved:

The bright spots indicating "red area," "curved edge," and "stem" are retained
Minor noise and background details get filtered out

Key Benefit: Translation Invariance

Even if the apple shifts slightly in the image (say, 1 pixel to the left), the pooled features remain nearly identical. This is why CNNs can recognize objects regardless of their exact position.

Step 4: Making the Decision — The Fully Connected Layer

So far, our clues are scattered across separate feature maps. The fully connected layer brings everything together to make a final judgment.

Flattening

First, we reshape all 4 feature maps (each 16×16 = 256 numbers) into a single long list: 4 × 256 = 1024 numbers.

Weighted Voting

The fully connected layer contains neurons that assign weights to each clue and compute scores for each possible class:

Clue	Apple Score	Orange Score
Red detector (high)	+++	−
Yellow detector (low)	0	0
Curved edge (high)	+	+
Stem detector (high)	++	0

In this example:

"Red" strongly supports apple, penalizes orange
"Yellow" is neutral (not detected)
"Curved edge" supports both (both fruits are round)
"Stem" supports apple (oranges have less prominent stems)

Final Output

The network computes probabilities: "Apple: 98%, Orange: 2%"

The CNN concludes: This image contains an apple.

The Complete Pipeline

Layer	Input	What It Does	Output
Input	Photo	Split into RGB	3 × 32×32
Convolution	RGB grids	Apply 4 kernels	4 × 32×32
Pooling	Feature maps	2×2 max pooling	4 × 16×16
Flatten	Compressed maps	Reshape to 1D	1024 numbers
Fully Connected	1D list	Weight & sum	2 probabilities
Output	Probabilities	Pick highest	"Apple"

Why CNNs Work So Well for Images

Three key properties make CNNs powerful:

1. Local Pattern Detection

Kernels look at small regions, capturing local features like edges and textures. They don't need to see the whole image at once.

2. Parameter Sharing

The same kernel slides across the entire image. This means far fewer parameters to learn compared to fully connected networks—and the ability to detect a pattern anywhere it appears.

3. Hierarchical Features

In deeper networks, early layers detect simple patterns (edges, colors), while later layers combine these into complex patterns (eyes, wheels, faces). Our simple example only has one convolution layer, but real CNNs stack many.

What Happens During Training?

We didn't cover training in detail, but here's the key idea:

Initially, the kernel values are random—they don't know what to look for. During training, the network sees thousands of labeled images ("this is an apple," "this is an orange"). After each image, it adjusts the kernel values slightly to reduce its mistakes.

Over time, the kernels learn to detect features that actually help distinguish the classes. The "red detector" and "stem detector" kernels we described didn't come built-in—the network discovered they were useful by trial and error.

Key Takeaways

CNNs process images as grids of numbers (RGB channels for color images)
Convolution layers slide small kernels across the image to detect local patterns
Pooling layers compress feature maps while preserving important signals
Fully connected layers combine all detected features to make a classification decision
Training teaches the kernels what patterns are useful for the task

Ready to learn more?

Our deep learning courses cover CNN architectures in depth, including ResNet, VGG, and practical applications in computer vision.