MathIsimple
Article
15 min read

How CNNs See Images: A Step-by-Step Visual Guide

Following a single image through convolution, pooling, and classification

2026-01-19
CNN
Deep Learning
Image Classification
Computer Vision
Neural Networks

Ever wondered how your phone recognizes faces, or how self-driving cars spot pedestrians? At the heart of most modern image recognition systems is a type of neural network called a Convolutional Neural Network (CNN).

The name sounds intimidating, but the core idea is surprisingly intuitive. Let's walk through exactly what happens when a CNN looks at a picture and decides what's in it—step by step, layer by layer.

Our example: teaching a CNN to tell whether an image contains an apple or an orange.

Step 1: The Raw Material — Input Image

First, we feed the CNN a regular color photo. Let's say it's a 32×32 pixel image of a red apple with a green leaf in the background.

Here's what the computer actually sees: not a "picture," but three grids of numbers.

A color image is stored as three separate layers (called channels):

  • Red channel: A 32×32 grid where each number represents how red that pixel is
  • Green channel: Same size, representing green intensity
  • Blue channel: Same size, representing blue intensity

In our apple image:

  • The red channel has high values where the apple is (it's red) and lower values in the leaf/background areas
  • The green channel has high values in the leaf area
  • The blue channel is relatively low throughout

This RGB representation is the CNN's starting point.

Step 2: Finding Clues — The Convolution Layer

This is where the magic happens. The convolution layer slides small "detectors" (called kernels or filters) across the image, looking for specific patterns.

Think of each kernel as a specialized magnifying glass. One might look for red patches. Another might look for curved edges. Each kernel produces a feature map—a new grid showing where that pattern appears in the image.

Let's use 4 kernels designed to distinguish apples from oranges:

Kernel 1: Red Detector

  • Scans for red regions
  • Output: Apple area lights up bright
  • Meaning: "Something red here"

Kernel 2: Yellow Detector

  • Looks for orange/yellow tones
  • Output: Mostly dark (no orange)
  • Meaning: "Not much yellow here"

Kernel 3: Edge Detector

  • Finds curved boundaries
  • Output: Round outline highlighted
  • Meaning: "Round shape here"

Kernel 4: Stem Detector

  • Looks for small protrusions
  • Output: Small bright spot at top
  • Meaning: "Stem-like bump here"

After this layer, we have 4 feature maps (each 32×32), each encoding a different type of visual clue.

Step 3: Keeping What Matters — The Pooling Layer

The feature maps are still large—32×32 pixels each. That's a lot of data, and much of it is redundant. The pooling layer compresses each feature map while preserving the important information.

How Max Pooling Works

We divide each feature map into small 2×2 regions. For each region, we keep only the maximum value and discard the other three.

Result: Each 32×32 feature map becomes 16×16—half the size in each dimension, one-quarter the total data.

What gets preserved:

  • The bright spots indicating "red area," "curved edge," and "stem" are retained
  • Minor noise and background details get filtered out

Key Benefit: Translation Invariance

Even if the apple shifts slightly in the image (say, 1 pixel to the left), the pooled features remain nearly identical. This is why CNNs can recognize objects regardless of their exact position.

Step 4: Making the Decision — The Fully Connected Layer

So far, our clues are scattered across separate feature maps. The fully connected layer brings everything together to make a final judgment.

Flattening

First, we reshape all 4 feature maps (each 16×16 = 256 numbers) into a single long list: 4 × 256 = 1024 numbers.

Weighted Voting

The fully connected layer contains neurons that assign weights to each clue and compute scores for each possible class:

ClueApple ScoreOrange Score
Red detector (high)+++
Yellow detector (low)00
Curved edge (high)++
Stem detector (high)++0

In this example:

  • "Red" strongly supports apple, penalizes orange
  • "Yellow" is neutral (not detected)
  • "Curved edge" supports both (both fruits are round)
  • "Stem" supports apple (oranges have less prominent stems)

Final Output

The network computes probabilities: "Apple: 98%, Orange: 2%"

The CNN concludes: This image contains an apple.

The Complete Pipeline

LayerInputWhat It DoesOutput
InputPhotoSplit into RGB3 × 32×32
ConvolutionRGB gridsApply 4 kernels4 × 32×32
PoolingFeature maps2×2 max pooling4 × 16×16
FlattenCompressed mapsReshape to 1D1024 numbers
Fully Connected1D listWeight & sum2 probabilities
OutputProbabilitiesPick highest"Apple"

Why CNNs Work So Well for Images

Three key properties make CNNs powerful:

1. Local Pattern Detection

Kernels look at small regions, capturing local features like edges and textures. They don't need to see the whole image at once.

2. Parameter Sharing

The same kernel slides across the entire image. This means far fewer parameters to learn compared to fully connected networks—and the ability to detect a pattern anywhere it appears.

3. Hierarchical Features

In deeper networks, early layers detect simple patterns (edges, colors), while later layers combine these into complex patterns (eyes, wheels, faces). Our simple example only has one convolution layer, but real CNNs stack many.

What Happens During Training?

We didn't cover training in detail, but here's the key idea:

Initially, the kernel values are random—they don't know what to look for. During training, the network sees thousands of labeled images ("this is an apple," "this is an orange"). After each image, it adjusts the kernel values slightly to reduce its mistakes.

Over time, the kernels learn to detect features that actually help distinguish the classes. The "red detector" and "stem detector" kernels we described didn't come built-in—the network discovered they were useful by trial and error.

Key Takeaways

  1. CNNs process images as grids of numbers (RGB channels for color images)
  2. Convolution layers slide small kernels across the image to detect local patterns
  3. Pooling layers compress feature maps while preserving important signals
  4. Fully connected layers combine all detected features to make a classification decision
  5. Training teaches the kernels what patterns are useful for the task

Ready to learn more?

Our deep learning courses cover CNN architectures in depth, including ResNet, VGG, and practical applications in computer vision.

Ask AI ✨
How CNNs See Images: A Step-by-Step Visual Guide | MathIsimple