Ever wondered how your phone recognizes faces, or how self-driving cars spot pedestrians? At the heart of most modern image recognition systems is a type of neural network called a Convolutional Neural Network (CNN).
The name sounds intimidating, but the core idea is surprisingly intuitive. Let's walk through exactly what happens when a CNN looks at a picture and decides what's in it—step by step, layer by layer.
Our example: teaching a CNN to tell whether an image contains an apple or an orange.
Step 1: The Raw Material — Input Image
First, we feed the CNN a regular color photo. Let's say it's a 32×32 pixel image of a red apple with a green leaf in the background.
Here's what the computer actually sees: not a "picture," but three grids of numbers.
A color image is stored as three separate layers (called channels):
- Red channel: A 32×32 grid where each number represents how red that pixel is
- Green channel: Same size, representing green intensity
- Blue channel: Same size, representing blue intensity
In our apple image:
- The red channel has high values where the apple is (it's red) and lower values in the leaf/background areas
- The green channel has high values in the leaf area
- The blue channel is relatively low throughout
This RGB representation is the CNN's starting point.
Step 2: Finding Clues — The Convolution Layer
This is where the magic happens. The convolution layer slides small "detectors" (called kernels or filters) across the image, looking for specific patterns.
Think of each kernel as a specialized magnifying glass. One might look for red patches. Another might look for curved edges. Each kernel produces a feature map—a new grid showing where that pattern appears in the image.
Let's use 4 kernels designed to distinguish apples from oranges:
Kernel 1: Red Detector
- Scans for red regions
- Output: Apple area lights up bright
- Meaning: "Something red here"
Kernel 2: Yellow Detector
- Looks for orange/yellow tones
- Output: Mostly dark (no orange)
- Meaning: "Not much yellow here"
Kernel 3: Edge Detector
- Finds curved boundaries
- Output: Round outline highlighted
- Meaning: "Round shape here"
Kernel 4: Stem Detector
- Looks for small protrusions
- Output: Small bright spot at top
- Meaning: "Stem-like bump here"
After this layer, we have 4 feature maps (each 32×32), each encoding a different type of visual clue.
Step 3: Keeping What Matters — The Pooling Layer
The feature maps are still large—32×32 pixels each. That's a lot of data, and much of it is redundant. The pooling layer compresses each feature map while preserving the important information.
How Max Pooling Works
We divide each feature map into small 2×2 regions. For each region, we keep only the maximum value and discard the other three.
Result: Each 32×32 feature map becomes 16×16—half the size in each dimension, one-quarter the total data.
What gets preserved:
- The bright spots indicating "red area," "curved edge," and "stem" are retained
- Minor noise and background details get filtered out
Key Benefit: Translation Invariance
Even if the apple shifts slightly in the image (say, 1 pixel to the left), the pooled features remain nearly identical. This is why CNNs can recognize objects regardless of their exact position.
Step 4: Making the Decision — The Fully Connected Layer
So far, our clues are scattered across separate feature maps. The fully connected layer brings everything together to make a final judgment.
Flattening
First, we reshape all 4 feature maps (each 16×16 = 256 numbers) into a single long list: 4 × 256 = 1024 numbers.
Weighted Voting
The fully connected layer contains neurons that assign weights to each clue and compute scores for each possible class:
| Clue | Apple Score | Orange Score |
|---|---|---|
| Red detector (high) | +++ | − |
| Yellow detector (low) | 0 | 0 |
| Curved edge (high) | + | + |
| Stem detector (high) | ++ | 0 |
In this example:
- "Red" strongly supports apple, penalizes orange
- "Yellow" is neutral (not detected)
- "Curved edge" supports both (both fruits are round)
- "Stem" supports apple (oranges have less prominent stems)
Final Output
The network computes probabilities: "Apple: 98%, Orange: 2%"
The CNN concludes: This image contains an apple.
The Complete Pipeline
| Layer | Input | What It Does | Output |
|---|---|---|---|
| Input | Photo | Split into RGB | 3 × 32×32 |
| Convolution | RGB grids | Apply 4 kernels | 4 × 32×32 |
| Pooling | Feature maps | 2×2 max pooling | 4 × 16×16 |
| Flatten | Compressed maps | Reshape to 1D | 1024 numbers |
| Fully Connected | 1D list | Weight & sum | 2 probabilities |
| Output | Probabilities | Pick highest | "Apple" |
Why CNNs Work So Well for Images
Three key properties make CNNs powerful:
1. Local Pattern Detection
Kernels look at small regions, capturing local features like edges and textures. They don't need to see the whole image at once.
2. Parameter Sharing
The same kernel slides across the entire image. This means far fewer parameters to learn compared to fully connected networks—and the ability to detect a pattern anywhere it appears.
3. Hierarchical Features
In deeper networks, early layers detect simple patterns (edges, colors), while later layers combine these into complex patterns (eyes, wheels, faces). Our simple example only has one convolution layer, but real CNNs stack many.
What Happens During Training?
We didn't cover training in detail, but here's the key idea:
Initially, the kernel values are random—they don't know what to look for. During training, the network sees thousands of labeled images ("this is an apple," "this is an orange"). After each image, it adjusts the kernel values slightly to reduce its mistakes.
Over time, the kernels learn to detect features that actually help distinguish the classes. The "red detector" and "stem detector" kernels we described didn't come built-in—the network discovered they were useful by trial and error.
Key Takeaways
- CNNs process images as grids of numbers (RGB channels for color images)
- Convolution layers slide small kernels across the image to detect local patterns
- Pooling layers compress feature maps while preserving important signals
- Fully connected layers combine all detected features to make a classification decision
- Training teaches the kernels what patterns are useful for the task