From LeNet to AlexNet to VGG: How CNNs Got Deep

LeNet showed that convolutional networks could solve vision problems. AlexNet showed that deep convolutional networks could win on large, messy, real-world vision problems. VGG turned that empirical lead into a design methodology — small kernels, repeatable blocks, disciplined scaling. Tracing the path from LeNet through AlexNet to VGG explains how CNN architecture moved from one-off engineering to a reusable language.

What LeNet actually proved

LeNet was built for handwritten digit recognition. That matters. Digits are grayscale, low resolution, centered, and visually constrained. The dataset distribution is far simpler than modern natural-image collections.

LeNet's success established something important but limited: convolution plus pooling is a useful idea for structured image data. It did not yet prove that a deeper CNN could scale to cluttered backgrounds, color images, many categories, large intraclass variation, and object appearance under real lighting and viewpoint changes.

In that sense, LeNet was a proof of concept for the convolutional paradigm, not yet a proof of dominance for large-scale visual recognition. Three constraints held its successors back. The task domain was too simple to settle the broader question. Model capacity was modest. The surrounding ecosystem was missing critical ingredients: large labeled corpora, modern GPUs, and stable training practices for deeper nets.

What AlexNet changed

AlexNet mattered because it moved CNNs from tidy academic examples into a benchmark that looked much more like the real visual world. ImageNet provided scale. GPUs provided feasible training time. ReLU, dropout, and data augmentation provided a far healthier optimization and generalization story.

The architecture itself was much larger and deeper than LeNet, with more channels and more hierarchical capacity. That gave the network room to learn a progression from low-level edges to mid-level parts to high-level visual categories on a much richer dataset.

LeNet made the case that convolution was a good idea. AlexNet made the case that deep convolution could be the winning strategy.

AlexNet is often summarized as "a deeper LeNet," but that compresses away the important details. ReLU made optimization far easier than older sigmoid-heavy pipelines. Dropout reduced overfitting in the large classifier head. Data augmentation helped the model generalize beyond the raw training images. GPU training made the whole recipe operationally feasible. AlexNet was not only a bigger network — it was a new training stack.

A compact AlexNet implementation

class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )

        self.avgpool = nn.AdaptiveAvgPool2d((2, 2))
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(256 * 2 * 2, 1024),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, num_classes),
        )

The convolutional front end extracts increasingly abstract visual features. The classifier head turns those features into logits for the final categories. AdaptiveAvgPool2d((2, 2)) fixes the spatial size before the linear layers, removing the need to manually derive the exact post-convolution shape each time the feature extractor changes. The dropout layers matter because the fully connected section contains a large fraction of the model's parameters.

VGG asks the isolated question: how much does depth matter?

AlexNet was a breakthrough but still looked like a collection of historical choices: large early kernels, aggressive stride, and a heavy classifier head. The VGG paper tried to isolate one variable: depth. Instead of changing everything at once, it held the overall design relatively steady and explored whether making the network deeper in a structured way improved accuracy.

That is why VGG matters conceptually. It strengthened the claim that representation depth itself, not just ad hoc architectural luck, was an important source of visual performance.

Why small kernels were a big idea

A single $7 \times 7$ convolution sees a large local region, but it is not the only way to obtain that effective receptive field. Stacking three $3 \times 3$ convolutions yields a similar spatial reach while introducing more nonlinearities and using fewer parameters.

Ignoring channels for a moment, one $7 \times 7$ filter has 49 weights, while three $3 \times 3$ filters have 27 weights total per channel path. With $C$ channels per layer, the comparison is even more striking: $49 C^2$ versus $27 C^2$ parameters. And those three layers introduce three ReLU nonlinearities instead of one, increasing the function class the network can express.

VGG's philosophy was not "big kernels are wrong." It was "many small, consistent steps can be more expressive and easier to scale than one large, irregular jump."

The block is the real abstraction

The most durable VGG idea is the block. Instead of designing a network layer by layer as a one-off structure, VGG builds a small repeatable unit and stacks it:

3x3 conv -> ReLU -> 3x3 conv -> ReLU -> max pool

Repeated blocks create a predictable pattern: spatial resolution gradually falls, channel width gradually rises, and the model becomes deeper without becoming architecturally chaotic.

A compact VGG-style implementation

def vgg_block(in_channels, out_channels, num_convs):
    layers = []
    for i in range(num_convs):
        layers += [
            nn.Conv2d(
                in_channels if i == 0 else out_channels,
                out_channels,
                kernel_size=3,
                padding=1,
            ),
            nn.ReLU(),
        ]
    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return nn.Sequential(*layers)


class VGG(nn.Module):
    def __init__(self, num_classes=10,
                 block_specs=((1, 16), (1, 32), (2, 64), (2, 128), (2, 128))):
        super().__init__()
        blocks, channels = [], 1
        for num_convs, out_channels in block_specs:
            blocks.append(vgg_block(channels, out_channels, num_convs))
            channels = out_channels

        self.features = nn.Sequential(*blocks)
        self.avgpool = nn.AdaptiveAvgPool2d((3, 3))
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels * 3 * 3, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )

The important line here is not a specific layer width. It is the fact that the architecture is specified in terms of reusable block definitions. Once the block is the unit of design, the network becomes easier to scale, reason about, and compare.

Depth means more than "more layers"

If kernel size tells you how far one step can see, depth tells you how many representational steps the model is allowed to take.

Early layers respond to edges and simple texture cues. Middle layers combine them into local motifs and parts. Later layers integrate those parts into larger semantic structures. More depth gives the network more chances to refine and recombine information rather than trying to capture everything in one leap.

That progression also increases the effective receptive field. Later units see a larger part of the original image, but they do so through a chain of intermediate abstractions rather than one oversized filter.

Why VGG still matters as a design language

VGG is no longer the most efficient CNN family, but it remains one of the clearest architectural teaching tools because its design principles are so legible:

Use consistent local operators ( $3 \times 3$ convolution everywhere).
Build networks from reusable blocks rather than one-off layer sequences.
Trade one large step for multiple smaller, nonlinear steps.
Let spatial resolution fall gradually as channel complexity rises.

Many later architectures depart from VGG, but they do so in dialogue with the standard it established. The Inception, ResNet, and DenseNet families all rely on the block abstraction VGG made standard. VGG's legacy is not a particular leaderboard result — it is the demonstration that disciplined block design can turn CNN architecture into something coherent, scalable, and easy to reason about.

The main takeaway

The path from LeNet to AlexNet to VGG is not about three networks competing. It is about three lessons compounding. LeNet showed that convolution and pooling is a viable architectural idea. AlexNet showed that scale, ReLU, dropout, and GPU training change which architectures are practically trainable. VGG showed that stacking many small, identical operations is more scalable and more expressive than designing each layer ad hoc.

Once the block pattern was clear, the question shifted from "what layers should this network have?" to "what should the block look like, and how should it scale?" That shift opened the door to NIN's 1×1 convolutions, GoogLeNet's Inception modules, and ResNet's residual connections — each a different answer to the same now-standardized question.