Foundational Series

Deep Learning Explorations

From optimization intuition to output-layer probability, this section turns deep learning's core math into something you can actually reason about.

Optimization intuition

Gradient descent, backpropagation, and why signals can move through deep stacks.

Modeling foundations

Activation functions, loss design, and the assumptions hidden inside familiar formulas.

Probability at the output layer

Softmax, cross-entropy, and the gradient identities that make classification trainable.

All

Beginner

Intermediate

Advanced

Gradient Descent

10 min

Gradient Descent, Explained from First Principles

Imagine you're blindfolded and dropped into a valley. Your only goal: reach the bottom.

The Chain Rule in Deep Learning: What Backpropagation Is Really Computing

How local derivatives multiply along a path and add across branches in a neural network.

Why Linear Regression Uses Squared Error

The loss looks simple because the statistical story behind it is simple.

Why Deep Networks Need Activation Functions

Stack enough affine layers without a nonlinearity and the whole network collapses into one layer.

Softmax and Cross-Entropy, Without the Hand-Waving

Why logits become probabilities, why the exponential shows up, and why the gradient becomes prediction minus target.

PyTorch Softmax Regression: Why the Code Never Calls Softmax

Why logits go straight into CrossEntropyLoss, and why that is the numerically stable thing to do.

The Line That Nearly Froze AI: Why Activation Functions Matter

How XOR exposed the limits of linear perceptrons, and why nonlinear activations made deep learning viable.

Pure PyTorch MLP Training, Line by Line

From weight initialization to backward() and optimizer.step(), what the training loop is mathematically doing.

From Overfitting to Regularization: Weight Decay and Dropout

Two regularizers, two mechanisms, one goal: reducing the gap between training performance and real generalization.

A Field Guide to Deep Learning Math

A compact reference for losses, Softmax curvature, LogSumExp, and the calculus facts that keep resurfacing.

Neural Network Training: A Step-by-Step Computation Graph Example

Tracing forward propagation, branching gradients, and parameter updates by hand.

Why Deep Networks Fail to Train: Vanishing Gradients and Initialization

A deep multilayer perceptron can fail long before the optimizer gets a fair chance.

Initialization

Vanishing Gradients

Xavier Initialization

Intermediate

Distribution Shift

10 min

Distribution Shift: Why Models Fail in the Real World

The model did not get worse; the world just changed.

Distribution Shift

Machine Learning in Production

Why MLPs Struggle with Images, and Why CNNs Succeed

Once you flatten an image, you force the model to relearn basic visual structure from scratch.

CNN

Computer Vision

Multilayer Perceptron

Beginner

CNN

14 min

CNN Forward and Backward Passes: Why Convolution is actually Cross-Correlation

The operation we call 'convolution' is technically cross-correlation, and the backward pass reveals a beautiful mathematical symmetry.

Padding, Stride, Channels, and Pooling in CNNs

How CNNs purposefully compress spatial information to build high-level semantic understanding.

From LeNet to AlexNet: Why Deep CNNs Finally Won

LeNet proved CNNs worked on simple tasks, but AlexNet proved they could conquer the real visual world.

AlexNet

LeNet

Deep Learning History

Beginner

VGG

12 min

VGG and the Power of Block-Based Network Design

VGG proved that stacking small 3x3 filters was a scalable, systematic way to build deep networks.

Network In Network (NIN): Upgrading the Local Receptive Field

NIN asked a simple question: What if each local convolution filter was actually a tiny neural network?

NIN

1x1 Convolution

Global Average Pooling

Intermediate

GoogLeNet

14 min

GoogLeNet and the Inception Module: Parallel Convolution Done Right

When you cannot know which filter size is best, run them all in parallel — then let the math decide.

Batch Normalization: How One Layer Changed the Way Deep Networks Train

Stabilize the distribution at every layer, and the entire optimization landscape becomes easier to navigate.