From optimization intuition to output-layer probability, this section turns deep learning's core math into something you can actually reason about.
Optimization intuition
Gradient descent, backpropagation, and why signals can move through deep stacks.
Modeling foundations
Activation functions, loss design, and the assumptions hidden inside familiar formulas.
Probability at the output layer
Softmax, cross-entropy, and the gradient identities that make classification trainable.
Imagine you're blindfolded and dropped into a valley. Your only goal: reach the bottom.
How local derivatives multiply along a path and add across branches in a neural network.
The loss looks simple because the statistical story behind it is simple.
Stack enough affine layers without a nonlinearity and the whole network collapses into one layer.
Why logits become probabilities, why the exponential shows up, and why the gradient becomes prediction minus target.
Why logits go straight into CrossEntropyLoss, and why that is the numerically stable thing to do.
How XOR exposed the limits of linear perceptrons, and why nonlinear activations made deep learning viable.
From weight initialization to backward() and optimizer.step(), what the training loop is mathematically doing.
Two regularizers, two mechanisms, one goal: reducing the gap between training performance and real generalization.
A compact reference for losses, Softmax curvature, LogSumExp, and the calculus facts that keep resurfacing.
Tracing forward propagation, branching gradients, and parameter updates by hand.
A deep multilayer perceptron can fail long before the optimizer gets a fair chance.
The model did not get worse; the world just changed.
Once you flatten an image, you force the model to relearn basic visual structure from scratch.
The operation we call 'convolution' is technically cross-correlation, and the backward pass reveals a beautiful mathematical symmetry.
How CNNs purposefully compress spatial information to build high-level semantic understanding.
LeNet proved CNNs worked on simple tasks, but AlexNet proved they could conquer the real visual world.
VGG proved that stacking small 3x3 filters was a scalable, systematic way to build deep networks.
NIN asked a simple question: What if each local convolution filter was actually a tiny neural network?