From convolutional feature extraction to recurrent sequence modeling, this section builds deep learning understanding through rigorous mathematics, concrete implementations, and real design tradeoffs.
CNNs & Computer Vision
Convolutional networks, image-recognition architectures, and the engineering choices behind modern visual models.
RNNs & Sequence Modeling
Recurrent networks, gated cells, attention, and sequence-to-sequence models for language and time series.
Foundations
Gradients, losses, activations, and the math you keep needing once you go past the textbook chapter.
Optimization & Training
Why deep loss surfaces are hard, and how SGD, momentum, adaptive methods, and schedulers tame them.
Convolutional networks, image-recognition architectures, and the engineering choices behind modern visual models.
Once you flatten an image, you force the model to relearn basic visual structure from scratch.
The operation we call 'convolution' is technically cross-correlation, and the backward pass reveals a clean mathematical symmetry.
Three networks, three lessons that turned CNN architecture from one-off engineering into a reusable design language.
When you cannot know which filter size is best, run them all in parallel — then let the network decide.
How three architectural ideas — normalization, residual learning, and dense connectivity — fixed the gradient and feature-flow problems that capped CNN depth.
Building a multi-task fruit classifier on a small dataset, with frozen backbone, weighted losses, cosine annealing, and early stopping.
Recurrent networks, gated cells, attention, and sequence-to-sequence models for language and time series.
From the i.i.d. assumption to hidden states, Markov approximations, and the chain rule of probability.
Time-axis weight sharing, the BPTT chain rule, and why exploding gradients are not optional to handle.
How element-wise gates turn the multiplicative cascade of vanilla RNNs into an additive memory highway.
Embeddings, batch_first, detach, gradient clipping, perplexity, and warmup-based generation.
How attention removed the encoder bottleneck, and how beam search produces better translations than greedy decoding.
Gradients, losses, activations, and the math you keep needing once you go past the textbook chapter.
Imagine you're blindfolded and dropped into a valley. Your only goal: reach the bottom.
How local derivatives multiply along a path and add across branches in a neural network.
The loss looks simple because the statistical story behind it is simple.
Stack enough affine layers without a nonlinearity and the whole network collapses into one layer.
Why logits become probabilities, why the exponential shows up, and why the gradient becomes prediction minus target.
Why logits go straight into CrossEntropyLoss, and why that is the numerically stable thing to do.
How XOR exposed the limits of linear perceptrons, and why nonlinear activations made deep learning viable.
From weight initialization to backward() and optimizer.step(), what the training loop is mathematically doing.
Two regularizers, two mechanisms, one goal: reducing the gap between training performance and real generalization.
A compact reference for losses, Softmax curvature, LogSumExp, and the calculus facts that keep resurfacing.
Tracing forward propagation, branching gradients, and parameter updates by hand.
A deep multilayer perceptron can fail long before the optimizer gets a fair chance.
The model did not get worse; the world just changed.
Why deep loss surfaces are hard, and how SGD, momentum, adaptive methods, and schedulers tame them.
Non-convex optimization, saddle points, mode connectivity, and why high dimensions are friendlier than they look.
Momentum, AdaGrad, RMSProp, Adam, AdamW — the math behind every optimizer you might pick.
Why a constant learning rate is almost never right, and which schedule fits which task.