Princeton: LN, 2023. — 227 p.
Basic Setup and some math notions.
List of useful math facts.
Basics of Optimization.
Gradient descent (GD).
Stochastic gradient descent (SGD).
Accelerated Gradient Descent.
Running time: Learning Rates and Update Directions.
Convergence rates under smoothness conditions.
Correspondence of theory with practice.
Note on overparametrized linear regression and kernel regression.
Overparametrized least squares linear regression.
Kernel least-squares regression.
Note on Backpropagation and its Variants.
Problem Setup.
Backpropagation (Linear Time).
Auto-differentiation.
Notable Extensions.
Basics of generalization theory.
Occam's razor was formalized for ML.
Some simple upper bounds on generalization error.
Data-dependent complexity measures.
Understanding limitations of the union-bound approach.
A Compression-based framework.
PAC-Bayes bounds.
Exercises.
Tractable Landscapes for Nonconvex Optimization.
Preliminaries and challenges in nonconvex landscapes.
Cases with a unique global minimum.
Symmetry, saddle points, and locally optimizable functions.
Case study: top eigenvector of a matrix.
Escaping Saddle Points.
Preliminaries.
Perturbed Gradient Descent.
Saddle Points Escaping Lemma.
Algorithmic Regularization.
Linear models in regression: squared loss.
Matrix factorization.
Linear Models in Classification.
Homogeneous Models with Exponential Tailed Loss.
Induced bias in function space.
Ultra-wide Neural Networks and Neural Tangent Kernels.
The evolution equation for net parameters.
NTK: Simple 2-layer example.
Explaining Optimization and Generalization of Ultra-wide Neural Networks via NTK.
NTK formula for Multilayer Fully Connected Neural Network.
NTK in Practice.
Exercises.
Interpreting the output of Deep Nets: Credit Attribution.
Influence Functions.
Shapley Values.
Data Models.
Saliency Maps.
Inductive Biases due to Algorithmic Regularization.
Matrix Sensing.
Deep neural networks.
Landscape of the Optimization Problem.
Role of Parametrization.
SDE approximation of SGD and its implications.
Understanding gradient noise in SGD.
Stochastic processes: Informal Treatment.
The notion of closeness between stochastic processes.
Stochastic Variance Amplified Gradient (SVAG).
Effect of Normalization in Deep Learning.
Warmup Example: How Normalization Helps Optimization.
Normalization schemes and scale invariance.
Exponential learning rate schedules.
Convergence analysis for GD on Scale-Invariant Loss.
Unsupervised learning: Distribution Learning.
Possible goals of unsupervised learning.
Training Objective for Learning Distributions: Log Likelihood.
Variational method.
Autoencoders and Variational Autoencoders (VAEs).
Normalizing Flows.
Stable Diffusion.
Language Models (LMs).
Transformer Architecture.
Explanation of Cross-Entropy Loss.
Scaling Laws and Emergence.
(Mis)understanding, Excess entropy, and Cloze Questions.
How to generate text from an LM.
Instruction tuning.
Aligning LLMs with human preferences.
Mathematical Framework for Skills and Emergence.
Analysis of Emergence (uniform cluster).
Generative Adversarial Nets.
Distance between Distributions.
Introducing GANs.
"Generalization" for GANs vs Mode Collapse.
Self-supervised Learning.
Adversarial Examples and efforts to combat them.
Basic Definitions.
Provable defense via randomized smoothing.
Examples of Theorems, Proofs, Algorithms, Tables, Figures.
Example of Theorems and Lemmas.
Example of Long Equation Proofs.
Example of Algorithms.
Example of Figures.
Example of Tables.
Exercise.