Lecture Videos

Lecture 0: Course Overview and Logistics
Overview: This lecture gives an overview on the course structure and the logistics.
[link]

Lecture Notes:

Chapter 0

Lecture 1: Tokenization and Embedding
LMs - Part 1: We start with LMs and understand how we can feed a text into it by doing the so-called "Tokenization" and "Embedding".
[link]

Lecture Notes:

Chapter 1 - Section 1 Pgs 1:18

Further Reads:

Tokenization: Chapter 2 of [JM]
Embedding: Chapter 6 of [JM]
Original BPE Algorithm: Original BPE Algorithm proposed by Philip Gage in 1994
BPE for Tokenization: Paper Neural machine translation of rare words with subword units by Rico Sennrich, Barry Haddow, and Alexandra Birch presented in ACL 2016 that adapted BPE for NLP

Lecture 2: Language Distribution and Bi-Gram Model
LMs - Part 2: In this lecture, we define the LMs concretely via the notion of language distribution. We then build a simple Bi-gram LM.
[link]

Lecture Notes:

Chapter 1 - Section 1 Pgs 18:32

Further Reads:

LMs: Chapter 12 of [BB] Section 12.2
N-Gram LMs: Chapter 3 of Speech and Language Processing; Section 3.1 on N-gram LM
Maximum Likelihood: Chapter 2 of [BB] Sections 12.1 – 12.3

Lecture 3: Recurrent LMs
LMs - Part 3: This lecture explains how we can build a LM using RNNs. We take a look at LSTM-based LMs.
[link]

Lecture Notes:

Chapter 1 - Section 1 Pgs 32:42

Further Reads:

Recurrent LMs: Chapter 8 of [JM]
LSTM LMs: Paper Regularizing and Optimizing LSTM Language Models by Stephen Merity, Nitish Shirish Keskar, and Richard Socher published in ICLR 2018 enabling LSTMs to perform strongly on word-level language modeling
High-Rank Recurrent LMs: Paper Breaking the Softmax Bottleneck: A High-Rank RNN Language Model by Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen presented at ICLR 2018 proposing Mixture of Softmaxes (MoS) and achieving state-of-the-art results at the time

Lecture 4: Context Extraction via Self-Attention
Transformer LMs - Part 1: In this lecture we use self-attention mechanism to extract context from a token sequence. Introduction to self-attention is given through the lecture.
[link]

Lecture Notes:

Chapter 1 - Section 2

Further Reads:

Transformer Paper: Paper Attention Is All You Need! published in 2017 that made a great turn in sequence processing
Transformers: Chapter 9 of [JM]
Transformers: Chapter 12 of [BB] Section 12.1

Lecture 5: Transformer LM
Transformer LMs - Part 2: We use self-attention-based context to build a LM via a Transformer.
[link]

Lecture Notes:

Chapter 1 - Section 2

Further Reads:

Transformer LMs: Chapter 12 of [BB] Section 12.3
LLMs via Transformers: Chapter 10 of [JM]

Lecture 6: LLM Examples
LLMs - Part 1: We take a look at GPT examples to see how deep they are, how large the data is, how to evaluate them, and what challenges are in the way of pre-training them.
[link]

Lecture Notes:

Chapter 1 - Section 3

Further Reads:

GPT-1: Paper Improving Language Understanding by Generative Pre-Training by Alec Radford et al. (OpenAI, 2018) that introduced GPT-1 and revived the idea of pretraining transformers as LMs followed by supervised fine-tuning
GPT-2: Paper Language Models are Unsupervised Multitask Learners by Alec Radford et al. (OpenAI, 2019) that introduces GPT-2 with 1.5B parameter trained on web text
GPT-3: Paper Language Models are Few-Shot Learners by Tom B. Brown et al. (OpenAI, 2020) that introduces GPT-3, a 175B-parameter transformer LM
GPT-4: GPT-4 Technical Report by OpenAI (2023) that provides an overview of GPT-4’s capabilities
The Pile: Paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling by Leo Gao et al. presented in 2020 introductin dataset The Pile
Documentation Debt: Paper Addressing “Documentation Debt” in Machine Learning Research: A Retrospective Datasheet for BookCorpus by Jack Bandy and Nicholas Vincent published in 2021 discussing the efficiency and legality of data collection by looking into BookCorpus

Lecture 7: Pre-training vs Fine-tuning
LLMs - Part 2: We discuss the idea of pre-training and fine-tuning, and take a look into the example of GPT-1.
[link]

Lecture Notes:

Chapter 1 - Section 3

Further Reads:

SSL: Paper Semi-supervised Sequence Learning by Andrew M. Dai et al. published in 2015 that explores using unsupervised pretraining followed by supervised fine-tuning; this was an early solid work advocating pre-training idea for LMs
GPT-1: Paper Improving Language Understanding by Generative Pre-Training by Alec Radford et al. (OpenAI, 2018) that introduced GPT-1 and revived the idea of pretraining transformers as LMs followed by supervised fine-tuning

Lecture 8: Statistical View and LoRA
LLMs - Part 3: We practice statistical view on the idea of fine-tuning, discuss full and selective fine-tuning, and study the low-rank adaptation (LoRA) method.
[link]

Lecture Notes:

Chapter 1 - Section 3

Further Reads:

LMs: Chapter 12 of [BB] Section 12.3.5
LoRA: Paper LoRA: Low-Rank Adaptation of Large Language Models by Edward J. Hu et al. presented at ICLR in 2022 introducing LoRA

Lecture 9: Prompt Design
LLMs - Part 4: In this lecture, we discuss the idea prompt design, i.e., using LLMs to sample directly from a task specific distribution. We get to overview ideas like "few-shot" and "zero-shot" learning and get to know what "foundation models" are.
[link]

Lecture Notes:

Chapter 1 - Section 3

Further Reads:

Chain-of-Thought: Paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models by Jason Wei et al. presented at NeurIPS in 2022 introducing chain-of-thought prompting
Prefix-Tuning: Paper Prefix-Tuning: Optimizing Continuous Prompts for Generation by Xiang Lisa Li et al. presented at ACL in 2021 proposing prefix-tuning approach for prompting
Prompt-Tuning: Paper The Power of Scale for Parameter-Efficient Prompt Tuning by B. Lester et al. presented at EMNLP in 2021 proposing the prompt tuning idea, i.e., learning to prompt
Zero-Shot LLMs: Paper Large Language Models are Zero-Shot Reasoners by T. Kojima et al. presented at NeurIPS in 2022 studying zero-shot learning with LLMs

Lecture 10: Data Generation Problem - Basic Definitions
Formulation - Part 1: In this lecture, we start with the formulating the generic problem of data generation. We review the concept of "data distribution" and see that we essentially need to learn "how to sample from it".
[link]

Lecture Notes:

Chapter 2 - Section 1

Further Reads:

Probabilistic Model: Chapter 2 of [BB] Sections 2.4 to 2.6
Statistics: Chapter 3 of [M] Sections 3.1 to 3.3

Lecture 11: Discriminative vs Generative Learning
Formulation - Part 2: In this lecture we study the discriminative and generative models. We see that many earlier computational models we learned are indeed discriminative. We further learn how we could use a generative model for a discriminative learning task.
[link]

Lecture Notes:

Chapter 2 - Section 2

Further Reads:

Discriminative and Generative Models: Chapter 5 of [BB]

Lecture 12: Naive Bayes - Most Basic Generative Model
Formulation - Part 3: This lecture investigates Naive Bayes, the most basic generative model we can think of. This model enables us understand the idea of generative modeling clearly.
[link]

Lecture Notes:

Chapter 2 - Section 3

Further Reads:

Naive Bayes: Paper Idiot’s Bayes—Not So Stupid After All? by D. Hand and K. Yu published at Statistical Review in 2001 discussing the efficiency of Naive Bayes for classification
Naive Bayes vs Linear Regression: Paper On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes by A. Ng and M. Jordan presented at NeurIPS in 2001 elaborating the data-efficiency efficiency of Naive Bayes and asymptotic superiority of Logistic Regression
Generative Models – Overview: Chapter 20 of [M] Sections 20.1 to 20.3

Lecture 13: Explicit Distribution Learning - Sampling
Explicit Methods - Part 1: In this lecture, we start with "explicit learning methods"; the term we use to refer to approaches that learn data distribution explicitly. We see that direct sampling from them is not easy and we need to develop an structured modeling.
[link]

Lecture Notes:

Chapter 3 - Section 1

Further Reads:

Sampling Overview: Chapter 14 of [BB]
Sampling The book Pattern Recognition and Machine Learning by Christopher Bishop. Read Chapter 11 to know about how challenging sampling from a distribution is
Sampling Methods: Chapter 17 of [GYC] Sections 17.1 and 17.2

Lecture 14: Maximum Likelihood Learning
Explicit Methods - Part 2: We discuss the general recipe for distribution learning, i.e., maximum likelihood estimation (MLE). We see that MLE essentially minimizes the estimated KL divergence between our model and the data distribution.
[link]

Lecture Notes:

Chapter 3 - Section 1

Further Reads:

KL Divergence and MLE: Chapter 5 of [M] Sections 5.1 to 5.2
MLE: Chapter 5 of [GYC] Section 5.5
Maximum Likelihood Learning The book Information Theory, Inference, and Learning Algorithms by David MacKay which discusses MLE for clustering in Chapter 22

Lecture 15: Autoregressive Modeling
AR Models - Part 1: This lecture talks about the idea of autoregressive (AR) modeling. We see that using this approach, we can design an efficient model for data distribution. The price that we pay though is slow generation.
[link]

Lecture Notes:

Chapter 3 - Section 2

Further Reads:

Autoregressive Models: Chapter 22 of [M]

Lecture 16: Computational AR Models
AR Models - Part 2: This lectures goes through a general framework for developing a computational AR model. These models extract a masked content and compute a conditional distribution based on that. Generation in these model is always slow.
[link]

Lecture Notes:

Chapter 3 - Section 2

Further Reads:

Autoregressive Models: Chapter 22 of [M]

Lecture 17: PixelRNN
AR Models - Part 3: In this lecture, we look at PixelRNN; a visual autoregressive model which builds context by recursion. We take a look at its training and sampling.
[link]

Lecture Notes:

Chapter 3 - Section 3

Further Reads:

PixelRNN and PixelCNN: Paper Pixel Recurrent Neural Networks by A. Oord et al. presented at ICML in 2016 proposing PixelRNN and PixelCNN

Lecture 18: Masked AR Models - PixelCNN and ImageGPT
AR Models - Part 4: This lecture goes through the idea of masked AR models. These models give us the benefit of parallel computation, and hence can be efficiently trained using Teacher-Forcing Training. We look into the examples of PixelCNN and ImageGPT. The former uses masked convolution to realize an AR model, while the latter use masked decoding.
[link]

Lecture Notes:

Chapter 3 - Section 3

Further Reads:

PixelRNN and PixelCNN: Paper Pixel Recurrent Neural Networks by A. Oord et al. presented at ICML in 2016 proposing PixelRNN and PixelCNN
ImageGPT: Paper Generative Pretraining from Pixels by M. Chen et al. presented at ICML in 2020 proposing ImageGPT

Lecture 19: Energy Based Models - Boltzmann Distribution
EBMs - Part 1: This is the first lecture on EBMs. We talk about Boltzmann distribution and how we could use it to build a distribution model from an arbitrary computational model. We call such models EBMs.
[link]

Lecture Notes:

Chapter 3 - Section 4

Further Reads:

EBMs: Chapter 24 of [M]
Partition Function and Normalizing: Chapter 16 of [GYC] Section 16.2
Universality of EBMs Paper Representational power of restricted Boltzmann machines and deep belief networks, by N. Le Roux and Y. Bengio published at Neural Computation in 2008 elaborating the representational power of EBMs *Tutorial on EBMs Survey A Tutorial on Energy-Based Learning, by Y. LeCun et al. published in 2006

Lecture 20: Computational EBMs - Training and Sampling
EBMs - Part 2: In this lecture, we discuss computational EBMs. We see how we can train them if we know how to sample them. This motivates us to study MCMC algorithms for sampling.
[link]

Lecture Notes:

Chapter 3 - Section 4

Further Reads:

EBMs: Chapter 24 of [M]
Partition Function and Normalizing: Chapter 16 of [GYC] Section 16.2 *Tutorial on EBMs Survey A Tutorial on Energy-Based Learning, by Y. LeCun et al. published in 2006

Lecture 21: MCMC Algorithms - Gibbs Sampling
EBMs - Part 3: In this lecture, we study the Gibbs sampling algorithm. We learn how we can use it to train and sample a basic Boltzmann machine.
[link]

Lecture Notes:

Chapter 3 - Section 4

Further Reads:

MCMC Algorithms: Chapter 12 of [M] Sections 12.3, 12.6 and 12.7
Gibbs Sampling and Langevin: Chapter 14 of [BB]
Anatomy of MCMC Paper On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models published by E. Nijkamp et al. in AAAI 2020 looking on the stability of training by MCMC algorithms

Lecture 22: MCMC - Langevin and Conservative Divergence
EBMs - Part 4: In this lecture, we study the Langevin Algorithm, a first-order MCMC algorithm for sampling. We learn how we can use this algorithm to develop the so-called conservative divergence approach for EBM training. This leads us to the idea of score-matching, which we will introduce briefly and leave its details for Chapter 6.
[link]

Lecture Notes:

Chapter 3 - Section 4

Further Reads:

Gibbs Sampling and Langevin: Chapter 14 of [BB]
Conservative Divergence Paper Training Products of Experts by Minimizing Contrastive Divergence, by G. Hinton published at Neural Computation in 2002 proposing the idea of Conservative Divergence
Training by MCMC Paper Implicit Generation and Generalization in Energy-Based Models published by Y. Du and I. Mordatch in NeurIPS 2019 discussing efficiency of MCMC algorithms for EBM training
Improved CD Paper Improved Contrastive Divergence Training of Energy-Based Models published by Y. Du et al. in ICML 2021 proposing an efficient training based on Hinton’s CD ideal

Lecture 23: Latent Space
Notion of Latent: In this lecture, we study the notion of Latent Space. This notion enables us to develop latent-space generative models which include most current state-of-the-art generative AI approaches.
[link]

Lecture Notes:

Chapter 3 - Section 5

Further Reads:

Latent Variable: Chapter 16 of [BB] Sections 16.2

Lecture 24: Normalizing Flow
Flow-based Models - Part I: In this section, we start with flow-based models. To this end, we study the idea of normalizing flow.
[link]

Lecture Notes:

Chapter 3 - Section 5

Further Reads:

Normalizing Flow: Chapter 18 of [BB]

Lecture 25: Learning Flow
Flow-based Models - Part II: We study the flow-based models which use normalizing flow to learn data distribution. We discuss their training and sampling and find out how complex they are.
[link]

Lecture Notes:

Chapter 3 - Section 5

Further Reads:

Flow-based Models: Chapter 23 of [M]
Tutorial on Normalizing Flow Paper Normalizing Flows for Probabilistic Modeling and Inference published by G. Papamakarios et al. at JMLR in 2021 discussing the training and inference of flow-based models

Lecture 26: NICE, RealNVP and Glow
Flow-based Models - Part III: We discuss the Real NVP architecture, a classical flow model which has inspired many other flow-based models.
[link]

Lecture Notes:

Chapter 3 - Section 5

Further Reads:

NICE Paper NICE: Non-linear Independent Components Estimation published by L. Dinh et al. at ICLR in 2015 proposing the NICE model
Real NVP Paper Density estimation using Real NVP published by L. Dinh et al. at ICLR in 2017 proposing the Real NVP model
Glow Paper Glow: Generative Flow with Invertible 1x1 Convolutions published by D. Kingma and P. Dhariwal at NeurIPS in 2018 proposing the Glow model

Lecture 27: Introduction to GAN
GANs - Part I: We start with GANs. We see that though looking like a flow model, they are unable to use direct MLE due to challenges involved in likelihood computation. We then intuitively discuss adversarial mechanism used to train the generator.
[link]

Lecture Notes:

Chapter 4 - Section 1

Further Reads:

Tutorial on GANs Tutorial Generative Adversarial Networks given by I. Goodfellow at NeurIPS in 2016

Lecture 28: Vanilla GAN
GANs - Part II: In this lecture, we study the vanilla GAN. We see how we can train it by implementing a min-max game. We discuss its training and sampling.
[link]

Lecture Notes:

Chapter 4 - Section 1

Further Reads:

GANs Paper Generative Adversarial Nets published by I. Goodfellow et al. at NeurIPS in 2014 proposing GANs

Lecture 29: Implicit MLE via GAN
GANs - Part III: In this lecture, we see how GAN training can be interpreted as an implicit maximum-likelihood learning. This will serve us as a background to understand how Wasserstein GAN is working.
[link]

Lecture Notes:

Chapter 4 - Section 2

Further Reads:

GANs Paper Generative Adversarial Nets published by I. Goodfellow et al. at NeurIPS in 2014 proposing GANs
Tutorial on GANs Tutorial Generative Adversarial Networks given by I. Goodfellow at NeurIPS in 2016

Lecture 30: Wasserstein Distance
W-GANs - Part I: In this lecture, we understand the notion of Wasserstein distance. This is going to help us develop W-GAN as an alternative approach to the vanilla GAN.
[link]

Lecture Notes:

Chapter 4 - Section 3

Further Reads:

W-GANs Paper Wasserstein GAN published by M. Arjovsky et al. at ICML in 2017 proposing Wasserstein GANs
Tutorial on GANs Tutorial Generative Adversarial Networks given by I. Goodfellow at NeurIPS in 2016

Lecture 31: Wasserstein GAN
W-GANs - Part II: In this lecture, we develop W-GAN framework, which trains the generator and discriminator to minimize the Wasserstein distance between the data and model distributions.
[link]

Lecture Notes:

Chapter 4 - Section 3

Further Reads:

W-GANs Paper Wasserstein GAN published by M. Arjovsky et al. at ICML in 2017 proposing Wasserstein GANs

Lecture 32: GAN Samples
Summary: We quickly go through well-known GAN architectures and summarize Chapter 4.
[link]

Lecture Notes:

Chapter 4 - Section 4

Further Reads:

DCGAN Paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks published by A. Radford et al. at ICLR in 2016 proposing DCGAN
StyleGAN Paper A Style-Based Generator Architecture for Generative Adversarial Networks published by T. Karras et al. at IEEE CVF in 2019 proposing Style GAN
BigGAN Paper Large Scale GAN Training for High Fidelity Natural Image Synthesis published by A. Brock et al. at ICLR in 2019 proposing BigGAN
SAGAN Paper Self-Attention Generative Adversarial Networks published by H. Zhang et al. at ICML in 2019 proposing Self-Attention GAN

Lecture 33: Probabilistic Latent-Space Generation
VI - Part I: In this lecture, we discuss an alternative generator design in which data samples are computed from latent samples via a probabilistic model. This is the base approach used in VAEs and Diffusion models. We see that computing the likelihood in this case is not tractable. This motivates us to learn Variational Inference.
[link]

Lecture Notes:

Chapter 5 - Section 1

Further Reads:

Probabilistic Latent: Chapter 16 of [BB] Sections 16.1 and 16.2
Mixture Models Paper On the number of components in a Gaussian mixture model published by G. McLachlan and S. Rathnayake in 2014 reviewing some key properties of Gaussian mixtures and their approximation power

Lecture 34: Variational Inference
VI - Part II: In this lecture, we study Variational Inference. This framework enables us to develop an implicit approach for estimating the likelihood of probabilistic generators.
[link]

Lecture Notes:

Chapter 5 - Section 2

Further Reads:

ELBO: Chapter 16 of [BB] Section 16.3
VI for Likelihood The early paper Computing Upper and Lower Bounds on Likelihoods in Intractable Networks published by T. Jaakkola and M. Jordan at UAI in 1996
Tutorials on VI Review paper Variational Inference: A Review for Statisticians published by D. Blei, A. Kucukelbir, and J. McAuliffe in 2016 giving a good overview on VI framework
Introduction to VI Book An Introduction to Variational Autoencoders written by D. Kingma and M. Welling and published by NOW in 2019

Lecture 35: Variational Autoencoding
VAE - Part I: In this lecture, we discuss VAE architecture and see how we could use VI framework to build an AE architecture for generation.
[link]

Lecture Notes:

Chapter 5 - Section 3

Further Reads:

AE with VI Paper Auto-Encoding Variational Bayes published by D. Kingma and M. Welling in 2014 proposing VAE
Stachastic Generation by VAE Paper Stochastic Backpropagation and Approximate Inference in Deep Generative Models published by D. Rezende et al in 2014 proposing VAE in parallel

Lecture 36: Encoding and Decoding in VAEs
VAE - Part II: In this lecture, we learn how the Encoder and Decoder of a VAE look like.
[link]

Lecture Notes:

Chapter 5 - Section 3

Further Reads:

AE with VI Paper Auto-Encoding Variational Bayes published by D. Kingma and M. Welling in 2014 proposing VAE
Introduction to VAE Book An Introduction to Variational Autoencoders written by D. Kingma and M. Welling and published by NOW in 2019

Lecture 37: Training VAEs by ELBO Maximization
VAE - Part III: In this lecture, we discuss how we can define a risk function out of the ELBO to implicitly maximize the likelihood in VAEs.
[link]

Lecture Notes:

Chapter 5 - Section 3

Further Reads:

AE with VI Paper Auto-Encoding Variational Bayes published by D. Kingma and M. Welling in 2014 proposing VAE
Introduction to VAE Book An Introduction to Variational Autoencoders written by D. Kingma and M. Welling and published by NOW in 2019

Lecture 38: Gradient on Stochastic Graph by Importance Sampling and Reparameterization
VAE - Part IV: In this lecture, we discuss the Importance Sampling Trick and Reparameterization which can be used to compute gradients on a stochastic computation graph.
[link]

Lecture Notes:

Chapter 5 - Section 3

Further Reads:

AE with VI Paper Auto-Encoding Variational Bayes published by D. Kingma and M. Welling in 2014 proposing VAE
Introduction to VAE Book An Introduction to Variational Autoencoders written by D. Kingma and M. Welling and published by NOW in 2019 read Reparameterization section

Lecture 39: Training Computational VAE
VAE - Part V: In this lecture we complete the training loop for a computational VAE using the reparameterization trick.
[link]

Lecture Notes:

Chapter 5 - Section 3

Further Reads:

AE with VI Paper Auto-Encoding Variational Bayes published by D. Kingma and M. Welling in 2014 proposing VAE
Stachastic Generation by VAE Paper Stochastic Backpropagation and Approximate Inference in Deep Generative Models published by D. Rezende et al in 2014 proposing VAE in parallel

Lecture 40: VAE Challenges and Variants of VAE
VAE - Part VI: In this lecture, we discuss the main challenges of VAE in practice. We also discuss the variants of VAE which can help overcoming these challenges.
[link]

Lecture Notes:

Chapter 5 - Section 4

Further Reads:

DCVAE Paper Semi-Supervised Learning with Deep Generative Models published by D. Kingma et al. in 2014 implementing a Deep Convolutional VAE
Transformer VAE Paper Transformer VAE: A Hierarchical Model for Structure-Aware and Interpretable Music Representation Learning published by J. Jiang et al in ICASSP 2020 proposing a Transformer based VAE
VAE with VampPrior Paper VAE with a VampPrior published by J. Tomczak and M. Welling in 2017 proposing VAE with general latent prior

Lecture 41: Vector-Quantized VAE
VAE - Part VII: In this lecture we discuss VAEs with discrete latent, in particular the VQ-VAE. We see the idea behind quantizing the latent and its implementational framework.
[link]

Lecture Notes:

Chapter 5 - Section 4

Further Reads:

VQ-VAE Paper Neural Discrete Representation Learning published by D. Kingma and M. Welling in NeurIPS 2017 proposing VQ-VAE

Lecture 42: Generation by Langevin Dynamics
Diffusion by Score - Part I: In this lecture we start with the most basic diffusion-based generation framework which uses Langevin Dynamics to sample from data distribution. We understand that we only need to know the score function to generate by this approach.
[link]

Lecture Notes:

Chapter 6 - Section 1

Further Reads:

Score Matching Paper Estimation of non-normalized statistical models by score matching published by A. Hyvärinen in 2005 proposing the computational score matching
Langevin Generation Paper Generative Modeling by Estimating Gradients of the Data Distribution published by Song and Ermon in NeurIPS 2019 explaining the score matching and its application to generation by Langevin Dynamics

Lecture 43: Score Function
Preliminaries - Part I: In this short lecture, we learn about score function and its key properties. This function is an important entity in diffusion-based models.
[link]

Lecture Notes:

Chapter 6 - Section 1

Further Reads:

Score Matching Paper Estimation of non-normalized statistical models by score matching published by A. Hyvärinen in 2005 proposing the computational score matching

Lecture 44: Learning to Diffuse by Score Matching
Diffusion by Score - Part II: In this lecture, we learn the idea of score-matching. We learn how we could use sampling to train a computational model that performs score matching, and how we could use the Langevin dynamics to sample from data distribution using our trained score network.
[link]

Lecture Notes:

Chapter 6 - Section 1

Further Reads:

Score Matching Paper Estimation of non-normalized statistical models by score matching published by A. Hyvärinen in 2005 proposing the computational score matching

Lecture 45: Bayes Optimal and Computational Denoising
Preliminaries - Part II: In this short lecture, we talk about the concept of denoising, what the optimal approach is, and how we could do this computationally.
[link]

Lecture Notes:

Chapter 6 - Section 2

Lecture 46: Generic Diffusion SDE and Its Reverse
Preliminaries - Part III: In this lecture, we learn about a general diffusion that is described by a SDE. We also learn how we could reverse it in time. This takes us towards more practical diffusion-based models, i.e., DSM and DDPM.
[link]

Lecture Notes:

Chapter 6 - Section 2

Further Reads:

Reverse-time Diffusion Paper Reverse-time diffusion equation models published in Elsevier by B. Anderson in 1982 esplaining the reverse-time diffusion process
SDE Approach Paper Maximum Likelihood Training of Score-Based Diffusion Models by Song et al. in NeurIPS 2021 explaining the DPM and DDPM from inverse Diffusion viewpoint

Lecture 47: Building SDE and Its Reverse for Data Generation
Diffusion by Score - Part III: In this lecture, we design a SDE and its reverse which is able to send us from data sample to Gaussian Latent and back from Latent to Data. We also find out what we need to learn in order to be able to get back from Latent to Data.
[link]

Lecture Notes:

Chapter 6 - Section 2

Further Reads:

SDE Approach Paper Maximum Likelihood Training of Score-Based Diffusion Models by Song et al. in NeurIPS 2021 explaining the DPM and DDPM from inverse Diffusion viewpoint

Lecture 48: Diffusion Score Matching
Diffusion by Score - Part IV: In this lecture, we go through diffusion score matching idea which enables us to learn the required scores for data generation on a reverse SDE. This enables us to sample from data distribution via a diffusion process without learning the data score.
[link]

Lecture Notes:

Chapter 6 - Section 2

Further Reads:

DSM Paper Estimation of non-normalized statistical models by score matching published in Neural Computation by Pascal Vincent in 2011 proposing the denoising approach for score estimation (DSM)
SDE Approach Paper Maximum Likelihood Training of Score-Based Diffusion Models by Song et al. in NeurIPS 2021 explaining the DPM and DDPM from inverse Diffusion viewpoint

Lecture 49: Summary of Score Matching and SDE Limitations
Diffusion by Score - Part V: In this lecture, we summarize what we learned on SDE-based diffusion models. We see that due to the lack of guarantee on higher order differentials, such models are limited on accuracy. This motivates us looking at diffusion from a probabilistic viewpoint.
[link]

Lecture Notes:

Chapter 6 - Section 3

Further Reads:

SDE Approach Paper Maximum Likelihood Training of Score-Based Diffusion Models by Song et al. in NeurIPS 2021 explaining the DPM and DDPM from inverse Diffusion viewpoint

Lecture 50: Probabilistic Diffusion - Forward Process
DPMs - Part I: Given the limitation of SDE-based diffusion building, we no move towards probabilistic diffusion models which have led to the current trends, e.g., DPMs and DDPMs. We first build the forward process and understand its features.
[link]

Lecture Notes:

Chapter 6 - Section 3

Further Reads:

DPM Paper Deep Unsupervised Learning using Nonequilibrium Thermodynamics published by J. Sohl-Dickstein et al. in ICML 2015 proposing DPM framework for generation (from Variational Inference)

Lecture 51: Reverse Probabilistic Diffusion via MLE
DPMs - Part II: In this lecture, we try to build a reverse Markov chain that builds the reverse trajectory from noise to data. We see that this problem can be formulated as a classic distribution learning that can be addressed by maximum likelihood estimation. Due to computational complexity though, we need to invoke variational inference to implicitly perform MLE.
[link]

Lecture Notes:

Chapter 6 - Section 3

Further Reads:

DPM Paper Deep Unsupervised Learning using Nonequilibrium Thermodynamics published by J. Sohl-Dickstein et al. in ICML 2015 proposing DPM framework for generation (from Variational Inference)

Lecture 52: Risk Function for Probabilistic Diffusion
DPMs - Part III: In this lecture, we derive the learning risk function that learns the reverse trajectory. To this end, we derive the ELBO on the log-likelihood of the reverse process and compute it for a general learnable Markov-chain. We then focus on the particular case of Gaussian denoising. This gives us a computable loss function, which can be used for training of probabilistic diffusion models.
[link]

Lecture Notes:

Chapter 6 - Section 3

Further Reads:

DPM Paper Deep Unsupervised Learning using Nonequilibrium Thermodynamics published by J. Sohl-Dickstein et al. in ICML 2015 proposing DPM framework for generation (from Variational Inference)

Lecture 53: Computational DPMs
DPMs - Part IV: In this lecture, we build a computational DPM. We see how we can compute the sample loss using time sampling. We sketch the training loop and Generation algorithm. This completes the chain of the last couple of lectures.
[link]

Lecture Notes:

Chapter 6 - Section 4

Further Reads:

DPM Paper Deep Unsupervised Learning using Nonequilibrium Thermodynamics published by J. Sohl-Dickstein et al. in ICML 2015 proposing DPM framework for generation (from Variational Inference)

Lecture 54: Denoising DPMs
DDPMs: In this lecture, we learn the well-known case of DPMs, i.e., Denoising DPM (DDPM). We see how we build the denoiser in these models and how we can train and sample from them. This describes the most conventional form of diffusion models used in practice.
[link]

Lecture Notes:

Chapter 6 - Section 4

Further Reads:

DDPM Paper Denoising Diffusion Probabilistic Models published by J. Ho et al. in NeurIPS 2020 proposing DDPM framework
Improved DDPM Paper Improved Denoising Diffusion Probabilistic Models published by A. Nichol and P. Dhariwal in ICML 2021 proposing improvements to DDPM

Lecture 55: Denoising Diffusion Implicit Models
DDIMs: In this lecture, we study the Denoising Diffusion Implicit Models (DDIMs) which enable us to sample much faster as compared to DDPMs.
[link]

Lecture Notes:

Chapter 6 - Section 4

Further Reads:

DDIM Paper Denoising Diffusion Implicit Models published by J. Song et al. in ICLR 2021 proposing DDIM framework

Lecture 56: Some Known Diffusion Models
Commercial Examples: In this lecture, we study some well-known examples for diffusion models that are currently being used in large scale.
[link]

Lecture Notes:

Chapter 6 - Section 5

Further Reads:

Stable Diffusion Paper High-Resolution Image Synthesis with Latent Diffusion Models published by R. Rombach et al. in IEEE CVPR 2022 proposing Stable Diffusion
CVL Group Page of the research group Computer Vision & Learning Group in LMU Munich which developed Stable Diffusion
Imagen Paper Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding published by C. Saharia et al. in 2022 at Google proposing Imagen model
DALL-E Page of the DALL-E project by OpenAI

Lecture 57: Multimodality and Conditional Generative Models
Conditional Models I: In this lecture, we discuss the notion of multi-modality. We see that this notion is closely related to conditional generative models.
[link]

Lecture Notes:

Chapter 7

Further Reads:

Text-to-Image Paper Learning Transferable Visual Models From Natural Language Supervision published by A. Radford et al. in ICML 2021 proposing visual data generation from raw text
Survey on Multimodal Models Paper Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions published by P. Liang et al. in 2023

Lecture 58: Computational Conditioning via Embedding
Conditional Models II: In this lecture, we discuss how we could computationally condition a generative model via condition embedding. We further discuss the two particular techniques of Cross-Attention and FiLM.
[link]

Lecture Notes:

Chapter 7

Further Reads:

FiLM Paper FiLM: Visual Reasoning with a General Conditioning Layer published by E. Perez et al. in AAAI 2018 proposing FiLM
Cross-Attention Conditioning Paper Multi-Modality Cross Attention Network for Image and Sentence Matching published by X. Wei et al. in IEEE CVPR 2020 proposing a cross-attention based approach for conditioning

Lecture 59: Final Words
The End: This is the last part of the last lecture. A few important notes on what you know and what you need to pay attention to.
[link]

Lecture Notes:

Chapter 7

Lectures

Review Lectures