Lectures

Here, you can find the recordings of the lecture videos.

  • Lecture 0: Course Overview and Logistics
    Overview: This lecture gives an overview on the course structure and the logistics.
    [link]

    Lecture Notes:

  • Lecture 1: Tokenization and Embedding
    LMs - Part 1: We start with LMs and understand how we can feed a text into it by doing the so-called "Tokenization" and "Embedding".
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 2: Language Distribution and Bi-Gram Model
    LMs - Part 2: In this lecture, we define the LMs concretely via the notion of language distribution. We then build a simple Bi-gram LM.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 3: Recurrent LMs
    LMs - Part 3: This lecture explains how we can build a LM using RNNs. We take a look at LSTM-based LMs.
    [link]

    Lecture Notes:

    Further Reads:

    • Recurrent LMs: Chapter 8 of [JM]
    • LSTM LMs: Paper Regularizing and Optimizing LSTM Language Models by Stephen Merity, Nitish Shirish Keskar, and Richard Socher published in ICLR 2018 enabling LSTMs to perform strongly on word-level language modeling
    • High-Rank Recurrent LMs: Paper Breaking the Softmax Bottleneck: A High-Rank RNN Language Model by Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen presented at ICLR 2018 proposing Mixture of Softmaxes (MoS) and achieving state-of-the-art results at the time
  • Lecture 4: Context Extraction via Self-Attention
    Transformer LMs - Part 1: In this lecture we use self-attention mechanism to extract context from a token sequence. Introduction to self-attention is given through the lecture.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 5: Transformer LM
    Transformer LMs - Part 2: We use self-attention-based context to build a LM via a Transformer.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 6: LLM Examples
    LLMs - Part 1: We take a look at GPT examples to see how deep they are, how large the data is, how to evaluate them, and what challenges are in the way of pre-training them.
    [link]

    Lecture Notes:

    Further Reads:

    • GPT-1: Paper Improving Language Understanding by Generative Pre-Training by Alec Radford et al. (OpenAI, 2018) that introduced GPT-1 and revived the idea of pretraining transformers as LMs followed by supervised fine-tuning
    • GPT-2: Paper Language Models are Unsupervised Multitask Learners by Alec Radford et al. (OpenAI, 2019) that introduces GPT-2 with 1.5B parameter trained on web text
    • GPT-3: Paper Language Models are Few-Shot Learners by Tom B. Brown et al. (OpenAI, 2020) that introduces GPT-3, a 175B-parameter transformer LM
    • GPT-4: GPT-4 Technical Report by OpenAI (2023) that provides an overview of GPT-4’s capabilities

    • The Pile: Paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling by Leo Gao et al. presented in 2020 introductin dataset The Pile
    • Documentation Debt: Paper Addressing “Documentation Debt” in Machine Learning Research: A Retrospective Datasheet for BookCorpus by Jack Bandy and Nicholas Vincent published in 2021 discussing the efficiency and legality of data collection by looking into BookCorpus
  • Lecture 7: Pre-training vs Fine-tuning
    LLMs - Part 2: We discuss the idea of pre-training and fine-tuning, and take a look into the example of GPT-1.
    [link]

    Lecture Notes:

    Further Reads:

    • SSL: Paper Semi-supervised Sequence Learning by Andrew M. Dai et al. published in 2015 that explores using unsupervised pretraining followed by supervised fine-tuning; this was an early solid work advocating pre-training idea for LMs
    • GPT-1: Paper Improving Language Understanding by Generative Pre-Training by Alec Radford et al. (OpenAI, 2018) that introduced GPT-1 and revived the idea of pretraining transformers as LMs followed by supervised fine-tuning
  • Lecture 8: Statistical View and LoRA
    LLMs - Part 3: We practice statistical view on the idea of fine-tuning, discuss full and selective fine-tuning, and study the low-rank adaptation (LoRA) method.
    [link]

    Lecture Notes:

    Further Reads:

    • LMs: Chapter 12 of [BB] Section 12.3.5
    • LoRA: Paper LoRA: Low-Rank Adaptation of Large Language Models by Edward J. Hu et al. presented at ICLR in 2022 introducing LoRA
  • Lecture 9: Prompt Design
    LLMs - Part 4: In this lecture, we discuss the idea prompt design, i.e., using LLMs to sample directly from a task specific distribution. We get to overview ideas like "few-shot" and "zero-shot" learning and get to know what "foundation models" are.
    [link]

    Lecture Notes:

    Further Reads:

    • Chain-of-Thought: Paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models by Jason Wei et al. presented at NeurIPS in 2022 introducing chain-of-thought prompting
    • Prefix-Tuning: Paper Prefix-Tuning: Optimizing Continuous Prompts for Generation by Xiang Lisa Li et al. presented at ACL in 2021 proposing prefix-tuning approach for prompting
    • Prompt-Tuning: Paper The Power of Scale for Parameter-Efficient Prompt Tuning by B. Lester et al. presented at EMNLP in 2021 proposing the prompt tuning idea, i.e., learning to prompt
    • Zero-Shot LLMs: Paper Large Language Models are Zero-Shot Reasoners by T. Kojima et al. presented at NeurIPS in 2022 studying zero-shot learning with LLMs
  • Lecture 10: Data Generation Problem - Basic Definitions
    Formulation - Part 1: In this lecture, we start with the formulating the generic problem of data generation. We review the concept of "data distribution" and see that we essentially need to learn "how to sample from it".
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 11: Discriminative vs Generative Learning
    Formulation - Part 2: In this lecture we study the discriminative and generative models. We see that many earlier computational models we learned are indeed discriminative. We further learn how we could use a generative model for a discriminative learning task.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 12: Naive Bayes - Most Basic Generative Model
    Formulation - Part 3: This lecture investigates Naive Bayes, the most basic generative model we can think of. This model enables us understand the idea of generative modeling clearly.
    [link]

    Lecture Notes:

    Further Reads:

    • Naive Bayes: Paper Idiot’s Bayes—Not So Stupid After All? by D. Hand and K. Yu published at Statistical Review in 2001 discussing the efficiency of Naive Bayes for classification
    • Naive Bayes vs Linear Regression: Paper On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes by A. Ng and M. Jordan presented at NeurIPS in 2001 elaborating the data-efficiency efficiency of Naive Bayes and asymptotic superiority of Logistic Regression
    • Generative Models – Overview: Chapter 20 of [M] Sections 20.1 to 20.3

Review Lectures

Here, you can find review lectures on some key deep learning topics. It is strongly suggested that you watch these videos to recap those key concepts, as they are frequently used in the course.