Materials
Lecture Notes
The lecture notes are uploaded through the semester. For each chapter, the notes are provided section by section.
Chapter 0: Course Overview and Logistics
- Handouts: All Sections included in a single file
Chapter 1: Text Generation via Language Models
- Section 1: Fundamentals of Language Modeling - Primary LMs
- Section 2: Transformer-based LMs
- Section 3: Large Language Models
Chapter 1: Data Generation Problem
- Section 1: Basic Definitions
- Section 2: Generative and Discriminative Learning
- Section 3: Generative Modeling
Book
There is indeed no single textbook for this course, and we use various resources in the course. Most of resources are research papers, which are included in the reading list below and completed through the semester. The following textbooks have however covered some key notions and related topics.
- [BB] Bishop, Christopher M., and Hugh Bishop. Deep Learning: Foundations and Concepts. Springer Nature, 2023.
- [M] Murphy, Kevin P. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023.
- [GYC] Goodfellow, Ian, et al. Deep Learning. MIT Press, 2016.
With respect to the first part of the course, the following book provides some good read:
The following recent textbooks are also good resources for practicing hands-on skills. Note that we are not simply learning to implement only! We study the fundamentals that led to development of this framework, nowadays known as generative AI. Of course, we try to get our hands dirty as well and learn how to do implementation.
- Sanseviero, Omar, et al. Hands-On Generative AI with Transformers and Diffusion Models. O’Reilly Media, Inc., 2024.
- Alammar, Jay, and Maarten Grootendorst. Hands-on large language models: language understanding and generation. O’Reilly Media, Inc., 2024.
Reading List
This section will be completed gradually through the semseter. I will try to break down the essence of each item, so that you could go over them easily.
Review
You may review the idea of Seq2Seq learning in the following references:
- SimpleLM: Initial ideas on making a language model
- SeqGen: Sequence generation via RNNs –Old idea, but yet worth thinking about it!
- Seq2Seq: How we can do sequence to sequence learning via NNs
You may review the idea of transformers in the following resources:
- Transformer Paper: Paper Attention Is All You Need! published in 2017 that made a great turn in sequence processing
- Transformers: Chapter 9 of [JM]
- Transformers: Chapter 12 of [BB] Section 12.1
Chapter 1: Text Generation and Language Models
Tokenization and Embedding
- Tokenization: Chapter 2 of [JM]
- Original BPE Algorithm: Original BPE Algorithm proposed by Philip Gage in 1994
- BPE for Tokenization: Paper Neural machine translation of rare words with subword units by Rico Sennrich, Barry Haddow, and Alexandra Birch presented in ACL 2016 that adapted BPE for NLP
Other Embedding Approaches
- Word2Vec Paper Efficient Estimation of Word Representations in Vector Space by Mikolov et al. published in 2013 introducing Word2Vec
- GloVe Paper GloVe: Global Vectors for Word Representation by Pennington et al._ published in 2014 introducing GloVe
- WordPiece: Paper Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation by Yonghui Wu et al. published in 2016 introducing WordPiece (used in BERT)
- SentencePiece: Paper SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing by Taku Kudo and John Richardson presented in EMNLP 2018 that introduces a language-independent tokenizer
- ELMo Paper Deep contextualized word representations by Peters et al. introducing ELMo a context-sensitive embedding
- ByT5: Paper ByT5: Towards a token-free future with pre-trained byte-to-byte models by Xue et al. presented in ACL 2022 proposing ByT5
Language Modelling
- LMs: Chapter 12 of [BB] Section 12.2
- N-Gram LMs: Chapter 3 of Speech and Language Processing; Section 3.1 on N-gram LM
- Maximum Likelihood: Chapter 2 of [BB] Sections 12.1 – 12.3
Recurrent LMs
- Recurrent LMs: Chapter 8 of [JM]
- LSTM LMs: Paper Regularizing and Optimizing LSTM Language Models by Stephen Merity, Nitish Shirish Keskar, and Richard Socher published in ICLR 2018 enabling LSTMs to perform strongly on word-level language modeling
- High-Rank Recurrent LMs: Paper Breaking the Softmax Bottleneck: A High-Rank RNN Language Model by Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen presented at ICLR 2018 proposing Mixture of Softmaxes (MoS) and achieving state-of-the-art results at the time
Transformer-based LMs and LLMs
- Transformer LMs: Chapter 12 of [BB] Section 12.3
- LLMs via Transformers: Chapter 10 of [JM]
GPTs
- GPT-1: Paper Improving Language Understanding by Generative Pre-Training by Alec Radford et al. (OpenAI, 2018) that introduced GPT-1 and revived the idea of pretraining transformers as LMs followed by supervised fine-tuning
- GPT-2: Paper Language Models are Unsupervised Multitask Learners by Alec Radford et al. (OpenAI, 2019) that introduces GPT-2 with 1.5B parameter trained on web text
- GPT-3: Paper Language Models are Few-Shot Learners by Tom B. Brown et al. (OpenAI, 2020) that introduces GPT-3, a 175B-parameter transformer LM
- GPT-4: GPT-4 Technical Report by OpenAI (2023) that provides an overview of GPT-4’s capabilities
Other LLMs
- BERT: Paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin et al. presented at NAACL 2019 that introduced BERT
- RoBERTa: Paper RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, et al. (Facebook AI, 2019) that shows BERT’s performance can be significantly improved by more data, longer training, and removing next sentence prediction
- T5: Paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel et al. (JMLR 2020) that reformulates all NLP tasks as text-to-text problems introducing the T5 model
Data for LLMs
- The Pile: Paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling by Leo Gao et al. presented in 2020 introductin dataset The Pile
- RACE: Paper RACE: Large-scale Reading Comprehension Dataset from Examinations by Guokun Lai et al. presented at EMNLP in 2017 introducing a large-scale dataset of English reading comprehension questions from real-world exams
- BookCorpus: Paper Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books by Yukun Zhu et al. presented at ICCV in 2015 introducing the dataset BookCorpus. It was used to pre-train GPT-1 and BERT; nevertheless, it turned out that the dataset was collected without authors consent; see the Wikipedia article. It was hence replaced later with BookCorpusOpen
- Documentation Debt: Paper Addressing “Documentation Debt” in Machine Learning Research: A Retrospective Datasheet for BookCorpus by Jack Bandy and Nicholas Vincent published in 2021 discussing the efficiency and legality of data collection by looking into BookCorpus
Earlier Work on Pretraining
- SSL: Paper Semi-supervised Sequence Learning by Andrew M. Dai et al. published in 2015 that explores using unsupervised pretraining followed by supervised fine-tuning; this was an early solid work advocating pre-training idea for LMs
- ULMFiT: Paper Universal Language Model Fine-tuning for Text Classification by Jeremy Howard et al. presented at ACL in 2018 introducing ULMFiT that uses pre-trained LMs with task-specific fine-tuning
Fine-tuning
- LMs: Chapter 12 of [BB] Section 12.3.5
- LoRA: Paper LoRA: Low-Rank Adaptation of Large Language Models by Edward J. Hu et al. presented at ICLR in 2022 introducing LoRA
- ReFT: Paper ReFT: Representation Finetuning for Language Models by Z. Wu et al. presented at NeurIPS in 2024 proposing an alternative fine-tuning algorithm
Prompt Design
- Zero-Shot: Paper Zero-shot Learning — A Comprehensive Evaluation of the Good, the Bad and the Ugly by Yongqin Xian et al. at IEEE Tran. PAMI in 2018 presenting an overview on zero-shot learning
- Chain-of-Thought: Paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models by Jason Wei et al. presented at NeurIPS in 2022 introducing chain-of-thought prompting
- Prefix-Tuning: Paper Prefix-Tuning: Optimizing Continuous Prompts for Generation by Xiang Lisa Li et al. presented at ACL in 2021 proposing prefix-tuning approach for prompting
- Prompt-Tuning: Paper The Power of Scale for Parameter-Efficient Prompt Tuning by B. Lester et al. presented at EMNLP in 2021 proposing the prompt tuning idea, i.e., learning to prompt
- Zero-Shot LLMs: Paper Large Language Models are Zero-Shot Reasoners by T. Kojima et al. presented at NeurIPS in 2022 studying zero-shot learning with LLMs
- Prompt Engineering is Dead: Article AI Prompt Engineering Is Dead: Long Live AI Prompt Engineering by Dina Genkina published in IEEE Spectrum in 2024
Foundation Models
- CRFM Center for Research on Foundation Models who coined the term Foundation Model
Chapter 2: Data Generation Problem
Basic Definitions
- Probabilistic Model: Chapter 2 of [BB] Sections 2.4 to 2.6
- Statistics: Chapter 3 of [M] Sections 3.1 to 3.3
- Bayesian Statistics: Chapter 5 of [GYC] Section 5.6
Generative and Discriminative Learning
- Discriminative and Generative Models: Chapter 5 of [BB]
Generative Models
- Naive Bayes: Paper Idiot’s Bayes—Not So Stupid After All? by D. Hand and K. Yu published at Statistical Review in 2001 discussing the efficiency of Naive Bayes for classification
- Naive Bayes vs Linear Regression: Paper On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes by A. Ng and M. Jordan presented at NeurIPS in 2001 elaborating the data-efficiency efficiency of Naive Bayes and asymptotic superiority of Logistic Regression
- Generative Models – Overview: Chapter 20 of [M] Sections 20.1 to 20.3