Materials
Lecture Notes
The lecture notes are uploaded through the semester. For each chapter, the notes are provided section by section.
Chapter 0: Course Overview and Logistics
- Handouts: All Sections included in a single file
Chapter 1: RL Framework
- Section 1: Multi-armed Bandit
- Section 2: RL Problem: Definitions and Basics
- Section 3: Playing in RL Framework
- Section 4: Our First Game
Chapter 2: Model-based RL
- Section 1: Markov Decision Process and Bellman Equation
- Section 2: Bellman Optimality
- Section 3: Policy and Value Iteration
Chapter 3: Model-free Tabular RL
- Section 1: Evaluation via Monte-Carlo
- Section 2: Temporal Difference
- Section 3: Deep Bootstrapping and Credit Assignment
- Section 4: Online Control via Monte-Carlo
- Section 5: Online Control via TD - SARSA and Q-Learning
Chapter 4: RL with Function Approximation
- Section 1: Tabular RL vs RL with Function Approximation
- Section 2: Prediction with Function Approximation
- Section 3: Control with Function Approximation
- Section 4: Deep Q-Learning
Chapter 5: Policy Gradient Methods (PGMs)
- Section 1: Policy Network
- Section 2: Training Policy Net and PGM
- Section 3: PGM with Trust Region
- Section 4: Basic Form TRPO and PPO
Chapter 6: Actor-Critic Approaches
- Section 1: Basics of AC - A2C Algorithm
- Section 2: AC Version of TRPO and PPO
- Section 3: Deterministic Policy Gradient and DDPG
Chapter 7: Wrap-up
Tutorial Notebooks and Videos
The tutorial notebooks can be accessed below.
- Tutorial 1: Overview on RL environment and Introduction to Gymnasium, by Eason Qu Video
- Tutorial 2: Policy and Value Iteration, by Eason Qu Video
- Tutorial 3: Monte Carlo Learning, by Eason Qu Video
- Tutorial 4: TD Learning, by Eason Qu Video
- Tutorial 5: Midterm Review by Eason Qu Video
- Tutorial 6: Tabular vs Deep RL by Eason Qu Video
- Tutorial 7: PGM by Eason Qu Video
- Tutorial 8: TRPO and PPO by Eason Qu Video
- Tutorial 9: AC version of PPO by Eason Qu Video
Book
Most of the contents covered in the first two parts of course can be further read in
- [SB] Sutton, R., and Barto, A. Reinforcement Learning: An Introduction, 2nd Edition, MIT Press, 2018
The following old preprint is also a good summary of important RL algorithms
Most materials in the third part, i.e., deep RL, are collected from research papers. The following textbook is also good resources for practicing hands-on skills.
Reading List
This section will be completed gradually through the semester.
Chapter 1: RL Framework
Introduction
- Intro to RL: Chapter 1 - Sections 1.1 and 1.2 of [SB]
Multi-armed Bandit
- k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
- Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays
RL Problem Formulation
- RL Setting: Chapter 1 - Sections 1.3 and 1.5 [SB]
Terminal State and Episode
- RL Setting: Chapter 3 - Section 3.4 of [SB]
Chapter 2: Model-based RL
MDPs
- Intro to RL: Chapter 3 - Sections 3.1 and 3.5 of [SB]
Bellman Equation and Optimal Policy
- Bellman and Optimality: Chapter 3 - Sections 3.6 and 3.7 of [SB]
Policy Iteration
- Policy Evaluation: Chapter 4 - Section 4.1 of [SB]
- Policy Improvement and Iteration: Chapter 4 - Sections 4.2 and 4.3 of [SB]
Value Iteration
- Value Iteration: Chapter 4 - Sections 4.4, 4.6 and 4.7 of [SB]
Chapter 3: Model-free RL
Monte Carlo Approach
- Monte-Carlo: Chapter 5 - Sections 5.1, 5.2 and 5.3 of [SB]
Temporal Difference
Deep Temporal Difference
Credit Assignment
Eligibility Trace
- Eligibility Tracing: Chapter 12 - Sections 12.4 and 12.5 of [SB]
- RL with Eligibility Paper Reinforcement Learning with Replacing Eligibility Traces published in 1996 by S. Singh and R. Sutton proposing the idea of including eligibility tracing in RL
Monte-Carlo Control
- MC Control: Chapter 5 - Sections 5.3 and 5.4 of [SB]
ε-Greedy Improvement
Temporal-Difference Control
- TD Control: Chapter 6 - Section 6.2 of [SB]
Sarsa Algorithm
- Online Q-Learning Article On-Line Q-Learning Using Connectionist Systems published in 1994 by G. Rummery and M. Niranjan proposing SARSA as an online version of Q-Learning
- Sarsa: Chapter 6 - Section 6.4 of [SB]
- Sarsa: Chapter 10 - Sections 10.2 and 10.5 of [SB]
- Sarsa: Chapter 12 - Section 12.7 of [SB]
Importance Sampling and Off-policy Learning
- Importance Sampling: Chapter 5 - Section 5.5 of [SB]
- Off Policy Learning: Chapter 12 - Sections 12.9 and 12.11 of [SB]
Q-Learning
- Q-Learning Paper Paper Q-learning published in 1992 by _C Watkins and P. Dayan_proposing the off-policy learning as in Q-learning algorithm
- Q-Learning: Chapter 6 - Section 6.5 of [SB]
- Q-Learning: Chapter 12 - Section 12.10 of [SB]
Q-Learning vs SARSA
*Convergence Paper The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning published in 2000 by V. Borkar and S. Meyn studying convergence of Q-Learning and SARSA
Chapter 4: RL with Function Approximation
Function Approximation for Value Estimation
- Function Approximation for RL: Chapter 9 of [SB]
- Neuro-dynamic Programming Paper Neuro-dynamic programming: an overview published in 1995 by D. Bertsekas and J. Tsitsiklis discussing function approximation for value learning
Prediction via Function Approximation
- Prediction with FA: Chapter 3 of [CS]
- TD with FA Paper Analysis of Temporal-Diffference Learning with Function Approximation published in 1996 by J. Tsitsiklis and B. Roy analyzing Prediction with parameterized models
- FA vs Tabular Paper Analyzing feature generation for value-function approximation published in 2008 by R. Parr et al. discussing connections of RL with FA to tabular RL
- RL with FA Paper Residual Algorithms: Reinforcement Learning with Function Approximation published in 1995 by L. BAird giving some critics to RL with FA
Control via Function Approximation
- Control with FA: Chapters 10 and 11 of [SB]
- Control with FA: Chapter 4 of [CS]
Deep Q-Learning
- Off-Policy with FA Paper A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation published in 2008 by R. Sutton et al. analyzing off-policy learning via linear models
- DQL: Chapter 4 - Section 4.3 of [CS]
- Deep Q-Learning Paper Human-level control through deep reinforcement learning published in 2015 by V. Mnih et al. proposing the legendary idea of Deep Q-Learning
- DQL Paper I Paper Playing Atari with Deep Reinforcement Learning published in 2013 by V. Mnih et al. describing DQL details
- DQL Paper II Paper Deep Reinforcement Learning with Double Q-learning published in 2015 by H. Haasselt et al. proposing Double DQL
- DQL Paper III Paper Dueling Network Architectures for Deep Reinforcement Learning published in 2016 by Z. Wang et al. proposing Dueling DQL
- DQL Paper IV Paper Prioritized Experience Replay published in 2016 by T. Schaul et al. proposing a prioritizing experience replay scheme
- DQL Paper V Paper Deep Recurrent Q-Learning for Partially Observable MDPs published in 2017 by M. Hausknecht and P. Stone extending DQL to more realistic cases
- Gorila Paper Massively Parallel Methods for Deep Reinforcement Learning published in 2015 by A. Nair et al. proposing Gorila
- Why Policy Net Article Deep Deterministic Policy Gradient at OpenAI Spinning Up
Chapter 5: Policy Gradient Method
Basic PGM
- REINFORCE Paper Simple statistical gradient-following algorithms for connectionist reinforcement learning published by R. Williams in 1992 introducing REINFORCE algorithm
Policy Gradient Theorem
- PGM Theorem Paper Policy Gradient Methods for Reinforcement Learning with Function Approximation published by R. Sutton et al. in 1999 developing the Policy Gradient Theorem
Vanilla and Baseline PGM
- Baseline Paper Policy invariance under reward transformations: Theory and application to reward shaping published by A. Ng et al. in 1999
Trust Region PGM
- Nat PGM Paper A Natural Policy Gradient published by S. Kakade in 2001 proposing a basic natural PGM
- TRPO Paper Trust Region Policy Optimization published by J. Schulman et al. in 2015 proposing TRPO
- PPO Paper Proximal Policy Optimization Algorithms published by J. Schulman et al. in 2017 proposing PPO
Chapter 6: Actor-Critic Approaches
Basic of AC
- AC Paper Neuronlike adaptive elements that can solve difficult learning control problems published by A. Barto, R. Sutton, and C. Anderson in 1980 coining the term AC approaches
- AC Paper Actor-Critic Algorithms published by V. Konda and J. Tsitsiklis in 2000 formulating AC more generally
Back to TRPO and PPO
- TRPO Paper Trust Region Policy Optimization published by J. Schulman et al. in 2015 proposing TRPO
- PPO Paper Proximal Policy Optimization Algorithms published by J. Schulman et al. in 2017 proposing PPO
Deterministic Policy Gradient
- DPGs Paper Deterministic Policy Gradient Algorithms published by D. Silver et al. in 2014 developing DPGs
- DDPG Paper Continuous Control with Deep Reinforcement Learning published by T. Lillicrap et al. in 2015 proposing Deep DPG
- TD3 Paper Addressing Function Approximation Error in Actor-Critic Methods published by S. Fujimoto et al. in 2018 proposing TD3
Soft Actor-Critic
- Soft QL Paper Reinforcement Learning with Deep Energy-Based Policies published by T. Haarnoja et al. in 2017 developing Soft Q-Learning with Entropy-based regularization of rewarding system
- SAC Paper Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor published by T. Haarnoja et al. in 2018 proposing SAC
- SAC II Paper Soft Actor-Critic Algorithms and Applications published by T. Haarnoja et al. in 2019 giving more context to SAC
Chapter 7: History and Advances
Imitation LEarning and Behavior Cloning
- IL - DAgger Paper A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning published by S. Ross et al. in 2011 proposing DAgger for imitation learning
- BCO Paper Behavioral Cloning from Observation (BCO) published by F. Torabi et al. in 2018 proposing BCO
Inverse RL
- Inverse RL Paper Algorithms for Inverse Reinforcement Learning published by A. Ng and S. Russell in 2000 discussing Inverse RL
- Max Entropy Inverse RL Paper Maximum Entropy Inverse Reinforcement Learning published by B. Ziebart et al. in 2008 Proposing Max Entropy approach for Inverse RL
Reinforcement Learning with Human Feedback (RLHF)
- RLHF Paper Deep Reinforcement Learning from Human Preferences published by P. Christiano et al. in 2017 discusses RLHF
- Fine-Tuning via RLHF Paper Fine-Tuning Language Models from Human Preferences published by D. Ziegler et al. in 2019 discusses Fine-tuning RLHF
- RLHF for LLM Paper Training language models to follow instructions with human feedback (InstructGPT) published by L. Ouyang et al. in 2022 discusses RLHF for LLM
Multi-agent RL
- MA-RL Paper Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments published by R. Lowe et al. in 2017 discusses Multi-agent RL
- Deep MA-RL Paper Learning to Communicate with Deep Multi-Agent Reinforcement Learning published by J. Foerster et al. in 2016 discusses Multi-agent RL
- Adversarial RL Paper Robust Adversarial Reinforcement Learning published by L. Pinato et al. in 2017 discusses Adversarial RL
