Materials
Lecture Notes
The lecture notes are uploaded through the semester. For each chapter, the notes are provided section by section.
Chapter 0: Course Overview and Logistics
- Handouts: All Sections included in a single file
Chapter 1: RL Framework
- Section 1: Multi-armed Bandit
- Section 2: RL Problem: Definitions and Basics
- Section 3: Playing in RL Framework
- Section 4: Our First Game
Chapter 2: Model-based RL
- Section 1: Markov Decision Process and Bellman Equation
- Section 2: Bellman Optimality
- Section 3: Policy and Value Iteration
Chapter 3: Model-free Tabular RL
- Section 1: Evaluation via Monte-Carlo
- Section 2: Temporal Difference
- Section 3: Deep Bootstrapping and Credit Assignment
- Section 4: Online Control via Monte-Carlo
- Section 5: Online Control via TD - SARSA and Q-Learning
Chapter 4: RL with Function Approximation
- Section 1: Tabular RL vs RL with Function Approximation
- Section 2: Prediction with Function Approximation
- Section 3: Control with Function Approximation
- Section 4: Deep Q-Learning
Chapter 5: Policy Gradient Methods (PGMs)
- Section 1: Policy Network
- Section 2: Training Policy Net and PGM
- Section 3: PGM with Trust Region
- Section 4: Basic Form TRPO and PPO
Chapter 6: Actor-Critic Approaches
- Section 1: Basics of AC - A2C Algorithm
- Section 2: AC Version of TRPO and PPO
- Section 3: Deterministic Policy Gradient and DDPG
Chapter 7: Wrap-up
Tutorial Notebooks and Videos
The tutorial notebooks can be accessed below.
- Tutorial 1: Overview on RL environment and Introduction to Gymnasium, by Eason Qu Video
- Tutorial 2: Policy and Value Iteration, by Eason Qu Video
- Tutorial 3: Monte Carlo Learning, by Eason Qu Video
- Tutorial 4: TD Learning, by Eason Qu Video
- Tutorial 5: Midterm Review by Eason Qu Video
- Tutorial 6: Tabular vs Deep RL by Eason Qu Video
- Tutorial 7: PGM by Eason Qu Video
Book
Most of the contents covered in the first two parts of course can be further read in
- [SB] Sutton, R., and Barto, A. Reinforcement Learning: An Introduction, 2nd Edition, MIT Press, 2018
The following old preprint is also a good summary of important RL algorithms
Most materials in the third part, i.e., deep RL, are collected from research papers. The following textbook is also good resources for practicing hands-on skills.
Reading List
This section will be completed gradually through the semester.
Chapter 1: RL Framework
Introduction
- Intro to RL: Chapter 1 - Sections 1.1 and 1.2 of [SB]
Multi-armed Bandit
- k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
- Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays
RL Problem Formulation
- RL Setting: Chapter 1 - Sections 1.3 and 1.5 [SB]
Terminal State and Episode
- RL Setting: Chapter 3 - Section 3.4 of [SB]
Chapter 2: Model-based RL
MDPs
- Intro to RL: Chapter 3 - Sections 3.1 and 3.5 of [SB]
Bellman Equation and Optimal Policy
- Bellman and Optimality: Chapter 3 - Sections 3.6 and 3.7 of [SB]
Policy Iteration
- Policy Evaluation: Chapter 4 - Section 4.1 of [SB]
- Policy Improvement and Iteration: Chapter 4 - Sections 4.2 and 4.3 of [SB]
Value Iteration
- Value Iteration: Chapter 4 - Sections 4.4, 4.6 and 4.7 of [SB]
Chapter 3: Model-free RL
Monte Carlo Approach
- Monte-Carlo: Chapter 5 - Sections 5.1, 5.2 and 5.3 of [SB]
Temporal Difference
Deep Temporal Difference
Credit Assignment
Eligibility Trace
- Eligibility Tracing: Chapter 12 - Sections 12.4 and 12.5 of [SB]
- RL with Eligibility Paper Reinforcement Learning with Replacing Eligibility Traces published in 1996 by S. Singh and R. Sutton proposing the idea of including eligibility tracing in RL
Monte-Carlo Control
- MC Control: Chapter 5 - Sections 5.3 and 5.4 of [SB]
ε-Greedy Improvement
Temporal-Difference Control
- TD Control: Chapter 6 - Section 6.2 of [SB]
Sarsa Algorithm
- Online Q-Learning Article On-Line Q-Learning Using Connectionist Systems published in 1994 by G. Rummery and M. Niranjan proposing SARSA as an online version of Q-Learning
- Sarsa: Chapter 6 - Section 6.4 of [SB]
- Sarsa: Chapter 10 - Sections 10.2 and 10.5 of [SB]
- Sarsa: Chapter 12 - Section 12.7 of [SB]
Importance Sampling and Off-policy Learning
- Importance Sampling: Chapter 5 - Section 5.5 of [SB]
- Off Policy Learning: Chapter 12 - Sections 12.9 and 12.11 of [SB]
Q-Learning
- Q-Learning Paper Paper Q-learning published in 1992 by _C Watkins and P. Dayan_proposing the off-policy learning as in Q-learning algorithm
- Q-Learning: Chapter 6 - Section 6.5 of [SB]
- Q-Learning: Chapter 12 - Section 12.10 of [SB]
Q-Learning vs SARSA
*Convergence Paper The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning published in 2000 by V. Borkar and S. Meyn studying convergence of Q-Learning and SARSA
Chapter 4: RL with Function Approximation
Function Approximation for Value Estimation
- Function Approximation for RL: Chapter 9 of [SB]
- Neuro-dynamic Programming Paper Neuro-dynamic programming: an overview published in 1995 by D. Bertsekas and J. Tsitsiklis discussing function approximation for value learning
Prediction via Function Approximation
- Prediction with FA: Chapter 3 of [CS]
- TD with FA Paper Analysis of Temporal-Diffference Learning with Function Approximation published in 1996 by J. Tsitsiklis and B. Roy analyzing Prediction with parameterized models
- FA vs Tabular Paper Analyzing feature generation for value-function approximation published in 2008 by R. Parr et al. discussing connections of RL with FA to tabular RL
- RL with FA Paper Residual Algorithms: Reinforcement Learning with Function Approximation published in 1995 by L. BAird giving some critics to RL with FA
Control via Function Approximation
- Control with FA: Chapters 10 and 11 of [SB]
- Control with FA: Chapter 4 of [CS]
Deep Q-Learning
- Off-Policy with FA Paper A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation published in 2008 by R. Sutton et al. analyzing off-policy learning via linear models
- DQL: Chapter 4 - Section 4.3 of [CS]
- Deep Q-Learning Paper Human-level control through deep reinforcement learning published in 2015 by V. Mnih et al. proposing the legendary idea of Deep Q-Learning
- DQL Paper I Paper Playing Atari with Deep Reinforcement Learning published in 2013 by V. Mnih et al. describing DQL details
- DQL Paper II Paper Deep Reinforcement Learning with Double Q-learning published in 2015 by H. Haasselt et al. proposing Double DQL
- DQL Paper III Paper Dueling Network Architectures for Deep Reinforcement Learning published in 2016 by Z. Wang et al. proposing Dueling DQL
- DQL Paper IV Paper Prioritized Experience Replay published in 2016 by T. Schaul et al. proposing a prioritizing experience replay scheme
- DQL Paper V Paper Deep Recurrent Q-Learning for Partially Observable MDPs published in 2017 by M. Hausknecht and P. Stone extending DQL to more realistic cases
- Gorila Paper Massively Parallel Methods for Deep Reinforcement Learning published in 2015 by A. Nair et al. proposing Gorila
- Why Policy Net Article Deep Deterministic Policy Gradient at OpenAI Spinning Up
Chapter 5: Policy Gradient Method
Basic PGM
- REINFORCE Paper Simple statistical gradient-following algorithms for connectionist reinforcement learning published by R. Williams in 1992 introducing REINFORCE algorithm
Policy Gradient Theorem
- PGM Theorem Paper Policy Gradient Methods for Reinforcement Learning with Function Approximation published by R. Sutton et al. in 1999 developing the Policy Gradient Theorem
Vanilla and Baseline PGM
- Baseline Paper Policy invariance under reward transformations: Theory and application to reward shaping published by A. Ng et al. in 1999
Trust Region PGM
- Nat PGM Paper A Natural Policy Gradient published by S. Kakade in 2001 proposing a basic natural PGM
- TRPO Paper Trust Region Policy Optimization published by J. Schulman et al. in 2015 proposing TRPO
- PPO Paper Proximal Policy Optimization Algorithms published by J. Schulman et al. in 2017 proposing PPO
