Materials
Lecture Notes
The lecture notes are uploaded through the semester. For each chapter, the notes are provided section by section.
Chapter 0: Course Overview and Logistics
- Handouts: All Sections included in a single file
Chapter 1: RL Framework
- Section 1: Multi-armed Bandit
- Section 2: RL Problem: Definitions and Basics
- Section 3: Playing in RL Framework
- Section 4: Our First Game
Chapter 2: Model-based RL
- Section 1: Markov Decision Process and Bellman Equation
- Section 2: Bellman Optimality
- Section 3: Policy and Value Iteration
Chapter 3: Model-free Tabular RL
- Section 1: Evaluation via Monte-Carlo
- Section 2: Temporal Difference
- Section 3: Deep Bootstrapping and Credit Assignment
- Section 4: Online Control via Monte-Carlo
- Section 5: Online Control via TD - SARSA and Q-Learning
Tutorial Notebooks and Videos
The tutorial notebooks can be accessed below.
- Tutorial 1: Overview on RL environment and Introduction to Gymnasium, by Eason Qu Video
- Tutorial 2: Policy and Value Iteration, by Eason Qu Video
- Tutorial 3: Monte Carlo Learning, by Eason Qu Video
Book
Most of the contents covered in the first two parts of course can be further read in
- [SB] Sutton, R., and Barto, A. Reinforcement Learning: An Introduction, 2nd Edition, MIT Press, 2018
The following old preprint is also a good summary of important RL algorithms
Most materials in the third part, i.e., deep RL, are collected from research papers. The following textbook is also good resources for practicing hands-on skills.
Reading List
This section will be completed gradually through the semester.
Chapter 1: RL Framework
Introduction
- Intro to RL: Chapter 1 - Sections 1.1 and 1.2 of [SB]
Multi-armed Bandit
- k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
- Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays
RL Problem Formulation
- RL Setting: Chapter 1 - Sections 1.3 and 1.5 [SB]
Terminal State and Episode
- RL Setting: Chapter 3 - Section 3.4 of [SB]
Chapter 2: Model-based RL
MDPs
- Intro to RL: Chapter 3 - Sections 3.1 and 3.5 of [SB]
Bellman Equation and Optimal Policy
- Bellman and Optimality: Chapter 3 - Sections 3.6 and 3.7 of [SB]
Policy Iteration
- Policy Evaluation: Chapter 4 - Section 4.1 of [SB]
- Policy Improvement and Iteration: Chapter 4 - Sections 4.2 and 4.3 of [SB]
Value Iteration
- Value Iteration: Chapter 4 - Sections 4.4, 4.6 and 4.7 of [SB]
Chapter 2: Model-free RL
Monte Carlo Approach
- Monte-Carlo: Chapter 5 - Sections 5.1, 5.2 and 5.3 of [SB]
Temporal Difference
Deep Temporal Difference
Credit Assignment
Eligibility Trace
- Eligibility Tracing: Chapter 12 - Sections 12.4 and 12.5 of [SB]
Monte-Carlo Control
- MC Control: Chapter 5 - Sections 5.3 and 5.4 of [SB]
ε-Greedy Improvement
Temporal-Difference Control
- TD Control: Chapter 6 - Section 6.2 of [SB]
Sarsa Algorithm
- Sarsa: Chapter 6 - Section 6.4 of [SB]
- Sarsa: Chapter 10 - Sections 10.2 and 10.5 of [SB]
- Sarsa: Chapter 12 - Section 12.7 of [SB]
Importance Sampling and Off-policy Learning
- Importance Sampling: Chapter 5 - Section 5.5 of [SB]
- Off Policy Learning: Chapter 12 - Sections 12.9 and 12.11 of [SB]
Q-Learning
- Q-Learning: Chapter 6 - Section 6.5 of [SB]
- Sarsa: Chapter 12 - Section 12.10 of [SB]