Materials - Reinforcement Learning / Fall 2025

Lecture Notes

The lecture notes are uploaded through the semester. For each chapter, the notes are provided section by section.

Chapter 0: Course Overview and Logistics

Handouts: All Sections included in a single file

Chapter 1: RL Framework

Section 1: Multi-armed Bandit
Section 2: RL Problem: Definitions and Basics
Section 3: Playing in RL Framework
Section 4: Our First Game

Chapter 2: Model-based RL

Section 1: Markov Decision Process and Bellman Equation
Section 2: Bellman Optimality
Section 3: Policy and Value Iteration

Chapter 3: Model-free Tabular RL

Section 1: Evaluation via Monte-Carlo
Section 2: Temporal Difference
Section 3: Deep Bootstrapping and Credit Assignment
Section 4: Online Control via Monte-Carlo
Section 5: Online Control via TD - SARSA and Q-Learning

Chapter 4: RL with Function Approximation

Section 1: Tabular RL vs RL with Function Approximation
Section 2: Prediction with Function Approximation
Section 3: Control with Function Approximation
Section 4: Deep Q-Learning

Chapter 5: Policy Gradient Methods (PGMs)

Section 1: Policy Network
Section 2: Training Policy Net and PGM
Section 3: PGM with Trust Region
Section 4: Basic Form TRPO and PPO

Chapter 6: Actor-Critic Approaches

Section 1: Basics of AC - A2C Algorithm
Section 2: AC Version of TRPO and PPO
Section 3: Deterministic Policy Gradient and DDPG

Chapter 7: Wrap-up

Section 1: Some History
Section 2: Some Advanced Topics

Tutorial Notebooks and Videos

The tutorial notebooks can be accessed below.

Tutorial 1: Overview on RL environment and Introduction to Gymnasium, by Eason Qu Video
Tutorial 2: Policy and Value Iteration, by Eason Qu Video
Tutorial 3: Monte Carlo Learning, by Eason Qu Video
Tutorial 4: TD Learning, by Eason Qu Video
Tutorial 5: Midterm Review by Eason Qu Video
Tutorial 6: Tabular vs Deep RL by Eason Qu Video
Tutorial 7: PGM by Eason Qu Video

Book

Most of the contents covered in the first two parts of course can be further read in

[SB] Sutton, R., and Barto, A. Reinforcement Learning: An Introduction, 2nd Edition, MIT Press, 2018

The following old preprint is also a good summary of important RL algorithms

[CS] Csaba Szepesvari, Algorithms for Reinforcement Learning (2009)

Most materials in the third part, i.e., deep RL, are collected from research papers. The following textbook is also good resources for practicing hands-on skills.

Maxim Lapan. Deep Reinforcement Learning Hands-On. O’Reilly Media, Inc., 2020.

Reading List

This section will be completed gradually through the semester.

Chapter 1: RL Framework

Introduction

Intro to RL: Chapter 1 - Sections 1.1 and 1.2 of [SB]

Multi-armed Bandit

k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays

RL Problem Formulation

RL Setting: Chapter 1 - Sections 1.3 and 1.5 [SB]

Terminal State and Episode

RL Setting: Chapter 3 - Section 3.4 of [SB]

Chapter 2: Model-based RL

MDPs

Intro to RL: Chapter 3 - Sections 3.1 and 3.5 of [SB]

Bellman Equation and Optimal Policy

Bellman and Optimality: Chapter 3 - Sections 3.6 and 3.7 of [SB]

Policy Iteration

Policy Evaluation: Chapter 4 - Section 4.1 of [SB]
Policy Improvement and Iteration: Chapter 4 - Sections 4.2 and 4.3 of [SB]

Value Iteration

Value Iteration: Chapter 4 - Sections 4.4, 4.6 and 4.7 of [SB]

Chapter 3: Model-free RL

Monte Carlo Approach

Monte-Carlo: Chapter 5 - Sections 5.1, 5.2 and 5.3 of [SB]

Temporal Difference

TD-0: Chapter 6 - Sections 6.1, 6.3 and 6.3 of [SB]

Deep Temporal Difference

TD-n: Chapter 7 - Sections 7.1 and 7.2 of [SB]

Credit Assignment

TD-lambda: Chapter 12 - Sections 12.1 to 12.3 of [SB]

Eligibility Trace

Eligibility Tracing: Chapter 12 - Sections 12.4 and 12.5 of [SB]
RL with Eligibility Paper Reinforcement Learning with Replacing Eligibility Traces published in 1996 by S. Singh and R. Sutton proposing the idea of including eligibility tracing in RL

Monte-Carlo Control

MC Control: Chapter 5 - Sections 5.3 and 5.4 of [SB]

ε-Greedy Improvement

ε-Greedy: Chapter 2 - Sections 2.5 and 2.6 of [SB]

Temporal-Difference Control

TD Control: Chapter 6 - Section 6.2 of [SB]

Sarsa Algorithm

Online Q-Learning Article On-Line Q-Learning Using Connectionist Systems published in 1994 by G. Rummery and M. Niranjan proposing SARSA as an online version of Q-Learning
Sarsa: Chapter 6 - Section 6.4 of [SB]
Sarsa: Chapter 10 - Sections 10.2 and 10.5 of [SB]
Sarsa: Chapter 12 - Section 12.7 of [SB]

Importance Sampling and Off-policy Learning

Importance Sampling: Chapter 5 - Section 5.5 of [SB]
Off Policy Learning: Chapter 12 - Sections 12.9 and 12.11 of [SB]

Q-Learning

Q-Learning Paper Paper Q-learning published in 1992 by _C Watkins and P. Dayan_proposing the off-policy learning as in Q-learning algorithm
Q-Learning: Chapter 6 - Section 6.5 of [SB]
Q-Learning: Chapter 12 - Section 12.10 of [SB]

Q-Learning vs SARSA

*Convergence Paper The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning published in 2000 by V. Borkar and S. Meyn studying convergence of Q-Learning and SARSA

Chapter 4: RL with Function Approximation

Function Approximation for Value Estimation

Function Approximation for RL: Chapter 9 of [SB]
Neuro-dynamic Programming Paper Neuro-dynamic programming: an overview published in 1995 by D. Bertsekas and J. Tsitsiklis discussing function approximation for value learning

Prediction via Function Approximation

Prediction with FA: Chapter 3 of [CS]
TD with FA Paper Analysis of Temporal-Diffference Learning with Function Approximation published in 1996 by J. Tsitsiklis and B. Roy analyzing Prediction with parameterized models
FA vs Tabular Paper Analyzing feature generation for value-function approximation published in 2008 by R. Parr et al. discussing connections of RL with FA to tabular RL
RL with FA Paper Residual Algorithms: Reinforcement Learning with Function Approximation published in 1995 by L. BAird giving some critics to RL with FA

Control via Function Approximation

Control with FA: Chapters 10 and 11 of [SB]
Control with FA: Chapter 4 of [CS]

Deep Q-Learning

Off-Policy with FA Paper A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation published in 2008 by R. Sutton et al. analyzing off-policy learning via linear models
DQL: Chapter 4 - Section 4.3 of [CS]
Deep Q-Learning Paper Human-level control through deep reinforcement learning published in 2015 by V. Mnih et al. proposing the legendary idea of Deep Q-Learning
DQL Paper I Paper Playing Atari with Deep Reinforcement Learning published in 2013 by V. Mnih et al. describing DQL details
DQL Paper II Paper Deep Reinforcement Learning with Double Q-learning published in 2015 by H. Haasselt et al. proposing Double DQL
DQL Paper III Paper Dueling Network Architectures for Deep Reinforcement Learning published in 2016 by Z. Wang et al. proposing Dueling DQL
DQL Paper IV Paper Prioritized Experience Replay published in 2016 by T. Schaul et al. proposing a prioritizing experience replay scheme
DQL Paper V Paper Deep Recurrent Q-Learning for Partially Observable MDPs published in 2017 by M. Hausknecht and P. Stone extending DQL to more realistic cases
Gorila Paper Massively Parallel Methods for Deep Reinforcement Learning published in 2015 by A. Nair et al. proposing Gorila
Why Policy Net Article Deep Deterministic Policy Gradient at OpenAI Spinning Up

Chapter 5: Policy Gradient Method

Basic PGM

REINFORCE Paper Simple statistical gradient-following algorithms for connectionist reinforcement learning published by R. Williams in 1992 introducing REINFORCE algorithm

Policy Gradient Theorem

PGM Theorem Paper Policy Gradient Methods for Reinforcement Learning with Function Approximation published by R. Sutton et al. in 1999 developing the Policy Gradient Theorem

Vanilla and Baseline PGM

Baseline Paper Policy invariance under reward transformations: Theory and application to reward shaping published by A. Ng et al. in 1999

Trust Region PGM

Nat PGM Paper A Natural Policy Gradient published by S. Kakade in 2001 proposing a basic natural PGM
TRPO Paper Trust Region Policy Optimization published by J. Schulman et al. in 2015 proposing TRPO
PPO Paper Proximal Policy Optimization Algorithms published by J. Schulman et al. in 2017 proposing PPO