Lecture Notes

The lecture notes are uploaded through the semester. For each chapter, the notes are provided section by section.

Chapter 0: Course Overview and Logistics

  • Handouts: All Sections included in a single file

Chapter 1: RL Framework

Chapter 2: Model-based RL

Chapter 3: Model-free Tabular RL

Chapter 4: RL with Function Approximation

  • Section 1: Tabular RL vs RL with Function Approximation
  • Section 2: Prediction with Function Approximation
  • Section 3: Control with Function Approximation
  • Section 4: Deep Q-Learning

Chapter 5: Policy Gradient Methods (PGMs)

Chapter 6: Actor-Critic Approaches

Chapter 7: Wrap-up

Tutorial Notebooks and Videos

The tutorial notebooks can be accessed below.

Book

Most of the contents covered in the first two parts of course can be further read in

The following old preprint is also a good summary of important RL algorithms

Most materials in the third part, i.e., deep RL, are collected from research papers. The following textbook is also good resources for practicing hands-on skills.

Reading List

This section will be completed gradually through the semester.

Chapter 1: RL Framework

Introduction

Multi-armed Bandit

  • k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
  • Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays

RL Problem Formulation

Terminal State and Episode

Chapter 2: Model-based RL

MDPs

Bellman Equation and Optimal Policy

Policy Iteration

Value Iteration

Chapter 3: Model-free RL

Monte Carlo Approach

Temporal Difference

  • TD-0: Chapter 6 - Sections 6.1, 6.3 and 6.3 of [SB]

Deep Temporal Difference

  • TD-n: Chapter 7 - Sections 7.1 and 7.2 of [SB]

Credit Assignment

Eligibility Trace

  • Eligibility Tracing: Chapter 12 - Sections 12.4 and 12.5 of [SB]
  • RL with Eligibility Paper Reinforcement Learning with Replacing Eligibility Traces published in 1996 by S. Singh and R. Sutton proposing the idea of including eligibility tracing in RL

Monte-Carlo Control

ε-Greedy Improvement

Temporal-Difference Control

Sarsa Algorithm

  • Online Q-Learning Article On-Line Q-Learning Using Connectionist Systems published in 1994 by G. Rummery and M. Niranjan proposing SARSA as an online version of Q-Learning
  • Sarsa: Chapter 6 - Section 6.4 of [SB]
  • Sarsa: Chapter 10 - Sections 10.2 and 10.5 of [SB]
  • Sarsa: Chapter 12 - Section 12.7 of [SB]

Importance Sampling and Off-policy Learning

Q-Learning

  • Q-Learning Paper Paper Q-learning published in 1992 by _C Watkins and P. Dayan_proposing the off-policy learning as in Q-learning algorithm
  • Q-Learning: Chapter 6 - Section 6.5 of [SB]
  • Q-Learning: Chapter 12 - Section 12.10 of [SB]

Q-Learning vs SARSA

*Convergence Paper The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning published in 2000 by V. Borkar and S. Meyn studying convergence of Q-Learning and SARSA

Chapter 4: RL with Function Approximation

Function Approximation for Value Estimation

Prediction via Function Approximation

  • Prediction with FA: Chapter 3 of [CS]
  • TD with FA Paper Analysis of Temporal-Diffference Learning with Function Approximation published in 1996 by J. Tsitsiklis and B. Roy analyzing Prediction with parameterized models
  • FA vs Tabular Paper Analyzing feature generation for value-function approximation published in 2008 by R. Parr et al. discussing connections of RL with FA to tabular RL
  • RL with FA Paper Residual Algorithms: Reinforcement Learning with Function Approximation published in 1995 by L. BAird giving some critics to RL with FA

Control via Function Approximation

Deep Q-Learning

  • Off-Policy with FA Paper A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation published in 2008 by R. Sutton et al. analyzing off-policy learning via linear models
  • DQL: Chapter 4 - Section 4.3 of [CS]
  • Deep Q-Learning Paper Human-level control through deep reinforcement learning published in 2015 by V. Mnih et al. proposing the legendary idea of Deep Q-Learning
  • DQL Paper I Paper Playing Atari with Deep Reinforcement Learning published in 2013 by V. Mnih et al. describing DQL details
  • DQL Paper II Paper Deep Reinforcement Learning with Double Q-learning published in 2015 by H. Haasselt et al. proposing Double DQL
  • DQL Paper III Paper Dueling Network Architectures for Deep Reinforcement Learning published in 2016 by Z. Wang et al. proposing Dueling DQL
  • DQL Paper IV Paper Prioritized Experience Replay published in 2016 by T. Schaul et al. proposing a prioritizing experience replay scheme
  • DQL Paper V Paper Deep Recurrent Q-Learning for Partially Observable MDPs published in 2017 by M. Hausknecht and P. Stone extending DQL to more realistic cases
  • Gorila Paper Massively Parallel Methods for Deep Reinforcement Learning published in 2015 by A. Nair et al. proposing Gorila
  • Why Policy Net Article Deep Deterministic Policy Gradient at OpenAI Spinning Up

Chapter 5: Policy Gradient Method

Basic PGM

  • REINFORCE Paper Simple statistical gradient-following algorithms for connectionist reinforcement learning published by R. Williams in 1992 introducing REINFORCE algorithm

Policy Gradient Theorem

  • PGM Theorem Paper Policy Gradient Methods for Reinforcement Learning with Function Approximation published by R. Sutton et al. in 1999 developing the Policy Gradient Theorem

Vanilla and Baseline PGM

  • Baseline Paper Policy invariance under reward transformations: Theory and application to reward shaping published by A. Ng et al. in 1999

Trust Region PGM

  • Nat PGM Paper A Natural Policy Gradient published by S. Kakade in 2001 proposing a basic natural PGM
  • TRPO Paper Trust Region Policy Optimization published by J. Schulman et al. in 2015 proposing TRPO
  • PPO Paper Proximal Policy Optimization Algorithms published by J. Schulman et al. in 2017 proposing PPO