Lecture Videos - Reinforcement Learning / Fall 2025

Lecture 0: Course Overview and Logistics
Overview: In this lecture, we go through the course logistics. The audio quality is poor, as the teaching station did not work.
[link]

Lecture Notes:

Chapter 0

Lecture 1: RL as a Learning Problem
Introduction: We introduce the notion of reinforcement learning and understand how it differs to classic learning tasks in its nature.
[link]

Lecture Notes:

Chapter 1 - Section 1

Further Reads:

Intro to RL: Chapter 1 - Sections 1.1 and 1.2 of [SB]

Lecture 2: Optimal and Random Playing of Multi-armed Bandit
Bandit - Part I: We take a look at a very first example, the multi-armed bandit problem, and see how optimally or randomly playing could change the collected reward.
[link]

Lecture Notes:

Chapter 1 - Section 1

Further Reads:

k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays

Lecture 3: Exploiting Explorations in Multi-armed Bandit
Bandit - Part II: We now take a look at a mix strategy, first explore the environment for some limited time and then exploit this exploration in the real-time. We see that we could get very close to the optimal reward if we play for a long enough time.
[link]

Lecture Notes:

Chapter 1 - Section 1

Further Reads:

k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays

Lecture 4: Formulating the RL Framework
Principle RL Problem: We now formulate the RL setting for a general interactive problem. We get to know what the agent is, what the environment is, and how the interactions between them is formulated.
[link]

Lecture Notes:

Chapter 1 - Section 2

Further Reads:

RL Setting: Chapter 3 - Sections 3.1 to 3.3 of [SB]

Lecture 5: Environment as State-Dependent System
Return and State: We see that the best way to present the environment mathematically is to look at it as a state-dependent system. This provides us the Markovity and enables us concretely formulate the problem of RL.
[link]

Lecture Notes:

Chapter 1 - Section 2

Further Reads:

RL Setting: Chapter 3 - Sections 3.1 to 3.3 of [SB]

Lecture 6: Examples of RL Setting
Example: We look into a few sample environment, namely Tic-Tac-Toe, Cart-Pole, and Trading examples. For each, we formulate the State, Action, and Reward.
[link]

Lecture Notes:

Chapter 1 - Section 2

Lecture 7: Policy and Its Value
Value - Part I: We define mathematically a policy. We see how we can determine the value of a given policy in a target environment.
[link]

Lecture Notes:

Chapter 1 - Section 3

Further Reads:

RL Setting: Chapter 1 - Sections 1.3 and 1.4 of [SB]

Lecture 8: Playing Tic-Tac-Toe
Value - Part II: We take a look at the Tic-Tac-Toe example and play with two different policies. For each policy, we evaluate the value function. We further define the action-value function.
[link]

Lecture Notes:

Chapter 1 - Section 3

Further Reads:

RL Setting: Chapter 1 - Section 1.5 of [SB]

Lecture 9: Optimal Policy
Example: The value function enables us to define the notion of Optimal Policy. This formulates concretely the main objective in RL. We give a short overview on different approaches to find the optimal policy. This indeed is the rough categorization of RL approaches.
[link]

Lecture Notes:

Chapter 1 - Section 3

Lecture 10: Frozen Lake Example -- Terminal State and Episode
Terminal and Episode: We take a look at the example of Frozen Lake Game to practice the definitions of action, state, policy and value function. Through this example, we get to know the notion of Terminal State and Episode. We further quickly introduce the Gymnasium API for Implementation.
[link]

Lecture Notes:

Chapter 1 - Section 4

Further Reads:

RL Setting: Chapter 3 - Section 3.4 of [SB]

Lecture 11: Markov Decision Processes
Model-based I: We start with model-based RL. We see that in this case, we can think of the environment as an MDP. Assuming to know the MDP, we can evaluate a given policy simply using statistical analysis. We look into a few simple examples to get the concept clear.
[link]

Lecture Notes:

Chapter 2 - Section 1

Further Reads:

RL Setting: Chapter 3 - Sections 3.1 and 3.3 of [SB]

Lecture 12: Value Function Calculation via MDPs -- Naive Approach
Model-based II: We use the known MDP to evaluate the value function of a given policy. We see that this is theoretically feasible; however, imposes a lot of analytical challenges. This motivates us to look into an alternative approach, i.e., Bellman Equation.
[link]

Lecture Notes:

Chapter 2 - Section 1

Further Reads:

RL Setting: Chapter 3 - Section 3.5 [SB]

Lecture 13: Bellman Equation
Model-based III: We derive the Bellman equation for the value function. This equation enables us to compute the value function from a set of fixed-point equations without needing to take nested expectations.
[link]

Lecture Notes:

Chapter 2 - Section 1

Further Reads:

Bellman: Chapter 3 - Section 3.6 [SB]

Lecture 14: Bellman Equation for Action-Value and Backup Diagram
Model-based IV: We extend the derivation of Bellman equation to the action-value function. We see that the Bellman equations describe a simple graphic diagram which can relate the value of each state or action-state to other values in the environment.
[link]

Lecture Notes:

Chapter 2 - Section 1

Further Reads:

Bellman: Chapter 3 - Section 3.6 [SB]

Lecture 15: Bellman Optimality Equation
Model-based V: We derive the Bellman Optimality equation. This equation enables us to compute the optimal value function without needing to know the optimal policy. This further let us derive the optimal policy for an environment with known MDP.
[link]

Lecture Notes:

Chapter 2 - Section 2

Further Reads:

Optimal Policy: Chapter 3 - Section 3.7 [SB]

Lecture 16: Back-Tracking Optimal Policy
Optimal Policy: Using Bellman Optimality Equation, we can backtrack the optimal policy. We learn this algorithm in this lecture. We see that the key issue is to find the optimal value solution. This motivates us to use dynamic programming.
[link]

Lecture Notes:

Chapter 2 - Section 2

Further Reads:

Optimal Policy: Chapter 3 - Section 3.7 [SB]

Lecture 17: Policy Evaluation by Dynamic Programming
Policy Iteration - Part I: We now use dynamic programming to evaluate a given policy. This enables us to develop an iterative algorithm to find values.
[link]

Lecture Notes:

Chapter 2 - Section 3

Further Reads:

Policy Evaluation: Chapter 4 - Section 4.1 of [SB]

Lecture 18: Policy Improvement and Policy Iteration
Policy Iteration - Part II: Given the optimality constraint, we can use a simple mechanism to improve a given policy. This idea along with policy evaluation leads us to what is known as Policy Iteration algorithm.
[link]

Lecture Notes:

Chapter 2 - Section 3

Further Reads:

Policy Improvement and Iteration: Chapter 4 - Sections 4.2 and 4.3 of [SB]

Lecture 19: Value Iteration
Value Iteration: We can use dynamic programming to directly solve the Bellman optimality equation, and then backtrack the optimal policy. This is what we call Value Iteration and we see it in this lecture.
[link]

Lecture Notes:

Chapter 2 - Section 3

Further Reads:

Value Iteration: Chapter 4 - Section 4.4 of [SB]

Lecture 20: Generalized Policy Iteration
Value Iteration: We compare policy and value iteration algorithms. While VI could converge faster PI could give more robust result in realistic environment. This motivates us for what is known as Generalized Policy Iteration.
[link]

Lecture Notes:

Chapter 2 - Section 3

Further Reads:

Generalized Policy Iteration: Chapter 4 - Sections 4.6 and 4.7 of [SB]

Lecture 21: Model-free Policy Evaluation via Monte-Carlo
Monte-Carlo - Part I: We start with model-free approaches for RL. As the first stop, we look into Monte-Carlo approach and see how we could use it to estimate the value function.
[link]

Lecture Notes:

Chapter 3 - Section 1

Further Reads:

Monte-Carlo: Chapter 5 - Sections 5.1 and 5.2 of [SB]

Lecture 22: GPI via Monte-Carlo
Monte-Carlo - Part II: We now use the MC evaluation algorithm to develop a generalized policy iteration (GPI) loop based on MC evaluation. This loop implements GPI by using MC estimate to evaluate the policy in each iteration.
[link]

Lecture Notes:

Chapter 3 - Section 1

Further Reads:

Monte-Carlo: Chapter 5 - Section 5.3 of [SB]

Lecture 23: Bootstrapping
TD - Part I: Using the recursive property of the value function, we can develop an alternative estimator for the value function via bootstrapping, which estimates the value of each state from the estimate of the other states.
[link]

Lecture Notes:

Chapter 3 - Section 2

Further Reads:

TD-0: Chapter 6 - Section 6.1 of [SB]

Lecture 24: GPI via Temporal Difference
TD - Part II: Using bootstrapping, we can evaluate a policy. This approach is called temporal difference which can lead to better sample efficiency as compared to MC. We learn how to implement GPI with this approach, and discuss its bias-variance trad-off as compared with the MC approach.
[link]

Chapter 3 - Section 2

Further Reads:

TD-0: Chapter 6 - Sections 6.2 and 6.3 of [SB]

Lecture 25: Deep Bootstrapping and TD-n
TD - Part III: We extend the idea of bootstrapping to estimate values from deeper temporal differences. This is called TD-n. We study the impact of depth on the learning quality.
[link]

Chapter 3 - Section 3

Further Reads:

TD-n: Chapter 7 - Sections 7.1 and 7.2 of [SB]

Lecture 26: TD-λ
TD - Part III: We can improve sample efficiency by averaging TD over depth. This is called TD-λ. We learn this approach and understand its implication.
[link]

Chapter 3 - Section 3

Further Reads:

TD-lambda: Chapter 12 - Sections 12.1 to 12.3 of [SB]

Lecture 27: TD with Eligibility Tracing
TD - Part IV: TD-λ is not causal and hence not very efficient for online control. We hence study its causal equivalent known as TD with eligibility tracing. It enables us to implement the efficient TD-λ without the need to know the future rewards in the trajectory.
[link]

Chapter 3 - Section 3

Further Reads:

Eligibility Tracing: Chapter 12 - Sections 12.4 and 12.5 of [SB]

Lecture 28: Control Loop with Monte Carlo
MC Control: We now move to the problem of online RL. We want to learn optimal policy while sampling, not by basic policy iteration which evaluates policies. We understand the difference and develop our first basic loop, i.e. Monte-Carlo Control Loop.
[link]

Chapter 3 - Section 4

Further Reads:

MC Control: Chapter 5 - Sections 5.3 and 5.4 of [SB]

Lecture 29: Adding Exploration to Control Loop
MC Control: We discuss ε-greedy approach which enables us adding exploration to our control loop. We show that this approach still improves the policy and hence it can ultimately converge to optimal policy. We also define the notion of GLIE algorithm.
[link]

Chapter 3 - Section 4

Further Reads:

ε-Greedy: Chapter 2 - Sections 2.5 and 2.6 of [SB]

Lecture 30: On-Policy RL via SARSA
SARSA: We build a control loop using TD estimation. This enables us to learn the optimal policy while we are playing with it. The final algorithm is called State-Action Reward State-Action or simply SARSA. SARSA is the most used form of on-policy RL.
[link]

Chapter 3 - Section 5

Further Reads:

Online Q-Learning Article On-Line Q-Learning Using Connectionist Systems published in 1994 by G. Rummery and M. Niranjan proposing SARSA as an online version of Q-Learning
Sarsa: Chapter 6 - Section 6.4 of [SB]
Sarsa: Chapter 10 - Sections 10.2 and 10.5 of [SB]
Sarsa: Chapter 12 - Section 12.7 of [SB]

Lecture 31: Off-Policy RL via Importance Sampling
Importance Sampling: We can learn the optimal policy while playing with another policy. This is what we call off-policy learning. The key statistical toll that let us do that is Importance Sampling. We learn how we can do it for a general case.
[link]

Chapter 3 - Section 5

Further Reads:

Importance Sampling: Chapter 5 - Section 5.5 of [SB]
Off Policy Learning: Chapter 12 - Sections 12.9 and 12.11 of [SB]

Lecture 32: Q-Learning
Q-Learning: We discuss a specific form of off-policy RL, namely Q-learning. Here, we play ε-greedy, but we learn the value of the greedy policy. This leads to a simple control loop, often called the Q-Learning Algorithm.
[link]

Chapter 3 - Section 5

Further Reads:

Q-Learning Paper Paper Q-learning published in 1992 by _C Watkins and P. Dayan_proposing the off-policy learning as in Q-learning algorithm
Q-Learning: Chapter 6 - Section 6.5 of [SB]
Q-Learning: Chapter 12 - Section 12.10 of [SB]

Lecture 33: Convergence of Q-Learning and SARSA
Convergence: We discuss the requirements for both algorithms to converge. This is consistent with the notion of GLIE algorithms.
[link]

Chapter 3 - Section 5

Further Reads: *Convergence Paper The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning published in 2000 by V. Borkar and S. Meyn studying convergence of Q-Learning and SARSA

Lecture 34: Why Deep RL?
Deep RL - Intro: We discuss the space size of a realistic environment to see that classical tabular RL is not working. We hence have to go for Deep RL.
[link]

Chapter 4 - Section 1

Further Reads:

Neuro-dynamic Programming Paper Neuro-dynamic programming: an overview published in 1995 by D. Bertsekas and J. Tsitsiklis discussing function approximation for value learning

Lecture 35: Using Function Approximation in RL
Function Approximation: We see how function approximation can get its way into RL. This enables us to use Deep Learning in RL, i.e. Deep RL.
[link]

Chapter 4 - Section 1

Further Reads:

Function Approximation for RL: Chapter 9 of [SB]
Neuro-dynamic Programming Paper Neuro-dynamic programming: an overview published in 1995 by D. Bertsekas and J. Tsitsiklis discussing function approximation for value learning

Lecture 36: Flexibility of RL via Function Approximation
Function Approximation: We take a look at the example of Mountain Car to see how using function approximation gives us more flexibility as compared to tabular RL.
[link]

Chapter 4 - Section 1

Further Reads:

Prediction with FA: Chapter 3 of [CS]

Lecture 37: Training Value Model for Prediction
Value Learning: We see how using a parameterized model, we can train the model to learn the value of a given policy. We can use both Monte-Carlo and TD approaches.
[link]

Chapter 4 - Section 2

Further Reads:

TD with FA Paper Analysis of Temporal-Diffference Learning with Function Approximation published in 1996 by J. Tsitsiklis and B. Roy analyzing Prediction with parameterized models

Lecture 38: Back to Tabular RL
Value Learning: We show that tabular RL is a special case of RL with function approximation, where the state representation is given by a one-hot vector and the model is linear.
[link]

Chapter 4 - Section 2

Further Reads:

FA vs Tabular Paper Analyzing feature generation for value-function approximation published in 2008 by R. Parr et al. discussing connections of RL with FA to tabular RL

Lecture 39: Learning Action-Value via Function Approximation
Action-Value Learning: We extend the value learning idea to learning of Action-Value function, where we use a model to approximate the action-value of each state. This is indeed what is known as the Q-Network.
[link]

Chapter 4 - Section 2

Further Reads:

RL with FA Paper Residual Algorithms: Reinforcement Learning with Function Approximation published in 1995 by L. BAird giving some critics to RL with FA

Lecture 40: Control via Function Approximation and Deep Q-Learning
Action-Value Learning: We now use the developed training loop to train a Q-network a control process. We look into both on-policy and off-policy cases, and conclude the vanilla form of the Deep Q-Learning algorithm.
[link]

Chapter 4 - Section 4

Further Reads:

Control with FA: Chapter 4 of [CS]

Lecture 41: Experience Replay in DQL
Action-Value Learning: The fact that we are doing off-policy control in DQL enables us to boost the sample efficiency. We can use the same samples multiple times. This is what is known as experience replay and is a key component of DQL.
[link]

Chapter 4 - Section 4

Further Reads:

DQL: Chapter 4 - Section 4.3 of [CS]
Deep Q-Learning Paper Human-level control through deep reinforcement learning published in 2015 by V. Mnih et al. proposing the legendary idea of Deep Q-Learning
DQL Paper I Paper Playing Atari with Deep Reinforcement Learning published in 2013 by V. Mnih et al. describing DQL details

Lecture 42: Target Network
Action-Value Learning: DQL in vanilla form has a second problem which is the fact that the labels computed by TD change by network update. This can deteriorate the convergence significantly. To address this issue, we could train via a nested loop in which we copy the value net into a Target Network and keep it fixed for multiple iterations. This helps significantly with convergence.
[link]

Chapter 4 - Section 4

Further Reads:

DQL Paper I Paper Playing Atari with Deep Reinforcement Learning published in 2013 by V. Mnih et al. describing DQL details

Lecture 43: Double DQL and Gorila
Action-Value Learning: Due to maximization operation in DQL, the network always overestimate the value. A solution to this is to use double DQL. We can further perform DQL in a distributed fashion as in Gorila. We discuss these extensions in this lecture.
[link]

Chapter 4 - Section 4

Further Reads:

DQL Paper II Paper Deep Reinforcement Learning with Double Q-learning published in 2015 by H. Haasselt et al. proposing Double DQL
DQL Paper III Paper Dueling Network Architectures for Deep Reinforcement Learning published in 2016 by Z. Wang et al. proposing Dueling DQL
DQL Paper IV Paper Prioritized Experience Replay published in 2016 by T. Schaul et al. proposing a prioritizing experience replay scheme
Gorila Paper Massively Parallel Methods for Deep Reinforcement Learning published in 2015 by A. Nair et al. proposing Gorila

Lecture 44: Why Policy Net?
Policy Learning: With continuous action space, value-based Deep RL fails! We see this issue and try to find a simple solution via deep learning for it. This introduces the idea of Policy Network which is studied in the next chapter.
[link]

Chapter 4 - Section 4

Further Reads:

Why Policy Net Article Deep Deterministic Policy Gradient at OpenAI Spinning Up

Lecture 45: Policy Net and Its Learning Objective
PGM I: We learn policy networks and their learning objectives. We see how er can formulate their objective to train a computational policy network.
[link]

Chapter 5 - Section 1

Further Reads:

REINFORCE Paper Simple statistical gradient-following algorithms for connectionist reinforcement learning published by R. Williams in 1992 introducing REINFORCE algorithm

Lecture 46: Training Policy Net via SGD
PGM II: We develop an SGD-based training loop for the policy network. This describes the most basic form of PGM.
[link]

Chapter 5 - Section 1

Further Reads:

PGM Theorem Paper Policy Gradient Methods for Reinforcement Learning with Function Approximation published by R. Sutton et al. in 1999 developing the Policy Gradient Theorem

Lecture 47: Policy Gradient Theorem
PGM III: We learn the policy gradient theorem, which is the basis of all Policy Gradient Methods.
[link]

Chapter 5 - Section 1

Further Reads:

PGM Theorem Paper Policy Gradient Methods for Reinforcement Learning with Function Approximation published by R. Sutton et al. in 1999 developing the Policy Gradient Theorem

Lecture 48: Vanilla and Baseline PGM
PGM IV: Using the policy gradient theorem, we can develop efficient algorithm for training of a policy network via SGD. This algorithm is however sensitive to reward shift. This leads us to the baseline PGM.
[link]

Chapter 5 - Section 2

Further Reads:

Baseline Paper Policy invariance under reward transformations: Theory and application to reward shaping published by A. Ng et al. in 1999

Lecture 49: PGM as Sequential Surrogate Optimization
Trust Region PGM I: Vanilla and baseline PGM still suffer from instability. We see in this lecture that we can interpret PGM update as sequential surrogate function optimization. This reveals us that the key challenge is unrestricted update in each iteration. This motivates us to study trust-region PGM approaches.
[link]

Chapter 5 - Section 3

Further Reads:

Nat PGM Paper A Natural Policy Gradient published by S. Kakade in 2001 proposing a basic natural PGM
TRPO Paper Trust Region Policy Optimization published by J. Schulman et al. in 2015 proposing TRPO

Lecture 50: Trust Region and Natural PGM
Trust Region PGM II: We define PGM update with a trust region. We solve the problem using an approximation of the objective. This leads to natural PGM which modifies the policy gradient with a trust-region transform. We further discuss the complexity and challenges of implementing Natural PGM.
[link]

Chapter 5 - Section 3

Further Reads:

Nat PGM Paper A Natural Policy Gradient published by S. Kakade in 2001 proposing a basic natural PGM
TRPO Paper Trust Region Policy Optimization published by J. Schulman et al. in 2015 proposing TRPO

Lecture 51: TRPO Algorithm
Trust Region PGM III: We discuss the TRPO algorithm as a principal approach for trust region PGM with tractable complexity.
[link]

Chapter 5 - Section 4

Further Reads:

TRPO Paper Trust Region Policy Optimization published by J. Schulman et al. in 2015 proposing TRPO

Lecture 52: PPO Algorithm
Trust Region PGM IV: We discuss the PPO algorithm as a an alternative solution to the trust-region PGM problem. PPO imposes significantly lower complexity making trust region policy update more feasible.
[link]

Chapter 5 - Section 4

Further Reads:

PPO Paper Proximal Policy Optimization Algorithms published by J. Schulman et al. in 2017 proposing PPO

Lectures