Lecture Videos
Lectures
Here, you can find the recordings of the lecture videos.
-
Lecture 1: RL as a Learning Problem
Introduction: We introduce the notion of reinforcement learning and understand how it differs to classic learning tasks in its nature.
[link]
-
Lecture 2: Optimal and Random Playing of Multi-armed Bandit
Bandit - Part I: We take a look at a very first example, the multi-armed bandit problem, and see how optimally or randomly playing could change the collected reward.
[link]
Lecture Notes:
Further Reads:
- k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
- Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays
-
Lecture 3: Exploiting Explorations in Multi-armed Bandit
Bandit - Part II: We now take a look at a mix strategy, first explore the environment for some limited time and then exploit this exploration in the real-time. We see that we could get very close to the optimal reward if we play for a long enough time.
[link]
Lecture Notes:
Further Reads:
- k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
- Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays
-
Lecture 4: Formulating the RL Framework
Principle RL Problem: We now formulate the RL setting for a general interactive problem. We get to know what the agent is, what the environment is, and how the interactions between them is formulated.
[link]
-
Lecture 5: Environment as State-Dependent System
Return and State: We see that the best way to present the environment mathematically is to look at it as a state-dependent system. This provides us the Markovity and enables us concretely formulate the problem of RL.
[link]
-
Lecture 6: Examples of RL Setting
Example: We look into a few sample environment, namely Tic-Tac-Toe, Cart-Pole, and Trading examples. For each, we formulate the State, Action, and Reward.
[link]
Lecture Notes:
-
Lecture 7: Policy and Its Value
Value - Part I: We define mathematically a policy. We see how we can determine the value of a given policy in a target environment.
[link]
-
Lecture 8: Playing Tic-Tac-Toe
Value - Part II: We take a look at the Tic-Tac-Toe example and play with two different policies. For each policy, we evaluate the value function. We further define the action-value function.
[link]
-
Lecture 9: Optimal Policy
Example: The value function enables us to define the notion of Optimal Policy. This formulates concretely the main objective in RL. We give a short overview on different approaches to find the optimal policy. This indeed is the rough categorization of RL approaches.
[link]
Lecture Notes:
-
Lecture 10: Frozen Lake Example -- Terminal State and Episode
Terminal and Episode: We take a look at the example of Frozen Lake Game to practice the definitions of action, state, policy and value function. Through this example, we get to know the notion of Terminal State and Episode. We further quickly introduce the Gymnasium API for Implementation.
[link]
-
Lecture 11: Markov Decision Processes
Model-based I: We start with model-based RL. We see that in this case, we can think of the environment as an MDP. Assuming to know the MDP, we can evaluate a given policy simply using statistical analysis. We look into a few simple examples to get the concept clear.
[link]
-
Lecture 12: Value Function Calculation via MDPs -- Naive Approach
Model-based II: We use the known MDP to evaluate the value function of a given policy. We see that this is theoretically feasible; however, imposes a lot of analytical challenges. This motivates us to look into an alternative approach, i.e., Bellman Equation.
[link]
-
Lecture 13: Bellman Equation
Model-based III: We derive the Bellman equation for the value function. This equation enables us to compute the value function from a set of fixed-point equations without needing to take nested expectations.
[link]
-
Lecture 14: Bellman Equation for Action-Value and Backup Diagram
Model-based IV: We extend the derivation of Bellman equation to the action-value function. We see that the Bellman equations describe a simple graphic diagram which can relate the value of each state or action-state to other values in the environment.
[link]
-
Lecture 15: Bellman Optimality Equation
Model-based V: We derive the Bellman Optimality equation. This equation enables us to compute the optimal value function without needing to know the optimal policy. This further let us derive the optimal policy for an environment with known MDP.
[link]
-
Lecture 16: Back-Tracking Optimal Policy
Optimal Policy: Using Bellman Optimality Equation, we can backtrack the optimal policy. We learn this algorithm in this lecture. We see that the key issue is to find the optimal value solution. This motivates us to use dynamic programming.
[link]
-
Lecture 17: Policy Evaluation by Dynamic Programming
Policy Iteration - Part I: We now use dynamic programming to evaluate a given policy. This enables us to develop an iterative algorithm to find values.
[link]
-
Lecture 18: Policy Improvement and Policy Iteration
Policy Iteration - Part II: Given the optimality constraint, we can use a simple mechanism to improve a given policy. This idea along with policy evaluation leads us to what is known as Policy Iteration algorithm.
[link]
Lecture Notes:
Further Reads:
- Policy Improvement and Iteration: Chapter 4 - Sections 4.2 and 4.3 of [SB]
-
Lecture 19: Value Iteration
Value Iteration: We can use dynamic programming to directly solve the Bellman optimality equation, and then backtrack the optimal policy. This is what we call Value Iteration and we see it in this lecture.
[link]
-
Lecture 20: Generalized Policy Iteration
Value Iteration: We compare policy and value iteration algorithms. While VI could converge faster PI could give more robust result in realistic environment. This motivates us for what is known as Generalized Policy Iteration.
[link]
Lecture Notes:
Further Reads:
- Generalized Policy Iteration: Chapter 4 - Sections 4.6 and 4.7 of [SB]
-
Lecture 21: Model-free Policy Evaluation via Monte-Carlo
Monte-Carlo - Part I: We start with model-free approaches for RL. As the first stop, we look into Monte-Carlo approach and see how we could use it to estimate the value function.
[link]
-
Lecture 22: GPI via Monte-Carlo
Monte-Carlo - Part II: We now use the MC evaluation algorithm to develop a generalized policy iteration (GPI) loop based on MC evaluation. This loop implements GPI by using MC estimate to evaluate the policy in each iteration.
[link]
-
Lecture 23: Bootstrapping
TD - Part I: Using the recursive property of the value function, we can develop an alternative estimator for the value function via bootstrapping, which estimates the value of each state from the estimate of the other states.
[link]
-
Lecture 24: GPI via Temporal Difference
TD - Part II: Using bootstrapping, we can evaluate a policy. This approach is called temporal difference which can lead to better sample efficiency as compared to MC. We learn how to implement GPI with this approach, and discuss its bias-variance trad-off as compared with the MC approach.
[link]
-
Lecture 25: Deep Bootstrapping and TD-n
TD - Part III: We extend the idea of bootstrapping to estimate values from deeper temporal differences. This is called TD-n. We study the impact of depth on the learning quality.
[link]
-
Lecture 26: TD-λ
TD - Part III: We can improve sample efficiency by averaging TD over depth. This is called TD-λ. We learn this approach and understand its implication.
[link]
-
Lecture 27: TD with Eligibility Tracing
TD - Part IV: TD-λ is not causal and hence not very efficient for online control. We hence study its causal equivalent known as TD with eligibility tracing. It enables us to implement the efficient TD-λ without the need to know the future rewards in the trajectory.
[link]
Further Reads:
- Eligibility Tracing: Chapter 12 - Sections 12.4 and 12.5 of [SB]
-
Lecture 28: Control Loop with Monte Carlo
MC Control: We now move to the problem of online RL. We want to learn optimal policy while sampling, not by basic policy iteration which evaluates policies. We understand the difference and develop our first basic loop, i.e. Monte-Carlo Control Loop.
[link]
-
Lecture 29: Adding Exploration to Control Loop
MC Control: We discuss ε-greedy approach which enables us adding exploration to our control loop. We show that this approach still improves the policy and hence it can ultimately converge to optimal policy. We also define the notion of GLIE algorithm.
[link]