Lectures

Here, you can find the recordings of the lecture videos.

  • Lecture 0: Course Overview and Logistics
    Overview: In this lecture, we go through the course logistics. The audio quality is poor, as the teaching station did not work.
    [link]

    Lecture Notes:

  • Lecture 1: RL as a Learning Problem
    Introduction: We introduce the notion of reinforcement learning and understand how it differs to classic learning tasks in its nature.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 2: Optimal and Random Playing of Multi-armed Bandit
    Bandit - Part I: We take a look at a very first example, the multi-armed bandit problem, and see how optimally or randomly playing could change the collected reward.
    [link]

    Lecture Notes:

    Further Reads:

    • k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
    • Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays
  • Lecture 3: Exploiting Explorations in Multi-armed Bandit
    Bandit - Part II: We now take a look at a mix strategy, first explore the environment for some limited time and then exploit this exploration in the real-time. We see that we could get very close to the optimal reward if we play for a long enough time.
    [link]

    Lecture Notes:

    Further Reads:

    • k-armed Bandit: Chapter 2 - Section 2.1 of [SB]
    • Robbins’ Paper: Paper Some aspects of the sequential design of experiments by H. Robbins published in the Bulletin of the American Mathematical Society in 1952 formulating multi-armed bandit as we know it nowadays
  • Lecture 4: Formulating the RL Framework
    Principle RL Problem: We now formulate the RL setting for a general interactive problem. We get to know what the agent is, what the environment is, and how the interactions between them is formulated.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 5: Environment as State-Dependent System
    Return and State: We see that the best way to present the environment mathematically is to look at it as a state-dependent system. This provides us the Markovity and enables us concretely formulate the problem of RL.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 6: Examples of RL Setting
    Example: We look into a few sample environment, namely Tic-Tac-Toe, Cart-Pole, and Trading examples. For each, we formulate the State, Action, and Reward.
    [link]

    Lecture Notes:

  • Lecture 7: Policy and Its Value
    Value - Part I: We define mathematically a policy. We see how we can determine the value of a given policy in a target environment.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 8: Playing Tic-Tac-Toe
    Value - Part II: We take a look at the Tic-Tac-Toe example and play with two different policies. For each policy, we evaluate the value function. We further define the action-value function.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 9: Optimal Policy
    Example: The value function enables us to define the notion of Optimal Policy. This formulates concretely the main objective in RL. We give a short overview on different approaches to find the optimal policy. This indeed is the rough categorization of RL approaches.
    [link]

    Lecture Notes:

  • Lecture 10: Frozen Lake Example -- Terminal State and Episode
    Terminal and Episode: We take a look at the example of Frozen Lake Game to practice the definitions of action, state, policy and value function. Through this example, we get to know the notion of Terminal State and Episode. We further quickly introduce the Gymnasium API for Implementation.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 11: Markov Decision Processes
    Model-based I: We start with model-based RL. We see that in this case, we can think of the environment as an MDP. Assuming to know the MDP, we can evaluate a given policy simply using statistical analysis. We look into a few simple examples to get the concept clear.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 12: Value Function Calculation via MDPs -- Naive Approach
    Model-based II: We use the known MDP to evaluate the value function of a given policy. We see that this is theoretically feasible; however, imposes a lot of analytical challenges. This motivates us to look into an alternative approach, i.e., Bellman Equation.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 13: Bellman Equation
    Model-based III: We derive the Bellman equation for the value function. This equation enables us to compute the value function from a set of fixed-point equations without needing to take nested expectations.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 14: Bellman Equation for Action-Value and Backup Diagram
    Model-based IV: We extend the derivation of Bellman equation to the action-value function. We see that the Bellman equations describe a simple graphic diagram which can relate the value of each state or action-state to other values in the environment.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 15: Bellman Optimality Equation
    Model-based V: We derive the Bellman Optimality equation. This equation enables us to compute the optimal value function without needing to know the optimal policy. This further let us derive the optimal policy for an environment with known MDP.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 16: Back-Tracking Optimal Policy
    Optimal Policy: Using Bellman Optimality Equation, we can backtrack the optimal policy. We learn this algorithm in this lecture. We see that the key issue is to find the optimal value solution. This motivates us to use dynamic programming.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 17: Policy Evaluation by Dynamic Programming
    Policy Iteration - Part I: We now use dynamic programming to evaluate a given policy. This enables us to develop an iterative algorithm to find values.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 18: Policy Improvement and Policy Iteration
    Policy Iteration - Part II: Given the optimality constraint, we can use a simple mechanism to improve a given policy. This idea along with policy evaluation leads us to what is known as Policy Iteration algorithm.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 19: Value Iteration
    Value Iteration: We can use dynamic programming to directly solve the Bellman optimality equation, and then backtrack the optimal policy. This is what we call Value Iteration and we see it in this lecture.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 20: Generalized Policy Iteration
    Value Iteration: We compare policy and value iteration algorithms. While VI could converge faster PI could give more robust result in realistic environment. This motivates us for what is known as Generalized Policy Iteration.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 21: Model-free Policy Evaluation via Monte-Carlo
    Monte-Carlo - Part I: We start with model-free approaches for RL. As the first stop, we look into Monte-Carlo approach and see how we could use it to estimate the value function.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 22: GPI via Monte-Carlo
    Monte-Carlo - Part II: We now use the MC evaluation algorithm to develop a generalized policy iteration (GPI) loop based on MC evaluation. This loop implements GPI by using MC estimate to evaluate the policy in each iteration.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 23: Bootstrapping
    TD - Part I: Using the recursive property of the value function, we can develop an alternative estimator for the value function via bootstrapping, which estimates the value of each state from the estimate of the other states.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 24: GPI via Temporal Difference
    TD - Part II: Using bootstrapping, we can evaluate a policy. This approach is called temporal difference which can lead to better sample efficiency as compared to MC. We learn how to implement GPI with this approach, and discuss its bias-variance trad-off as compared with the MC approach.
    [link]

    Further Reads:

    • TD-0: Chapter 6 - Sections 6.2 and 6.3 of [SB]
  • Lecture 25: Deep Bootstrapping and TD-n
    TD - Part III: We extend the idea of bootstrapping to estimate values from deeper temporal differences. This is called TD-n. We study the impact of depth on the learning quality.
    [link]

    Further Reads:

    • TD-n: Chapter 7 - Sections 7.1 and 7.2 of [SB]
  • Lecture 26: TD-λ
    TD - Part III: We can improve sample efficiency by averaging TD over depth. This is called TD-λ. We learn this approach and understand its implication.
    [link]

    Further Reads:

  • Lecture 27: TD with Eligibility Tracing
    TD - Part IV: TD-λ is not causal and hence not very efficient for online control. We hence study its causal equivalent known as TD with eligibility tracing. It enables us to implement the efficient TD-λ without the need to know the future rewards in the trajectory.
    [link]

    Further Reads:

  • Lecture 28: Control Loop with Monte Carlo
    MC Control: We now move to the problem of online RL. We want to learn optimal policy while sampling, not by basic policy iteration which evaluates policies. We understand the difference and develop our first basic loop, i.e. Monte-Carlo Control Loop.
    [link]

    Further Reads:

  • Lecture 29: Adding Exploration to Control Loop
    MC Control: We discuss ε-greedy approach which enables us adding exploration to our control loop. We show that this approach still improves the policy and hence it can ultimately converge to optimal policy. We also define the notion of GLIE algorithm.
    [link]

    Further Reads: