Reinforcement Learning / Fall 2025

Updates

New Lecture is up: Lecture 52: PPO Algorithm
New Lecture is up: Lecture 51: TRPO Algorithm
New Lecture is up: Lecture 50: Trust Region and Natural PGM
New Lecture is up: Lecture 49: PGM as Sequential Surrogate Optimization
New Lecture is up: Lecture 48: Vanilla and Baseline PGM
New Lecture is up: Lecture 47: Policy Gradient Theorem
New Lecture is up: Lecture 46: Training Policy Net via SGD

For the Quercus page of the course please click here

Midterm Exam

The midterm exam will be on Tuesday October 21, 2025 starting on 5:00 PM in PB B150 (Leslie Dan Pharmacy Building located at 144 College Street).

For the exam, please pay attention to the following points:

The exam covers Chapters 1 to 3 (Deep RL is not included). It will include only written questions and does not include programming
The exam is 3 hours long
Please bring a photo ID with you to the exam
The exam is open-book meaning that you can bring any printed material with you (printed notes, your own notes, cheat-sheet, textbook, etc). Electronic devices are though NOT allowed in the exam

If you need any assistance or specific accommodation for the exam, please feel free to reach out.

Course Description

This course develops hands-on skills in deep reinforcement learning, for which fundamentals of reinforcement learning are first discussed and then deep reinforcement learning algorithms are studied. The course is designed in three major parts: Part I gives the students a warm welcome by taking them through the basic definitions and fundamental concepts. Part II explains fundamental reinforcement learning methods by touching the key model-based and model-free techniques and providing deep understanding of these methods. Part III explores deep reinforcement learning, where deep neural networks are employed to efficiently approximate the developed techniques in Part II. In this part, we take a look into several algorithms, such as deep Q-learning, policy gradient methods, e.g., trust-region and proximal policy optimization algorithms, and actor-critic methods.

Time and Place

Lectures

Lectures start on September 2, 2025. Please note that the lecture halls are different in Tuesdays and Fridays.

Day	Time	Place
Tuesdays	5 PM - 7 PM	BA-1170 - Bahen Centre for Information Technology
Fridays	5 PM - 7 PM	BA-1180 - Bahen Centre for Information Technology

Tutorials

Tutorials sessions start on September 16, 2025.

Day	Time	Place
Tuesdays	4 PM - 5 PM	BA-1160 - Bahen Centre for Information Technology

Course Office Hours

Day	Time
Thursdays	12 PM - 1 PM

Instructor

Ali Bereyhi

Assistant Professor (TS)

ECE Department

Bahen 7208

Teaching Assistants

Course Description

This course provides a concrete understanding of reinforcement learning and its applications. The ultimate goal of the course is to develop hands-on skills in deep reinforcement learning, for which fundamentals of reinforcement learning are first discussed and then deep reinforcement learning algorithms are studied. The course is designed in three major parts: Part I introduces basic definitions and fundamental concepts; Part II covers fundamental reinforcement learning methods (model-based and model-free) with a deep understanding of each; Part III explores deep reinforcement learning, employing deep neural networks to approximate the techniques in Part II (e.g., function approximation, deep Q-learning, policy gradient methods, and proximal policy optimization).

Part I: First Things in Reinforcement Learning

General framework of reinforcement learning

The multi-armed bandit problem
Components: Agent, Environment, State, Action, Reward, Policy
Comparison to supervised learning
Value function and policy design
Problem of Credit Assignment

Exploration versus Exploitation

Revisiting the multi-armed bandit problem
Trade-off between exploration and exploitation

Introduction to Gymnasium library

Generating an environment in Gymnasium
Our first try: a simple game

Part II: Fundamentals of Reinforcement Learning

Model-based reinforcement learning

Markov Decision Processes (MDPs)
Value and policy with MDPs
Dynamic programming and Bellman equation
Value iteration and Policy iteration algorithms

Model-free reinforcement learning

On-policy versus off-policy approaches for model-free reinforcement learning

Difference and properties of on-policy and off-policy methods

On-policy approach 1: Monte-Carlo (MC) learning
On-policy approach 2: Temporal Difference (TD) learning
From value function to Q-function
On-policy approach 3: State-Action-Reward-State-Action (SARSA)
Off-policy approach: Q-learning

Revisiting our simple game

Implementing value and policy iteration in Gymnasium

Part III: Deep Reinforcement Learning

Reviewing main concepts in deep learning

Universal approximation theorem
Deep neural networks

Training a neural net via gradient descent
Reviewing neural network implementation in PyTorch

Preliminaries of Deep Reinforcement Learning

Function approximation

Space reduction via function approximation

Simple function approximator
Deep neural networks as function approximators
Looking into a new example

Deep off-policy reinforcement learning

Value networks: value function approximation via deep neural networks
Deep Q-learning and Deep Q-networks (DQNs)
Properties of deep Q-learning: sample efficiency and instability
Visiting our new example

Deep on-policy methods

Policy networks: policy approximation via deep neural networks
Policy gradient methods

Direct policy update
Properties of deep policy networks: sample inefficiency versus stability

Trust Region Policy Optimization (TRPO)

Constraining policy update via Kullback-Leibler divergence
Idea of surrogate objective function

Proximal Policy Optimization (PPO)

Clipping
Complexity of PPO

Revisiting our new example

Actor-Critic Methods

Advantage Actor Critic (A2C)
TRPO and PPO with value network
Deterministic Policy Gradient

Deep Deterministic Policy Gradient (DDPG)

Soft Actor Critic (SAC)
Extensions and modification

Applications and advancements of deep reinforcement learning

Looking into some successful examples: Alpha-Go, Alpha-Zero, Pluribus, and OpenAI Five
Sample applications of deep reinforcement learning and project poster session
Recent advancements in deep reinforcement learning

Course Evaluation

The learning procedure consists of three components:

Component	Grade %
Assignments	42%	3 assignment sets
Exam	25%	midterm exam
Project	33%	selected from predefined set of projects or open-ended

Let's get through them a bit in detail.

Assignment

This is the most important part! We are learning an applied course and hence should implement whatever we learn. There will be three sets of assignments. Roughly speaking, the first one goes through fundamentals of reinforcement learning. The second assignment gets more serious in terms of implementation and develops your knowledge on tabular reinforcement learning methods. The last one gives you a chance to implement some mini-project on Deep Reinforcement Learning. The assignments will count for 42% of your final mark.

Exam

This will be a theory-side-of-the-course exam. When we are over with Part 2, there will be an exam. The exam will ask questions on theory. There will be of course no programming questions in the exam. The exam will only evaluate the understanding of fundamental concepts and reinforcement learning methods through questions that can be either explained in words or solved simply by hand. This exam will comprise 25% of the total mark.

Course Project

This is the exciting part of the course, where we could challenge ourselves and test the development of our skills. The topic of the project is set to be an application of deep reinforcement learning. The project is designed in advance with its milestones and tasks being clear. Required documents for the project will be shared. You then need to use your knowledge to gradually finish the required tasks by the end of semester. You have further the option to define your own project. An open-ended project must be defined to be of the same level of the predefined projects, with the milestones being clearly specified. The final projects will be presented in the seminar session which holds on the last week.