Project - Reinforcement Learning / Fall 2025

Code of Honor

This project is intended to deepen your understanding and develop your skills, and it forms a substantial part of your final evaluation. It must be completed collaboratively as a group. Any form of academic dishonesty is a violation of the Code of Honor. You are encouraged to use publicly available resources, provided that all sources are clearly cited and your individual contributions are clearly explained. Failure to properly acknowledge your contribution may be considered a lack of participation, and projects without meaningful individual contributions will be deemed incomplete.

The course project will be seriously started in the second half of the course. In these project, you choose a topic from the list of available topics and work through semester to deliver the requested outcomes of the project. Regardless of topic of the project, you will need to follow the following steps:

Make a group of 3 or 4. Due to the course size, smaller group size is only accepted under special circumstances, e.g., working on an open-ended topic of your own or a group member dropping in the middle of semester.
Submit your topic by the end of Week 5. It is strongly suggested to choose as soon as possible to get into the problem and start preliminaries.
You will be allocated to a TA, who could help you throughout the project.
Deliver initial milestones of the project in a progress briefing by Week 10. The progress briefing will serve as the base for your final report.
Deliver your final results by the end of semester. This includes the final report, the source codes, and a final presentation in our internal seminar.

Submission Procedure

The main body of work is submitted through Git. In addition, each group submits a final paper and gives a presentation. In this respect, please follow these steps.

Each group must maintain a Git repository, e.g., GitHub or GitLab, for the project. By the time of final submission, the repository should have:
- Well-documented codebase
- Clear README.md with setup and usage instructions
- A requirements.txt file listing all required packages or an environment.yaml file with a reproducible environment setup
- Demo script or notebook showing sample input-output
- If applicable, a /doc folder with extended documentation
A final report (maximum 5 pages) must be submitted in a PDF format. The report should be written in the provided formal style, including an abstract, introduction, method, experiments, results, and conclusion.
Important: Please make sure to complete the section Consent for Information Sharing. Important: Submissions that do not use template are considered incomplete.
A 5-minute presentation (maximum 5 slides including the title slide) is given on the internal seminar on Week 14, i.e., Dec 1 to Dec 5, by the group. For presentation, any template can be used.

Project Topics

Category A: Deep RL from Scratch

Topic A-1: From REINFORCE to Advantage Actor-Critic

See Complete Project Description
Objective: Design and implement a progression of policy gradient algorithms starting from REINFORCE, extending it with a baseline, and finally implementing Advantage Actor-Critic (A2C). The project aims to compare the stability, sample efficiency, and learning dynamics across these algorithms in classic control environments.

Topic A-2: Trust Region versus Proximal Policy Optimization

See Complete Project Description
Objective: Design and implement two advanced actor–critic policy optimization algorithms: Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). Compare their stability, sample efficiency, and robustness in classic continuous control tasks.

Topic A-3: Deterministic versus Stochastic Policy Gradients: DDPG and SAC

See Complete Project Description
Objective: Design and implement two advanced actor–critic algorithms for continuous control: Deep Deterministic Policy Gradient (DDPG) and Soft Actor-Critic (SAC). Compare their performance in terms of stability, sample efficiency, and robustness on continuous-action Gym environments.

Category B: Applications of RL

Topic B-1: Using RL to Finetune a Language Model

See Complete Project Description
Objective: Design and implement a simplified pipeline for RL-based finetuning of language models (LMs). Starting from a small pretrained LM, e.g., DistilGPT2, you will implement a policy optimization method (PPO) to improve LM outputs according to a reward model, e.g., sentiment classifier. The project aims to compare supervised finetuning against RL-based one, and to analyze the effect of different reward models and modifications.

Topic B-2: Using RL for Portfolio Management

See Complete Project Description
Objective: Design and implement an RL agent for portfolio management. The agent should allocate capital across a small set of assets, e.g., 3–5 stocks or ETFs, to maximize long-term returns while managing risk. The project aims to compare RL-based strategies, e.g., DQN, PPO, with heuristic baselines such as Buy-and-Hold or Equal Weighting.

Topic B-3: Using RL for Traffic Signal Control

See Complete Project Description
Objective: Design and implement an RL agent to optimize traffic signal control at one or more intersections. The agent should minimize congestion and vehicle waiting times by dynamically adjusting traffic light phases. The project aims to compare RL-based policies with simple heuristics such as fixed-time or round-robin controllers.

Topic B-4: Using RL to Build a Recommendation System

See Complete Project Description
Objective: Design and implement a reinforcement learning agent for sequential recommendation. The agent should recommend items (e.g., movies or products) to maximize long-term user engagement. The project aims to compare RL-based policies with traditional recommendation baselines such as collaborative filtering or random recommendation.

Category C: Robotics and Automation

Topic C-1: Controlling Robotic Arm with RL

See Complete Project Description
Objective: Design and implement an RL agent to control a robotic arm in a reaching task. The agent should learn to move the robot’s end-effector to a target location in continuous action space. The project aims to compare RL algorithms such as PPO and SAC, and evaluate their robustness under environment modifications.

Topic C-2: RL for Robotic Locomotion

See Complete Project Description
Objective: Design and implement RL agents to train a simulated robot to perform locomotion tasks such as hopping or walking. The project should compare algorithms like PPO and SAC in terms of stability, sample efficiency, and robustness. The environment can be selected from lightweight simulators such as PyBullet (Hopper, Ant, or Walker).

Topic C-3: RL for Autonomous Driving

See Complete Project Description
Objective: Design and implement an RL agent for basic autonomous driving tasks such as lane-keeping, overtaking, or collision avoidance. The project aims to compare RL-based driving policies with simple rule-based baselines in lightweight simulators such as Highway-env.

Category D: Resource Allocation

Topic D-1: RL for Job Scheduling in a Compute Cluster

See Complete Project Description
Objective: Design and implement an RL agent for job scheduling in a simulated compute cluster. The agent should decide which jobs to execute in order to optimize performance metrics such as throughput, average waiting time, or fairness. The project aims to compare RL-based scheduling policies with classical heuristics such as First-Come-First-Served (FCFS) or Shortest-Job-First (SJF).

Topic D-2: RL for Wireless Resource Allocation

See Complete Project Description
Objective: Design and implement an RL agent for wireless resource allocation in a small multi-user system. The agent should allocate power or bandwidth among users to maximize throughput while ensuring fairness. The project aims to compare RL-based allocation policies with traditional heuristics such as round-robin or max-SINR scheduling.

Topic D-3: Cache Management via RL

See Complete Project Description
Objective: Design and implement an RL agent for cache management in a simulated system. The agent should learn eviction and admission strategies to minimize cache misses given a stream of content requests. The project aims to compare RL-based caching policies with classical baselines such as Least Recently Used (LRU) and Least Frequently Used (LFU).

Category E: Open-ended

You can develop your own proposal
The proposal should meet the technical level of pre-defined projects
Please prepare your proposal in the format other proposals are given
You may use the provided proposal template

Templates for Proposal, Report and Presentation

Proposal Template This is the template for project proposal. You can use other template as well
Report Template - LaTex: Other templates are not accepted!
Prsentation Template You can use other template as well