The Lunar Lander environment is a rocket trajectory optimization problem. The goal is to touch down at the landing pad as close as possible. The rocket starts at the top center with a random initial force applied to its center of mass.
There are four discrete action: do nothing, fire left engine, fire main engine, and fire right engine.
Each observation is an 8-dimensional vector containing: the lander position in x & y, its linear velocity in x & y, its angle, its angular velocity, and two boolean flags indicating whether each leg has contact with the ground.
Positive rewards are received for a landing (100-140, depending on the position) with +100 if the lander comes to a rest. Firing the engines gives a tiny (-0.03) and crashing a big (-100) negative reward. The problem is considered solved by reaching 200 points.
The following RL algorithms were implemented:
- Neural Fitted Q Iteration (NFQ)
- Deep Q-Network (DQN)
- REINFORCE with baseline / Vanilla Policy Gradient (VPG)
- Advantage Actor Critic (AC)
For better comparison, all algorithms use a 2-layer MLP (128, 64) and a discount factor of 0.999. The learning rate is set individually.
Install dependencies with pip install -r requirements.txt
.
Run main.py train <agent> <episodes>
to train an agent.
Run main.py evaluate <agent> <episodes> <render>
to evaluate a pre-trained agent.
<agent> (string)
NFQ, DQN, VPG or AC
<episodes> (int)
Number of episodes
<render> (bool)
Display episodes on screen
Training | After 2000 episodes |
---|---|
Training | After 1000 episodes |
---|---|
Training | After 5000 episodes |
---|---|
Reference: R. Sutton, and A. Barto (2018) Reinforcement Learning: An Introduction, p. 328
Reference: OpenAI: Spinning Up in Deep RL!, Vanilla Policy Gradient
Training | After 1000 episodes |
---|---|
Reference: RL Course by David Silver - Lecture 7: Policy Gradient Methods
The score is the average return over 100 episodes on the trained agent.
Score | |
---|---|
Neural Fitted Q Iteration | -24.90 |
Deep Q-Network | 271.47 |
Vanilla Policy Gradient | 172.49 |
Advantage Actor Critic | 205.77 |
- Python v3.10.9
- Gym v0.26.2
- Matplotlib v3.6.2
- Numpy v1.24.1
- Pandas v1.5.2
- PyTorch v1.13.1
- Tqdm v4.64.1
- Typer v0.7.0