In this project you will implement the following RL algorithms, in this order:
- Sarsa with state-value function approximation
- Monte Carlo with state-value function approximation
- REINFORCE with soft-max policy
- REINFORCE with baseline and soft-max policy
all for the Cart-Pole task from Gymnasium (formerly OpenAI Gym).
There are a lot of Python files. All require editing; grep
for
BEGIN
to see where your code is required. It is much less work than
it seems at first glance; read on for details.
The *Agent.py
files define two hierarchies of base classes for
generic RL algorithms, and the CartPole*.py
files each instantiate a
specific method for this task. At a glance, these are the
dependencies (by inheritance or inclusion) between the files:
DiscreteAgent
DiscreteSarsaAgent
CartPoleSarsa
DiscreteMonteCarloAgent
CartPoleMonteCarlo
ReinforceAgent
CartPoleReinforce
ReinforceBaselineAgent
CartPoleReinforceBaseline
In addition, see run.sh
and plots.py
for evaluation.
-
Since all q and h functions operate on the same state (=observation) space, you can probably use the same neural network architecture everywhere.
-
The
trainEpisode()
method of theReinforceAgent
is almost identical to that of theDiscreteMonteCarloAgent
. -
Part of the
update()
method of theReinforceBaselineAgent
includes an almost verbatim copy of theupdate()
method of theReinforceAgent
. -
Don't expect perfect results. No agent will learn perfectly. However, each agent should achieve perfect results on multiple consecutive episodes.
-
See the lecture notes for further hints on implementation with PyTorch.
Upload this entire directory as an archive file to OLAT. Besides its
original content (including your edits) it should contain plots
generated by plots.py
, 8 in total, for at least one run of each
method. Write any observations or comments into submission.md
.
Do not include any bulky log or .npy
files.
The above task explores discrete actions on episodic tasks. In class we also discussed (or will discuss) continuous action spaces and non-episodic tasks (i.e., tasks that may potentially continue forever unless they fail, starting a new episode). The classic control tasks of Gymnasium include tasks with continuous action spaces, but these are not suitable for our basic methods, for reasons (to be) discussed in class. Of these, the Pendulum task can easily be turned into a non-episodic task, but again, it is not useful for our purposes.
Instead of doing Your Task specified above, you may choose to do the following:
-
Adapt or create a task (as a class derived from
gymnasium.Env
) that either involves continuous actions, is non-episodic, or both, and that is solvable by the methods we discussed in class. -
Implement one such method, building as much as possible on the supplied Python files, and demonstrate that it works, analogously to the above instructions. In particular, implement one of the following methods:
- Differential Semi-Gradient Sarsa or Continuing Actor-Critic for a continuing task with discrete actions
- REINFORCE or Episodic Actor-Critic with a policy parametrized as a Normal distribution for an episodic task with continuous actions
- Continuing Actor-Critic for a non-episodic task with continuous actions