Code Monkey home page Code Monkey logo

recurrent-ppo-truncated-bptt's Introduction

Recurrent Proximal Policy Optimization using Truncated BPTT

This repository features a PyTorch based implementation of PPO using a recurrent policy supporting truncated backpropagation through time. Its intention is to provide a clean baseline/reference implementation on how to successfully employ recurrent neural networks alongside PPO and similar policy gradient algorithms.

We also offer a clean TransformerXL + PPO baseline repository.

Latest Updates (February 2023)

  • Added support for Memory Gym
  • Added yaml configs
  • Added max grad norm hyperparameter to the config
  • Gymnasium is used instead of gym
  • Only model inputs are padded now
  • Buffer tensors are freed from memory after optimization
  • Fixed dynamic sequence length

Features

  • Recurrent Policy
    • GRU
    • LSTM
    • Truncated BPTT
  • Environments
    • Proof-of-concept Memory Task (PocMemoryEnv)
    • CartPole
      • Masked velocity
    • Minigrid Memory
      • Visual Observation Space: 3x84x84
      • Egocentric Agent View Size: 3x3 (default 7x7)
      • Action Space: forward, rotate left, rotate right
    • MemoryGym
      • Mortar Mayhem
      • Mystery Path
      • Searing Spotlights (WIP)
  • Tensorboard
  • Enjoy (watch a trained agent play)

Citing this Work

@inproceedings{
  pleines2023memory,
  title={Memory Gym: Partially Observable Challenges to Memory-Based Agents},
  author={Marco Pleines and Matthias Pallasch and Frank Zimmer and Mike Preuss},
  booktitle={International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=jHc8dCx6DDr}
}

Documentation Contents

Installation

Install PyTorch depending on your platform. We recommend the usage of Anaconda.

Create Anaconda environment:

conda create -n recurrent-ppo python=3.11 --yes
conda activate recurrent-ppo

CPU:

conda install pytorch torchvision torchaudio cpuonly -c pytorch

CUDA:

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Install the remaining requirements and you are good to go:

pip install -r requirements.txt

Train a model

The training is launched via the command python train.py.

Usage:
    train.py [options]
    train.py --help

Options:
    --run-id=<path>            Specifies the tag of the tensorboard summaries and the model's filename [default: run].
    --cpu                      Whether to enforce training on the CPU, otherwwise an available GPU will be used. [default: False].

Hyperparameters are configured inside of configs.py. The to be used config has to be specified inside of train.py. Once the training is done, the final model will be saved to ./models/run-id.nn. Training statistics are stored inside the ./summaries directory.

python train.py --run-id=my-training-run

Enjoy a model

To watch an agent exploit its trained model, execute the python enjoy.py command. Some already trained models can be found inside the models directory!

Usage:
    enjoy.py [options]
    enjoy.py --help

Options:
    --model=<path>              Specifies the path to the trained model [default: ./models/minigrid.nn].

The path to the desired model has to be specified using the --model flag:

python enjoy.py --model=./models/minigrid.nn

Recurrent Policy

Implementation Concept

Flow of processing the training data

  1. Training data
    1. Training data is sampled from the current policy
    2. Sampled data is split into episodes
    3. Episodes are split into sequences (based on the sequence_length hyperparameter)
    4. Zero padding is applied to retrieve sequences of fixed length
    5. Recurrent cell states are collected from the beginning of the sequences (truncated bptt)
  2. Forward pass of the model
    1. While feeding the model for optimization, the data is flattened to feed an entire batch (faster)
    2. Before feeding it to the recurrent layer, the data is reshaped to (num_sequences, sequence_length, data)
  3. Loss computation
    1. Zero padded values are masked during the computation of the losses

Found & Fixed Bugs

As a reinforcement learning engineer, one has to have high endurance. Therefore, we are providing some information on the bugs that slowed us down for months.

Feeding None to nn.GRU/nn.LSTM

We observed an exploding value function. This was due to unintentionally feeding None to the recurrent layer. In this case, PyTorch uses zeros for the hidden states as shown by its source code.

if hx is None:
    num_directions = 2 if self.bidirectional else 1
    real_hidden_size = self.proj_size if self.proj_size > 0 else self.hidden_size
    h_zeros = torch.zeros(self.num_layers * num_directions,
                          max_batch_size, real_hidden_size,
                          dtype=input.dtype, device=input.device)
    c_zeros = torch.zeros(self.num_layers * num_directions,
                          max_batch_size, self.hidden_size,
                          dtype=input.dtype, device=input.device)
    hx = (h_zeros, c_zeros)

Reshaping an Entire Batch into Sequences

Training an agent using a sequence length greater than 1 caused the agent to just achieve a performance of a random agent. The issue behind this bug was found in reshaping the data right before feeding it to the recurrent layer. In general, the desire is to feed the entire training batch instead of sequences to the encoder (e.g. convolutional layers). Before feeding the processed batch to the recurrent layer, it has to be rearranged into sequences. At the point of this bug, the recurrent layer was initialized with batch_first=False. Hence, the data was reshaped using h.reshape(sequence_length, num_sequences, data). This messed up the structure of the sequences and ultimately caused this bug. We fixed this by setting batch_first to True and therefore reshaping the data by h.reshape(num_sequences, sequence_length, data).

Hidden States were not reset

This is rather considered as a feature and not a bug. For environments that produce rather short episodes are likely to take advantage of not resetting the hidden states upon commencing a new episode. This is the case for MinigridMemory-S9. Resetting hidden states is now controlled by the hyperparameter reset_hidden_state inside configs.py. The actual mistake was the mixed up order of saving the recurrent cell to its respective placeholder and resetting it.

Hyperparameters (configs.py)

Recurrence

Hyperparameter Description
sequence_length Length of the trained sequences, if set to 0 or smaller the sequence length is dynamically fit to episode lengths
hidden_state_size Size of the recurrent layer's hidden state
layer_type Supported recurrent layers: gru, lstm
reset_hidden_state Whether to reset the hidden state upon starting a new episode. This can be beneficial for environments that produce short episodes like MinigridMemory-S9.

General

gamma Discount factor
lamda Regularization parameter used when calculating the Generalized Advantage Estimation (GAE)
updates Number of cycles that the entire PPO algorithm is being executed
n_workers Number of environments that are used to sample training data
worker_steps Number of steps an agent samples data in each environment (batch_size = n_workers * worker_steps)
epochs Number of times that the whole batch of data is used for optimization using PPO
n_mini_batch Number of mini batches that are trained throughout one epoch
value_loss_coefficient Multiplier of the value function loss to constrain it
hidden_layer_size Number of hidden units in each linear hidden layer
max_grad_norm Gradients are clipped by the specified max norm

Schedules

These schedules can be used to polynomially decay the learning rate, the entropy bonus coefficient and the clip range.

learning_rate_schedule The learning rate used by the AdamW optimizer
beta_schedule Beta is the entropy bonus coefficient that is used to encourage exploration
clip_range_schedule Strength of clipping optimizations done by the PPO algorithm

Model Architecture

Model Architecture

The figure above illustrates the model architecture in the case of training Minigrid. The visual observation is processed by 3 convolutional layers. The flattened result is then divided into sequences before feeding it to the recurrent layer. After passing the recurrent layer's result to one hidden layer, the network is split into two streams. One computes the value function and the other one the policy. All layers use the ReLU activation.

In the case of training an environment that utilizes vector observations only, the visual encoder is omitted and the observation is fed directly to the recurrent layer.

Add environment

Follow these steps to train another environment:

  1. Extend the create_env() function in utils.py by adding another if-statement that queries the environment's name
  2. At this point you could simply use gym.make() or use a custom environment that builds on top of the gym interface.
  3. Adjust the "env" key inside the config dictionary to match the name of the new environment

Tensorboard

During training, tensorboard summaries are saved to summaries/run-id/timestamp.

Run tensorboad --logdir=summaries to watch the training statistics in your browser using the URL http://localhost:6006/.

Results

The code for plotting the results can be found in the results directory. Results on Memory Gym can be found in our TransformerXL + PPO baseline repository.

MinigridMemory-S9

Minigrid Memory Result

MinigridMemoryRandom-S17

(only trained on MinigridMemory-S9 using unlimited seeds)

Minigrid Memory S17

PoC Memory Task

PoC Result

Cartpole Masked Velocity

CartPole Masked Velocity Result

recurrent-ppo-truncated-bptt's People

Contributors

horrible22232 avatar marcometer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

recurrent-ppo-truncated-bptt's Issues

Adapting the repo to my specific problem

Hi! Your repo is a very good guideline to LSTM w/ PPO. I have been trying to adapt it to my personal gym env.
My env:

  • Obs: Dict {time-series-data: 60 last entries (like a window moving each timestep) of 15 features : 60x15 + 3 1d vectors}
  • Actions: Box(2,)
  • Aprox steps by ep: 400

What I think to use as a solution:

  • LTSM (just for the time-series data) + Linear (fed with the previous result flattened + rest vectors of the obs) and then feed the actor critics policies of PPO.

Problems and doubts:

  1. You have proposed to flatten the 60x15 and then use attention on that input. Can you provide a simple code example? I haven't worked with attention and there is not much info out there.
  2. Could I use LTSM w Attention? or maybe dual encoder-decoder LTSM with or without Attention?. I want a simple but powerful approach, I don't think that the agent has to work on 'predicting' the next values on the time series (which as in stocks are difficult to predict), maybe it can figure out trends changes so that it must adapt to those with actions turning on or off triggers on the env. I wonder what arch can suit this better.
  3. Right now I am trying to adapt your work to my env. I have an error and is that _train_mini_batch gives me this error Expected hidden[0] size (3, 1260, 128), got [3, 21, 128] when feeding the model. The issue is that recurrent_mini_batch_generator only collects states at the beginning of a seq. I don't know how to handle it as I have already made some changes in shapes that may not be correct and I feel very new to this. What do I use as batches? fixed secuences of obs? I'm messing up on how to feed the obs to the hole learning architecture. Any insight will be appreciated :)

Suggestions for training on multiple environments simultaneously?

Hi, thanks for making this nice repo! So I am interested in testing whether recurrent policy gradients can be used for image classification (sounds strange but basically each image is extremely large like 100,000 by 100,000 so it's impossible to process the entire image at once - instead I think perhaps I can train an agent to selectively navigate parts of the image sequentially, pick-up on landmarks of interests, and then make a classification).

So naturally, the goal here is not to solve a single environment (i.e. one image) since the model will just memorize the label. The goal is to train on e.g. 10,000 different environments (images), and test how well the agent + classifier will perform on held-out data. Do you have any suggestions for whether it makes sense to use your repo as a starting point, and modify it such that we either use a new environment in each mini-batch or sample from multiple environments in the same mini-batch? thanks!

Question Regarding Sequence Length

Firstly, I wanted to thank you for the great project, it helped me understand Recurrent PPO better.

I mostly have one main question, regarding how the training is done, especially regarding the Actor and Critic hidden states.

In my situation, to better understand the workflow of the project, I use only 1 worker.

If I understand correctly, this worker will collect data (observations, actions, hidden states etc.) from the environment until termination. In the example of CartPole it would be something like this:
Step 1:
Observations: [-0.0058, -0.0000, -0.0079, 0.0000]
Hidden States (Actor): (tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]), tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]))

Then for Step 2 we get new Observations and Hidden States and so forth.

My question Is regarding the training phase and the actual sequence length. If in my batch I have 4 episodes (padded to have the same length of the longest episode), then (sequence_length = longest_episode_length) right? In case of course I want to be my sequence length to be the length of my episode.

Moreover, in the training, for each episode are we using only the first Hidden_State? If this is the case, then for each episode that is fed to the Networks, then the Hidden States are always going to be the initial states, like:
Hidden States (Actor): (tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]), tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]))

Hence if this is the case, what is the point of saving the hidden states during the episodes?

I might also have misunderstood this part, so I apologize in advance if this is a nonsense question :)

about sequence_length

This issue is mainly to thank this great project.

I followed these codes, wrote a PPO algorithm with RNN, and trained it in the hallway environment of ml-agents.
I found that the parameter sequence_length has a great influence on the training results. I don’t know Why, I think sequence_length does not affect the memory length of the agent, is this correct?

The following is the reward curve, the max reward is 1:
green line: sequence_length = 8
gray line: sequence_length = 16
image

Pre-trained Models Do Not Work

When I run python enjoy.py --model=./models/mortar_mayhem_grid.nn, I get this python pickle error:

pygame 2.5.1 (SDL 2.28.2, Python 3.7.16)
Hello from the pygame community. https://www.pygame.org/contribute.html
Traceback (most recent call last):
  File "enjoy.py", line 71, in <module>
    main()
  File "enjoy.py", line 26, in main
    state_dict, config = pickle.load(open(model_path, "rb"))
_pickle.UnpicklingError: invalid load key, 'v'.

I followed the installation steps listed in the README for CUDA. The commands I ran were:

conda create -n recurrent-ppo python=3.7 --yes
conda activate recurrent-ppo
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
git clone https://github.com/MarcoMeter/recurrent-ppo-truncated-bptt.git
cd recurrent-ppo-truncated-bptt
pip install -r requirements.txt
python enjoy.py --model=./models/mortar_mayhem_grid.nn

I am running these commands on Ubuntu 20.04.6.

Could you please look into this? Thanks!

Calculation of the Generalized Advantage Estimation

Hi! Thanks for sharing the RNN-PPO implementation, it's very insightful. I have a question about the GAE calculation calc_advantages as part of the rollout buffer.

My understanding is that the GAE is calculated over a single episode at a time. However, with the current approach the GAE is calculated over all the episodes combined per worker. If I am not mistaken about the implementation of the data sampling.

The exact line I'm talking about:

self.buffer.calc_advantages(last_value, self.config["gamma"], self.config["lamda"])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.