ReLAx

ReLAx - Reinforcement Learning Applications

ReLAx is an object oriented library for deep reinforcement learning built on top of PyTorch.

Implemented Algorithms
Special Features
Usage With Custom Environments
Minimal Examples
- On Policy
- Off policy
Installation
Further Developments
Known Issues

Implemented Algorithms

ReLAx library contains implementations of the following algorithms:

Value Based (Model-Free):
- On-Policy
  - VPG: example
  - A2C: example
  - TRPO: example
  - PPO: example
- Off-policy
  - DQN: example
  - Double DQN: example
  - Dueling DQN: example
  - Noisy DQN: example
  - Categorical DQN: example
  - RAINBOW: example
  - DDPG: example
  - TD3: example
  - SAC: example
Model Based:
- Random Shooting: example
- Cross Entropy Method (CEM): example
- Filtering & Reward Weigthed Refinement (PDDM): example
Hybrid MB-MF
- MBPO: example
- DYNA-Q: example

Special Features

ReLAx offers a set of special features:

Simple interface for lagging environment observations: Recurrent Policies for Handling Partially Observable Environments
Sampling from parallel environments: Speeding Up PPO with Parallel Sampling
Wide possibilities for scheduling hyper-parameters: Scheduling TRPO's KL Divergence Constraint
Support of N-step bootstrapping for all off-policy value-based algorithms: Multistep TD3 for Locomotion
Support of Prioritized Experience Replay for all off-policy value-based algorithms: Prioritised DDQN for Atari-2600
Simple interface for model-based acceleration: DYNA Model-Based Acceleration with TD3 / MBPO with SAC

And other options for building non-standard RL architectures:

Usage With Custom Environments

Some examples of how to write custom user-defined environments and use them with ReLAx:

Playing 2048 with RAINBOW

Minimal Examples

On Policy

import torch
import gym

from relax.rl.actors import VPG
from relax.zoo.policies import CategoricalMLP
from relax.data.sampling import Sampler

# Create training and eval envs
env = gym.make("CartPole-v1")
eval_env = gym.make("CartPole-v1")

# Wrap them into Sampler
sampler = Sampler(env)
eval_sampler = Sampler(eval_env)

# Define Vanilla Policy Gradient actor
actor = VPG(
    device=torch.device('cuda'), # torch.device('cpu') if no gpu available
    policy_net=CategoricalMLP(acs_dim=2, obs_dim=4,
                              nlayers=2, nunits=64),
    learning_rate=0.01
)

# Run training loop:
for i in range(100):
    
    # Sample training data
    train_batch = sampler.sample(n_transitions=1000,
                                 actor=actor,
                                 train_sampling=True)
    
    # Update VPG actor
    actor.update(train_batch)
    
    # Collect evaluation episodes
    eval_batch = eval_sampler.sample_n_episodes(n_episodes=5,
                                                actor=actor,
                                                train_sampling=False)
    
    # Print average return per iteration
    print(f"Iter: {i}, eval score: {eval_batch.create_logs()['avg_return']}")

Off policy

import torch
import gym

from relax.rl.actors import ArgmaxQValue
from relax.rl.critics import DQN

from relax.exploration import EpsilonGreedy
from relax.schedules import PiecewiseSchedule
from relax.zoo.critics import DiscQMLP

from relax.data.sampling import Sampler
from relax.data.replay_buffer import ReplayBuffer

# Create training and eval envs
env = gym.make("CartPole-v1")
eval_env = gym.make("CartPole-v1")

# Wrap them into Sampler
sampler = Sampler(env)
eval_sampler = Sampler(eval_env)

# Define schedules
# First 5k no learning - only random sampling
lr_schedule = PiecewiseSchedule({0: 5000}, 5e-5)
eps_schedule = PiecewiseSchedule({1: 5000}, 1e-3)

# Define actor
actor = ArgmaxQValue(
    exploration=EpsilonGreedy(eps=eps_schedule)
)

# Define critic
critic = DQN(
    device=torch.device('cuda'), # torch.device('cpu') if no gpu available
    critic_net=DiscQMLP(obs_dim=4, acs_dim=2, 
                        nlayers=2, nunits=64),
    learning_rate=lr_schedule,
    batch_size=100,
    target_updates_freq=3000
)

# Provide actor with critic
actor.set_critic(critic)

# Run q-iteration training loop:
print_every = 1000
replay_buffer = ReplayBuffer(100000)

for i in range(100000):
    
    # Sample training data (one transition)
    train_batch = sampler.sample(n_transitions=1,
                                 actor=actor,
                                 train_sampling=True)
                                 
    # Add it to buffer                             
    replay_buffer.add_paths(train_batch)
    
    # Update DQN critic
    critic.update(replay_buffer)
    
    # Update ArgmaxQValue actor (only to step schedules)
    actor.update()
    
    if i > 0 and i % print_every == 0:
      # Collect evaluation episodes
      eval_batch = eval_sampler.sample_n_episodes(n_episodes=5,
                                                  actor=actor,
                                                  train_sampling=False)

      # Print average return per iteration
      print(f"Iter: {i}, eval score: " + \
            f"{eval_batch.create_logs()['avg_return']}, " + \
            "buffer score: " + \
            f"{replay_buffer.create_logs()['avg_return']}")

Installation

Building from GitHub Source

Installing into a separate virtual environment:

git clone https://github.com/nslyubaykin/relax
cd relax
conda create -n relax python=3.6
conda activate relax
pip install -r requirements.txt
pip install -e .

Mujoco

To install Mujoco do the following steps:

mkdir ~/.mujoco
cd ~/.mujoco
wget http://www.roboti.us/download/mujoco200_linux.zip
unzip mujoco200_linux.zip
mv mujoco200_linux mujoco200
rm mujoco200_linux.zip
wget http://www.roboti.us/file/mjkey.txt

Then, add the following line to the bottom of your bashrc:

export LD_LIBRARY_PATH=~/.mujoco/mujoco200/bin/

Finally, install mujoco_py itself:

pip install mujoco-py==2.0.2.2

!Note: very often installation crushes with error: error: command 'gcc' failed with exit status 1. To debug this run:

sudo apt-get install gcc
sudo apt-get install build-essential

And then again try to install mujoco-py==2.0.2.2

Atari Environments

ReLAx package was developed and tested with gym[atari]==0.17.2. Newer versions also should work, however, its compatibility with provided Atari wrappers is uncertain.

Here is Gym Atari installation guide:

pip install gym[atari]==0.17.2

In case of "ROMs not found" error do the following steps:

Download ROMs archive

wget http://www.atarimania.com/roms/Roms.rar

Unpack it

unrar x Roms.rar

Install atari_py

pip install atari_py

Provide atari_py with ROMS

python -m atari_py.import_roms ROMS

Further Developments

In the future the following functionality is planned to be added:

Curiosity (RND)
Offline RL (CQL, BEAR, BCQ, SAC-N, EDAC)
Decision Transformers
PPG
QR-DQN
IQN
FQF
Discrete SAC
NAF
Stochastic environment models
Improving documentation

Known Issues

Lack of documentation (right now compensated with usage examples)
On some systems relax.zoo.layers.NoisyLinear seems to leak memory. This issue is very unpredictable and yet not fully understood. Sometimes, installing different versions of PyTorch and CUDA may fix it. If the problem persists, as a workaround, consider not using noisy linear layers.
Filtering & Reward Weighted Refinement declared performance in paper is not yet reached
DYNA-Q is not compatible with PER as it is not clear which priority to assign to synthetic branched transitions (possible option: same priority as its parent transition)

Solved

Hello Nikita,

first of all I would like to acknowledge that this is probably the cleanest ml code I have srsly ever seen. Thank you for that!

So I am using FRWR and Random Shooting and I have a question about the horizon.
I am running on limited resources in real-time (gtx1060+win+cuda, overall I would say RS is quite fast, CEM slow, FRWR a bit slow) and I see that the length of the horizon scales worse in regards to the overall performance. So I experimented to lower the iterations/s and reduced f.i. ensemble size to 3, candidate_sequences = 300, horizon = 3-5. So I thought, it would be nicer, to spread the sampling into the future by 2^x (so horizon[1,2,4,8,16] ), maybe also weighting the later even higher - and discarding the intermediate steps for calculations of candidates. So I could span a bigger time horizon with the same number of steps. (Maybe this approach is already used in the mpc community, but I am fairly new to it, so I don't know.)
I see you have a very interesting scheduler system, but I am confused how to initialize an exponential scheduler that lags the horizon in its tensor dimensions. I made a drawing for this.

mpc horizon mod

Update: I tried in the data utils get_next_lag_obs

    lag_obs_split = np.split(lag_obs, 
                             indices_or_sections=1+nlags**2,   # pow  or   =1+int(nlags**1.25)
                             axis=concat_axis)

I am also thinking to convert all lists to np.arrays and jax jit-ing the buffers (and trying hugginface accelerate), but not sure

Also a short quesstion, would it be possible to replace RS/CEM by (NN-approximated) MPPI)? Because they say so. (A combination of MPPI/FRWR)

Anyways, thank you for your beautiful and very instructive codebase! (I am wondering at what / what instance taught this way of ('pure pythonic'?) coding?)

Best

Lee
Living Computation Foundation Member

nslyubaykin / relax Goto Github PK

relax's Introduction

ReLAx

Contents

Implemented Algorithms

Special Features

Usage With Custom Environments

Minimal Examples

On Policy

Off policy

Installation

Building from GitHub Source

Mujoco

Atari Environments

Further Developments

Known Issues

relax's People

Contributors

Stargazers

Watchers

Forkers

relax's Issues

Recommend Projects

Recommend Topics

Recommend Org