vwxyzjn / cleanrl Goto Github PK

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

Home Page: http://docs.cleanrl.dev

License: Other

Python 93.31% Shell 5.53% Dockerfile 0.14% HCL 1.02%

wandb reinforcement-learning pytorch python gym machine-learning deep-reinforcement-learning deep-learning atari ale

cleanrl's People

Contributors

Stargazers

Watchers

Forkers

zuozhiyang rshtirmer dosssman ottpeterr timtody perfmjs qihongl lydia99992 johnlime yooceii akjayant ahmadjordan markusbuchholz hirotransfer wook133 zeta1999 mbijon adamcakg bam4d bentrevett jadentravnik bohblue2 fmxfranky dtransposed ramimanna manabukosaka lujinxuan-fang chutaklee chenhch8 danielberns mcx bragajj jingxixu rish-16 lucasalegre amirardalan9473 mjsargent ingambe jiawei415 davidslayback zhangkai2017 sudo-michael kepelrs zeyefkey stjordanis fredamouzgar zijianh4 aicools wjf1022 stevenjokess syrma alek5k hardlygo zhanyon sonsang rezakakooee sudhirpratapyadav longjiao993 mohibazam fhl1998 tingz515 slienteagle-wyb urela ray-zhan mniju mbilalai manjekim mf093087 niranjankrishna-acad aldobattista nferradas98 hsuth1996 nikihowe akahello gfhe gangsuuga xuanhien070594 baimukashev rl-code-lib jiaxinchen666 coderlemon17 gycn snlpatel001213 njhofmann jamesthesnake ai2hub jeffrey28 maksymdel joabim rpsebastian dipamc apincan manjrekarom lidongtaolk fossabot quantumiracle hartl3y94 alex-petrenko wulingyu howuhh

cleanrl's Issues

Refactor the `argparse` parameters to have `learning_rate` `total_timesteps` move down to algorithm-specific arguments.

Also refactor gym-id to env

Dict observation space

Hi, is the PPO implementation in this repo able to handle the Dict observation space? Many thanks!

C51 does a cross-entropy loss which could have numerical instability depending on the implementation. See link for an overview. Usually calculating the cross-entropy loss directly from the logits is more numerically stable. However, I am not sure how to do it exactly.

Deepmind's dqn_zoo has an implementation that seems to use the logits directly:

https://github.com/deepmind/dqn_zoo/blob/f011d683529d8d23b017a95194ebbb41a4962fe8/dqn_zoo/c51/agent.py#L35
https://github.com/deepmind/rlax/blob/42bbcf97a69ef9b21cb88322b83169ade7930363/rlax/_src/value_learning.py#L703
https://github.com/deepmind/rlax/blob/42bbcf97a69ef9b21cb88322b83169ade7930363/rlax/_src/value_learning.py#L543

Personally, I am recording this issue but in practice often it's enough to do

loss = (-(target_pmfs * old_pmfs.clamp(min=1e-5, max=1-1e-5).log()).sum(-1)).mean()
# instead of 
# unstable_loss = (-(target_pmfs * old_pmfs.log()).sum(-1)).mean()

The more stable loss results in

and unstable_loss results in

See more at #102

If anyone is interested in digging into this, that will be fantastic.

GitPod link not working

Problem Description

Link to GitPod dev environment on instructions page not working. Dev environment not opening in GitPod. Error message saying that there is not GitPod file in repository.

Current Behavior

See above.

Expected Behavior

Dev environment opening in GitPod.

Possible Solution

Fix GitPod link

Steps to Reproduce

Try to click on GitPod banner-link on instructions page.

AWS example?

Do you guys have an example of how to run this on AWS? I've never done it but it sounds intriguing.
Also, what kind of costs do you pay for training a single game, for example?

Support gym.wrappers.Monitor Wrapper

Let's add the Monitor integration to support things like https://youtu.be/UYBkrJBVvys

Normalized Env Bug

There has been an issue with the NormalizedEnv in ppo2_continuous_action.py. It uses the underlying RunningMeanStd, which is incorrectly implemented as illustrated below:

import numpy as np
# taken from https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_normalize.py
class RunningMeanStd(object):
    def __init__(self, epsilon=1e-4, shape=()):
        self.mean = np.zeros(shape, 'float64')
        self.var = np.ones(shape, 'float64')
        self.count = epsilon

    def update(self, x):
        batch_mean = np.mean(x, axis=0)
        batch_var = np.var(x, axis=0)
        batch_count = 1
        self.update_from_moments(batch_mean, batch_var, batch_count)

    def update_from_moments(self, batch_mean, batch_var, batch_count):
        self.mean, self.var, self.count = update_mean_var_count_from_moments(
            self.mean, self.var, self.count, batch_mean, batch_var, batch_count)

def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
    delta = batch_mean - mean
    tot_count = count + batch_count

    new_mean = mean + delta * batch_count / tot_count
    m_a = var * count
    m_b = batch_var * batch_count
    M2 = m_a + m_b + np.square(delta) * count * batch_count / tot_count
    new_var = M2 / tot_count
    new_count = tot_count

    return new_mean, new_var, new_count

print("incorrect RunningMeanStd uses the same mean across data array")
ob_rms = RunningMeanStd(shape=(2,))
state = np.array([0.52191359, 0.24749929])
print(ob_rms.mean)
print(ob_rms.var)
ob_rms.update(state)
print(ob_rms.mean)
print(ob_rms.var)

class RunningMeanStd(object):
    def __init__(self, epsilon=1e-4, shape=()):
        self.mean = np.zeros(shape, 'float64')
        self.var = np.ones(shape, 'float64')
        self.count = epsilon

    def update(self, x):
        batch_mean = np.mean([x], axis=0)
        batch_var = np.var([x], axis=0)
        batch_count = 1
        self.update_from_moments(batch_mean, batch_var, batch_count)

    def update_from_moments(self, batch_mean, batch_var, batch_count):
        self.mean, self.var, self.count = update_mean_var_count_from_moments(
            self.mean, self.var, self.count, batch_mean, batch_var, batch_count)

print("correct RunningMeanStd uses different means across dimensions")
ob_rms = RunningMeanStd(shape=(2,))
state = np.array([0.52191359, 0.24749929])
print(ob_rms.mean)
print(ob_rms.var)
ob_rms.update(state)
print(ob_rms.mean)
print(ob_rms.var)

And the output is

incorrect RunningMeanStd uses the same mean across data array
[0. 0.]
[1. 1.]
[0.38466797 0.38466797]
[0.01893871 0.01893871]
correct RunningMeanStd uses different means across dimensions
[0. 0.]
[1. 1.]
[0.5218614  0.24747454]
[0.00012722 0.00010611]

This is a great job. I want to ask, how should you plot the following curves? seaborn or wandb? If use wandb, how to edit this? Thanks

Deprecating `apex_dqn_atari.py`

Problem Description

The current apex_dqn_atari.py cannot meet the level of performance in published results. Given about 4 hours of training time, ApeX-DQN published result significantly outperforms ours.

Environment	`apex_dqn_atari.py` result	ApeX-DQN Published result
BreakoutNoFrameskip-v4	356.95 ± 46.40	~450
PongNoFrameskip-v4	19.61 ± 0.54	~20
BeamRiderNoFrameskip-v4	2852.69 ± 706.75	~35000
SpaceInvadersNoFrameskip-v4	927.29 ± 146.49	~12000
QbertNoFrameskip-v4	2613.47 ± 796.08	~16000

ApeX-DQN Published result:

Cause

Admittedly, the hardware used is drastically different: ApeX-DQN uses 360 actors while our apex_dqn_atari.py uses 4 actors. There are many other implementation differences, too. apex_dqn_atari.py started out as my toy script, but as we hold a high bar for CleanRL's implementation, we may have to deprecate it, or at least remove it from the current repository and re-submit.

Refactoring on Class Arguments

Problem Description

Since CleanRL has been using single-file implementations with no main() function, a lot of global variables are created, and sometimes we have code within a class access global variables when those global variables should be passed to the classes.

Current Behavior

As an example, in https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari_visual.py,

class QNetwork(nn.Module):
    def __init__(self, frames=4):
        super(QNetwork, self).__init__()
        self.network = nn.Sequential(
            Scale(1/255),
            nn.Conv2d(frames, 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(3136, 512),
            nn.ReLU(),
            Linear0(512, env.action_space.n)
        )

    def forward(self, x):
        x = torch.Tensor(x).to(device)
        return self.network(x)

The forward function uses the global variable device. This is slightly undesirable

Expected Behavior

The forward function should be

    def forward(self, x, device):
        x = torch.Tensor(x).to(device)
        return self.network(x)

It would be great if someone is willing to take some time and refactor files. This problem is present in almost all of the files..

Proper entropy regularized PPO

Problem Description

Seems like the current implementation of PPO use only one-step entropy bonus (not including the entropy bonus in the overall return). I see this as a ease of implementation that passed along from other popular repos. Do you consider implementing the proper entropy regularization in this repo? It seems like the performance gain might me crucial in some cases as shown in this paper (section 6.1). The main difference is shown in eq.79-80 in section 6.1

Current Behavior

Using one-step entropy bonus

Expected Behavior

Using proper entropy bonus

Possible Solution

The entropy bonus should be added to the rewards before computing the advantage. This should be simple to implement as it changes r to r + entropy then the rest of the process should be the same if I am not mistaken.

GAE bug with PPO2

The following results using GAE is clearly incorrect. The last value of the advantages array is very off.

The reason for the bug might be related to episode_lengths = [-1], where the GAE based calculation will set the last value of the advantages array incorrectly. This requires better implementation and fix.

Investigate ` nn.utils.clip_grad_norm_` for DQN, DDPG, and TD3

Problem Description

Compared to the original implementations, our DQN, DDPG, and TD3 implementations additionally do global gradient clipping, a code-level optimization done in PPO. It is unclear if global gradient clipping offers real performance benefits, so we should look into it and remove it if necessary.

dqn_atari.py
ddpg_continuous_action.py
td3_continuous_action.py

0.3 Release

Here is a list of TODOs for 0.2 release

[ ] Include a benchmark PNG file of all the algorithms using seaborn
[ ] Consider using a wrapper to replace the functions in common.py
[ ] Consider wrapping the env and only use torch arrays
[ ] Benchmark Atari games
[ ] Better documentation for cloud support
[ ] Optimize the DQN memory usage

Something not potentially for 0.2 but more likely for the future releases.
[ ] Evaluate using the VecEnv

Improving offline RL scripts

Problem Description // Current behavior

A reminder for the cleanrl/offline related issues mentioned in #130

SPS logging is missing (attempted to match #126 #130
Monitor cannot be imported anymore due to gym=0.23.0 update
Tests for those two scripts are missing
Unlike dqn_atari, the wrappers are not imported from SB3
The offline-env-id that is required to load the dataset does not seem to work anymore. Is there any dependency missing, such as d4rl or d4rl_atari for example ?

(cleanrl) d055@kara:~/random/rl/cleanrl/cleanrl/offline$ python offline_dqn_atari_visual.py 

A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd)
[Powered by Stella]
Traceback (most recent call last):
  File "offline_dqn_atari_visual.py", line 559, in <module>
    data_loader = iter(torch.utils.data.DataLoader(ExperienceReplayDataset(), batch_size=args.batch_size, num_workers=2))
  File "offline_dqn_atari_visual.py", line 544, in __init__
    self.dataset_env = gym.make(args.offline_env_id)
  File "/home/d055/anaconda3/envs/cleanrl/lib/python3.8/site-packages/gym/envs/registration.py", line 676, in make
    return registry.make(id, **kwargs)
  File "/home/d055/anaconda3/envs/cleanrl/lib/python3.8/site-packages/gym/envs/registration.py", line 490, in make
    versions = self.env_specs.versions(namespace, name)
  File "/home/d055/anaconda3/envs/cleanrl/lib/python3.8/site-packages/gym/envs/registration.py", line 220, in versions
    self._assert_name_exists(namespace, name)
  File "/home/d055/anaconda3/envs/cleanrl/lib/python3.8/site-packages/gym/envs/registration.py", line 297, in _assert_name_exists
    raise error.NameNotFound(message)
gym.error.NameNotFound: Environment `breakout-expert` doesn't exist. Did you mean: `Breakout-ram`?

Possible Solution

Straight forward to add, but did not want to overload the #130
and 4. : Make do without Monitor, use SB3's wrapper instead
Straight forward to add
Investigate the missing dependencies, as well as the generation of the args.offline_env_id

Print out episode reward for debugging without tensorboard

The current implementation doesn't print out anything once the scripts start running. Perhaps it would be more beginner-friendly if we print out something like the following just to let the user know that the script is actually running.

global_step=3442, episode_reward=15.527101137763129
global_step=3456, episode_reward=23.907788285943155
global_step=3472, episode_reward=19.012161566288178
global_step=3483, episode_reward=15.243719686337442
global_step=3497, episode_reward=16.92203202540712
global_step=3529, episode_reward=30.636879754445644
global_step=3553, episode_reward=28.04640999748334

And additional desired metrics to be printed should be the episode length.

Generally Support Griddly Environments

Problem Description

Support Griddly (https://griddly.readthedocs.io/en/latest/) in the Open RL Benchmark (http://benchmark.cleanrl.dev/)

Current Problems

The games have different resolutions, which make it difficult to write a generic script to handle all the Griddly games once and fo all. @Bam4d is working on some methods such as global /average pooling for this issue.

TypeError: can't assign a list to a torch.cuda.FloatTensor

I got the above error in the line

cleanrl/cleanrl/ppo.py

Line 217 in bb18a39

obs[step] = next_obs

as my state space is a list [Box(4,)]. In order to address it, I converted next_obs to a tensor before assigning it to the obs tensor's selected index. I am not sure if this is the optimal solution, but it worked.

Implementing PPG (Phasic Policy Gradient)

Problem Description

PPG (Phasic Policy Gradient, https://arxiv.org/abs/2009.04416) seems to be a major update to PPO that improves the sample-efficiency of PPO. PPG is examined on Procgen (#29) in the paper, which we should replicate in the Open RL Benchmark. In addition, it might be intersting to examine its performance in Atari and continuous control tasks as well.

Possible mistake in normalization of returns

cleanrl/cleanrl/ppo_continuous_action.py

Line 68 in a1a9021

    
           rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)

Shouldn't we subtract the mean: (rews- self.ret_rms.mean)

I have closed the issue. Subtracting the mean seems to break it.

KeyError: "terminal_observation" in dqn.py

Problem Description

when I run cleanrl/dqn.py, an error occurred：
in line 174：

My package version is here:
torch==1.7.1
gym==0.18.3
stable_baselines3==1.10

Re-think Open RL Benchmark.

I am thinking of re-doing the Open RL Benchmark that also includes benchmarks from other popular RL libraries, and CleanRL is just one of them. So we would have wandb and github projects like

openrlbenchmark/cleanrl
openrlbenchmark/baselines
openrlbenchmark/sb3
openrlbenchmark/tianshou
openrlbenchmark/rllib

And the good thing is that anyone can use the recorded metrics in the Open RL Benchmark like explained here: wandb/wandb#3231; so the contribution is that no one has to re-run the baselines experiments if they just want to compare the results.

`dqn.py` does not respect seed

Problem Description

python dqn.py --seed 1 could yield different results. See the following demo, where the first run yields global_step=24830, episodic_return=245.0 and the second run yields global_step=24975, episodic_return=178.0.

PPO: Shouldn't advantages be recomputed after every minibatch update?

https://github.com/openai/baselines/blob/master/baselines/ppo2/runner.py - here, lines 65-66
I was trying to reproduce the ppo paper results myself, and I noticed that in openai/baselines repo they compute GAE for each trajectory segment, but they don't use these advantages directly. They do some trick regarding the fact that Adv = Return - Value. Thus Return = Adv + Value.
And then at https://github.com/openai/baselines/blob/master/baselines/ppo2/ppo2.py lines 165-166 they only pass returns to train method.
And finally they compute advantages again regarding equation Adv = Return - Value here at line 136: https://github.com/openai/baselines/blob/master/baselines/ppo2/model.py

In your implementations you compute advantages only once, if I'm not mistaken. But to be honest, I'm not sure if it is actually crucial

LOMPO and COMBO implementation for visual offline RL

Hi,

I really like this repo and use it in my research. I found the visual CQL implementation very interesting and wonder if there is a plan to add COMBO and LOMPO algorithms for visual and offline RL?

Thank you

GAE Calculation for PPO

This issue will serve as a record to verify the correctness of the change.

[WinError 193] %1 is not a valid Win32 application

I try to install this package on windows. When I installed poetry, when I run poetry run ppo.py, I met the following errors:

[WinError 193] %1 is not a valid Win32 application

at c:\users\username\onedrive\anaconda3\lib\subprocess.py:1311 in _execute_child
1307│ sys.audit("subprocess.Popen", executable, args, cwd, env)
1308│
1309│ # Start the process
1310│ try:
→ 1311│ hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
1312│ # no special security
1313│ None, None,
1314│ int(not close_fds),
1315│ creationflags,

Investigate DQN's regression in `MountainCar-v0`

Problem Description

In the previous version of Open RL Benchmark, we clearly observed that our dqn.py was able to solve MountainCar-v0 (see link). However, I could no longer reproduce this result with the latest dqn.py using the exact same hyperparameters. See here for the regression report.

Looking into the root cause

After looking into this further, it turns out the "culprit" is SB3's replay buffer. Our upstream SB3's replay buffer starts to properly handle truncation vs termination (see DLR-RM/stable-baselines3#243), and by disabling the proper handling of truncation via handle_timeout_termination=False I was able to reproduce past performance... ironically (see https://wandb.ai/costa-huang/cleanRL/reports/MountainCar-v0-Regression-Investigation--VmlldzoxODEyMzgw).

Where to go from here

I don't think finding proper hyperparameters for dqn.py should block #121, but this is something we can look into in the future.

Issues with applying PPO Impala on Retro Env in regards to running multiple environment

So what I essentially need is to so have something like
"venv = ProcgenEnv(num_envs=" ... but for retro.make(). Running multiple retro environments is causing issues for me, and retrowrapper isn't helping. Thank you!

Roadmap for CleanRL

As CleanRL gets more mature, it's time to re-think the future. With CleanRL 1.0, we'd hope to further improve documentation and design better contribution guidelines. This issue tracks a few desired items for CleanRL 1.0.

1.0 1.1+

Support Procgen Environments

Problem Description

Procgen Environments (https://github.com/openai/procgen) are new environments to test out the generalization ability of agents. It would be nice to include some of the games into the Open RL Benchmark (http://benchmark.cleanrl.dev/)

This is a good first issue for contributors. I think contributors can simply modify the network model slightly (

cleanrl/cleanrl/ppo_atari_visual.py

Line 514 in db00739

self.network = nn.Sequential(

) to handle the Procgen Environments.

Merge the replay buffer implementations

Various minor PPO refactors

Problem Description

A lot of the formatting changes are suggested by @Howuhh

1. Refactor on `next_done`

The current code to handle done looks like this

            next_obs, reward, done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)

which is fine, but when I tried to adapt isaacgym it became an issue. Specifically, I thought the to(device) code is no longer needed so just did

            next_obs, reward, done, info = envs.step(action)

but this is wrong because I should have done next_done = done. The current next_done = torch.Tensor(done).to(device) just does not make a lot of sense.

We should refactor it to

            next_obs, reward, next_done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(next_done).to(device)

2. `make_env` refactor

if capture_video:
    if idx == 0:
        env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

if capture_video and idx == 0:
    env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

3. flatten batch

        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
        b_logprobs = logprobs.reshape(-1)
        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
        b_advantages = advantages.reshape(-1)
        b_returns = returns.reshape(-1)
        b_values = values.reshape(-1)

        b_obs = obs.flatten(0, 1)
        b_actions = actions.flatten(0, 1)
        b_logprobs = logprobs.reshape(-1)
        b_returns = returns.reshape(-1)
        b_advantages = advantages.reshape(-1)
        b_values = values.reshape(-1)

4.


            if args.target_kl is not None:
                if approx_kl > args.target_kl:
                    break

            if args.target_kl is not None and approx_kl > args.target_kl:
                break

5.

cleanrl/cleanrl/ppo_atari.py

Line 209 in 9a74142

global_step += 1 * args.num_envs

global_step += args.num_envs

6.

move

cleanrl/cleanrl/ppo.py

Line 183 in 9a74142

num_updates = args.total_timesteps // args.batch_size

to the argparse.

Support StarCraft II Mini-game Environments (pysc2)

Problem Description

StarCraft II Environments (https://github.com/deepmind/pysc2) have some challenging mini-games. It would be nice to include some of the games into the Open RL Benchmark (http://benchmark.cleanrl.dev/)

For experienced researcher, this might be a good issue. I have already created a gym wrapper for sc2 (https://github.com/vwxyzjn/gym-pysc2), and here is an example run (https://wandb.ai/cleanrl/cleanrl.benchmark/runs/2qy45w8y?workspace=)

Broken links on README

Problem Description

Broken links on Algorithms Implemented section of README

Current Behavior

Links are pointing to unavailable pages

Expected Behavior

Link should point to desired code

Possible Solution

Remove links to removed code or change links to new location

Steps to Reproduce

The link to the following codes on README implemented algorithms section are pointing to old locations, resulting in page 404:

experiments/ppo_self_play.py
experiments/ppo_microrts.py
experiments/ppo_simple.py
experiments/ppo_simple_continuous_action.py

not sure if there are more links, i found these one as they are on the experiments folder which does not exist anymore

Add `rnd_ppo.py` documentation and refactor

rnd_ppo.py is a bit dated, and I recommend refactoring it to match other PPO style, which would include:

change the name from rnd_ppo.py to ppo_rnd.py
use from gym.wrappers.normalize import RunningMeanStd instead of the implementing ourselves (note the implementation might be a bit different).

create a make_env function like

cleanrl/cleanrl/ppo_atari.py

Lines 88 to 103 in 0b3f8ea

    
           def make_env(env_id, seed, idx, capture_video, run_name): 
        
               def thunk(): 
        
                   env = gym.make(env_id) 
        
                   env = gym.wrappers.RecordEpisodeStatistics(env) 
        
                   if capture_video: 
        
                       if idx == 0: 
        
                           env = gym.wrappers.RecordVideo(env, f"videos/{run_name}") 
        
                   env = NoopResetEnv(env, noop_max=30) 
        
                   env = MaxAndSkipEnv(env, skip=4) 
        
                   env = EpisodicLifeEnv(env) 
        
                   if "FIRE" in env.unwrapped.get_action_meanings(): 
        
                       env = FireResetEnv(env) 
        
                   env = ClipRewardEnv(env) 
        
                   env = gym.wrappers.ResizeObservation(env, (84, 84)) 
        
                   env = gym.wrappers.GrayScaleObservation(env) 
        
                   env = gym.wrappers.FrameStack(env, 4)

remove the visualization (i.e., ProbsVisualizationWrapper)
use def get_value and def get_action_and_value for the Agent class

remove

cleanrl/cleanrl/rnd_ppo.py

Lines 706 to 708 in 0b3f8ea

    
           class Flatten(nn.Module): 
        
               def forward(self, input): 
        
                   return input.view(input.size(0), -1)

maybe log the average curiosity_reward instead?

cleanrl/cleanrl/rnd_ppo.py

Line 848 in 0b3f8ea

f"global_step={global_step}, episodic_return={info['episode']['r']}, curiosity_reward={curiosity_rewards[step][idx]}"
name total_reward_per_env to curiosity_return

cleanrl/cleanrl/rnd_ppo.py

Line 854 in 0b3f8ea

total_reward_per_env = np.array(
Add SPS (steps per second) metric.

Overall I suggest selecting ppo_atari.py and rnd_ppo.py and use Compare Selected on VSCode to see the file difference and minimize the file difference:

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the documentation accordingly.
I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments.

Loading benchmarked hyperparams?

Is there a way to load the reproducibility hyperparams automatically for each environment and alg?

HuggingFace's model hub integration

Problem Description

HuggingFace's model hub has become a standard place to host trained models. Now it is expanding coverage to the RL space (see huggingface.co/sb3 as an example and DLR-RM/rl-baselines3-zoo#198) & it would be nice for us to integrate, too. I spoke with @ThomasSimonini today, and he expressed interest in working with this.

10/6/22 update: I would like to rethink how we can support saved models in CleanRL as they have become increasingly relevant (e.g., recent research on using models to bootstrap RL; see reincarnate RL)

Challenges

Huggingface and SB3 make a great fit because SB3 already provides a uniform API for training and evaluation. With CleanRL, this is tricky since CleanRL is more of a repository for educational and prototyping purposes: we don't have uniform APIs as SB3 does.

Desired Features:

save model
evaluate model
upload model to HF
load model from HF

Possible Solution

I think we can start the integration in a few selected files:

We can add an optional flag like the following

    parser.add_argument("--save-model-hf", type=lambda x: bool(strtobool(x)), default=False, nargs="?", const=True,
        help="whether to save model to hugging face")

To add utilities like https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/rl_zoo3/push_to_hub.py, we could add a function upload_to_hf in the cleanrl_utils folder and import it. Importantly, we should only import it when the flag is turned on, so we don't make the single-file implementation dependent on the cleanrl_utils.upload_to_hf.

@ThomasSimonini has a demo here https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit8/unit8.ipynb

SAC Consistency

Problem Description

Currently, the SAC script has debugging outputs that are inconsistent with other scripts in the repository (see here: it outputs
print(f"Episode: {global_episode} Step: {global_step}, Ep. Reward: {episode_reward}")
instead of
print(f"global_step={global_step}, episode_reward={episode_reward}")

Additionally, the parameter args.learning_start is also used differently. In sac_continuous_action.py, it has

if len(rb.buffer) > args.batch_size: # starts update as soon as there is enough data.

whereas in other scripts we had

if global_step > args.learning_start:

@dosssman would you mind looking into this? Thanks!

Friendlier `CONTRIBUTING.md`

We should make contribution guidelines clearer. We got the feedback that "Although the current project seems welcoming to new developers, there is little concrete information on how to get involved. For example, what code style does the project follow? What are the criteria for including a new algorithm: what documentation does it need to have, test cases, etc? There is little in the way of API docs either, this admittedly is not critical as the files are themselves quite readable, but I'd suggest adding at least one-line docstrings to classes, methods, etc (and consider splitting up the scripts into more separate methods as well)"

This is great feedback. To address it, let's re-think the checklist for including a new algorithm. Such a checklist should include:

pre-commit utilities: sort dependencies, remove unused variables and imports, format code using black, and check word spelling #107.
Empirical analysis and benchmark: we should adopt a similar guide from sb3-contrib with a bit of our spin. The implemented algorithm should come with tracked experiments that
- match the reported performance in the paper (if applicable)
- match the reported performance in a high-quality reference implementation (SB3, Tianshou, and others) (if applicable).
- We should also add documentation on how exactly we want the tracked experiments to be done (i.e., what W&B project? should they capture video recording?)
Documentation: the proposed algorithm should also come with documentation at https://docs.cleanrl.dev/rl-algorithms/ to
- explain crucial implementation details
- add links to the original paper and related papers (if applicable)
- add links to the PR related to the algorithm
- add links to the tracked experiments and benchmark results.
Tests: the proposed algorithm should come with an e2e test that makes sure the algorithm does not crash.

I will try to make some examples next week.

Cloud Integration Support

It is desirable to be able to run experiments on scale by leveraging cloud providers such as AWS, GCP, Azure, or even on-premise servers. This issue will track all of the commits that are related to cloud integrations.

Work with AWS Preemptible Instance

Problem Description

For the AWS Integrations, we usually run experiments using AWS spot instances to save cost. However, sometimes there's a need to running experiments for a long time. Real use cases include running montezuma's revenge by @yooceii and certain microrts tasks by myself. So we should look more into this issue.

By consulting this resource, I am considering storing the models periodically on the associated wandb run of certain run_id, and should the aws instance terminate, we basically pull the associated models from the run with run_id and continue training.

Both dqn_atari and dqn_atari_visual use different ReplayBuffers compared to other implementations

Thanks for creating this repo - I always have difficulty understanding RL algorithms when they're spread across huge libraries.

One question, both the dqn_atari.py and dqn_atari_visual.py both use the following ReplayBuffer class:

# modified from https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py
class ReplayBuffer(object):
    def __init__(self, size):
        self._storage = []
        self._maxsize = size
        self._next_idx = 0

    def put(self, data):
        if self._next_idx >= len(self._storage):
            self._storage.append(data)
        else:
            self._storage[self._next_idx] = data
        self._next_idx = (self._next_idx + 1) % self._maxsize

    def sample(self, batch_size):
        idxes = np.random.choice(len(self._storage), batch_size, replace=True)
        obses_t, actions, rewards, obses_tp1, dones = [], [], [], [], []
        for i in idxes:
            data = self._storage[i]
            obs_t, action, reward, obs_tp1, done = data
            obses_t.append(np.array(obs_t, copy=False))
            actions.append(np.array(action, copy=False))
            rewards.append(reward)
            obses_tp1.append(np.array(obs_tp1, copy=False))
            dones.append(done)
        return np.array(obses_t), np.array(actions), np.array(rewards), np.array(obses_tp1), np.array(dones)

However, dqn.py, c51.py, c51_atari.py, cs51_atari_visual.py, sac_continuous_action.py, and td3_continuous_action.py all use the following ReplayBuffer:

# modified from https://github.com/seungeunrho/minimalRL/blob/master/dqn.py#
class ReplayBuffer():
    def __init__(self, buffer_limit):
        self.buffer = collections.deque(maxlen=buffer_limit)
    
    def put(self, transition):
        self.buffer.append(transition)
    
    def sample(self, n):
        mini_batch = random.sample(self.buffer, n)
        s_lst, a_lst, r_lst, s_prime_lst, done_mask_lst = [], [], [], [], []
        
        for transition in mini_batch:
            s, a, r, s_prime, done_mask = transition
            s_lst.append(s)
            a_lst.append(a)
            r_lst.append(r)
            s_prime_lst.append(s_prime)
            done_mask_lst.append(done_mask)

        return np.array(s_lst), np.array(a_lst), \
               np.array(r_lst), np.array(s_prime_lst), \
               np.array(done_mask_lst)

As far as I am aware, they look identical - but wanted to know if there was some implementation reason for why they are different?

Documentation Site

Problem Description

Although CleanRL generally has a simplistic implementation, it will be desirable to have a documentation site for some situations. For example, I'm not sure where to put instructions on how to do start and resume with CleanRL's scripts. See #33, #14.

Implementing IMPALA

Problem Description

It would be nice to include IMPALA (Importance Weighted Actor-Learner Architectures) to the Open RL Benchmark

A implementation that looks nice is https://github.com/facebookresearch/torchbeast.

Problems with PPO value loss

cleanrl/cleanrl/ppo_continuous_action.py

Line 379 in a1a9021

v_loss = 0.5 *((new_values - b_returns[minibatch_ind]) ** 2)

I believe this line is missing a .mean()

Also, are you meant to be multiplying the value loss by 0.5 in lines 377 and 379? Isn't that the purpose of args.vf_coef?

I notice there is a number of PPO implementations, and it looks like many of them have the same issue.

GPU Implementation runs no faster than CPU counterparts

For example, here is the profiling dqn.py running 2e4 total timesteps.

GPU:

time python dqn.py
real    0m35.960s
user    0m31.977s
sys     0m1.758s

CPU:

time python dqn.py --no-cuda
real    0m29.083s
user    5m31.594s
sys     0m9.362s

DDPG Actor missing 1 argument: 'env'

Identified while testing changes for PR #67

cleanrl/cleanrl/ddpg_continuous_action.py

Line 167 in 502f0f3

actor = Actor().to(device)

When running:

python ddpg_continuous_actions.py

returns the error:

$ python ddpg_continuous_action.py 
/home/d055/anaconda3/envs/cleanrl-py3.7.1/lib/python3.7/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
  for external in metadata.entry_points().get(self.group, []):
pybullet build time: Oct 11 2021 20:59:00
/home/d055/anaconda3/envs/cleanrl-py3.7.1/lib/python3.7/site-packages/gym/spaces/box.py:74: UserWarning: WARN: Box bound precision lowered by casting to float32
  "Box bound precision lowered by casting to {}".format(self.dtype)
Traceback (most recent call last):
  File "ddpg_continuous_action.py", line 167, in <module>
    actor = Actor().to(device)
TypeError: __init__() missing 1 required positional argument: 'env'

Refactor documentation

What is the problem

The current documentation requires more work. First, some of the implemented algorithms such as Apex-DQN, TD3, and SAC are not documented at https://docs.cleanrl.dev. Second, even the documented algorithm such as PPO does not have complete documentation: for example, the ppo_atari_envpool.py is not really documented. Third, there doesn't seem to be a single-source place to put documentation.

Going forward, I'd like to impose a specific documentation style and improve the overall workflow, which will also help #117.

Proposed solution

I was thinking maybe we can put a documentation link at the beginning of each file. For example, we could add these two lines at ppo.py.

https://github.com/vwxyzjn/cleanrl/blob/c8faef93fc8dbc9528183840ab75b8962df7b9c4/cleanrl/ppo.py#L1-L7

And this link of https://cleanrl-553u0zazz-vwxyzjn.vercel.app/rl-algorithms/ppo/#ppopy will point to the corresponding documentation that has

Brief overview of the algorithm
Original paper and relevant resources
Short description of what ppo.py specifically does
Explanation of important implementation details
Experiment results (and how they compare to the original paper or/and other reference implementations)
Learning curves
Tracked experiments

Which roughly looks like below (haven't added the tracked experiments)

ppodemo.mp4

List of files needed to add documentation

Refactor PPO Buffer

The idea is to create a buffer that you can add an entire episode in it, and calculate the corresponding advantages. Until some certain limits, keep adding episodes in the buffer using the same returns, obs, actions array under the buffer.

	def make_env(env_id, seed, idx, capture_video, run_name):
	def thunk():
	env = gym.make(env_id)
	env = gym.wrappers.RecordEpisodeStatistics(env)
	if capture_video:
	if idx == 0:
	env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
	env = NoopResetEnv(env, noop_max=30)
	env = MaxAndSkipEnv(env, skip=4)
	env = EpisodicLifeEnv(env)
	if "FIRE" in env.unwrapped.get_action_meanings():
	env = FireResetEnv(env)
	env = ClipRewardEnv(env)
	env = gym.wrappers.ResizeObservation(env, (84, 84))
	env = gym.wrappers.GrayScaleObservation(env)
	env = gym.wrappers.FrameStack(env, 4)

	class Flatten(nn.Module):
	def forward(self, input):
	return input.view(input.size(0), -1)

vwxyzjn / cleanrl Goto Github PK

cleanrl's People

Contributors

Stargazers

Watchers

Forkers

cleanrl's Issues

Problem Description

Current Behavior

Expected Behavior

Possible Solution

Steps to Reproduce

Problem Description

Cause

Problem Description

Current Behavior

Expected Behavior

Problem Description

Current Behavior

Expected Behavior

Possible Solution

Problem Description

Problem Description // Current behavior

Possible Solution

Problem Description

Current Problems

Problem Description

Problem Description

Problem Description

Problem Description

Looking into the root cause

Where to go from here

1.0

1.1+

Problem Description

Problem Description

1. Refactor on next_done

2. make_env refactor

3. flatten batch

4.

5.

6.

Problem Description

Problem Description

Current Behavior

Expected Behavior

Possible Solution

Steps to Reproduce

Types of changes

Checklist:

Problem Description

Challenges

Possible Solution

Problem Description

Problem Description

Problem Description

Problem Description

What is the problem

Proposed solution

List of files needed to add documentation

Recommend Projects

Recommend Topics

Recommend Org

1. Refactor on `next_done`

2. `make_env` refactor