denisyarats / drq Goto Github PK

View Code? Open in Web Editor NEW

398.0 13.0 49.0 13.05 MB

DrQ: Data regularized Q

Home Page: https://sites.google.com/view/data-regularized-q

License: MIT License

Python 1.02% Jupyter Notebook 98.98%

rl reinforcement-learning deep-learning mujoco dm-control gym pixel sac soft-actor-crit pytorch

drq's Introduction

DrQ: Data regularized Q

This is a PyTorch implementation of DrQ from

Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels by

Denis Yarats*, Ilya Kostrikov*, Rob Fergus.

*Equal contribution. Author ordering determined by coin flip.

[Paper] [Webpage]

Update: we released a newer version DrQ-v2, please check it out here.

Implementations in other frameworks: jax/flax.

Citation

If you use this repo in your research, please consider citing the paper as follows

@inproceedings{yarats2021image,
  title={Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels},
  author={Denis Yarats and Ilya Kostrikov and Rob Fergus},
  booktitle={International Conference on Learning Representations},
  year={2021},
  url={https://openreview.net/forum?id=GY6-6sTvGaf}
}

Requirements

We assume you have access to a gpu that can run CUDA 9.2. Then, the simplest way to install all required dependencies is to create an anaconda environment by running

conda env create -f conda_env.yml

After the instalation ends you can activate your environment with

conda activate drq

Instructions

To train the DrQ agent on the Cartpole Swingup task run

python train.py env=cartpole_swingup

you can get the state-of-the-art performance in under 3 hours.

To reproduce the results from the paper run

python train.py env=cartpole_swingup batch_size=512 action_repeat=8

This will produce the runs folder, where all the outputs are going to be stored including train/eval logs, tensorboard blobs, and evaluation episode videos. To launch tensorboard run

tensorboard --logdir runs

The console output is also available in a form:

| train | E: 5 | S: 5000 | R: 11.4359 | D: 66.8 s | BR: 0.0581 | ALOSS: -1.0640 | CLOSS: 0.0996 | TLOSS: -23.1683 | TVAL: 0.0945 | AENT: 3.8132

a training entry decodes as

train - training episode
E - total number of episodes
S - total number of environment steps
R - episode return
D - duration in seconds
BR - average reward of a sampled batch
ALOSS - average loss of the actor
CLOSS - average loss of the critic
TLOSS - average loss of the temperature parameter
TVAL - the value of temperature
AENT - the actor's entropy

while an evaluation entry

| eval  | E: 20 | S: 20000 | R: 10.9356

contains

E - evaluation was performed after E episodes
S - evaluation was performed after S environment steps
R - average episode return computed over `num_eval_episodes` (usually 10)

The PlaNet Benchmark

DrQ demonstrates the state-of-the-art performance on a set of challenging image-based tasks from the DeepMind Control Suite (Tassa et al., 2018). We compare against PlaNet (Hafner et al., 2018), SAC-AE (Yarats et al., 2019), SLAC (Lee et al., 2019), CURL (Srinivas et al., 2020), and an upper-bound performance SAC States (Haarnoja et al., 2018). This follows the benchmark protocol established in PlaNet (Hafner et al., 2018).

The Dreamer Benchmark

DrQ demonstrates the state-of-the-art performance on an extended set of challenging image-based tasks from the DeepMind Control Suite (Tassa et al., 2018), following the benchmark protocol from Dreamer (Hafner et al., 2019). We compare against Dreamer (Hafner et al., 2019) and an upper-bound performance SAC States (Haarnoja et al., 2018).

Acknowledgements

We used kornia for data augmentation.

drq's People

Contributors

Stargazers

Watchers

Forkers

satoshirobatofujimoto saminyeasar sts-sadr robot-ai-machinelearning tejamoy hsouporto saqibmamoon zebrajack zeta1999 melfm imran-ice dosssman lukashermann nilsjohanbjorck andreicnica husnejahan nam630 jingweiz snehashischatterjee1997 binetsolution quanzhou-li obin-hero himanshu032000 hosseinsheikhi shahrutav qiming-zou kylehkhsu zchuning chongyi-zheng seba-1511 lmc19970711 divyanshj16 kelvinxu yaomarkmu aicools clay-fang maltemosbach zerlinwang velythyl kevingmelin jianshu-hu weakenleg chunyuan-w m5l14i11 xiuyuan0216 kulinseth c0sch0 vbarbaros

drq's Issues

Not learning on dm_control's pendulum swingup

Upon running the command

python train.py env=pendulum_swingup

I'm getting

I understand that pendulum swingup uses sparse reward, but the agent should be able to earn some reward (although very little) initially by exploration.

Figures

May I know what are the ways to generate those figures, which code file is for that. Sorry for my ignorance, thank you.

walker_stand critic loss explosion

Hi, thank you for quality code. but I wonder why walker_stand task critic loss is too high(up to 1e+3) in my experiment. In my case, I used your conda.yaml and changed env :walker_stand and action_repeat : 2 and batch_size : 512 as you mentioned in paper. how can I get stable critic loss?(for example, reward scaling)

Thank you for reading.

Non-deterministic runs

Hi,

I get a different train and eval curves every time I start the training even with the same seed. Is that supposed to happen even after setting all seeds i.e. random, numpy, torch(both cuda and non-cuda)? Did you observe a similar behaviour?

About environment steps

Hello, I'm trying to replicate the results of the Dreamer and DrQ papers with PyTorch.

While the DrQ code works fine, I am concerned that the environment steps (x-axis in Figure) are counted differently from Dreamer's.

The Dreamer's implementation increments 1000 environment steps per episode. (No matter what the action repeat is.)
However, in the DrQ implementation, the step count (self.step in train.py) incremented 1000/action_repeat per episode.

I believe that this would make the DrQ consumes more episodes to reach the same training_max_steps.

Am I missing something here?

critic update frequency

in DRQAgent.update it seems that the critic is updated at every environment step which makes sampling from the environment rather slow.

Do you think if it's safe adding an additional frequency to the critic update?

reproduce problem

Can I get all DMC benchmark results if I use batch_size : 512 and action_repeat : 8?
I tried batch_size : 128 and action_repeat : 2 in env : finger, task : turn_easy. But result was bad(under 500 mean score until 200k).

replicating results on dreamer benchmark

For replicating the results on the dreamer benchmark, are there any settings to override except batch_size=512 action_repeat=2? Thanks!!

How to run code without augmentation?

Hi, thanks for sharing the code! I'm interested in applying DrQ to a dm_control domain, and I see from the README that I can readily do that using the following command:

python train.py env=cartpole_swingup

However, is there a way to turn off augmentation (e.g., via command line options)? I'd like to compare the performance with and without augmentation.

eval frequency

In line 127 of train.py you are checking if you should evaluate the env (if self.step % self.cfg.eval_frequency == 0:).
However, this happens inside the if clause checking for done. Shouldn't it be independent of that? Like this, most of the times when step is a multiple of eval_frequency, it doesn't happen to coincide with done being True, which means no evaluation will be performed.

hydra 'strict' argument error

python train.py env=cartpole_swingup

result :

Traceback (most recent call last):
  File "train.py", line 170, in <module>
    @hydra.main(config_path='config.yaml', strict=True)
TypeError: main() got an unexpected keyword argument 'strict'

if I delete strict argument then,

ValueError: Using config_path to specify the config name is not supported, specify the config name via config_name.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes

How can I fix it?

Unused code

While modifying this code for research, I found lots of code to be redundant.

Code for train() or eval(): only BatchNorm and DropOut are affected by this.
Done flags in replay buffer: they are stored but never used.
output_dim in Encoder is unused
output_logits is always false

sparse reward tasks

I wanted to ask if any tweak in your implementation might be needed for sparse reward tasks

Why is the encoder detached?

I might have missed something simple, but could you please kindly explain why don't you update the encoder part?

https://github.com/denisyarats/drq/blob/master/drq.py#L263-L264

In other SAC implementations (e.g. rlkit), the gradient back-props through the entire policy network. Thanks!

Getting the code to run deterministically

I am trying to get the code to run deterministically, i.e. repeat behavior exactly when running the same seed multiple times. However, I'm having some issues. I've tried to disable the cudnn benchmarking:

torch.backends.cudnn.benchmark = True

I've also added

torch.use_deterministic_algorithms(True)

Still I am not able to repeat the experiments exactly for fixed seeds. Are there any ideas what further sources of non-determinism in the code base might be? Thanks!!

Action_repeat settings to reproduce your paper results

In Readme, you mean that if I want to reproduce the results, I just
python train.py env=cartpole_swingup batch_size=512

But I notice the action_repeat number in config.yaml is not 8 for cartpole_swingup.

Maybe you should check this point.

Recurrent controller instead of stacked image observation?

This is an amazing work, thanks a lot for sharing!!

The paper states that stacking the last 3 image frames can convert POMDP to MDP. While I understand this is common practice, I wonder if you have tried using GRU/LSTM controller? Does it typically perform better or worse than frame stacking in your experience?

How to limit number of threads spawned?

I want to run DRQ on a larger node with multiple GPUs and 18 cores (36 with hyperthreading). When I try to run multiple DRQ jobs in parallel on the node, each job seems to spawn 41 threads, and this seems to be too much to handle for the CPU. Is there any way to limit the number of threads that DRQ launches? Thanks!!

Hi！

Did you have a tensor flow code?

Confiused with csv files

Great work!
Am a little confused with the dmc_planet_bench.csv file. Why steps are negative?
To produce results comparable with this csv shall I set the eval_frequency in the config file to 2000?
I want to plot this file using Tensorboard, and just to make sure, shall I set the action_repeat to corresponding action_repeat in table 2 when am going to log this csv with the provided logger? and I have to plot it as eval/episode_reward like following?

logger.log('eval/episode_reward', float(row['episode_reward']), -1 * int(row['step']))

macOS Catalina can't run due to Hydra

After following the installation instructions, I run into a problem with Hydra:

HYDRA_FULL_ERROR=1 python train.py env=cartpole_swingup                                                
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 153, in load_configuration
    from_shell=from_shell,
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 256, in _load_configuration
    run_mode=run_mode,
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 796, in _merge_defaults_into_config
    hydra_cfg = merge_defaults_list_into_config(hydra_cfg, system_list)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 764, in merge_defaults_list_into_config
    merged_cfg.merge_with(job_cfg)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/basecontainer.py", line 325, in merge_with
    self._format_and_raise(key=None, value=None, cause=e)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/base.py", line 101, in _format_and_raise
    type_override=type_override,
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/_utils.py", line 610, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/_utils.py", line 591, in _raise
    raise ex  # set end OC_CAUSE=1 for full backtrace
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/basecontainer.py", line 323, in merge_with
    self._merge_with(*others)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/basecontainer.py", line 341, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/basecontainer.py", line 288, in _map_merge
    dest_node._merge_with(src_value)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/basecontainer.py", line 341, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/basecontainer.py", line 308, in _map_merge
    dest[key] = src._get_node(key)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 251, in __setitem__
    key=key, value=value, type_override=ConfigKeyError, cause=e
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/base.py", line 101, in _format_and_raise
    type_override=type_override,
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/_utils.py", line 610, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/omegaconf/_utils.py", line 591, in _raise
    raise ex  # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigKeyError: Key 'name' not in 'HydraConf'
	full_key: hydra.name
	reference_type=Optional[HydraConf]
	object_type=HydraConf

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 178, in <module>
    main()
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/main.py", line 37, in decorated_main
    strict=strict,
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/utils.py", line 356, in _run_hydra
    lambda: hydra.run(
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/utils.py", line 210, in run_and_report
    raise ex
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/utils.py", line 207, in run_and_report
    return func()
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/utils.py", line 359, in <lambda>
    overrides=args.overrides,
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 104, in run
    run_mode=RunMode.RUN,
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 512, in compose_config
    from_shell=from_shell,
  File "/usr/local/Caskroom/miniconda/base/envs/drq/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 156, in load_configuration
    raise ConfigCompositionException() from e
hydra.errors.ConfigCompositionException

I have looked through the stack trace and am not sufficiently familiar with Hydra or OmniConf to decipher what is actually causing the issue. Maybe we have different versions for the packages that got installed from the conda_env.yml file?

Would it be possible to include code for state-sac?

It seems like the paper shows results for SAC trained on the underlying state, however, I cannot find that code in the repo. Would it be possible to include code for this? I'd be interested in reproducing your experiments! Thanks!!

The purpose of layer norm after CNN

Hi, I wonder what is the purpose of the layer norm after the convolutional layers. Does it improve stability?

I understand that your actor and critic are sharing the convolutional layers. Is layer norm for that purpose?

Replicating table 1 of the paper

Dear Denis,

Thanks for open-sourcing this, the paper is really cool! I am trying to replicate table 1 with the planet benchmark and ran into some problems for the SAC-state baseline. I am using your implementation of SAC-state (github.com/denisyarats/pytorch_sac) but fail to reach the reported performance. Was action repeat applied to SAC-state in table 1? For each environment, I am using frame_skip = action_repeat, where action_repeat comes from table 2 in the paper. To only use 500,000 environment steps, I set num_train_steps = 500,000 // action_repeat. Am I missing something here? Once I figure this out, I will replicate the DrQ experiments. Thanks!!

Class DRQAgent is not in module drq

Hi,
I would like to try a costum gym environment but have encountered with this error:

Error instantiating drq.DRQAgent : Class DRQAgent is not in module drq
Traceback (most recent call last):
  File "/home/alireza/.local/share/virtualenvs/SimulationFramework-19OjgRmc/lib/python3.6/site-packages/hydra/utils.py", line 23, in get_class
    klass = getattr(mod, class_name)
AttributeError: module 'drq' has no attribute 'DRQAgent'

which happens in line self.agent = hydra.utils.instantiate(cfg.agent) in train.py

Do you know what might be the reason?

denisyarats / drq Goto Github PK

drq's Introduction

DrQ: Data regularized Q

Citation

Requirements

Instructions

The PlaNet Benchmark

The Dreamer Benchmark

Acknowledgements

drq's People

Contributors

Stargazers

Watchers

Forkers

drq's Issues

Recommend Projects

Recommend Topics

Recommend Org