google-deepmind / acme Goto Github PK

View Code? Open in Web Editor NEW

3.4K 3.4K 415.0 6.32 MB

A library of reinforcement learning components and agents

License: Apache License 2.0

Python 99.87% Shell 0.13%

agents reinforcement-learning research

acme's People

Contributors

Stargazers

Watchers

Forkers

concretevitamin edreams rubenszimbres nethask giangzuzana yyht bqzhu922 arunaganesanswaminathan shyamalschandra codeaudit roysh nestorsgarzonc bollwyvl jcassiojr mbrukman fdoperezi sudo-owen stepneverstop jorgeviz elnazsn zzp110 platecki hhy5277 burtawicz t-thanh spingreekgod haiahems onisimchukv zeta1999 bharatr21 wadrhaw ai-hub-deep-learning-fundamental zhangtjtongxue dodzilla-ai kamwong2015 northwolf521 shaunstanislauslau hadryan bruinxiong gaborkosa01 ankitshah009 przor3n gao370829 hundunyu nithishkumar4164 ricklentz dexter31 119243740 stjordanis mc-o joswinkj sts-sadr xrosliang louiss007 wuqianliang apprenticearnab aterterian chuckwoody axefq aslanides schouhy fduerwilliam abcdcamey msmaras da505819 mohtashimbq murilo intfrr moncybigdata matpg ahmedcs huzzdtx vballoli luisimagiire ainiml herolin12 sebtheiler songyanghan yifeijiang muratakif ank-it rquintin jingweiz stefanjuang ml-ai-nlp-ir h3lio5 mcc-rl-respect asderio jonarain graidl mjsargent kisress kanglicheng c3suryansu chomolungma nuland-project binarypheonix anslabs rishistyping vishalbelsare

acme's Issues

Does this support pytorch-based agents?

Hi, guys.
Looks interesting.

Can I write policies in Pytorch? Or is this tf only?

Thanks.

unable to run the toturial, getting (rom acme.tf import networks ModuleNotFoundError: No module named 'acme.tf')

trying to run the tutorial but getting the following error:

ModuleNotFoundError Traceback (most recent call last)
in ()
7
8 from acme import environment_loop
----> 9 from acme.tf import networks
10 from acme.adders import reverb as adders
11 from acme.agents.tf import actors as actors

ModuleNotFoundError: No module named 'acme.tf'

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

I have tried this with the following environments and they all failed with the same error:
tried on Colab,
tried on AWS sage maker,
tried on local machine (Windows),
tried on Linux VM

removing the line will generate other error on subsequent imports,

all these environments failed with the same error, I will keep trying but could really appreciate some help here.

Regards

Contributing

Hi,

Currently a master's student in France, I would like to contribute to your platform, which is exactly the project I was waiting to invest time on!

However, I haven't found any guidelines nor issues that could provide a starting point, and I saw in the paper that maybe new algorithms would be added in the future.

I am thus inquiring you (the maintainers) about possible points that I could tackle. I would be glad to add algorithms found in the literature, but starting with smaller issues would also be great.
You can contact me on [email protected] if needed.

Thank you,

PS: Sorry if this message felt awkward...

error in install reverb

Hi, when I install reverb in Window. It spills out an error like this:

ERROR: Could not find a version that satisfies the requirement dm-reverb-nightly==0.1.0.dev20200529; extra == "reverb" (from dm-acme[reverb]) (from versions: none)

ERROR: No matching distribution found for dm-reverb-nightly==0.1.0.dev20200529; extra == "reverb" (from dm-acme[reverb])

Distributed training examples?

It looks like all of the scripts in the examples are single-process. Are there any examples for distributed training?
The paper mentions "Launchpad" that manages distributed processes - when do you plan to open-source that?

ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

I am having trouble running a python script on a computing cluster and the problem is reproduced when I run my job on slurm:

I get the error


2020-07-17 19:30:57.289439: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-07-17 19:30:57.289506: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    from acme.agents.tf import dqn
  File "/home/armas/temp/dist_rl/lib/python3.7/site-packages/acme/agents/tf/dqn/__init__.py", line 18, in <module>
    from acme.agents.tf.dqn.agent import DQN
  File "/home/armas/temp/dist_rl/lib/python3.7/site-packages/acme/agents/tf/dqn/agent.py", line 20, in <module>
    from acme import datasets
  File "/home/armas/temp/dist_rl/lib/python3.7/site-packages/acme/datasets/__init__.py", line 17, in <module>
    from acme.datasets.reverb import make_reverb_dataset
  File "/home/armas/temp/dist_rl/lib/python3.7/site-packages/acme/datasets/reverb.py", line 22, in <module>
    from acme.adders import reverb as adders
  File "/home/armas/temp/dist_rl/lib/python3.7/site-packages/acme/adders/reverb/__init__.py", line 21, in <module>
    from acme.adders.reverb.base import DEFAULT_PRIORITY_TABLE
  File "/home/armas/temp/dist_rl/lib/python3.7/site-packages/acme/adders/reverb/base.py", line 26, in <module>
    import reverb
  File "/home/armas/temp/dist_rl/lib/python3.7/site-packages/reverb/__init__.py", line 27, in <module>
    from reverb import item_selectors as selectors
  File "/home/armas/temp/dist_rl/lib/python3.7/site-packages/reverb/item_selectors.py", line 19, in <module>
    from reverb import pybind
  File "/home/armas/temp/dist_rl/lib/python3.7/site-packages/reverb/pybind.py", line 1, in <module>
    import tensorflow as _tf; from .libpybind import *; del _tf
ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

On my local machine, I was struggling with same issue when I was running a virtual environment, I solved this problem simply with sudo apt-get install libpython3.7.

Here's some other things that may be helpful to know.

$which libpython
/usr/bin/which: no libpython in (/home/armas/temp/dist_rl/bin:/om2/user/armas/anaconda/bin:/om2/user/armas/anaconda/condabin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)

$echo $PATH
/home/armas/temp/dist_rl/bin:/om2/user/armas/anaconda/bin:/om2/user/armas/anaconda/condabin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin

If there's anything I missed please let me know and I'll be sure to add, thank you again!

Very high CPU usage

First, thx for the open-source code! But when I run the example run_dqn.py under directory examples/atari/, I find the CPU usage is very high, about 1900%, details are listed below:

VIRT RES SHR S %CPU %MEM TIME+ COMMAND
74.715g 0.011t 839564 R 1962 36.6 107:35.79 run_dqn.py

And the info of CPU is listed below:

cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
1

cat /proc/cpuinfo | grep "cpu cores" | uniq
cpu cores	: 10

cat /proc/cpuinfo| grep "processor" | wc -l
20

So can somebody tell me this situation is normal or not? And I find the program also uses gpu. If this situation is abnormal, how can I fix the code?
BTW, I want to know when I run programs on Atari game, how much memory do programs need ? When I run run_dqn.py on my PC with 32G RAM, it always breaks down because of memory burst.

add_first doc says timestep but code takes observation

Just a discrepancy between the doc and the code sample.

On this doc page: https://github.com/deepmind/acme/blob/master/docs/components.md

It says:

The add_first() method takes the first timestep of an episode and adds it to the buffer

yet in the code example later it is shown:

adder.add_first(timestep.observation)

Some dependencies missing in quickstart

Finally got it running! Works great guys. Looking forward to using this in the future.

The only issue I encountered was that running the quickstart example notebook failed due to (a lot of) missing dependencies.

I hunted them down and made a sample project that includes all of the necessary dependencies in a conda environment.yml file:

https://github.com/drozzy/acme_rl_example

(I omitted jax because I've never used it and mujoco because I didn't need it).

Some new users may find it useful. I also suggest you provide a conda environment file for people instead of pip instructions, as it is much more reliable/reproducible.

Launchpad release?

Do you plan to release LaunchPad for distributed RL as well ?

Training become extremely slow after 20k steps

Hello,

I met a problem when using the D4PG agent and my customized environment. The training process becomes very slow after around 20k steps. I run multiple experiments and attached the agent wall time. I think the customized environment is fine since there is no problem when I tested it on ray and RLlib.

Then I used time profile to evaluate the running loop. It turned out that self._actor.select_action(timestep.observation) and self._actor.update() are 10 times slower after 20k steps.

Do you have any ideas?

dm-acme[reverb] installation fail

ERROR: Could not find a version that satisfies the requirement dm-reverb-nightly==0.1.0.dev20200605; extra == "reverb" (from dm-acme[reverb]) (from versions: none)
ERROR: No matching distribution found for dm-reverb-nightly==0.1.0.dev20200605; extra == "reverb" (from dm-acme[reverb])
I’m now using python 3.7.5.

documentation for IMPALA

Hi all,
i am currently trying to implement/run IMPALA agent using acme. it will be helpful if documentation for IMPALA is added here.

D4PG's vmin and vmax paramaters

Hi @fastturtle

Would it be possible to share more info about hyper-parameters D4PG? For example, D4PG paper doesn't include information about all of important hyperparameters in the paper such as vmin and vmax . Although there is a note says how to find them, it can still lead to very different results and it is kinda unclear.
If it is not possible to share them, can you clarify a bit the note about vmin and vmax?
"A good rule of thumb is to set vmax to the discounted sum of the maximum instantaneous rewards for the maximum episode length;" What policy should be used to gather this info and when? random policy? or we should train an agent for a while with another method and then use that policy to get maximum reward? how this vmax/vmin related to the final an agent's returns and optimal policy? For example, final return of Acrobot(Swingup) is ~400 (Figure 2 in the paper), so should vmax be very close to 400?

Thanks.

Tracking Memory Consumption?

Hello,

I was running into some trouble with memory usage when attempting the acme/examples/atari/run_dqn.py agent for 1000 training episodes on a computing cluster. Are there any utilities or recommendation for tracking memory consumption on Acme?

Originally I ran a sbatch file.sh with memory of 8GB and 90 Episodes in I got this error:

[23280.579/var/slurm/slurmd/job17362439/slurm_script: line 13: 43348 Killed                  python test.py
slurmstepd: error: Detected 1 oom-kill event(s) in step 17362439.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

When I switched to 16GB I got 135 episodes until the same error.

32GB I got the same error however at 265 episodes.

64 GB I did not get an OOM error for the first time and had a successful run. Now I am seeing that I am getting a segmentation error (core dump) if I try to reproduce.

cat slurm-17566431.out
2020-08-07 15:39:34.132474: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[reverb/cc/platform/tfrecord_checkpointer.cc:143] Initializing TFRecordCheckpointer in /tmp/tmpkhjkbjht
[reverb/cc/platform/tfrecord_checkpointer.cc:320] Loading latest checkpoint from /tmp/tmpkhjkbjht
[reverb/cc/platform/default/server.cc:55] Started replay server on port 16149
2020-08-07 15:39:41.985553: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-07 15:39:41.991144: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-08-07 15:39:41.991195: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: node028.cm.cluster
2020-08-07 15:39:41.991212: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: node028.cm.cluster
2020-08-07 15:39:41.991319: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.82.0
2020-08-07 15:39:41.991382: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.82.0
2020-08-07 15:39:41.991399: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 440.82.0
2020-08-07 15:39:42.003454: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199940000 Hz
2020-08-07 15:39:42.003669: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5644082ad740 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-07 15:39:42.003696: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-07 15:39:42.739048: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
I0807 15:39:44.747882 47865664093888 savers.py:165] Attempting to restoring checkpoint: None
I0807 15:39:44.755933 47865664093888 csv.py:39] Logging to environment_loop/b8102e22-d8e5-11ea-a399-50465dec5138/logs/logs.csv
I0807 15:39:45.245096 47865664093888 savers.py:156] Saving checkpoint: /home/armas/acme/b8102e22-d8e5-11ea-a399-50465dec5138/checkpoints/dqn_learner
[Environment Loop] Episode Length = 824 | Episode Return = -21.000 | Episodes = 1 | Steps = 824 | Steps Per Second = 61.497
Fatal Python error: Segmentation fault

Thread 0x00002b8897e7bec0 (most recent call first):
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59 in quick_execute
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 545 in call
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1923 in _call_flat
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1843 in _filtered_call
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2829 in __call__
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 840 in _call
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 780 in __call__
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/acme/agents/tf/dqn/learning.py", line 172 in step
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/acme/agents/agent.py", line 87 in update
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/acme/agents/tf/dqn/agent.py", line 166 in update
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/acme/environment_loop.py", line 99 in run
  File "test1.py", line 61 in main
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/absl/app.py", line 250 in _run_main
  File "/om2/user/armas/anaconda/envs/dist_rl/lib/python3.8/site-packages/absl/app.py", line 299 in run
  File "test1.py", line 65 in <module>

Interestingly on my local machine 7 episodes in and my computer crashes.

@fastturtle @aslanides Do you know what the issue could be or if you could direct me to a person who may help, thank you both so much!

Dead paper link in blog post

Here: https://deepmind.com/research/publications/Acme it links to here (at the bottom):
https://github.com/deepmind/acme/blob/master/paper.pdf

Does Acme support Gym Tuple Spaces?

I am having trouble making observation tuple spaces with multi discrete spaces and then having the framework exicute.
Specifically i always get an error.

    agent = dqn.DQN(
  File "/home/cah/acme/lib/python3.8/site-packages/acme/agents/tf/dqn/agent.py", line 128, in __init__
    tf2_utils.create_variables(network, [environment_spec.observations])
  File "/home/cah/acme/lib/python3.8/site-packages/acme/tf/utils.py", line 105, in create_variables
    dummy_output = network(*add_batch_dim(dummy_input))
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/utils.py", line 89, in _decorate_unbound_method
    return decorator_fn(bound_method, self, args, kwargs)
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/base.py", line 272, in wrap_with_name_scope
    return method(*args, **kwargs)
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/sequential.py", line 72, in __call__
    outputs = mod(outputs, *args, **kwargs)
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/utils.py", line 89, in _decorate_unbound_method
    return decorator_fn(bound_method, self, args, kwargs)
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/base.py", line 272, in wrap_with_name_scope
    return method(*args, **kwargs)
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/reshape.py", line 147, in __call__
    self._initialize(inputs)
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/utils.py", line 89, in _decorate_unbound_method
    return decorator_fn(bound_method, self, args, kwargs)
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/once.py", line 93, in wrapper
    _check_no_output(wrapped(*args, **kwargs))
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/utils.py", line 89, in _decorate_unbound_method
    return decorator_fn(bound_method, self, args, kwargs)
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/base.py", line 272, in wrap_with_name_scope
    return method(*args, **kwargs)
  File "/home/cah/acme/lib/python3.8/site-packages/sonnet/src/reshape.py", line 120, in _initialize
    if inputs.shape.rank < self._preserve_dims:
AttributeError: 'tuple' object has no attribute 'shape'

An example i am trying to run is acme with the openai Gym BlackJack example. 3 discrete observation spaces in a space tuple.

Callback system

Hi everyone!
Thank you for sharing this project.
I have a few ideas that I think might contribute to your library and before making a pr I thought of asking here first to check if you are interested.

The idea would be to take one more abstraction step and instead of having an Agent with an Actor and a Learner, an Agent would have an Actor and a list of Callbacks. The Learner would be itself a Callback, but also the Adder, the Logger, and even the Noise. The Callbacks would react at certain events similar to fastai callback system ('on_episode_begin', 'on_episode_end', 'on_feedback', 'before_select_action', 'after_select_action').
Callbacks may be disabled on demand. So after running the the env_loop, one could simply call agent._noise.disable() to remove noise.
The environment loop looks quite clearner this way too:

iterator = range(num_episodes) if num_episodes else itertools.count()

for _ in iterator:
    timestep = self._environment.reset()

    self._callbacks.call('on_episode_begin', timestep=timestep)

    # Run an episode.
    while not timestep.last():
        # Generate an action from the agent's policy and step the environment.
        action = self._agent.select_action(timestep.observation)
        timestep = self._environment.step(action)

        self._callbacks.call('on_feedback', action=action, next_timestep=timestep)

    self._callbacks.call('on_episode_end')

I forked your repo and started implementing it (it is actually almost done). It involves quite a few changes, so let me know if you would be willing to take a look at it.

Padded sequences in running Impala

I have a question regarding padded sequences when running Impala. So in SequenceAdder sequences are zero-padded to the intended length and the snt.static_unroll takes the argument sequence_length of batch size to account for the padding effect. But it seems the learning step of Impala agent does not handle this explicitly. For example, the sequence_length is not used in the unrolling step of the agent. I was wondering if this will have an impact on the performance and whether it should be taken into account when computing the loss, e.g., maybe mask the gradient on those padded experiences?

Thanks for sharing this great library!

NLP

Hi:
I want to ask whether the project can solve the NLP work or not.
I will appreciate it if you tell me. Thank you.

Questions re: Custom Distributional Agents in ACME?

I have four questions towards creating custom distributional RL algorithms using Acme.

The only distributional agent I can find (in TF and jax) is D4PG. Is this correct?
When will other distributional agents (e.g. C51, Quantile Regression, Expectile Regression) be released?
If I want to implement my own distributional agent, I glanced at d4pg.D4PGLearner to get a sense of how this might be done. I'm a bit puzzled by a few things. Coming from PyTorch, I'm used to optimizers stepping. It looks like here, the Learner itself steps. Is this correct? What was the reason for this implementation choice?
If I want to implement expectile regression for discrete control, what would the recommended approach be? I imagine I'd need to start with DQN, change the output dimension and change the loss function, but how do I specify which output element (i.e. the output units corresponding to tau = 0.5) is used for control?

Data type errors with LunarLanderContinuous, MountainCar

I'm running into an internal error complaining about the wrong data type (expected float32, get int32) when I try running the quickstart code (d4pg agent) on lunar lander and mountain car. Here is the code to generate the error:

# python3
# Copyright 2018 DeepMind Technologies Limited. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Example running D4PG on the OpenAI Gym."""

from typing import Mapping, Sequence
import sys
from absl import app
from absl import flags
import acme
from acme import specs
from acme import types
from acme import wrappers
from acme.agents.tf import actors
from acme.agents.tf import d4pg
from acme.tf import networks
from acme.tf import utils as tf2_utils
import dm_env
import gym
import numpy as np
import sonnet as snt

FLAGS = flags.FLAGS
flags.DEFINE_integer('num_episodes', 100,
                     'Number of training episodes to run for.')
flags.DEFINE_integer('num_episodes_per_eval', 10,
                     'Number of training episodes to run between evaluation '
                     'episodes.')


def make_environment(
    task: str = 'LunarLanderContinuous-v2') -> dm_env.Environment:
  """Creates an OpenAI Gym environment."""

  # Load the gym environment.
  environment = gym.make(task)

  # Make sure the environment obeys the dm_env.Environment interface.
  environment = wrappers.GymWrapper(environment)
  environment = wrappers.SinglePrecisionWrapper(environment)

  return environment


# The default settings in this network factory will work well for the
# MountainCarContinuous-v0 task but may need to be tuned for others. In
# particular, the vmin/vmax and num_atoms hyperparameters should be set to
# give the distributional critic a good dynamic range over possible discounted
# returns. Note that this is very different than the scale of immediate rewards.
def make_networks(
    action_spec: specs.BoundedArray,
    policy_layer_sizes: Sequence[int] = (256, 256, 256),
    critic_layer_sizes: Sequence[int] = (512, 512, 256),
    vmin: float = -150.,
    vmax: float = 150.,
    num_atoms: int = 51,
) -> Mapping[str, types.TensorTransformation]:
  """Creates the networks used by the agent."""

  # Get total number of action dimensions from action spec.
  num_dimensions = np.prod(action_spec.shape, dtype=int)

  # Create the shared observation network; here simply a state-less operation.
  observation_network = tf2_utils.batch_concat

  # Create the policy network.
  policy_network = snt.Sequential([
      networks.LayerNormMLP(policy_layer_sizes, activate_final=True),
      networks.NearZeroInitializedLinear(num_dimensions),
      networks.TanhToSpec(action_spec),
  ])

  # Create the critic network.
  critic_network = snt.Sequential([
      # The multiplexer concatenates the observations/actions.
      networks.CriticMultiplexer(),
      networks.LayerNormMLP(critic_layer_sizes, activate_final=True),
      networks.DiscreteValuedHead(vmin, vmax, num_atoms),
  ])

  return {
      'policy': policy_network,
      'critic': critic_network,
      'observation': observation_network,
  }


def main(_):

  nbits = int(sys.argv[1])
  iteration = int(sys.argv[2])
  dirpath = str(sys.argv[3])
  
  # Create an environment, grab the spec, and use it to create networks.
  environment = make_environment()
  environment_spec = specs.make_environment_spec(environment)
  agent_networks = make_networks(environment_spec.actions)

  # Construct the agent.
  agent = d4pg.D4PG(
      environment_spec=environment_spec,
      policy_network=agent_networks['policy'],
      critic_network=agent_networks['critic'],
      observation_network=agent_networks['observation'],
      sigma=1.0,
      nbits=nbits
  )

  # Create the environment loop used for training.
  train_loop = acme.EnvironmentLoop(environment, agent, label='%s/nbits=%d_rep=%d/train_loop_nbits=%d_rep=%d' % (dirpath, nbits, iteration, nbits, iteration))

  # Create the evaluation policy.
  eval_policy = snt.Sequential([
      agent_networks['observation'],
      agent_networks['policy'],
  ])

  # Create the evaluation actor and loop.
  eval_actor = actors.FeedForwardActor(policy_network=eval_policy)
  eval_env = make_environment()
  eval_loop = acme.EnvironmentLoop(eval_env, eval_actor, label='%s/nbits=%d_rep=%d/eval_loop_nbits=%d_rep=%d' % (dirpath, nbits, iteration, nbits, iteration))

  for _ in range(FLAGS.num_episodes // FLAGS.num_episodes_per_eval):
    train_loop.run(num_episodes=FLAGS.num_episodes_per_eval)
    eval_loop.run(num_episodes=1)


if __name__ == '__main__':
  app.run(main)

The error output I see is:

W0619 05:19:35.448845 47712755015296 backprop.py:1021] Calling GradientTape.gradient on a persistent tape inside its context is significantly less efficient than calling it outside the context (it causes the gradient ops to be recorded on the tape, leading to increased CPU and memory usage). Only call GradientTape.gradient inside the context if you actually want to trace the gradient in order to compute higher order derivatives.
[Learner] Critic Loss = 3.948 | Policy Loss = 0.474 | Steps = 1 | Walltime = 0
[Data Dbg/Nbits=32 Rep=1/Train Loop Nbits=32 Rep=1] Episode Length = 87 | Episode Return = -212.909 | Episodes = 10 | Steps = 1073 | Steps Per Second = 3.896
[Learner] Critic Loss = 3.893 | Policy Loss = 0.385 | Steps = 21 | Walltime = 1.026
[Data Dbg/Nbits=32 Rep=1/Train Loop Nbits=32 Rep=1] Episode Length = 113 | Episode Return = -252.268 | Episodes = 12 | Steps = 1273 | Steps Per Second = 195.374
[Learner] Critic Loss = 3.892 | Policy Loss = 0.403 | Steps = 46 | Walltime = 2.032
[Data Dbg/Nbits=32 Rep=1/Train Loop Nbits=32 Rep=1] Episode Length = 83 | Episode Return = -418.158 | Episodes = 15 | Steps = 1480 | Steps Per Second = 190.304
[Learner] Critic Loss = 3.853 | Policy Loss = 0.382 | Steps = 71 | Walltime = 3.044
[Data Dbg/Nbits=32 Rep=1/Train Loop Nbits=32 Rep=1] Episode Length = 53 | Episode Return = -151.977 | Episodes = 18 | Steps = 1692 | Steps Per Second = 196.198
[Learner] Critic Loss = 3.850 | Policy Loss = 0.340 | Steps = 97 | Walltime = 4.077
[Data Dbg/Nbits=32 Rep=1/Train Loop Nbits=32 Rep=1] Episode Length = 63 | Episode Return = -261.436 | Episodes = 21 | Steps = 1905 | Steps Per Second = 198.534
[Learner] Critic Loss = 3.807 | Policy Loss = 0.412 | Steps = 122 | Walltime = 5.090
[Data Dbg/Nbits=32 Rep=1/Train Loop Nbits=32 Rep=1] Episode Length = 101 | Episode Return = -604.001 | Episodes = 24 | Steps = 2146 | Steps Per Second = 197.962
[Learner] Critic Loss = 3.751 | Policy Loss = 0.423 | Steps = 148 | Walltime = 6.128
Traceback (most recent call last):
  File "run_d4pg.py", line 143, in <module>
    app.run(main)
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run_d4pg.py", line 138, in main
    train_loop.run(num_episodes=FLAGS.num_episodes_per_eval)
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/acme/environment_loop.py", line 99, in run
    self._actor.update()
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/acme/agents/agent.py", line 87, in update
    self._learner.step()
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/acme/agents/tf/d4pg/learning.py", line 251, in step
    fetches = self._step()
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 767, in __call__
    result = self._call(*args, **kwds)
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 794, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2811, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1838, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1914, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 549, in call
    ctx=ctx)
  File "/n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal:  Output 6 of type int32 does not match declared output type float for node node IteratorGetNext (defined at /n/janapa_reddi_lab/maxlam/conda_quarl/lib/python3.6/site-packages/acme/agents/tf/d4pg/learning.py:178) 
  (1) Cancelled:  Function was cancelled before it was started
0 successful operations.
0 derived errors ignored. [Op:__inference__step_9555]

Function call stack:
_step -> _step

[reverb/cc/platform/default/server.cc:64] Shutting down replay server
W0619 05:20:06.683977 47712755015296 client.py:112] Writer-object deleted without calling .close explicitly.
[reverb/cc/writer.cc:231] Received error when closing the stream: [14] Socket closed

I'm not sure where the int32 is coming from, any assistance with this issue would be appreciated, thanks a lot!

Acme Tutorial Won't Start in Colab

Hi,
https://github.com/deepmind/acme/blob/master/examples/tutorial.ipynb
won't start in Colab. , if ran manually it fails on cell 2 with:
File "", line 32
except ModuleNotFoundError, OSError:
^
SyntaxError: invalid syntax
Moreover, the Acme paper is full of typos and mistakes.

MCTS agent: environment in simulation out of sync

I appreciate the initiative with this framework a lot, in general I find the code very clean and much easier to read and adapt than the one of other frameworks!
Currently, I am focusing on the MCTS agent and strongly believe that there is a bug concerning the separate environment inside the simulator: It runs out of sync with the main environment as it is not reset properly.

More specifically, when a new episode is started in the environment loop after the first episode has been finished, the main environment is reset, but the environment in the simulator is not. In the action selection, the tree search starts by creating a root node, selecting an action and this action should then be performed by step. At this moment, the simulator's environment still has an old uninitialized state. It seems that this is usually a LAST state and an auto-reset is therefore performed. However, this happens too late, as then the observation of the initial state is returned instead of the intended result of the first action.

Issues importing acme.tf

Hi all,

Despite running: pip install dm-acme[tf], using import acme.tf raises an error. This appears to be an issue in both the Quickstart and Tutorial examples as well.

This is the reproducible code sample:

!pip install dm-acme
!pip install dm-acme[reverb]
!pip install dm-acme[tf]
!pip install dm-acme[envs]

import acme.tf

Google Colab version

dm-acme[reverb] installation fails

ERROR: Could not find a version that satisfies the requirement dm-reverb-nightly==0.1.0.dev20200529; extra == "reverb" (from dm-acme[reverb]) (from versions: 0.1.0.dev20200605)
ERROR: No matching distribution found for dm-reverb-nightly==0.1.0.dev20200529; extra == "reverb" (from dm-acme[reverb])

Overlapping sequences vs. period and sequence length

I'm not sure about this one, maybe I'm wrong...

Under reverb adders section in the doc, for the SequenceAdder you say:

sequences can be overlapping (if the period parameter = sequence_length n) or non-overlapping (if period < sequence_length)

Do you mean the opposite maybe?

sequences can be non-overlapping (if the period parameter = sequence_length n) or overlapping (if period < sequence_length)

Agent evaluation and results

Hi,

Thanks for sharing this repo.
I have a suggestion: it would be super useful if you can provide examples of how to evaluate an agent and report results (like for DQN or DDPG). I see an example in tutorial, but it is more about displaying videos not reporting results.
This way, one can directly compare with Deepmind's papers and comparison would be very consistent and meaningful.
Thanks.

Using annealing epsilon greedy policy

I found out about this amazing library last week (such a great job and thank you for making it public!) and I have been working on adapting my own environments and agents to acme classes. I have one small question.

I noticed that in the DQN agent it uses a constant 0.05 epsilon greedy policy for action selection. https://github.com/deepmind/acme/blob/eb63054ec19525fa8320fb323327cf13495615e3/acme/agents/tf/dqn/agent.py#L119
Now if I want to use a linearly annealing epsilon based on the number of total steps, what should be the proper way of coding it?

In the code it adds a layer below the network to apply epsilon greedy. I guess it's not impossible for the network to take the number of total steps as an extra input, and feed it all the way down to a modified trfl.epsilon_greedy with annealing. But it seems I need to make many changes to different components and I really doubt whether this the proper way by design. Also I don't know whether this is already a function in the library.

Again, I haven't been working on it for a lot of days. If it turns out to be a stupid question I apologize. Thank you!

Rendering atari environment

Hi, thanks for open sourcing this framework

I was trying to visualize what the agent has learned, but I could not render the atari environment.

After the GymAtariAdapter wrapper is used when I try to render the environment I get

AttributeError: 'AtariWrapper' object has no attribute 'render'

And I can not unwrap it either

'AtariWrapper' object has no attribute 'unwrapped'

I apologize if this is the intended behavior.

Simple Example with CartPole, MountainCar

Can you give an example with simple gym CartPole and MountainCar problem ?
Thank you

Gym Environment ( no available MuJoCo access )

Hi,
Is it possible to release Gym-only based tutorial version ? It'd be helpful for people with no MuJoCo licence.

Regards, Ranko.

TRPO / PPO / SAC implementations

First off, thank you for this great library!

Is the team planning on implementing TRPO/PPO/SAC as well?

I find it immensely valuable to be able to compare different families of algorithms within the same framework to reduce errors and have my results be more robust.

Acme agents hyperparameters

Hi,

Do the Acme agents in "acme/examples/" follow the same setups and hyper-parameters as their corresponding papers? can those exact same codes be used to reproduce their results? or need some changes like reward scaling, hyper-parameters, etc?

Thanks.

NN "torso"?

What is a "torso"?
I encountered it in the paper, and also here:
https://github.com/deepmind/acme/blob/f8867fe383c591d00f873323b308cd4e70f9f818/acme/tf/networks/continuous.py#L38

Are there are parts that you use as well? Just curious about terminology.
Thanks.

AZLearner: TD vs MSE-to-outcome

The AZLearner implementation here appears to do TD on transitions.

As far as I can tell from page 2 of the AlphaZero paper, the original AlphaZero optimizes MSE between value-predictions and game-outcomes.

Is this an intentional variation? (Or am I misunderstanding either of the files?)

(Sorry in case this is the wrong place to put this)

Why is the loss for the trajectories in each episode not decreasing for DQN example?

Hello everyone, I ran the acme/run_dqn.py example script and I have noticed that while the episode return increases but the learner loss is not decreasing as the episodes continue? Why is that? What logs are at my disposal in order to see if agents are improving in future episodes?

Here is the output after a few episodes as opposed to 895 episodes in.

[Environment Loop] Episode Length = 764 | Episode Return = -21.000 | Episodes = 2 | Steps = 1528 | Steps Per Second = 37.277
[Learner] Loss = 0.058 | Steps = 70 | Walltime = 4.277
[Learner] Loss = 0.058 | Steps = 87 | Walltime = 5.307
[Learner] Loss = 0.163 | Steps = 104 | Walltime = 6.357
[Learner] Loss = 0.073 | Steps = 121 | Walltime = 7.372
[Learner] Loss = 0.068 | Steps = 138 | Walltime = 8.380
[Learner] Loss = 0.061 | Steps = 155 | Walltime = 9.391
[Learner] Loss = 0.052 | Steps = 173 | Walltime = 10.452
[Environment Loop] Episode Length = 876 | Episode Return = -21.000 | Episodes = 3 | Steps = 2404 | Steps Per Second = 133.777
[Learner] Loss = 0.054 | Steps = 190 | Walltime = 11.485
[Learner] Loss = 0.060 | Steps = 207 | Walltime = 12.543

Very quickly the loss decreases...

[Environment Loop] Episode Length = 764 | Episode Return = -21.000 | Episodes = 6 | Steps = 4696 | Steps Per Second = 140.916
[Learner] Loss = 0.009 | Steps = 467 | Walltime = 27.883
[Learner] Loss = 0.014 | Steps = 485 | Walltime = 28.910
[Learner] Loss = 0.014 | Steps = 503 | Walltime = 29.950
[Learner] Loss = 0.013 | Steps = 520 | Walltime = 30.955
[Learner] Loss = 0.020 | Steps = 537 | Walltime = 31.968
[Learner] Loss = 0.016 | Steps = 554 | Walltime = 33.019
[Learner] Loss = 0.011 | Steps = 571 | Walltime = 34.030
[Environment Loop] Episode Length = 912 | Episode Return = -21.000 | Episodes = 7 | Steps = 5608 | Steps Per Second = 135.071
[Learner] Loss = 0.010 | Steps = 588 | Walltime = 35.087
[Learner] Loss = 0.014 | Steps = 604 | Walltime = 36.109
[Learner] Loss = 0.015 | Steps = 615 | Walltime = 37.127
[Learner] Loss = 0.016 | Steps = 629 | Walltime = 38.193

But seems to not improve much for the rest of the training loop

[Environment Loop] Episode Length = 2154 | Episode Return = 12.000 | Episodes = 895 | Steps = 1838193 | Steps Per Second = 6.382
[Learner] Loss = 0.011 | Steps = 229651 | Walltime = 304473.760
[Learner] Loss = 0.009 | Steps = 229652 | Walltime = 304475.012
[Learner] Loss = 0.012 | Steps = 229653 | Walltime = 304476.288
[Learner] Loss = 0.008 | Steps = 229654 | Walltime = 304477.543
[Learner] Loss = 0.009 | Steps = 229655 | Walltime = 304478.783
[Learner] Loss = 0.008 | Steps = 229656 | Walltime = 304480.020
[Learner] Loss = 0.010 | Steps = 229657 | Walltime = 304481.291
[Learner] Loss = 0.017 | Steps = 229658 | Walltime = 304482.534
[Learner] Loss = 0.009 | Steps = 229659 | Walltime = 304483.770

[Error] Acme cannot build on ppc64le

It seems not possible to build acme on ppc64le.
Did anybody tried to compile on an architecture different from "x86_64"?

Alessandro

GymWrapper: deepcopy fails due to its getattr method

MCTS needs to deepcopy the environment. When used with a Gym environment wrapped with GymWrapper, deepcopy fails with an infinite recursion due to GymWrapper's __getattr__ method.

I tried to resolve the problem by an own __deepcopy__ method in my Gym environment, but this also did not work as then only my Gym environment without the wrapper is copied.

Commenting out GymWrapper.__getattr__ resolves the problem in my case, but this is clearly not what was originally intended.

Saving and restoring model/network

Hi,

I'm extremely new to both TF and Sonnet.

I was trying to figure out how to save and load a trained network. I was following what was written in https://github.com/deepmind/sonnet#tensorflow-saved-model , but cant figure out how to regenerate a my_module object, with the weights from the loaded model.

Thanks!

How to modify acme.adders.reverb.base.Adder for adding several environments' transitions in Acme?

I wanna use acme's algorithms to interact with the vectorized environments(just like baselines.VectorEnv), however maintaining N adders for N environments will decrease the efficiency when collecting samples and sending them to reverb_server(when I use my VecEnv(num_envs=64), sending transitions to replay buffer will use 60% time of the total sampling time). So how to modify the code of acme.adders.reverb.base.Adder and its subclasses to make it can add several environments' transitions in one adder?
Thank you very much~

Cannot run r2d2

Thanks for the repo. I write the code for starting R2D2 like DQN shown inthe following block:

# python3
# Copyright 2018 DeepMind Technologies Limited. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Run DQN on Atari."""

import functools

from absl import app
from absl import flags
import acme
from acme import wrappers
from acme.agents.tf import r2d2
from acme.tf import networks
import dm_env
import gym

flags.DEFINE_string('level', 'PongNoFrameskip-v4', 'Which Atari level to play.')
flags.DEFINE_integer('num_episodes', 1000, 'Number of episodes to train for.')

FLAGS = flags.FLAGS


def make_environment(evaluation: bool = False) -> dm_env.Environment:

  env = gym.make(FLAGS.level, full_action_space=True)

  max_episode_len = 108000 if evaluation else 50000

  return wrappers.wrap_all(env, [
      wrappers.GymAtariAdapter,
      functools.partial(
          wrappers.AtariWrapper,
          to_float=True,
          max_episode_len=max_episode_len,
          zero_discount_on_life_loss=True,
      ),
      wrappers.SinglePrecisionWrapper,
  ])


def main(_):
  env = make_environment()
  env_spec = acme.make_environment_spec(env)
  network = networks.R2D2AtariNetwork(env_spec.actions.num_values)

  agent = r2d2.R2D2(env_spec, network, burn_in_length=40, trace_length=40, replay_period=1)

  loop = acme.EnvironmentLoop(env, agent)
  loop.run(FLAGS.num_episodes)


if __name__ == '__main__':
  app.run(main)

However I met the Error like this:

Traceback (most recent call last):
  File "run_r2d2.py", line 65, in <module>
    app.run(main)
  File "/home/mxfeng/anaconda3/envs/acme/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/mxfeng/anaconda3/envs/acme/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run_r2d2.py", line 58, in main
    agent = r2d2.R2D2(env_spec, network, burn_in_length=10, trace_length=10, replay_period=1)
  File "/home/mxfeng/acme/acme/agents/tf/r2d2/agent.py", line 103, in __init__
    tf2_utils.create_variables(network, [environment_spec.observations])
  File "/home/mxfeng/acme/acme/tf/utils.py", line 105, in create_variables
    dummy_output = network(*add_batch_dim(dummy_input))
  File "/home/mxfeng/anaconda3/envs/acme/lib/python3.6/site-packages/sonnet/src/utils.py", line 89, in _decorate_unbound_method
    return decorator_fn(bound_method, self, args, kwargs)
  File "/home/mxfeng/anaconda3/envs/acme/lib/python3.6/site-packages/sonnet/src/base.py", line 272, in wrap_with_name_scope
    return method(*args, **kwargs)
  File "/home/mxfeng/acme/acme/tf/networks/atari.py", line 91, in __call__
    embeddings = self._embed(inputs)
  File "/home/mxfeng/anaconda3/envs/acme/lib/python3.6/site-packages/sonnet/src/utils.py", line 89, in _decorate_unbound_method
    return decorator_fn(bound_method, self, args, kwargs)
  File "/home/mxfeng/anaconda3/envs/acme/lib/python3.6/site-packages/sonnet/src/base.py", line 272, in wrap_with_name_scope
    return method(*args, **kwargs)
  File "/home/mxfeng/acme/acme/tf/networks/embedding.py", line 37, in __call__
    if len(inputs.reward.shape.dims) == 1:
AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'reward'
[reverb/cc/platform/default/server.cc:64] Shutting down replay server

How to deal with that?
Thanks a lot!

QuickStart throws attribute error

Hi.
Thank you for providing amazing open source library.

Running your quick start notebook using gym environment, I found your D4PG tutorial throws an attribute error saying

AttributeError: 'DiscreteValuedDistribution' object has no attribute 'probs_parameter'

DiscreteValuedDistribution is a subclass of tfp.distribution.Categorical which has method 'probs_parameter() see official docs

Is there any version restriction for tensorflow-probability or any bug in my code?

Here is my code (though it's nearly the same of the notebook)

from acme import environment_loop
from acme import specs
from acme import wrappers
from acme.agents.tf import d4pg
from acme.tf import networks
from acme.tf import utils as tf2_utils
from acme.utils import loggers
import numpy as np
import sonnet as snt
import gym

# Imports required for visualization
import pyvirtualdisplay
import imageio
import base64

# Set up a virtual display for rendering.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

environment = gym.make('MountainCarContinuous-v0')
environment = wrappers.GymWrapper(environment)  # To dm_env interface.

# Make sure the environment outputs single-precision floats.
environment = wrappers.SinglePrecisionWrapper(environment)

# Grab the spec of the environment.
environment_spec = specs.make_environment_spec(environment)

# Get total number of action dimensions from action spec.
num_dimensions = np.prod(environment_spec.actions.shape, dtype=int)

# Create the shared observation network; here simply a state-less operation.
observation_network = tf2_utils.batch_concat

# Create the deterministic policy network.
policy_network = snt.Sequential([
    networks.LayerNormMLP((256, 256, 256), activate_final=True),
    networks.NearZeroInitializedLinear(num_dimensions),
    networks.TanhToSpec(environment_spec.actions),
])

# Create the distributional critic network.
critic_network = snt.Sequential([
    # The multiplexer concatenates the observations/actions.
    networks.CriticMultiplexer(),
    networks.LayerNormMLP((512, 512, 256), activate_final=True),
    networks.DiscreteValuedHead(vmin=-150., vmax=150., num_atoms=51),
])

agent_logger = loggers.TerminalLogger(label='agent', time_delta=10.)
env_loop_logger = loggers.TerminalLogger(label='env_loop', time_delta=10.)

# Create the D4PG agent.
agent = d4pg.D4PG(
    environment_spec=environment_spec,
    policy_network=policy_network,
    critic_network=critic_network,
    observation_network=observation_network,
    sigma=1.0,
    logger=agent_logger,
    checkpoint=False
)

# Create a loop connecting this agent to the environment created above.
env_loop = environment_loop.EnvironmentLoop(
    environment, agent, logger=env_loop_logger)

# Run a `num_episodes` training episodes.
# Rerun this cell until the agent has learned the given task.
env_loop.run(num_episodes=100)

I'm running this on Ubuntu 18.04.4 on Docker on OS X Catalina 10.15.5
I pip-ed the following libraries

reverb
tf-nightly==2.3.0.dev20200604
dm-reverb-nightly
tensorflow-probability==0.7.0
wrapt 
dm-sonnet
graphs
trfl
dm-acme
dm-acme[reverb]
dm-acme[tf]    
dm-acme[envs]
ffmpeg
gym
imageio
PILLOW 
pyglet
pyvirtualdisplay
imageio-ffmpeg
xvfbwrapper

Since it worked when I modified the tutorial to use DDPG (not D4PG) which I think doesn't use tfp libraries, I believe it's the matter of tensorflow-probability.

Thanks for your cooperation.

Non-Mandatory CSVLogger UID (or custom UID)?

class CSVLogger necessarily adds a mandatory uid:

directory = paths.process_path(directory, 'logs', label, add_uid=True)

Would adding an argument to the CSVLogger init function so as to disable uids be possible? It'd look like:

class CSVLogger(base.Logger):
  """Standard CSV logger."""

  _open = open

  def __init__(self,
               directory: str = '~/acme',
               label: str = '',
               time_delta: float = 0.,
               add_uid=True,
):

    directory = paths.process_path(directory, 'logs', label, add_uid=add_uid)

DQN crashes on tensorflow 2.3rc

First of all, I run everything on Google Colab. I don't know if this issue is worth reporting, because right now the stable version of tensorflow is 2.2 and dqn has no problems on it. But in certain circumstances I got errors saying tensorflow 2.2 is not high enough (For details please see the end of the post. Let me know if you want me to open up a separate issue and describe it in more details). But anyway, the crash happens if you install tensorflow 2.3.0rc0. The problem can be reproduced in the following very simple colab notebook (run the 1st cell, restart the runtime and run 2-4th cells):

https://colab.research.google.com/drive/1jcOWwdHPqqrA0Yp97SYyeuHiotDAdU0W?usp=sharing

Because there was no error message, I really had no idea about what's going on. I tried to use stupid debugging methods and I was able to figure out that the problem was possibly (not 100% sure though) caused by reverb.TFClient.update_priorities in the following line

https://github.com/deepmind/acme/blob/eb63054ec19525fa8320fb323327cf13495615e3/acme/agents/tf/dqn/learning.py#L153

For my own project, I finally realized that the solution is very simple: I load the packages in the right order and the tensorflow version error goes away. I stay in tensorflow 2.2 and nothing goes wrong. But I don't know if this is something worth noting in your project, and I post it here just in case. I would also like to know what caused it just for the sake of my curiosity.

About the tensorflow version error:
In google colab, if I already imported tensorflow and then install acme packages, I would come across the following error when I tried to import acme.adders.reverb

ImportError: This version of Reverb requires TensorFlow version >= 2.3.0; Detected an installation of version 2.2.0. Please upgrade TensorFlow to proceed.

I would also have problems loading acme.tf.networks but the error is a different one. All errors will be gone if I force to install tensorflow 2.3.0rc0. Let me know if you need more details.

Policy update in D4PG

Hi,

In D4PG implementation, actor uses s_{t+1} (i.e. next state) to update the policy, shouldn't be current state s_t used to update the policy?

      dpg_a_t = self._policy_network(o_t)
      dpg_z_t = self._critic_network(o_t, dpg_a_t)
      dpg_q_t = dpg_z_t.mean()

instead

      dpg_a_t = self._policy_network(o_tm1)
      dpg_z_t = self._critic_network(o_tm1, dpg_a_t)
      dpg_q_t = dpg_z_t.mean()

Thanks.

Error: Casting error when running D4PG

Hi,

I am integrating a custom gym environment which runs on other environment. When trying to run in ACME I am getting the following error:

File "/mnt/d/AI Project/acme/acme/examples/gym/run_d4pg.py", line 140, in
app.run(main)
File "/home/win888/.local/lib/python3.8/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/win888/.local/lib/python3.8/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/mnt/d/AI Project/acme/acme/examples/gym/run_d4pg.py", line 112, in main
agent = d4pg.D4PG(
File "/mnt/d/AI Project/acme/acme/acme/agents/tf/d4pg/agent.py", line 143, in init
tf2_utils.create_variables(critic_network, [emb_spec, act_spec])
File "/mnt/d/AI Project/acme/acme/acme/tf/utils.py", line 105, in create_variables
dummy_output = network(*add_batch_dim(dummy_input))
File "/home/win888/.local/lib/python3.8/site-packages/sonnet/src/utils.py", line 89, in _decorate_unbound_method
return decorator_fn(bound_method, self, args, kwargs)
File "/home/win888/.local/lib/python3.8/site-packages/sonnet/src/base.py", line 272, in wrap_with_name_scope
return method(*args, **kwargs)
File "/home/win888/.local/lib/python3.8/site-packages/sonnet/src/sequential.py", line 72, in call
outputs = mod(outputs, *args, **kwargs)
File "/home/win888/.local/lib/python3.8/site-packages/sonnet/src/utils.py", line 89, in _decorate_unbound_method
return decorator_fn(bound_method, self, args, kwargs)
File "/home/win888/.local/lib/python3.8/site-packages/sonnet/src/base.py", line 272, in wrap_with_name_scope
return method(*args, **kwargs)
File "/mnt/d/AI Project/acme/acme/acme/tf/networks/multiplexers.py", line 65, in call
outputs = tf2_utils.batch_concat([observation, action])
File "/mnt/d/AI Project/acme/acme/acme/tf/utils.py", line 56, in batch_concat
return tf.concat(tree.flatten(flat_leaves), axis=-1)
File "/home/win888/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/home/win888/.local/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py", line 1643, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/home/win888/.local/lib/python3.8/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1207, in concat_v2
_ops.raise_from_not_ok_status(e, name)
File "/home/win888/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 6851, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute ConcatV2 as input #1(zero-based) was expected to be a float tensor but is a int32 tensor [Op:ConcatV2] name: concat
[reverb/cc/platform/default/server.cc:64] Shutting down replay server

Missing pip install dm-env in Quickstart?

Hi! I love the library. Thanks for releasing this!

I'm walking through the quickstart notebook and I received the following error:

Should there be a !pip install dm-env added to the quickstart notebook?

dm_control environments and difference with dm_env

Hi,

Where can I find list of all available environments( i.e. task names and domain names)? Say in the following examples, what are possible domain names and task names:
from dm_control import suite
environment = suite.load(domain_name, task_name)
What is the difference between dm_env and dm_control?

Thanks.

Not enough documentation for EpisodeAdder

Having read the docs and the code for the episode adder I still don't quite understand it.

Is it just the simplest adder? E.g. adding every transition to one long buffer? Or is it doing something else?

Thanks!

google-deepmind / acme Goto Github PK

acme's People

Contributors

Stargazers

Watchers

Forkers

acme's Issues

To view examples of installing some common dependencies, click the "Open Examples" button below.

Recommend Projects

Recommend Topics

Recommend Org

To view examples of installing some common dependencies, click the
"Open Examples" button below.