Code Monkey home page Code Monkey logo

agents's Introduction

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

PyPI tf-agents PyPI - Python Version

TF-Agents makes implementing, deploying, and testing new Bandits and RL algorithms easier. It provides well tested and modular components that can be modified and extended. It enables fast code iteration, with good test integration and benchmarking.

To get started, we recommend checking out one of our Colab tutorials. If you need an intro to RL (or a quick recap), start here. Otherwise, check out our DQN tutorial to get an agent up and running in the Cartpole environment. API documentation for the current stable release is on tensorflow.org.

TF-Agents is under active development and interfaces may change at any time. Feedback and comments are welcome.

Table of contents

Agents
Tutorials
Multi-Armed Bandits
Examples
Installation
Contributing
Releases
Principles
Contributors
Citation
Disclaimer

Agents

In TF-Agents, the core elements of RL algorithms are implemented as Agents. An agent encompasses two main responsibilities: defining a Policy to interact with the Environment, and how to learn/train that Policy from collected experience.

Currently the following algorithms are available under TF-Agents:

Tutorials

See docs/tutorials/ for tutorials on the major components provided.

Multi-Armed Bandits

The TF-Agents library contains a comprehensive Multi-Armed Bandits suite, including Bandits environments and agents. RL agents can also be used on Bandit environments. There is a tutorial in bandits_tutorial.ipynb. and ready-to-run examples in tf_agents/bandits/agents/examples/v2.

Examples

End-to-end examples training agents can be found under each agent directory. e.g.:

Installation

TF-Agents publishes nightly and stable builds. For a list of releases read the Releases section. The commands below cover installing TF-Agents stable and nightly from pypi.org as well as from a GitHub clone.

⚠️ If using Reverb (replay buffer), which is very common, TF-Agents will only work with Linux.

Note: Python 3.11 requires pygame 2.1.3+.

Stable

Run the commands below to install the most recent stable release. API documentation for the release is on tensorflow.org.

$ pip install --user tf-agents[reverb]

# Use keras-2
$ export TF_USE_LEGACY_KERAS=1
# Use this tag get the matching examples and colabs.
$ git clone https://github.com/tensorflow/agents.git
$ cd agents
$ git checkout v0.18.0

If you want to install TF-Agents with versions of Tensorflow or Reverb that are flagged as not compatible by the pip dependency check, use the following pattern below at your own risk.

$ pip install --user tensorflow
$ pip install --user tf-keras
$ pip install --user dm-reverb
$ pip install --user tf-agents

If you want to use TF-Agents with TensorFlow 1.15 or 2.0, install version 0.3.0:

# Newer versions of tensorflow-probability require newer versions of TensorFlow.
$ pip install tensorflow-probability==0.8.0
$ pip install tf-agents==0.3.0

Nightly

Nightly builds include newer features, but may be less stable than the versioned releases. The nightly build is pushed as tf-agents-nightly. We suggest installing nightly versions of TensorFlow (tf-nightly) and TensorFlow Probability (tfp-nightly) as those are the versions TF-Agents nightly are tested against.

To install the nightly build version, run the following:

# Use keras-2
$ export TF_USE_LEGACY_KERAS=1

# `--force-reinstall helps guarantee the right versions.
$ pip install --user --force-reinstall tf-nightly
$ pip install --user --force-reinstall tf-keras-nightly
$ pip install --user --force-reinstall tfp-nightly
$ pip install --user --force-reinstall dm-reverb-nightly

# Installing with the `--upgrade` flag ensures you'll get the latest version.
$ pip install --user --upgrade tf-agents-nightly

From GitHub

After cloning the repository, the dependencies can be installed by running pip install -e .[tests]. TensorFlow needs to be installed independently: pip install --user tf-nightly.

Contributing

We're eager to collaborate with you! See CONTRIBUTING.md for a guide on how to contribute. This project adheres to TensorFlow's code of conduct. By participating, you are expected to uphold this code.

Releases

TF Agents has stable and nightly releases. The nightly releases are often fine but can have issues due to upstream libraries being in flux. The table below lists the version(s) of TensorFlow that align with each TF Agents' release. Release versions of interest:

  • 0.19.0 supports tensorflow-2.15.0.
  • 0.18.0 dropped Python 3.8 support.
  • 0.16.0 is the first version to support Python 3.11.
  • 0.15.0 is the last release compatible with Python 3.7.
  • If using numpy < 1.19, then use TF-Agents 0.15.0 or earlier.
  • 0.9.0 is the last release compatible with Python 3.6.
  • 0.3.0 is the last release compatible with Python 2.x.
Release Branch / Tag TensorFlow Version dm-reverb Version
Nightly master tf-nightly dm-reverb-nightly
0.19.0 v0.19.0 2.15.0 0.14.0
0.18.0 v0.18.0 2.14.0 0.13.0
0.17.0 v0.17.0 2.13.0 0.12.0
0.16.0 v0.16.0 2.12.0 0.11.0
0.15.0 v0.15.0 2.11.0 0.10.0
0.14.0 v0.14.0 2.10.0 0.9.0
0.13.0 v0.13.0 2.9.0 0.8.0
0.12.0 v0.12.0 2.8.0 0.7.0
0.11.0 v0.11.0 2.7.0 0.6.0
0.10.0 v0.10.0 2.6.0
0.9.0 v0.9.0 2.6.0
0.8.0 v0.8.0 2.5.0
0.7.1 v0.7.1 2.4.0
0.6.0 v0.6.0 2.3.0
0.5.0 v0.5.0 2.2.0
0.4.0 v0.4.0 2.1.0
0.3.0 v0.3.0 1.15.0 and 2.0.0.

Principles

This project adheres to Google's AI principles. By participating, using or contributing to this project you are expected to adhere to these principles.

Contributors

We would like to recognize the following individuals for their code contributions, discussions, and other work to make the TF-Agents library.

  • James Davidson
  • Ethan Holly
  • Toby Boyd
  • Summer Yue
  • Robert Ormandi
  • Kuang-Huei Lee
  • Alexa Greenberg
  • Amir Yazdanbakhsh
  • Yao Lu
  • Gaurav Jain
  • Christof Angermueller
  • Mark Daoust
  • Adam Wood

Citation

If you use this code, please cite it as:

@misc{TFAgents,
  title = {{TF-Agents}: A library for Reinforcement Learning in TensorFlow},
  author = {Sergio Guadarrama and Anoop Korattikara and Oscar Ramirez and
     Pablo Castro and Ethan Holly and Sam Fishman and Ke Wang and
     Ekaterina Gonina and Neal Wu and Efi Kokiopoulou and Luciano Sbaiz and
     Jamie Smith and Gábor Bartók and Jesse Berent and Chris Harris and
     Vincent Vanhoucke and Eugene Brevdo},
  howpublished = {\url{https://github.com/tensorflow/agents}},
  url = "https://github.com/tensorflow/agents",
  year = 2018,
  note = "[Online; accessed 25-June-2019]"
}

Disclaimer

This is not an official Google product.

agents's People

Contributors

abhijeetkrishnan avatar adammichaelwood avatar ageron avatar agreenb avatar bartokg avatar denbergvanthijs avatar ebrevdo avatar efiko avatar egonina avatar esonghori avatar jaingaurav avatar kbanoop avatar kuanghuei avatar leroidauphin avatar marek-at-work avatar markdaoust avatar mhe500 avatar nealwu avatar ormandi avatar peterzhizhin avatar qlzh727 avatar rchen152 avatar roopaliv avatar samfishman avatar seungjaeryanlee avatar sguada avatar summer-yue avatar tfboyd avatar vcarbune avatar yilei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agents's Issues

MaskedCategorical distribution example?

I have an existing RL agent that I'd like to re-express with tf_agents, so I can test out a number of different algorithms that I'd struggle to implement myself. Thanks for this work! 👏

My environment has an action-space where not all actions are valid at all timesteps. It looks like MaskedCategorical class is capable of supporting this use-case, but there are no examples of its use within the larger library.

I'm currently working with the sac/examples/v2/train_eval.py script and a custom environment. The training runs as expected, but I (unsurprisingly) receive invalid actions.

Can someone point me in the right direction?

ParallelPyEnvironment is slow to start

ParallelPyEnvironment starts all environments in different processes, but it does so sequentially.

  def start(self):
    logging.info('Starting all processes.')
    for env in self._envs:
      env.start()
    logging.info('All processes started.')
  def start(self):
....
    self._process.start()
    result = self._conn.recv()
      env = env_constructor()
      action_spec = env.action_spec()
      conn.send(self._READY)  # Ready.

If env_constructor takes, for example, 10 seconds to finish, then for 64 agents I have to wait 640 seconds. It makes it hard to experiment with heavy environments.

It would be better if we first spawned all start processes, and then waited for all of them to finish.

Using RNN Q Networks

In order to use RnnQNetworks, I just replaced it with the QNetwork in the colab document for dqn.
Here is the link to the colab document.
Am I doing something wrong?
Here is the error message that I get:

TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/nest.py in assert_same_structure(nest1, nest2, check_types, expand_composites)
    287     _pywrap_tensorflow.AssertSameStructure(nest1, nest2, check_types,
--> 288                                            expand_composites)
    289   except (ValueError, TypeError) as e:

TypeError: The two structures don't have the same nested structure.

First structure: type=tuple str=()

Second structure: type=_ListWrapper str=ListWrapper([TensorSpec(shape=(40,), dtype=tf.float32, name='network_state_spec'), TensorSpec(shape=(40,), dtype=tf.float32, name='network_state_spec')])

More specifically: The two namedtuples don't have the same sequence type. First structure type=tuple str=() has type tuple, while second structure type=_ListWrapper str=ListWrapper([TensorSpec(shape=(40,), dtype=tf.float32, name='network_state_spec'), TensorSpec(shape=(40,), dtype=tf.float32, name='network_state_spec')]) has type _ListWrapper

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-5-0f243dffb929> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', "\n# (Optional) Optimize by wrapping some of the code in a graph using TF function.\ntf_agent.train = common.function(tf_agent.train)\n\n# Reset the train step\ntf_agent.train_step_counter.assign(0)\n\n# Evaluate the agent's policy once before training.\navg_return = compute_avg_return(eval_env, tf_agent.policy, num_eval_episodes)\nreturns = [avg_return]\n\nfor _ in range(num_iterations):\n\n  # Collect a few steps using collect_policy and save to the replay buffer.\n  for _ in range(collect_steps_per_iteration):\n    collect_step(train_env, tf_agent.collect_policy)\n\n  # Sample a batch of data from the buffer and update the agent's network.\n  experience, unused_info = next(iterator)\n  train_loss = tf_agent.train(experience)\n\n  step = tf_agent.train_step_counter.numpy()\n\n  if step % log_interval == 0:\n    print('step = {0}: loss = {1}'.format(step, train_loss.loss))\n\n  if step % eval_interval == 0:\n    avg_return = compute_avg_return(eval_env, tf_agent.policy, num_eval_episodes)\n    print('step = {0}: Average Return = {1}'.format(step, avg_return))\n    returns.append(avg_return)")

/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2115             magic_arg_s = self.var_expand(line, stack_depth)
   2116             with self.builtin_trap:
-> 2117                 result = fn(magic_arg_s, cell)
   2118             return result
   2119 

</usr/local/lib/python3.6/dist-packages/decorator.py:decorator-gen-60> in time(self, line, cell, local_ns)

/usr/local/lib/python3.6/dist-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    186     # but it's overkill for just that one bit of state.
    187     def magic_deco(arg):
--> 188         call = lambda f, *a, **k: f(*a, **k)
    189 
    190         if callable(arg):

/usr/local/lib/python3.6/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
   1191         else:
   1192             st = clock2()
-> 1193             exec(code, glob, local_ns)
   1194             end = clock2()
   1195             out = None

<timed exec> in <module>()

<ipython-input-4-ec088ef9d1d3> in compute_avg_return(environment, policy, num_episodes)
    107 
    108     while not time_step.is_last():
--> 109       action_step = policy.action(time_step)
    110       time_step = environment.step(action_step.action)
    111       episode_return += time_step.reward

/usr/local/lib/python3.6/dist-packages/tf_agents/policies/tf_policy.py in action(self, time_step, policy_state, seed)
    179     """
    180     tf.nest.assert_same_structure(time_step, self._time_step_spec)
--> 181     tf.nest.assert_same_structure(policy_state, self._policy_state_spec)
    182     with tf.control_dependencies(tf.nest.flatten([time_step, policy_state])):
    183       # TODO(ebrevdo,sfishman): Perhaps generate a seed stream here and pass

/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/nest.py in assert_same_structure(nest1, nest2, check_types, expand_composites)
    293                   "Entire first structure:\n%s\n"
    294                   "Entire second structure:\n%s"
--> 295                   % (str(e), str1, str2))
    296 
    297 

TypeError: The two structures don't have the same nested structure.

First structure: type=tuple str=()

Second structure: type=_ListWrapper str=ListWrapper([TensorSpec(shape=(40,), dtype=tf.float32, name='network_state_spec'), TensorSpec(shape=(40,), dtype=tf.float32, name='network_state_spec')])

More specifically: The two namedtuples don't have the same sequence type. First structure type=tuple str=() has type tuple, while second structure type=_ListWrapper str=ListWrapper([TensorSpec(shape=(40,), dtype=tf.float32, name='network_state_spec'), TensorSpec(shape=(40,), dtype=tf.float32, name='network_state_spec')]) has type _ListWrapper
Entire first structure:
()
Entire second structure:
ListWrapper([., .])```


Invalid actions in a custom ActorDistributionNetwork for a DdpgAgent

Hello team, I will try to generalize my question as much as I can, so maybe it could be useful also for other users. The simplified version of my custom actor network looks like this:

@gin.configurable
class Actor(DistributionNetwork):
def init(self, spec, name='ActorNetwork'):
super(Actor, self).init(
input_tensor_spec=spec.input_tensor_spec,
state_spec=spec.state_spec,
output_spec=spec.output_spec,
name=name)

    self._forward.add(layers.Dense(50))
    self._forward.add(layers.Dense(10))

def call(self, observation, step_type, network_state=None):   
    logits = self._forward(observation)
    return logits, network_state

ENVIRONMENT DEFINITION:
The Environment is a continuos action space and at each timestep
my state is represented by N features + 1 binary feature, let's call this last one "transition". So by definition the transition feature can be only 0 or 1 and every time we reset the environment the sequence always looks like this:
0101010101010101010101 .....

At each timestep only 5 actions are allowed for the agent,
let's say:
transition = 0 ---> valid actions are logits[:5]
transition = 1 ---> valid actions are logits[5:]

In my previous code in tf.1 I was handling the illegal actions
simply by filtering them in this way:

def valid_actions(logits, transition):
return tf.cond(tf.equal(transition, 0),
lambda: tf.slice(logits, [0], [5]),
lambda: tf.slice(logits, [5], [5]))

so the input for my critic network _action_layers is:
a = valid_actions(logits, transition)

Now my question,
it is correct to implement something like the "valid_actions function" above
directly at the end of the self.forward() pass ?
Or tf_agents provide some method to handle something like this specific case and even in environements with more complicated structure of valid actions to handle ?

Training PPO in non-episodic environments

After reviewing PPOAgent code, I have found the following piece of code:

    valid_mask = ppo_utils.make_timestep_mask(next_time_steps)

    if weights is None:
      weights = valid_mask
    else:
      weights *= valid_mask

....

    value_preds, unused_policy_state = self._collect_policy.apply_value_network(
        time_steps.observation, time_steps.step_type, policy_state=policy_state)
    value_estimation_error = tf.math.squared_difference(returns, value_preds)
    value_estimation_error *= weights

This means that if an agent runs in a non-episodic environment (the environment never resets and can be run infinitely), the agent will never learn anything because weights will be always zero.

The agent can get some reward, but due to weight being zero, training will not happen.

This also makes it harder to train the agent with DynamicStepDriver. Only latest steps before reset will be used for training.

@abc.abstractmethod

I have never seen this syntax before. I was reading the TensorFlow Environments code and came across @abc.abstractmethod in the following context:

  @abc.abstractmethod
  def _reset(self):
    """Returns the current `TimeStep` after resetting the Environment."""

  @abc.abstractmethod
  def _current_time_step(self):
    """Returns the current `TimeStep`."""

What is the purpose and meaning of this syntax? Thanks!

What is the recommended way to `render` a TensorFlow Agents TF-Agent?

I tried to post this question at StackOverflow, but I lack the reputation to create a 'tensorflow-agents' tag. So...

Open AI Gym environments carry a .render() method that is directly accessible in the TF-Agents Python environment created by gym_wrapper.GymWrapper.

However, when training with an agent in the TensorFlow environment created by tf_py_environment.TFPyEnvironment, calling .render() throws a not-implemented exception.

If you dig a bit, you find the environment underneath the TensorFlow environment is a batched Python env, and you can cheat your way to the Gym environment at the bottom with something like:

tf_env._env.envs[-1]._env.render()

Where the -1 index is showing the position in the batch. However, no matter what index I provide, the render never updates.

What is the recommended way to get a TF-Agents TensorFlow environment to render?

Thanks for any thoughts!

API documentation

It would be a great help if the APIs and the parameters can be hosted on a site. Currently, we need to reach to the specific python files to understand the definition.

Sample action outside the valid range

Hi,

NormalProjectionNetwork only normalize the mean of the tfp.distributions.Normal to be within [action_spec.min, action_spec.max].

means = self._mean_transform(means, self._sample_spec)

However, since the standard deviation is non-zero, it's possible for the actor to sample an action that is outside the valid range. For example, say my action spec is

BoundedTensorSpec(shape=(2,), dtype=tf.float32, name=None, minimum=array([-0.02, -0.02], dtype=float32), maximum=array([0.02, 0.02], dtype=float32))

When I trained PPO, it used the default init_action_stddev=0.35

def _normal_projection_net(action_spec,
init_action_stddev=0.35,
init_means_output_factor=0.1):
std_initializer_value = np.log(np.exp(init_action_stddev) - 1)
return normal_projection_network.NormalProjectionNetwork(
action_spec,
init_means_output_factor=init_means_output_factor,
std_initializer_value=std_initializer_value)

Therefore, most of the sampled actions are outside the valid [-0.02, 0.02] range.

Here are a few solutions that I could think of

  1. change from tfp.distributions.Normal to tfp.distributions.TrunNormal
    Change
    return self.output_spec.build_distribution(loc=means, scale=stds)

    to
    return self.output_spec.build_distribution(loc=means,
                                               scale=stds,
                                               low=tf.constant(self._sample_spec.minimum),
                                               high=tf.constant(self._sample_spec.maximum))

However, I got the following error. I spent some time debugging but no luck so far.

Traceback (most recent call last):
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1819, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shapes must be equal rank, but are 1 and 2 for 'tf_uniform_replay_buffer/TFUniformReplayBuffer_1/ResourceScatterUpdate_6' (op: 'ResourceScatterUpdate') with input shapes: [], [1], [3].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_eval_p2p_nav_ppo.py", line 397, in <module>
    tf.app.run()
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "train_eval_p2p_nav_ppo.py", line 392, in main
    use_rnns=FLAGS.use_rnns)
  File "train_eval_p2p_nav_ppo.py", line 258, in train_eval
    num_episodes=collect_episodes_per_iteration).run()
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 159, in run
    name='driver_loop'
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3477, in while_loop
    return_same_structure)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2998, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2923, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 98, in loop_body
    observer_ops = [observer(traj) for observer in self._observers]
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 98, in <listcomp>
    observer_ops = [observer(traj) for observer in self._observers]
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/module/module.py", line 120, in enter_name_scope
    return unbound_method(self, *args, **kwargs)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/replay_buffers/replay_buffer.py", line 65, in add_batch
    return self._add_batch(items)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/module/module.py", line 120, in enter_name_scope
    return unbound_method(self, *args, **kwargs)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/replay_buffers/tf_uniform_replay_buffer.py", line 136, in _add_batch
    write_data_op = self._data_table.write(write_rows, items)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/replay_buffers/table.py", line 132, in write
    for (slot, value) in zip(flattened_slots, flattened_values)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/replay_buffers/table.py", line 132, in <listcomp>
    for (slot, value) in zip(flattened_slots, flattened_values)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/state_ops.py", line 302, in scatter_update
    name=name))
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_resource_variable_ops.py", line 1271, in resource_scatter_update
    updates=updates, name=name)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 800, in _apply_op_helper
    op_def=op_def)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3473, in create_op
    op_def=op_def)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1983, in __init__
    control_input_ops)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1822, in _create_c_op
    raise ValueError(str(e))
ValueError: Shapes must be equal rank, but are 1 and 2 for 'tf_uniform_replay_buffer/TFUniformReplayBuffer_1/ResourceScatterUpdate_6' (op: 'ResourceScatterUpdate') with input shapes: [], [1], [3].
  1. change init_action_stddev to a smaller number, e.g. 0.001
  2. clip the action within the step function of my environment.

Which solution do you think is the best to solve this issue?

Thanks a ton. Any help will be greatly appreciated.

PyEnvironmentBaseWrapper doesn't forward batch_size and batched calls.

Hi,

We found what seems as a bug in ActionDiscretizeWrapper. The following piece of code won't raise any exceptions even through we believe it should:

class TestBug(tf.test.TestCase):

    def testBugWrapper(self):
        obs_spec = array_spec.BoundedArraySpec((2, 3), np.int32, -10, 10)
        action_spec = array_spec.BoundedArraySpec((1,), np.int32, -10, 10)
        nested_env = random_py_environment.RandomPyEnvironment(
            obs_spec,
            action_spec,
            reward_fn=lambda *_: np.array([1.0]),
            batch_size=2)
        env = wrappers.ActionDiscretizeWrapper(nested_env, num_actions=100)
        self.assertNotEqual(env.batched, nested_env.batched)  # Don't raise when it should
        self.assertNotEqual(nested_env.batch_size, env.batch_size)  # Don't raise when it should

Issue comes from

class PyEnvironmentBaseWrapper(py_environment.PyEnvironment):
  """PyEnvironment wrapper forwards calls to the given environment."""

  def __getattr__(self, name):
    """Forward all other calls to the base environment."""
    return getattr(self._env, name)

which doesn't forward the batched and batch_size properties to the wrapped environment as they are defined in the parent class.

Adding explicit calls to the properties would fix the issue:

  @property
  def batched(self):
    return self._env.bacthed

  @property
  def batch_size(self):
    return self._env.batch_size

Getting import error in TF-Agents Policies Tutorial.

When running a policies tutorial collab, executing this cell:

import abc
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np

from tf_agents.specs import array_spec
from tf_agents.specs import tensor_spec
from tf_agents.environments import time_step as ts
from tf_agents.networks import network

from tf_agents.policies import py_policy
from tf_agents.policies import random_py_policy
from tf_agents.policies import scripted_py_policy

from tf_agents.policies import tf_policy
from tf_agents.policies import random_tf_policy
from tf_agents.policies import actor_policy
from tf_agents.policies import q_policy
from tf_agents.policies import greedy_policy

tf.compat.v1.enable_v2_behavior()

outputs this:

AttributeErrorTraceback (most recent call last)
<ipython-input-6-d447654fa582> in <module>()
      9 from tf_agents.networks import network
     10 
---> 11 from tf_agents.policies import py_policy
     12 from tf_agents.policies import random_py_policy
     13 from tf_agents.policies import scripted_py_policy

/usr/local/lib/python2.7/dist-packages/tf_agents/policies/py_policy.py in <module>()
     28 from tf_agents.environments import trajectory
     29 from tf_agents.policies import policy_step
---> 30 from tf_agents.utils import common
     31 
     32 

/usr/local/lib/python2.7/dist-packages/tf_agents/utils/common.py in <module>()
    289 
    290 
--> 291 class Periodically(tf.Module):
    292   """Periodically performs the ops defined in `body`."""
    293 

AttributeError: 'module' object has no attribute 'Module'

How to load trained model or resume training?

Hi,

I understand this issue has be raised here #4 , but I think I still have trouble loading trained model or resuming training.

For instance, I ran train_eval_gym.py until convergence. Then I killed the processes and reran train_eval_gym.py with the same --root_dir to resume training. However, based on the tensorboard plot below, I don't think the checkpoint are properly loaded.

Train:
image

Eval:
image

Disclaimer: I am using an older fork of your repo (beedf60). If this issue has already been resolved in the latest version, please let me know. Thanks a lot!

Issue test SAC

I have been trying to use tf.agents and tf.nightly for a few days now without any luck. Right now I am trying to implement a SAC model, but the test continues to fail on my machine. The error I am receiving is:

ValueError: A configurable matching 'tf_agents.policies.actor_policy.ActorPolicy' already exists.

Above the error is the following syntax:

File "/anaconda3/lib/python3.7/site-packages/gin/config.py", line 891, in _make_configurable raise ValueError(err_str.format(selector))

Generally, I am not sure what is causing the error. I would sincerely appreciate any help or advice. My ultimate goal is to implement a SAC in the OpenAI CarRacing environment. I was not able to find a way to do that with OpenAI's code and I figured agents may provide a solution to my issue.

Observation summary may cause OOM

In some agent, for instance PPOAgent, when debug_summaries is enabled the summaries contain observations as well. This sometimes causes OOM error especially when the observation dimension is large (e.g image).
My suggestion is to add an additional flag summarize_observation to handle this case.
Below is the error message I got :

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3145728,30] 
and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator 
GPU_0_bfc [Op:OneHot] name: epoch_9/observations/buckets/cond/cond/one_hot/
  In call to configurable 'train_eval' (<function train_eval at 0x7fe756fdf840>)

Should we check for termination in collect step in the dqn_tutorial colab?

I am looking into this colab document.

I was wondering if we should check for termination In Data Collection section:

def collect_step(environment, policy):
  time_step = environment.current_time_step()
  action_step = policy.action(time_step)
  next_time_step = environment.step(action_step.action)
  traj = trajectory.from_transition(time_step, action_step, next_time_step)

  # Add trajectory to the replay buffer
  replay_buffer.add_batch(traj)


for _ in range(initial_collect_steps):
  collect_step(train_env, random_policy)

I guess we need to check if the environment is terminated here. Something like this:

def collect_step(environment, policy):
  time_step = environment.current_time_step()
  action_step = policy.action(time_step)
  next_time_step = environment.step(action_step.action)
  traj = trajectory.from_transition(time_step, action_step, next_time_step)
  if next_time_step.is_last():
    environment.reset()

  # Add trajectory to the replay buffer
  replay_buffer.add_batch(traj)


for _ in range(initial_collect_steps):
  collect_step(train_env, random_policy)

Intrinsic rewards

Thank you for such an amazing work, it is interesting to see what TF-Agents has to offer!

It would be great if we had an ability to encourage exploration with intrinsic motivators.
I would like to see Exploration by Random Network Distillation (https://arxiv.org/abs/1810.12894) as an example of such motivators.

It would be interesting to see your ideas how one could implement it. It should be a somewhat general mechanism, not related to any specific policy implementation.

Original implementation of RND uses PPO with the following modifications of the policy:

  1. Intrinsic reward is added. It is computed as a distance between predicted features and target features. Target features are computed as a randomly-initialized fixed neural network on a next observation after an action. Predicted features is a neural network that outputs a tensor of the same length as a target network. After each sampling episode a fraction of experience is used to update the predictor's network weights.
  2. There are two critics, not one. The first one predicts extrinsic value function, the second predicts intrinsic value function. Advantage is computed separately using both models.
  3. Advantage in PPO should be a weighted combination of extrinsic and intrinsic critics predictions.

For Deep Q-learning, it should also be similar.

  1. We need another network that predicts Q function for intrinsic reward that is used only during training.
  2. Sampling strategy should be modified to use the second Q-function. It should sample results based on a total Q function that is computed as a weighted sum of two other ones.

We can continue the list for other algorithms. I would like to see your ideas how I could implement RND using TF-Agents in a way that it can be used in multiple policies. I also want to know whether you would accept such contribution if I made it.

Potential memory issue with tf_py_environment

Hi,

First of all, my environment is the following:
Tensorflow version: 1.13.0-dev20190205 (pip install tf-nightly-gpu)
tf-agents version: 0.2.0.dev20190123 (pip install tf-agents-nightly)
CUDA version: 10.0
cuDNN version: 7.4.1
Ubuntu version: 16.04

When I wrapped my customized python environment using tf_py_environment, it seemed to consume more and more cpu memory as time passed until the memory ran out and the program got stuck. This problem is particularly evident if my observation is large (say a RGB image or a large vector).

Here is a toy example:

import tensorflow as tf
from tf_agents.environments import tf_py_environment, py_environment
from tf_agents.specs import array_spec
from tf_agents.environments import time_step as ts
import numpy as np

img_size = 5000

class MyEnv(py_environment.Base):
    def __init__(self):
        self._action_spec = array_spec.BoundedArraySpec(shape=(), dtype=np.float32)
        self._observation_spec = array_spec.BoundedArraySpec(shape=(img_size, img_size, 3), dtype=np.float32)

    def action_spec(self):
        return self._action_spec

    def observation_spec(self):
        return self._observation_spec

    def reset(self):
        return ts.restart(np.zeros(shape=(img_size, img_size, 3), dtype=np.float32))

    def step(self, action):
        return ts.transition(np.zeros(shape=(img_size, img_size, 3), dtype=np.float32), reward=0.0, discount=1.0)

tf_py_env = MyEnv()
tf_env = tf_py_environment.TFPyEnvironment(tf_py_env)
i = 0
while True:
    if i % 10000 == 0:
        print(i)
        tf_env.reset()
    action = tf.constant([0.0])
    time_step = tf_env.step(action)
    i += 1

After a few minutes of running, it drained almost all the memory until the program got stuck. The last print out is 850000.
image

I have also run tf_agents/agents/dqn/examples/train_eval_atari.py for a while and it has the same symptom.
image
image
The memory fluctuated between 40% - 90% and due to the time / computing limit, I didn't get the chance to run it until convergence or crash / getting stuck.

In both cases, running the program makes my machine pretty slow. Is this expected?

I am very new to tf-agents so I suspect I did something wrong (maybe I am supposed to free memory somewhere in my code?). I would really appreciate if someone could point me to the right direction. Thanks!

Eric

OOM after a couple of iterations

I am running DQN on an Atari game (BeamRider-v0). I just get the input image and flatten it and connect it to a fully connected layer with 32 neurons. It runs for 14000 iterations on a Telsa v100 GPU. After 14000 iterations, I get OOM. Is there a memory leak? I am using tf-nightly-gpu-2.0-preview. I have also tried tf-nightly-gpu and the same problem exists. My question is why I don't get the error at the very first iterations? What causes memory usage to grow for 14000 iterations?

TF Agents & Dopamine

There seems to be a lot of overlap between objectives of TF Agents and the Dopamine framework, are there any plans to share or combine these frameworks, or are they intentionally made separate?

Dueling DDQN?

Hi,

I can see DQN and DDQN have been implemented, any plan for the Dueling DDQN variant?
I am very excited about what TF-Agents has to offer and would love to contribute!

SAC for Car Control

Problem: How to create a simulation where a SAC algorithm controls a car to navigate an environment.

Goal: Create a simulation where a SAC agent controls a car in an environment.

Generally, I familiar with deep reinforcement learning frameworks and autonomous car control. I wrote a paper on the topic and have read a lot about the subject matter. Some important papers: paper 1, paper 2, paper 3, paper 4, paper 5. However, I am struggling to develop the code for my models. Below is a list of steps I am taking to try and solve the problem. I would really appreciate any help or advice.

Steps:

1. Pick packages

I want to use tf.agents SAC. However, I may also need a package to create an environment.

Issues:
I am unsure how to integrate the SAC with an environment. My thought was to use OpenAI's CarRacing-v0, but there are so many problems with the environment because of the way in which the different packages are layered and the inconsistency of action and observation spaces is complex to resolve. Indeed, as written the CarRacing-v0 environment does not work. See Issue 1267 and Issue 120.

I looked at other environment generators like Donkey Car and Unity- but generally, I would like to work with TensorFlow because it provides the ability to create both environments and agents. Further, TensorFlow seems to be the most simple, well-tested, and mathematically sound model. Another option would be to create a new environment from scratch with TensorFlow.

2. Pick interface

Generally, there are a lot of different interfaces I have been trying. For example, the command line, Spyder, and Idle.

Issues:
The packages are hard to find and correct. I don't know how these different interfaces interact with my machine, nor the relation between the files and the interface. So, I get a lot of packages not found and double loading errors, which makes it difficult to move forward.

3. Write code to develop agent.

To do this I need to map out the objects: agent and environment. I also need to map out the agents functions for perception of its environment and decision making. I think a 3D observation space makes the most sense, however a 2D space may also suffice. A convolutional neural network would likely be the best method of perception. Essentially, the SAC would be dueling CNNs with the actor acting as the agent in a Markov Decision Process. The SAC would be the method of decision making and the math is here.

Issues:
Writing the code from scratch is difficult because I do not understand how the code and math line up. However, I think TensorFlow will allow me to develop everything I need to solve the problem. But, I want to write the code in a more simplified way. So, there would need to be a critic function and actor function working to minimize loss or maximize reward. The reward would be defined by optimal driving metrics like lane follow and speed.

4. Run code in environment.

Issues:
Integrating the agent in the environment is difficult because there are a lot of dependencies and packages, which need to work in harmony to successfully execute the code.

Moving Forward
One way of figuring out how to do this may be building a dataset and asking a BERT model for help. But, I am not sure how good BERT is yet. Any advice or suggestions would be greatly appreciated. Thanks.

I also thought about making this a project rather than an issue. However, I cannot figure out how the project tab works or how to start a new project.

Multi-gpu training support?

Hi tf-agents team,

Thanks for answering all of my previous questions.

Does tf-agents support multi-gpu training? If so, is there any recommendation you could give me to achieve this (i.e. which files to consider modifying)? I found these two resources:
https://www.tensorflow.org/guide/using_gpu#using_multiple_gpus
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py

I have only done multi-gpu training in PyTorch and Keras and I am still learning more about that of TF.

Thanks again.

Would you like to see a MultiCategorical projection network?

I see that existing CategoricalProjectionNetwork supports only with the same number of actions along all dimensions.

So, for example, discrete actions: [3, 3, 3, 3] -- good.
Discrete actions: [3, 3, 2, 3] -- not supported.

I see in the code of CategoricalProjectionNetwork that you advise to implement more flexible distribution myself. This is not actually that hard to support.

  1. We can have a single dense layer that outputs sum of all actions as logits. For example, [3, 3, 2, 3] actions count will be converted to 3 + 3 + 2 + 3 logits.
  2. When we need to get the output distribution, we apply the dense layer, split the logits (in the example: [n_batch, 3], [n_batch, 3], [n_batch, 2], [n_batch, 3]).
  3. Return a custom distribution that gets the splitted logits and creates multiple tfd.Categorical internally. Implement mode() and sample() to call all internal categories.

I would like to know why it was decided to restrict this to the action spaces with same number of unique actions along all dimensions?

What does the return function do in the _step function?

I am having trouble understanding this piece of code:

def _step(self, action):

        "Apply the action and return the next time_step(reward, observation)."
        if is_final(self.state):
            return self.reset()
        observation, reward = self._apply_action(action)
        return TimeStep(observation, reward)

So, I know the function of _step is to apply the action and return the next time_step, which contains a reward and observation. (self, action) are parameters, to which arguments may be passed. Here, the agent is traveling to the next state in the environment with its previous action and self. And, if the current state is the final state in the environment, then call the reset() function from the environment. Here, the observation, reward pair is set equal to self._apply_action(action). However, I do not understand what self.apply_action(action) means or does. Additionally, what does the return function do?

Performance benchmarks

Some RL libraries add performance benchmarks on Atari or MoJoCo. Some also have comparisons between typical algorithms with different implementations.
See this for an example: https://github.com/ray-project/rl-experiments

The performance benchmarks will make it more compelling for someone outside Google to use the library. If you had them, someone wouldn't need to do the benchmarks themselves. They wouldn't need to invest time learning the library before getting the performance numbers. It will make even more sense after you have distributed training (as discussed in #14). It will be hard for someone to set up distributed training infra just to make benchmarks.

Moreover, I've personally tried RLlib because of their comparison charts :-)

Also, a TF-Agents performance benchmark would make it easier for you and everyone else to profile the code and find bottlenecks.

I would have made the comparisons myself, if I had had a GPU at home. Testing on cloud would cost me money to make the comparisons. And I don't want to spend anything on this.

Inconsistencty between tf_env.action_spec(action).is_compatible_with() and tf_env.step(action)

Hi,

As described in https://stackoverflow.com/q/55537069/4282745, I'm a bit confused by the following pieces of code:

The following piece of code works fine even through tf_env.action_spec().is_compatible_with(action) is False.

import tensorflow as tf
import tf_agents.environments.tf_py_environment as tf_py_environment
from tf_agents.environments.tf_py_environment_test import PYEnvironmentMock

py_env = PYEnvironmentMock()
tf_env = tf_py_environment.TFPyEnvironment(py_env)
assert tf_env.batch_size == 1
action = tf.constant(2, shape=(1,), dtype=tf.int32)
assert tf_env.action_spec().is_compatible_with(action) is False
obs = tf_env.step(action)

If now, I change the shape of the action, the action is indeed compatible with the spec, but calling tf_env.step(action)

action = tf.constant(2, shape=(), dtype=tf.int32)
assert tf_env.action_spec().is_compatible_with(action) # Now ok
obs = tf_env.step(action) # raises an IndexError: list index out of range

raises the IndexError below as it expects a 1-dim action array:

Traceback (most recent call last):
  File "/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3291, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-42-f3211de6b571>", line 1, in <module>
    tf_env.step(action)
  File "/lib/python3.6/site-packages/tf_agents/environments/tf_environment.py", line 232, in step
    return self._step(action)
  File "/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 147, in graph_wrapper
    return f(*args, **kwargs)
  File "/lib/python3.6/site-packages/tf_agents/environments/tf_py_environment.py", line 209, in _step
    dim_value = tensor_shape.dimension_value(action.shape[0])
  File "/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py", line 837, in __getitem__
    return self._dims[key]
IndexError: list index out of range

Is there anything wrong in my code?

py_driver constructor, question

max_steps = max_steps or 0
max_episodes = max_episodes or 0
if max_steps < 1 and max_episodes < 1:
raise ValueError(
'Either "max_steps" or "max_episodes" should be greater than 0.')
super(PyDriver, self).init(env, policy, observers)
self._max_steps = max_steps or np.inf
self._max_episodes = max_episodes or np.inf

is there a scenario where the max_stes will be assigned np.inf ?
I could not understand the logic behind it.

Sorry if this is too basic. :)

EncodingNetwork breaks with only one preprocessing layer.

Currently the EncodingNetwork class cannot accept a single preprocessing layer.

When preprocessing_layers are passed as an attribute to the EncodingNetwork class the outputs of the preprocessing layers are stored in a list states.
states is passed to the preprocessing_combiner layer, which reduces states to a single tensor.
This tensor is then passed to the _postprocessing_layers.

The issue arises when passing a single preprocessing layer.
Because we only have one preprocessing layer, we do not need a preprocessing combiner. But, without a preprocessing_combiner layer, states is passed to _postprocessing_layers as a list which causes an error.

I propose adding an elif to the call function as follows:

if self._preprocessing_layers is None:
      processed = observation
elif len(self._preprocessing_layers) == 1:
      processed = self._preprocessing_layers[0](observation)
else:
      processed = []
      for obs, layer in zip(
          nest.flatten_up_to(self.input_tensor_spec, observation),
          self._preprocessing_layers):
        processed.append(layer(obs))

What is your plan about the Agent design and TF 2.0 ?

Currently, most of the algorithm available in tf_agents work in graph mode and some can be run in eager mode as well. We are trying to run PPOAgent in eager mode because we found it easier for example to set the network phase #6.
The problem is that it is not easy and error prone to maintain the graph and the eager mode at the same time.
Since, TF 2.0 will be by default in eager mode. Is it necessary to maintain the graph mode in tf_agents ?

Layer sharing between networks

Hi,

This is more of a clarification rather than a bug I guess.

I am trying to share my feature extraction / encoder layers across different networks (e.g. ValueNetwork and ActorDistributionNetwork). Here is a minimum example:

class Encoder(network.Network):
    def __init__(self, name='Encoder'):
        fc_layers = tf.keras.Sequential([tf.keras.layers.Dense(64), tf.keras.layers.Dense(32)])
        super(Encoder, self).__init__(
            input_tensor_spec=None,
            state_spec=(),
            name=name
        )
        self._fc_layers = fc_layers

    def call(self, observation, step_type=None, network_state=()):
        del step_type  # unused.
        states = self._fc_layers(observation)
        return states, network_state

class MyNetwork(network.Network):
    def __init__(self, encoder, name='MyNetwork'):
        output_layer = tf.keras.Sequential([tf.keras.layers.Dense(1)])
        super(MyNetwork, self).__init__(
            input_tensor_spec=None,
            state_spec=(),
            name=name
        )
        self._encoder = encoder
        self._output_layer = output_layer

    def call(self, observation, step_type=None, network_state=()):
        del step_type  # unused.
        states, _ = self._encoder(observation)
        states = self._output_layer(states)
        return states, network_state


def test_layer_sharing():
    batch_size = 4
    input_dim = 16
    encoder = Encoder()
    my_network = MyNetwork(encoder=encoder)
    input = tf.zeros(shape=(batch_size, input_dim))
    output, _ = my_network(input)
    print(my_network.layers)

test_layer_sharing()

The output is the following:

[<__main__.Encoder object at 0x7f46e9cb3c88>,
 <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f46e9c04668>,
 <__main__.Encoder object at 0x7f46e9cb3c88>]

I don't quite understand why there are two Encoder object in my_network.layers. Is this an intended behavior? Would this cause any problem? During execution, it also duplicates the trainable_weights of the network.

On the other hand, if I instantiate Encoder inside MyNetwork instead of passing it as an argument:

class MyNetwork(network.Network):
    def __init__(self, name='MyNetwork'):
        output_layer = tf.keras.Sequential([tf.keras.layers.Dense(1)])
        super(MyNetwork, self).__init__(
            input_tensor_spec=None,
            state_spec=(),
            name=name
        )
        self._encoder = Encoder()
        self._output_layer = output_layer

Then the output becomes:

[<tf_agents.networks.encoding_network.EncodingNetwork object at 0x7f80c5280da0,
 <tensorflow.python.keras.layers.core.Dense object at 0x7f80c49c1208>]

Which is what I expected. However, I won't be able to share this encoder with other network, e.g. MyFirstNetwork, MySecondNetwork, ValueNetwork, ActorNetwork, etc.

I also tried to instantiate Encoder inside MyNetwork and also pass another Encoder into it:

class MyNetwork(network.Network):
    def __init__(self, encoder, name='MyNetwork'):
        del encoder
        output_layer = tf.keras.Sequential([tf.keras.layers.Dense(1)])
        super(MyNetwork, self).__init__(
            input_tensor_spec=None,
            state_spec=(),
            name=name
        )
        self._encoder = Encoder()
        self._output_layer = output_layer

The output is:

[<__main__.Encoder object at 0x7f1bb01227f0>,
 <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f1bb0122630>,
 <__main__.Encoder object at 0x7f1bb0112c50>]

It seems like by just passing the encoder into the constructor of MyNetwork without actually using it (del encoder in the very first line) , keras will nevertheless include it into its layers.

Any help would be greatly appreciated. Thanks a lot!

no attribute 'register_symbolic_tensor_type'

I'm trying to run the example: tf_agents/agents/dqn/examples/train_eval_gym.py

The error I get:

# python3 /usr/local/lib/python3.6/dist-packages/tf_agents/agents/dqn/examples/train_eval_gym.py --root_dir=$HOME/tmp/dqn/gym/cart-pole/ --alsologtostderr
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/agents/dqn/examples/train_eval_gym.py", line 37, in <module>
    from tf_agents.agents.dqn import dqn_agent
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/agents/__init__.py", line 17, in <module>
    from tf_agents.agents.ddpg.ddpg_agent import DdpgAgent
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/agents/ddpg/ddpg_agent.py", line 29, in <module>
    from tf_agents.agents import tf_agent
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/agents/tf_agent.py", line 28, in <module>
    from tf_agents.environments import trajectory
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/trajectory.py", line 28, in <module>
    from tf_agents.environments import time_step as ts
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/time_step.py", line 28, in <module>
    from tf_agents.specs import array_spec
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/specs/__init__.py", line 19, in <module>
    from tf_agents.specs.tensor_spec import BoundedTensorSpec
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/specs/tensor_spec.py", line 24, in <module>
    import tensorflow_probability as tfp
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_probability/__init__.py", line 78, in <module>
    from tensorflow_probability.python import *  # pylint: disable=wildcard-import
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_probability/python/__init__.py", line 25, in <module>
    from tensorflow_probability.python import layers
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_probability/python/layers/__init__.py", line 30, in <module>
    from tensorflow_probability.python.layers.distribution_layer import CategoricalMixtureOfOneHotCategorical
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_probability/python/layers/distribution_layer.py", line 46, in <module>
    keras_tf_utils.register_symbolic_tensor_type(
AttributeError: module 'tensorflow.python.keras.utils.tf_utils' has no attribute 'register_symbolic_tensor_type'

The runtime environment (as a Dockerfile):

FROM ubuntu:bionic

RUN apt update && apt install -y python3 python3-pip

RUN pip3 install --upgrade tensorflow tf-agents-nightly

Reusability for Pytorch

Is there something similar available for pytorch? (Which I believe there is not.) Can someone guide me to which parts of the project will be reusable for implementations of agents in py-torch ?

How to load and use trained policy ?

Hi all,
I'm trying to figure out how to use a trained policy in a "demo" mode. In the following example I trained a PPO agents for Pybullet HalfCheetah and want to load the policy checkpoint to see how it performs

import gym
import os
import pybullet_envs
import tensorflow as tf
from tf_agents.networks import actor_distribution_network
from tf_agents.environments import suite_pybullet
from tf_agents.environments import tf_py_environment
from tf_agents.environments import time_step as ts
from tf_agents.policies import actor_policy, py_tf_policy
from tf_agents.utils import common as common_utils
from tf_agents.utils import tensor_normalizer
from tf_agents.agents.ppo import ppo_policy

tf.enable_eager_execution()

tf.logging.set_verbosity(tf.logging.INFO)

train_dir = '/tmp/half_cheetah_ppo/train'


env = gym.make("HalfCheetahBulletEnv-v0")
env.render(mode="human")

# disable rendering during reset, makes loading much faster
obs = env.reset()


tf_env = tf_py_environment.TFPyEnvironment(suite_pybullet.load('HalfCheetahBulletEnv-v0'))
actor_net = actor_distribution_network.ActorDistributionNetwork(tf_env.observation_spec(),
                                                                tf_env.action_spec(),
                                                                fc_layer_params=(200, 100))

time_step_spec = tf_env.time_step_spec()
obs_normalizer = tensor_normalizer.StreamingTensorNormalizer(
    time_step_spec.observation, scope='normalize_observations')

policy = ppo_policy.PPOPolicy(
    time_step_spec=time_step_spec,
    action_spec=tf_env.action_spec(),
    actor_network=actor_net,
    observation_normalizer=obs_normalizer,
    clip=False,
    collect=False)

time_step = ts.restart(obs, 1)
policy.action(time_step)

print(actor_net.trainable_weights[0])

policy_checkpointer = common_utils.Checkpointer(ckpt_dir=os.path.join(train_dir, 'policy'),
                                                policy=policy)

print(actor_net.trainable_weights[0])
r = 0
for _ in range(1000):
    policy_step = policy.action(time_step)
    obs, rew, _, _ = env.step(policy_step.action.numpy())
    time_step = ts.transition(obs, rew)
    env.render()
    r += rew
print(r)

It seems that this code is not enough since the weight of my actor net doesn't change after the checkpoint is loaded.
Any ideas ?

tf.function throws error when used in PP0 TF2.0 GPU example

I get the following error when setting use_tf_functions = True in the PPO v2 example.
This error appears in the line

collect_driver.run = common.function(collect_driver.run, autograph=False)

with TF2.0 GPU installed with tf-nightly-gpu-2.0-preview :

Cannot place the graph because a reference or resource edge connects colocation groups with incompatible assigned
 devices: /job:localhost/replica:0/task:0/device:CPU:0 vs /job:localhost/replica:0/task:0/device:GPU:0. The edge src node is driver_loop/exit/_130 , and the dst node is driver
_loop [Op:__inference_run_5502]

Recommendation and best practice for pre-training and transfer learning ?

Hi all,
We're looking for some recommendation to do pre-training and transfer learning with tf_agents.
The idea is to perform pre-training on other task domain and use this pre-trained model as an encoder for an agent. For instance, the pre-training task is imagenet classification.
Ideally, we would like to use the original Keras model definition for the encoder network since we have a fullstack of Tensorflow keras pipeline for the pre-training task and we don't want to recode everything to follow tf_agents network definitions.
We know that tf_agents.Network derives from keras.Network and the question is : does using pre-trained model from keras will work out of the box when we want to use is as a sub-network in tf_agents ? if Not what is the best way to copy weight from keras pre-trained model ?
Below is the pseudo-code of what we would like to achieve:

# pre-training
keras_model = build_keras_model(...)
keras_model.fit(pre_train_dataset)

# tf_agents_netowks uses keras_model as an encoder
tf_agents_networks = build_tf_agent_network(keras_model, ...)

tf_agents = Agent(tf_agents_network)

...
tf_agents.train(...)

Add a folder for examples?

How about adding a folder for examples and tutorials like tensorflow/tensorflow/contrib/eager/python/examples?

How should we set the network to `training` mode ?

When we have layers with different behaviour in train and test phase (e.g BatchNormalization) inside the network we need to set this mode explicitly. Coming from tf.keras.Model we usually set this in the call method of the network.
How can we achieve the same thing with tf_agent ? Currently the API of the call function in the subclasses of network.Network seems to not offer such phase setting.

MultiDiscrete spaces support

Hello. Thank you for an amazing product, I am experimenting with this right now.

I would like to know, why MultiDiscrete spaces for Gym wrappers are not supported. I don't see any reasons why it should be difficult to support the space. It should be similar to a tuple of discrete spaces which you already support.

Could you please provide some clarification?

debug_summaries makes PPOAgent fail when observation is a tuple

When an agent observation is a tuple and debug_summaries is set to True, the agent crashes in a tf.compat.v2.summary.histogram call.

Let's have a look at the code that creates the problem:

    observation = time_steps.observation
    if debug_summaries:
      tf.compat.v2.summary.histogram(
          name='observations', data=observation, step=self.train_step_counter)

If observation is a tuple, the call crashes.

Probably before fixing this, we should identify other possible instances of this pattern to fix it everywhere.

Issue with observation normalization when len(observation_spec) > 1

Hi,

I am trying to train ppo with observation spec that has len > 1.

E.g. my gym environment looks something like this:

self.observation_space = gym.spaces.Dict({
    'observation': gym.spaces.Box(low=0.0, high=1.0,
                                  shape=(128, 128, 4),
                                  dtype=np.float32),
    'state': gym.spaces.Box(low=-np.inf, high=np.inf, shape=(self.sensor_dim,), dtype=np.float32)
})

And the corresponding tf env looks something like this:

observation_spec = collections.OrderedDict([
    ('observation', tensor_spec.BoundedTensorSpec(shape=(128, 128, 4),
                                                  dtype=tf.float32,
                                                  minimum=0.0,
                                                  maximum=1.0)),
    ('state', tensor_spec.BoundedTensorSpec(shape=(19),
                                            dtype=tf.float32,
                                            minimum=-np.inf,
                                            maximum=np.inf))
    ])

I modified the tf_agents/networks/actor_distribution_network.py and tf_agents/networks/value_network.py to accept multiple inputs and concatenate the results (drew inspiration from preprocessing_layers and preprocessing_combiner of tf_agents/networks/encoding_network.py).

If I set normalize_observations=False for my ppo_agent.PPOAgent, the training (tf_agents/agents/ppo/examples/train_eval.py) works fine.

However, when I set normalize_observations=True, which is the default choice, the training breaks. The problem occurs in tf_agents/utils/tensor_normalizer.py. I modified the _update_ops function to update each component of the observation separately.

Then I got a new error (you can see below for the error log). The gist is basically when calling initialize() for eval_py_policy = py_tf_policy.PyTFPolicy(tf_agent.policy()), it complains some of the variables are DictWrapper of tf.Variable instead of tf.Variable. For example, the following is a variable of the tf policy:
DictWrapper({'state': <tf.Variable 'normalize_observations/normalize_observations/var_sum_1:0' shape=(19,) dtype=float32>, 'observation': <tf.Variable 'normalize_observations/normalize_observations/var_sum:0' shape=(128, 128, 4) dtype=float32>}).

I would really appreciate if anyone could give me some suggestions on how to resolve this. Thanks a ton!

Error log:

Traceback (most recent call last):
  File "train_eval_p2p_nav_ppo.py", line 413, in <module>
    tf.app.run()
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "train_eval_p2p_nav_ppo.py", line 408, in main
    use_rnns=FLAGS.use_rnns)
  File "train_eval_p2p_nav_ppo.py", line 341, in train_eval
    callback=eval_metrics_callback,
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/metrics/metric_utils.py", line 94, in compute_summaries
    results = compute(metrics, environment, policy, num_episodes)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/metrics/metric_utils.py", line 55, in compute
    policy_state = policy.get_initial_state(environment.batch_size)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/policies/py_policy.py", line 104, in get_initial_state
    return self._get_initial_state(batch_size)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/policies/py_tf_policy.py", line 157, in _get_initial_state
    self.initialize(batch_size)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tf_agents/policies/py_tf_policy.py", line 104, in initialize
    self.session.run(tf.initializers.variables(self._tf_policy.variables()))
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 2834, in variables_initializer
    return control_flow_ops.group(*[v.initializer for v in var_list], name=name)
  File "/cvgl2/u/chengshu/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 2834, in <listcomp>
    return control_flow_ops.group(*[v.initializer for v in var_list], name=name)
AttributeError: '_DictWrapper' object has no attribute 'initializer'

tf-agents pip is no longer available

I think you have noticed that there is no release available now for tf-agents pip library to install. A number of other libraries have been broken by this issue including magenta, tensor2tensor, etc. Would you consider to fix it soon or what's the plan?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.