danijar / dreamerv2 Goto Github PK

Mastering Atari with Discrete World Models

Home Page: https://danijar.com/dreamerv2

License: MIT License

Python 99.12% Dockerfile 0.88%

reinforcement-learning world-models artificial-intelligence robotics deep-learning video-prediction atari research machine-learning

dreamerv2's Issues

Input shape incompatible

Hi authors, thanks for your paper and code. I was trying to test dreamerv2 on retro games, and I spent a really long time looking at the code and trying to debug, but I have no clue what's going on.

I ran python3 dreamerv2/train.py --logdir ~/logdir/atari_pong/dreamerv2/1 --configs defaults retro --task retro_Airstriker-Genesis, and the output seemed good for a while:

Logdir /Users/ryantjj/logdir/atari_pong/dreamerv2/1
Create envs.
make_env(): suite is retro.
task: Airstriker-Genesis

This shows that I parsed the arguments correctly, and also hooked up gym-retro, and edited the configs.yaml and envs.py files to support retro.

But after some iterations it seems, I run into this error:

    /dreamerv2/agent.py:79 train  *
        metrics.update(self._task_behavior.train(self.wm, start, reward))
    /dreamerv2/agent.py:212 train  *
        feat, state, action, disc = world_model.imagine(self.actor, start, hor)
    /dreamerv2/agent.py:150 step  *
        succ = self.rssm.img_step(state, action)
    ./common/other.py:41 static_scan  *
        last = fn(last, inp)
    ./common/nets.py:105 img_step  *
        x = self.get('img_in', tfkl.Dense, self._hidden, self._act)(x)
    /Users/ryantjj/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:1013 __call__  **
        input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
    /Users/ryantjj/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/input_spec.py:255 assert_input_compatibility
        ' but received input with shape ' + display_shape(x.shape))

    ValueError: Input 0 of layer dense is incompatible with the layer: expected axis -1 of input shape to have value 1042 but received input with shape (2450, 1036)

So I printed the shapes of these variables in nets.py:105 img_step by inserting the print statements in this function as shown:

  @tf.function
  def img_step(self, prev_state, prev_action, sample=True):
    prev_stoch = self._cast(prev_state['stoch'])
    prev_action = self._cast(prev_action)
    if self._discrete:
      shape = prev_stoch.shape[:-2] + [self._stoch * self._discrete]
      prev_stoch = tf.reshape(prev_stoch, shape)
    x = tf.concat([prev_stoch, prev_action], -1)
    print("prev_stoch.shape: " + str(prev_stoch.shape))    # OVER HERE 
    print("prev_action.shape: " + str(prev_action.shape))   # OVER HERE
    x = self.get('img_in', tfkl.Dense, self._hidden, self._act)(x)
    deter = prev_state['deter']
    x, deter = self._cell(x, [deter])
    deter = deter[0]  # Keras wraps the state in a list.
    x = self.get('img_out', tfkl.Dense, self._hidden, self._act)(x)
    stats = self._suff_stats_layer('img_dist', x)
    dist = self.get_dist(stats)
    stoch = dist.sample() if sample else dist.mode()
    prior = {'stoch': stoch, 'deter': deter, **stats}
    return prior

And these are the terminal outputs when I run the code:

Create agent.
prev_stoch.shape: (50, 1024)
prev_action.shape: (50, 18)
prev_stoch.shape: (50, 1024)
prev_action.shape: (50, 18)
prev_stoch.shape: (50, 1024)
prev_action.shape: (50, 18)
Found 19975379 model parameters.
prev_stoch.shape: (2450, 1024)
prev_action.shape: (2450, 12)
Traceback (most recent call last):

Any clue as to why the prev_action.shape changed from 18 to 12? Thanks for getting through this really long post. I really appreciate your help! :)

Discount predictor invalid log_prob targets?

Hi,
there seems to be an issue with the discount predictor log likelihood targets.

dreamerv2/dreamerv2/agent.py

Line 168 in e783832

obs['discount'] *= self.config.discount

dreamerv2/dreamerv2/agent.py

Line 126 in e783832

like = tf.cast(head(inp).log_prob(data[name]), tf.float32)

If I understand this correctly, this tries to compute the log probability of a Bernoulli distribution with values other than 0 or 1, as the discount will be < 1 for non terminal steps.

TypeError: unsupported operand type(s) for //=: 'str' and 'int'

I ran this code

python dreamer.py --logdir ~/logdir/atari_pong/dreamerv2/1 --configs defaults atari --task atari_pong

But I got this mistake

Commented version of the code

Hi Danijar,

I was just wondering if there is any commented version of the code by any chance?

Tuple Actions Space

This might be impossible but is it possible to run this algorithm on an environment with a tuple action space?

If so then how?

Thanks,

Minimal evaluation/example using gym observation

Hi,

How to create an agent, load the weights and then call a prediction function to receive the action?

I'm trying to recreate one but many details are missing.
I stuck in this error:

python validate.py --model_path ~/logdir/trader
Loading config.
Loading config. Done
Resizing keys image to (64, 64).
Create agent (step: 481310).
Encoder CNN inputs: ['image']
Encoder MLP inputs: []
Decoder CNN outputs: ['image']
Decoder MLP outputs: []
Create agent. Done!
Loading checkpoint.
Load checkpoint with 85 tensors and 32342130 parameters.
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/ml/lib/python3.8/site-packages/tensorflow/python/util/nest.py", line 568, in assert_same_structure
    _pywrap_utils.AssertSameStructure(nest1, nest2, check_types,
ValueError: The two structures don't have the same nested structure.

First structure: type=tuple str=(<tf.Variable 'Variable:0' shape=() dtype=int32, numpy=481310>, <tf.Variable 'Variable:0' shape=() dtype=int32, numpy=0>, <tf.Variable 'Variable:0' shape=() dtype=float64, numpy=1.0>)

Second structure: type=tuple str=(483399, 48242, array([[-0.03663844,  0.02114336, -0.01451669, ..., -0.00666128,
        -0.01674761,  0.07526544],
       [-0.04041671,  0.02768614, -0.01707186, ..., -0.00505101,

My agent code:

import gym
import logging
import random
from typing import Sequence
import numpy as np
import tensorflow as tf
from dreamerv2.api import defaults
from dreamerv2 import common
from dreamerv2.agent import Agent

from pathlib import Path
from agents import BaseAgent


logger = logging.getLogger('root')


class Dreamerv2Agent(BaseAgent):
    def __init__(self,
                 conf_file: Path,
                 env: str,
                 test_mode: bool,
                 prefix: str,
                 batch: int,
                 model_path: Path,
                 seed: bool):
        super().__init__(env, test_mode, prefix, batch, model_path, seed)

        if self.seed:
            random.seed(0)
            np.random.seed(0)
            tf.random.set_seed(0)

        logger.info("Loading config.")
        config = common.Config.load(
            str(model_path.absolute() / 'config.yaml')
            )
        logger.info("Loading config. Done")

        # config = defaults.parse_flags()

        env = gym.make(env)
        env = common.GymWrapper(env)
        env = common.ResizeImage(env)
        if hasattr(env.act_space['action'], 'n'):
            env = common.OneHotAction(env)
        else:
            env = common.NormalizeAction(env)
        env = common.TimeLimit(env, config.time_limit)

        replay = common.Replay(
            model_path.absolute() / 'train_episodes',
            **config.replay
            )
        step = common.Counter(replay.stats['total_steps'])
        logger.info(f'Create agent (step: {step.value}).')
        self.agent = Agent(config, env.obs_space, env.act_space, step)
        logger.info('Create agent. Done!')

        logger.info('Loading checkpoint.')
        if (model_path.absolute() / 'variables.pkl').exists():
            self.agent.load(model_path.absolute() / 'variables.pkl')
        logger.info('Loading checkpoint. Done!')

    def get_action(self, observation: Sequence):
        # Receive the Gym observation to get action
        output, _ = self.agent.policy(observation)
        return output.get('action')

Questions about atari evaluation protocol

Hi @danijar, thank you for this great work.

I have some questions about evaluation protocol used in this code and dreamerV2 paper.

Q1. In the paper, it is mentioned that you followed the evaluation protocol of Machdo et al. (2018), where they use "evaluation during training" which means the average score of the last 100 training episodes before the agent reaches 200M frames, without using the explicit evaluation phase. Is this "evaluation during training" protocol used for dreamerV2 as well? or did you separate evaluation episodes for evaluation?
Q2. Is there any standard atari evaluation protocol ? For instance, in IMPALA paper, it is addressed they used standard evaluation protocol where the scores over 200 evaluation episodes are averaged. So they used separate evaluation phase, while Machado et al did not. Also, sticky actions is not applied in IMPALA evaluation, while is is used in Machado et al and dreamerV2 for evaluation. So I wonder there is any evaluation protocol that we can call "standard". What is your opinion about this, and what is the reason of dreamerV2 following Machado et al's evaluation protocol, not the one of IMPALA?
Q3. In Machado et al, 5~24 different trials are averaged for evaluation. How many trials did you use in dreamerV2?
Q4. In the code (https://github.com/danijar/dreamerv2/blob/main/dreamerv2/train.py#L153-L155) , the number of evaluation episodes (config.eval_every) is 1 and evaluation interval (config.eval_every) is 1e5. How can I relate these settings to standard evaluation protocol of Machado et al?

Should policy state be reset after every episode?

It seems like the state of the agent (self._state) is not initialized to 0 on reset. Only in the very first episode, it is None, so it will be set to 0s. Since driver.reset() is never called again in api.py, self._state will be carried over from previous episodes on episode reset.
Is this intentional?

dreamerv2/dreamerv2/common/driver.py

Lines 32 to 40 in 07d906e

    
           obs = { 
        
               i: self._envs[i].reset() 
        
               for i, ob in enumerate(self._obs) if ob is None or ob['is_last']} 
        
           for i, ob in obs.items(): 
        
             self._obs[i] = ob() if callable(ob) else ob 
        
             act = {k: np.zeros(v.shape) for k, v in self._act_spaces[i].items()} 
        
             tran = {k: self._convert(v) for k, v in {**ob, **act}.items()} 
        
             [fn(tran, worker=i, **self._kwargs) for fn in self._on_resets] 
        
             self._eps[i] = [tran]

procgen env

Dumb question here, but How does this algorithm compare in Procgen environment, especially compared to PPG?

Thank you

Render episodes

Is there a way to render eval episodes for the open ai Atari envs?

Questions on Imagination MDP and imagination horizon H = 15

Dear author,

After reading the code and the paper, I am confused about why Imagination MDP is introduced and why imagination horizon is needed. For example, with a trained world model and given a trajectory: $\tau$, we can sample an initial state and simulate a trajectory with the world model. In DreamerV2, each state in the sampled trajectory is used to simulate a sub-trajectory whose length is 15 and then used to update the policy. Why is your solution feasible for training model-based RL? It looks like magic. Could you help me to understand it?

Does the actor-critc train using only the stochastic state?

Hi,

I'm very interested in your work but I am unclear if the actor-critic is trained only using the stochastic state as its observation or if it also uses the recurrent state? What's the reasoning behind this choice?

Thanks for all your work and for putting it on Github!

Pickle and shape issues

Hi, I have been training an agent using this for a while now but today I have been getting these errors:

Logdir X:\Dreamer_log\logdir\ai\dreamerv2\1
Could not load episode: Object arrays cannot be loaded when allow_pickle=False
Create envs.
.\common\driver.py:64: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Train episode has 15691 steps and return 1.4.
Traceback (most recent call last):
  File "dreamerv2/train.py", line 119, in <module>
  File ".\common\driver.py", line 56, in __call__
  File ".\common\driver.py", line 56, in <listcomp>
  File "dreamerv2/train.py", line 110, in <lambda>
  File "dreamerv2/train.py", line 102, in per_episode
  File "C:\Users\Rob\Anaconda2\envs\muzero\lib\site-packages\elements\logger.py", line 36, in video
    self.add({name: value})
  File "C:\Users\Rob\Anaconda2\envs\muzero\lib\site-packages\elements\logger.py", line 25, in add
    f"Shape {value.shape} for name '{name}' cannot be "
ValueError: Shape (15692,) for name 'train_policy' cannot be interpreted as scalar, image, or video.

I haven't changed anything in my system or env.

Any ideas would be great.

Thanks

Batch size = 16?

dreamerv2/dreamerv2/configs.yaml

Line 24 in 912ec5d

dataset: {batch: 16, length: 50}

Hi Danijar,
do I understand correct that this line should have batch = 50 to to have same hyperparameters as in the paper? I am asking because I want to investigate why my own PyTorch implementation is slower.

the Desire of Hyperparameters of Humanoid-Walk

When I was using your code, I only founded the Hyperparameters of Walker-Walk but Humanoid-Walk. In the official paper, it has the results of Humanoid-Walk Environment. So could you please supply the Hyperparameters of Humanoid-Walk in your config.yaml file?

KeyError: 'dmc' while trying to run walker?

This is the commandline and output i get:

(tf2) marten@dpserver:~/rl/dreamerv2$ python3 dreamerv2/train.py --logdir ~/logdir/dmc_walker_walk/dreamerv2/1 --configs dmc --task dmc_walker_walk Traceback (most recent call last): File "dreamerv2/train.py", line 196, in <module> main() File "dreamerv2/train.py", line 37, in main config = config.update(configs[name]) KeyError: 'dmc'

How does dreamerv2 perform on feature-based tasks?

Hello. Thanks for your interesting work!

I'm planning to use dreamerv2 on some feature-based tasks. After doing some searching, I found no one has tried to do it before. I'm wondering if there is any difficulty on doing so? What problems would you anticipate?

The result for atari enduro in the paper is not reproduced

Hi Danijar,

I am trying to work on your codebase, but I found your latest codes are not working for the atari enduro game.

(I checked it is working for Atari Bank Heist).

Could you check it?

I leave the Tensorboard logs here.

Bests, Jaesik.

Can't reproduce riverraid's results

Hello, danijar! First of all, thanks for your work :)

I've been trying out dreamerv2 this past week and tried to reproduce riverraid's results. However, I was unsuccessful and the agent only reaches about ~5k reward after almost 1e6 train steps. This is the latest result I got. If you need, I can attach tensorboard graphs later this week.

train_return 5190 / train_length 982 / train_total_steps 9.5e5 / train_total_episodes 1220 / train_loaded_steps 9.5e5 / train_loaded_episodes 1220

I did a small modification to the original code so it runs on multiple GPUs (tf.distribute.MirroredStrategy). Then, I trained the agent to play Pong and the return plot was similar to the one you posted on #8, so I figured out it was ok. Also, in the riverraid's output attached above, half of it ran with precision=16 and half with precision=32 since it was mentioned in a few other issues that precision 32 helped, especially #30. I did not did a full run with precision=32, though.

Do you have any tips on what could be going wrong or what could I do to debug it?

Thanks so much!

Default setting doesn't seem to be learning

Thanks for the updated release. I just downloaded the code and made a fresh environment as detailed in the readme. I tried to train the script with everything set to default by simply running "python dreamerv2/train.py --logdir ./logdir/atari_pong --configs defaults atari --task atari_pong". After 50k steps, the return doesn't seem to increase at all. The atari pong task should have a random reward of around -20 and what I got so far is just that. Any suggestion on why this is the case?

Here is the configs.yaml just in case you need it. The only place I changed in the code is the steps in line 8 and 77 where I reduce them to 1e7. Even at a fewer number of steps, I think I should be expecting some improvements in return.

defaults:

Train Script

logdir: /dev/null
seed: 0
task: dmc_walker_walk
num_envs: 1
steps: 1e7
eval_every: 1e5
action_repeat: 1
time_limit: 0
prefill: 10000
image_size: [64, 64]
grayscale: False
replay_size: 2e6
dataset: {batch: 50, length: 50, oversample_ends: True}
train_gifs: False
precision: 16
jit: True

Agent

log_every: 1e4
train_every: 5
train_steps: 1
pretrain: 0
clip_rewards: identity
expl_noise: 0.0
expl_behavior: greedy
expl_until: 0
eval_noise: 0.0
eval_state_mean: False

World Model

pred_discount: True
grad_heads: [image, reward, discount]
rssm: {hidden: 400, deter: 400, stoch: 32, discrete: 32, act: elu, std_act: sigmoid2, min_std: 0.1}
encoder: {depth: 48, act: elu, kernels: [4, 4, 4, 4], keys: [image]}
decoder: {depth: 48, act: elu, kernels: [5, 5, 6, 6]}
reward_head: {layers: 4, units: 400, act: elu, dist: mse}
discount_head: {layers: 4, units: 400, act: elu, dist: binary}
loss_scales: {kl: 1, reward: 1, discount: 1}
kl: {free: 0.0, forward: False, balance: 0.8, free_avg: True}
model_opt: {opt: adam, lr: 3e-4, eps: 1e-5, clip: 100, wd: 1e-6}

Actor Critic

actor: {layers: 4, units: 400, act: elu, dist: trunc_normal, min_std: 0.1}
critic: {layers: 4, units: 400, act: elu, dist: mse}
actor_opt: {opt: adam, lr: 1e-4, eps: 1e-5, clip: 100, wd: 1e-6}
critic_opt: {opt: adam, lr: 1e-4, eps: 1e-5, clip: 100, wd: 1e-6}
discount: 0.99
discount_lambda: 0.95
imag_horizon: 15
actor_grad: both
actor_grad_mix: '0.1'
actor_ent: '1e-4'
slow_target: True
slow_target_update: 100
slow_target_fraction: 1

Exploration

expl_extr_scale: 0.0
expl_intr_scale: 1.0
expl_opt: {opt: adam, lr: 3e-4, eps: 1e-5, clip: 100, wd: 1e-6}
expl_head: {layers: 4, units: 400, act: elu, dist: mse}
disag_target: stoch
disag_log: True
disag_models: 10
disag_offset: 1
disag_action_cond: True
expl_model_loss: kl

atari:

task: atari_pong
time_limit: 108000 # 30 minutes of game play.
action_repeat: 4
steps: 1e7
eval_every: 1e5
log_every: 1e5
prefill: 200000
grayscale: True
train_every: 16
clip_rewards: tanh
rssm: {hidden: 600, deter: 600, stoch: 32, discrete: 32}
actor.dist: onehot
model_opt.lr: 2e-4
actor_opt.lr: 4e-5
critic_opt.lr: 1e-4
actor_ent: 1e-3
discount: 0.999
actor_grad: reinforce
actor_grad_mix: 0
loss_scales.kl: 0.1
loss_scales.discount: 5.0
.*.wd$: 1e-6

dmc:

task: dmc_walker_walk
time_limit: 1000
action_repeat: 2
eval_every: 1e4
log_every: 1e4
prefill: 5000
train_every: 5
pretrain: 100
pred_discount: False
grad_heads: [image, reward]
rssm: {hidden: 200, deter: 200}
model_opt.lr: 3e-4
actor_opt.lr: 8e-5
critic_opt.lr: 8e-5
actor_ent: 1e-4
discount: 0.99
actor_grad: dynamics
kl.free: 1.0
dataset.oversample_ends: False

debug:

jit: False
time_limit: 100
eval_every: 300
log_every: 300
prefill: 100
pretrain: 1
train_steps: 1
dataset.batch: 10
dataset.length: 10

Plot.py not working properly

When running python3 common/plot.py --indir ~/logdir/exp --outdir ~/plots --xaxis step --yaxis eval_return --bins 1e6 I get:

NotADirectoryError: [Errno 20] Not a directory: '/home/USER/logdir/exp/variables.pkl'

It seems that the code is treating files as folders. If I make --indir ~/logdir instead (one level up) I get:

Traceback (most recent call last):
  File "/home/USER/code/dreamerv2/dreamerv2/common/plot.py", line 571, in <module>
    main(parse_args())
  File "/home/USER/code/dreamerv2/dreamerv2/common/plot.py", line 482, in main
    runs = load_runs(args)
  File "/home/USER/code/dreamerv2/dreamerv2/common/plot.py", line 72, in load_runs
    task, method, seed = filename.relative_to(indir).parts[:-1]
ValueError: not enough values to unpack (expected 3, got 1)

I ran training with: python dreamerv2/train.py --logdir ~/logdir/exp --configs dmc_vision --task dmc_walker_walk

Packages:

python=3.9.12

Understanding re-clipping in Truncated Normal distribution

Hi,

I was looking at the TruncNormalDist code and was wondering why were the samples re-clipped ('re' because they are already in [-1, 1] because of tfd.TruncatedNormal's sampling).
In practice it seems to me that this wouldn't create an issue as it is only re-clipped by 1e-6, but I am curious if I'm missing something.

Thanks!

Why share states across random batches for training the world model?

From my understanding, the posterior of the last timestep from a batch is used as the start state for the next batch.
Is this intended? If so, is it just to avoid always initializing the start state to zeros and have it model some random sample from the current latent distribution?

dreamerv2/dreamerv2/agent.py

Line 60 in 07d906e

state, outputs, mets = self.wm.train(data, state)

AssertionError and AttributeError dreamerv2 in jupyter-notebook

Hi! I am trying to run the minigrid and crafter examples in a Jupyter notebook, but i keep getting this error when running the config.

Command (minigrid):

config = dv2.defaults.update({
    'logdir': '~/logdir/crafter',
    'log_every': 1e3,
    'train_every': 10,
    'prefill': 1e5,
    'actor_ent': 3e-3,
    'loss_scales.kl': 1.0,
    'discount': 0.99,
}).parse_flags()

Assertion Error:

~/miniconda3/envs/hacking/lib/python3.8/site-packages/dreamerv2/common/flags.py in parse(self, argv, known_only, help_exists)
     45         if flag.startswith('--'):
     46           raise ValueError(f"Flag '{flag}' did not match any config keys.")
---> 47       assert not remaining, remaining
     48       return parsed
     49 

AssertionError: ['-f', '/home/balloch/.local/share/jupyter/runtime/kernel-244efe8b-5d51-499b-bfa2-7611c02c8e5b.json']

Command (crafter)

config = dv2.defaults.update().parse_flags()

Attribute Error:

AttributeError: 'dict' object has no attribute 'crafter'

Have you considered using a PPO actor instead of a normal Actor-Critic?

I think a lot of improvement could be made by using a PPO actor.

How many environment steps per update?

Hi danijar,
how many environment steps are you running per update?
In the paper it is 4 (so after every step the agent makes it is updated because of action repeat?), but here in the config it says train_every: 16. What is the correct number?

Best,
Tim

Why stop-grad on actor's input state in imagine() function ?

Hi,

While I'm taking a close look in the imagine() function in the world model,
I wonder why the gradient from the input feature to the actor should be stopped.

WorldModel's imagine fuction (agent.py)

def imagine(self, policy, start, is_terminal, horizon):
flatten = lambda x: x.reshape([-1] + list(x.shape[2:]))
start = {k: flatten(v) for k, v in start.items()}
start['feat'] = self.rssm.get_feat(start)
start['action'] = tf.zeros_like(policy(start['feat']).mode())
seq = {k: [v] for k, v in start.items()}
for _ in range(horizon):
action = policy(tf.stop_gradient(seq['feat'][-1])).sample()

In my opinion, for the full gradient from the initial state to the last step of the sequence, shouldn't the 'feat' flow through the computation graph without the stop gradient? I just wonder why there is a stop gradient. have you tried the code without the stop gradient? What was the result like?

I'm struggling to find out the reason for the stop gradient and ask it here for help.
Thanks!

What does "openl" do / mean?

Just curious about the variable naming in this snippet of code.

Questions about expl.py and updating the batch dataset

Hi Danijar,
I'm currently doing a project where I'm running DreamerV2 on some of the alternative exploration agents. I have two questions:

How does train_dataset update to include samples from the collected data in the training episodes? As far as I know, under the default settings for Pong, we have this line which creates the train_dataset:

print('Create agent.')
train_dataset = iter(train_replay.dataset(**config.dataset))

And this line in the for loop which iterates over the batches.

for _ in range(config.train_steps):
mets = train_agent(next(train_dataset))

I just wanted to sanity check with you that the next(train_dataset) batch is pulled from the entire buffer in train_replay._complete_eps, and that it's being updated as such, since I don't see train_dataset being updated after its initialisation. I also wanted to clarify that if the expl_behaviour is set to not greedy, the training episodes use the exploratory agent, and that data collected by this agent is sampled in subsequent batches of next(train_dataset). Possibly a silly question but in case I was missing something I tried the following modification:

for _ in range(config.train_steps):
train_dataset = iter(train_replay.dataset(**config.dataset))
mets = train_agent(next(train_dataset))

Where train_dataset was re-initialised and I got worse results than the default behaviour.

I set the expl_behaviour equal to 'Plan2Explore' and tested on some toy games I'd based off the dSprites database, and got worse results than default (greedy) Dreamer, over one million episodes. The train_policy does indeed seem to explore the space quite well, but I was wondering if:

a) any steps needed to be done in order for Plan2Explore to work properly, other than just updating configs.yaml with expl_behavior: Plan2Explore (this is what I currently have)

b) it takes more than a few million steps for Plan2Explore to perform as well as default Dreamer. Here's a graph of the situation:

Note: I'd accidentally had action_repeat set to 4 in both these games, so divide by 4 to get the true number of steps on the x-axis.

Thanks in advance!

No Improvement in Pong Scores after 18M+ Steps

[Edited to update to 18M steps; images below are from 12M]

Starting a new thread with more relevant detail here. Please feel free to close if you don't think it's appropriate.

We've now trained several instances to at least 10M+ steps with no improvement in Pong scores. This is using the default Pong settings on V100 machines in Colab Pro.

All training settings are the default in the repo, no modifications have been made to the code base as this was a first "test run" of dreamer.

Below are performance graphs. Happy to provide Colab copy or log files if it would be helpful. Would appreciate any insight, even if it's that we need to allow longer training (though the chart in Appendix F appears to show Pong improving by this point in training?).

Will keep training in the meantime and update if anything changes.

Thank you.

[Below images are from 12M steps. However issue persists beyond 18M+ steps]

Reward different on evaluation

Hi Danijar,

I'm training using dreamerv2 with success, and I'm getting this result:

[5847513] return 6.12 / length 151 / total_steps 5.8e6 / total_episodes 3.9e4 / loaded_steps 1e5 / loaded_episodes 662
Save checkpoint with 85 tensors and 32333580 parameters.
[5847536] kl_loss 0.67 / image_loss 1.1e4 / reward_loss 0.92 / discount_loss 0.06 / model_kl 0.67 / prior_ent 1.92 / post_ent 1.14 / model_loss 1.1e4 / model_grad_norm 133.28 / actor_loss -1.5e-5 / actor_grad_norm 1.5e-3 / critic_loss 0.82 / critic_grad_norm 0.07 / reward_mean 0.04 / reward_std 0.03 / reward_normed_mean 0.04 / reward_normed_std 0.03 / critic_slow 1.37 / critic_target 1.35 / actor_ent 2e-3 / actor_ent_scale 2e-3 / critic 1.37 / fps 44.95
Episode has 151 steps and return 6.1.

The return 6.1 is the cumulative sum of rewards of the episode?

After training I'm running the code below and I'm receiving the cumulative reward of the 2.153186.

I'm using this code to evaluate.

import re
import warnings
import gym
import logging
import random
from absl import logging
from typing import Sequence
import numpy as np
import tensorflow as tf
import dreamerv2.api as dv2
from dreamerv2 import common
from dreamerv2.agent import Agent

from pathlib import Path
from agents import BaseAgent


# logger = logging.getLogger('root')
# warnings.filterwarnings('ignore', '.*box bound precision lowered.*')


class Dreamerv2Agent(BaseAgent):
    def __init__(self,
                 conf_file: Path,
                 env: str,
                 test_mode: bool,
                 prefix: str,
                 batch: int,
                 model_path: Path = Path("~/logdir/trader"),
                 seed: bool = False):
        super().__init__(env, test_mode, prefix, batch, model_path, seed)

        if self.seed:
            random.seed(0)
            np.random.seed(0)
            tf.random.set_seed(0)

        model_path = model_path.expanduser().absolute()
        logging.error(f"Model Path: {model_path}")

        logging.error("Loading config.")
        config_path = (model_path / 'config.yaml')
        config = common.Config.load(config_path)
        self.config = config

        logging.error("Loading config. Done")

        env = gym.make(env)

        replay = common.Replay(
            model_path / 'train_episodes',
            **config.replay
        )
        step = common.Counter(replay.stats['total_steps'])
        env = self.wrapper(env)

        def per_episode(ep):
            length = len(ep['reward']) - 1
            score = float(ep['reward'].astype(np.float64).sum())
            logging.error(f'Episode has {length} steps and return {score:.1f}.')
            # logger.scalar('return', score)
            # logger.scalar('length', length)
            for key, value in ep.items():
                if re.match(config.log_keys_sum, key):
                    logging.error.scalar(f'sum_{key}', ep[key].sum())
                if re.match(config.log_keys_mean, key):
                    logging.error.scalar(f'mean_{key}', ep[key].mean())
                if re.match(config.log_keys_max, key):
                    logging.error.scalar(f'max_{key}', ep[key].max(0).mean())
            # logger.add(replay.stats)
            # logger.write()

        driver = common.Driver([env])
        driver.on_episode(per_episode)
        driver.on_step(lambda tran, worker: step.increment())
        driver.on_step(replay.add_step)
        driver.on_reset(replay.add_step)

        prefill = max(0, config.prefill - replay.stats['total_steps'])
        if prefill:
            print(f'Prefill dataset ({prefill} steps).')
            random_agent = common.RandomAgent(env.act_space)
            driver(random_agent, steps=prefill, episodes=1)
            driver.reset()

        logging.error(f'Create agent (step: {step.value}).')
        logging.error(f"Action Space: {env.act_space}")
        logging.error(f"Observation Space: {env.obs_space}")
        self.agent = Agent(config, env.obs_space, env.act_space, step)
        dataset = iter(replay.dataset(**config.dataset))
        train_agent = common.CarryOverState(self.agent.train)
        train_agent(next(dataset))
        logging.error('Create agent. Done!')

        logging.error('Loading checkpoint.')
        vars = (model_path / 'variables.pkl').absolute()
        if vars.exists():
            self.agent.load(vars)
        logging.error('Loading checkpoint. Done!')

    def wrapper(self, env):
        env = common.GymWrapper(env)
        env = common.ResizeImage(env)
        if hasattr(env.act_space['action'], 'n'):
            env = common.OneHotAction(env)
        else:
            env = common.NormalizeAction(env)
        env = common.TimeLimit(env, self.config.time_limit)
        return env

    def get_action(self, observation: Sequence):
        obs = {k: np.expand_dims(v, 0) for k, v in observation.items()}
        output, _ = self.agent.policy(obs, mode='eval')
        output['action'] = tf.squeeze(output['action'])
        return output

My config.

action_repeat: 1
actor: {act: elu, dist: auto, layers: 4, min_std: 0.1, norm: none, units: 400}
actor_ent: 0.002
actor_grad: auto
actor_grad_mix: 0.1
actor_opt: {clip: 100, eps: 1e-05, lr: 8e-05, opt: adam, wd: 1e-06}
atari_grayscale: false
clip_rewards: tanh
critic: {act: elu, dist: mse, layers: 4, norm: none, units: 400}
critic_opt: {clip: 100, eps: 1e-05, lr: 0.0002, opt: adam, wd: 1e-06}
dataset: {batch: 16, length: 50}
decoder:
  act: elu
  cnn_depth: 48
  cnn_kernels: [5, 5, 6, 6]
  cnn_keys: .*
  mlp_keys: .*
  mlp_layers: [400, 400, 400, 400]
  norm: none
disag_action_cond: true
disag_log: false
disag_models: 10
disag_offset: 1
disag_target: stoch
discount: 0.99
discount_head: {act: elu, dist: binary, layers: 4, norm: none, units: 400}
discount_lambda: 0.95
dmc_camera: -1
encoder:
  act: elu
  cnn_depth: 48
  cnn_kernels: [4, 4, 4, 4]
  cnn_keys: .*
  mlp_keys: .*
  mlp_layers: [400, 400, 400, 400]
  norm: none
envs: 1
envs_parallel: none
eval_eps: 1
eval_every: 1000.0
eval_noise: 0.0
eval_state_mean: false
expl_behavior: greedy
expl_extr_scale: 0.0
expl_head: {act: elu, dist: mse, layers: 4, norm: none, units: 400}
expl_intr_scale: 1.0
expl_model_loss: kl
expl_noise: 0.0
expl_opt: {clip: 100, eps: 1e-05, lr: 0.0003, opt: adam, wd: 1e-06}
expl_reward_norm: {eps: 1e-08, momentum: 1.0, scale: 1.0}
expl_until: 0
grad_heads: [decoder, reward, discount]
imag_horizon: 15
jit: true
kl: {balance: 0.8, forward: false, free: 0.0, free_avg: true}
log_every: 10000.0
log_keys_max: ^$
log_keys_mean: ^$
log_keys_sum: ^$
log_keys_video: [image]
logdir: ~/logdir/trader
loss_scales: {discount: 1.0, kl: 1.0, proprio: 1.0, reward: 1.0}
model_opt: {clip: 100, eps: 1e-05, lr: 0.0001, opt: adam, wd: 1e-06}
precision: 16
pred_discount: true
prefill: 10000
pretrain: 1
render_size: [64, 64]
replay: {capacity: 100000.0, maxlen: 50, minlen: 50, ongoing: false, prioritize_ends: true}
reward_head: {act: elu, dist: mse, layers: 4, norm: none, units: 400}
reward_norm: {eps: 1e-08, momentum: 1.0, scale: 1.0}
rssm: {act: elu, deter: 1024, discrete: 32, ensemble: 1, hidden: 1024, min_std: 0.1,
  norm: none, std_act: sigmoid2, stoch: 32}
seed: 0
slow_baseline: true
slow_target: true
slow_target_fraction: 1
slow_target_update: 100
steps: 100000000.0
task: dmc_walker_walk
time_limit: 0
train_every: 1000
train_steps: 1

There are something I'm missing to get a better evaluation?

Best regards,
Fernando Ribeiro

Lamba Target Equation

Hi,

I have a question about how you calculate the lambda_target as seen in the equation below.

I've been implementing it to work directly in the environment rather than with the model states to test out how it works and something occurred to me. On your final step, i.e. when t = H, are you not accounting for the reward twice since the Value network is already trained to incorporate the reward of a state into the Value for a state? Would it not be more valid to instead stop calculation at H-1 and use the final H model_state only for bootstrapping, so that the target calculation would become V(s_H-1) = r_H-1 + y_H-1 * V(s_H)?

Thanks again,
Lewis

How to run dreamerv2 on atari games

We could successfully run dreamerv2 on the Minigrid environment by referring to the README.md. And, we are now trying to run dreamerv2 on atari games, but the environment loaded from an atari game, especially SpaceInvaders-v0, seems to be not compatible with the agent's input. Could you tell us the way to run dreamerv2 on SpaceInvaders-v0? Or, should we modify some codes like agent.py and envs.py, so that the agent and environment are compatible with each other.

Setting random seed

Dear Danijar,

I'm running into an issue that may be a non-issue, and I thought it was worth checking.

Are you able to reproduce training runs using a fixed random seed?
There is a 'seed' flag in the config file, but I cannot find where it is actually being used, and my runs do not appear fixed to a seed.

Additionally, I am running into what appears to be a weird bug, and I am wondering if you have insight. The model does not train properly if I try to manually set the random seeds by adding, before any other code:
np.random.seed(config.seed)
tf.random.set_seed(config.seed)
print(f'--> Setting random seed to {config.seed}')

For example, using dmc_walker_run, here is a training curve if I do not set the seed

Whereas here is a training curve if the only change I make is to add the above three lines.

This has been a consistent finding. I also see it if I try setting the random seed at other locations in the code.
Otherwise, I am getting consistent success training without setting a random seed ( --> congratulations and thank you for the wonderful codebase and algorithm :-)

Is this a known problem? And/or do you have any insight into why this might be the case. Is there a reason to give up trying to set a random seed?

Note: I have been using the original version of your repo (i.e. from March). Is this something you have knowingly fixed with subsequent updates?

Thank you so much.

Best,

Isaac

Prediction returning the same action from different observations

Hi again,

Congrats by excellent work.

My model is improving.

I'm loading the checkpoint with success and trying to predict (calling the policy function) the get an action using this observation format:

{'image': array([[[161, 255,   0],
        [161, 255,   0],
        [161, 255,   0],
        ...,
        [155, 255,   0],
        [155, 255,   0],
        [155, 255,   0]],

       [[161, 255,   0],
        [161, 255,   0],
        [161, 255,   0],
        ...,
        [155, 255,   0],
        [155, 255,   0],
        [155, 255,   0]],

       [[161, 255,   0],
        [161, 255,   0],
        [161, 255,   0],
        ...,
        [155, 255,   0],
        [155, 255,   0],
        [155, 255,   0]],

       ...,

       [[182, 255,   0],
        [182, 255,   0],
        [182, 255,   0],
        ...,
        [183, 255,   0],
        [183, 255,   0],
        [183, 255,   0]],

       [[182, 255,   0],
        [182, 255,   0],
        [182, 255,   0],
        ...,
        [183, 255,   0],
        [183, 255,   0],
        [183, 255,   0]],

       [[182, 255,   0],
        [182, 255,   0],
        [182, 255,   0],
        ...,
        [183, 255,   0],
        [183, 255,   0],
        [183, 255,   0]]], dtype=uint8), 'reward': 0.0, 'is_first': True, 'is_last': False, 'is_terminal': False}

My code:

import re
import warnings
import gym
import logging
import random
from typing import Sequence
import numpy as np
import tensorflow as tf
import dreamerv2.api as dv2
from dreamerv2 import common
from dreamerv2.agent import Agent

from pathlib import Path
from agents import BaseAgent


logger = logging.getLogger('root')
warnings.filterwarnings('ignore', '.*box bound precision lowered.*')


class Dreamerv2Agent(BaseAgent):
    def __init__(self,
                 conf_file: Path,
                 env: str,
                 test_mode: bool,
                 prefix: str,
                 batch: int,
                 model_path: Path = Path("~/logdir/trader"),
                 seed: bool = False):
        super().__init__(env, test_mode, prefix, batch, model_path, seed)

        if self.seed:
            random.seed(0)
            np.random.seed(0)
            tf.random.set_seed(0)

        model_path = model_path.expanduser().absolute()
        print(f"Model Path: {model_path}")

        print("Loading config.")
        config_path = (model_path / 'config.yaml')
        config = common.Config.load(config_path)
        self.config = config

        print("Loading config. Done")

        env = gym.make(env)

        replay = common.Replay(
            model_path / 'train_episodes',
            **config.replay
        )
        step = common.Counter(replay.stats['total_steps'])
        env = self.wrapper(env)

        def per_episode(ep):
            length = len(ep['reward']) - 1
            score = float(ep['reward'].astype(np.float64).sum())
            print(f'Episode has {length} steps and return {score:.1f}.')
            logger.scalar('return', score)
            logger.scalar('length', length)
            for key, value in ep.items():
                if re.match(config.log_keys_sum, key):
                    logger.scalar(f'sum_{key}', ep[key].sum())
                if re.match(config.log_keys_mean, key):
                    logger.scalar(f'mean_{key}', ep[key].mean())
                if re.match(config.log_keys_max, key):
                    logger.scalar(f'max_{key}', ep[key].max(0).mean())
            logger.add(replay.stats)
            logger.write()

        driver = common.Driver([env])
        driver.on_episode(per_episode)
        driver.on_step(lambda tran, worker: step.increment())
        driver.on_step(replay.add_step)
        driver.on_reset(replay.add_step)

        prefill = max(0, config.prefill - replay.stats['total_steps'])
        if prefill:
            print(f'Prefill dataset ({prefill} steps).')
            random_agent = common.RandomAgent(env.act_space)
            driver(random_agent, steps=prefill, episodes=1)
            driver.reset()

        print(f'Create agent (step: {step.value}).')
        print(f"Action Space: {env.act_space}")
        print(f"Observation Space: {env.obs_space}")
        self.agent = Agent(config, env.obs_space, env.act_space, step)
        dataset = iter(replay.dataset(**config.dataset))
        train_agent = common.CarryOverState(self.agent.train)
        train_agent(next(dataset))
        print('Create agent. Done!')

        print('Loading checkpoint.')
        vars = (model_path / 'variables.pkl').absolute()
        if vars.exists():
            self.agent.load(vars)
        print('Loading checkpoint. Done!')

    def wrapper(self, env):
        env = common.GymWrapper(env)
        env = common.ResizeImage(env)
        if hasattr(env.act_space['action'], 'n'):
            env = common.OneHotAction(env)
        else:
            env = common.NormalizeAction(env)
        env = common.TimeLimit(env, self.config.time_limit)
        return env

    def get_action(self, observation: Sequence):
        obs = {k: np.expand_dims(v, 0) for k, v in observation.items()}
        output, _ = self.agent.policy(obs, mode='eval')
        output['action'] = tf.squeeze(output['action'])
        return output

After calling get_action and getting a the action to pass to step from my gym environment (wrapped by the dreamerv2) and this works inside the loop.

But I'm getting always the same action from different observations.

Is something missing from my evaluation method?

Thanks in advanced.

ValueError: . Tensor must have rank 4. Received rank 3, shape (208, 64, 64)

Hi,

Congrats by excellent project.

I'm using a custom gym environment (images with shape [21, 4], the channel dimension is absent because its a grayscale image) and I'm getting this error.

Episode has 207 steps and return -19.3.
[2679] return -19.31 / length 207 / total_steps 2679 / total_episodes 13 / loaded_steps 2691 / loaded_episodes 13
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/check_ops.py", line 1231, in assert_rank
    assert_op = _assert_rank_condition(x, rank, static_condition,
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/check_ops.py", line 1131, in _assert_rank_condition
    raise ValueError(
ValueError: ('Static rank condition failed', 3, 4)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/Fernando/dev/dreamerv2/examples/test.py", line 16, in <module>
    dv2.train(env, config)
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/api.py", line 94, in train
    driver(random_agent, steps=prefill, episodes=1)
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/common/driver.py", line 57, in __call__
    [fn(ep, **self._kwargs) for fn in self._on_episodes]
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/common/driver.py", line 57, in <listcomp>
    [fn(ep, **self._kwargs) for fn in self._on_episodes]
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/api.py", line 74, in per_episode
    logger.write()
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/common/logger.py", line 44, in write
    output(self._metrics)
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/common/logger.py", line 117, in __call__
    tf.summary.image(name, value, step)
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorboard/plugins/image/summary_v2.py", line 140, in image
    return tf.summary.write(
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 762, in write
    op = smart_cond.smart_cond(
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/framework/smart_cond.py", line 56, in smart_cond
    return true_fn()
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 750, in record
    summary_tensor = tensor() if callable(tensor) else array_ops.identity(
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorboard/util/lazy_tensor_creator.py", line 66, in __call__
    self._tensor = self._tensor_callable()
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorboard/plugins/image/summary_v2.py", line 112, in lazy_tensor
    tf.debugging.assert_rank(data, 4)
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/check_ops.py", line 1178, in assert_rank_v2
    return assert_rank(x=x, rank=rank, message=message, name=name)
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/check_ops.py", line 1236, in assert_rank
    raise ValueError(
ValueError: .  Tensor  must have rank 4.  Received rank 3, shape (208, 64, 64)

How to fix this?

My test file is:

import gym
import dreamerv2.api as dv2

config = dv2.defaults.update({
    'logdir': '~/logdir/test',
    'log_every': 1e3,
    'train_every': 10,
    'prefill': 1e5,
    'actor_ent': 3e-3,
    'loss_scales.kl': 1.0,
    'discount': 0.99,
}).parse_flags()


env = gym.make('Test-v0')
dv2.train(env, config)

Change ```eval_envs``` to ```num_eval_envs```

dreamerv2/dreamerv2/train.py

Line 129 in e02ceb9

eval_envs = [make_async_env('eval') for _ in range(eval_envs)]

Since the number of evaluation environments is num_eval_envs, I think there should be a change at the end of this line.

Difference in the KL loss terms in the paper and the code

The algorithm for the KL balancing in the paper has the posterior and prior terms given as kl_loss = compute_kl(stop_grad(posterior), prior). So I had assumed that the code would have computed the loss as value = kld(dist(sg(post)), dist(prior)).

But instead the code has the terms reversed, with the KL loss formulated as (in networks.py, line 168) value = kld(dist(prior), dist(sg(post))).

Does that have something to do with the implementation of the kl divergence function in tensorflow_probability?

Skipped short episode of length 10.

May I ask what this means?

Is there something wrong with my env?

Thanks,

replay data memory usage?

Hi, thank you for the good code base.
I just wonder if a normal PC can embrace all the replay data in the memory when the agent step goes over 1 million. If I have about 16 GB memory, then can this agent be trained until the end?
It seems like the replay data size keep increasing as the training proceeds (without a truncation). Do you have any idea that the agent can be trained in a small-sized memory?
Thanks!

How to save and reload trained dreamerv2 models

How can I save a trained dreamerv2 model?
How can I reuse or load a previously trained dreamerv2 model for (i) evaluation or (ii) as a base for further training, at a later time?

Two questions about the paper

Hi @danijar,

Thank you for the great work of DreamerV2 and for sharing the code. It's great news that DreamerV2 achieves SOTA performance on Atari games. But I have two questions about the paper, hoping you can help me clarify them.

On page 6 Actor loss function. You mentioned that Reinforce is unbiased. To my best knowledge Reinforce is unbiased only when it's used with Monte Carlo returns. In the case of DreamerV2, which uses TD(𝝀) to approximate the return, biases should be introduced because of the error introduced by the function approximator. Am I right? Furthermore, I have a conjecture about why maximizing TD(𝝀) performs worse than Reinforce: when maximizing TD(𝝀), the gradients were first flowed through the world model $p(s'|s,a)$ then the policy model. This could possibly causes two problems: 1) the error of the world model may be propagated to the policy model, resulting in inaccurate gradient estimates. 2) the vanishing/exploding gradient problem may occur as now the predition graph becomes very deep and there is no mechanism such as skip connection to deal with the problem. I acknowledge that the second problem may be partially addressed by the intermediate policy losses at each imagine step but I'd like to bring it up here for discussion. What do you think about my conjecture?
On page 9 when discussing the potential advantage of categorical latents, you mention: "a term that would otherwise scale the gradient." Which term do you suggest? Is it $\epsilon$ used in reparameterization?

Intrinsic Rewards

Is it possible to add the use of intrinsic rewards to this method?

Thanks

Question about Plan2explore

For Plan2Explore, in expl.py the Class Plan2Explore will have a world model.

class Plan2Explore(common.Module):

  def __init__(self, config, act_space, wm, tfstep, reward):
    self.config = config
    self.reward = reward
    self.wm = wm

And this model will be WorldModel which is the same as dreamerv2.

class Agent(common.Module):

  def __init__(self, config, obs_space, act_space, step):
    self.config = config
    self.obs_space = obs_space
    self.act_space = act_space['action']
    self.step = step
    self.tfstep = tf.Variable(int(self.step), tf.int64)
    self.wm = WorldModel(config, obs_space, self.tfstep)
    self._task_behavior = ActorCritic(config, self.act_space, self.tfstep)
    if config.expl_behavior == 'greedy':
      self._expl_behavior = self._task_behavior
    else:
      self._expl_behavior = getattr(expl, config.expl_behavior)(
          self.config, self.act_space, self.wm, self.tfstep,
          lambda seq: self.wm.heads['reward'](seq['feat']).mode())

For worldmodel training, the code will encode all information include reward into encoder

def loss(self, data, state=None):
    data = self.preprocess(data)
    embed = self.encoder(data)

def preprocess(self, obs):
    dtype = prec.global_policy().compute_dtype
    obs = obs.copy()
    for key, value in obs.items():
      if key.startswith('log_'):
        continue
      if value.dtype == tf.int32:
        value = value.astype(dtype)
      if value.dtype == tf.uint8:
        value = value.astype(dtype) / 255.0 - 0.5
      obs[key] = value
    obs['reward'] = {
        'identity': tf.identity,
        'sign': tf.sign,
        'tanh': tf.tanh,
    }[self.config.clip_rewards](obs['reward'])
    obs['discount'] = 1.0 - obs['is_terminal'].astype(dtype)
    obs['discount'] *= self.config.discount
    return obs

class Encoder(common.Module):
  def _cnn(self, data):
    x = tf.concat(list(data.values()), -1)

But Plan2explore says there should not be env reward.

How to reproduce DayDreamer's results in A1 simulator?

Hi,

I find your work really fascinating and I am trying to reproduce DayDreamer's results in A1 robot dog simulator. The simulator is the A1 robot in Google motion imitation, and I adopt the same parameters for dreamer as in the default config for A1 robot. However, after training for a day (about 0.7M steps), the robot can learn merely not to trip over, but hardly walk or run. In the end, the dog walks somehow like this.

I am wondering what I may be missing for the reproduction. I notice that you have filtered out high frequency motor commands, could it be the main deficiency in my reproduction? Also, how many steps did you train on real A1 robots? About 20Hz * 60sec * 60min =72k steps?

I appreciate any advice from your experience. Thanks a lot!

Performance difference between TruncNormal and TanhNormal

Hey @danijar.

I just noticed that the code is using TruncNormal as the actor distribution instead of TanhNormal as in v1. I wonder did you make some ablations on these two choices and see TruncNormal provide better results? Or the change is only because the entropy of TruncNormal is easier to compute than TanhNormal for the entropy regularizer?

Offsets in actor loss calculation

Hi Danijar,
the critic loss is calculated without the offset identical to how it is stated in the paper.

dreamerv2/dreamerv2/agent.py

Lines 299 to 302 in 52fc568

    
           dist = self.critic(seq['feat'][:-1]) 
        
           target = tf.stop_gradient(target) 
        
           weight = tf.stop_gradient(seq['weight']) 
        
           critic_loss = -(dist.log_prob(target) * weight[:-1]).mean()

However, for the actor loss there is this offset by 1 (skip first target). Could you explain why this is the case?

dreamerv2/dreamerv2/agent.py

Line 272 in 52fc568

advantage = tf.stop_gradient(target[1:] - baseline)

This is how I imagine the advantage should be calculated (simplified without lambda-target). s_t is the current state of the agent. r_t is the reward of this state and should be ignored, since we are already in the state.

A = target(s_t) - baseline(s_t) = (r_t + r_{t+1} + E[r_{t+2} + ...]) - (r_t + E[r_{t+1} + r_{t+2} + ...]) = (r_{t+1} + E[r_{t+2} + ...]) - (E[r_{t+1} + r_{t+2} + ...]) = Q(a_t,s_t) - V(s_t)

If I understand your code correctly, as a result of the offset, the reward r_t will not cancel and the advantage will be wrong?

Outdated dependencies and broken examples

I wanted to try out dreamerv2 on our own environment (or at least the examples) but unfortunately run into some issues along the way.

The README example & Dockerfile use TensorFlow (and other library) versions that are outdated, in some cases pip doesn't even distribute the older versions anymore.

I attempted to run the minigrid example with a recent TensorFlow version.

The import from tensorflow.keras.mixed_precision import experimental as prec in nets.py causes an error as that API is no longer listed under experimental.

MiniGrid has migrated elsewhere and now uses Gymnasium: https://github.com/Farama-Foundation/Minigrid

Using the new minigrid env throws an error:

  ...
  File ".../dreamerv2/api.py", line 77, in train
    env = common.ResizeImage(env)
  File ".../dreamerv2/common/envs.py", line 447, in __init__
    self._keys = [
  File ".../dreamerv2/common/envs.py", line 449, in <listcomp>
    if len(v.shape) > 1 and v.shape[:2] != size]
TypeError: object of type 'NoneType' has no len()

Attempting to call train.py with our own environment (with an observation of an intensity image of size 42x30, contained in a NumPy array) manages to collect a prefill dataset but then fails with the error

File ".../dreamerv2/api.py", line 101, in train
    train_agent(next(dataset))
  File ".../dreamerv2/common/other.py", line 201, in __call__
    self._state, out = self._fn(*args, self._state)
  File ".../tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File ".../dreamerv2/agent.py", line 60, in train
    state, outputs, mets = self.wm.train(data, state)
  File ".../dreamerv2/agent.py", line 100, in train
    model_loss, state, outputs, metrics = self.loss(data, state)
  File ".../dreamerv2/agent.py", line 108, in loss
    post, prior = self.rssm.observe(
  File ".../dreamerv2/common/nets.py", line 50, in observe
    post, prior = common.static_scan(
  File ".../dreamerv2/common/other.py", line 41, in static_scan
    last = fn(last, inp)
  File ".../dreamerv2/common/nets.py", line 51, in <lambda>
    lambda prev, inputs: self.obs_step(prev[0], *inputs),
  File ".../dreamerv2/common/nets.py", line 96, in obs_step
    prior = self.img_step(prev_state, prev_action, sample)
  File ".../dreamerv2/common/nets.py", line 124, in img_step
    dist = self.get_dist(stats)
  File ".../dreamerv2/common/nets.py", line 81, in get_dist
    dist = tfd.Independent(common.OneHotDist(logit), 1)
  File ".../decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File ".../tensorflow_probability/python/distributions/distribution.py", line 342, in wrapped_init
    default_init(self_, *args, **kwargs)
  File ".../tensorflow_probability/python/distributions/independent.py", line 162, in __init__
    super(_Independent, self).__init__(
  File ".../tensorflow_probability/python/distributions/distribution.py", line 603, in __init__
    d for d in self._parameter_control_dependencies(is_init=True)
  File ".../tensorflow_probability/python/distributions/independent.py", line 337, in _parameter_control_dependencies
    raise ValueError('reinterpreted_batch_ndims({}) cannot exceed '
ValueError: reinterpreted_batch_ndims(1) cannot exceed distribution.batch_ndims(0)

I was unable to figure out what exactly the problem is here, and whether it is caused by the updated dependency versions, a problem with our environment or something else entirely.

Could you update the dependencies and examples to make it possible again to try out dreamerv2?

no GPU error

Hi!

After running:
python dreamer.py --logdir ~/logdir/atari_pong/dreamerv2/1 --configs defaults atari --task atari_pong

I got this error:

Traceback (most recent call last):
File "dreamer.py", line 324, in
main(parser.parse_args(remaining))
File "dreamer.py", line 239, in main
assert tf.config.experimental.list_physical_devices('GPU'), message
AssertionError: No GPU found. To actually train on CPU remove this assert.

Can you help me find the problem?

Straight-thru gradients vs Gumbel Softmax

I'm curious if you considered trying the gumbel softmax as an alternative to the way you implemented straight-thru gradients in this paper/code. It seems like it might be a less-biased way of backpropagating through the operation of sampling from a categorical distribution. The "hard" variant allows you to retain a purely discrete one-hot output in the forward pass, as you did here.

As I understand it, you implemented:

forward: one_hot(draw(logits))
backward: softmax(logits, temp=1)

And the (hard version of the) gumbel softmax is:

forward: one_hot(arg_max(log_softmax(logits) + sample_from_gumbel_dist)
backward: softmax(log_softmax(logits) + sample_from_gumbel_dist), temp=temp_hyperparam)

The forwards in both versions are equivalent - the second is just a reparameterization of the first. By altering the temperature hyperparameter, you can trade off bias and variance.

	obs = {
	i: self._envs[i].reset()
	for i, ob in enumerate(self._obs) if ob is None or ob['is_last']}
	for i, ob in obs.items():
	self._obs[i] = ob() if callable(ob) else ob
	act = {k: np.zeros(v.shape) for k, v in self._act_spaces[i].items()}
	tran = {k: self._convert(v) for k, v in {ob, act}.items()}
	[fn(tran, worker=i, **self._kwargs) for fn in self._on_resets]
	self._eps[i] = [tran]

	dist = self.critic(seq['feat'][:-1])
	target = tf.stop_gradient(target)
	weight = tf.stop_gradient(seq['weight'])
	critic_loss = -(dist.log_prob(target) * weight[:-1]).mean()

danijar / dreamerv2 Goto Github PK

dreamerv2's Issues

Train Script

Agent

World Model

Actor Critic

Exploration

Recommend Projects

Recommend Topics

Recommend Org