danijar / dreamerv2 Goto Github PK
View Code? Open in Web Editor NEWMastering Atari with Discrete World Models
Home Page: https://danijar.com/dreamerv2
License: MIT License
Mastering Atari with Discrete World Models
Home Page: https://danijar.com/dreamerv2
License: MIT License
Hi authors, thanks for your paper and code. I was trying to test dreamerv2 on retro games, and I spent a really long time looking at the code and trying to debug, but I have no clue what's going on.
I ran python3 dreamerv2/train.py --logdir ~/logdir/atari_pong/dreamerv2/1 --configs defaults retro --task retro_Airstriker-Genesis
, and the output seemed good for a while:
Logdir /Users/ryantjj/logdir/atari_pong/dreamerv2/1
Create envs.
make_env(): suite is retro.
task: Airstriker-Genesis
This shows that I parsed the arguments correctly, and also hooked up gym-retro, and edited the configs.yaml
and envs.py
files to support retro.
But after some iterations it seems, I run into this error:
/dreamerv2/agent.py:79 train *
metrics.update(self._task_behavior.train(self.wm, start, reward))
/dreamerv2/agent.py:212 train *
feat, state, action, disc = world_model.imagine(self.actor, start, hor)
/dreamerv2/agent.py:150 step *
succ = self.rssm.img_step(state, action)
./common/other.py:41 static_scan *
last = fn(last, inp)
./common/nets.py:105 img_step *
x = self.get('img_in', tfkl.Dense, self._hidden, self._act)(x)
/Users/ryantjj/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:1013 __call__ **
input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
/Users/ryantjj/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/input_spec.py:255 assert_input_compatibility
' but received input with shape ' + display_shape(x.shape))
ValueError: Input 0 of layer dense is incompatible with the layer: expected axis -1 of input shape to have value 1042 but received input with shape (2450, 1036)
So I printed the shapes of these variables in nets.py:105 img_step
by inserting the print
statements in this function as shown:
@tf.function
def img_step(self, prev_state, prev_action, sample=True):
prev_stoch = self._cast(prev_state['stoch'])
prev_action = self._cast(prev_action)
if self._discrete:
shape = prev_stoch.shape[:-2] + [self._stoch * self._discrete]
prev_stoch = tf.reshape(prev_stoch, shape)
x = tf.concat([prev_stoch, prev_action], -1)
print("prev_stoch.shape: " + str(prev_stoch.shape)) # OVER HERE
print("prev_action.shape: " + str(prev_action.shape)) # OVER HERE
x = self.get('img_in', tfkl.Dense, self._hidden, self._act)(x)
deter = prev_state['deter']
x, deter = self._cell(x, [deter])
deter = deter[0] # Keras wraps the state in a list.
x = self.get('img_out', tfkl.Dense, self._hidden, self._act)(x)
stats = self._suff_stats_layer('img_dist', x)
dist = self.get_dist(stats)
stoch = dist.sample() if sample else dist.mode()
prior = {'stoch': stoch, 'deter': deter, **stats}
return prior
And these are the terminal outputs when I run the code:
Create agent.
prev_stoch.shape: (50, 1024)
prev_action.shape: (50, 18)
prev_stoch.shape: (50, 1024)
prev_action.shape: (50, 18)
prev_stoch.shape: (50, 1024)
prev_action.shape: (50, 18)
Found 19975379 model parameters.
prev_stoch.shape: (2450, 1024)
prev_action.shape: (2450, 12)
Traceback (most recent call last):
Any clue as to why the prev_action.shape changed from 18 to 12? Thanks for getting through this really long post. I really appreciate your help! :)
Hi,
there seems to be an issue with the discount predictor log likelihood targets.
Line 168 in e783832
Line 126 in e783832
If I understand this correctly, this tries to compute the log probability of a Bernoulli distribution with values other than 0 or 1, as the discount will be < 1 for non terminal steps.
I ran this code
python dreamer.py --logdir ~/logdir/atari_pong/dreamerv2/1 --configs defaults atari --task atari_pong
But I got this mistake
Hi Danijar,
I was just wondering if there is any commented version of the code by any chance?
This might be impossible but is it possible to run this algorithm on an environment with a tuple action space?
If so then how?
Thanks,
Hi,
How to create an agent, load the weights and then call a prediction function to receive the action?
I'm trying to recreate one but many details are missing.
I stuck in this error:
python validate.py --model_path ~/logdir/trader
Loading config.
Loading config. Done
Resizing keys image to (64, 64).
Create agent (step: 481310).
Encoder CNN inputs: ['image']
Encoder MLP inputs: []
Decoder CNN outputs: ['image']
Decoder MLP outputs: []
Create agent. Done!
Loading checkpoint.
Load checkpoint with 85 tensors and 32342130 parameters.
Traceback (most recent call last):
File "/opt/homebrew/Caskroom/miniconda/base/envs/ml/lib/python3.8/site-packages/tensorflow/python/util/nest.py", line 568, in assert_same_structure
_pywrap_utils.AssertSameStructure(nest1, nest2, check_types,
ValueError: The two structures don't have the same nested structure.
First structure: type=tuple str=(<tf.Variable 'Variable:0' shape=() dtype=int32, numpy=481310>, <tf.Variable 'Variable:0' shape=() dtype=int32, numpy=0>, <tf.Variable 'Variable:0' shape=() dtype=float64, numpy=1.0>)
Second structure: type=tuple str=(483399, 48242, array([[-0.03663844, 0.02114336, -0.01451669, ..., -0.00666128,
-0.01674761, 0.07526544],
[-0.04041671, 0.02768614, -0.01707186, ..., -0.00505101,
My agent code:
import gym
import logging
import random
from typing import Sequence
import numpy as np
import tensorflow as tf
from dreamerv2.api import defaults
from dreamerv2 import common
from dreamerv2.agent import Agent
from pathlib import Path
from agents import BaseAgent
logger = logging.getLogger('root')
class Dreamerv2Agent(BaseAgent):
def __init__(self,
conf_file: Path,
env: str,
test_mode: bool,
prefix: str,
batch: int,
model_path: Path,
seed: bool):
super().__init__(env, test_mode, prefix, batch, model_path, seed)
if self.seed:
random.seed(0)
np.random.seed(0)
tf.random.set_seed(0)
logger.info("Loading config.")
config = common.Config.load(
str(model_path.absolute() / 'config.yaml')
)
logger.info("Loading config. Done")
# config = defaults.parse_flags()
env = gym.make(env)
env = common.GymWrapper(env)
env = common.ResizeImage(env)
if hasattr(env.act_space['action'], 'n'):
env = common.OneHotAction(env)
else:
env = common.NormalizeAction(env)
env = common.TimeLimit(env, config.time_limit)
replay = common.Replay(
model_path.absolute() / 'train_episodes',
**config.replay
)
step = common.Counter(replay.stats['total_steps'])
logger.info(f'Create agent (step: {step.value}).')
self.agent = Agent(config, env.obs_space, env.act_space, step)
logger.info('Create agent. Done!')
logger.info('Loading checkpoint.')
if (model_path.absolute() / 'variables.pkl').exists():
self.agent.load(model_path.absolute() / 'variables.pkl')
logger.info('Loading checkpoint. Done!')
def get_action(self, observation: Sequence):
# Receive the Gym observation to get action
output, _ = self.agent.policy(observation)
return output.get('action')
Hi @danijar, thank you for this great work.
I have some questions about evaluation protocol used in this code and dreamerV2 paper.
It seems like the state of the agent (self._state
) is not initialized to 0 on reset. Only in the very first episode, it is None
, so it will be set to 0s. Since driver.reset()
is never called again in api.py
, self._state
will be carried over from previous episodes on episode reset.
Is this intentional?
dreamerv2/dreamerv2/common/driver.py
Lines 32 to 40 in 07d906e
Dumb question here, but How does this algorithm compare in Procgen environment, especially compared to PPG?
Thank you
Is there a way to render eval episodes for the open ai Atari envs?
Dear author,
After reading the code and the paper, I am confused about why Imagination MDP is introduced and why imagination horizon is needed. For example, with a trained world model and given a trajectory:
Hi,
I'm very interested in your work but I am unclear if the actor-critic is trained only using the stochastic state as its observation or if it also uses the recurrent state? What's the reasoning behind this choice?
Thanks for all your work and for putting it on Github!
Hi, I have been training an agent using this for a while now but today I have been getting these errors:
Logdir X:\Dreamer_log\logdir\ai\dreamerv2\1
Could not load episode: Object arrays cannot be loaded when allow_pickle=False
Create envs.
.\common\driver.py:64: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Train episode has 15691 steps and return 1.4.
Traceback (most recent call last):
File "dreamerv2/train.py", line 119, in <module>
File ".\common\driver.py", line 56, in __call__
File ".\common\driver.py", line 56, in <listcomp>
File "dreamerv2/train.py", line 110, in <lambda>
File "dreamerv2/train.py", line 102, in per_episode
File "C:\Users\Rob\Anaconda2\envs\muzero\lib\site-packages\elements\logger.py", line 36, in video
self.add({name: value})
File "C:\Users\Rob\Anaconda2\envs\muzero\lib\site-packages\elements\logger.py", line 25, in add
f"Shape {value.shape} for name '{name}' cannot be "
ValueError: Shape (15692,) for name 'train_policy' cannot be interpreted as scalar, image, or video.
I haven't changed anything in my system or env.
Any ideas would be great.
Thanks
dreamerv2/dreamerv2/configs.yaml
Line 24 in 912ec5d
Hi Danijar,
do I understand correct that this line should have batch = 50 to to have same hyperparameters as in the paper? I am asking because I want to investigate why my own PyTorch implementation is slower.
When I was using your code, I only founded the Hyperparameters of Walker-Walk but Humanoid-Walk. In the official paper, it has the results of Humanoid-Walk Environment. So could you please supply the Hyperparameters of Humanoid-Walk in your config.yaml file?
This is the commandline and output i get:
(tf2) marten@dpserver:~/rl/dreamerv2$ python3 dreamerv2/train.py --logdir ~/logdir/dmc_walker_walk/dreamerv2/1 --configs dmc --task dmc_walker_walk Traceback (most recent call last): File "dreamerv2/train.py", line 196, in <module> main() File "dreamerv2/train.py", line 37, in main config = config.update(configs[name]) KeyError: 'dmc'
Hello. Thanks for your interesting work!
I'm planning to use dreamerv2 on some feature-based tasks. After doing some searching, I found no one has tried to do it before. I'm wondering if there is any difficulty on doing so? What problems would you anticipate?
Hello, danijar! First of all, thanks for your work :)
I've been trying out dreamerv2 this past week and tried to reproduce riverraid's results. However, I was unsuccessful and the agent only reaches about ~5k reward after almost 1e6 train steps. This is the latest result I got. If you need, I can attach tensorboard graphs later this week.
train_return 5190 / train_length 982 / train_total_steps 9.5e5 / train_total_episodes 1220 / train_loaded_steps 9.5e5 / train_loaded_episodes 1220
I did a small modification to the original code so it runs on multiple GPUs (tf.distribute.MirroredStrategy
). Then, I trained the agent to play Pong and the return plot was similar to the one you posted on #8, so I figured out it was ok. Also, in the riverraid's output attached above, half of it ran with precision=16 and half with precision=32 since it was mentioned in a few other issues that precision 32 helped, especially #30. I did not did a full run with precision=32, though.
Do you have any tips on what could be going wrong or what could I do to debug it?
Thanks so much!
Thanks for the updated release. I just downloaded the code and made a fresh environment as detailed in the readme. I tried to train the script with everything set to default by simply running "python dreamerv2/train.py --logdir ./logdir/atari_pong --configs defaults atari --task atari_pong". After 50k steps, the return doesn't seem to increase at all. The atari pong task should have a random reward of around -20 and what I got so far is just that. Any suggestion on why this is the case?
Here is the configs.yaml just in case you need it. The only place I changed in the code is the steps in line 8 and 77 where I reduce them to 1e7. Even at a fewer number of steps, I think I should be expecting some improvements in return.
defaults:
logdir: /dev/null
seed: 0
task: dmc_walker_walk
num_envs: 1
steps: 1e7
eval_every: 1e5
action_repeat: 1
time_limit: 0
prefill: 10000
image_size: [64, 64]
grayscale: False
replay_size: 2e6
dataset: {batch: 50, length: 50, oversample_ends: True}
train_gifs: False
precision: 16
jit: True
log_every: 1e4
train_every: 5
train_steps: 1
pretrain: 0
clip_rewards: identity
expl_noise: 0.0
expl_behavior: greedy
expl_until: 0
eval_noise: 0.0
eval_state_mean: False
pred_discount: True
grad_heads: [image, reward, discount]
rssm: {hidden: 400, deter: 400, stoch: 32, discrete: 32, act: elu, std_act: sigmoid2, min_std: 0.1}
encoder: {depth: 48, act: elu, kernels: [4, 4, 4, 4], keys: [image]}
decoder: {depth: 48, act: elu, kernels: [5, 5, 6, 6]}
reward_head: {layers: 4, units: 400, act: elu, dist: mse}
discount_head: {layers: 4, units: 400, act: elu, dist: binary}
loss_scales: {kl: 1, reward: 1, discount: 1}
kl: {free: 0.0, forward: False, balance: 0.8, free_avg: True}
model_opt: {opt: adam, lr: 3e-4, eps: 1e-5, clip: 100, wd: 1e-6}
actor: {layers: 4, units: 400, act: elu, dist: trunc_normal, min_std: 0.1}
critic: {layers: 4, units: 400, act: elu, dist: mse}
actor_opt: {opt: adam, lr: 1e-4, eps: 1e-5, clip: 100, wd: 1e-6}
critic_opt: {opt: adam, lr: 1e-4, eps: 1e-5, clip: 100, wd: 1e-6}
discount: 0.99
discount_lambda: 0.95
imag_horizon: 15
actor_grad: both
actor_grad_mix: '0.1'
actor_ent: '1e-4'
slow_target: True
slow_target_update: 100
slow_target_fraction: 1
expl_extr_scale: 0.0
expl_intr_scale: 1.0
expl_opt: {opt: adam, lr: 3e-4, eps: 1e-5, clip: 100, wd: 1e-6}
expl_head: {layers: 4, units: 400, act: elu, dist: mse}
disag_target: stoch
disag_log: True
disag_models: 10
disag_offset: 1
disag_action_cond: True
expl_model_loss: kl
atari:
task: atari_pong
time_limit: 108000 # 30 minutes of game play.
action_repeat: 4
steps: 1e7
eval_every: 1e5
log_every: 1e5
prefill: 200000
grayscale: True
train_every: 16
clip_rewards: tanh
rssm: {hidden: 600, deter: 600, stoch: 32, discrete: 32}
actor.dist: onehot
model_opt.lr: 2e-4
actor_opt.lr: 4e-5
critic_opt.lr: 1e-4
actor_ent: 1e-3
discount: 0.999
actor_grad: reinforce
actor_grad_mix: 0
loss_scales.kl: 0.1
loss_scales.discount: 5.0
.*.wd$: 1e-6
dmc:
task: dmc_walker_walk
time_limit: 1000
action_repeat: 2
eval_every: 1e4
log_every: 1e4
prefill: 5000
train_every: 5
pretrain: 100
pred_discount: False
grad_heads: [image, reward]
rssm: {hidden: 200, deter: 200}
model_opt.lr: 3e-4
actor_opt.lr: 8e-5
critic_opt.lr: 8e-5
actor_ent: 1e-4
discount: 0.99
actor_grad: dynamics
kl.free: 1.0
dataset.oversample_ends: False
debug:
jit: False
time_limit: 100
eval_every: 300
log_every: 300
prefill: 100
pretrain: 1
train_steps: 1
dataset.batch: 10
dataset.length: 10
When running python3 common/plot.py --indir ~/logdir/exp --outdir ~/plots --xaxis step --yaxis eval_return --bins 1e6
I get:
NotADirectoryError: [Errno 20] Not a directory: '/home/USER/logdir/exp/variables.pkl'
It seems that the code is treating files as folders. If I make --indir ~/logdir
instead (one level up) I get:
Traceback (most recent call last):
File "/home/USER/code/dreamerv2/dreamerv2/common/plot.py", line 571, in <module>
main(parse_args())
File "/home/USER/code/dreamerv2/dreamerv2/common/plot.py", line 482, in main
runs = load_runs(args)
File "/home/USER/code/dreamerv2/dreamerv2/common/plot.py", line 72, in load_runs
task, method, seed = filename.relative_to(indir).parts[:-1]
ValueError: not enough values to unpack (expected 3, got 1)
I ran training with: python dreamerv2/train.py --logdir ~/logdir/exp --configs dmc_vision --task dmc_walker_walk
Packages:
python=3.9.12
Hi,
I was looking at the TruncNormalDist
code and was wondering why were the samples re-clipped ('re' because they are already in [-1, 1]
because of tfd.TruncatedNormal
's sampling).
In practice it seems to me that this wouldn't create an issue as it is only re-clipped by 1e-6
, but I am curious if I'm missing something.
Thanks!
From my understanding, the posterior of the last timestep from a batch is used as the start state for the next batch.
Is this intended? If so, is it just to avoid always initializing the start state to zeros and have it model some random sample from the current latent distribution?
Line 60 in 07d906e
Hi! I am trying to run the minigrid and crafter examples in a Jupyter notebook, but i keep getting this error when running the config.
Command (minigrid):
config = dv2.defaults.update({
'logdir': '~/logdir/crafter',
'log_every': 1e3,
'train_every': 10,
'prefill': 1e5,
'actor_ent': 3e-3,
'loss_scales.kl': 1.0,
'discount': 0.99,
}).parse_flags()
Assertion Error:
~/miniconda3/envs/hacking/lib/python3.8/site-packages/dreamerv2/common/flags.py in parse(self, argv, known_only, help_exists)
45 if flag.startswith('--'):
46 raise ValueError(f"Flag '{flag}' did not match any config keys.")
---> 47 assert not remaining, remaining
48 return parsed
49
AssertionError: ['-f', '/home/balloch/.local/share/jupyter/runtime/kernel-244efe8b-5d51-499b-bfa2-7611c02c8e5b.json']
Command (crafter)
config = dv2.defaults.update().parse_flags()
Attribute Error:
AttributeError: 'dict' object has no attribute 'crafter'
I think a lot of improvement could be made by using a PPO actor.
Hi danijar,
how many environment steps are you running per update?
In the paper it is 4 (so after every step the agent makes it is updated because of action repeat?), but here in the config it says train_every: 16
. What is the correct number?
Best,
Tim
Hi,
While I'm taking a close look in the imagine() function in the world model,
I wonder why the gradient from the input feature to the actor should be stopped.
WorldModel's imagine fuction (agent.py)
def imagine(self, policy, start, is_terminal, horizon):
flatten = lambda x: x.reshape([-1] + list(x.shape[2:]))
start = {k: flatten(v) for k, v in start.items()}
start['feat'] = self.rssm.get_feat(start)
start['action'] = tf.zeros_like(policy(start['feat']).mode())
seq = {k: [v] for k, v in start.items()}
for _ in range(horizon):
action = policy(tf.stop_gradient(seq['feat'][-1])).sample()
In my opinion, for the full gradient from the initial state to the last step of the sequence, shouldn't the 'feat' flow through the computation graph without the stop gradient? I just wonder why there is a stop gradient. have you tried the code without the stop gradient? What was the result like?
I'm struggling to find out the reason for the stop gradient and ask it here for help.
Thanks!
Hi Danijar,
I'm currently doing a project where I'm running DreamerV2 on some of the alternative exploration agents. I have two questions:
print('Create agent.')
train_dataset = iter(train_replay.dataset(**config.dataset))
And this line in the for loop which iterates over the batches.
for _ in range(config.train_steps):
mets = train_agent(next(train_dataset))
I just wanted to sanity check with you that the next(train_dataset) batch is pulled from the entire buffer in train_replay._complete_eps, and that it's being updated as such, since I don't see train_dataset being updated after its initialisation. I also wanted to clarify that if the expl_behaviour is set to not greedy, the training episodes use the exploratory agent, and that data collected by this agent is sampled in subsequent batches of next(train_dataset). Possibly a silly question but in case I was missing something I tried the following modification:
for _ in range(config.train_steps):
train_dataset = iter(train_replay.dataset(**config.dataset))
mets = train_agent(next(train_dataset))
Where train_dataset was re-initialised and I got worse results than the default behaviour.
a) any steps needed to be done in order for Plan2Explore to work properly, other than just updating configs.yaml with expl_behavior: Plan2Explore (this is what I currently have)
b) it takes more than a few million steps for Plan2Explore to perform as well as default Dreamer. Here's a graph of the situation:
Note: I'd accidentally had action_repeat set to 4 in both these games, so divide by 4 to get the true number of steps on the x-axis.
Thanks in advance!
[Edited to update to 18M steps; images below are from 12M]
Starting a new thread with more relevant detail here. Please feel free to close if you don't think it's appropriate.
We've now trained several instances to at least 10M+ steps with no improvement in Pong scores. This is using the default Pong settings on V100 machines in Colab Pro.
All training settings are the default in the repo, no modifications have been made to the code base as this was a first "test run" of dreamer.
Below are performance graphs. Happy to provide Colab copy or log files if it would be helpful. Would appreciate any insight, even if it's that we need to allow longer training (though the chart in Appendix F appears to show Pong improving by this point in training?).
Will keep training in the meantime and update if anything changes.
Thank you.
[Below images are from 12M steps. However issue persists beyond 18M+ steps]
Hi Danijar,
I'm training using dreamerv2 with success, and I'm getting this result:
[5847513] return 6.12 / length 151 / total_steps 5.8e6 / total_episodes 3.9e4 / loaded_steps 1e5 / loaded_episodes 662
Save checkpoint with 85 tensors and 32333580 parameters.
[5847536] kl_loss 0.67 / image_loss 1.1e4 / reward_loss 0.92 / discount_loss 0.06 / model_kl 0.67 / prior_ent 1.92 / post_ent 1.14 / model_loss 1.1e4 / model_grad_norm 133.28 / actor_loss -1.5e-5 / actor_grad_norm 1.5e-3 / critic_loss 0.82 / critic_grad_norm 0.07 / reward_mean 0.04 / reward_std 0.03 / reward_normed_mean 0.04 / reward_normed_std 0.03 / critic_slow 1.37 / critic_target 1.35 / actor_ent 2e-3 / actor_ent_scale 2e-3 / critic 1.37 / fps 44.95
Episode has 151 steps and return 6.1.
The return 6.1 is the cumulative sum of rewards of the episode?
After training I'm running the code below and I'm receiving the cumulative reward of the 2.153186.
I'm using this code to evaluate.
import re
import warnings
import gym
import logging
import random
from absl import logging
from typing import Sequence
import numpy as np
import tensorflow as tf
import dreamerv2.api as dv2
from dreamerv2 import common
from dreamerv2.agent import Agent
from pathlib import Path
from agents import BaseAgent
# logger = logging.getLogger('root')
# warnings.filterwarnings('ignore', '.*box bound precision lowered.*')
class Dreamerv2Agent(BaseAgent):
def __init__(self,
conf_file: Path,
env: str,
test_mode: bool,
prefix: str,
batch: int,
model_path: Path = Path("~/logdir/trader"),
seed: bool = False):
super().__init__(env, test_mode, prefix, batch, model_path, seed)
if self.seed:
random.seed(0)
np.random.seed(0)
tf.random.set_seed(0)
model_path = model_path.expanduser().absolute()
logging.error(f"Model Path: {model_path}")
logging.error("Loading config.")
config_path = (model_path / 'config.yaml')
config = common.Config.load(config_path)
self.config = config
logging.error("Loading config. Done")
env = gym.make(env)
replay = common.Replay(
model_path / 'train_episodes',
**config.replay
)
step = common.Counter(replay.stats['total_steps'])
env = self.wrapper(env)
def per_episode(ep):
length = len(ep['reward']) - 1
score = float(ep['reward'].astype(np.float64).sum())
logging.error(f'Episode has {length} steps and return {score:.1f}.')
# logger.scalar('return', score)
# logger.scalar('length', length)
for key, value in ep.items():
if re.match(config.log_keys_sum, key):
logging.error.scalar(f'sum_{key}', ep[key].sum())
if re.match(config.log_keys_mean, key):
logging.error.scalar(f'mean_{key}', ep[key].mean())
if re.match(config.log_keys_max, key):
logging.error.scalar(f'max_{key}', ep[key].max(0).mean())
# logger.add(replay.stats)
# logger.write()
driver = common.Driver([env])
driver.on_episode(per_episode)
driver.on_step(lambda tran, worker: step.increment())
driver.on_step(replay.add_step)
driver.on_reset(replay.add_step)
prefill = max(0, config.prefill - replay.stats['total_steps'])
if prefill:
print(f'Prefill dataset ({prefill} steps).')
random_agent = common.RandomAgent(env.act_space)
driver(random_agent, steps=prefill, episodes=1)
driver.reset()
logging.error(f'Create agent (step: {step.value}).')
logging.error(f"Action Space: {env.act_space}")
logging.error(f"Observation Space: {env.obs_space}")
self.agent = Agent(config, env.obs_space, env.act_space, step)
dataset = iter(replay.dataset(**config.dataset))
train_agent = common.CarryOverState(self.agent.train)
train_agent(next(dataset))
logging.error('Create agent. Done!')
logging.error('Loading checkpoint.')
vars = (model_path / 'variables.pkl').absolute()
if vars.exists():
self.agent.load(vars)
logging.error('Loading checkpoint. Done!')
def wrapper(self, env):
env = common.GymWrapper(env)
env = common.ResizeImage(env)
if hasattr(env.act_space['action'], 'n'):
env = common.OneHotAction(env)
else:
env = common.NormalizeAction(env)
env = common.TimeLimit(env, self.config.time_limit)
return env
def get_action(self, observation: Sequence):
obs = {k: np.expand_dims(v, 0) for k, v in observation.items()}
output, _ = self.agent.policy(obs, mode='eval')
output['action'] = tf.squeeze(output['action'])
return output
My config.
action_repeat: 1
actor: {act: elu, dist: auto, layers: 4, min_std: 0.1, norm: none, units: 400}
actor_ent: 0.002
actor_grad: auto
actor_grad_mix: 0.1
actor_opt: {clip: 100, eps: 1e-05, lr: 8e-05, opt: adam, wd: 1e-06}
atari_grayscale: false
clip_rewards: tanh
critic: {act: elu, dist: mse, layers: 4, norm: none, units: 400}
critic_opt: {clip: 100, eps: 1e-05, lr: 0.0002, opt: adam, wd: 1e-06}
dataset: {batch: 16, length: 50}
decoder:
act: elu
cnn_depth: 48
cnn_kernels: [5, 5, 6, 6]
cnn_keys: .*
mlp_keys: .*
mlp_layers: [400, 400, 400, 400]
norm: none
disag_action_cond: true
disag_log: false
disag_models: 10
disag_offset: 1
disag_target: stoch
discount: 0.99
discount_head: {act: elu, dist: binary, layers: 4, norm: none, units: 400}
discount_lambda: 0.95
dmc_camera: -1
encoder:
act: elu
cnn_depth: 48
cnn_kernels: [4, 4, 4, 4]
cnn_keys: .*
mlp_keys: .*
mlp_layers: [400, 400, 400, 400]
norm: none
envs: 1
envs_parallel: none
eval_eps: 1
eval_every: 1000.0
eval_noise: 0.0
eval_state_mean: false
expl_behavior: greedy
expl_extr_scale: 0.0
expl_head: {act: elu, dist: mse, layers: 4, norm: none, units: 400}
expl_intr_scale: 1.0
expl_model_loss: kl
expl_noise: 0.0
expl_opt: {clip: 100, eps: 1e-05, lr: 0.0003, opt: adam, wd: 1e-06}
expl_reward_norm: {eps: 1e-08, momentum: 1.0, scale: 1.0}
expl_until: 0
grad_heads: [decoder, reward, discount]
imag_horizon: 15
jit: true
kl: {balance: 0.8, forward: false, free: 0.0, free_avg: true}
log_every: 10000.0
log_keys_max: ^$
log_keys_mean: ^$
log_keys_sum: ^$
log_keys_video: [image]
logdir: ~/logdir/trader
loss_scales: {discount: 1.0, kl: 1.0, proprio: 1.0, reward: 1.0}
model_opt: {clip: 100, eps: 1e-05, lr: 0.0001, opt: adam, wd: 1e-06}
precision: 16
pred_discount: true
prefill: 10000
pretrain: 1
render_size: [64, 64]
replay: {capacity: 100000.0, maxlen: 50, minlen: 50, ongoing: false, prioritize_ends: true}
reward_head: {act: elu, dist: mse, layers: 4, norm: none, units: 400}
reward_norm: {eps: 1e-08, momentum: 1.0, scale: 1.0}
rssm: {act: elu, deter: 1024, discrete: 32, ensemble: 1, hidden: 1024, min_std: 0.1,
norm: none, std_act: sigmoid2, stoch: 32}
seed: 0
slow_baseline: true
slow_target: true
slow_target_fraction: 1
slow_target_update: 100
steps: 100000000.0
task: dmc_walker_walk
time_limit: 0
train_every: 1000
train_steps: 1
There are something I'm missing to get a better evaluation?
Best regards,
Fernando Ribeiro
Hi,
I have a question about how you calculate the lambda_target as seen in the equation below.
I've been implementing it to work directly in the environment rather than with the model states to test out how it works and something occurred to me. On your final step, i.e. when t = H, are you not accounting for the reward twice since the Value network is already trained to incorporate the reward of a state into the Value for a state? Would it not be more valid to instead stop calculation at H-1 and use the final H model_state only for bootstrapping, so that the target calculation would become V(s_H-1) = r_H-1 + y_H-1 * V(s_H)?
Thanks again,
Lewis
We could successfully run dreamerv2 on the Minigrid environment by referring to the README.md. And, we are now trying to run dreamerv2 on atari games, but the environment loaded from an atari game, especially SpaceInvaders-v0, seems to be not compatible with the agent's input. Could you tell us the way to run dreamerv2 on SpaceInvaders-v0? Or, should we modify some codes like agent.py and envs.py, so that the agent and environment are compatible with each other.
Dear Danijar,
I'm running into an issue that may be a non-issue, and I thought it was worth checking.
Are you able to reproduce training runs using a fixed random seed?
There is a 'seed' flag in the config file, but I cannot find where it is actually being used, and my runs do not appear fixed to a seed.
Additionally, I am running into what appears to be a weird bug, and I am wondering if you have insight. The model does not train properly if I try to manually set the random seeds by adding, before any other code:
np.random.seed(config.seed)
tf.random.set_seed(config.seed)
print(f'--> Setting random seed to {config.seed}')
For example, using dmc_walker_run, here is a training curve if I do not set the seed
Whereas here is a training curve if the only change I make is to add the above three lines.
This has been a consistent finding. I also see it if I try setting the random seed at other locations in the code.
Otherwise, I am getting consistent success training without setting a random seed ( --> congratulations and thank you for the wonderful codebase and algorithm :-)
Is this a known problem? And/or do you have any insight into why this might be the case. Is there a reason to give up trying to set a random seed?
Note: I have been using the original version of your repo (i.e. from March). Is this something you have knowingly fixed with subsequent updates?
Thank you so much.
Best,
Isaac
Hi again,
Congrats by excellent work.
My model is improving.
I'm loading the checkpoint with success and trying to predict (calling the policy function) the get an action using this observation format:
{'image': array([[[161, 255, 0],
[161, 255, 0],
[161, 255, 0],
...,
[155, 255, 0],
[155, 255, 0],
[155, 255, 0]],
[[161, 255, 0],
[161, 255, 0],
[161, 255, 0],
...,
[155, 255, 0],
[155, 255, 0],
[155, 255, 0]],
[[161, 255, 0],
[161, 255, 0],
[161, 255, 0],
...,
[155, 255, 0],
[155, 255, 0],
[155, 255, 0]],
...,
[[182, 255, 0],
[182, 255, 0],
[182, 255, 0],
...,
[183, 255, 0],
[183, 255, 0],
[183, 255, 0]],
[[182, 255, 0],
[182, 255, 0],
[182, 255, 0],
...,
[183, 255, 0],
[183, 255, 0],
[183, 255, 0]],
[[182, 255, 0],
[182, 255, 0],
[182, 255, 0],
...,
[183, 255, 0],
[183, 255, 0],
[183, 255, 0]]], dtype=uint8), 'reward': 0.0, 'is_first': True, 'is_last': False, 'is_terminal': False}
My code:
import re
import warnings
import gym
import logging
import random
from typing import Sequence
import numpy as np
import tensorflow as tf
import dreamerv2.api as dv2
from dreamerv2 import common
from dreamerv2.agent import Agent
from pathlib import Path
from agents import BaseAgent
logger = logging.getLogger('root')
warnings.filterwarnings('ignore', '.*box bound precision lowered.*')
class Dreamerv2Agent(BaseAgent):
def __init__(self,
conf_file: Path,
env: str,
test_mode: bool,
prefix: str,
batch: int,
model_path: Path = Path("~/logdir/trader"),
seed: bool = False):
super().__init__(env, test_mode, prefix, batch, model_path, seed)
if self.seed:
random.seed(0)
np.random.seed(0)
tf.random.set_seed(0)
model_path = model_path.expanduser().absolute()
print(f"Model Path: {model_path}")
print("Loading config.")
config_path = (model_path / 'config.yaml')
config = common.Config.load(config_path)
self.config = config
print("Loading config. Done")
env = gym.make(env)
replay = common.Replay(
model_path / 'train_episodes',
**config.replay
)
step = common.Counter(replay.stats['total_steps'])
env = self.wrapper(env)
def per_episode(ep):
length = len(ep['reward']) - 1
score = float(ep['reward'].astype(np.float64).sum())
print(f'Episode has {length} steps and return {score:.1f}.')
logger.scalar('return', score)
logger.scalar('length', length)
for key, value in ep.items():
if re.match(config.log_keys_sum, key):
logger.scalar(f'sum_{key}', ep[key].sum())
if re.match(config.log_keys_mean, key):
logger.scalar(f'mean_{key}', ep[key].mean())
if re.match(config.log_keys_max, key):
logger.scalar(f'max_{key}', ep[key].max(0).mean())
logger.add(replay.stats)
logger.write()
driver = common.Driver([env])
driver.on_episode(per_episode)
driver.on_step(lambda tran, worker: step.increment())
driver.on_step(replay.add_step)
driver.on_reset(replay.add_step)
prefill = max(0, config.prefill - replay.stats['total_steps'])
if prefill:
print(f'Prefill dataset ({prefill} steps).')
random_agent = common.RandomAgent(env.act_space)
driver(random_agent, steps=prefill, episodes=1)
driver.reset()
print(f'Create agent (step: {step.value}).')
print(f"Action Space: {env.act_space}")
print(f"Observation Space: {env.obs_space}")
self.agent = Agent(config, env.obs_space, env.act_space, step)
dataset = iter(replay.dataset(**config.dataset))
train_agent = common.CarryOverState(self.agent.train)
train_agent(next(dataset))
print('Create agent. Done!')
print('Loading checkpoint.')
vars = (model_path / 'variables.pkl').absolute()
if vars.exists():
self.agent.load(vars)
print('Loading checkpoint. Done!')
def wrapper(self, env):
env = common.GymWrapper(env)
env = common.ResizeImage(env)
if hasattr(env.act_space['action'], 'n'):
env = common.OneHotAction(env)
else:
env = common.NormalizeAction(env)
env = common.TimeLimit(env, self.config.time_limit)
return env
def get_action(self, observation: Sequence):
obs = {k: np.expand_dims(v, 0) for k, v in observation.items()}
output, _ = self.agent.policy(obs, mode='eval')
output['action'] = tf.squeeze(output['action'])
return output
After calling get_action and getting a the action to pass to step from my gym environment (wrapped by the dreamerv2) and this works inside the loop.
But I'm getting always the same action from different observations.
Is something missing from my evaluation method?
Thanks in advanced.
Hi,
Congrats by excellent project.
I'm using a custom gym environment (images with shape [21, 4], the channel dimension is absent because its a grayscale image) and I'm getting this error.
Episode has 207 steps and return -19.3.
[2679] return -19.31 / length 207 / total_steps 2679 / total_episodes 13 / loaded_steps 2691 / loaded_episodes 13
Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/check_ops.py", line 1231, in assert_rank
assert_op = _assert_rank_condition(x, rank, static_condition,
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/check_ops.py", line 1131, in _assert_rank_condition
raise ValueError(
ValueError: ('Static rank condition failed', 3, 4)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/Fernando/dev/dreamerv2/examples/test.py", line 16, in <module>
dv2.train(env, config)
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/api.py", line 94, in train
driver(random_agent, steps=prefill, episodes=1)
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/common/driver.py", line 57, in __call__
[fn(ep, **self._kwargs) for fn in self._on_episodes]
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/common/driver.py", line 57, in <listcomp>
[fn(ep, **self._kwargs) for fn in self._on_episodes]
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/api.py", line 74, in per_episode
logger.write()
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/common/logger.py", line 44, in write
output(self._metrics)
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/dreamerv2/common/logger.py", line 117, in __call__
tf.summary.image(name, value, step)
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorboard/plugins/image/summary_v2.py", line 140, in image
return tf.summary.write(
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 762, in write
op = smart_cond.smart_cond(
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/framework/smart_cond.py", line 56, in smart_cond
return true_fn()
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 750, in record
summary_tensor = tensor() if callable(tensor) else array_ops.identity(
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorboard/util/lazy_tensor_creator.py", line 66, in __call__
self._tensor = self._tensor_callable()
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorboard/plugins/image/summary_v2.py", line 112, in lazy_tensor
tf.debugging.assert_rank(data, 4)
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/check_ops.py", line 1178, in assert_rank_v2
return assert_rank(x=x, rank=rank, message=message, name=name)
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/usr/local/Caskroom/miniconda/base/envs/ml/lib/python3.9/site-packages/tensorflow/python/ops/check_ops.py", line 1236, in assert_rank
raise ValueError(
ValueError: . Tensor must have rank 4. Received rank 3, shape (208, 64, 64)
How to fix this?
My test file is:
import gym
import dreamerv2.api as dv2
config = dv2.defaults.update({
'logdir': '~/logdir/test',
'log_every': 1e3,
'train_every': 10,
'prefill': 1e5,
'actor_ent': 3e-3,
'loss_scales.kl': 1.0,
'discount': 0.99,
}).parse_flags()
env = gym.make('Test-v0')
dv2.train(env, config)
Line 129 in e02ceb9
Since the number of evaluation environments is num_eval_envs
, I think there should be a change at the end of this line.
The algorithm for the KL balancing in the paper has the posterior and prior terms given as kl_loss = compute_kl(stop_grad(posterior), prior)
. So I had assumed that the code would have computed the loss as value = kld(dist(sg(post)), dist(prior))
.
But instead the code has the terms reversed, with the KL loss formulated as (in networks.py, line 168) value = kld(dist(prior), dist(sg(post)))
.
Does that have something to do with the implementation of the kl divergence function in tensorflow_probability?
May I ask what this means?
Is there something wrong with my env?
Thanks,
Hi, thank you for the good code base.
I just wonder if a normal PC can embrace all the replay data in the memory when the agent step goes over 1 million. If I have about 16 GB memory, then can this agent be trained until the end?
It seems like the replay data size keep increasing as the training proceeds (without a truncation). Do you have any idea that the agent can be trained in a small-sized memory?
Thanks!
Hi @danijar,
Thank you for the great work of DreamerV2 and for sharing the code. It's great news that DreamerV2 achieves SOTA performance on Atari games. But I have two questions about the paper, hoping you can help me clarify them.
Is it possible to add the use of intrinsic rewards to this method?
Thanks
For Plan2Explore, in expl.py the Class Plan2Explore will have a world model.
class Plan2Explore(common.Module):
def __init__(self, config, act_space, wm, tfstep, reward):
self.config = config
self.reward = reward
self.wm = wm
And this model will be WorldModel which is the same as dreamerv2.
class Agent(common.Module):
def __init__(self, config, obs_space, act_space, step):
self.config = config
self.obs_space = obs_space
self.act_space = act_space['action']
self.step = step
self.tfstep = tf.Variable(int(self.step), tf.int64)
self.wm = WorldModel(config, obs_space, self.tfstep)
self._task_behavior = ActorCritic(config, self.act_space, self.tfstep)
if config.expl_behavior == 'greedy':
self._expl_behavior = self._task_behavior
else:
self._expl_behavior = getattr(expl, config.expl_behavior)(
self.config, self.act_space, self.wm, self.tfstep,
lambda seq: self.wm.heads['reward'](seq['feat']).mode())
For worldmodel training, the code will encode all information include reward into encoder
def loss(self, data, state=None):
data = self.preprocess(data)
embed = self.encoder(data)
def preprocess(self, obs):
dtype = prec.global_policy().compute_dtype
obs = obs.copy()
for key, value in obs.items():
if key.startswith('log_'):
continue
if value.dtype == tf.int32:
value = value.astype(dtype)
if value.dtype == tf.uint8:
value = value.astype(dtype) / 255.0 - 0.5
obs[key] = value
obs['reward'] = {
'identity': tf.identity,
'sign': tf.sign,
'tanh': tf.tanh,
}[self.config.clip_rewards](obs['reward'])
obs['discount'] = 1.0 - obs['is_terminal'].astype(dtype)
obs['discount'] *= self.config.discount
return obs
class Encoder(common.Module):
def _cnn(self, data):
x = tf.concat(list(data.values()), -1)
But Plan2explore says there should not be env reward.
Hi,
I find your work really fascinating and I am trying to reproduce DayDreamer's results in A1 robot dog simulator. The simulator is the A1 robot in Google motion imitation, and I adopt the same parameters for dreamer as in the default config for A1 robot. However, after training for a day (about 0.7M steps), the robot can learn merely not to trip over, but hardly walk or run. In the end, the dog walks somehow like this.
I am wondering what I may be missing for the reproduction. I notice that you have filtered out high frequency motor commands, could it be the main deficiency in my reproduction? Also, how many steps did you train on real A1 robots? About 20Hz * 60sec * 60min =72k steps?
I appreciate any advice from your experience. Thanks a lot!
Hey @danijar.
I just noticed that the code is using TruncNormal
as the actor distribution instead of TanhNormal
as in v1. I wonder did you make some ablations on these two choices and see TruncNormal
provide better results? Or the change is only because the entropy of TruncNormal
is easier to compute than TanhNormal
for the entropy regularizer?
Hi Danijar,
the critic loss is calculated without the offset identical to how it is stated in the paper.
Lines 299 to 302 in 52fc568
However, for the actor loss there is this offset by 1 (skip first target). Could you explain why this is the case?
Line 272 in 52fc568
This is how I imagine the advantage should be calculated (simplified without lambda-target). s_t is the current state of the agent. r_t is the reward of this state and should be ignored, since we are already in the state.
A = target(s_t) - baseline(s_t) = (r_t + r_{t+1} + E[r_{t+2} + ...]) - (r_t + E[r_{t+1} + r_{t+2} + ...]) = (r_{t+1} + E[r_{t+2} + ...]) - (E[r_{t+1} + r_{t+2} + ...]) = Q(a_t,s_t) - V(s_t)
If I understand your code correctly, as a result of the offset, the reward r_t will not cancel and the advantage will be wrong?
I wanted to try out dreamerv2 on our own environment (or at least the examples) but unfortunately run into some issues along the way.
The README example & Dockerfile use TensorFlow (and other library) versions that are outdated, in some cases pip doesn't even distribute the older versions anymore.
I attempted to run the minigrid example with a recent TensorFlow version.
The import from tensorflow.keras.mixed_precision import experimental as prec
in nets.py
causes an error as that API is no longer listed under experimental.
MiniGrid has migrated elsewhere and now uses Gymnasium: https://github.com/Farama-Foundation/Minigrid
Using the new minigrid env throws an error:
...
File ".../dreamerv2/api.py", line 77, in train
env = common.ResizeImage(env)
File ".../dreamerv2/common/envs.py", line 447, in __init__
self._keys = [
File ".../dreamerv2/common/envs.py", line 449, in <listcomp>
if len(v.shape) > 1 and v.shape[:2] != size]
TypeError: object of type 'NoneType' has no len()
Attempting to call train.py
with our own environment (with an observation of an intensity image of size 42x30, contained in a NumPy array) manages to collect a prefill dataset but then fails with the error
File ".../dreamerv2/api.py", line 101, in train
train_agent(next(dataset))
File ".../dreamerv2/common/other.py", line 201, in __call__
self._state, out = self._fn(*args, self._state)
File ".../tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File ".../dreamerv2/agent.py", line 60, in train
state, outputs, mets = self.wm.train(data, state)
File ".../dreamerv2/agent.py", line 100, in train
model_loss, state, outputs, metrics = self.loss(data, state)
File ".../dreamerv2/agent.py", line 108, in loss
post, prior = self.rssm.observe(
File ".../dreamerv2/common/nets.py", line 50, in observe
post, prior = common.static_scan(
File ".../dreamerv2/common/other.py", line 41, in static_scan
last = fn(last, inp)
File ".../dreamerv2/common/nets.py", line 51, in <lambda>
lambda prev, inputs: self.obs_step(prev[0], *inputs),
File ".../dreamerv2/common/nets.py", line 96, in obs_step
prior = self.img_step(prev_state, prev_action, sample)
File ".../dreamerv2/common/nets.py", line 124, in img_step
dist = self.get_dist(stats)
File ".../dreamerv2/common/nets.py", line 81, in get_dist
dist = tfd.Independent(common.OneHotDist(logit), 1)
File ".../decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File ".../tensorflow_probability/python/distributions/distribution.py", line 342, in wrapped_init
default_init(self_, *args, **kwargs)
File ".../tensorflow_probability/python/distributions/independent.py", line 162, in __init__
super(_Independent, self).__init__(
File ".../tensorflow_probability/python/distributions/distribution.py", line 603, in __init__
d for d in self._parameter_control_dependencies(is_init=True)
File ".../tensorflow_probability/python/distributions/independent.py", line 337, in _parameter_control_dependencies
raise ValueError('reinterpreted_batch_ndims({}) cannot exceed '
ValueError: reinterpreted_batch_ndims(1) cannot exceed distribution.batch_ndims(0)
I was unable to figure out what exactly the problem is here, and whether it is caused by the updated dependency versions, a problem with our environment or something else entirely.
Could you update the dependencies and examples to make it possible again to try out dreamerv2?
Hi!
After running:
python dreamer.py --logdir ~/logdir/atari_pong/dreamerv2/1 --configs defaults atari --task atari_pong
I got this error:
Traceback (most recent call last):
File "dreamer.py", line 324, in
main(parser.parse_args(remaining))
File "dreamer.py", line 239, in main
assert tf.config.experimental.list_physical_devices('GPU'), message
AssertionError: No GPU found. To actually train on CPU remove this assert.
Can you help me find the problem?
I'm curious if you considered trying the gumbel softmax as an alternative to the way you implemented straight-thru gradients in this paper/code. It seems like it might be a less-biased way of backpropagating through the operation of sampling from a categorical distribution. The "hard" variant allows you to retain a purely discrete one-hot output in the forward pass, as you did here.
As I understand it, you implemented:
one_hot(draw(logits))
softmax(logits, temp=1)
And the (hard version of the) gumbel softmax is:
one_hot(arg_max(log_softmax(logits) + sample_from_gumbel_dist)
softmax(log_softmax(logits) + sample_from_gumbel_dist), temp=temp_hyperparam)
The forwards in both versions are equivalent - the second is just a reparameterization of the first. By altering the temperature hyperparameter, you can trade off bias and variance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.