werner-duvaud / muzero-general Goto Github PK

MuZero

Home Page: https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation

License: MIT License

Python 100.00%

muzero reinforcement-learning alphazero pytorch python3 self-learning monte-carlo-tree-search deep-learning deep-reinforcement-learning neural-network

muzero-general's People

Contributors

Stargazers

Watchers

Forkers

rsethur supercatex flamato jamadri alexandremerasli christofer-f jingweiz llucid-97 radzol vfleon bdbabiak juanluck yuanenjie kovacev fischer496 timkim111 dylansd fidel-schaposnik manuel-delverme xtagon yangboz xuxiyang1993 philippemarcotte susumuota fmalazemi ddyycao sailfish009 kasparasmasiukas liuruoze seongjunyun llt1 andynik fred-drake coder-free nikkkkhil battyone adrianacala avinayak magi803 serenatangarfff jctillman zzqangel mc-o joshvarty csmaster23 dustinangerhofer solversa zombasy 0xtristan divassanwal dhawgupta epsilon456 mintiti qimi2008 hori-ryota yewr exe714 colincee neuqailib charliezou gemasphi yffbit daocalendar kanhari qilongpan flyinskybtx lukewood goshawk22 intelligencefoundation ai-hub-deep-learning-fundamental patrickkorus stephennfernandes cfytrok bhargavagowda sanjosolutions elephann zaneh1992 psinogas prof-pengyin sergiovieri shrimpceviche kk-55 budelius sqiangcao cs486-group64 gms2009 aizatrosli oecastrom pixelami gryph66 zhouzhiqian expz sethkitchen galaxycorpforce sondrelg yinsong1986 izaxon rasorensen90 harris-jacob strombom

muzero-general's Issues

Better Reward assignment at the end of Games

If for example you want to implement a chess game and give a reward of 1 to the winner and -1 to the looser at the end of the Game.

In the current implementation this wouldn't really work, cause you only know that a game is finished when somebody wins and then its not possible to give the looser his reward (in hindsight), unless you would iterate over all players with a dummy observation and then give them their rewards. So without this only the last person to play gets a reward.

I would propose to change the AbstractGame wrapper to include a get_reward method which returns a reward for each player and then assigns the last time that player acted that reward. This would only happen at the end of the game and shouldn't effect gym like games which give a reward for each step, i think.

But i'm not sure about this so i thought i make an issue to get feedback before actually changing it and making a pull request.

Question about "reward" values in new games

For any new games which I could add, is there any min/max range for the reward to be followed? For the gym Atari games, I am not aware of any clipping.

Specifically
`def step(self, action):

    observation, reward, done = self.env.step(action) #<--does this reward need to be in some range?
    return observation, reward, done

`
Thanks!

Last edit broke something?

When I run a freshly cloned version of this program to train connect4, all graphs in the Total_reward-section of tensorboard become flat lines (Total reward=10, Mean value=0, Episode length=11, MuZero reward=0, Opponent reward=0). The graphs in all of the other sections look ok I think. I get this result both locally and in Google colab.

It worked before 91afb1d so maybe something broke there?

Tensorflow support

Hi,

I would like to ask if tensorflow will be supported also as per the initial README (d838835#diff-04c6e90faac2675aa89e2176d2eec7d8).

Thanks

Why do we ignore the first action?

muzero-general/trainer.py

Line 97 in 0918977

for i in range(1, action_batch.shape[1]):

Connect 4 bug

There is a bug in the connect 4 game logic, but it's small enough that I'm sure it hasn't affected training. If a player (parity says it must be player 2) wins on the last move of the game (move 42), it will be rewarded as a draw.
From connect4.py: reward = 1 if done and 0 < len(self.legal_actions()) else 0
No legal actions doesn't mean someone didn't just win.

Example:

a = self.Game()
mv = [2,1,1,1,3,1,4,1,6,5,7,2,3,4,5,6,7,7,6,5,4,3,2,7,6,5,4,3,2,2,3,4,5,6,7,7,6,5,4,3,2,1]
for m in mv:
    x = a.step(m-1)[1:]
    print(x)
    a.render()

Last step prints (0, True) even though the second player won. https://connect4.gamesolver.org/?pos=211131416572345677654327654322345677654321

Question about learning rate

Hi,
I trained cart pole and titactoe. I used tensorboard to analysis result then found that learning rate for the cart pole changed over the training process, but the learning rate for titactoe learning remain unchanged during all steps. Could you tell me why the learning remain unchanged during all steps? Thanks

Place for communication?

Is there a slack, discord, discourse or other form for communicating about this project? I'd love to help contribute to this project but my skills aren't very good in ML. I was wondering if I could assist with a CI/CD system or help with creating a grid search for hyperparameter searching.

Default Q value for unexplored nodes

Here

muzero-general/self_play.py

Line 370 in 18cad41

value_score = 0

you assign value 0 to nodes that haven't yet been visited. However, since values are normalized to the range [0,1] a better estimate to assign to unexplored nodes would be 0.5. Otherwise it seems that the exploration-exploitation trade-off will lean towards exploitation (after the first arbitrary choice, that branch will likely have the maximum value in min_max_stats, and therefore have value 1, particularly when values and rewards are positive).

Masking allowed actions at root node

Hi Werner, I've really enjoyed tinkering with the codebase as I learn all aspects of MuZero. I see in the MuZero paper they describe how they mask the policy logits to allowable moves in the root node of the MCTS:

AlphaZero used the set of legal actions obtained from the simulator to mask the prior
produced by the network everywhere in the search tree. MuZero only masks legal actions at the root of the search tree where the environment can be queried, but does not perform any masking within the search tree. This is possible because the network rapidly learns not to predict actions that never occur in the trajectories it is trained on.

Do you think the MCTS should mask out illegal moves if it is at the root node (if the game being played supports it)? It might speed up the learning process for these types of games. If so, do you want me to send you a pull request for it?

wrong normalization for encoded state in `MuZeroFullyConnectedNetwork.dynamics()`

At this line (https://github.com/werner-duvaud/muzero-general/blob/master/models.py#L125), it should be

next_encoded_state - min_next_encoded_state

Competitive play?

Hi,

Thanks for making this, it looks really nice to use! I'm curious: are there any extra modifications I need to make to my game file for competitive play between two agents?

Cheers,
Miles

Question about ucb_score

I have a question about the calculation of ucb_score.
In particular value_score is calculated as:

value_score = min_max_stats.normalize( 
                 child.reward + self.config.discount * child.value() 
             )

In two player games like Tic-Tac-Toe the players alternate turns. The value of a given state for me is the exact negative of the value of a given state for my opponent. I see this is represented in backpropagate():

muzero-general/self_play.py

Line 350 in b94cd65

node.value_sum += value if node.to_play == to_play else -value

My question is: Doesn't this information need to be taken into account when calculating ucb_score? That is, don't we have to negate the value of child.value() since the child is a state from the perspective of our opponent?

For that matter, I see expand always uses the unmodified reward from recurrent_inference but backpropagate negates the value depending on the value of virtual_to_play?

Can you help me better understand how this works in muzero?

Graph Based Games

How would I go about using this for a graph based game like risk?

"RuntimeError: CUDA out of memory" while running default train() for Connect4 on 4GB GPU

Hi, just trying the code out. Selecting [2] Connect4, then [0] train, it logs this error in the terminal after a few seconds

Any ideas on how to avoid this error? Reduce batch size? Tensorboard seems to continue to run the error though, although terminal output is stuck on Last test reward: 10.00. Training step: 0/100000. Played games: 27. Loss: 0.00, with only the played games number changing.

2020-07-13 18:07:45,104.ERROR worker.py:987 -- Possible unhandled error from worker: ray::Trainer.continuous_update_weights() (pid=100153, ip=192.168.0.2)
  File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 400, in ray._raylet.execute_task.function_executor
  File "/home/user/muzero-general/trainer.py", line 66, in continuous_update_weights
    ) = self.update_weights(batch)
  File "/home/user/muzero-general/trainer.py", line 138, in update_weights
    observation_batch
  File "/home/user/muzero-general/models.py", line 541, in initial_inference
    policy_logits, value = self.prediction(encoded_state)
  File "/home/user/muzero-general/models.py", line 461, in prediction
    policy, value = self.prediction_network(encoded_state)
  File "/home/user/anaconda3/envs/muzero/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/muzero-general/models.py", line 370, in forward
    out = block(out)
  File "/home/user/anaconda3/envs/muzero/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/muzero-general/models.py", line 204, in forward
    out = self.conv1(x)
  File "/home/user/anaconda3/envs/muzero/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/anaconda3/envs/muzero/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 353, in forward
    return self._conv_forward(input, self.weight)
  File "/home/user/anaconda3/envs/muzero/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 350, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 3.82 GiB total capacity; 191.73 MiB already allocated; 19.31 MiB free; 210.00 MiB reserved in total by PyTorch)

How do I merge this with TensorTrade environment?

https://github.com/tensortrade-org/tensortrade
see this here
@notadamking what do you think?

in file replay_buffer.py, line 48

Maybe this line should move to before line 47?

I trained connect4 game 4624102 count. used 9 days. but muzero is stupid.

I have changed some params:

    ### Game
    self.observation_shape = 6 * 7  # Dimensions of the game observation
    self.action_space = [i for i in range(7)]  # Fixed list of all possible actions
    self.players = [i for i in range(2)]  # List of players
    self.stacked_observations = 2  # Number of previous observation to add to the current observation


    ### Self-Play
    # 自我播放以供重播缓冲区使用的并发线程数
    # Number of simultaneous threads self-playing to feed the replay buffer
    self.num_actors = 8  
    # 如果游戏未完成之前的最大移动次数 
    # Maximum number of moves if game is not finished before
    self.max_moves = 50  
    # 自我模拟的未来动作数
    # Number of futur moves self-simulated
    self.num_simulations = 30  
    # 奖励按时间顺序折扣
    # Chronological discount of the reward
    self.discount = 0.997  
    # 每次玩游戏后需要等待的秒数，以调整自我玩耍/训练比率，以免过度/不足
    # Number of seconds to wait after each played game to adjust the self play / training ratio to avoid over/underfitting
    self.self_play_delay = 0 

    # Root prior exploration noise
    self.root_dirichlet_alpha = 0.25
    self.root_exploration_fraction = 0.25

    # UCB formula
    self.pb_c_base = 19652
    self.pb_c_init = 1.25


    ### Network
    self.encoding_size = 32
    self.hidden_layers = [64]

    # 价值和奖励按比例缩放（几乎为sqrt）并编码在范围为-support_size到support_size的向量上
    # Value and reward are scaled (with almost sqrt) and encoded on a vector with a range of -support_size to support_size
    self.support_size = 10  

    ### Training
    # 存储模型权重的路径
    # Path to store the model weights
    self.results_path = "./pretrained" 

    # 培训步骤总数（即，权重根据批次进行更新）
    # Total number of training steps (ie weights update according to a batch)
    self.training_steps = 1000000  

    # 每个训练步骤要训练的游戏零件数
    # Number of parts of games to train on at each training step
    self.batch_size = 512  

    # 每个批次元素要保留的游戏移动次数
    # Number of game moves to keep for every batch element
    self.num_unroll_steps = 10  

    # 在使用模型进行自弹奏之前的训练步骤数
    # Number of training steps before using the model for sef-playing
    self.checkpoint_interval = 10  

    # 保留在重播缓冲区中的自玩游戏数
    # Number of self-play games to keep in the replay buffer
    self.window_size = 1000  

    # 计算目标值时要考虑的未来步骤数
    # Number of steps in the futur to take into account for calculating the target value
    self.td_steps = 10  

    # 每次训练后需要等待的秒数，以调整自我发挥/训练比率，以避免过度/不足
    # Number of seconds to wait after each training to adjust the self play / training ratio to avoid over/underfitting
    self.training_delay = 0 

    # Train on GPU if available
    self.training_device = "cuda" if torch.cuda.is_available() else "cpu"  

    self.weight_decay = 1e-4  # L2 weights regularization
    self.momentum = 0.9

    # Exponential learning rate schedule
    self.lr_init = 0.05  # Initial learning rate
    self.lr_decay_rate = 1
    self.lr_decay_steps = 10000


    ### Test
    self.test_episodes = 2  # Number of game played to evaluate the network

error in training. maybe because I install gnome-tewaks-tool in training. Error is after i install this software.

this is i play with muzero (allways muzero first):

My environment is:
Ubuntu 18.04
i7-7700 16G
RTX 2080 8G

how to train a smart muzero for connect4?

Average pooling after residual tower

Hi! Could you provide a reference for the average pooling layer inserted in the residual networks after the tower and before the value, reward and policy heads? Can't seem to find any sign of it in the papers...

Connect4 seems not working. Am I wrong?

I am running your code with the game connect4. It is already doing more than 250k steps. but the reward value is declining and approaching the bottom.

muzero on Google Colab

Hi there,

Great work on the repository.

Keen to try running on Google Colab, but run into an error on muzero.py.

Anyone had any luck getting to run on Google Colab?

Thanks!


Welcome to MuZero! Here's a list of games:
0. cartpole
1. connect4
2. gomoku
3. lunarlander
4. tictactoe
Enter a number to choose the game: 0

0. Train
1. Load pretrained model
2. Render some self play games
3. Play against MuZero
4. Exit
Enter a number to choose an action: 0
2020-03-26 19:58:46,830	INFO resource_spec.py:212 -- Starting Ray with 6.64 GiB memory available for workers and up to 3.34 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-03-26 19:58:47,246	INFO services.py:1123 -- View the Ray dashboard at localhost:8265
2020-03-26 19:58:50,509	WARNING worker.py:1072 -- The dashboard on node 2fdf91b82296 failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.6/asyncio/base_events.py", line 1062, in create_server
    sock.bind(sa)
OSError: [Errno 99] Cannot assign requested address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/dashboard.py", line 920, in <module>
    dashboard.run()
  File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/dashboard.py", line 368, in run
    aiohttp.web.run_app(self.app, host=self.host, port=self.port)
  File "/usr/local/lib/python3.6/dist-packages/aiohttp/web.py", line 433, in run_app
    reuse_port=reuse_port))
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/aiohttp/web.py", line 359, in _run_app
    await site.start()
  File "/usr/local/lib/python3.6/dist-packages/aiohttp/web_runner.py", line 104, in start
    reuse_port=self._reuse_port)
  File "/usr/lib/python3.6/asyncio/base_events.py", line 1066, in create_server
    % (sa, err.strerror.lower()))
OSError: [Errno 99] error while attempting to bind on address ('::1', 8265, 0, 0): cannot assign requested address


Training...
Run tensorboard --logdir ./results and go to http://localhost:6006/ to see in real time the training performance.

2020-03-26 19:58:52,250	WARNING worker.py:1072 -- Failed to unpickle actor class 'Trainer' for actor ID 45b95b1c0100. Traceback:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/function_manager.py", line 494, in _load_actor_class_from_gcs
    actor_class = pickle.loads(pickled_class)
  File "/usr/local/lib/python3.6/dist-packages/ray/cloudpickle/cloudpickle.py", line 1136, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'models'

(pid=630) 2020-03-26 19:58:52,240	ERROR function_manager.py:496 -- Failed to load actor class Trainer.
(pid=630) Traceback (most recent call last):
(pid=630)   File "/usr/local/lib/python3.6/dist-packages/ray/function_manager.py", line 494, in _load_actor_class_from_gcs
(pid=630)     actor_class = pickle.loads(pickled_class)
(pid=630)   File "/usr/local/lib/python3.6/dist-packages/ray/cloudpickle/cloudpickle.py", line 1136, in subimport
(pid=630)     __import__(name)
(pid=630) ModuleNotFoundError: No module named 'models'
2020-03-26 19:58:52,733	WARNING worker.py:1072 -- WARNING: 6 PYTHON workers have been started. This could be a result of using a large number of actors, or it could be a consequence of using nested tasks (see https://github.com/ray-project/ray/issues/3644) for some a discussion of workarounds.
---------------------------------------------------------------------------
RayTaskError(ModuleNotFoundError)         Traceback (most recent call last)
<ipython-input-6-b332aadb407c> in <module>()
    226         choice = int(choice)
    227         if choice == 0:
--> 228             muzero.train()
    229         elif choice == 1:
    230             path = input("Enter a path to the model.weights: ")

1 frames
/usr/local/lib/python3.6/dist-packages/ray/worker.py in get(object_ids, timeout)
   1500                     worker.core_worker.dump_object_store_memory_usage()
   1501                 if isinstance(value, RayTaskError):
-> 1502                     raise value.as_instanceof_cause()
   1503                 else:
   1504                     raise value

RayTaskError(ModuleNotFoundError): ray::IDLE (pid=631, ip=172.28.0.2)
  File "python/ray/_raylet.pyx", line 430, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 433, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 306, in ray._raylet.deserialize_args
  File "/usr/local/lib/python3.6/dist-packages/ray/serialization.py", line 323, in deserialize_objects
    self._deserialize_object(data, metadata, object_id))
  File "/usr/local/lib/python3.6/dist-packages/ray/serialization.py", line 271, in _deserialize_object
    return self._deserialize_pickle5_data(data)
  File "/usr/local/lib/python3.6/dist-packages/ray/serialization.py", line 262, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
ModuleNotFoundError: No module named 'games'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

raise pickle.PicklingError(msg)

When I trained the game, an error raised "raise pickle.PicklingError(msg)". Why raised this error ?

OutofMemoryError: Actors use too much memory

Cheers for the wonderful code! I am using it for atari game. But the worker threads use too much memory resulting in out of memory error.

RayTaskError(RayOutOfMemoryError): ray::SharedStorage (pid=2986, ip=172.28.0.2)
File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task
File "/usr/local/lib/python3.6/dist-packages/ray/memory_monitor.py", line 120, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 8ff9db8651eb is used (12.13 / 12.72 GB). The top 10 memory consumers are:

PID MEM COMMAND
3039 4.53GiB ray::SelfPlay.continuous_self_play()
3054 4.5GiB ray::SelfPlay.continuous_self_play()
2985 1.26GiB ray::Trainer.continuous_update_weights()
126 0.38GiB /usr/bin/python3 -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-4404f6dc-8f78-4a
2986 0.14GiB ray::SharedStorage
3026 0.08GiB ray::ReplayBuffer
2966 0.08GiB /usr/bin/python3 -u /usr/local/lib/python3.6/dist-packages/ray/dashboard/dashboard.py --host=localho
26 0.07GiB /usr/bin/python2 /usr/local/bin/jupyter-notebook --ip="172.28.0.2" --port=9000 --FileContentsManager
2960 0.04GiB /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:44319
3048 0.04GiB ray::IDLE

In addition, up to 0.24 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the object_store_memory parameter when starting Ray.

Question: Why use negative board for observation in connect4 training?

Hi,

I'm trying to write a Gomoku game with MuZero. I'm learning from Connect4 since it's also a two player game. However I noticed the following code:

def get_observation(self):
        if self.player == 1:
            return self.board
        else:
            return -self.board

Why are you returning the negative board for one of the player not letting them to learn from the same board?

Loss going up - is it ok?

Training the model for 10h (RTX6000) on Connect4.

Is it ok that only the policy loss goes down over time, while others go up? If I understand correctly, lowering the learning rate might help? What other hyperparameters would be useful? What would be a quicker way to select hyperparameters?

P.S. Another related question, is whether maybe there's some idea how long it would take to get somewhat good? The performance is not good at the moment. Maybe I should embed a performance test (say 20 matches against a normal algorithmic opponent such as negamax) every several epochs or so?

ray not support windows，so muzero-general can not run on windows

please support windows，thanks.

Including reward in Node.value()

I see in

muzero-general/self_play.py

Line 321 in 633c658

child.reward + self.config.discount * child.value()

you are now including the Node.reward when computing the UCB score (as opposed to the original pseudocode). Can you explain the logic why this shouldn't be extended to other places where the value is used, e.g. when updating normalizations (

muzero-general/self_play.py

Line 336 in 633c658

min_max_stats.update(node.value())

) or storing statistics (

muzero-general/self_play.py

Line 436 in 633c658

self.root_values.append(root.value())

value/reward transform issue

I think the way you transform value/reward is a little mismatch with the original paper at this line (

muzero-general/trainer.py

Line 153 in fe791e8

x = torch.sign(x) * (torch.sqrt(torch.abs(x) + 1) - 1 + 0.001 * x)

)

From the referenced paper (https://arxiv.org/abs/1805.11593), the transformation function should be

So instead of

x = torch.sign(x) * (torch.sqrt(torch.abs(x) + 1) - 1 + 0.001 * x)

the correct formula should be

x = torch.sign(x) * (torch.sqrt(torch.abs(x) + 1) - 1) + .001 * x

Wrong initial reward

muzero-general/models.py

Line 137 in ecca75c

torch.zeros(len(observation), self.full_support_size).to(

Since you are predicting a distributions, the first reward should be scalar_to_support(0)

Tracking for support of 2-player games with interrupts

Per #26 (comment) and #19 (comment) the code only supports two-player games with strict alternating actions. What would it take to encode the "next player" for the available actions into the game state? I'm thinking of games with rules where the state of the board may require an action from a player out of turn order (e.g., in GIPF, making a line of 4 pieces compels the owner of those pieces to remove them from the board regardless of whose turn the line(s) appeared on). For such a case, the action space could encode "this action compels removal from the board", but it breaks down when the opponent needs to make that decision before their turn and having the current player "decide" for them is…not ideal.

The above mentioned issues were closed without comment as to why they were closed. Resolved? Discussion over? No links to the code and I don't see anything in the game class about supporting determining the next player.

Slow playback, Trainable and lunar

For some reason the playback is really slow (even if I remove the "press enter to continue" prompt).
Also, I'm curious why you didn't implement Trainable interface from ray - it would have enabled a more standardized experience (e.g. ability to run with tune or manully, automatic checkpointing) etc.?
My lunar lander is not training that well yet. Did you manage to get it to work with at least 150-200 reward consistently (e.g. say 100 episodes without dipping below 150)? If so, how long did you train and on what hardware?

i try to train gomoku. but i got an error

Welcome to MuZero! Here's a list of games:
0. breakout
1. cartpole
2. connect4
3. gomoku
4. lunarlander
5. tictactoe
Enter a number to choose the game: 3

0. Train
1. Load pretrained model
2. Render some self play games
3. Play against MuZero
4. Exit
Enter a number to choose an action: 0
2020-04-22 15:45:12,693	INFO resource_spec.py:212 -- Starting Ray with 8.01 GiB memory available for workers and up to 4.01 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-22 15:45:13,024	INFO services.py:1093 -- View the Ray dashboard at localhost:8265

Training...
Run tensorboard --logdir ./results and go to http://localhost:6006/ to see in real time the training performance.

(pid=3934) /home/zbf/Documents/muzero/muzero-general/replay_buffer.py:132: RuntimeWarning: invalid value encountered in true_divide
(pid=3934)   game_probs /= numpy.sum(game_probs)
2020-04-22 15:49:02,252 ERROR worker.py:1003 -- Possible unhandled error from worker: ray::ReplayBuffer.get_batch() (pid=3934, ip=192.168.1.12)
  File "python/ray/_raylet.pyx", line 643, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 623, in function_executor
  File "/home/zbf/Documents/muzero/muzero-general/replay_buffer.py", line 74, in get_batch
    game_id, game_history, game_prob = self.sample_game(self.buffer)
  File "/home/zbf/Documents/muzero/muzero-general/replay_buffer.py", line 133, in sample_game
    game_index = numpy.random.choice(len(self.buffer), p=game_probs)
  File "mtrand.pyx", line 922, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
2020-04-22 15:49:02,252	ERROR worker.py:1003 -- Possible unhandled error from worker: ray::Trainer.continuous_update_weights() (pid=3932, ip=192.168.1.12)
  File "python/ray/_raylet.pyx", line 643, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 623, in function_executor
  File "/home/zbf/Documents/muzero/muzero-general/trainer.py", line 56, in continuous_update_weights
    index_batch, batch = ray.get(replay_buffer.get_batch.remote(self.model.get_weights()))
ray.exceptions.RayTaskError(ValueError): ray::ReplayBuffer.get_batch() (pid=3934, ip=192.168.1.12)
  File "python/ray/_raylet.pyx", line 643, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 623, in function_executor
  File "/home/zbf/Documents/muzero/muzero-general/replay_buffer.py", line 74, in get_batch
    game_id, game_history, game_prob = self.sample_game(self.buffer)
  File "/home/zbf/Documents/muzero/muzero-general/replay_buffer.py", line 133, in sample_game
    game_index = numpy.random.choice(len(self.buffer), p=game_probs)
  File "mtrand.pyx", line 922, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Last test reward: 0.00. Training step: 8/10. Played games: 29. Loss: 26.00

I do not change any code.

My environment is:
Ubuntu 18.04
i7-7700 16G
RTX 2080 8G

Unknown error related to ray each time on exiting the run

Hi @werner-duvaud

When I'm running the latest code I always get an error when exiting the process. Please see attached screen output below:

I did delete old repo and had a fresh check out, still seeing this.

so I did some research, found this: https://github.com/ray-project/ray/issues/5042
and this: https://github.com/ray-project/ray/issues/6239
Hope it can help you.

Welcome to MuZero! Here's a list of games:
0. cartpole
1. connect4
2. gomoku
3. lunarlander
Enter a number to choose the game: 2

0. Train
1. Load pretrained model
2. Render some self play games
3. Play against MuZero
4. Exit
Enter a number to choose an action: 0
2020-02-25 15:21:47,620 INFO resource_spec.py:212 -- Starting Ray with 3.91 GiB memory available for workers and up to 1.97 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-02-25 15:21:47,998 INFO services.py:1093 -- View the Ray dashboard at localhost:8265

Training...
Run tensorboard --logdir ./results and go to http://localhost:6006/ to see in real time the training performance.

Done test reward: 1.00. Training step: 11/10. Played games: 1. Loss: 33.26

0. Train
1. Load pretrained model
2. Render some self play games
3. Play against MuZero
4. Exit
Enter a number to choose an action: 4
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'

Breakout

Thank you very much for a comprehensive implementation.

I ran Breakout with the current configuration, except changing the actors from 350 to 4 since I ran
into memory problems with Ray. I am using the same setup as specified tested on, except GTX 1060.

The code reports one line, but never updates this line. On Tensorboard I see progress, but at zero rewards.

Any advice?

Is it possible to get the best action sequence?

Is it possible to extract not only the best immediate action, but the best action sequence over the planning horizon?

So a vector of actions with length self.config.num_simulations

Hi, I try to train the gomoku. But did not get the expected effect.

After the release of AlphaZero, I trained Gomoku based on AlphaZero, and I didn't need much calculation to get a good model.

The code used is the code of this.

I train a simple game(6x6 board and 4 chess in line will win) on cpu(i7 2.5G Hz). It only takes about 10 hours. And the code is single thread.

I train a simple game(8x8 board and 5 chess in line will win) on cpu(i7 2.5G Hz). It only takes about 50 hours. And the code is single thread.

In muzero, I copy games/gomoku.py to games/gobang.py and changed some params.
changes:

board size from 11 to 6;
num in line from 5 to 4;
num simulations change to 400;
training steps change to 2000;
batch size change to 512;
played games per training step ratio set to 1;

full code see gobang.py

I trained about 47 hours. The result is still very poor. Trained on i7 7700 3.6G Hz 16G, RTX 2080 8G. And muzero implementation Multi-thread.

What went wrong? Can you help me?

My final goal is a 15x15 chessboard, but I want to train a simple one first. if it works, and then train a complex one. After all, the complex requires a lot of calculation.

Forgive me for my poor English, some content comes from google translation.

Loss: nan

I often get this error after training for a few hours. It has happened in all the games I've tried (but I've only tried two-player games). The error message below is from tictactoe. If this only happens to me, maybe it could have something to do with my low self played games per training step ratio. Training continues but self-play stops after the error.

2020-04-27 19:00:35,290.ERROR worker.py:1011 -- Possible unhandled error from worker: ray::ReplayBuffer.get_batch() (pid=10953, ip=192.168.0.113)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 407, in ray._raylet.execute_task.function_executor
  File "/home/gustav/Desktop/muzero-general/replay_buffer.py", line 74, in get_batch
    game_id, game_history, game_prob = self.sample_game(self.buffer)
  File "/home/gustav/Desktop/muzero-general/replay_buffer.py", line 133, in sample_game
    game_index = numpy.random.choice(len(self.buffer), p=game_probs)
  File "mtrand.pyx", line 920, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
2020-04-27 19:00:35,290	ERROR worker.py:1011 -- Possible unhandled error from worker: ray::Trainer.continuous_update_weights() (pid=10952, ip=192.168.0.113)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 407, in ray._raylet.execute_task.function_executor
  File "/home/gustav/Desktop/muzero-general/trainer.py", line 56, in continuous_update_weights
    index_batch, batch = ray.get(replay_buffer.get_batch.remote(self.model.get_weights()))
ray.exceptions.RayTaskError(ValueError): ray::ReplayBuffer.get_batch() (pid=10953, ip=192.168.0.113)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 407, in ray._raylet.execute_task.function_executor
  File "/home/gustav/Desktop/muzero-general/replay_buffer.py", line 74, in get_batch
    game_id, game_history, game_prob = self.sample_game(self.buffer)
  File "/home/gustav/Desktop/muzero-general/replay_buffer.py", line 133, in sample_game
    game_index = numpy.random.choice(len(self.buffer), p=game_probs)
  File "mtrand.pyx", line 920, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Last test reward: 20.00. Training step: 87849/100000. Played games: 25959. Loss: nan

Alternating reward signs in backpropagate

muzero-general/self_play.py

Line 384 in 98cb784

value = node.reward + self.config.discount * value

the value to be backpropagated along the search path incorporates rewards from all nodes with the same sign, which seems in contradiction with

muzero-general/self_play.py

Line 380 in 98cb784

node.value_sum += value if node.to_play == to_play else -value

and the way value targets are being computed in

muzero-general/replay_buffer.py

Line 206 in 98cb784

def compute_target_value(self, game_history, index):

Is this intentional? Or maybe this line should be corrected to something like:

value = node.reward + self.config.discount * value if node.to_play == to_play else -node.reward + self.config.discount * value

AttributeError: 'NoneType' object has no attribute 'get_global_worker'

After about 40mins (it varies) of lunar lander training I get this error:

Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>1791. Loss: 589.988
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7fcaa87c17b8>
Traceback (most recent call last):
  File "/home/andriy/miniconda3/envs/muzero/lib/python3.7/site-packages/ray/actor.py", line 652, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'

Gradient scaling for hidden state

This is not necessarily a bug, but as far as I can see in your code you neglected the hidden state gradient scaling in training the dynamic function as it is suggested by the Muzero paper. Here is a snippet of the paper psudocode:
for action in actions:
value, reward, policy_logits, hidden_state = network.recurrent_inference(hidden_state, action)
predictions.append((1.0 / len(actions), value, reward, policy_logits))
hidden_state = scale_gradient(hidden_state, 0.5)

Is there any specific reason why you ignored it? Actually, when I read the paper I why this scaling "ensures that the total gradient applied to the dynamics function stays constant" as the authors stated. I would be thankful if you share your insight about the effectiveness of this operation.

Determining who is next to play inside the MCST

muzero-general/self_play.py

Line 262 in 283e353

# Players play turn by turn

you essentially assume actions are taken by players in alternating order for two-player games. In a way, this is a rule ("Players take turns making moves") leaking into the MCTS, whereas we would like to assume we can only know who is next-to-play at the root of the tree where we can query the environment. Inside the tree, it would be more consistent to have the next-to-play be computed by the dynamics function, right?

This is also a limitation in the way actions are encoded: for example, my understanding is that castling in chess is encoded as two separate, consecutive moves made by the same player, but this would break the MCTS logic as it stands here. Any idea how this was handled by the original authors?

what the different between MuZero_reward and total reward?

Hi,

In the Tensorboard for Tictactoe, I saw two diagrams abut reward, MuZero_reward and total reward. What the different between MuZero_reward and total reward?

Training on ended games

muzero-general/replay_buffer.py

Line 79 in 4d54162

    
           position_probs = numpy.array(game_history.priorities) / sum(game_history.priorities)

Shouldn't this be:
position_probs = numpy.array(game_history.priorities[:-1]) / sum(game_history.priorities[:-1])

It makes no sense to learn from the last step because we would be training against illegal moves (and zero reward as per my last issue)

Issue with image based environemts.

Hi, thanks a lot for this well-organized code.

I tried your code for CartPole and it works fine. However, I cannot get any results with other environments (e.g Breakout) especially those with an image as input. In these cases, it repeats taking the same actions during all the episodes. Do you have any suggestions where should I start tunning some hyper-parameters?

Retrying to connect to socket for pathname tcp://127.0.0.1:54380

Hi,

I cloned your repo and tried to use it with a fresh environment. Unfortunately, it seems that it does not work properly, as I get the following error when trying to train or play against muzero:

(muzero) C:\Users\...\Desktop\RLUnity\python\muzero-general>python muzero.py

Welcome to MuZero! Here's a list of games:
0. breakout
1. cartpole
2. connect4
3. gomoku
4. gridworld
5. lunarlander
6. tictactoe
7. twentyone
Enter a number to choose the game: 0

0. Train
1. Load pretrained model
2. Diagnose model
3. Render some self play games
4. Play against MuZero
5. Test the game manually
6. Exit
Enter a number to choose an action: 4

Testing...
2020-07-03 13:48:15,208 INFO resource_spec.py:212 -- Starting Ray with 6.49 GiB memory available for workers and up to 3.26 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-03 13:48:15,637 INFO services.py:1165 -- View the Ray dashboard at �[1m�[32mlocalhost:8265�[39m�[22m
E0703 13:48:20.680589 13144 11520 raylet_client.cc:69] Retrying to connect to socket for pathname tcp://127.0.0.1:60437 (num_attempts = 1, num_retries = 10)
E0703 13:48:23.682262 13144 11520 raylet_client.cc:69] Retrying to connect to socket for pathname tcp://127.0.0.1:60437 (num_attempts = 2, num_retries = 10)
E0703 13:48:26.682521 13144 11520 raylet_client.cc:69] Retrying to connect to socket for pathname tcp://127.0.0.1:60437 (num_attempts = 3, num_retries = 10)
E0703 13:48:29.685930 13144 11520 raylet_client.cc:69] Retrying to connect to socket for pathname tcp://127.0.0.1:60437 (num_attempts = 4, num_retries = 10)
E0703 13:48:32.687705 13144 11520 raylet_client.cc:69] Retrying to connect to socket for pathname tcp://127.0.0.1:60437 (num_attempts = 5, num_retries = 10)
E0703 13:48:35.690719 13144 11520 raylet_client.cc:69] Retrying to connect to socket for pathname tcp://127.0.0.1:60437 (num_attempts = 6, num_retries = 10)
E0703 13:48:38.692248 13144 11520 raylet_client.cc:69] Retrying to connect to socket for pathname tcp://127.0.0.1:60437 (num_attempts = 7, num_retries = 10)
E0703 13:48:41.694942 13144 11520 raylet_client.cc:69] Retrying to connect to socket for pathname tcp://127.0.0.1:60437 (num_attempts = 8, num_retries = 10)
E0703 13:48:44.696983 13144 11520 raylet_client.cc:69] Retrying to connect to socket for pathname tcp://127.0.0.1:60437 (num_attempts = 9, num_retries = 10)
F0703 13:48:45.697768 13144 11520 raylet_client.cc:78] Could not connect to socket tcp://127.0.0.1:60437
*** Check failure stack trace: ***
    @   00007FFDB9633A8C  public: __cdecl google::LogMessage::~LogMessage(void) __ptr64
    @   00007FFDB94A8954  public: virtual __cdecl google::NullStreamFatal::~NullStreamFatal(void) __ptr64
    @   00007FFDB94E351B  public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
    @   00007FFDB94E5B5E  public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
    @   00007FFDB93F3B98  public: class google::LogMessageVoidify & __ptr64 __cdecl google::LogMessageVoidify::operator=(class google::LogMessageVoidify const & __ptr64) __ptr64
    @   00007FFDB93F1C00  public: class google::LogMessageVoidify & __ptr64 __cdecl google::LogMessageVoidify::operator=(class google::LogMessageVoidify const & __ptr64) __ptr64
    @   00007FFDB93F00ED  public: class google::LogMessageVoidify & __ptr64 __cdecl google::LogMessageVoidify::operator=(class google::LogMessageVoidify const & __ptr64) __ptr64
    @   00007FFDB93EF9C3  public: class google::LogMessageVoidify & __ptr64 __cdecl google::LogMessageVoidify::operator=(class google::LogMessageVoidify const & __ptr64) __ptr64
    @   00007FFDB936F179  public: virtual __cdecl google::LogSink::~LogSink(void) __ptr64
    @   00007FFDF395F9CF  _PyObject_FastCallKeywords
    @   00007FFDF395F7DA  _PyObject_FastCallKeywords
    @   00007FFDF3967939  _PyMethodDef_RawFastCallKeywords
    @   00007FFDF3968322  _PyEval_EvalFrameDefault
    @   00007FFDF3951286  _PyEval_EvalCodeWithName
    @   00007FFDF3967907  _PyMethodDef_RawFastCallKeywords
    @   00007FFDF3968A69  _PyEval_EvalFrameDefault
    @   00007FFDF3A523A3  _PyStack_UnpackDict
    @   00007FFDF39AB431  PyErr_NoMemory
    @   00007FFDF3968322  _PyEval_EvalFrameDefault
    @   00007FFDF3951286  _PyEval_EvalCodeWithName
    @   00007FFDF3967907  _PyMethodDef_RawFastCallKeywords
    @   00007FFDF3968A69  _PyEval_EvalFrameDefault
    @   00007FFDF3951286  _PyEval_EvalCodeWithName
    @   00007FFDF3932A93  PyEval_EvalCodeEx
    @   00007FFDF39329F1  PyEval_EvalCode
    @   00007FFDF393299B  PyArena_Free
    @   00007FFDF3AC614D  PyRun_FileExFlags
    @   00007FFDF3AC6974  PyRun_SimpleFileExFlags
    @   00007FFDF3AC601B  PyRun_AnyFileExFlags
    @   00007FFDF3A11AAF  _Py_UnixMain
    @   00007FFDF3A11B57  _Py_UnixMain
    @   00007FFDF3980D5A  PyErr_NoMemory

Also I experienced an Error when trying to run breakout. I needed to install gym[atari] in addition to the requirements.

(muzero) C:\Users\...\Desktop\RLUnity\python\muzero-general>python muzero.py

Welcome to MuZero! Here's a list of games:
0. breakout
1. cartpole
2. connect4
3. gomoku
4. gridworld
5. lunarlander
6. tictactoe
7. twentyone
Enter a number to choose the game: 0
breakout is not a supported game name, try "cartpole" or refer to the documentation for adding a new game.
Traceback (most recent call last):
  File "C:\Users\...\Desktop\RLUnity\python\muzero-general\games\breakout.py", line 11, in <module>
    import cv2
ModuleNotFoundError: No module named 'cv2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "muzero.py", line 286, in <module>
    muzero = MuZero(games[choice])
  File "muzero.py", line 48, in __init__
    raise err
  File "muzero.py", line 39, in __init__
    game_module = importlib.import_module("games." + self.game_name)
  File "C:\Users\...\.conda\envs\muzero\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "C:\Users\...\Desktop\RLUnity\python\muzero-general\games\breakout.py", line 13, in <module>
    raise ModuleNotFoundError('Please run "pip install gym[atari]"')
ModuleNotFoundError: Please run "pip install gym[atari]"

Diagram cut off

The diagram at the bottom of the page in How MuZero works is cut off.

RuntimeError: Error(s) in loading state_dict for MuZeroFullyConnectedNetwork

Below all used dialogue from python muzero.py

I have trained a lunarlander model.
When load pretrained model, an error occurs:
RuntimeError: Error(s) in loading state_dict for MuZeroFullyConnectedNetwork: Missing key(s) in state_dict: "representation_network.0.weight", "representation_network.0.bias", "dynamics_encoded_state_network.0.weight", "dynamics_encoded_state_network.0.bias", "dynamics_encoded_state_network.2.weight", "dynamics_encoded_state_network.2.bias", "dynamics_reward_network.0.weight", "dynamics_reward_network.0.bias", "dynamics_reward_network.2.weight", "dynamics_reward_network.2.bias", "prediction_policy_network.0.weight", "prediction_policy_network.0.bias", "prediction_policy_network.2.weight", "prediction_policy_network.2.bias", "prediction_value_network.0.weight", "prediction_value_network.0.bias", "prediction_value_network.2.weight", "prediction_value_network.2.bias". Unexpected key(s) in state_dict: "representation_network.layers.0.weight", "representation_network.layers.0.bias", "dynamics_encoded_state_network.layers.0.weight", "dynamics_encoded_state_network.layers.0.bias", "dynamics_encoded_state_network.layers.2.weight", "dynamics_encoded_state_network.layers.2.bias", "dynamics_reward_network.layers.0.weight", "dynamics_reward_network.layers.0.bias", "dynamics_reward_network.layers.2.weight", "dynamics_reward_network.layers.2.bias", "prediction_policy_network.layers.0.weight", "prediction_policy_network.layers.0.bias", "prediction_value_network.layers.0.weight", "prediction_value_network.layers.0.bias", "prediction_value_network.layers.2.weight", "prediction_value_network.layers.2.bias".

how can i fix it? thx~

Priority Sampling Formula

In Appendix G of the muzero paper, they define the priority of a sample as p_i = | nu_i - z_i |, and write "nu is the search value and z the observed n-step return." (I'll use "nu" in place of ν for clarity when comparing with v)

However, in self_play.py, line 335, you seem to calculate priority as | v_i - nu_i |.

There are three related but distinct quantities here:

v - the output of the value head for a position
nu (ν) - the value estimate returned by MCTS for a position
z - the bootstrapped value of a position calculated using the next k observed rewards and the nu value k steps in the future, and discounting appropriately.

See Section 3 of the paper for definitions of these three terms.

It seems to me that the code currently does not match the priority calculation in the paper. Is this intentional? The | v - nu | formulation in the code has the advantage that priorities can be quickly updated whenever a position is used in training, because a new v is obtained. The | nu - z | version in the paper is not amenable to priority updates, because that would seem to require at a minimum re-running MCTS to re-estimate nu, and at most re-playing the game for k round to re-estimate z, both of which are products of the current weights.

Not training using GPU

Hello!
Thanks for sharing your implementation. However, the loss went to Nan on the latest commit so, following #40, I used this tree: https://github.com/werner-duvaud/muzero-general/tree/f5dd3d2a3fd2e0c354731112b23c2d4c55811914. However, when I check nvidia-smi, it shows that only around 100mb of my GPU is used:+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 00000000:02:00.0 On | N/A |
| 25% 60C P2 94W / 250W | 1553MiB / 11177MiB | 30% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3234 G /usr/lib/xorg/Xorg 265MiB |
| 0 15010 G ...uest-channel-token=17125473231410870467 115MiB |
| 0 15854 G ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files 404MiB |
| 0 18818 C ray::Trainer.continuous_update_weights() 741MiB |
| 0 27228 G ...charm-community-2018.1.4/jre64/bin/java 11MiB |
+-----------------------------------------------------------------------------+
and the program is heavily reliant on CPU. I noticed in the latest commit, this was fixed. Any ideas on how I can get the model to run on GPU?
Thanks,
Weichen

[Connect4] Default settings result in NaN loss, "ValueError 'a' cannot be empty unless no samples are taken"

Training Connect4 with default settings results in many errors.

Reproduction

python muzero.py
Select ConnectX (2)
Select Train (0)
Wait several minutes

Branch master, commit "c046c03 Fix backpropagate"
Python 3.7, Ubuntu 20.04 LTS, RTX6000 GPU (24GB), 8 CPU, 32GB RAM

Documentation

Recording of terminal session

https://asciinema.org/a/QR0bM7MH4TR2ulgGddZ4luJ1n

Terminal screenshot

Show

Tensorboard

Show

Terminal copied errors

Show

Loss: nan

Warning : Extreme values (nan) in game priorities. Could be underfitting or overfitting.

2020-07-14 20:24:39,425.ERROR worker.py:987 -- Possible unhandled error from worker: ray::SelfPlay.continuous_self_play() (pid=10473, ip=45.79.123.77)
  File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 400, in ray._raylet.execute_task.function_executor
  File "/root/muzero-general/self_play.py", line 48, in continuous_self_play
    0,
  File "/root/muzero-general/self_play.py", line 142, in play_game
    False if temperature == 0 else True,
  File "/root/muzero-general/self_play.py", line 312, in run
    action, node = self.select_child(node, min_max_stats)
  File "/root/muzero-general/self_play.py", line 359, in select_child
    for action, child in node.children.items()
  File "mtrand.pyx", line 907, in numpy.random.mtrand.RandomState.choice
ValueError: 'a' cannot be empty unless no samples are taken

werner-duvaud / muzero-general Goto Github PK

muzero-general's People

Contributors

Stargazers

Watchers

Forkers

muzero-general's Issues

Reproduction

Documentation

Recording of terminal session

Terminal screenshot

Tensorboard

Terminal copied errors

Recommend Projects

Recommend Topics

Recommend Org