davidadsp / simple Goto Github PK

View Code? Open in Web Editor NEW

292.0 16.0 104.0 21.44 MB

Selfplay In MultiPlayer Environments

License: GNU General Public License v3.0

Dockerfile 0.46% Python 97.58% Shell 1.95%

simple's Introduction

Selfplay In MultiPlayer Environments
· Report Bug · Request Feature

About The Project
Getting Started
- Prerequisites
- Installation
Tutorial

Quickstart
Tensorboard
Custom Environments
Parallelisation

Roadmap
Contributing
License
Contact
Acknowledgements

About The Project

This project allows you to train AI agents on custom-built multiplayer environments, through self-play reinforcement learning.

It implements Proximal Policy Optimisation (PPO), with a built-in wrapper around the multiplayer environments that handles the loading and action-taking of opponents in the environment. The wrapper delays the reward back to the PPO agent, until all opponents have taken their turn. In essence, it converts the multiplayer environment into a single-player environment that is constantly evolving as new versions of the policy network are added to the network bank.

To learn more, check out the accompanying blog post.

This guide explains how to get started with the repo, add new custom environments and tune the hyperparameters of the system.

Have fun!

Getting Started

To get a local copy up and running, follow these simple steps.

Prerequisites

Install Docker and Docker Compose to make use of the docker-compose.yml file

Installation

Clone the repo

git clone https://github.com/davidADSP/SIMPLE.git
cd SIMPLE

Build the image and 'up' the container.
```
docker-compose up -d
```
Choose an environment to install in the container (tictactoe, connect4, sushigo, geschenkt, butterfly, and flamme rouge are currently implemented)
```
bash ./scripts/install_env.sh sushigo
```

Tutorial

This is a quick tutorial to allow you to start using the two entrypoints into the codebase: test.py and train.py.

TODO - I'll be adding more substantial documentation for both of these entrypoints in due course! For now, descriptions of each command line argument can be found at the bottom of the files themselves.

Quickstart

`test.py`

This entrypoint allows you to play against a trained AI, pit two AIs against eachother or play against a baseline random model.

For example, try the following command to play against a baseline random model in the Sushi Go environment.

docker-compose exec app python3 test.py -d -g 1 -a base base human -e sushigo

`train.py`

This entrypoint allows you to start training the AI using selfplay PPO. The underlying PPO engine is from the Stable Baselines package.

For example, you can start training the agent to learn how to play SushiGo with the following command:

docker-compose exec app python3 train.py -r -e sushigo

After 30 or 40 iterations the process should have achieved above the default threshold score of 0.2 and will output a new best_model.zip to the /zoo/sushigo folder.

Training runs until you kill the process manually (e.g. with Ctrl-C), so do that now.

You can now use the test.py entrypoint to play 100 games silently between the current best_model.zip and the random baselines model as follows:

docker-compose exec app python3 test.py -g 100 -a best_model base base -e sushigo

You should see that the best_model scores better than the two baseline model opponents.

Played 100 games: {'best_model_btkce': 31.0, 'base_sajsi': -15.5, 'base_poqaj': -15.5}

You can continue training the agent by dropping the -r reset flag from the train.py entrypoint arguments - it will just pick up from where it left off.

docker-compose exec app python3 train.py -e sushigo

Congratulations, you've just completed one training cycle for the game Sushi Go! The PPO agent will now have to work out a way to beat the model it has just created...

Tensorboard

To monitor training, you can start Tensorboard with the following command:

bash scripts/tensorboard.sh

Navigate to localhost:6006 in a browser to view the output.

In the /zoo/pretrained/ folder there is a pre-trained /<game>/best_model.zip for each game, that can be copied up a directory (e.g. to /zoo/sushigo/best_model.zip) if you want to test playing against a pre-trained agent right away.

Custom Environments

You can add a new environment by copying and editing an existing environment in the /environments/ folder.

For the environment to work with the SIMPLE self-play wrapper, the class must contain the following methods (expanding on the standard methods from the OpenAI Gym framework):

__init__

In the initiation method, you need to define the usual action_space and observation_space, as well as two additional variables:

n_players - the number of players in the game
current_player_num - an integer that tracks which player is currently active

step

The step method accepts an action from the current active player and performs the necessary steps to update the game environment. It should also it should update the current_player_num to the next player, and check to see if an end state of the game has been reached.

reset

The reset method is called to reset the game to the starting state, ready to accept the first action.

render

The render function is called to output a visual or human readable summary of the current game state to the log file.

observation

The observation function returns a numpy array that can be fed as input to the PPO policy network. It should return a numeric representation of the current game state, from the perspective of the current player, where each element of the array is in the range [-1,1].

legal_actions

The legal_actions function returns a numpy vector of the same length as the action space, where 1 indicates that the action is valid and 0 indicates that the action is invalid.

Please refer to existing environments for examples of how to implement each method.

You will also need to add the environment to the two functions in /utils/register.py - follow the existing examples of environments for the structure.

Parallelisation

The training process can be parallelised using MPI across multiple cores.

For example to run 10 parallel threads that contribute games to the current iteration, you can simply run:

docker-compose exec app mpirun -np 10 python3 train.py -e sushigo

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the GPL-3.0. See LICENSE for more information.

Contact

David Foster - @davidADSP - [email protected]

Project Link: https://github.com/davidADSP/SIMPLE

Acknowledgements

There are many repositories and blogs that have helped me to put together this repository. One that deserves particular acknowledgement is David's Ha's Slime Volleyball Gym, that also implements multi-agent reinforcement learning. It has helped to me understand how to adapt the callback function to a self-play setting and also to how to implement MPI so that the codebase can be highly parallelised. Definitely worth checking out!

David Ha - Slime Volleyball Gym

simple's People

Contributors

Stargazers

Watchers

Forkers

victor3387 zeta1999 trendingtechnology moliqingwa icodein cubicka solidscribbles radoone algoricky zorgluf jbdatascience muma1018 saintis codejacktsu angnicolas vanimal nyrwis rzulian dalbrecht theidanlapid byenilmez dpmaloney sovnheim neuralflux shmchan samurus r3dlin3 nacho-ag flend freedom- vuviethung1998 caojunqi murmer-blip fxjordan sylvain2903 rozbeh drazail miyamura80 ltnkoen dex-r toth2zoltan lj-lancejones vduan bagger3025 thetateman lmandtler dingbat amazwp nineyalexandre maciel310 javi0410 zackere marenz userbal greerviau wdreames seamuslowry yang0110 dbeleznay mfaridn03 maxmustermensch huynd2210 noahpodgurski darksun198 arcada-uas djohnson2718 dbravender hato1 adamlang96 thomaschen98 zpyoung stjordanis cbbowman darren-omori nikolaidokken cteague3 tyler-b-lee amenting dee0512 pbtura wdlctc amoeri adityakak warkb threnjen aushim eric0871 kingsharaman zplx72 ryankunz howe73 calebji123 vsevolod-oparin aethy raduadumitru thebaumann jakotibodos lucamax32 mbprdctns naiarani

simple's Issues

Permissions not granted on zoo/sushigo/...

When running
docker-compose exec app python3 test.py -d -g 1 -a base base human -e sushigo
I get the error
Logging to logs Saving base.zip PPO model... Permissions not granted on zoo/sushigo/... ERROR: 1

M1 Mac: Building wheel for opencv-python (pyproject.toml): still running..

Hey y'all, not exactly an issue with this repository so much as it is opencv-python, but I'm just poking to see if anyone's solved this.

Running docker compose up -d in top level of the repository gets all the way through the Dockerfile then freezes permanently at "Building wheel for opencv-python (pyproject.toml): still running..". Ran it overnight just to be totally sure.

It seems other people have had issues with opencv-python on M1, but I've been researching this for weeks to no avail.

Is there any way this can be run without Docker, or some known solution to skip the opencv wheel?

How to get the pretrained model

How to get the pretrained model? Is the pretrained model training by setting the opponent_type='random'?

Add a way to communicate with game engines written in other languages

I do most of my game engines now in Dart since my apps are written in Dart/Flutter and it's quite a lot of work to write and debug a game engine in both Dart and Python (especially given how many games I plan on implementing) so I'd like to create a generic interface in SIMPLE that can be used to call out to a game engine over HTTP. I'm also aware of other people that would benefit from an open source stub that calls out to a remote game engine.

Here's my current rough idea for how this could be done:

Model definition will still be in Python.
Everything else will be done in a way that calls out to a server which implements the following endpoints:
- /newgame endpoint returns {"id": "unique game identifier used by the server to map game IDs to in-memory game states", "player_count": X, "action_space_size": X, "observation_space_size": X, "player_count": X, "current_player": X, "observation": [], "legal_actions": []}
- /step/{id} POSTed to with {"action": integer of action selected} returns the following: {"observation": [one hot encoding of current game state], "next_player": integer representing the player whose turn it is after the move is made, "reward": [array of rewards for each player], "done": true|false}

evojax?

gym was recently [May 2022] ported over to EvoJAX allowing hardware-acceleration. Did anyone try using this in connection with SIMPLE?

Permission Denied on environment install

Hi all,

Been trying to run the install steps as indicated in the Readme. I have encountered issues on the step 3 of the install process.

On bash ./scripts/install_env.sh sushigo I get the message Defaulting to user installation because normal site-packages is not writeable. The scripts keeps on running but a few lines later I get the following issue:

Installing collected packages: sushigo
  Running setup.py develop for sushigo
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/app/environments/sushigo/setup.py'"'"'; __file__='"'"'/app/environments/sushigo/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps --user --prefix=
         cwd: /app/environments/sushigo/
    Complete output (4 lines):
    running develop
    running egg_info
    creating sushigo.egg-info
    error: could not create 'sushigo.egg-info': Permission denied
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/app/environments/sushigo/setup.py'"'"'; __file__='"'"'/app/environments/sushigo/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps --user --prefix= Check the logs for full command output.

Anyone else experienced this ?

Question: Does SIMPLE support board games with simultaneous action selection?

Hi,

I am new to ML but have been researching MuZero. Thanks to your article i found SIMPLE. My question is: is it possible to create a policy network that takes into account that both players will choose a move to make ? Because the next state relies on both players choices there is a bit of nuance in the policy network where it will have to account for the opponents and their might be a dependency between them.

I am not sure if it is built out of the box or if there is any research in this regard.

Thanks

Question on PPO entropy coefficient

Hi,
I have a question about the entropy coefficient c2 of PPO and its standard value in SIMPLE.
In the original paper, the "standard" value is c2=0.01 but in SIMPLE its set to c2=0.1
"parser.add_argument("--entcoeff", "-ent", type = float, default = 0.1, help="The entropy coefficient in PPO")"

Is there a reason to put the standard value so high in SIMPLE? I am currently trying to tune that value and I am just curious.

Kind regards,
Markus

(Paper: https://arxiv.org/pdf/1707.06347.pdf)

Multiprocessor Training Resulted in Corrupted best_model.zip

I left a model training overnight with

docker-compose exec app mpirun -np 8 python3 train.py -e connect4

But after just an hour or so it crashed with error:

A load persistent id instruction was encountered, but no persistent_load function was specified.

Subsequently, I could not restart training, as whenever the program attempted to load best_model.zip it produced the same error. Investigation revealed that the best_model.zip file had somehow become malformed/corrupted. I had to replace it with a prior saved model in order to resume training.

Feature request: Implement chess and an user interface for it using SIMPLE

Hi David, i'm new to data sciene and also to Python. I find SIMPLE amazing. Can you please extend SIMPLE with chess please? It would be awesome if you can provide an user interface for it also. Thanks.

Feature request : allow variable player number on a game/env

Hi !
It would be nice to be able to choose the number of players for a game, although I don't master the impact on the training loop (one model per number of players ?).
Is there any other limitation to implement this feature ?
Thanks for your answer and for all the work done.

Potential memory leak in training PPO agents

Hi,

I followed the tutorial setting up the docker container and ran train.py with all default hyper parameters on tictactoe. Here is my command
sudo docker-compose exec app python3 train.py -r -e tictactoe
I did not use parallelization and I notice that the RAM is growing linearly w.r.t. the training steps. Roughly about 700 MB memory increase after 0.2M steps. After training for ~20M steps my computer's 32GB RAM will be fully occupied.

Has anyone encountered similar issue and is there a way to resolve this?

Training Flamme Rouge model for 5 players stops

When I try to train my frouge model after setting the number of players to 5, it can start the training. However, when it starts optimizing... , it just stops the traing and go to the command prompt. Any ideas on how this can be fixed would be a great help for me. THX

I can't succesfully follow the tutorial

I'm probably just doing something wrong, but I've just spent two hours getting the environment set up (I'm new to Docker) and when I run:

docker-compose exec app python3 test.py -d -g 1 -a base base human -e sushigo

I get the following on the console:

Logging to logs
/home/selfplay/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Saving base.zip PPO model...
Check zoo/sushigo/ exists and read/write permission granted to user

Nothing else happens, but I'm assuming this is supposed to start an interactive game against one of the pretrained networks?

No Limit to Number of Models Attempted to Load - Memory Issue

After leaving the program to train overnight, it had generated in totum some 128+ models. Restarting training, the program attempts to load all of them at the start (and fails due to memory constraints). Solved by deleting some older models but it seems like there should be a check to not try and load more models than could fit into memory.

Exporting to TF SavedModel/TFLite

I'm trying to export the resulting model to TFLite so I can run inference on another device, but I'm hitting some issues. I found instructions on how to export a model in the Stable Baselines documentation and tried adapting it for PPO1 instead of PPO2, however when I try and load the resulting SavedModel I get an exception about the Tensor not existing.

Here's the code:

  ppo_model = load_model(env, 'best_model.zip')

  tf.saved_model.simple_save(ppo_model.sess, "TEST_OUTPUT", inputs={"obs": ppo_model.policy_pi.obs_ph},
                                   outputs={"action": ppo_model.policy_pi._policy_proba})

  converter = tf.lite.TFLiteConverter.from_saved_model("TEST_OUTPUT")
  tflite_model = converter.convert()

And the full error message:
KeyError: "The name 'input/Ob:0' refers to a Tensor which does not exist. The operation, 'input/Ob', does not exist in the graph."

I've verified that ppo_model is being loaded correctly by running the inference (using ppo_model.action_probability()), so I don't believe there's an issue there. The SavedModel directory does get created on the tf.saved_model.simple_save step, however I believe it may not be a complete export as the size is very small.

I'm rather new to the ML side of things, so there might be something obvious that I'm missing, so any help would be greatly appreciated!

Thanks for putting together this great library!

AttributeError: module 'contextlib' has no attribute 'nullcontext'

Exactly as the title says. I followed the steps from the README.

> sudo docker-compose exec app python3 test.py -d -g 1 -a base base human -e sushigo 
/home/selfplay/.local/lib/python3.6/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
  for external in metadata.entry_points().get(self.group, []):
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    from stable_baselines import logger
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/__init__.py", line 3, in <module>
    from stable_baselines.a2c import A2C
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/a2c/__init__.py", line 1, in <module>
    from stable_baselines.a2c.a2c import A2C
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/a2c/a2c.py", line 3, in <module>
    import gym
  File "/home/selfplay/.local/lib/python3.6/site-packages/gym/__init__.py", line 13, in <module>
    from gym.envs import make, spec, register
  File "/home/selfplay/.local/lib/python3.6/site-packages/gym/envs/__init__.py", line 10, in <module>
    _load_env_plugins()
  File "/home/selfplay/.local/lib/python3.6/site-packages/gym/envs/registration.py", line 269, in load_env_plugins
    context = contextlib.nullcontext()
AttributeError: module 'contextlib' has no attribute 'nullcontext'

All output:

> sudo docker-compose up -d                                                          
[+] Running 1/1
 ⠿ Container selfplay  Started                                                                                                      0.5s
 > sudo bash ./scripts/install_env.sh sushigo                         
Defaulting to user installation because normal site-packages is not writeable
Obtaining file:///app/environments/sushigo
  Preparing metadata (setup.py) ... done
Requirement already satisfied: gym>=0.9.4 in /home/selfplay/.local/lib/python3.6/site-packages (from sushigo==0.1.0) (0.21.0)
Requirement already satisfied: numpy>=1.13.0 in /home/selfplay/.local/lib/python3.6/site-packages (from sushigo==0.1.0) (1.19.5)
Requirement already satisfied: opencv-python>=3.4.2.0 in /home/selfplay/.local/lib/python3.6/site-packages (from sushigo==0.1.0) (4.5.4.58)
Requirement already satisfied: cloudpickle>=1.2.0 in /home/selfplay/.local/lib/python3.6/site-packages (from gym>=0.9.4->sushigo==0.1.0) (2.0.0)
Requirement already satisfied: importlib-metadata>=4.8.1 in /home/selfplay/.local/lib/python3.6/site-packages (from gym>=0.9.4->sushigo==0.1.0) (4.8.1)
Requirement already satisfied: zipp>=0.5 in /home/selfplay/.local/lib/python3.6/site-packages (from importlib-metadata>=4.8.1->gym>=0.9.4->sushigo==0.1.0) (3.6.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /home/selfplay/.local/lib/python3.6/site-packages (from importlib-metadata>=4.8.1->gym>=0.9.4->sushigo==0.1.0) (3.10.0.2)
Installing collected packages: sushigo
  Running setup.py develop for sushigo
Successfully installed sushigo-0.1.0
> sudo docker-compose exec app python3 test.py -d -g 1 -a base base human -e sushigo 
/home/selfplay/.local/lib/python3.6/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
  for external in metadata.entry_points().get(self.group, []):
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    from stable_baselines import logger
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/__init__.py", line 3, in <module>
    from stable_baselines.a2c import A2C
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/a2c/__init__.py", line 1, in <module>
    from stable_baselines.a2c.a2c import A2C
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/a2c/a2c.py", line 3, in <module>
    import gym
  File "/home/selfplay/.local/lib/python3.6/site-packages/gym/__init__.py", line 13, in <module>
    from gym.envs import make, spec, register
  File "/home/selfplay/.local/lib/python3.6/site-packages/gym/envs/__init__.py", line 10, in <module>
    _load_env_plugins()
  File "/home/selfplay/.local/lib/python3.6/site-packages/gym/envs/registration.py", line 269, in load_env_plugins
    context = contextlib.nullcontext()
AttributeError: module 'contextlib' has no attribute 'nullcontext'

issue when legal actions mask is dependant on current player

I have a custom environment where the legal actions depend on the state of the board and the current player , and when I try to train my first agent the legal_actions mask isn't computed correctly for the agent, but it is for the opponent. Im guessing the issue comes from the code below (found in SelfPlayWrapper). Since the legal_actions depend on current_player_num and agent_player_num != current_player_num it can not calculate the correct mask for the agent. Please let me know if you have any ideas on how to fix this

  def continue_game(self):
            observation = None
            reward = None
            done = None
            while self.current_player_num != self.agent_player_num:
                action = self.current_agent.choose_action(self, choose_best_action = False, mask_invalid_actions = True)
                observation, reward, done, _ = super(SelfPlayEnv, self).step(action)
                logger.debug(f'Rewards: {reward}')
                logger.debug(f'Done: {done}')
                if done:
                    break

            return observation, reward, done, None

"Permissions not granted"

Following the readme. After bringing up the docker image and installing the sushigo env, I try to run test.py and encounter this error:

> docker-compose exec app python3 test.py -d -g 1 -a base base human -e sushigo
Logging to logs
Saving base.zip PPO model...
Permissions not granted on zoo/sushigo/...

Unable to install the enviroment due connection errors

I am unable to install the enviroment needed to start testing the SIMPLE features. I need to solve this problem before starting to develop my own custom board game project.

Error with training

It seems like theres an error when I try to use the module with a custom env, that occurs after the first iter:

Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
Traceback (most recent call last):
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [1024,2] vs. [1024]
         [[{{node gradients/loss/sub_8_grad/BroadcastGradientArgs}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 184, in <module>
    cli()
  File "train.py", line 179, in cli
    main(args)
  File "train.py", line 118, in main
    model.learn(total_timesteps=int(1e9), callback=[eval_callback], reset_num_timesteps = False, tb_log_name="tb")
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/ppo1/pposgd_simple.py", line 297, in learn
    cur_lrmult, sess=self.sess)
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/common/tf_util.py", line 330, in __call__
    results = sess.run(self.outputs_update, feed_dict=feed_dict, **kwargs)[:-1]
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [1024,2] vs. [1024]
         [[node gradients/loss/sub_8_grad/BroadcastGradientArgs (defined at /home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'gradients/loss/sub_8_grad/BroadcastGradientArgs':
  File "train.py", line 184, in <module>
    cli()
  File "train.py", line 179, in cli
    main(args)
  File "train.py", line 82, in main
    model = PPO1.load(os.path.join(model_dir, 'base.zip'), env, **params)
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/common/base_class.py", line 947, in load
    model.setup_model()
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/ppo1/pposgd_simple.py", line 193, in setup_model
    [self.summary, tf_util.flatgrad(total_loss, self.params)] + losses)
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/common/tf_util.py", line 381, in flatgrad
    grads = tf.gradients(loss, var_list)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 350, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/math_grad.py", line 1144, in _SubGrad
    SmartBroadcastGradientArgs(x, y, grad))
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/math_grad.py", line 99, in SmartBroadcastGradientArgs
    rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 830, in broadcast_gradient_args
    "BroadcastGradientArgs", s0=s0, s1=s1, name=name)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'loss/sub_8', defined at:
  File "train.py", line 184, in <module>
    cli()
[elided 2 identical lines from previous traceback]
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/common/base_class.py", line 947, in load
    model.setup_model()
  File "/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/ppo1/pposgd_simple.py", line 147, in setup_model
    vf_loss = tf.reduce_mean(tf.square(self.policy_pi.value_flat - ret))
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 899, in binary_op_wrapper
    return func(x, y, name=name)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 11086, in sub
    "Sub", x=x, y=y, name=name)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/selfplay/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

How could this be caused? I defined my action space as a discrete value with 11 possible, and observation as 2 values with 100 discrete values. I repurpused the Tic Tac Toe model, with a few changes below:

import tensorflow as tf
tf.get_logger().setLevel('INFO')
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

from tensorflow.keras.layers import BatchNormalization, Activation, Flatten, Conv2D, Add, Dense, Dropout

from stable_baselines.common.policies import ActorCriticPolicy
from stable_baselines.common.distributions import CategoricalProbabilityDistributionType, CategoricalProbabilityDistribution


class CustomPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=True)

        with tf.variable_scope("model", reuse=reuse):
            
            self._policy = policy_head(self.processed_obs)
            self._value_fn, self.q_value = value_head(self.processed_obs)

            self._proba_distribution  = CategoricalProbabilityDistribution(self._policy)

            
        self._setup_init()

    def step(self, obs, state=None, mask=None, deterministic=True):
        if deterministic:
            action, value, neglogp = self.sess.run([self.deterministic_action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        else:
            action, value, neglogp = self.sess.run([self.action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        return action, value[0], self.initial_state, neglogp

    def proba_step(self, obs, state=None, mask=None):
        return self.sess.run(self.policy_proba, {self.obs_ph: obs})

    def value(self, obs, state=None, mask=None):
        return self.sess.run(self.value_flat, {self.obs_ph: obs})



def value_head(y):
    vf = dense(y, 2, batch_norm = False, activation = 'tanh', name='vf')
    q = dense(y, 11, batch_norm = False, activation = 'tanh', name='q')
    return vf, q


def policy_head(y):
    policy = dense(y, 11, batch_norm = False, activation = None, name='pi')
    return policy


def resnet_extractor(y, **kwargs):

    y = convolutional(y, 32, 3)
    y = residual(y, 32, 3)

    return y



def convolutional(y, filters, kernel_size):
    y = Conv2D(filters, kernel_size=kernel_size, strides=1, padding='same')(y)
    y = BatchNormalization(momentum = 0.9)(y)
    y = Activation('relu')(y)
    return y

def residual(y, filters, kernel_size):
    shortcut = y

    y = Conv2D(filters, kernel_size=kernel_size, strides=1, padding='same')(y)
    y = BatchNormalization(momentum = 0.9)(y)
    y = Activation('relu')(y)

    y = Conv2D(filters, kernel_size=kernel_size, strides=1, padding='same')(y)
    y = BatchNormalization(momentum = 0.9)(y)
    y = Add()([shortcut, y])
    y = Activation('relu')(y)

    return y


def dense(y, filters, batch_norm = True, activation = 'relu', name = None):

    if batch_norm or activation:
        y = Dense(filters)(y)
    else:
        y = Dense(filters, name = name)(y)
    
    if batch_norm:
        if activation:
            y = BatchNormalization(momentum = 0.9)(y)
        else:
            y = BatchNormalization(momentum = 0.9, name = name)(y)

    if activation:
        y = Activation(activation, name = name)(y)
    
    return y

Allow "gym.spaces.Box" action space in custom environments

Per documentation on the SIMPLE blog post, the current implementation of SIMPLE only allows "Discrete" action spaces in custom environments. It would be useful to be able to use "Box" action spaces as well.

When attempting to use a "Box" action space such as this:
self.action_space = gym.spaces.Box(low=0, high=1, shape=(10, 10), dtype=np.int)

The following output is generated:
Value passed to parameter 'indices' as DataType float32 not in list of allowed values: uint8, int32, int64

Please let me know how I can help.

Feature Request: Port from tensorflow 1.15 to 2

Would be nice to just run the scripts themself in tensor flow 2. This way it would be a lot easier to avoid the docker hassle, since tf=1.15 is not available in python 3.9 anymore.

Adding other dependencies

Our environment depends on the (external) library pycatan, which I tried to install by adding to both the requirements.txt and the Dockerfile, yet I always get the error that the docker can't find the library:

> sudo docker-compose exec app python3 train.py -r -e catan
Logging to logs

Setting up the selfplay training environment opponents...
Traceback (most recent call last):
  File "/app/utils/register.py", line 24, in get_environment
    from catan.envs.catan import CatanEnv
  File "/app/environments/catan/catan/envs/__init__.py", line 1, in <module>
    from catan.envs.catan import CatanEnv
  File "/app/environments/catan/catan/envs/catan.py", line 3, in <module>
    from pycatan import Resource, Coords, Path, Intersection, Hex
ModuleNotFoundError: No module named 'pycatan'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 188, in <module>
    cli()
  File "train.py", line 183, in cli
    main(args)
  File "train.py", line 58, in main
    base_env = get_environment(args.env_name)
  File "/app/utils/register.py", line 32, in get_environment
    raise Exception(f'Install the environment first using: \nbash scripts/install_env.sh {env_name}\nAlso ensure the environment is added to /utils/register.py')
Exception: Install the environment first using: 
bash scripts/install_env.sh catan
Also ensure the environment is added to /utils/register.py

However, the environment is in /utils/register.py, and the install script has run:

> sudo docker-compose exec app pip3 install pycatan        
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pycatan in /home/selfplay/.local/lib/python3.6/site-packages (0.13)
Requirement already satisfied: quotequail in /home/selfplay/.local/lib/python3.6/site-packages (from pycatan) (0.2.3)

The setup clearly states that the requirement is satisfied, yet I can't seem to get it loaded?

Training and eval env are not of the same type

When executing docker-compose exec app python3 train.py
I'm getting a warning message:

/home/selfplay/.local/lib/python3.6/site-packages/stable_baselines/common/callbacks.py:287: UserWarning: Training and eval env are not of the same type<SelfPlayEnv instance> != <stable_baselines.common.vec_env.dummy_vec_env.DummyVecEnv object at 0x7f1db36e9c50>
  "{} != {}".format(self.training_env, self.eval_env))

Besides that the training works fine. Is it something I should worry about, or it's normal, and I should ignore it?

Policy loss

Is the policy loss supposed to be fluctuating here between negative and positive? What could be the problem?