kevaday / alphazero-general Goto Github PK

View Code? Open in Web Editor NEW

57.0 7.0 17.0 70.82 MB

A fast, generalized, and modified implementation of Deepmind's distinguished AlphaZero in PyTorch.

License: MIT License

Python 64.99% Cython 35.01%

reinforcement-learning alphazero deep-learning game board-game cython

alphazero-general's Introduction

AlphaZero General

This is an implementation of AlphaZero based on the following repositories:

The original repo: https://github.com/suragnair/alpha-zero-general
A fork of the original repo: https://github.com/bhansconnect/fast-alphazero-general

This project is still work-in-progress, so expect frequent fixes, updates, and much more detailed documentation soon.

You may join the Discord server if you wish to join the community and discuss this project, ask questions, or contribute to the framework's development.

Current differences from the above repos

Cython: The most computationally intensive components are written in Cython to be compiled for a runtime speedup of up to 30x compared to pure python.
GUI: Includes a graphical user interface for easier training and arena comparisons. It also allows for games to be played visually (agent-agent, agent-human, human-human) instead of through a command line interface (work-in-progress). Custom environments must implement their own GUI naturally.
Node-based MCTS: Uses a better implementation of MCTS that uses nodes instead of dictionary lookups. This allows for a huge increase in performance and much less RAM usage than what the previous implementation used, about 30-50% speed increase and 95% less RAM usage from experimental data. The base code for this was provided by bhandsconnect.
Model Gating: After each iteration, the model is compared to the previous iteration. The model that performs better continues forward based on an adjustable minimum winrate parameter.
Batched MCTS: bhandsconnect's repo already includes this for self play, but it has been expanded upon to be included in Arena for faster comparison of models.
N-Player Support: Any number of players are supported! This allows for training on a greater variety of games such as many types of card games or something like Catan.
Warmup Iterations: A few self play iterations in the beginning of training can optionnally be done using random policy and value to speed up initial generation of training data instead of using a model that is initally random anyways. This makes these iterations purely CPU-bound.
Root Dirichlet Noise & Root Temperature, Discount: Allows for better exploration and MCTS doesn't get stuck in local minima as often. Discount allows AlphaZero to "understand" the concept of time and chooses actions which lead to a win more quickly/efficiently as opposed to choosing a win that would occur later on in the game.
More Adjustable Parameters: This implementation allows for the modification of numerous hyperparameters, allowing for substantial control over the training process. More on hyperparameters below where the usage of some are discussed.

Getting Started

Install required packages

Make sure you have Python 3 installed. Then run:

pip3 install -r requirements.txt

GUI (work-in-progress)

AlphaZeroGUI, built using PyQT5, is intended to simplify the training, hyperparameter selection, and deployment/inference processes as opposed to modifying different files and running in the command line. It can be run with the following command:

python -m AlphaZeroGUI.main

After that, the controls are generally intuitive. Default/saved arguments can be loaded, the environment can be selected (see section Create your own game for implementing environment for GUI), arguments can be edited/created, and tensorboard can be opened. Simple training stats are shown on the left side, and the progress is shown at the bottom.

At the top left, the Arena tab can be toggled as seen above. Here, a separate set of args & env can be loaded and the type of players can be selected. For example, in the above image the brandubh environment was loaded and an MCTS Player with a model is pitted against a human player.

For now, Arena is still displayed in the console, but eventually there will be support for each environment to implement its own graphical interface to play games (agent-agent, agent-player, player-player).

Try one of the existing examples

Adjust the hyperparameters in one of the examples to your preference (in the GUI editor, or path is alphazero/envs/<env name>/train.py). Take a look at Coach.py where the default arguments are stored to see the available options. For example, edit alphazero/envs/connect4/train.py.
After that, you can start training AlphaZero on your chosen environment by pressing the 'play' button in the GUI, or running the following in the console:

python3 -m alphazero.envs.<env name>.train

Make sure that your working directory is the root of the repo.

You can observe how training is progressing in the GUI, from the console output, or you can also run tensorboard for a visual representation. To start tensorboard in the console, run:

tensorboard --logdir ./runs

also from the project root. runs is the default directory for tensorboard data, but it can be changed in the hyperparameters.

Once you have trained a model and want to test it, either against itself or yourself, use the Arena tab in the GUI as described above, or in console you must change alphazero/pit.py (or alphazero/envs/<env name>/pit.py) to your needs and run it with:

python3 -m alphazero.pit

(once again, this will be easier to accomplish in future updates). You may also modify roundrobin.py to run a tournament with different iterations of models to rank them using a rating system.

Create your own game to train on

More detailed documentation is on the way, but essentially you must subclass GameState from alphazero/Game.py and implement its abstract methods correctly. Your game engine subclass of GameState must be named Game and located in alphazero/envs/<env name>/<env name>.py in order for the GUI to recognize it. If this is done, just create a train file and choose hyperparameters accordingly and start training, or use the GUI to train and pit. Also, it may be helpful to use and subclass the boardgame module to create a new game engine more easily, as it implements some functions that can be useful.

As a general guideline, game engine files/other potential bottlenecks should be implemented in Cython, or at least stored as .pyx files to be compiled for runtime for increased performace.

Description of some hyperparameters

workers: Number of processes used for self play, Arena comparison, and training the model. Should generally be set to the number of processors - 1.

process_batch_size: The size of the batches used for batching MCTS during self play. Equivalent to the number of games that should be played at the same time in each worker. For exmaple, a batch size of 128 with 4 workers would create 128*4 = 512 total games to be played simultaneously.

minTrainHistoryWindow, maxTrainHistoryWindow, trainHistoryIncrementIters: The number of past iterations to load self play training data from. Starts at min and increments once every trainHistoryIncrementIters iterations until it reaches max.

max_moves: Number of moves in the game before the game ends in a draw (should be implemented manually for now in getGameEnded of your Game class, automatic draw is planned). Used for the calculation of default_temp_scaling function.

num_stacked_observations: The number of past observations from the game to stack as a single observation. Should also be done manually for now, but take a look at how it was implemented in alphazero/envs/tafl/tafl.pyx.

numWarmupIters: The number of warm up iterations to perform. Warm up is self play but with random policy and value to speed up initial generations of self play data. Should only be 1-3, otherwise the neural net is only getting random moves in games as training data. This can be done in the beginning because the model's actions are random anyways, so it's for performance.

skipSelfPlayIters: The number of self play data generation iterations to skip. This assumes that training data already exists for those iterations can be used for training. For example, useful when training is interrupted because data doesn't have to be generated from scratch because it's saved on disk.

symmetricSamples: Add symmetric samples to training data from self play based on the user-implemented method symmetries in their Game class. Assumes that this is implemented. For example, in Viking Chess, the board can be rotated 90 degrees and mirror flipped any number of times while still being a valid game state, therefore this can be used for more training data.

numMCTSSims: Number of Monte Carlo Tree Search simulations to execute for each move in self play. A higher number is much slower, but also produces better value and policy estimates.

probFastSim: The probability of a fast MCTS simulation to occur in self play, in which case numFastSims simulations are done instead of numMCTSSims. However, fast simulations are not saved to training history.

max_gating_iters: If a model doesn't beat its own past iteration this many times, then gating is temporarily reset and the model is allowed to move on to the next iteration. Use None to disable this feature.

min_next_model_winrate: The minimum win rate required for the new iteration against the last model in order to move on. If it doesn't beat this number, the previous model is used again (model gating).

cpuct: A constant for balancing exploration vs exploitation in the MCTS algorithm. A higher number promotes more exploration of new actions whereas a lower one promotes exploitation of previously known good actions. A normal range is between 1-4, depending on the environment; a game with less possible moves on each turn would need a lower CPUCT.

fpu_reduction: "First Play Urgency" reduction decreases the initialization Q value of an unvisited node by this factor, must be in the range [-1, 1]. The closer this value is to 1, it discourages MCTS to explore unvisited nodes further, which (hopefully) allows it to explore paths that are more familiar. If this is set to 0, no reduction is done and unvisited nodes inherit their parent's Q value. Closer to a value of -1 (not recommended to go below 0), unvisited nodes become more prefered which can lead to more exploration.

num_channels: The number of channels each ResNet convolution block has.

depth: The number of stacked ResNet blocks to use in the network.

value_head_channels/policy_head_channels: The number of channels to use for the 1x1 value and policy convolution heads respectively. The value and policy heads pass data onto their respective dense layers.

value_dense_layers/policy_dense_layers: These arguments define the sizes and number of layers in the dense network of the value and policy head. This must be a list of integers where each element defines the number of neurons in the layer and the number of elements defines how many layers there should be.

Results

Connect Four

envs/connect4

AlphaZero was trained on the connect4 env for 208 iterations in the past, but unfortunately the specific args used to train it were lost. The args were quite close to the current default for the connect4 env (but with lower batch size and games/iteration, hence the large number of iterations), therefore the trained model can still be loaded with some trial and error.

This training instance was very successful, and was unbeatable by every human trial. Here are the Tensorboard logs: It can be seen that over time as total loss decreases, the model plays increasingly better against the baseline tester (which I believe was a raw MCTS player at the time). Note that the average game length and amount of draws also increase as the model understands the dynamics of the game better and struggles more to beat itself as it gets better.

Towards the end of training, the winrate against the past model suddenly decreases; I believe this is because the model has learnt to play a perfect game, and begins to overfit as it continues to generate very similar data via its self-play. This overfitting makes it less general and adaptable to dynamic situations, and therefore its past self can defeat it because it can adapt better.

The model file for the most successful iteration (193) can be downloaded here. As mentioned above, subsequent iterations underperformed most likely due to overfitting.

Another instance was trained later using the current default arguments. It was trained using more recent features such as FPU value, root temperature/dirichlet noise, etc. Only 35 iterations were trained, as it was intended just to test these new features.

I was surprised to see that even in only 35 iterations (which took approximately 8 hours on a GTX 1070 and i5-4690 CPU) it had reached superhuman capabilities. Take a look at the Tensorboard logs: For unknown reasons, it does not perform as well against the baseline tester as the instance above, but this is probably due to the use of dirichlet noise and root temperature in Arena, which can cause AlphaZero to make a 'mistake' by random chance (which is intended for further exploration in self-play). However, if these are turned off, temperature is set to a low value (0-0.25), and more 'thinking' time is allowed (number of MCTS simulations are increased), then even this undertrained model can essentially play a perfect game.

The model for the lateset iteration (35) of this instance can be downloaded here.

Viking Chess - Brandubh

envs/brandubh

The tensorboard logs have been corrupted for the best trained instance, therefore it cannot be included here. It was trained for 48 iterations with the default args included in the GUI (AlphaZeroGUI/args/brandubh.json), and achieved human-level results when testing.

However, the model does have a strange tendency to disregard obvious opportunities on occasion such as a victory in one move or blocking a defeat. Also, the game length seems to even out around 25 moves - despite the players' nearly even win rate - instead of increasing to the maximum as expected. This is being investigated, but it is either due to inappropriate hyperparameters, or a bug in the MCTS code regarding recent changes.

Iteration 48 of the model can be downloaded here.

alphazero-general's People

Contributors

Stargazers

Watchers

Forkers

jbdatascience bhansconnect dodatko m-r-munroe casper2002casper bobingstern syllebra lusu2004 andykhang404 yang0110 colllin tokarev-i-v pierreremacle julencw

alphazero-general's Issues

Error in default code from default download

Process:

Downloaded zip, extracted into a useful directory. --> OK
Opened README.md, started working through "Getting Started"
in terminal, in directory, ran "pip3 install -r requirements. txt" --> all requirements satisfied
ran "python -m AlphaZeroGUI.main" --> Got ERROR

Here is the text for that error:

Traceback (most recent call last):
  File "/home/<username>/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/<username>/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/<username>/<useful_directory>/AlphaZeroGUI/main.py", line 2, in <module>
    from PySide2.QtWidgets import QApplication, QMessageBox, QInputDialog, QTableWidgetItem, QLineEdit
ImportError: /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2: symbol krb5_ser_context_init version krb5_3_MIT not defined in file libkrb5.so.3 with link time reference

It looks like the requirements.txt may need something involving "Pyside2".

I spun up a conda environment and manually am going through the requirements.... and now I get a GUI interface!!

Strange "bug?" with model gating

I keep getting a model gating for version 4 even though the win rate against past model is 114/45
Is this a bug or is this intended? It seemingly plays well against the mcts player with a model vs baseline rate of 154/18. Not sure what this means

A simple attempt at distributed training

Could it be possible to have multiple machines all generate training games over the same network and send the games generated to a "master" machine which will use the training data and train a new version of the model and then send the new model (after applying gating if enabled) back to the other machines and start over again? I'm 99% sure I can implement this using sockets fairly easily but I need to know a few things.

How are training samples stored and sampled for training
How can I "merge" training samples from multiple files into the base 3 .pkl files (data, policy, value)

I am reluctant to use Ray for this since it sounds like overkill for a simple task of generation and file transfer. Scheduling would be very straightforward, just count how many samples have been transfered and once they reach a threshold of say 1 mil, tell the other machines to stop generation and start training on the master machine. After this send the (gated) model back and have the machines run baseline testing if needed.

I think this is basically what Lc0 does but with distributed training as well which would probably need Ray

Exception ignored in: 'alphazero.MCTS.MCTS._add_root_noise'

I'm training gobang (10x10 board with 5 connect), I got this error:

FloatingPointError: underflow encountered in cast
Exception ignored in: 'alphazero.MCTS.MCTS._add_root_noise'
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()

The train is not interrupt. But I don't know what will happened.
I training it with python 3.8

Training not working on macOS due to multiprocessing queue missing implementation

Hi,
I'm struggling to run the training on macOS, while on windows everything works just fine.
The following error is raised when I run python -m alphazero.envs.tictactoe.train on a macOS terminal.

Traceback (most recent call last):
  File "/Users/User/anaconda3/envs/alphazero/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/User/anaconda3/envs/alphazero/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/User/Projects/alphazero-general/alphazero/envs/tictactoe/train.py", line 37, in <module>
    c.learn()
  File "/Users/User/Projects/alphazero-general/alphazero/Coach.py", line 250, in learn
    self.saveIterationSamples(self.model_iter)
  File "/Users/User/Projects/alphazero-general/alphazero/Coach.py", line 146, in wrapper
    ret = func(self, *args, **kwargs)
  File "/Users/User/Projects/alphazero-general/alphazero/Coach.py", line 365, in saveIterationSamples
    num_samples = self.file_queue.qsize()
  File "/Users/User/anaconda3/envs/alphazero/lib/python3.9/multiprocessing/queues.py", line 126, in qsize
    return self._maxsize - self._sem._semlock._get_value()
NotImplementedError
/Users/User/anaconda3/envs/alphazero/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I think that the issue is related to this issue.

AttributeError: 'function' object has no attribute 'supports_process'

Dear Kevi Aday,
Thanks for public your code. There some error when I trying to train connect4 game. Can you help me to solve this error.
Thank you in advance!
Albert,

PITTING AGAINST BASELINE: RawMCTSPlayer Traceback (most recent call last): File "D:\ana3\envs\muzero\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "D:\ana3\envs\muzero\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\88Projects\alphazero-general\alphazero\envs\connect4\train.py", line 58, in <module> c.learn() File "D:\88Projects\alphazero-general\alphazero\Coach.py", line 267, in learn self.compareToBaseline(self.model_iter) File "D:\88Projects\alphazero-general\alphazero\Coach.py", line 148, in wrapper ret = func(self, *args, **kwargs) File "D:\88Projects\alphazero-general\alphazero\Coach.py", line 589, in compareToBaseline self.arena = Arena(players, self.game_cls, use_batched_mcts=can_process, args=self.args) File "alphazero\Arena.pyx", line 66, in alphazero.Arena._set_state.decorator.wrapper ret = func(self, *args, **kwargs) File "alphazero\Arena.pyx", line 108, in alphazero.Arena.Arena.__init__ self.players = players File "alphazero\Arena.pyx", line 129, in alphazero.Arena.Arena.players self.__check_players_valid() File "alphazero\Arena.pyx", line 132, in genexpr if self.use_batched_mcts and not all(p.player.supports_process() for p in self.players): File "alphazero\Arena.pyx", line 132, in genexpr if self.use_batched_mcts and not all(p.player.supports_process() for p in self.players): AttributeError: 'function' object has no attribute 'supports_process'

Self-play - ValueError: Invalid action encountered while updating root

Hi,

Attempting to use this as instructed (subclassing GameState) seems to work at first, but during the self-play step the console is filled with the same errors:

Exception ignored in: 'alphazero.MCTS.MCTS.update_root'
Traceback (most recent call last):
  File "/home/tyto/miniconda3/envs/torch/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
ValueError: Invalid action encountered while updating root: 155
ValueError: Invalid action encountered while updating root: 91

'PySide2.QtCore.Qt.WindowType' object cannot be interpreted as an integer

when I run command python -m AlphaZeroGUI.main. I got this error:

  File "/home/zbf/Desktop/git/github/alphazero-general/AlphaZeroGUI/_gui.py", line 86, in __init__
    self.setWindowFlags(QtCore.Qt.WindowMinimizeButtonHint | QtCore.Qt.WindowCloseButtonHint)
TypeError: 'PySide2.QtCore.Qt.WindowType' object cannot be interpreted as an integer

code is:

class Ui_FormMainMenu(QtWidgets.QMainWindow):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.setWindowFlags(QtCore.Qt.WindowMinimizeButtonHint | QtCore.Qt.WindowCloseButtonHint)
        self.setupUi(self)

    def setupUi(self, FormMainMenu):
        FormMainMenu.setObjectName("FormMainMenu")
        FormMainMenu.resize(876, 647)
        ......

I can't fix this bug. I run it on python 3.11

Getting a segfault after appending agents

Hey, just pulled your repo, have a custom game interface. I get a segfault after appending agents, plenty of ram available, ulimit 32k, stack and recursion size maxed out. Worked fine on bhasconnect's original repo - any ideas as to what could be causing it?

Running cpuonly on windows gives "pinned memory requires CUDA"

I don't have a fancy gpu on this computer.

I used this code to create my repo:

conda create --name agz_kevaday
conda activate agz_kevaday
conda install pytorch torchvision torchaudio cpuonly -c pytorch
conda install -c anaconda numpy cython 
conda install -c conda-forge tensorboard tensorboardx choix

I navigated to the main alphazero-general directory and then executed this:
python -m alphazero.envs.tictactoe.train

Here is a names-redacted version of the output:

Because of batching, it can take a long time before any games finish.
------ITER 1------
Warmup: random policy and value
Traceback (most recent call last):
  File "C:\Users\_redacted_\Anaconda3\envs\agz_kevaday\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\_redacted_\Anaconda3\envs\agz_kevaday\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\_redacted_\Documents\Personal\alphazero-general\alphazero\envs\tictactoe\train.py", line 32, in <module>
    c.learn()
  File "C:\Users\_redacted_\Documents\Personal\alphazero-general\alphazero\Coach.py", line 180, in learn
    self.generateSelfPlayAgents()
  File "C:\Users\_redacted_\Documents\Personal\alphazero-general\alphazero\Coach.py", line 226, in generateSelfPlayAgents
    self.input_tensors[i].pin_memory()
RuntimeError: Pinned memory requires CUDA. PyTorch splits its backend into two shared libraries: a CPU library and a CUDA library; this error has occurred because you are trying to use some CUDA functionality, but the CUDA library has not been loaded by the dynamic linker for some reason.  The CUDA library MUST be loaded, EVEN IF you don't directly use any symbols from the CUDA library! One common culprit is a lack of -INCLUDE:?warp_size@cuda@at@@YAHXZ in your link arguments; many dynamic linkers will delete dynamic library dependencies if you don't depend on any of their symbols.  You can check if this has occurred by using link on your binary to see if there is a dependency on *_cuda.dll library.

Given the error, I think the code doesn't work for cpu-only. It seems to be saying "CUDA required".

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.