Code Monkey home page Code Monkey logo

sai's People

Contributors

akdtg avatar alreadydone avatar amato-gianluca avatar apetresc avatar barrybecker4 avatar bittsitt avatar cglemon avatar chinchangyang avatar earthengine avatar gcp avatar godmoves avatar hersmunch avatar ihavnoid avatar killerducky avatar kuba97531 avatar marcocalignano avatar mkolman avatar nerai avatar parton69 avatar roy7 avatar sethtroisi avatar tfifie avatar thynson avatar trinetra75 avatar ttl avatar tux3 avatar vandertic avatar wonderingabout avatar ywrt avatar zediir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sai's Issues

Invalid authentication error

I tried setting up an account, and am now trying to connect a cpu autogtp instance, but I keep getting "invalid authentication", followed by a JSON parse error "illegal value".
I'm pretty sure I put in my auth details correctly on the hta page, so I'm not sure what might be going wrong. Anyone else having a similar issue?

elo display

Why not show the elos in Best Network Hash table?

Help request: implement integer komi and draws

As was noted by afalturki and confirmed by gcp on the LZ forum, it would be really neat to have integer komi and draws implemented.

It shouldn't be too hard, because the low-level management of the board should come from Leela code and hence appears to be ready for these features, but the rest of the code is not.

I don't have much time to implement this now, so I wanted to ask if someone has time and is willing to help?

Thank you in advance.

why recent matchs started earlier

the wiki said

when the game count reaches 3072 self-play games, training starts, based on the self-plays games of the last n generations

but recent nets all started the matches about 1000 selfplay games

autogtp error of sai-0.17.4-gpu.exe

AutoGTP v18
Using 1 game thread(s) per device.
Starting tuning process, please wait...
Network connection to server failed.
NetworkException: Curl returned non-zero exit code 7
Retrying in 30 s.

the same machine, run autogtp of leelaz is ok

AutoGTP v18
Using 1 game thread(s) per device.
Starting tuning process, please wait...
net: 6ee288bdde21cc080f8be9db33dd179385736cc3852bcc7a2206eaa4a6cf0538.
./leelaz --batchsize=5 --tune-only -w networks/6ee288bdde21cc080f8be9db33dd179385736cc3852bcc7a2206eaa4a6cf0538.gz
Leela Zero 0.17  Copyright (C) 2017-2019  Gian-Carlo Pascutto and contributors

Future

Will the network size be bigger in the future?
Will this plan last until it become very strong?

What should I read to implement this for Blooms?

I am wondering if some documentation or a manual exists. Suggestions and help would be appreciated.

Go with 2 colors per player on a hexagonal board.
https://boardspace.net/blooms/english/rules.html


Separate a move into 2 moves.

Neural Network Inputs:
Hexagon tiled board (has 12 symmetries) of the last 4 board states
• an input set of the current player’s blooms
• an input set of the ally’s blooms
• 2 input sets of the opponent’s 2 bloom colors
1 bit for ally moving next or opponent moving next

Expressing hexagons as sums of multiples of (-1)^(2/3) and multiples of 1 allows easy symmetry computation by taking the complex conjugate or multiplying by (-1)^(1/3).

The network has the symmetry of swapping the opponent’s colors.
The if an ally color moves next, then the network has a symmetry of swapping the current color with the allied color because capturing only happens when the color switches teams.

The length n edged board has [1-3n+3n^2] territories.

Use a Tromp-Taylor analog for scoring.

This seems strange to me (why +2?) in FastBoard.h:
static constexpr int MAXSQ = ((BOARD_SIZE + 2) * (BOARD_SIZE + 2));

Do I need to have an extra layer of territories around the board?

Compute Joseki

Now that the network start to play around 4-4, it may be the right time to activate the joseki script so the existing "http://sai.unich.it/opening" looks like "https://zero.sjeng.org/opening".

I will be quite interesting to see if sai get the same joseki than leela zero and if so, in how many accumulated games.

Here are some guidance:

I didn't test locally but I think it should be sufficient to run https://github.com/sai-dev/sai-server/blob/sainext/scripts/opening_analysis.js and add the static file into static, as per https://github.com/leela-zero/leela-zero-server/pull/120/files.

Maybe @rchs819 or @roy7 can say more about that.

What do the numbers mean in comments?

-1.24, -9.17, 0.0556, 0.375, 0.491, 0, 1
1.07, 2.61, 0.0583, 0.538, 0.497, 0, 0
I found some numbers in the comments.
Can anyone tell me what they mean?

Feature explanation: symmetries exploitation

I believe symmetries have been added also to the current next branch of LZ, but i did not check if the implementation is the same as our own.
What we have done in SAI is here.

The new line-command option --symm modifies the agent behaviour in the following way.

For every node in the search tree the policy probabilities of moves which are equivalent by symmetry are
summed up and concentrated onto a single one of them, which is randomly chosen. The other symmetrical moves are completely excluded from the UC tree.

The same option modifies the training info in the following way.

The total number of visits for the chosen move, is split again (evenly, this time) among symmetry-equivalent moves at the moment of writing the training information.

In our experiments on 7x7 we observed that the progression at the beginning was much faster with symmetries enabled and that the optimal first move (tengen) is typically discovered sooner and with a lower number of visits than before.

Notice that this will affect the probabilities appearing in the training info, since instead of being the ratio v/N between move visits and total visits, for symmetric moves it may be V/N/2^n, where V is the total number of moves for the symmetry equivalent class and n is 1, 2 or 3 according to the order of the symmetry involved. So one cannot simply deduce the number of visits from the common denominator of the probabilities any more.

The wrong network was promoted.

2019-11-06 06:42 e9c845c5 VS 9df1b56e 42 : 0 : 18 (70.00%) 60 / 50 promotion
2019-11-06 05:39 f40513c4 VS 9df1b56e 38 : 3 : 13 (73.15%) 54 / 50 promotion
f40513c4 has a higher win percentage, yet e9c845c5 was promoted instead.

Perhaps the bug is because e9c845c5 had more wins?
38/54=0.7037, so it seems ties are not the problem, rather just the number of wins seems to be the reason.

So many match games?

The number of self-playing games is twice that of the match games. But LZ's self-playing game is five times that of the match games. why?
I guess you also use match games to train, is it right?

Idea: branch from common handicap

As far as I understand, sai brings different concept to dramatically improve value network in unfair situation.

If we consider extrem situation like 9 handicap stone, leela zero is considering the stone 1-1 as valuable as 6-4 for white, which is insane for a human.

image

The reasoning is simple: since the winrate is both 0% there is no difference. I will say the bigger the number of move the worst it became, since the initial policy (3-3) point will start to have less importance.

On the contrary in the end games, leela often play sloppy move as she is already far ahead.


To solve this issue, sai (and other improvement on leela) are bringing some concept.

  1. the concept of dynamic komi ( which require to have having a monotonous winrate over komi) : by variating the komi, the actor can trick the engine to a more faire situation and let it do better choice

  2. by training not only to win but to win by the maximum number of point. We can think of katago function (see under) but also alpha and beta parameter of sai (which looks still weird to me since on contratry of katago, sai doesn't end every game and thus i don't see how those parameter are trained, but that is another problem).

image
(Appendix F: Score Maximization of https://arxiv.org/pdf/1902.10565.pdf by @lightvector)

Thoose improvements have a positive impact on plays (end game, handicap plays) but also are a good way to improve user experience.
For instance katago can output ownership.

image
(Figure 3: Visualization of ownership predictions by the trained neural net. of https://arxiv.org/pdf/1902.10565.pdf by @lightvector)


However that only improve the value head. AFAIK, the policy network is still lost since those situation are super rare (No way that white pass 8 times and black play hoshi).

image
(Katago default policy)

My idea is to profit from the branching code to branch a sgf every 10 game.
So 1% of the game start with one normal handicap, 1% two handicaps, ect, and the last percent is free placing. Then the network will pick appropriate komi so it is fair.

I think this may improve the joseki for handicap games.


Note: that is maybe a crazy idea (we introduce too much bias). Or it is too risky.

I don't have as much as knowledge as you have (I read paper from time to time, looked at leela and GoNN code and did my small experiment on 7x7 but i have not ML background.

However that may be interesting :) What do you think ?

Feature explanation: current player color removal

LZ input planes include two bit fields which are alternatively flat 0 and 1 and code for the color of the current player. This is necessary for LZ which learns to play without any information on the komi: the two colors play with different komi (-7.5 for black and +7.5 for white) and they need to know this.
In fact toying with these bit fields is the best option to have variable komi in regular LZ.

On the contrary, SAI has information of the relative komi of the current player. This value may change wildly in unbalanced situations, when playing game branches. So telling the net the color of the current player is not needed and may well be negative.

Thus SAI may use nets that do not use the color of the current player as an input. These nets are detected by the parity of the input planes: if even the last two planes code the color as usual, if odd, the last plane has to be flat 1. (This is needed so that the first convolution layers has information on where are the borders of the board.)

After experimenting, we can confirm that using this feature improves the training and the playing strength a bit on our 7x7 pipeline.

Notice that in this way the policy is not expected to specialize for the two players. This is intended, as we believe that the policy should always show a variety of promising moves, rather than converge to the one best move for the current player. (In fact the best move for the current player may well be not unique, and may change as the precision of the value head increases.)

Training side, no-color nets can be trained with the option
INPUT_STM = 0 # 1: both side to move and komi in input (18 input planes)
# 0: only komi in input (17 input planes)
in config.py

Feature explanation: advanced input planes

This feature is motivated by the fact that AlphaGo had more rich input planes than AlphaGo Zero and Alpha Zero, the latter two using only the board configuration for the last 8 turns.
One could argue that AlphaGo was experimental, and that the other programs are its evolution, but it must be noted that even if they eventually overcame AlphaGo's strength statistically, they were never exposed to the public as was the first one, which was the only program trusted to play in public, against Lee Sedol, Ke Jie and, as Master, with several other professionals.
We believe the reason may be that go playing programs made with this general approach are universally affected by sporadic weaknesses on rare game events, like ladders, huge no eyes groups, or sekis. Putting extra information in the input planes may mitigate these problems and AlphaGo had a huge amount of such information as input planes.
One possible issue is that complex information deduced from the board configuration may sometimes be wrong. AlphaGo had an input field saying if a stone may be captured in a ladder. This is not obvious to compute with zero errors, and many people thought that when AG lost to Lee Sedol in game 4 it was because it was blind to a complicated ladder, because of an error in the implementation of this input field.
To avoid such situations, while keeping some of the benefits from the advanced input features, we have chosen only two very simple additional bit fields:

  • empty intersections that are the last liberty of a group (of either player)
  • intersections were it is illegal to play

In fact we observed in previous 7x7 runs that sometimes even very strong networks would lose with a much weaker one because of an unseen large group atari, and we felt that it could be easier to learn ko in this way.
After experimentation with this feature, we can conclude that it allows to accelerate a lot the learning in the beginning. It makes not much difference when the nets are strong, but the occasional lost game against weak nets disappeared.
Experiments were done shortening the history from 8 to 4 turns, in order to keep constant the number of input planes (8x2 = 4x4).

As for the implementation, nets that expect advanced input planes are coded with a version number 17 in the first line. The program automatically understands this and gather the correct information for playing.
The line command option --adv_features instead changes the recording of the training information, putting the correct input planes in the output files.
The training with TF is no different with or without this features, as long as the training data is the correct one. The option in config.py
WEIGHTS_FILE_VER = "17" # 1: LZ # 17: 'advanced features'
just writes 17 in the first line.

Feature explanation: blunder threshold

Randomness is at the core of LZ's way of learning. First nets are really random and may try anything, but in time the policy of better networks converges and eventually they would stop to explore different moves. This would be particularly indesiderabile, since their value head is also evolving, and moves that were excluded before may become more interesting after some time.
To make the program always consider some policy-unexpected moves, Dirichlet noise is introduced at the root node. Nevertheless this doesn't mean that the weird moves will be actually chosen and that board configurations following that branch in the game tree will be played and tested for game winner. So configurations following the weird move will not enter the training data and the value head will not be able to train on them.
To avoid this, the -m command-line option allows the first moves to be played randomly, by choosing proportionally to the number of visits. (This is even tuneable with a temperature parameter.)
These two randomness sources together promise to make exploration always possible and to avoid convergence to sub-optimal play.

Nevertheless from our many experiments on 7x7 it appears that there is an important trade-off that may be underestimated and is better investigated. Every time a weird move is actually played because of -m, we have a useful exploration of not typical branches of the game, but if this weird move is in fact a bad blunder, then the positions that precede the bad move are poor for value training, because the winner at the end of the game will be typically randomly altered by the blunder.
The positions following the blunder are always good for training (unless another blunder follows after some moves), but the positions before have a wrong (or random) winner and will make the training of the value head more difficult.

The neat way to avoid this would be that randomness in the moves was chosen at the server: if we want a particular weird sub-branch to be explored, then the server should decide on this and give the corresponding starting position to play to the client, which then would play with -m 0. In this way you have exploration of random sub-trees but also training data with reasonable winners.

Anyway, this is somewhat difficult to implement, because server-side one would need to analyze the games coming from clients and randomly choose positions where to branch by randomly choosing a different move. Can be done, but with a lot of work and introducing new parameters (magic numbers again!). So we chose an easier and less beautiful strategy.

The random choice of weird sub-branches is left to the client as always, but if the command-line option --blunder_thr x is used (x is a number between 0 and 1), then the program looks for blunders, defined as randomly chosen moves that have a number of visits less than a fraction x*V of the move with the most visits. If at least one blunder is found in a self-play game, then all positions before the last blunder are NOT stored in the training data, lest they ruin the training of the value head.

This of course have the defect of wasting some positions, but we believe that overall it is an important improvement, in particular when -m is high.

Some experimentation showed that -m 15 --blunder_thr 0.25 is a reasonable choice for 7x7 learning pipeline and that, together with temperatures tuning (see future post), it greatly improved performance of SAI.

Training a 25x1 zero AI using SAI

hi everyone ! 👍

25x1 is an unknown yet very fun board size !

many players, especially dan players, enjoyed playing it as they were introduced to it, then they wanted to know more about it

A friend of mine also had a plan to start to train a bot for 25x1, several months ago

You can find an example of how 25x1 is played in these 2 game samples

it's a game heavily revolving arround ko reading, but also using concepts of influence, territory, sente, etc...
so i want to run it on OGS (i'm already running phoenixgo on ogs)

zero bot (no previous knowledge), chinese rules, 1.5 komi, all rules same as 19x19

as for me i have no past knowledge on training an AI but i'm very motivated to do it, so i would appreciate any guidance even if the slightest, step by step let's make it !

i have a general idea of how AI are trained (selfplay until enough game data is produced, then train a new network on this data, if it is stronger than previous best promote it and make it strong

some quick things i thought about (will be updated with time) :

  • 25x1 is not a square board size, but gtp2ogs is compatible with any custom board size now
  • 25x1 is not a square board size, so code may need to be modified to support custom width and height (who said go cant be played on non square board sizes)
  • if 1.5 komi is too good for black, i think the sai approach allows me to change komi, does it make the training slower or weaker as compared to one komi approach ?
  • symetries are uneeded considering how small the board is
  • i have some feel like let's start with a 10b size then maybe increase to 15-20b size
  • i want to start training it alone, then depending on how successfuly i am, i may open source it and make it in autogtp mode (not urgent)

@Vandertic @francesco Morandin @parton69 @amato-gianluca @Nazgand @bood

sai79 v400 vs lz250 v1

net games win
77 48 11
80 36 5
82 74 24
83 38 6
86 45 9
87 46 10
89 36 5
101 54 14
102 91 33
104 77 26
105 66 20
107 46 10
108 70 22
110 41 7
111 95 35
113 56 15
114 70 22
119 74 24
120 102 39
121 77 26
122 83 29
124 68 21
125 36 5
126 54 14
127 72 23
129 131 54
139 278 131
165 108 42
166 236 109
169 466 230

validation -k sai77_lz250 -n af4.gz -o "-g -v 400 -r 5 -w" -n networks/3d415846183a7f51a40aaa80a007d886668e759c37be8712febf09ec2823d257.gz -o "-g -v 1 -r 5 -w" -- sai -- sai

11 wins, 37 losses
The first net is worse than the second
af4.gz   v networks ( 48 games)
              wins        black       white
af4.gz     11 22.92%    5 21.74%    6 24.00%
networks   37 77.08%   18 78.26%   19 76.00%
                       23 47.92%   25 52.08%

More participants?

If the goal of this project is to reach superhumanity, obviously it's not enough.
Does anyone have some ideas to let more people know about this project and help?

recommendations on text on the invite page

after a second feedback i suggest we modify "Download the latest release of SAI" -> "Open this this page and download the latest release of SAI fitting the characteristics of your computer"

let's include here other recommendations on the invite page

get new task error

I often get following message

Getting task returned: <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot POST /get-task/18/Copyright</pre>
</body>
</html>

Network connection to server failed.
NetworkException: JSON parse error: illegal value
Retrying in 30 s.
Getting task returned: <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot POST /get-task/18/Copyright</pre>
</body>
</html>

Network connection to server failed.
NetworkException: JSON parse error: illegal value
Retrying in 45 s.
Getting task returned: <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot POST /get-task/18/Copyright</pre>
</body>
</html>

Network connection to server failed.
NetworkException: JSON parse error: illegal value
Retrying in 67 s.

OpenCL.dll missing from release

I didn't compile from source, I used the installer (which was in Italian for whatever reason)

It did not put the OpenCL.dll into the folder, so the program did not run until I copied the correct one from the LZ folder.

I'm on Intel integrated graphics, Win 8

Feature explanation: flexible net structure

As many will know, LZ recognizes two types of net weights, the standard v1 and ELF-like v2 which encodes differently the winrate (IIRC). The type is encoded in a number in the first line of the weights file.
Apart from that, the internal function Network::load_v1_network is able to understand the number of layers and filters by itself, by counting the number of weights in the third line and the number of lines in the file. In this way one can seamlessly use 40x256 nets, 15x192 nets or even 5x64 nets without telling leelaz anything.

In SAI we generalized this behaviour as much as possible. Our version of this function is more complex and somewhat slower because it must count the numbers in several lines, but can understand if it is loading:

  1. a Leela Zero network, with a single value head, in which case it then behaves as LZ master branch, with which it is fully compatible
  2. a SAI network of structure V, Y, T or I. These are different in the way the two value heads are attached to the resconv tower.

Moreover we generalized the number of 1x1 convolutional filters and the number of units in the fully connected layers and the function loads any network understanding its structure and reading these parameters automatically.

In fact, by AlphaGo Zero defaults, Leela Zero has:

  • 2 filters (1x1) for the policy,
  • 1 filter for the value,
  • 256 units in the f.c. layer of the value head.

But when using SAI, one can train and test a network that changes these numbers. For example in the paper there is one test (augmented network) with a 7x7 Leela Zero with 2 filters for the value head.

So presently we are using an Y head, in which the same 1x1 convolution is used for both alpha and beta. This convolution has 2 filters, like the policy one (which is independent). After this convolutional layer, there are two parallel fully connected layers, with 256 units (for alpha) and 128 units (for beta).

What do you think about this setting? Are there any proposals for net structures which are expected to perform better? We made some tests, but we couldn't see many differences, maybe because we are still on 7x7, but before scaling to 9x9 it would be best to decide the net structure to use.

Document network size change when the progression stale

Recently the elo rating is starting to stale.

image

While this is super early to say if it really staled (over the last 5 promotion is statically no enough), it is hard as an external reader / participant to answer this two questions:

  • When are we going to switch to a bigger network ?
  • How are we going to switch to a bigger network ?

Inded the SAI pipeline (https://github.com/sai-dev/sai/wiki/Progress#sai-pipeline) does not document the network change process.

I recall having seen that not only 6x128 is trained but also bigger networks. Is it true ? can't recall where I found that :)
(Edit: found here #15 but this method of training new net in the same time and not using net2net like leela zero is an important divergence to leela and could be documented in the wiki)

I will expect something like:

The cycle is as follows.
1. `gen=0`, `current_net=random`, `n=1`;
2. `current_net` plays **2560** whole self-play games, with variable komi, distributed according to `current_net` [[evaluation|Evaluation of fair komi]];
3. `current_net` starts playing **[[branches|Branching games]]** of self-play games, from random positions of previous games;
4. when the game count reaches **3072** self-play games, **training starts**, based on the self-plays games of the last `n` generations;
+ The following network are trained:
+ - 6x128
+ - 10x128
+ - 15 x192
- 5. during training, a variable number of **candidate networks** are generated (currently, 10 networks at 2000 training steps one from the other); 
+ 5. during training, a variable number of **candidate networks** are generated:
+ - if the rate "elo/plays" of the linear interpolation of the last then promotion is larger than 10 elo gained per 40000 games,  10 candidate networks of the current size (6x128) at 2000 training steps one from the others are generated
+ - if the rate "elo/plays" of the linear interpolation of the last then promotion is less than 10 elo gained per 40000 games,  10 candidate networks of the current size (6x128) at 2000 training steps one from the others and 10 candidate networks of the size just above (10x128) at 2000 training steps one from the others  are generated
6. as soon as candidates are available, **promotion matches** are added between the new candidate networks and `current_net`. These matches can be identified because they are **50** games long;
7. when promotion matches end, the best candidate network is identified; denote it by `chosen_net`;
8. `current_net` finishes playing branches of self-play games until count reaches **3840**;
9. **reference matches** are added between several recent networks (the ones promoted at generations `gen-k`, with `k` in `{1, 2, 5, 8, 11}`) and `chosen_net`, to get a more precise evaluation of `chosen_net` Elo. These matches can be identified because they are **40** games long;
10. if `gen` is a multiple of 4, **panel matches** are added between the 16 networks in the [[panel|Panel of reference networks]] and `chosen_net`, again to get an even more precise evaluation of `chosen_net` Elo. These matches can be identified because they are **30** games long;
11. `gen++`, `current_net=chosen_net`, if [[reasonable|Generations for training]] then `n++`;
12. go to step 2;

Pipeline description: self-play temperatures

This post is about how on SAI we use two standard LZ tuning parameters that are generally left to their default values in the LZ learning pipeline: --randomtemp and --softmax_temp.

These are temperatures in the sense of probability distributions, or more precisely as the term is used in Gibbs measures. Basically, a probability distribution over a finite set may be perturbed with a temperature parameter, in the sense that a value of 1 leaves the distribution as it is, values larger than 1 flatten the distribution towards uniform probability for all points, and values lesser than 1 sharpen the distribution to the limit (when temperature is 0) where the point with the highest original probability gets probability 1 and all other points probability 0.

  • randomtemp (default 1) is a perturbation of the probability for choosing one move among the visited ones. By default, if the option -m x is used, the first m moves are chosen with temperature 1 (probability proportional to visits number), and the following moves are chosen with temperature 0 (the move with the most visits is chosen with probability 1).
  • softmax_temp (default 1) is a perturbation of the policy probability, applied just after the computation of the policy itself, so before Dirichlet noise.

What we observed in our 7x7 experiments, both with LZ and SAI is that there is a peculiar problem when these parameters are used at their default values. The policy concentrates too much and too fast, as generations of nets go on, to the point where, in the opening of the game, only one move is really considered.
This in principle is the good expected behaviour, since we would like the play to converge to the perfect game, where the policy knows exactly the best move in every situation.
Nevertheless, we observed that the policy would typically concentrate too much before the value estimation would get very good. I this way, the policy will converge to a suboptimal move, which then is really hard to get corrected.

Value and policy are the two estimators that evolve between generations of nets. Both should converge to a theoretical limit which is the perfect game. They may converge at very different speeds, and in particular it appears that the policy is much faster.
The policy is trained on UCT visits of root node's children, which depend on the policy of the previous generation at the same node and on the value of the previous generation at subsequent nodes.
We have a mathematical argument (not fully rigorous) that shows that if the average value of children does not change between generations, the move with the highest average value will have all the policy concentrated on it very quickly. This is true even if the best move is just barely better than the second one. Convergence is exponential.
We suppose that when the value is not fixed but changes slowly between generations this behaviour can be still present.
Notice moreover that if the policy converges to one move, then other subtrees are not explored much and hence the value head cannot be trained much on them, hence it becomes difficult for the learning pipeline to correct this situation.

The proposed solution is to use a higher temperature for softmax_temp. It must be remarked that the mathematical argument above says that the limit policy distribution is concentrated on a single move if softmax_temp is lesser than or equal to 1. Values above 1 give a limit policy that have nonzero probability for the second-best move, so it makes sense to experiment with this parameter to correct the unwanted behaviour. We have experimented with 1.5 and 1.25 and the latter seems to be able to correct the excess of convergence while still letting the policy be trained well.

Of course one unwanted consequence of this choice is that not only policy gets flatter, but also visits to root node's children get flatter. This may be a problem when we are choosing the move to play with randomness. To this end, when setting --softmax_temp 1.25 we also add --randomtemp 0.8, so that the best move is chosen more often. It is also recommended that --blunder_thr 0.25 is used together, in particular with high values of -m.

Finally, let us remark that our experiments on 7x7 with official komi 9.5 showed that in the first part of the training, the nets learn perfect play for white, hence winning with higher and higher probability (fair komi is 9 on 7x7 go with are scoring). After that the way black plays oscillates a lot, while nets learn that this and then that strategies don't work.
When the temperatures are left to 1, this ruins also estimates and playing style of white which periodically unlearns how to win. One can observe that the strength of new nets oscillates widely over a period of tens of nets.

When the temperatures are corrected, this does not happen. The perfect play by white is very stable over hundreds of generations, while the strength of black oscillates somewhat (consequence of playing with komi 9.5). Moreover, the final policy, while still allowing the search to find the better move, is not completely concentrated, and in particular if there are two different moves in the perfect games tree, both have a reasonable fraction of the probability.

Progress stalled?

I just wanted to reassure everyone that if the progress stalls we are going to increase the visits and we believe that in a few generations the upgrade will restore a good rate of improvement

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.