alshedivat / lola Goto Github PK

View Code? Open in Web Editor NEW

140.0 12.0 34.0 12.26 MB

Code release for Learning with Opponent-Learning Awareness and variations.

License: MIT License

Python 6.57% Jupyter Notebook 93.43%

lola's Introduction

Learning with Opponent-Learning Awareness

Implements the LOLA (AAMAS'18) and LOLA-DiCE (ICML'18) algorithms.

Further resources:

A pytorch implementation of LOLA-DiCE is available at https://github.com/alexis-jacq/LOLA_DiCE.
A colab notebook with the nummerical evalution for DiCE is available at https://goo.gl/xkkGxN.

Installation

To run the code, you need to pip-install it as follows:

$ pip install -e .

After installation, you can run different experiments using the run scripts provided in scripts/. Use run_lola.py and run_tournament.py for running experiments from the AAMAS'18 paper. Use run_lola_dice.py for reproducing experiments from the ICML'18 paper. Check out notebooks/ for IPython notebooks with plots.

Note: this code is not tested on GPU, so there might be unexpected issues.

Disclaimer: This is a research code release that has not been tested beyond the use cases and experiments discussed in the original papers.

Contribution

Contributions to further enhance and improve the code are welcome. Please email jakob.foerster at cs.ox.ac.uk and alshedivat at cs.cmu.edu with comments and suggestions.

Citations

LOLA:

@inproceedings{foerster2018lola,
  title={Learning with opponent-learning awareness},
  author={Foerster, Jakob and Chen, Richard Y and Al-Shedivat, Maruan and Whiteson, Shimon and Abbeel, Pieter and Mordatch, Igor},
  booktitle={Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems},
  pages={122--130},
  year={2018},
  organization={International Foundation for Autonomous Agents and Multiagent Systems}
}

DiCE:

@inproceedings{foerster2018dice,
  title={{D}i{CE}: The Infinitely Differentiable {M}onte {C}arlo Estimator},
  author={Foerster, Jakob and Farquhar, Gregory and Al-Shedivat, Maruan and Rockt{\"a}schel, Tim and Xing, Eric and Whiteson, Shimon},
  booktitle ={Proceedings of the 35th International Conference on Machine Learning},
  pages={1524--1533},
  year={2018},
  volume={80},
  series={Proceedings of Machine Learning Research},
  address={Stockholmsmässan, Stockholm Sweden},
  month={10--15 Jul},
  publisher={PMLR},
  pdf={http://proceedings.mlr.press/v80/foerster18a/foerster18a.pdf},
  url={http://proceedings.mlr.press/v80/foerster18a.html},
}

License

MIT

lola's People

Stargazers

Watchers

lola's Issues

No policy loss

In the update step of the Pnetwork only the value loss is included in the total loss. Why is the policy loss not included in the total loss?

Set of hyper-parameters to reproduce LOLA DICE

Are the current default hyper-parameters the one used to produce the results of the DICE paper?
Current default HP are (from scripts/run_lola_dice.py):

batch-size=64
runs=5
epochs=200
use_dice=True

gamma=.96,
lr_inner=.1,
lr_outer=.2,
lr_value=.1,
lr_om=.1,
inner_asymm=True,
n_agents=2,
n_inner_steps=2,
value_batch_size=16,
value_epochs=0,
om_batch_size=16,
om_epochs=0,
use_baseline=False,

Or should we use the default from lola_dice/rpg.py?

epochs=100,
gamma=.96,
lr_inner=1.,          # lr for the inner loop steps
lr_outer=1.,          # lr for the outer loop steps
lr_value=.1,          # lr for the value function estimator
lr_om=.1,             # lr for opponent modeling
n_agents=2,
n_inner_steps=1,
inner_asymm=True,
om_batch_size=64,     # batch size used for fitting opponent models
om_epochs=5,          # epochs per iteration to fit opponent models
value_batch_size=64,  # batch size used for fitting the values
value_epochs=5,       # epochs per iteration to fit value functions
use_baseline=True,
use_dice=True,
use_opp_modeling=False,

Transpose of input

In coin game the input shape is [4,grid_size,grid_size]. Shouldn't the input be transposed before passing to the model in networks.py so that the channel(4) becomes the last dimension?

batch size = trace_length

The Line 75 of run_lola.py is -
"batch_size = 4000 if trace_length is None else trace_length"
It should be

"batch_size = 4000 if batch_size is None else batch_size"

Player blue and red are not currently symmetrical

In https://github.com/alshedivat/lola/blob/master/lola/envs/coin_game.py

The symmetry is broken in favor of player red.
When the two players move at the same time on the cell with the coin, player red has the advantage to pick the coin (always pick before player blue)

In my implementation (where I do not use batch):
Currently we have:

        if self.red_coin:
            if self._same_pos(self.red_pos, self.coin_pos):
                generate = True
                reward_red = 1
                reward_blue = 0
            elif self._same_pos(self.blue_pos, self.coin_pos):
                generate = True
                reward_red = -2
                reward_blue = 1
            else:
                reward_red = 0
                reward_blue = 0

        else:
            if self._same_pos(self.red_pos, self.coin_pos):
                generate = True
                reward_red = 1
                reward_blue = -2
            elif self._same_pos(self.blue_pos, self.coin_pos):
                generate = True
                reward_red = 0
                reward_blue = 1
            else:
                reward_red = 0
                reward_blue = 0

To have the symmetry between red and blue, this should be changed to:

        if self.red_coin:
            if self._same_pos(self.red_pos, self.coin_pos) and self._same_pos(self.blue_pos, self.coin_pos):
                if np.random.randint(0, 2):
                    generate = True
                    reward_red = 1
                    reward_blue = 0
                else:
                    generate = True
                    reward_red = -2
                    reward_blue = 1
            elif self._same_pos(self.red_pos, self.coin_pos):
                generate = True
                reward_red = 1
                reward_blue = 0
            elif self._same_pos(self.blue_pos, self.coin_pos):
                generate = True
                reward_red = -2
                reward_blue = 1
            else:
                reward_red = 0
                reward_blue = 0

        else:
            if self._same_pos(self.red_pos, self.coin_pos) and self._same_pos(self.blue_pos, self.coin_pos):
                if np.random.randint(0, 2):
                    generate = True
                    reward_red = 1
                    reward_blue = -2
                else:
                    generate = True
                    reward_red = 0
                    reward_blue = 1
            elif self._same_pos(self.red_pos, self.coin_pos):
                generate = True
                reward_red = 1
                reward_blue = -2
            elif self._same_pos(self.blue_pos, self.coin_pos):
                generate = True
                reward_red = 0
                reward_blue = 1
            else:
                reward_red = 0
                reward_blue = 0

I can do a push request if needed.

LOLA breaks when changing number of actions and/or states

I try to edit IPD to a setup with four actions. This yields a 4x4 payoff matrix and a 17-dimensional input, which breaks both LOLA and LOLA-DiCE implementations.

train_exact.py assumes that NUM_ACTIONS = 4 and NUM_STATES = 5 in the environment.
Isn't the number of states also depending on number of actions: NUM_STATES = NUM_ACTIONS ** 2 + 1?
As the payoff for agent 2 is simply the transposed payoff matrix: Does the game have to by symmetric? Or are different payoffs per agent possible in the current implementation?

LOLA Policy Gradient Target Computation

Hello, thank you for open-sourcing the code! :-)
The code is really helpful in understanding the papers deeper.

I am interested in LOLA, especially its policy gradient method (lola/train_pg.py).
As mentioned in the paper, this implementation shows the actor-critic method.

However, I could not fully understand the target computation code:
self.target = self.sample_return + self.next_v (code).
According to the reference (chapter 13, page 274, one-step actor-critic pseudocode), I wonder whether the target computation should use the step reward (i.e., reward at timestep t) instead of the return.

Thank you for your time and consideration!

Coin Game

Can you suggest a sample command line to run Coin Game?

I tried running just:

python scripts/run_lola.py --exp_name=CoinGame --no-exact

and it seems to be updating parameters and using up all the CPUs and not showing any indication what the progress is.

Logging to logs/CoinGame/seed-0
values (600000, 240)
main0/input_proc/Conv/weights:0 (3, 3, 3, 20)
main0/input_proc/Conv/BatchNorm/beta:0 (20,)
main0/input_proc/Conv_1/weights:0 (3, 3, 20, 20)
main0/input_proc/Conv_1/BatchNorm/beta:0 (20,)
main0/input_proc/fully_connected/weights:0 (240, 1)
main0/input_proc/fully_connected/biases:0 (1,)
main0/rnn/wx:0 (240, 128)
main0/rnn/wh:0 (32, 128)
main0/rnn/b:0 (128,)
main0/fully_connected/weights:0 (32, 4)
main0/fully_connected/biases:0 (4,)
values (4000, 240)
main0/input_proc/Conv/weights:0 (3, 3, 3, 20)
main0/input_proc/Conv/BatchNorm/beta:0 (20,)
main0/input_proc/Conv_1/weights:0 (3, 3, 20, 20)
main0/input_proc/Conv_1/BatchNorm/beta:0 (20,)
main0/input_proc/fully_connected/weights:0 (240, 1)
main0/input_proc/fully_connected/biases:0 (1,)
main0/rnn/wx:0 (240, 128)
main0/rnn/wh:0 (32, 128)
main0/rnn/b:0 (128,)
main0/fully_connected/weights:0 (32, 4)
main0/fully_connected/biases:0 (4,)
values (600000, 240)
main1/input_proc/Conv/weights:0 (3, 3, 3, 20)
main1/input_proc/Conv/BatchNorm/beta:0 (20,)
main1/input_proc/Conv_1/weights:0 (3, 3, 20, 20)
main1/input_proc/Conv_1/BatchNorm/beta:0 (20,)
main1/input_proc/fully_connected/weights:0 (240, 1)
main1/input_proc/fully_connected/biases:0 (1,)
main1/rnn/wx:0 (240, 128)
main1/rnn/wh:0 (32, 128)
main1/rnn/b:0 (128,)
main1/fully_connected/weights:0 (32, 4)
main1/fully_connected/biases:0 (4,)
values (4000, 240)
main1/input_proc/Conv/weights:0 (3, 3, 3, 20)
main1/input_proc/Conv/BatchNorm/beta:0 (20,)
main1/input_proc/Conv_1/weights:0 (3, 3, 20, 20)
main1/input_proc/Conv_1/BatchNorm/beta:0 (20,)
main1/input_proc/fully_connected/weights:0 (240, 1)
main1/input_proc/fully_connected/biases:0 (1,)
main1/rnn/wx:0 (240, 128)
main1/rnn/wh:0 (32, 128)
main1/rnn/b:0 (128,)
main1/fully_connected/weights:0 (32, 4)
main1/fully_connected/biases:0 (4,)
2018-11-04 16:36:10.603357: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
update params
update params
update params
update params
^C
Aborted!

Possible error in reported confidence interval used in the DICE paper

In the notebook notebooks/dice/analysis.ipynb which is used to analyse the results and reproduce the fig.5 from the paper DiCE: The Infinitely Differentiable Monte Carlo Estimator, the confidence interval used is 68% instead of the 95% reported in the paper. Also the figure generated by the notebook is the same as the one in the paper when the confidence interval is set to 68% (thus changing this parameter to 95% in the notebook produces a figure with a confidence interval much larger).

Pervasive reshape bugs in train_cg?

Unless I misunderstand what the code is trying to do, the following pattern in train_cg.py is a bug:

trainBatch1 = [[], [], [], [], [], []]  # line 137
...
while j < max_epLength:  # line 149
  ...
  trainBatch1[3].append(s1)  # line 180
  ...
...
# line 236:
last_state = np.reshape(
  np.concatenate(trainBatch1[3], axis=0),
  [batch_size, trace_length, env.ob_space_shape[0],
   env.ob_space_shape[1], env.ob_space_shape[2]])[:,-1,:,:,:]

The issue is that np.concatenate(trainBatch1[3], axis=0) is stored in memory with the time (trace_length) axis first and the batch axis second, and should be reshaped to [trace_length, batch_size, ...] and then transposed to move the batch axis forward. Reshaping straight to [batch_size, trace_length, ...] will silently misinterpret the order in which the elements are stored in memory.

The same buggy append-reshape pattern happens for basically all the things stored in trainBatch0, trainBatch1, with the offending reshapes happening in various places in other files, which expect [batch, time] storage order. I think the easiest fix would be to establish the desired storage order of trainBatch1 right after the loop over j < max_epLength, e.g.

trainBatch1 = [np.stack(seq, axis=1) for seq in trainBatch1]

and similar for trainBatch0. Now trainBatch1[3] has exactly the shape you want it to have at line 236, so last_state = trainBatch1[3][:, -1, :, :, :] will do. You can still trainBatch1[i].reshape([batch_size * trace_length, ...]) if you need the batch and time axes flattened, and this will correctly reshape back to [batch_size, trace_length, ...].