kristjankorjus / replicating-deepmind Goto Github PK

Reproducing the results of "Playing Atari with Deep Reinforcement Learning" by DeepMind

License: GNU General Public License v3.0

Shell 0.13% Makefile 0.49% C 6.78% C++ 50.91% TeX 5.27% Prolog 13.39% Python 5.85% Cuda 17.15% MATLAB 0.03%

replicating-deepmind's Introduction

Replicating-DeepMind

Reproducing the results of "Playing Atari with Deep Reinforcement Learning" by DeepMind. All the information is in our Wiki.

Progress: System is up and running on a GPU cluster with cuda-convnet2. It can learn to play better than random but not much better yet :) It is rather fast but still about 2x slower than DeepMind's original system. It does not have RMSprop implemented at the moment which is our next goal.

Note 1: You can also check out a popular science article we wrote about the system to Robohub.

Note 2: Nathan Sprague has a implementation based on Theano. It can do fairly well. See his github for more details.

replicating-deepmind's People

Contributors

Stargazers

Watchers

Forkers

huiwq1990 ticcky cleemesser sairahul putaozhuose rblomberg hbcbh1999 cbeach kuguobing oceanjack fanfannothing zuiwufenghua oftensmile lqshixinlei likeucode danielpwang tongzzz fatty- chagge xuanhan863 royshan sing1ee amoliu xiaodiu2010 matrixchan yiiwood qiakeyufeiniao thejermy richardkelley relationbuilder edwardt rishirdua fancyspeed 42learning wqren yongde1990 2089764 juxi wygan2012 upriser darcy0511 jordanmicahbennett ericxsun castroev vinda1219 cnsdytzy jiangchangduo cbccb6 kastnerkyle fdoperezi mfzhang chillypenguin ferherranz robert-ko coolflydream pfshawn zzpwelkin lee-b smartinicus kmfeng gitoer wingcontroller zencoding jinghsu tjevgerres alphamupsiomega marcelo-amancio letoemist nagyist chongbingbao guichuqing qutterr mindis bryantwilliam bolitt xiaoyongliang slmjlf xzflin malkocb milestonesvn tempdban ibccw plexzhang aaannndddyyy adeze objects-time hastyj aninkstone zt706 angeliaz zerocool438 shouhutsh tojoevan caomw nalvdao fx20141024 cloudstdio jmche kentchun33333 simudream

replicating-deepmind's Issues

Remove redundant self.predict_rewards([transition['prestate']])

On lines 124 and 126 the statement predict_rewards([transition['prestate']]) shows up twice. Computing this only once would bring down the number of neural network evaluations per frame.

Investigate Theano + PyCUDA

Check that output nodes give different outputs for different inputs

It makes more sense to .gitignore ALE .o and binary files

Decouple ALE from MemoryD

ALE class should encapsulate only connection with ALE and should be decoupled from MemoryD. In particular all references to memory should be removed and should be moved to main.py.

Also I see no reason why ALE should be in separate directory, why not src with main.py?

Create a function to create chess board array

Function should create 84x84 numpy array which has 8x8 squares, each is black or white - like a 10x10 chessboard with some additional room near right and bottom borders.

Use named fields for minibatch

Minibatch components are addressed using indexes, should be using names. Also rename variables in NeuralNet.train() to prestate and poststate.

Question about function φ

I am still not clear about the function φ in Algorithm 1. It is obvious from the paper that by using the function φ the input to Q-network is clipped into a 84×84×4 image. But how did it do that?

In Algorithm 1 we found that

and

This makes me confused. What on earth is s_t+1? Does that mean:

s1 = x1
s2 = s1,a1,x2 = x1,a1,x2
s3 = s2,a2,x3 = x1,a1,x2,a2,x3
s4 = s3,a3,x4 = x1,a1,x2,a3,x3,a3,x4
......

So how did φ process s3, for instance? φ(3) should equal to φ(s3) = φ(x1,a1,x2,a2,x3)? I feel hard to understand this.

I would appreciate if anyone could help.

Crate a toy example which tests the whole pipeline

Play small example by hand and make sure that program outputs same numbers as we have on paper.

Find a server with GPUs

EENet has GPUs in Tartu

Make sure that learning process changes parameters at all

Memory overwrite

The memory is of a fixed length, so when we reach 100000 transistions in memory, we need to start overwriting the first transitions. This is not implemented so far.

Attention should be put to figuring out how to deal with extracting transistions from minibatch once part of the transitions have been overwritten. For example: If we have overwritten transistions till position 10 in memory and minibatch asks for transition nr 11, then the 3 previous "images" in the memory do not correspond to what actually happened before the transition 11. SO we either
1)give a repetition of the 11th image instead of img 10, img9 and img8... as we do in the case when we are asked transistions in the beginng of new game. Downside of it is that actually such transition (same image for 4 frames and then an action) never takes place.
or
2) we forbid the minibatch to ask transitions at that location. Considering we have another 1M of transitions to choose from, frobidding to select 3 of them, seems like no problem.

Simplify NeuralNet interface

NeuralNet class should have only methods train() and predict(), everything else (in particular predict_best_action() and minibatch processing) should be moved to main.py. NeuralNet should be simple wrapper around ConvNet, this would allow using it in other projects too.

Define ALE and cuda-convnet2 as submodules of DeepMind

ALE and cuda-convnet2 should be defined as submodules of DeepMind. This way you will get latest version of both ALE and cuda-convnet2 when doing checkout. Also we can push our fixes directly to those projects.
http://stackoverflow.com/questions/5252450/using-someone-elses-repo-as-a-git-submodule-on-github

What to do with cuda-convnet2 patches? We can leave them as manual work, prepare them as patch files to be applied automatically during make, or hope that they are included in next release.

Saving/loading network

We need to crate a function to save the learned network parameters to file after a desired nr of games is played in main.py.

We need to add a constructor of NeuralNet, that would build neural net from given weight values.

AI

wait a minute... are you saying you have copied deep minds functionality and written your own atari game agent?

Increment frames_played

I downloaded this about a month ago and ran it on my GPU.

Is frames_played incremented anywhere? I was printing out the value for epsilon on every frame and it seemed to stay at 0.9.

Weight initialization

The weights should be initialized in a way that the initial values for expected rewards (when giving input to initial network) would be the same order of magnitude or rather a few orders of magnitude smaller than the reward that we give in case we break a tile (reward=1). At the moment the rewards at the randomly initialized network go as far as (-200 or +200).
We need to decrease weight values, because then adding a reward of 1 to a desired transition/state would really make us choose this same transition next time.

this should be done in constructors of individual layers (the way we initialize W and B)

also, Biases are all initialized at zero for the moment. need to change that.

Fill the PIL image object with the correct pixel values

    for i in range(len(image_string)/2):
        num_rows = i % width
        num_cols = i / width

        hex1 = int(image_string[i*2], 16)

        # Division by 2 because: http://en.wikipedia.org/wiki/List_of_video_game_console_palettes
        hex2 = int(image_string[i*2+1], 16)/2
        gray_val = int(arr[hex2, hex1])
        pixels[num_rows, num_cols] = (gray_val, gray_val, gray_val)

    # Crop and downscale image
    roi = (0, 33, 160, 193)  # region of interest is lines 33 to 193
    img = img.crop(roi)
    new_size = 84, 84
    img.thumbnail(new_size)