nvlabs / ga3c Goto Github PK

Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.

License: BSD 3-Clause "New" or "Revised" License

Python 99.61% Shell 0.39%

ga3c's Introduction

GA3C: Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU

A hybrid CPU/GPU version of the Asynchronous Advantage Actor-Critic (A3C) algorithm, currently the state-of-the-art method in reinforcement learning for various gaming tasks. This CPU/GPU implementation, based on TensorFlow, achieves a significant speed up compared to a similar CPU implementation.

How do I get set up?

Install Python > 3.0
Install TensorFlow 1.0
Install OpenAI Gym
Clone the repo.
That's it folks!

How to Train a model from scratch?

Run sh _clean.sh first, and then sh _train.sh. The script _clean.sh cleans the checkpoints folder, which contains the network models saved during the training process, as well as removing results.txt, which is a log of the scores achieved during training.

Remember to save your trained models and scores in a different folder if needed before cleaning.

_train.sh launches the training procedure, following the parameters in Config.py. You can modify the training parameters directly in Config.py, or pass them as argument to _train.sh. E.g., launching sh _train.sh LEARNING_RATE_START=0.001 overwrites the starting value of the learning rate in Config.py with the one passed as argument (see below). You may want to modify _train.sh for your particular needs.

The output should look like below:

...
[Time: 33] [Episode: 26 Score: -19.0000] [RScore: -20.5000 RPPS: 822] [PPS: 823 TPS: 183] [NT: 2 NP: 2 NA: 32]
[Time: 33] [Episode: 27 Score: -20.0000] [RScore: -20.4815 RPPS: 855] [PPS: 856 TPS: 183] [NT: 2 NP: 2 NA: 32]
[Time: 35] [Episode: 28 Score: -20.0000] [RScore: -20.4643 RPPS: 854] [PPS: 855 TPS: 185] [NT: 2 NP: 2 NA: 32]
[Time: 35] [Episode: 29 Score: -19.0000] [RScore: -20.4138 RPPS: 877] [PPS: 878 TPS: 185] [NT: 2 NP: 2 NA: 32]
[Time: 36] [Episode: 30 Score: -20.0000] [RScore: -20.4000 RPPS: 899] [PPS: 900 TPS: 186] [NT: 2 NP: 2 NA: 32]
...

PPS (predictions per second) demonstrates the speed of processing frames, while Score shows the achieved score.
RPPS and RScore are the rolling average of the above values.

To stop the training procedure, adjuts EPISODES in Config.py propoerly, or simply use ctrl + c.

How to continue training a model?

If you want to continue training a model, set LOAD_CHECKPOINTS=True in Config.py, and set LOAD_EPISODE to the episode number you want to load. Be sure that the corresponding model has been saved in the checkpoints folder (the model name includes the number of the episode).

Be sure not to use _clean.sh if you want to stop and then continue training!

How to play a game with a trained agent?

Run _play.sh You may want to modify this script for your particular needs.

How to change the game, configurations, etc.?

All the configurations are in Config.py
As mentioned before, one useful way of modifying a config is to pass it as an argument to _train.sh. For example, to save the models while training, just run: train.sh TRAINERS=4.

Sample learning curves

Typical learning curves for Pong and Boxing are shown here. These are easily obtained from the results.txt file.

References

If you use this code, please refer to our ICLR 2017 paper:

@conference{babaeizadeh2017ga3c,
  title={Reinforcement Learning thorugh Asynchronous Advantage Actor-Critic on a GPU},
  author={Babaeizadeh, Mohammad and Frosio, Iuri and Tyree, Stephen and Clemons, Jason and Kautz, Jan},
  booktitle={ICLR},
  biurl={https://openreview.net/forum?id=r1VGvBcxl},
  year={2017}
}

This work was first presented in an oral talk at the The 1st International Workshop on Efficient Methods for Deep Neural Networks, NIPS Workshop, Barcelona (Spain), Dec. 9, 2016:

@article{babaeizadeh2016ga3c,
  title={{GA3C:} {GPU}-based {A3C} for Deep Reinforcement Learning},
  author={Babaeizadeh, Mohammad and Frosio, Iuri and Tyree, Stephen and Clemons, Jason and Kautz, Jan},
  journal={NIPS Workshop},
  biurl={arXiv preprint arXiv:1611.06256},
  year={2016}
}

ga3c's People

Contributors

Stargazers

Watchers

Forkers

codeaudit ml-lab stevenlol kafemoka andyxieyong trigrass2 orchestor dylanthomas 4skynet coderx7 mbz emrul leliaonvidia etienne87 babaktr ajaytalati cantren spencerduncan asmith26 tabzraz ernsttmp wilsonwangthu prichemond wang90063 wsjeon amoliu gortium marknader nczempin net-mist mistobaan stevekapturowski lkhcnn miffyli floodsung gbyfbi 0xsuu chpyang0229 dremovd emigmo rstager wuntoguo ymao1993 tgangwani collector-m pshvechikov shaobintao shivajid last-g bdholt1 adajass hma02 gwding dhfromkorea mihahauke zencoding agentdanger alanxu89 skylian pedronahum liuweiming chenaddsix kismuz ilovescienceandpython davidtranno1 anisoptera roy-algoritm meelement haha-533 amano-ginji yizhi-fang weinima12 dabana ontree hongdazhang kaizeonwong fo40225 qyxqyx endymion64 projectvenom coinsyx timerstime hzceee joydosun tangbohu ttong1013 codegank daominglyu ioandraganai jmribeiro lacibeb darkdepth karlxue ituco dogordog ai3dvision crnsmile grseb9s bilio gongheguoyingpai

ga3c's Issues

how to run CPU version of A3C

Hi,

I want to reproduce the comparison between A3C and GA3C in Table 2 in your paper.

I wonder if the A3C experiment can be done using this repo?

Thanks

Cannot see the agents in action during testing

When I run the _play.sh script, I can't see the agent in action.

Cannot learn problems with a single, terminal reward

Thank you for the easy to use and fast A3C implementation. I created a simple problem for rapid testing that rewards 0 on all steps except the terminal step, where it rewards either -1 or 1. GA3C cannot learn this problem because of line 107 in ProcessAgent.py:

terminal_reward = 0 if done else value

which causes the agent to ignore the only meaningful reward in this environment, and line 63 in ProcessAgent.py:

return experiences[:-1]

which causes the agent to ignore the only meaningful experience in this environment.

This is easily fixed by changing line 107 in ProcessAgent.py to

terminal_reward = reward if done else value

and _accumulate_rewards() in ProcessAgent.py to return all experiences if the agent has taken a terminal step. These changes should generally increase performance as terminal steps often contain valuable reward signal.

not use enabled?

In Server.py, line 108, set trainer.enabled = False, may be useless?Here the 'trainer' is class 'ThreadTrainer'.
Not call anywhere?
And 'enabled' property only see in class ThreadDynamicAdjustment, not found class 'ThreadTrainer'

server.py is adding 'trainer' instead of removing it at the end of training.

in File server.py, line 131:
while self.trainers:
self.add_trainer()

Should it be self.remove_trainer() ?

Incompatibility with the most recent releases of OpenAI Gym

Because of the use of the deprecated gym.undo_logger_setup() method (openai/gym@4c460ba), GA3C no longer works with the most recent versions. Using this method is no longer required as OpenAI Gym no longer modifies the global logging configuration.

I've submitted a pull request (#42) to remove the use of the deprecated method.

Segmentation fault

./_train.sh: line 3: 3010 Segmentation fault (core dumped) python GA3C.py "$@"

Does anybody get a segmentation problem like this?

LSTM version

It is a great work. Is there any plan to develop a LSTM version?

why do we need x as argument of train() ?

In NetworkVP, ThreadTrainer, we see that we call train_model with x, r, a (states, rewards, actions)

Wouldn't it be simpler to just to maintain an history of p,v inside for each agent number inside NetworkVP & compute loss + backprop when rewards come back in the train function with agent indices ?

Thus we would avoid recomputing forward (already been done in predict), possibly even accelerating the whole process?

TRAINING_MIN_BATCH_SIZE does not seem to affect anything

In ThreadTrainer.py, I don't understand how the following lines are supposed to affect the batch size :

np.concatenate((x__, x_))
np.concatenate((r__, r_))
np.concatenate((a__, a_))

np.concatenate returns the merged array, but does not affect x__ or x_.

However, I do measure the TPS to drops. What sorcery is this ?

[Time: 404] [Episode: 213 Score: -1.0642] [RScore: 7.5345 RPPS: 281] [PPS: 282 TPS: 4] [NT: 2 NP: 3 NA: 4]

(The PPS/ TPS is overall low in my case because the game is a costly one running on remote desktop)

EDIT : i suggest to modify to :

x__ = np.concatenate((x__, x_))
r__ = np.concatenate((r__, r_))
a__ = np.concatenate((a__, a_))

but this does not affect TPS compared to other

Action Repeat Option

I've just finished reading the GA3C paper, and I didn't notice any mention of action repeat (as implemented in original Deep-Q paper, and A3C paper). From my understanding, action repeat means that, not only are frames stacked together (into an effective single state) in groups of 4, but actions are also only selected every 4 frames, with these action selections then being carried through (i.e. repeated) until another 4 frames have passed, ready for another action selection.

These correspond to two different hyperparameters from my understanding, one for how many frames to stack (refered to as "m" in Deep-Q methods section), and one for how often to select actions (referred to as "k" in Deep-Q methods section). These are then also explicity referrenced in A3C Experimental Setup section.

I may be wrong, but I've gone through this source code, and from my understanding, different actions are selected every single frame, not every 4th frame (or Kth frame, to be more general). It seems as though there is no option to change this, to more closely replicate DeepMind's A3C approach. I was wondering if this was a design decision? If not, is there any plan to incorporate action replay options into the model, with a parameter in the Config.py file etc. ?

Alternatively, if I've misinterpretted the code (which is entirely possible), I would be grateful if you could point out where/how the action replay is incorporated.

I will try implementing it efficiently myself in the meantime, which you are welcome to use if helpful.

P.s. this code is proving to be extremely useful!
Thanks a lot for sharing!

Dan

Playing hangs after last episode

Hi,
I am running testing(playing) for 500 iterations. And then I want to automatically start and other job. But I can not do that, because the 2 of the 3 GA3C jobs do not end and need to be killed by .

The reasons seems to be that the predictor gets stuck and waits for an agent that is already terminated (see the end of Server.py / ThreadPredictor.py). Could you please look at that?

Thanks,
Ernst

Issues with learning in custom environment

Hello,

I'm writing you to discuss a problem that is not directly related with your code and application but is affecting my own efforts to apply RL on a more custom type of problem where we have an environment which is not atari-like.

In my custom environment I have noisy, difficult to interpret images which are my observations, and i can take a bunch of actions. For each image between 1 and 4 actions can be considered correct and between 8 to 5 actions are always incorrect.

This problem can be also formulated in a fully supervised manner, as a classification problem, where we ignore the fact that more than one action can be considered correct at a time and that these actions are related to each other over time and define a "trajectory" of actions.
When we use the supervised approach in the way i just described the system works well, meaning that there is no struggle interpreting those noisy images that are difficult to understand by humans.

When these images are organized in a structured manner, in an environment one can play with, it's possible to use a RL algorithm to solve the problem. We have tried with satisfactory results DQN, which works okay. In that case the reward signal is provided continuously, for an action that goes in the right direction we assign a reward of 0.05, for an action that goes in the wrong one -0.15, for a "done" action issued correctly +1 and for a "done" actions issued incorrectly -0.25. A "done" action doesn't terminate the episode (it is terminated after 50 steps). DQN in these settings very slowly converges and shows nice validation results.

When we employ A3C, the behavior is either:

Reward goes up for a bit, then stabilizes to a very poor value, never moves again (we tried also dampening beta as we optimize and we got an identical behaviour)
Reward fluctuates and then drift in a way that for 50 steps of the episode a correct action is NEVER done. meaning the -0.15 reward is always the one that is obtained for all the steps, like the network would have learned perfectly how not to do things.

I am very puzzled by this behavior. I have checked and re-checked every moving piece. the environment, the inputs to the network in terms of images, the rewards, the distribution of different actions over time (such that i see if the network is just learning to issue always the same action for some reason). All these problems seem to be absent. I thought it was a problem of exploration vs. exploitation and therefore i reduced first and then increased beta. i have also dampened beta over time to see what happens, but the most i obtained was a sinusoidal kind of behavior of the reward.

I also have tried to use a epsilon greedy strategy (just for debug purposes) instead of sampling from the policy distribution, with no success (network converges to the worst possible scenario rather quickly).

I tried reducing learning rate with no success.

Now, the policy gradient loss is not exactly the same as the cross-entropy loss but it resembles it quite a bit. With an epsilon greedy policy i would expect that for each image (we have a limited number of images (observations) that are re-proposed when the environment reaches a similar state) all the possible actions are actually explored and therefore the policy is learned in a way that is not so far away from the supervised case. If i set discount factor to zero (which i have tried), the value part of the network does not really play a role (i might be mistaken though) and if i give a reward for each step i take i should kind of converge to something that resembles my classification approach that i described above.

Maybe the fact that multiple actions can generate the same reward or penalty is the problem ?!

I would immensely appreciate any help or thought coming from you. Despite I'm really motivated about applying RL to my specific problem I really don't know what to do to improve the situation.

Thanks a lot,

Fausto

Why is pytorch-a3c implementation so much faster?

https://github.com/ikostrikov/pytorch-a3c has an implementation (CPU ONLY) that can converge PongDeterministic-v3 within 15 minutes while the GPU powered GA3C appears to take 2-3 hours to achieve the same?

Based on my (limited) comparison they are using ADAM instead of RMSProp and using PongDeterministic-v3 instead of PongDeterministic-v0.

Maybe there is an incredible amount of overhead pushing data to the GPU so only with large models would see a true speedup?

Why the ProcessAgent use Process while the ThreadTrainer use Thread?

Hello!

Why the ProcessAgent use Process while the ThreadTrainer use Thread? I wonder whether the ProcessAgent.py could use Thread instead of Process

How do you make your that sess.run is not run at the same time or while in use?

This is more a question than an issue. I was looking at the code, and I realized that you use the functions like train, predict_p_and_v, predict_p and some others from file NetworkVP.py. All this functions use the method sess.run, my question is: how do you know that this code runs without a problem? I ask because as far as I can see, there's nothing controlling that sess.run is not called while another sess.run is in use. I thought that I needed to use a coordinator to do something like that. If you have a reference just to understand, that'd be great. Thanks!

Meaning of the RScore

Hi,

What is the meaning of the minus Rscore from the output I get? Should I apply abs(Rscore) to get the actual reward?
and how do I know when it is near to converge?

[Time:      943] [Episode:     1046 Score:   -20.0000] [RScore:   -20.3550 RPPS:  1255] [PPS:  1228 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      943] [Episode:     1047 Score:   -20.0000] [RScore:   -20.3550 RPPS:  1256] [PPS:  1229 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      943] [Episode:     1048 Score:   -19.0000] [RScore:   -20.3530 RPPS:  1259] [PPS:  1230 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      944] [Episode:     1049 Score:   -21.0000] [RScore:   -20.3530 RPPS:  1259] [PPS:  1231 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      944] [Episode:     1050 Score:   -19.0000] [RScore:   -20.3520 RPPS:  1258] [PPS:  1231 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      944] [Episode:     1051 Score:   -20.0000] [RScore:   -20.3530 RPPS:  1259] [PPS:  1232 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      945] [Episode:     1052 Score:   -20.0000] [RScore:   -20.3530 RPPS:  1259] [PPS:  1233 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      947] [Episode:     1053 Score:   -20.0000] [RScore:   -20.3530 RPPS:  1257] [PPS:  1231 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      947] [Episode:     1054 Score:   -20.0000] [RScore:   -20.3530 RPPS:  1257] [PPS:  1232 TPS:   208] [NT:  2 NP:  2 NA: 33]

Thanks!

test with Pong-v0 : not converging ?

I am trying a training right now, I replaced PongDeterministic-v0 by Pong-v0 (the former does not seem to exist in my install), other than that everything is the same in Config.py & other files.

After 2 hours :

[Time: 7795] [Episode: 7923 Score: -20.0000] [RScore: -20.2960 RPPS: 1488] [PPS: 1499 TPS: 251] [NT: 5 NP: 4 NA: 28]

Am I am missing something here? Is there a need to modify Config.py ?

EDIT : needed to update gym; retried with PongDeterministic-v0:

Here it is with Learning Rate = 1e-3 and the game PongDeterministic-v0

[Time: 4177] [Episode: 4397 Score: -9.0000] [RScore: -10.5960 RPPS: 1645] [PPS: 1646 TPS: 278] [NT: 4 NP: 4 NA: 33]

Any idea of the difference between the 2 games ?

Suggested Config.py settings for a DGX-1

After running the _train.sh with the default Config.py on a DGX-1 for about an hour I see that the CPU usage stays pretty constant at about 15%, and one GPU is being used at about 40%.

The settngs in Config.py are unchanged: DYNAMIC_SETTINGS = True. The number of trainers varies between 2 and 6, the number of predictors varies between 1 and 2 and the number of agents varies from 34 to 39. I would have expected them to grow to use the available CPU resources.

Are there settings that will better leverage the cores on a DGX-1?
It looks like the code in NetworkVP.py is written for a single GPU. With TensorFlow's support for multiple GPU's, do you have plans to add it? On the surface it seems pretty easy to add:

for d in ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']:    
    with tf.device(d):
       .... calcs here...

Release date

Hi!
Thanks for a great article! I'm interesting in it and want to try some ideas.
When you are planning to release the framework? Can you provide some estimation of release date?

There is a unused variable

In this code.

yield None, None, None, total_reward

total_reward variable is not used in that code.

./_play.sh not working on OSx

After training a model successfully using ./_train.sh, I am attempting to run and render my game using this model on OSx with ./_play.sh. When I run this, I receive the errors:

The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.

After researching online, this seems to be an issue stemming from multiple processes attempting to render in parallel on OSx. However, the number of trainers, predictors, and agents are all 1. Additionally, I believe the error stems from the file ThreadDynamicAdjustment.py line 37 and 38.

I am on macOS High Sierra version 10.13.3.

Python 3.6.4 :: Anaconda, Inc.

# packages in environment at /Users/<User>/miniconda3/envs/missionplanner:
#
absl-py                   0.1.11                    <pip>
astor                     0.6.2                     <pip>
bleach                    1.5.0                     <pip>
ca-certificates           2017.08.26           ha1e5d58_0  
certifi                   2018.1.18                py36_0  
chardet                   3.0.4                     <pip>
cycler                    0.10.0                    <pip>
decorator                 4.2.1                     <pip>
future                    0.16.0                    <pip>
gast                      0.2.0                     <pip>
grpcio                    1.10.0                    <pip>
gym                       0.9.7                     <pip>
gym-cap                   0.2                       <pip>
html5lib                  0.9999999                 <pip>
idna                      2.6                       <pip>
kiwisolver                1.0.1                     <pip>
libcxx                    4.0.1                h579ed51_0  
libcxxabi                 4.0.1                hebd6815_0  
libedit                   3.1                  hb4e282d_0  
libffi                    3.2.1                h475c297_4  
Markdown                  2.6.11                    <pip>
matplotlib                2.2.0                     <pip>
ncurses                   6.0                  hd04f020_2  
networkx                  2.1                       <pip>
numpy                     1.14.0                    <pip>
openssl                   1.0.2n               hdbc3d79_0  
pandas                    0.22.0                    <pip>
Pillow                    5.0.0                     <pip>
pip                       9.0.1            py36h1555ced_4  
protobuf                  3.5.2                     <pip>
pygame                    1.9.3                     <pip>
pyglet                    1.4.0a1                   <pip>
pyparsing                 2.2.0                     <pip>
python                    3.6.4                hc167b69_1  
python-dateutil           2.6.1                     <pip>
pytz                      2018.3                    <pip>
PyWavelets                0.5.2                     <pip>
readline                  7.0                  hc1231fa_4  
requests                  2.18.4                    <pip>
scikit-image              0.13.1                    <pip>
scipy                     1.0.0                     <pip>
setuptools                38.4.0                   py36_0  
six                       1.11.0                    <pip>
sqlite                    3.22.0               h3efe00b_0  
tensorboard               1.6.0                     <pip>
tensorflow                1.6.0                     <pip>
termcolor                 1.1.0                     <pip>
tk                        8.6.7                h35a86e2_3  
urllib3                   1.22                      <pip>
Werkzeug                  0.14.1                    <pip>
wheel                     0.30.0           py36h5eb2c71_1  
xz                        5.2.3                h0278029_2  
zlib                      1.2.11               hf3cbc9b_2

Training Slowdown

The issue is documented here, but I was wondering if you ever had any problems, receiving messages like this during training:

tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1648707130 get requests, put_count=1648707127 evicted_count=2741000 eviction_rate=0.00166251 and unsatisfied allocation rate=0.00166258

I get this message quite often when cloning this repo and running, untouched, on pong. The issue seems worse when training on a custom pygame I made, and one time the training ground to a stop completely, with no more output to results.txt, and the console full of these messages.

If you have never had problems like this with your network, then I will close the issue. Otherwise, any, advice would be greatly appreciated.

should conduct padding before training?

Hello!
In file ThreadTrain.py
x__ = np.concatenate((x__, x_))

As x_ may shorter than TMAX, should we conduct padding before concatenate?

The training process does not exit and has to be killed by Ctrl+C

Known issue. Adding it to keep track of it.

Different network layer structure from DQN

When saw NetworkVP.py code, It has 2 layers of convolutional network and a dense layer. And the above comment says # As implemented in A3C paper.

But Asynchronous Methods for Deep Reinforcement Learning paper does not describe structure of network layers. So I think that A3C has same model structure with DQN because of comparing between A3C and DQN.

In DQN, It has 3 layers of convolutional network and a dense layer. But the GA3C code is not.

Unnecessary relu applied to action probability logits

In the NetworkVP.py file on line 93, the activation function should be explicitly set to None. As it currently stands, the logits are being put through a relu non-linearity before sotfmax is applied.

Trying to compare this to universe-starter-agent (A3C)

Setting up openai/universe, I used the "universe starter agent" as a smoke test.

After adjusting the number of workers to better utilize my CPU, I saw the default PongDeterministic-v3 start winning after about 45 minutes.

Then I wanted to try GA3C on the same machine; given that you quote results of 6x or better, I expected it to perform at least as good as that result.

However, it turns out that with GA3C the agent only starts winning after roughly 90 minutes.

I'm assuming that either my first (few) run(s) on the starter agent were just lucky, or that my runs on GA3C were unlucky. Also I assume that the starter agent has other changes from the A3C that you compared GA3C against, at least in parameters, possibly in algorithm.

So, what can I (an experienced software engineer but with no background in ML), do to make the two methods more comparable on my machine? Is it just a matter of tweaking a few parameters? Is Pong not a good choice to make the comparison?

I have an i7-3930k, a GTX 1060 (6 GB) and 32 GB of RAM.

Why the RPPS, PPS, TPS are consistently increasing

Hi,

From running the experiment, I found that the displayed value of RPPS, PPS, TPS is consistently increasing for many episodes. How could I know for one configuration speedup over another configuration in a small number of episodes?

Thanks!

GA3C source code has High CPU usage causing System freeze or crash

The code runs fine but leaks CPU and Memory and will crush your system . I am using Glances diagnostic or monitoring tool ( pip install glances ) . You will notice that if you leave your code running for a long time the CPU context switches increases substantially and the CPU & Memory keeps increasing until your code hangs or crushes . CPU usage increased from 6.7% to 64% and Memory from 10% to 79% at that point it caused the system freeze. When i look at the Nvidia TITAN X ( Maxwell --12 GB mem) usage it is only using about 300 MB out 12 GB. So it seems while most of the heavy lifting should be offloaded to the GPU in this case it does not seem to be the case. I have 8 x TITAN Maxwell GPUs with 2 x Intel Xeon 2660 v3
(2 CPU with total 40 CPU Cores ) with 128GB of DDR4 memory and i can use any of them . Still i get same results , the CPU will keep increasing

Any insights?

Other original A3C or various hybrid ( CPU & GPU ) versions seem to offload most of the heavy lifting to GPU and causes no system freezes but not with GA3C

Testing it on various amounts of data and games

Need an action trigger for 'press to continue' kind of situations

For some Atari games including Breakout, the environment sometimes waits for user input to continue (when the user looses a life, for example).
Game play may be stuck forever if 'no-op' action is set in such situations. To prevent this, ProcessAgent may need an action sequence for repeated 'no-op' actions and should take an 'real' action if the queue length goes beyond the limit.

Frame Preprocessing Step (for flickering)

I've just finished reading the GA3C paper, and I didn't notice any mention of the frame processing to remove flickering artifacts, as explained in DeepMind's original Deep-Q paper methods section, and referenced in the experimental setup for the A3C paper.

I've just had a look through the source code, and it seems as though the the Environment._preprocess(image) function is applied on a frame by frame basis, with no joint thresholding across the previous frame. I was wondering if this was a design decision? If not, is there any plan to incorporate this preprocessing step into the model, to more closely resemble DeepMind's A3C approach?

Alternatively, if I've misinterpretted the code (which is entirely possible), I would be grateful if you could point out where/how this preprocessing step is incorporated.

I will try implementing it efficiently myself in the meantime, which you are welcome to use if helpful.

P.s. this code is proving to be extremely useful!
Thanks a lot for sharing!

Dan

Wrong A3C implementation

I believe there is a bug in the A3C algorithm implementation. In the file "ProcessAgent.py" on line 107. The sub-episode return should be the value in the next state not the previous.

I suggest replacing:

prediction, value = self.predict(self.env.current_state)
           
...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0 if done else value

with:

prediction, value = self.predict(self.env.current_state)
           
...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0
               if not done:
                     (_, terminal_reward) = self.predict(self.env.current_state)

Training on environments with long episode length

Hello!

I'm currently trying to train on a problem that requires anywhere from 500 to 10,000 steps per episode. The training for this is excruciatingly slow when using the default config values. I've been messing around with some of the parameters but haven't been making any headway. Any recommendations ways to improve the training speed?

edit: the main thing i tried to modify was setting tmax to a very large number to try and batch each episode into a single update. This helped, however not as much as I hoped.

pyTorch

Isn't there any plan on the horizon to port this code to pyTorch ?

memory usage growth after a while

I have tested the code on GTX 1080 with 32g Ram but when I run the code memory usage increases over time and after about 30 hours it will take all 32g of ram and make system to dies