Code Monkey home page Code Monkey logo

Comments (83)

ashern avatar ashern commented on June 26, 2024 1

The OpenAI agent uses an LSTM policy network & GAE for the loss function.

This repo has a far simpler implementation of A3C, using a vanilla feed forward network for the policy & I'm pretty sure using a less recent loss function (though I haven't confirmed that last point recently).

While I personally had high hopes that this implementation would be useful for speeding things up, I've recently gone back to working with the OpenAI framework for my testing. I think some people have been working to get the LSTM policy working w/ GPU based A3C, but I haven't seen any working code that improves on the OpenAI type model....

I'd love to be corrected if I'm incorrect on any of the above.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024 1

from ga3c.

nczempin avatar nczempin commented on June 26, 2024 1

So after around 24000 seconds (400 minutes, 6.6667 hours), here's what I get with GA3C with my 3930k, 32 GB and GTX 1060 (6GB):
[Time: 23999] [Episode: 20379 Score: 350.0000] [RScore: 268.7030 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 23999] [Episode: 20380 Score: 317.0000] [RScore: 268.7230 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 24001] [Episode: 20381 Score: 355.0000] [RScore: 268.8350 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 24004] [Episode: 20382 Score: 295.0000] [RScore: 268.8870 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 24008] [Episode: 20383 Score: 247.0000] [RScore: 268.8910 RPPS: 965] [PPS: 1258 TPS: 210] [NT: 16 NP: 3 NA: 8]

It seemed to make progress right from the start, unlike with Pong, where both algorithms seemed to be clueless for a while and then "suddenly get it" and no longer lose, followed by a very long time of very slow growth of average score (the points it conceded always seemed to be the very first few ones, once it had one a single point it seemed to go into very similar states.

GA3C on Amidar seems to be stuck just under 270; I will now see what I get on the same machine with universe-starter-agent.

from ga3c.

ifrosio avatar ifrosio commented on June 26, 2024 1

The improvement with TRAINING_MIN_BATCH_SIZE should be observed for all games (although we tested few of them only).

from ga3c.

nczempin avatar nczempin commented on June 26, 2024 1

./_clean.sh;./_train.sh ATARI_GAME='BoxingDeterministic-v3' TRAINING_MIN_BATCH_SIZE=40 without the changes from the GAE PR

[Time: 3601] [Episode: 2634 Score: 5.0000] [RScore: 1.6630 RPPS: 1560] [PPS: 1565 TPS: 30] [NT: 6 NP: 2 NA: 35]
[Time: 5885] [Episode: 5200 Score: 81.0000] [RScore: 68.2340 RPPS: 1583] [PPS: 1579 TPS: 30] [NT: 8 NP: 3 NA: 36]
[Time: 7200] [Episode: 8165 Score: 88.0000] [RScore: 88.1870 RPPS: 1575] [PPS: 1579 TPS: 30] [NT: 7 NP: 3 NA: 34]
[Time: 8045] [Episode: 10298 Score: 100.0000] [RScore: 92.0040 RPPS: 1564] [PPS: 1577 TPS: 30] [NT: 7 NP: 3 NA: 34]
[Time: 24041] [Episode: 55034 Score: 96.0000] [RScore: 99.0730 RPPS: 649] [PPS: 1505 TPS: 28] [NT: 8 NP: 5 NA: 20]

from ga3c.

etienne87 avatar etienne87 commented on June 26, 2024 1

@nczempin @4SkyNet have you implemented the version with LSTM in TF? I am currently trying in pytorch but I had to add the c, h states in queues of experiences. Also note that for GAE, i let the parameter self.tau to 1, which is perhaps not the best choice & in theory should not change the performance.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024 1

@nczempin as I see everything correctly --> there are no improvements from 6 to 12 agents in time >> global_step/sec a bit more 600 for both cases

I thought I saw I slight improvement in wall-clock time, but I didn't look at it in detail. I guess I should have included the fps images as well.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024 1

Okay, I think I need to clarify what I was referring to:

I don't understand this statement:

but it's hard to properly control it with current A3C:

It sounded to me like you're saying that episodes terminating early is somehow a problem, because some trait of "current A3C" somehow optimizes for episodes that don't terminate early.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024 1

Regular GA3C stuck near 1700 points on Seaquest after 12 hours (but still better than universe starter agent):

[Time:     3600] [Episode:     4009 Score:   520.0000] [RScore:   573.0000 RPPS:  1647] [PPS:  1636 TPS:    31] [NT: 10 NP:  2 NA: 32]

[Time:     7201] [Episode:     6712 Score:   560.0000] [RScore:   650.4000 RPPS:  1589] [PPS:  1625 TPS:    31] [NT: 15 NP:  4 NA: 34]

[Time:     9271] [Episode:     8357 Score:  1220.0000] [RScore:  1000.7400 RPPS:  1614] [PPS:  1626 TPS:    31] [NT: 15 NP:  2 NA: 28]

[Time:    16630] [Episode:    13066 Score:  1720.0000] [RScore:  1700.0000 RPPS:  1648] [PPS:  1641 TPS:    31] [NT: 10 NP:  6 NA: 41]

[Time:    27096] [Episode:    19523 Score:  1760.0000] [RScore:  1700.1600 RPPS:  1639] [PPS:  1647 TPS:    31] [NT: 17 NP:  3 NA: 36]

[Time:    45286] [Episode:    30468 Score:  1680.0000] [RScore:  1705.9800 RPPS:  1620] [PPS:  1636 TPS:    31] [NT:  8 NP:  4 NA: 34]

from ga3c.

nczempin avatar nczempin commented on June 26, 2024 1

I've just added Adventure to ALE:, that might be even harder than Montezuma's Revenge with current algorithms.

I wonder if intrinsic motivation would help it, like it did Montezuma's (a little bit; Adventure is not quite as dangerous as Montezuma's, but the rewards are even sparser).

from ga3c.

nczempin avatar nczempin commented on June 26, 2024 1

Still not sure I understand: Are you saying shooting a regular space invader gives as much reward as shooting a mothership?

The discounting is orthogonal to that question.

Apart from that, I'm not really in a position to argue about any of this. When I have a better understanding of actor-critic and all that I need to know before that, I may revisit this. So far I've watched the David Silver lectures, and I have some catching up to do.

In my engineer mindset I also like to implement all these different techniques like TD(lambda) etc., and obviously there is a lot I have to do before I even get to regular AC.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024 1

Not sure I follow.

The rewards come from the environment, the algorithm is trying to figure out how to get these rewards and the algorithms are ranked based on how well they do compared to other algorithms (or humans) in score. If you treat a regular space invader the same as a mothership (indirectly; your algorithm knows nothing about different types, it just knows that in state s1 it was better to move slightly to the right and then shoot, to get the (points for the) mothership rather than to the left and get the points for the regular invader.

That is completely general as long as the environment gives out rewards.

As I said, I know what discounting reward is and what it is for; like in finance, getting a reward now is better than getting a rewards tomorrow, and how much better is determined by the discount factor, which is usually a measure of uncertainty; in finance it is dynamic, based on risk vs. reward.

But the discount factor doesn't have anything to do with the reward coming from a mothership or not, unless your algorithm takes into account that to get the higher score it also risks dying more often.

And when you have very sparse rewards, it makes sense to have a low discount factor (a high gamma), because otherwise the reward might disappear after enough steps due to rounding. Although technically if the rewards are really sparse (like in Adventure, only +1 right at the end) it shouldn't make any difference as long as you don't round to 0. The 1 will always be more than any other reward.

I guess in that case (rounding it away) it may even make sense to dynamically adjust the gamma: If you keep finishing episodes without getting any rewards, gamma should eventually be increased so that later rewards eventually get counted in the present.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

ok, that explains it.

Is "get LSTM policy working with GA3C" an open research problem or merely a matter of implementation details?

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

And does Pong happen to be particularly sensitive to LSTM or would it be no different in the other Atari games?

from ga3c.

swtyree avatar swtyree commented on June 26, 2024

I did a few tests with the universe starter agent when it was just released. Based on that limited experience, it seemed that the setup was a bit overfit to Pong--performance was reasonable for other games, but exceptionally fast for Pong. But as the previous commenter mentioned, it also uses an LSTM and GAE, which are helpful in some cases. If you run more extensive tests, I'd be curious to know how it performs on a wider suite of games.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

from ga3c.

ashern avatar ashern commented on June 26, 2024

The appendix of the original A3C paper has a ton of comparisons across different games & models, which should help you avoid some testing.

LSTM A3C is widely implemented open-source - a quick search should turn up a few options. The Universe & Miyosuda implementations seems to be the most commonly used.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

The appendix of the original A3C paper has a ton of comparisons across different games & models, which should help you avoid some testing.

Not sure what this refers to; are you saying I could have avoided wasting time on CoasterRacer by being more aware of the comparisons? My goal was just to "play around with openai universe" rather than get deep into testing. If anything, I'd be interested in adding an environment such as MAME or one of the other emulators, which is more obviously an engineering task.

LSTM A3C is widely implemented open-source - a quick search should turn up a few options. The Universe & Miyosuda implementations seems to be the most commonly used

Is this a response to my question about GA3C with LSTM? If so, the implicit assumption is that there are no fundamental issues that would complicate an endeavour to do so, for example by looking at the A3C implementations. Is this what you're saying? My understanding from the GA3C paper is that they consider it to be a fundamental approach and that A3C just happened to perform the best, so adding LSTM should not be a big deal.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

also, what would be a better venue to have discussions such as this one? Don't really want to clutter up the project issues.

from ga3c.

ashern avatar ashern commented on June 26, 2024

I simply meant - there exists a readily available corpus of tests conducted by professional researchers. Use it as you wish.

Implementing LSTM policy is simply an engineering issue, albeit a moderately difficult one in this case. Have at it - and please publish if you get good results! There are other issues open in this repo, I believe, where there are already discussions around LSTM/GPU.

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

@nczempin you should add a GAE, cuz it's the most crucial part but easy to implement. LSTM don't really affect your results so much (but LSTM can helps a bit with dynamics of the game, but it's more policy oriented).
PS> see some results from vanilla article (last page)

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

I simply meant - there exists a readily available corpus of tests conducted by professional researchers. Use it as you wish.

Well, the "which ones should I try" was really offering my "services" to @swtyree: in case I make some more comparisons with my setup anyway it doesn't make a big difference to me which other roms I try, so if someone does have a preference, I might as well support that.

Implementing LSTM policy is simply an engineering issue, albeit a moderately difficult one in this case. Have at it - and please publish if you get good results!

"Publish" sounds intimidating to me, but if I do get anything off the ground, I promise to put the source up on github; perhaps fork and PR here. I probably have to brush up my background in this area a little first (and I definitely have some things I'd like to do first, as mentioned before), so don't hold your breath.

There are other issues open in this repo, I believe, where there are already discussions around LSTM/GPU.

I saw an issue on the universe starter agent, asking about GPU. It doesn't seem to have gone anywhere.

from ga3c.

mbz avatar mbz commented on June 26, 2024

Please check out the pull requests section. GAE has been already implemented by @etienne87 in this pull request. He also implemented an specific pre-processing and configuration which provides a better comparison with starter agent.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

@nczempin you should add a GAE, cuz it's the most crucial part but easy to implement. LSTM don't really affect you results so much (but LSTM can helps a bit with dynamics of the game).
PS> see some results from vanilla article (last page)

GAE being? All I get is Google App Engine, and I don't find a reference to the term in the A3C paper.

Edit: Generalized Advantage Estimation.

Please check out the pull requests section. GAE has been already implemented by @etienne87 in this pull request. He also implemented an specific pre-processing and configuration which provides a better comparison with starter agent.

I'll have a look at that. Should I use a different game from the purportedly overfitted Pong, or would it be fine? I guess we'd know the answer when/if I try...

from ga3c.

mbz avatar mbz commented on June 26, 2024

GAE stands for Generalized Advantage Estimation. It's always a good idea to start by Pong (since it's usually the fastest to converge), but as long as you avoid pong specific logic. things should generalize to other games as well.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Okay, I checked out the PR, but it breaks the dependencies on the vanilla openai-universe.

I'm willing to give it a whirl if it once the PR is in a usable state more or less as-is.

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

if you see some results from original paper, there are some good environments such as: Amidar, Berzerk and Krull for faster converge. But DeepMind trained all of these games with the same parameters, since that the gamma (discount factor) can be taken for each environment individually to get the better results.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

So am I right in assessing that my issue #22 essentially boils down to issue #3?

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

or should I rename it to something that specifically references GAE?

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

@nczempin you can try with Seaquest, Boxing or other similar with more policy oriented approach than value (Breakout).
PS> I prefer Boxing cuz it's simply enough, but it takes a lot of time to see some distinguishes from a random (8mil for me for an almost vanilla A3C), than Breakout for example

from ga3c.

ifrosio avatar ifrosio commented on June 26, 2024

Based on the latest version of our paper, we get more stable and faster convergence for TRAINING_MIN_BATCH_SIZE = 20 ... 40 in Config.py. If you haven't done it yet, you can try this.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Based on the latest version of our paper, we get more stable and faster convergence for TRAINING_MIN_BATCH_SIZE = 20 ... 40 in Config.py. If you haven't done it yet, you can try this.

On Pong again or on any of the other ones I'll try?

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

@nczempin DeepMind reaches almost 284 within 1 day (80 millions). You result isn't so bad, since that DeepMind selects 5 best runs from 50 and averaged it. You also can encounter with some saturation or exploration problem after some time. If you use RMSProp as target optimizer you can anneal the learning rate a bit slower.
PS> and, as you can see, DeepMind has some instability in training. It seems that a Hogwild can cause such issue, but it also occurs with more synchronize way.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Here's the situation with (the universe starter agent) python3 train.py --num-workers 6 --env-id Amidar-v0 --log-dir /tmp/amidar
after roughly 8 hours:

image

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

I picked 6 workers because that's how many cores my CPU has, but perhaps up to 12 could have helped, given Hyperthreading etc. But naive "analysis" suggests that GA3C still wins in this particular case, because it gets more than double the score.

It would be interesting to know how much the speedup is due to using the CPU cores more efficiently because of the dynamic load balancing vs. including the GPU.

Even just getting a dynamic number of threads, without any specific GPU improvements, is a big convenience over having to pick them yourself statically.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

GA3C PongDeterministic-v3 again with TRAINING_MIN_BATCH_SIZE=20:
[Time: 2702] [Episode: 3682 Score: -21.0000] [RScore: -20.2860 RPPS: 1513] [PPS: 1513 TPS: 51] [NT: 5 NP: 4 NA: 26]
...
[Time: 8993] [Episode: 7191 Score: -7.0000] [RScore: -14.3570 RPPS: 1514] [PPS: 1286 TPS: 44] [NT: 6 NP: 3 NA: 43]

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

GA3C PongDeterministic-v3 again with TRAINING_MIN_BATCH_SIZE=40:

[Time: 2701] [Episode: 3988 Score: -21.0000] [RScore: -20.1950 RPPS: 1663] [PPS: 1637 TPS: 31] [NT: 7 NP: 3 NA: 44]

[Time: 5402] [Episode: 6053 Score: -13.0000] [RScore: -17.0820 RPPS: 1628] [PPS: 1512 TPS: 29] [NT: 11 NP: 2 NA: 35]

[Time: 8996] [Episode: 7551 Score: -10.0000] [RScore: -13.0080 RPPS: 1609] [PPS: 1494 TPS: 28] [NT: 15 NP: 4 NA: 32]

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

GA3C PongDeterministic-v3 again with TRAINING_MIN_BATCH_SIZE=40, with GAE changes from #18:
[Time: 2701] [Episode: 3118 Score: -12.0000] [RScore: -16.1090 RPPS: 1939] [PPS: 1915 TPS: 36] [NT: 6 NP: 1 NA: 41]

Still not reaching that "starting to win after 45 minutes" I get with universe-starter-agent.

[Time: 4759] [Episode: 3968 Score: 8.0000] [RScore: -8.3390 RPPS: 1966] [PPS: 1939 TPS: 37] [NT: 9 NP: 1 NA: 45]
...
[Time: 5192] [Episode: 4251 Score: 19.0000] [RScore: 0.0220 RPPS: 2009] [PPS: 1950 TPS: 37] [NT: 7 NP: 3 NA: 46]
...
[Time: 5401] [Episode: 4405 Score: 17.0000] [RScore: 4.4630 RPPS: 2018] [PPS: 1955 TPS: 37] [NT: 9 NP: 3 NA: 46]

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

GA3C Amidar-v0 with TRAINING_MIN_BATCH_SIZE=40, with GAE changes from #18:

[Time: 10374] [Episode: 13975 Score: 296.0000] [RScore: 219.0260 RPPS: 1919] [PPS: 1882 TPS: 35] [NT: 16 NP: 8 NA: 55]
...
[Time: 10801] [Episode: 14456 Score: 2.0000] [RScore: 154.3180 RPPS: 1713] [PPS: 1867 TPS: 35] [NT: 14 NP: 5 NA: 52] (trying something new? there was a string of 2.00 scores)

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

BTW I should probably compile my own Tensorflow, not sure how much effect it will have though:

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Hm, there seems to be something wrong with _play.sh. I decided I wanted to pause the training and have a look at an agent playing, perhaps seeing why it only got 2 points all of a sudden.

Naively, I thought I could just let one agent run in parallel to the training; it should not affect the big picture overall.

But I got an error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [256,6] rhs shape= [256,10]
	 [[Node: save/Assign_22 = Assign[T=DT_FLOAT, _class=["loc:@logits_p/w"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](logits_p/w/RMSProp_1, save/RestoreV2_22/_11)]]

Caused by op u'save/Assign_22', defined at:
  File "GA3C.py", line 59, in <module>
    Server().main()
  File "/home/nczempin/git/ml/GA3C/ga3c/Server.py", line 48, in __init__
    self.model = NetworkVP(Config.DEVICE, Config.NETWORK_NAME, Environment().get_num_actions())
  File "/home/nczempin/git/ml/GA3C/ga3c/NetworkVP.py", line 65, in __init__
    self.saver = tf.train.Saver({var.name: var for var in vars}, max_to_keep=0)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1040, in __init__
    self.build()
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1070, in build
    restore_sequentially=self._restore_sequentially)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 675, in build
    restore_sequentially, reshape)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 414, in _AddRestoreOps
    assign_ops.append(saveable.restore(tensors, shapes))
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 155, in restore
    self.op.get_shape().is_fully_defined())
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
    use_locking=use_locking, name=name)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [256,6] rhs shape= [256,10]
	 [[Node: save/Assign_22 = Assign[T=DT_FLOAT, _class=["loc:@logits_p/w"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](logits_p/w/RMSProp_1, save/RestoreV2_22/_11)]]

So I thought that perhaps you're not meant to run _train.sh and _play.sh concurrently, so, thinking that _train.py would just pick up from a checkpoint I stopped it and then tried running _play.sh.

Turns out I got the same error.

So I thought, perhaps it's because of the GAE changes, so I checked out the master branch and tried again. Same result.

So right now I'm not sure what's going on; there may be something in the GAE changes that modifies the written data so that there is a problem when reading it back?

Then I tried continuing the _train.sh process, and was slightly surprised that the time started back at 0, not at the point where I had left it.

Right now it's a little hard for me to tell if there was an inadvertent _clean.sh thrown in somewhere, or if this is expected behaviour and I just need to add the time at which I stopped to the new time value, or if this is an error caused by the GAE changes.

Edit: I notice that in the stack trace python 2.7 is mentioned. I try to run everything with python3.5, but my default ubuntu setup seems to link /usr/bin/python to python2.7; when I change it, some other programs no longer work.
Edit 2: Scratch that, making sure to run it with 3.5 I get the equivalent error message, just with references to 3.5
Edit 3: And the time value is nothing to worry about, ProcessStats.py doesn't measure the time relative to the start of the training, but of the "session"

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Okay, I'll stop that Amidar run for now, because I need the machine for other things. There was some wonky stuff going on; I'll save the results.txt (what's left of it after the restart) and the checkpoints/ directory, just in case anyone wants to have a look.

[Time: 1327] [Episode: 18465 Score: 2.0000] [RScore: 10.8540 RPPS: 1866] [PPS: 1857 TPS: 35] [NT: 8 NP: 5 NA: 33]

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

If you use gym for your experiments -> try to use all in Deterministic mode, since that Amidar-v0 should be AmidarDeterministic-v3 or wrap v0 manually to become a deterministic with constant frame skipping.
PS> DeepMind's 1-day is equal to 80mil and they use 16 parallel workers.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

If you use gym for your experiments -> try to use all in Deterministic mode, since that Amidar-v0 should be AmidarDeterministic-v3 or wrap v0 manually to become a deterministic with constant frame skipping.
PS> DeepMind's 1-day is equal to 80mil and they use 16 parallel workers.

Oh, okay, I'll do that. Would that explain the sudden meltdown?

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

@nczempin no, mostly not --> but it can affect on the learning process in time and quality >> stochastic gym's v0 environments can lead you agent from the same state to one another within 2..5 frames randomly:
https://github.com/openai/gym/blob/master/gym/envs/atari/atari_env.py#L80
You can also manually control it something like (or use v-3):

from gym.wrappers.frame_skipping import SkipWrapper
...
frame_skip = 4
self.gym = gym.make(env)
if frame_skip is not None:
    skip_wrapper = SkipWrapper(frame_skip)
    self.gym = skip_wrapper(self.gym)

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Hm.

with ./_clean.sh;./_train.sh ATARI_GAME='BoxingDeterministic-v3' TRAINING_MIN_BATCH_SIZE=40 with GAE I seem to get a similarly strange behaviour; I'm guessing it's a bug in the GAE code (I will let the non-GAE version run overnight for comparison):

[Time: 11415] [Episode: 12890 Score: 4.0000] [RScore: 3.2790 RPPS: 1912] [PPS: 1909 TPS: 36] [NT: 25 NP: 2 NA: 31]

Here's the "high water mark" of RScore (average score over the default 1000 episodes):
[Time: 7652] [Episode: 9265 Score: 68.0000] [RScore: 60.2050 RPPS: 1905] [PPS: 1910 TPS: 36] [NT: 22 NP: 4 NA: 34]

A wild guess would be that some integer overflows somewhere.

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

Thx for the experiments! I also noticed some meltdowns through the training process with Atari Boxing,
for example:
A3C-FF_without-GAE
boxing-8th-35mil
A3C-LSTM1_without-GAE
da3c_cur-lstm_8ag_gym_boxing
A3C-LSTM2_without-GAE
da3c_tf-lstm_8ag_gym_boxing

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Thx for the experiments! I also notice some meltdowns through the training process with Atari Boxing

Hm.
That would indicate that I should leave it running for longer and not look at the score.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

And perhaps that I need to clarify my intuition about what the score represents.

So just using your (@4SkyNet) bottom diagram "A3C-LSTM2_without-GAE", my understanding was that at around 20 M, we would have had a network that would have given us 80 points on average, within some variation (but none that would ever, within probabilistic confidence, cause us to go below 40 or so points).

And between 20 and 22.5, we are exploring and not finding improvement, but we could always go back to what we had at 20 M.

And it does not mean that we have found a configuration that turns out after more exploration to only be worth 20 points on average (which is sorta what it feels like when the reported scores start going down).

Is my intuition reasonable?

[and I think by network and configuration above I mean policy]

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

@nczempin I think these meltdowns can be caused by some agent's generalization issues wrt exploration and these meltdowns are more environment dependent to my mind --> I don't see such behavior for Breakout in comparison to the Boxing, since the first one has more stable environment dynamics.

For example (as I see from my visual output):
Boxing agent start to beat the opponent and make some pressure on it.
It has to do more pressure to get more rewards through the time.
And (suddenly) if your agent comes a little behind the opponent,
the game unfolds them in the opposite direction:
your agent trains to boxing from left to right,
but it has to boxing from right to left from now.

PS> wrt original results DeepMind also has some a bit worse results from 4-day training compare to 1-day (FF).

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

PS> wrt original results DeepMind also has some a bit worse results from 4-day training compare to 1-day (FF).

That is confusing me; how can the results be worse after 4 days than after 1 day? Or are you saying this is with different algorithms?

Presumably you can always go back to a previous policy that was better, or is it only that we "thought" it was better and now it turns out that it wasn't?

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Universe starter agent with
python3 train.py --num-workers 6 --env-id BoxingDeterministic-v3 --log-dir /tmp/boxingd3

image

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

@nczempin have you implemented the version with LSTM in TF? I am currently trying in pytorch but I had to add the c, h states in queues of experiences. Also note that for GAE, i let the parameter self.tau to 1, which is perhaps not the best choice & in theory should not change the performance.

@etienne87 I haven't implemented anything; I just used the GA3C from the head here, from your PR plus the 2 changes, and I'm comparing them to https://github.com/openai/universe-starter-agent (which supposedly has GAE, LSTM, but no GA3C) with the same environments on the same machine (and wondering whether there are any other parameters I should tweak to get the comparisons to be fairer).

It also turns out that the fact that I did not see the improvement of GA3C is most likely due to the universe starter agent being tuned towards Pong (which they state in the Readme).

Maybe I misunderstood the question?

I was scared for a moment that there may be a bug in your GAE, but @4SkyNet clarified that the observations are more likely to be independent of your GAE changes.

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

That is confusing me; how can the results be worse after 4 days than after 1 day? Or are you saying this is with different algorithms?

No, algorithms are the same (A3C-FF 1-day & A3C-FF 4-day), but results could be worse after 4-day:
table_with_results
It seems like we get the highest (algo limit perhaps) result (see bolds) and then diverge a bit.
For Pong we see the follows (wrt this table):
A3C-FF 1-day: 11.4
A3C-FF 4-day: 5.6

Presumably you can always go back to a previous policy that was better, or is it only that we "thought" it was better and now it turns out that it wasn't?

It's hard to have strictly point in this case. Sometimes it really better than the old one (results, mentioned above), sometimes we just "thought" it was better (boxing meltdown example).

@4SkyNet have you implemented the version with LSTM in TF? I am currently trying in pytorch but I had to add the c, h states in queues of experiences.

@etienne87 unfortunately not. LSTM's which I show are from another version of A3C (similar to vanilla, but a bit more synchronous and distributed). I didn't do something with GA3C yet, but it's good to have more versions with LSTM / GAE / Exploration Bonus / etc (may be I turn on it later, I don't really know...)

Also note that for GAE, i let the parameter self.tau to 1

Thx to point it out! @nczempin what tau (lambda) do you use in starter agent?
PS> I usually set it to 0.97 for TRPO for example

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

@nczempin what tau do you use in starter agent?
PS> I usually set it to 0.97 for TRPO for example

I use defaults for anything I don't specifically mention

from ga3c.

etienne87 avatar etienne87 commented on June 26, 2024

@nczempin, it is my fault, i did not take the time to put the parameter in the Config.py. The parameter is just laying there in ProcessAgent.py

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

I think we can avoid some meltdowns or get them smoother for some environments if my intuition is right.
This technique called augmentation and it widely used in deep learning:
we can flip the input observation from left to right (by np.fliplr(image) for example)

A3C stacks images for a state representation as some kind of motion,
since we can't randomly flip the input image from left to right at any time within the episode.
But we can do such flipping for whole episode and starts all games with some probability (0.5) to be flipped or not within whole episode.

It make sense for some environments, but it also could cause a bit longer training, but I hope we can avoid some meltdowns by this techniques.

from ga3c.

etienne87 avatar etienne87 commented on June 26, 2024

Hum, if you flip image you need to map actions as well, or we need label-safe augmentations techniques.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

@nczempin, it is my fault, i did not take the time to put the parameter in the Config.py. The parameter is just laying there in ProcessAgent.py

It wouldn't have made any difference if you had put it in Config.py; I would not have touched it. Or are you saying that you should not have included it?

Also, @4SkyNet asked about the "starter agent"; by that I was assuming he meant the universe-starter-agent. Does that even have this parameter?

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Universe starter agent with
python3 train.py --num-workers 12 --env-id BoxingDeterministic-v3 --log-dir /tmp/boxingd3_12 to see how much effect using 12 workers on my 6-core, 12 threads (hyperthreading) i7 would have. Wall clock time was roughly 7.5 hours.
image

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

Hum, if you flip image you need to map actions as well, or we need label-safe augmentations techniques.

You are right @etienne87 >> actions should remaps too if it applicable --> and it's simple enough for left-right flipping, where we've to remap left to right && right to left (for Atari games)

Does that even have this parameter?

Yes, they call it lambda and it's equal to 1.0 by defaults

...using 12 workers on my 6-core, 12 threads

@nczempin as I see everything correctly --> there are no improvements from 6 to 12 agents in time >> global_step/sec a bit more 600 for both cases

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

@nczempin hm, I've also notices that the episode time becomes a bit shorter for your Boxing within the training --> it's a goo sign, but it's hard to properly control it with currant A3C:
Common episode is equal to 2min, but when your agent reaches 100 points --> you've get KO and game turns in terminal state to start the new one from scratch

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

@nczempin hm, I've also notices that the episode time becomes a bit shorter for your Boxing within the training --> it's a goo sign, but it's hard to properly control it with current A3C:
Common episode is equal to 2min, but when your agent reaches 100 points --> you've get KO and game turns in terminal state to start the new one from scratch

Are you saying either of the A3Cs have implicit assumptions about episode lengths being more uniform? Or just the original one?

In general, eventually agents will reach maximum scores. In ALE for many of them that roll over the score, this is not actually possible; an episode could potentially play to infinity. It is an open question there how to handle the situation of wrapping. Depending on how the Python interface is used, agents might be discouraged to wrap the score. IMHO it's pointless to keep going once you can wrap the score (for an agent; not necessarily for a human).

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

Are you saying either of the A3Cs have implicit assumptions about episode lengths being more uniform? Or just the original one?

No, you just control the training quality by acquiring of some reward (discounted reward).
There are no information about your lives or KOs(in boxing case) for current A3C

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

It sounded to me like you're saying that episodes terminating early is somehow a problem, because some trait of "current A3C" somehow optimizes for episodes that don't terminate early.

A3C sets the last terminal discounted reward to 0, that's not so good for Boxing.
It's better to set this value to relatively big value to encourage the agent to reach KOs
as soon as possible (we clips all rewards in -1..1, since we can set to only 1 for current algo).
From the other hand, if we have some lives and when we lost the live we should set reward
signal to some negative value --> but we don't control such cases in A3C

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

A3C sets the last terminal discounted reward to 0

Really? What is the motivation behind this?
I can see how it would make a difference if you had just one final reward at the end and that would be set to 0 (although I'm not sure I understand this correctly), but in Boxing you simply terminate early when one of the players reaches 100, but it is not like one big punch that gets you from 0 to 100.

we clips all rewards in -1..1, since we can set to only 1 for current algo

Wouldn't normalizing be better than clipping? And for both options, wouldn't knowing the max score be helpful? I can't even imagine how you'd clip or normalize without knowing the max (other than the max-seen-so-far), to be honest.

Once (,if this gets implemented,) (some of) the Atari games get to have maximum values in ALE, e. g. Boxing has 100, Pong 21, perhaps there could/should be some way to take advantage of this in algorithms; but wouldn't this count as domain knowledge?

if we have some lives and when we lost the live we should set reward
signal to some negative value --> but we don't control such cases in A3C

I also thought about including other reward signals in ALE, but in the end the number of lives is just part of the state, and with behaviour that minimizes losing lives presumably you'd maximize the score. Or maximizing the rewards will turn out to involve avoiding to lose lives.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Universe Starter Agent:
python3 train.py --num-workers 12 --env-id SeaquestDeterministic-v3 --log-dir /tmp/seaquestd3_12
Wall clock time: just over 10 hours.
image

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

after a promising start, GAE gets stuck near 212 points:

[Time: 19148] [Episode: 36972 Score: 200.0000] [RScore: 212.2800 RPPS: 1936] [PPS: 1996 TPS: 38] [NT: 9 NP: 2 NA: 38]

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Montezuma's Revenge, basic GA3C (one can hope):

[Time: 36385] [Episode: 77507 Score: 0.0000] [RScore: 0.3000 RPPS: 1219] [PPS: 1473 TPS: 28] [NT: 35 NP: 4 NA: 35]

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

I'm trying to move my changes to ALE into gym; it's quite tedious because they have diverged, and it's not immediately obvious in what way.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Okay, I seem to have managed to get it to work; here's Adventure running:

[Time: 34] [Episode: 1 Score: -1.0000] [RScore: -1.0000 RPPS: 43] [PPS: 44 TPS: 26] [NT: 1 NP: 2 NA: 33]

Really wondering if it will ever get a +1.

Any tips on which implementation I should pick to make this more likely would be appreciated.

Would it be of any help to step in and control manually (sorta off-policy learning)?

Change to openai/atari-py
Change to openai/gym

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Huh, the agent reached a score of 0. That's only possible by timing out of the episode. I hope it doesn't learn to sit idly at home forever...

[Time: 2369] [Episode: 182 Score: 0.0000] [RScore: -0.9945 RPPS: 921] [PPS: 922 TPS: 30] [NT: 2 NP: 4 NA: 31]

Which parameters do I need to set so it will eventually explore to bringing the chalice back to the golden castle?

I'm guessing there's no hope yet; it may require custom rewards to encourage exploration, opening castle doors, etc.

I'm currently looking into providing custom rewarders for openai/universe. The docs are sparse...

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

Unsurprisingly, the GA3C agent got nowhere on Adventure. Perhaps "distance of chalice to yellow castle" and "have discovered chalice" should somehow be added as rewards, but the values would be somewhat arbitrary.

[Time: 50487] [Episode: 2815 Score: 0.0000] [RScore: -0.9160 RPPS: 1425] [PPS: 1456 TPS: 28] [NT: 6 NP: 2 NA: 22]

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

In any case, I think I've answered my original question about GA3C vs. universe-starter-agent and will close this huge thread now.

from ga3c.

etienne87 avatar etienne87 commented on June 26, 2024

So ... what are the conclusions?

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

GA3C indeed makes better use of available resources; the GAE can help a lot but needs some parameter tweaking that I'm not ready for.

But mainly the conclusion is that I need to spend more time learning about how all of these things work before I can actually give a conclusion.

So I'll continue to try and help in areas where I can bring in my skills (e. g. adding more games to ALE, perhaps add an environment to Gym, other engineering-focused tasks that will satisfy my curiosity and perhaps help the researchers or others), and go right back to Supervised Learning, with maybe a little bit of generic (not general) AI thrown in, plus work my way through Tensorflow tutorials (possibly look at the other libraries), maybe implement some of the classic algorithms from scratch myself, etc.

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

A3C sets the last terminal discounted reward to 0

Really? What is the motivation behind this?

Indeed - I think that the main motivation about it that terminal isn't good (for most Atari games, since also we use the same set of parameters for all games) and we also can't do some estimation by value "function"

we clips all rewards in -1..1, since we can set to only 1 for current algo

Wouldn't normalizing be better than clipping?

ALE, e. g. Boxing has 100, Pong 21, perhaps there could/should be some way to take advantage of this in algorithms;

Perhaps, but we have to do some investigation of pros and cons wrt rewards changes.
There are some good thoughts from Hado van Hasselt.

I also thought about including other reward signals in ALE

It's good to do some reasonable investigation in this way.

the number of lives is just part of the state

Hm, I don't think so, cuz we just got the raw image as state.
But gym has some access to the lives, I'm not sure about of ALE, but I think it also does.

maximizing the rewards will turn out to involve avoiding to lose lives

Mostly yup, especially if it slowdown rewards gain. But some AI's in some games try also to loss their lives if they could get some advantage.

I wonder if intrinsic motivation would help it, like it did Montezuma's (a little bit; Adventure is not quite as dangerous as Montezuma's, but the rewards are even sparser).

You are right. The intrinsic motivation extremely helps you in such game types > you can add it to some A3C implementation (exploration bonus)

Any tips on which implementation I should pick to make this more likely would be appreciated.

So I'll continue to try and help in areas where I can bring in my skills (e. g. adding more games to ALE, perhaps add an environment to Gym, other engineering-focused tasks that will satisfy my curiosity and perhaps help the researchers or others)

It's great to hear. I also recommend to see on Retro-Learning-Environment (RLE) - it cover not only Atari and also some SNES and even Sega perhaps.

I'm guessing there's no hope yet; it may require custom rewards to encourage exploration, opening castle doors, etc.

Yup, or do something like this

GA3C indeed makes better use of available resources;

Yeah - it's the main reason to organize your data workflow in more efficient way.
It's vanilla A3C, but we can add some reasonably improvements to it.

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

the number of lives is just part of the state

Hm, I don't think so, cuz we just got the raw image as state.
But gym has some access to the lives, I'm not sure about of ALE, but I think it also does.

Okay, I have to be careful here; what I meant was, the number of lives is (in many cases, ALE does indeed allow to query this, but it's not strictly necessary) part of the internal state of the game, not necessarily of the state that's observed by the agent (which is just pixels).

from ga3c.

nczempin avatar nczempin commented on June 26, 2024

A3C sets the last terminal discounted reward to 0

Really? What is the motivation behind this?

Indeed - I think that the main motivation about it that terminal isn't good (for most Atari games, since also we use the same set of parameters for all games) and we also can't do some estimation by value "function"

But then how does GA3C even see the -1 on Adventure? Or are you saying the original A3C does it, while GA3C doesn't?

Adventure provides a good reason not simply to use 0 upon "fail", because there is a difference between "failing because eaten" and "failing because we timed out".

Hm, or maybe there isn't.

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

But then how does GA3C even see the -1 on Adventure? Or are you saying the original A3C does it, while GA3C doesn't?

No, all rewards for both should represents (since clipping) as -1, 1 or 0 if we said about vanilla A3C.
But we don't use these raw rewards --> we use its discounted sum wrt gamma:
It looks like as follows:

def discounted_reward(real_rewards, gamma):
    discounted_r = np.zeros_like(real_rewards, dtype=np.float32)
    running_add = 0
    for t in reversed(range(0, discounted_r.size)): 
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

rew_test = [1, -1, 0, 1, 0]
print(discounted_reward(rew_test, .95))

[ 0.72187066 -1.31918108  0.80844843  0.91000599 -1.12114406]

And if terminal is reached we just set running_add to 0 for A3C,
if not --> we estimate it by value "function"
I can said some things not so accurate as it seems like it strictly zero.
We just skip out future estimation, cuz it ends but holds the last receiving reward.
Anyway, this reward could be higher for some games, but we lose some kind of
generality if we o that.

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

Are you saying shooting a regular space invader gives as much reward as shooting a mothership?
The discounting is orthogonal to that question.

Yes, but I don't really know about space invaders scores > all of them is clipped in -1..1 for generalization and it's not good way if we want to give more from specific environments (perhaps, we have to do some investigation, cuz "blind" moves can also affect in negative way).

Discounting reward it just some technique wrt horizon of view to our received rewards. It could be more optimistic, for example, if we hit "motherships". And it is also affect on behavior, for example again:
gamma=0.95 is more preferable than gamma=0.99 for PacMan, since in the last case it has more fear to do some things and sitting in the corner.

from ga3c.

4SkyNet avatar 4SkyNet commented on June 26, 2024

@nczempin you definitely right.
And it's not so simple with gammas, but you can try.
PS> for chess we always get +1, -1 or 0 at the end and some hand-crafting intermediate rewards often hurts than helps, cuz it really hard to estimate such things by human)

from ga3c.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.