rlcode / reinforcement-learning Goto Github PK

Minimal and Clean Reinforcement Learning Examples

License: MIT License

Python 100.00%

a3c actor-critic deep-learning deep-q-network deep-reinforcement-learning dqn machine-learning policy-gradient reinforcement-learning

reinforcement-learning's People

Contributors

Stargazers

Watchers

Forkers

xmyqsh sam2015 c0debrain eyohansa kastnerkyle ahndaehwan oppa3109 namghiwook holdlen2dh red7hj iamsile ml-lab jdc08161063 sentientmachine sunjieee little1tow mrace rajat1994 stevenlol leehoy ustc-miner benjamesbabala gongqingyi-github mkowoods rubythonode nilportugues juzenn hassaku iitalics seth1002 linsong8208 raka747 awesome-archive wuatanabe jagjeetdhaliwal deepanonymous szn0212 jp1017 sherjilozair sonpython awhitesong olympusmedia wooridle alvations yashk2810 bityangke mlzxy p101drs szittom longzekai jalused ykwon0407 allensmile chop2 lbyfind zy1620454507 caohy1988 goingmyway zhumeilian chenokay jianfly wenxuanliu rintukutum zhujunnan ambier guohaoqiang xiao2mo vybhavk jaydengardiner sunsj2014 rainstrom dreamplayerzhang leezqcst chenfei-wu orchestor lambdaji ieyer pencilandbike phyland bratu-ionut wensincai jaichangpark xiliangsong ayush-agrawal xhchrn hsclouse skyer9 shubhamagarwal92 fanshaopu sscottsd mar7r zhanghonglishanzai liuweilin17 dl-yc jenniferdeng hskim weirdh h-nag hbcbh1999 keon

reinforcement-learning's Issues

A3C for Gridworld

I am trying to implement A3C for gridworld by appropriately modifying run() method of A3C cartpole example. However, I am getting the below error:

Exception ignored in: <bound method PhotoImage.del of <PIL.ImageTk.PhotoImage object at 0x7f7b807077b8>>
Traceback (most recent call last):
File "/home/akb/.local/lib/python3.4/site-packages/PIL/ImageTk.py", line 130, in del
name = self.__photo.name
AttributeError: 'PhotoImage' object has no attribute '_PhotoImage__photo'
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
self.run()
File "deep_a3c.py", line 159, in run
env = Env()
File "/home/akb/reinforcement-learning/1-grid-world/6-deep-sarsa/environment_a3c.py", line 21, in init
self.shapes = self.load_images()
File "/home/akb/reinforcement-learning/1-grid-world/6-deep-sarsa/environment_a3c.py", line 58, in load_images
Image.open("../img/rectangle.png").resize((30, 30)))
File "/home/akb/.local/lib/python3.4/site-packages/PIL/ImageTk.py", line 124, in init
self.__photo = tkinter.PhotoImage(**kw)
File "/usr/lib/python3.4/tkinter/init.py", line 3419, in init
Image.init(self, 'photo', name, cnf, master, **kw)
File "/usr/lib/python3.4/tkinter/init.py", line 3375, in init
self.tk.call(('image', 'create', imgtype, name,) + options)
RuntimeError: main thread is not in main loop

Any help on this error. If possible, could you provide an implementation of A3C on Gridworld.

Thanks,
Akilesh

rlcode.github.io does not exist !

The url is embedded in the given line:

Minimal and clean examples of reinforcement learning algorithms presented by RLCode team.

Batch size in A2C Cartpole

Are there no variants of A2C with mini-batch update instead of training every time step? If yes, could you tell the pros and cons of such an approach?

Thanks,
Akilesh

Why are you using SARSA instead of Q-Learning?

You are doing Q-Learning:

            # get action for the current state and go one step in environment
            action = agent.get_action(state)
            next_state, reward, done, info = env.step(action)

reinforcement-learning/2-cartpole/2-double-dqn/cartpole_ddqn.py

Line 111 in 2fe6984

target[i][action[i]] = reward[i] + self.discount_factor * (

But isn't that SARSA?

                a = np.argmax(target_next[i])
                target[i][action[i]] = reward[i] + self.discount_factor * (target_val[i][a])

Is that a mistake or is that a valid approach? I'm new to RL...

Moving obstacles in Grid World

I don't understand the effect of moving obstacles in grid world (Deep SARSA and REINFORCE) as in environment.py, the negative rewards are hard-coded for obstacles at coordinates [0, 1], [1, 2], [2, 3].

Thanks,
Akilesh

Reference links for each algorithm

Hi,

It's a really great repo for learning RL. However, if you could provide some links/blogs containing explanation of each algorithm, that will be even more beneficial for users.

Thanks,
Akilesh

Diagonal movement? - Grid Score

Any idea how to go about implementing diagonal movement in grid score?

Convergence

How many days/episodes did it take until it converged in breakout_a3c? Did you try using LSTM for faster convergence?

Pong Policy Gradient-important error in the definition of the convolutional net.

I tried to run Pong Policy Gradient for 2000 episodes on the original file with no results whatsoever. Then boosted reward for positive points (points scored by the learner(right side) to 20 and got this result:

I boosted learner's points rewards to 100 and after around 1500 episodes got a slight improvement, similar to that in the picture. I ran it to 8100 episodes and there was no improvement except for a slightly smaller variance. Forgive my being naive but successfully running three versions of cartpole I was expecting some logical results.
As you can see from the picture variance is big and after a 800-900 improvement the results seem stagnant.
Has anybody run it for some more episodes and tried to tweak the rewards and brought results up and variance down?
Given the policy should I boost the penalty for the teacher's (left opponent's) scoring points?
Any guidance will be appreciated. Thanks.

How to add Dropout

Hi all,

Thanks for your amazing project!

I have a question. If I want to add dropout into the network for policy gradient, how can I do that?
I think in order to do that, I need to completely change the code. Right now the workflow is as follows.
Having state -> do a forward computation -> having the output -> compute the gradient -> create a new input, output to train the network -> perform training the network with the <input, output> for one epoch -> repeating again.

However, to add dropout we need to change the workflow as follows:
Having state -> do a forward computation -> having the output -> compute the gradient -> backpropogate the gradient -> modifying network parameters -> repeating.

This would really complicate for an automatic differentiation system like Keras, I think. Any idea?

Thanks a lot for your help!

Best,

Dqn-per does not use importance sampling weight in training。

Dqn_per has no important sampling weight in training, which should be a problem?

Saved model usage

Could you please provide examples on how to use the saved model (.h5 files) at test time for Grid world, Cartpole environments?

Thanks,
Akilesh

The issue about breakout_a3c.py in 3-atari, when i execute source

PC1
cpu: intel i-5
no graphic card
python 3.5
tensorflow 1.14
keras 2.3.0

PC2
cpu: inter i-7
rtx-2070
python 3.5
tensorflow 1.14
keras 2.3.0

When i execute breakout_a3c.py ,
The following problem occurs on both computers
I guess that the issue is related to threading library...

Model: "model_18"

Layer (type) Output Shape Param #
input_9 (InputLayer) (None, 84, 84, 4) 0

conv2d_17 (Conv2D) (None, 20, 20, 16) 4112

conv2d_18 (Conv2D) (None, 9, 9, 32) 8224

flatten_9 (Flatten) (None, 2592) 0

dense_25 (Dense) (None, 256) 663808

dense_27 (Dense) (None, 1) 257
Total params: 676,401
Trainable params: 676,401
Non-trainable params: 0

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-7:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-9:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Can this code run other atari game beside breakout?

I want to run other atari game, it's performance looks doesn't good. Could anyone help me? Whether I can achieve this goal by change "gym.make('ENV_NAME')", and it's real_action? help me plllllz, appreciate so MUCH
I have changed the code like I wrote above, it's performance not so good, and occurs some ERRORS. like this:
TypeError: unsupported operand type(s) for /: 'tuple' and 'float'Traceback (most recent call last):

File "ddqn_spaceinvaders.py", line 372, in
agent.train_replay(step)
File "ddqn_spaceinvaders.py", line 235, in train_replay
history[i] = np.float32(mini_batch[i][0] / 255.)

Could you translate the comments into English?

Excuse me.
I can not read Korean.
Could you translate the comments into English?

Question on Policy Gradient

Add comment on the use of categorical cross entropy in REINFORCE and a2c

I was surprised to see this loss function because it is generally used when the target is a distribution (i.e. sums to 1). This is not the case for the advantage estimate. However, I worked out the math and it does appear to be doing the right thing which is neat!

I think this trick should be mentioned in the code.

couple a3c questions / recommendations for generalizing beyond Atari

First, thanks for making this. It's very easy to get started with and has really helped me move things forward on a personal project of mine I've been struggling with for months. This is really awesome work. Thanks again.

In my efforts to tweak the code from your A3C cartpole implementation to work with my own custom OpenAI environment, I've discovered a few things that I think can help make it generalize a bit more.

the paper says all the layers except the output are generally shared between the actor and critic. I'm curious - why do both your actor and critic networks have a private hidden layer before the output? Mine has 4 shared relu layers each with 1600 neurons, then the actor gets a softmax and the critic gets a linear layer for the output, and this has really helped my stability issues, though my environment is quite a bit more complicated than Atari, so I'm not sure if there's some advantage to each having a private layer. I could totally be missing something here though.
My environment has a massive discrete action space (1547 actions), which is what led me here. So, one change I made was to add an action filter to your "get_action" function. This is a vector of ones and zeros output by my environment that can be used to zero out the probabilities for invalid actions. I needed to renormalize the output from the actor so the probabilities all sum to 1 again after doing this, but it's really helped speed things up. Probably unnecessary unless you're dealing with large action spaces, but crucial if you are. You can always penalize invalid actions instead, but in a large action space, but I've found that adds a ton of time to training. Anyway, here's what I did (sorry - can't get the formatting to work here for some reason):
def get_action(self, state,actionfilter): policy = self.actor.predict(np.reshape(state, [1, self.state_size]))[0] policy=np.multiply(policy,actionfilter) probs=policy/np.sum(policy) action=np.random.choice(self.action_size, 1, p=probs)[0] return action
where action filter is provided through a custom function with the environment. It would be easy enough to only use it if it was passed, or just default it to a vector of ones with the same size as the action space.
I'm in the process now of giving each actor a different epsilon that's used to determine how much it explores, which will also be feed into the 'get_action' function. The original paper claims that if each agent has a different exploration policy, it can really help with stability, so I'm hoping this will help a bit more. To date, I've had to slowly increment my learning rates to find one that works (for me, I had to go all the way down to 1e-10).
Anything greater than that can cause an actor to return NaNs and then the whole thing falls apart, and anything lower and it just inches along at a glacial pace. I wish I could take it up a bit, but the NaNs are killing me. I tried gradient clipping, but it's really hard finding a good threshold to use. Anyway, implementing different exploration policies should be pretty easy to do... might be worth checking out. I suppose it would also be possible to randomly pick a more abstract exploration type as well during initialization. Like have one that's pure greedy, another epsilon-greedy with some random epsilon, and maybe a couple other types of policies thrown in there for kicks. I'm going to test this out this week to see if it has any effect. I can report back if you're interested.

Thank you for good materials!

https://www.gitbook.com/book/dnddnjs/rl/details에서 RL을 공부하다가 이 깃헙까지 오게되었습니다.

RL관련해서 자료가 없어서 막막한 부분이 있었는데 gitbook에 번역본을 제공해주셔서 빠르게 공부할 수 있었습니다. 감사합니다.

link was broken for the pong-a3c.py

can not open

Catastrophic collapse in episode score on cartpole_a3c

Hi,
First of all I just want to say awesome work on the library overall, really love the concept 👍

I have an issue where cartpole_a3c will converge relatively quickly (around ep 300-400). Then keep doing well, and then suddenly collapsing and not recovering. Has anyone else experienced this?

play part

Is there any part to play the breakout by calling the saved model?

Env BreakoutDeterministic-v4 not found

When I try to run Breakout_DQN I get the following error:
gym.error.DeprecatedEnv: Env BreakoutDeterministic-v4 not found (valid versions include ['BreakoutDeterministic-v3', 'BreakoutDeterministic-v0'])

What version of gym are you using? (I'm using 0.8.1)

Giving image as input in Gridworld

Hi,

Is it possible to give image as input in Gridworld environment. Can you suggest ways in which this can be done? Are there ways of converting tkinter Canvas into a numpy array which can then be fed into a ConvNet?

Thanks,
Akilesh

reinforcement learning real life use cases

Hi Shangtong , I am new to the reinforcement learning part , I have a sceanrio in which I have a machine learning model predicting target properly.I want to figure out the input paramters so as to attain a particular target value .Any suggestions will be great ..
The input parameter may range from 4-20 and since the input parameters are discrete numeric values there may be lot many combinations of the same .

Fail to converge for Breakout dqn

I have train the model for around 4 days, episode now is 14278, while the score is 40 ~ 50, what's the problem?

Use of memory in Cartpole A3C

Hi,

Could anyone please elaborate on the use of memory in Cartpole A3C implementation? The saved samples haven't been used during training.

Thanks,
Akilesh

5_A3C Cartpole Script - AttributeError: 'Functional' object has no attribute '_make_predict_function'

References a function that doesn't exist.

is it possible to apply categorical_crossentropy to a3c?

Just want to know if it makes sense to apply the tech in
https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/3-reinforce/cartpole_reinforce.py#L45

to a3c implementations, for cartpole and breakout? Thanks.

How to run threading while using Keras and tensorflow

Hi, I would like to test some hyperparameters, with using threading, that will be much faster. But when I run threading on DQN and DDQN algorithm, the error says:
<Tensor Tensor("dense_1/kernel:0", shape=(2, 32), dtype=float32_ref) is not an element of this graph>
Seems Keras can't support threading, but your A3C works, it's strange for me.

Implementing policy gradient when number of output classes is large

Hello,

I am aware of this smart trick of implementing policy gradient (see his for a reference: https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/3-reinforce/cartpole_reinforce.py). Specifically, categorical cross entropy is defined H(p, q) = sum(p_i * log(q_i)). For the action taken, a, we can set p_a = advantage * [index of action a in 1-hot-vector representation). Meanwhile, q_a is the output of the policy network, which is the probability of taking the action a, i.e. policy(s, a).

However, when the classes of output is huge (e.g. as in machine translation or language modeling), I simply cannot convert the output into one hot vector in the first place, using to_categorical(output, num_classes=output_class) function in keras.

Because of this, I cannot apply the trick to compute p_a.

So how to implement policy gradient in this case?

I hope I make my question in a clear way!

Many thanks for your help!

Best,

Cuong

@fredcallaway: I saw you commented on the code so I tagged you here as well. If you can give me an answer, I would really appreciate it ...

Tutorial

Dear

Do you have any tutorial for your code listed in this github ? or have you created your tutorial for your code ?

Thx

Failing to converge with increase in grid-size (Grid World)

If I increase both the HEIGHT and WIDTH from 5 to 10 keeping the obstacles and the final goal at the same position, Deep SARSA network doesn't seem to converge. What do you think is the problem? Should I increase the depth or dimensions of the hidden layer in actor and critic networks?

Thanks,
Akilesh

Use traned agent

Hello, trained agent play CartPole-v1 with score 500, but when I restart it with ...
self.load_model from = True and with correct name, it start learning again with low score results.

How can I load weights and start trained agent to play, without learning?

What each state signifies in Grid World

I understand that state_size is 15. Could you please elaborate on what each of the 15 values denote or signify?

Thanks,
Akilesh

module 'pandas' has no attribute 'computation'

run python Gridworld_DQN.py error:

File "/home/wangdawei/anaconda2/envs/py3/lib/python3.6/site-packages/dask/dataframe/core.py", line 38, in <module> pd.computation.expressions.set_use_numexpr(False) AttributeError: module 'pandas' has no attribute 'computation'

first i thought it is a pandas problem, finally it is a dask version problem

update dask to new version solve it.

Expected future rewards

Hi,

In Cartpole ddqn the following Q(s,a) formula has target_val, is it one step reward or is it expected future rewards?

target[i][action[i]] = reward[i] + self.discount_factor * ( target_val[i][a])

Training time of breakout-dqn

Can ask how long it takes to train the breakout dqn model, what is your graph card? Thanks !

A3C algorithm - background

Amazing work!!
Tried running the A3C algorithm for breakout and it works great!
Where did you get the background information in order to write the code? It's a little bit different than what was explained in the "Asynchronous Methods for Deep Reinforcement Learning" paper.
Thanks :)

Number of actions in deep SARSA grid world

Why does the implementation (deep_sarsa_agent.py) have action_space = [0, 1, 2, 3, 4] when there are only 4 possible actions that the agent can take (as specified in the environment.py)?

Thanks,
Akilesh

Saving QLearning Agent

First of all, great tutorials! I've been basing my own projects with this repo to better understand RL but through the process I found that persisting the QLearning Agent turns out to be really difficult because of it's final size.

I tried pickle, json, jsonpickle, cPickle, marshal, klepto, dbm and finally h5py and I noticed it might not be as easy as it seems, because none of these worked. My 64-bit Linux Mint system kills the process and leaves a 0 bytes file where the q_table should be.

It actually works, rewards getting better and all but if it's trained to a point it becomes impossible to persist it back to disk, it seems. I tried creating swap space from the intuition that it was running out of memory, to no avail.

Would be glad if anyone has a fix for this. Thanks!

The A2C carpole is wrong?

I have compared the implementation and the book "RL: an introduction". It seems the mse loss and cross-entropy loss can not get the update rule as Actor-Critic. It is w=w+alphaIdelta*grad for value function, and theta = theta + alpha *I delta grad(ln pi(action)). Especially for value function, mse loss gets another v^hat multiplied.

My code is very poor in learning 2048 game using Double DQN

Firstly, thanks for the great collection of code and articles. The articles were very useful in understanding DQN and implementing it.

However, my code is very bad in learning. I am not sure what is wrong with my code. I am using DDQN and passing rewards based on different criteria. Also the state is just a normalized version of the board itself.

My code repo is here https://github.com/codetiger/MachineLearning-2048
Let me know if you can review and help me understanding why my code doesnot learn anything even after 1000 episodes.

Prioritized experience replay implementation

Hi,

Could you provide an implementation of prioritized experience replay for either Gridworld or Cartpole environment?

Thanks,
Akilesh

sampling from deque

Great stuff! This has been extremely helpful! My only suggestion would be in line 78, changing mini_batch = random.sample(self.memory, batch_size) to mini_batch = random.sample(list(self.memory), batch_size), otherwise you get the following error, "TypeError: Population must be a sequence or set. For dicts, use list(d)."

issue regarding saved models

i was looking at the code for breakout and i saw various saved models ,but the code is only for one saved model then how the other models were saved, i want to know if they were saved after making some changes to the code

A3C on GPU

Hi.
Really nice job. This is the most readable and "easiest" code I found for the A3C implementation. With regular tensorflow on CPU the code is working fine, but with tensorflow-gpu I get the error below.

Do you know, why this is happening and is it possible to get the A3C code working with GPU accelleration?
Thanks in advance!

Caused by op 'IsVariableInitialized_16/IsVariableInitialized_22/IsVariableInitialized/IsVariableInitialized_13/IsVariableInitialized_6/IsVariableInitialized_7', defined at:
  File "C:\Users\trek\.vscode\extensions\ms-python.python-2018.3.1\pythonFiles\PythonTools\visualstudio_py_debugger.py", line 2068, in new_thread_wrapper
    func(*posargs, **kwargs)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\threading.py", line 882, in _bootstrap
    self._bootstrap_inner()
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\threading.py", line 914, in _bootstrap_inner
    self.run()
  File "d:\Thesis\Code\examples\cartpole\a3c2.py", line 159, in run
    action = self.get_action(state)
  File "d:\Thesis\Code\examples\cartpole\a3c2.py", line 209, in get_action
    policy = self.actor.predict(np.reshape(state, [1, self.state_size]))[0]
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1835, in predict
    verbose=verbose, steps=steps)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1330, in _predict_loop
    batch_outs = f(ins_batch)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 2476, in __call__
    session = get_session()
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 192, in get_session
    [tf.is_variable_initialized(v) for v in candidate_vars])
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 192, in <listcomp>
    [tf.is_variable_initialized(v) for v in candidate_vars])
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\variables.py", line 1203, in is_variable_initialized
    return state_ops.is_variable_initialized(variable)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\state_ops.py", line 180, in is_variable_initialized
    return gen_state_ops.is_variable_initialized(ref=ref, name=name)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 175, in is_variable_initialized
    result = _op_def_lib.apply_op("IsVariableInitialized", ref=ref, name=name)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'IsVariableInitialized_16/IsVariableInitialized_22/IsVariableInitialized/IsVariableInitialized_13/IsVariableInitialized_6/IsVariableInitialized_7' and 'Adam_1/iterations: Cannot merge devices with incompatible types: '/job:localhost/replica:0/task:0/device:GPU:0' and '/job:localhost/replica:0/task:0/device:CPU:0'
	 [[Node: IsVariableInitialized_16/IsVariableInitialized_22/IsVariableInitialized/IsVariableInitialized_13/IsVariableInitialized_6/IsVariableInitialized_7 = IsVariableInitialized[_class=["loc:@Adam/beta_1", "loc:@Adam/beta_2", "loc:@Adam_1/iterations", "loc:@Variable_27", "loc:@Variable_4", "loc:@dense_3/bias"], dtype=DT_FLOAT](Variable_4)]]

Why use self.batch_size instead of batch_size

From reinforcement-learning/2-cartpole/1-dqn/cartpole_dqn.py/train_model

def train_model(self):
    if len(self.memory) < self.train_start:
        return
    batch_size = min(self.batch_size, len(self.memory))
    mini_batch = random.sample(self.memory, batch_size)

    update_input = np.zeros((batch_size, self.state_size))
    update_target = np.zeros((batch_size, self.state_size))
    action, reward, done = [], [], []

    for i in range(self.batch_size):
        update_input[i] = mini_batch[i][0]
        action.append(mini_batch[i][1])
        reward.append(mini_batch[i][2])
        update_target[i] = mini_batch[i][3]
        done.append(mini_batch[i][4])

    target = self.model.predict(update_input)
    target_val = self.target_model.predict(update_target)

    for i in range(self.batch_size):
        # Q Learning: get maximum Q value at s' from target model
        if done[i]:
            target[i][action[i]] = reward[i]
        else:
            target[i][action[i]] = reward[i] + self.discount_factor * (
                np.amax(target_val[i]))

    # and do the model fit!
    self.model.fit(update_input, target, batch_size=self.batch_size,
                   epochs=1, verbose=0)

In the this part of code, why you use self.batch_size after take the minimum value between self.batch_size and the length of memory? Would batch_size be better?

Query regarding 'advantages' in a2c

The actor net takes state as input and outputs a policy containing the probability of each action. In train_model(), the ground truth for training actor net is 'advantages' which is not a probability distribution over possible actions. So, how does the categorical cross-entropy computation between the predicted output of actor net and 'advantages' work?

Thanks,
Akilesh

update target_model before loading saved model in cartpole_dqn.py

# initialize target model with same weights as the model, in case we load a model
       #shouldn't this be done after load_model?
       self.update_target_model()

       if self.load_model:
           self.model.load_weights("./save_model/cartpole_dqn.h5")

could be

if self.load_model:
            self.model.load_weights("./save_model/cartpole_dqn.h5")

 self.update_target_model()

so that if we load a saved model, the target_model would have the saved weights rather than starting with the Keras-initialized weights.

I'm going to test this but it seems like the loaded model would be using an inferior target_model for at least the first episode and the model weights could get adjusted in the wrong way in that first episode, slightly slowing down it's learning.

rlcode / reinforcement-learning Goto Github PK

reinforcement-learning's People

Contributors

Stargazers

Watchers

Forkers

reinforcement-learning's Issues

Recommend Projects

Recommend Topics

Recommend Org