keon / deep-q-learning Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 451.0 1.81 MB

Minimal Deep Q Learning (DQN & DDQN) implementations in Keras

Home Page: https://keon.io/deep-q-learning

License: MIT License

Python 100.00%

ddqn deep-learning deep-q-network deep-reinforcement-learning dqn reinforcement-learning

deep-q-learning's People

Contributors

Stargazers

Watchers

Forkers

charlie13 oppa3109 allensmile eharpste benjamesbabala alain2208 newcoder kjeanclaude jdc08161063 langtung xuqy1981 sebluo mei16 cyberspacefighter rcao-hk gongqingyi-github evan853 rogersjeffreyl skyer9 rubythonode yuan6785 researchase mave5 hansihn praneetdutta aterhune1984 ypeng0126 cosecant-csc jkeung vanova shagru csianglim bear1988520 zhfzhmsra petroffss muratmz dailyactie hamitb mligema alibaheri leochencipher solertis shivajid itsdin gaurav780 zilongzhong mayurand hackintoshrao flannimal3000 qifeng2010 kanghua309 ghicheon tommy2782 shuyuanli duoergun0729 xzw0005 yux94 yifantian tonegas chenjennhaur yigenliang kien-vu genipap madongmingming shkwon1566 zjzcn fitrialif genecyber k4ni5h nrontsis bekerov szkript verigibest alexzhibin teach-kids-learn-coding ritikamotwani infinitas-loop sixleaves nanfengpo dotafeiying mckeown12 mjedmonds miquelramirez janseling fwdeng chrisfullerastro seandongx jfct001 gpesma smilist sanlingdd praveensingh123 yat1ma-garage zhangyang5511 cobaramin gavincangan alexkaravaev mhasana hankso yangjingbo111

deep-q-learning's Issues

What is the purpose of "done"?

missing the initialization of target action value and refreshing the Qhat

I have several questions:
1- When I compared with algorithm presented in"Human-level control through deep reinforcement learning", I can not find the third initialization (initial target action value)? Also, I do not find the last step "every C step Qhat=Q"? Would you please explain where are them or what is the difference to reach them? These steps seems essential!
2- I have my own environment, If I want to have a state=[a,b,c] as input instead of just one input for DQN showing the state what I should do?

model save and load does not work

on reloading the model performs very poorly as compared to training

Making new predictions

This is extremely helpful code, thanks for sharing! I have a bit of a hypothetical question. Let's say that after training the agent using your code I want to be able to predict the q-values for moving to the right or left given a new combination of inputs. (i.e. do some type of model.predict(new_input), or test the code on new data). Where in the code would this go? Could you do model.predict(new_input) at the end of your main function outside of the for loop?

I ask because I wonder where the model parameters are being saved and if this affects where you call model.predict(new_input) for new data. Let me know if anything is unclear!

Possible incrrection in DQN & DDQN file

The DQN algorithm from NATURE leverages a target network to update the target Q value for training.

So I think the code in ddqn.py should be code for the DQN algorithm.

k frame

@keon Thanks for your applicable code just one question how we can add K frame to this as said in section 4.1 last sentences of first paragraph of Mnih et al. Nature 2015
4.1 Preprocessing and Model Architecture
Working directly with raw Atari frames, which are 210 � 160 pixel images with a 128 color palette,
can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the
input dimensionality. The raw frames are preprocessed by first converting their RGB representation
to gray-scale and down-sampling it to a 110 �84 image. The final input representation is obtained by
cropping an 84 � 84 region of the image that roughly captures the playing area. The final cropping
stage is only required because we use the GPU implementation of 2D convolutions from [11], which
expects square inputs. For the experiments in this paper, the function � from algorithm 1 applies this
preprocessing to the last 4 frames of a history and stacks them to produce the input to the Q-function.

Why are we training the neural network for only 1 epoch

The neural network would not converge for only 1 epoch right?

a hidden bug in your code

`for e in range(EPISODES):

    state = env.reset()

    state = np.reshape(state, [1, state_size])

    for time in range(500):

        # env.render()

        action = agent.act(state)

        next_state, reward, done, _ = env.step(action)

        reward = reward if not done else -10

        next_state = np.reshape(next_state, [1, state_size])

        agent.remember(state, action, reward, next_state, done)

        state = next_state

        if done:

            print("episode: {}/{}, score: {}, e: {:.2}"

                  .format(e, EPISODES, time, agent.epsilon))

            break

    if len(agent.memory) > batch_size:

        agent.replay(batch_size)`

Hi, I find a bug in your code.

The agent.replay(batch_size) should be in the inner loop, means train_on_batch each time step, not each episode.

Your version can pass the cart-pole, but not the lunar-lander (also from openai gym)

The formal algorithm is followed.

The image from Human-level control through deep reinforcement learning FYI

Go jackets!

ddqn_batch

Hi. I tried to change ddqn code to update in batch like dqn_batch but this change cause no any learning. i don't have any idea why? it is a simple change and i even set the batch size to 1 so it should behave exactly like no bathing.

What am I doing wrong?

I'm trying to test this out using a minor TicTacToe game but I'm failing miserably for days. DQN keeps choosing negative rewards.

Plot Image

First of all, thank you very much for your work. It was really helpful for me to understand RL. I would like to ask you the way you got the image display of the game. I didn't find it in the code. Thank you very much!

Predict the action for new environment - Inference

Hi,

Thanks for the excellent repository. Extremely useful.

I have trained a model and saved the weight file in .h5 format. How would I predict the action for the new environment?

Thank you,
KK

IndexError

I add 2 convolutional layer and train this on miniworld (another gym environment),but i keep getting this:
`IndexError Traceback (most recent call last)
in
39
40 if len(agent.memory) > batch_size:
---> 41 agent.replay(batch_size)
42
43 if e % 10 == 0:

in replay(self, batch_size)
59
60
---> 61 target_f[0][action] = target
62 self.model.fit(state, target_f, epochs=1, verbose=0)

IndexError: index 17447 is out of bounds for axis 0 with size 60

`
I don't know why I got the index 17447...

memory for state

thanks Keon for your great code!
I have two questions:
1- What does [0] means in self.model.predict(next_state)[0] and return np.argmin(act_values[0])? Does this mean that first element of batch?
2-If in addition to batch, I need that my state is the state from K times before, what is the necessary change in order to do this? I want to send the state=state[i-k+1]....state[i-1],state[i] not only one state! How I can do this?

Thanks again

Speeding the replay

First, thank you for this wonderful code.

In the replay function, there is one model.fit(state, target_f) per sample in the minibach (i.e. if there are 32 samples, then there are 32 fit ).

I think all samples of the minibatch could be used in a single update with one single train_on_batch(states, targets_f), which would speed up the processing time.

ValueError: cannot reshape array of size 2 into shape (1,4)

I get this numpy error while running the script - dqn.py

2022-10-06 23:47:28.547558: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2022-10-06 23:47:28.547772: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. /home/akshayparanjape/PhD/deep-q-learning/venv_dqn/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/adam.py:114: UserWarning: The lrargument is deprecated, uselearning_rateinstead. super().__init__(name, **kwargs) /home/akshayparanjape/PhD/deep-q-learning/venv_dqn/lib/python3.8/site-packages/numpy/core/_asarray.py:102: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. return array(a, dtype, copy=False, order=order) Traceback (most recent call last): File "ddqn.py", line 100, in <module> state = np.reshape(state, [1, state_size]) File "<__array_function__ internals>", line 5, in reshape File "/home/akshayparanjape/PhD/deep-q-learning/venv_dqn/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 299, in reshape return _wrapfunc(a, 'reshape', newshape, order=order) File "/home/akshayparanjape/PhD/deep-q-learning/venv_dqn/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 55, in _wrapfunc return _wrapit(obj, method, *args, **kwds) File "/home/akshayparanjape/PhD/deep-q-learning/venv_dqn/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 44, in _wrapit result = getattr(asarray(obj), method)(*args, **kwds) ValueError: cannot reshape array of size 2 into shape (1,4)
Has anybody encountered the same issue?

Question: Is this some form of reward engineering?

This would break in environments that return the state as more/less than 4 values for unpacking.

If not essential can we just remove this?
If it's essential, would someone explain why and/or reference the paper for this?
This seems specific to CartPole. I wasn't sure if the implementation's goal was to only solve CartPole.

r1 = (env.x_threshold - abs(x)) / env.x_threshold - 0.8  
r2 = (env.theta_threshold_radians - abs(theta)) / env.theta_threshold_radians - 0.5  
reward = r1 + r2

Not learning

Hi, is it just me or the algorihtm is not learning? I collected all the rewards for the episodes and they converge to 10

add cnn to dqn

https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html

i think you may interested in this url

Would it make sense to restrict the action to what's possible?

If the cartpole is already all the way at the right, we can't really select that action. So would it make sense to disallow that from either the random case (by sampling again) or the network case (by choosing the next highest Q value that the network predicts)?

Minor issue with globally scoped variable `env`

I found a minor issue on line 42.

Currently:

    return env.action_space.sample()

Should be:

    return self.env.action_space.sample()

p.s. It's better practice to not put a bunch of stuff in the global namespace (e.g., under if __name__ == '__main__':). It's safer to use an actual main() method.

Kind regards,
Sylvain.