dennybritz / reinforcement-learning Goto Github PK

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.

Home Page: http://www.wildml.com/2016/10/learning-reinforcement-learning/

License: MIT License

Jupyter Notebook 97.01% Python 2.99%

reinforcement-learning's Introduction

Overview

This repository provides code, exercises and solutions for popular Reinforcement Learning algorithms. These are meant to serve as a learning tool to complement the theoretical materials from

Each folder in corresponds to one or more chapters of the above textbook and/or course. In addition to exercises and solution, each folder also contains a list of learning goals, a brief concept summary, and links to the relevant readings.

All code is written in Python 3 and uses RL environments from OpenAI Gym. Advanced techniques use Tensorflow for neural network implementations.

Introduction to RL problems & OpenAI Gym
MDPs and Bellman Equations
Dynamic Programming: Model-Based RL, Policy Iteration and Value Iteration
Monte Carlo Model-Free Prediction & Control
Temporal Difference Model-Free Prediction & Control
Function Approximation
Deep Q Learning (WIP)
Policy Gradient Methods (WIP)
Learning and Planning (WIP)
Exploration and Exploitation (WIP)

List of Implemented Algorithms

Dynamic Programming Policy Evaluation
Dynamic Programming Policy Iteration
Dynamic Programming Value Iteration
Monte Carlo Prediction
Monte Carlo Control with Epsilon-Greedy Policies
Monte Carlo Off-Policy Control with Importance Sampling
SARSA (On Policy TD Learning)
Q-Learning (Off Policy TD Learning)
Q-Learning with Linear Function Approximation
Deep Q-Learning for Atari Games
Double Deep-Q Learning for Atari Games
Deep Q-Learning with Prioritized Experience Replay (WIP)
Policy Gradient: REINFORCE with Baseline
Policy Gradient: Actor Critic with Baseline
Policy Gradient: Actor Critic with Baseline for Continuous Action Spaces
Deterministic Policy Gradients for Continuous Action Spaces (WIP)
Deep Deterministic Policy Gradients (DDPG) (WIP)
Asynchronous Advantage Actor Critic (A3C)

Resources

Textbooks:

Reinforcement Learning: An Introduction (2nd Edition)

Classes:

Talks/Tutorials:

Other Projects:

Selected Papers:

reinforcement-learning's People

Contributors

Stargazers

Watchers

Forkers

mateuszb timwee fxfactorial squiba sunits mjk276 shaoweipng posenhuang arnabbir wavelets alfiyazi ukituki kattmingming kenhollandwhy li-ch olveirap ml-ai-nlp-ir iamsile allentran eternonq wuzhongdehua sjchoi86 innovarul wenbo2018 hejunbok choxi vibster haohaohaohaohaohaozhang shihmengli tonydeep hulalazz g-wang bneiluj vastsoun j-min chang810249 antonini zachlungu caomw knhuq ankush96 mungobungo benjamesbabala v7t-codes pepsalehi frank-xie jaehlee mattare2 empia akzaidi ramseydsilva ryadzenine pwichmann kastnerkyle cinneesol ahn19 supwest milestonesvn intellifora andrewliao11 jude2014 cynsithia shmuma bradparks avicennax shadowen binbinbian davidrapoport robi56 tanged hearthstoneboss kczxl prerakmody yuchen8807 mlzxy jeff-lewis swathimystery tkinzer winggyn saurav111 xcbat collector-m longlong-jing rcuevass yiiwood jos666 frankszn hipeace86 amberlan1001 aojiang1991 chrsitinass leezqcst codeaudit hdasappinc andreas-koukorinis laventura angelo337 cadrev u4lr451 yaokaichun

reinforcement-learning's Issues

Solution for dp policy evaluation possibly wrong

I'm just starting to learn about reinforcement-learning and I found that this is a great resource, but I notice the answer on dp policy evaluation could possibly be misleading.

the answer update each state value by overwriting the current state value on this line
V[s] = v
even though this approach still helps the value array converge to its final value, it doesn't seem to be consistent with what's in the book.
I believe what we could do is keep another value array for the new value and assign it to the original value after you got all the new state value.
I'm really sorry if this is wrong. I just thought that I could do some help.

Effects of experience replay haven't been documented.

Thanks a lot for this Denny, have learnt a lot from your blogs. The literature on FA (without using Deep Learning methods) is extremely scarce.

I've been trying to solve a linear optimization/control problem using RL for sometime now and have had troubles getting my SGD approximators to converge with state + action as my features. Is there any particular reason why having a separate FA for each action does better? Damien Ernst et al also use a similar approach in this paper. Additionally in your experience, what non-incremental learning methods like ExtraTreesRegressor etc work well with such problems?

Also although stated in the Read.me, the notebooks haven't used experience replay with SGD. Maybe I can take this up.

SARSA and Value Function Approximation

Dear

Regarding:

reinforcement-learning/FA/Q-Learning with Value Function Approximation Solution.ipynb

and regarding the commented code:

SARSA TD Target for on policy-training
next_action_probs = policy(next_state)
next_action = np.random.choice(np.arange(len(next_action_probs)), p=next_action_probs)
td_target = reward + discount_factor * q_values_next[next_action]

I think you want to say that: it is ok to use SARSA instead of Q-Learning in this specific example.

However, In case of non linear approximator (neural network as in DQN), we usually use reply memory to break the correlation. Using reply memory will limit us with off-policy algorithms (such as Q-learning), and we can not/should not use on-policy algorithms (such as SARSA) while using reply-memory. (I believe the reason is that: When using reply memory, the sampled transitions used for policy update taken from the reply-memory, are actually generated using old-version of the policy not the current version)

If this is correct, then the commented code may work, only because in this specific code, you are using a simple linear approximator and not using reply memory, correct?

Thank you
Final note: I cannot wait for the A3C implementation :)

Gradient clipping in A3C

Hi there,

I noticed that even though policy net and value net share some parameters (in a3c/estimators.py), their gradient were clipped separately (in a3c/worker.py).

I was wondering whether that could be a problem (clip before add v.s. clip after add)

Suppose we clip gradient by norm at a threshold of 5.

local_grads, _ = tf.clip_by_norm(local_grads, 5.0)

(to make it simple, I choose clip_by_norm instead of clip_by_global_norm)

If for some shared parameter, gradient from policy net is +10 and gradient from value net is -7, the net gradient should be +10 -7 = +3 (no clipping needed). But if we clip before summing them up, it becomes +5 -5 = 0.

Thanks

The kernel appears to have died. It will restart automatically.

I am using python2.7, tensorflow 1.0 on ubuntu 14.04 LTS.

in the part of " # For Testing ... ",

the kernel dies.... showing "The kernel appears to have died. It will restart automatically."
because of ' print(e.predict(sess, observations))' and
' print(e.update(sess, observations, a, y))'

What shoud I do???

SARSA vs TD in FA question

Denny, in the Function Approx you have created the TD targets for both SARSA (commented out) and Q-Learning, but it may escape many people that you can't just uncomment the lines and comment the Q-Learning to make it SARSA but you also need to change the last assignment so SARSA sticks to the next_action it chose in the commented out code in the next step iteration, while Q-Learning discards it (ie. no change from what it does right now)

Does this make sense to you?

Hacky (ugly) version of what I mean:

    next_action = None 
    # One step in the environment
    for t in itertools.count():            
        # Take a step
        if (next_action == None):
            # print("I am doing Q-Learning")
            action_probs = policy(state)
            action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
        else:
            # print("I am doing SARSA")
            action = next_action

a3c broken by gym TimeLimitWrapper changes

changes to TimeLimitWrapper in gym break the current A3C atari wrapper.

A3C results peak at only ~22 reward

Dear Denny

I would like to share the results of A3C code. (after more than 5 Million steps)

What do you think went wrong?

Regards

GPU misuse

Hello Denny,
thanks a lot for all the code you have made available. I have been trying to implement a DQN myself using tensorflow, but I have noticed (also when testing your code) that during training only a small fraction of the GPU is used (around 10-15%). This is of course not very efficient and I do not see how this is going to be better than using a CPU... Do you have any thoughts on this?

I am training using a Titan X and with the current progress I estimate it will take me a little more than a week to train the model as described in the original paper.

Please let me know what do you think about all this.

image_resize()

In DQN.py (and the notebook)an error is raised and the fix is to provide the new size as an array with2 scalars). Now:

self.output = tf.image.resize_images(
                self.output, 84, 84, method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

... but it wants:

self.output = tf.image.resize_images(
                self.output, [84, 84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

DQN solution results peak at ~35 reward

Hi Denny,

Thanks for this wonderful resource. It's been hugely helpful. Can you say what your results are when training the DQN solution? I've been unable to reproduce the results of the DeepMind paper. I'm using your DQN solution below, although I did try mine first :)

While training, the TensorBoard "episode_reward" graph peaks at an average reward of ~35, and then tapers off. When I run the final checkpoint with a fully greedy policy (no epsilon), I get similar rewards.

The DeepMind paper cites rewards around 400.

I have tried 1) implementing the error clipping from the paper, 2) changing the RMSProp arguments to (I think) reflect those from the paper, 3) changing the "replay_memory_size" and "epsilon_decay_steps" to the paper's 1,000,000, 4) enabling all six agent actions from the Open AI Gym. Still average reward peaks at 35.

Any pointers would be greatly appreciated.

Screenshot is of my still-running training session at episode 4878 with all above modifications to dqn.py:

MC Off-policy solution fixes

I think there are two typos in MC off-policy code with solutions, regarding the target policy.

--current code--
def policy_fn(observation):
A = np.zeros_like(Q[state], dtype=float)
best_acton = np.argmax(Q[state])
A[best_action] = 1.0
return A
return policy_fa
-- end --

Var "best_acton" should be "best_action"
argument "observation" should be named "state"

A3C: state_processor.py

I wanted to conduct some extra pre-processing for the state (games image). And see how this may speed up the training or enhance the agent performance.

For example, in Break-out, I thought that the most important part for the agent (specially at the beginning of the training) is to neglect the wall (breaks) at all, and just concentrate on the ball and try to hit it back (This is not the best assumption, because an advanced agent wants to make the ball behind the breaks to get much more rewards).

So, I decided to remove the wall as in the following image:

In state_process.py, I made this minor modification:

def process(self, state, sess=None):
state[30:95] = 0
sess = sess or tf.get_default_session()
return sess.run(self.output, { self.input_state: state })

My questions are:

Should I conduct any pre-processing in "process" function as above, or in --init--(self) and use tf.image? and why?
How to use tf.image to set state[30:95] = 0
If I want to re-shape the reward by giving a small reward to the agent if the bat hits the ball, Where to do this in the code? I think I need to detect myself if the ball hits the bat using some processing of the state, and then change the reward.

Here are the results if the normal agent and when there is no wall...

It seems that, it is better for agent in the beginning not to see the wall and just concentrate on the ball

Thank you

cliff_walking reinforce with baseline, itertools.count()

when i run this code, it step over 1000 more steps and still walking, whether need to give a max steps?

Hi! I have got a problem... Could you help me out?

I am using python 2.7

and I guess because of this version difference,

I keep getting error in the part of 'def deep_q_learning(~)'
SyntaxError: 'return' with argument inside generator

How can I solve this????

Question regarding tf

Is there a better way to ask a question that issues? If so I won't ask again here, but if it's ok I may ask from time to time.

I am reading through the DQN code, and I am not sure to understand what this part computes or outputs, other than (generally) knowing that we don't care about actions not taken, and we won't affect their classes gradients.

# Get the predictions for the chosen actions only
gather_indices = tf.range(batch_size) * tf.shape(self.predictions)[1] + self.actions_pl
self.action_predictions = tf.gather(tf.reshape(self.predictions, [-1]), gather_indices)

I am trying to understand how the backprop is done over the q(s, a) pairs (outputs of the net), and I follow the code again and again and can't find where it does step 3 and 4 like described below. In particular, each prediction will have an error in the action that is exactly what is calculated, and 0 otherwise, but it requires a vector cross all actions (ie. outputs) to let the optimizer know that the gradient only runs across one particular output.

Formulated in this way, the update algorithm becomes:

Experience a st,at,rt,st+1 transition in the environment.

Forward st+1 to evaluate the (fixed) target y=maxafθ(st+1). This quantity is interpreted to be maxaQ(st+1).

Forward fθ(st) and apply a simple L2 regression loss on the dimension at of the output, to be y. This gradient will have a very simple form where the predicted value is simply subtracted from y. The other output dimensions have zero gradient.

Backpropagate the gradient and perform a parameter update.
Karpathy:
http://cs.stanford.edu/people/karpathy/reinforcejs/puckworld.html

In the current code it appears it calculates an average loss given the actions for each replay and the resulting errors, calculates an average error across that play, and then updates the entire outputs based on that general error. Here's another example DQN doing the same (fixing error only to the output in the selected action):

http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/

What am I reading wrong? Any ideas? Is the current code only actually back propagating the error only in the actions that got selected, and not the others (ie. gradient 0 on the other actions)

Chaning Sarsa/Q-learning to deal with multiple enviroments

Hi,
In the most of RL implementations at the start of each episode, the environment (in SARSA code for instance: state = env.reset() ) is reset to the initial states (i.e. same start point and goals states). More specifically, they learn a policy for a given environment. But how about multiple environments at the same time?

In the other words, is it possible to apply SARSA/Q-Learning in a scenario where we have multiple environments? For examples, in 5*5 grid world, we have the following two cases:
env1 : [0,0] is start state/agent start position and [3,2] is the goal state.
env2 : [2,1] is start state/agent start position and [1,4] is the goal state.
So at the each episode, env1 and env2 are the inputs to the main loop of SARSA.
Can the current version of SARSA/Q-learning be changed to learn policies for both environments at the same times? it is more like a multi-task learning with RL.

Any help would be appreciated.

Thanks,
J
PS0: It is not fair to call this an issue but it is more like an extension to the current implementations.
PS1: Thanks @dennybritz for your wonderful job of sharing the codes. It is really helpful,

Some confusions about `BlackjackEnv`

According to the game rules explained in the beginning of class BlackjackEnv of lib/envs/blackjack.py, and the observations returned as follows:

def _get_obs(self):
    return (sum_hand(self.player), self.dealer[0], usable_ace(self.player))

Why return self.dealer[0] as the total points of dealer? I think it just represent the points of first card that the dealer have got.

And maybe this is the reason for the weird 'output' of observation as shown below:

Player Score: 20 (Usable Ace:False), Dealer Score: 10
Taking action:Stick
score(self.player), score(self.dealer) 20 20
Player Score: 20 (Usable Ace:False), Dealer Score: 10
Game end. Reward:0.0

The dealer's score keep the same all the time, while actually he was hitting, and that's why the game end with draw.

MC predict's solution

There might be a problem in the solution. One way to see it is the plots for both 'no usable ace' never have any point that the players score is over 21, which should be a sharp drop. I think the problem is

        for t in range(100):
            probs = policy(state)
            action = np.random.choice(np.arange(len(probs)), p=probs)
            next_state, reward, done, _ = env.step(action)
            episode.append((state, action, reward))
            if done:
                break
            state = next_state

it should be

        for t in range(100):
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            episode.append((state, action, reward))
            if done:
                break
            state = next_state

Paper Investigation

Hi Denny!,
Big Fan!
Was wondering if you had any thoughts about reproducing this paper for one
of your tutorials.
https://arxiv.org/abs/1603.01121.
Just super interested in this specific topic.
Many thanks,
Andrew

Action repeat in Atari environments

Are there any plans to implement action repeat in Atari environments? In the original DQN paper, they repeated each action taken by the network for 4 frames instead of asking the network for an action every frame. This is supposed to improve the training time without sacrificing (much) performance, since Atari does not require frame-perfect inputs anyways.

I tried to add this. I saw faster wall-clock training time but decreased final performance (possibly issue #30)? Any ideas? I can run a few more experiments (though these take about a day each run for me) and set up a pull request if you think this is good.

Lambdas in a loop (closure)

Hi, thank you. I like you codes and I'm having fun with it.

However, I noticed that in A3C sometimes some workers failed to start. I printed out the name of the worker in the run method and found out that some of the run methods were called twice, and some of them were not called at all, which led me to this (line 125 in a3c/train.py)

for worker in workers:
  worker_fn = lambda: worker.run(sess, coord, FLAGS.t_max)
  t = threading.Thread(target=worker_fn)
  t.start()
  worker_threads.append(t)

My guess is because the loop variable worker is not local to lambda. Please refer to this and this. The reason it works for the most of the time but not always is because t.start() get executed quick enough before loop variable worker changes.

Thanks

Experience Replay and Targets Network

Dear

First of all, Thank you so much for this great work!

Regarding: reinforcement-learning/FA/Q-Learning with Value Function Approximation Solution.ipynb

The approximator is trained on correlated samples from the online environment. My question is: Will this affect the training process? slowing it down or making it un-stable?

Should we use Experience Replay (and maybe Target Network)? or there is no need, and why?

Thank you and best regards
Sobh

frames per state on each step

Update - I just reviewed estimators.py and saw it uses the 4 frames already.

Denny, I was reading the paper, and not sure if they are feeding the past 4 frames as the "current" state in A3C. I thought they were following the protocols of previous papers, which might mean they feed the past 4 frames. I've also seen some other implementations using 4 frames (current + past 3) as current state.

My question is if your A3C you created allows or include past frames be passed as part of the current state. Karpathy also tried more creative things, like passing the difference of frames (which I liked, since you don't need to feed 2x or n times the features, but it nulls anything that didn't change).

No attribute 'wrappers'

In Deep Q-Learning for Atari Games(Deep Q Learning solution.py), there is some issues about version.. I guess..

When I run it, the following came up....

Populating replay memory...

Error Traceback (most recent call last)
in ()
31 epsilon_decay_steps=500000,
32 discount_factor=0.99,
---> 33 batch_size=32):
34
35 print("\nEpisode Reward: {}".format(stats.episode_rewards[-1]))

in deep_q_learning(sess, env, q_estimator, target_estimator, state_processor, num_episodes, experiment_dir, replay_memory_size, replay_memory_init_size, update_target_estimator_every, discount_factor, epsilon_start, epsilon_end, epsilon_decay_steps, batch_size, record_video_every)
107 """
108 # Record videos
--> 109 env.monitor.start(monitor_path,
110 resume=True,
111 video_callable=lambda count: count % record_video_every == 0)

/home/wonchul/gym/gym/core.py in monitor(self)
90 @Property
91 def monitor(self):
---> 92 raise error.Error("env.monitor has been deprecated as of 12/23/2016. Remove your call to env.monitor.start(directory) and instead wrap your env with env = gym.wrappers.Monitor(env, directory) to record data.")
93
94 def step(self, action):

Error: env.monitor has been deprecated as of 12/23/2016. Remove your call to env.monitor.start(directory) and instead wrap your env with env = gym.wrappers.Monitor(env, directory) to record data.

==> So, i've changed 'env.monitor.start(directory)' to 'env = gym.wrappers.Monitor(env, directory)'.
However, in this time, the following came up...

Populating replay memory...

AttributeError Traceback (most recent call last)
in ()
31 epsilon_decay_steps=500000,
32 discount_factor=0.99,
---> 33 batch_size=32):
34
35 print("\nEpisode Reward: {}".format(stats.episode_rewards[-1]))

in deep_q_learning(sess, env, q_estimator, target_estimator, state_processor, num_episodes, experiment_dir, replay_memory_size, replay_memory_init_size, update_target_estimator_every, discount_factor, epsilon_start, epsilon_end, epsilon_decay_steps, batch_size, record_video_every)
108 # Record videos
109 #env.monitor.start
--> 110 env = gym.wrappers.Monitor(env, monitor_path,
111 resume=True,
112 video_callable=lambda count: count % record_video_every == 0)

AttributeError: 'module' object has no attribute 'wrappers'

==> so, I googled this error to solve it... but no one had answer,,,
( I found that some one solved this problem by upgrading gym,,, but I don't know how to upgrade... moreover I installed gym 5 days ago... kind of recent one...)

Could you help me out???

A3C: We add entropy to the loss to encourage exploration

I do not understand how adding entropy to loss will encourage exploration

I understand that Entropy is a measure of unpredictability, or measure of randomness.

H(X) = -Sum P(x) log(P(x))

While training, we want to reduce the loss.
By adding the Entropy (of possible actions ) to the loss, we will reduce the Entropy too (reduce unpredictability)

When all actions have almost same probability, then Entropy will be high
When one action has near 1 probability, Entropy will be low

In the beginning of training, almost all actions have same probability. After some training, some actions get higher probability (in the direction of getting more rewards), and entropy is reduced over time.

However, I am confused, how adding entropy to loss will encourage exploration?

A3C actions

Denny, I just saw the merge of all the a3c goodies. This is really so exciting. I am not going to look at the code until I review the simpler ones because it's too awesome. However -and if I am testing to early no worries just discard- I am trying it out with Breakout-v0:

./train.py --model_dir /tmp/a3c --env Breakout-v0 --t_max 10 --eval_every 30 --parallelism 2 --reset

And every video seems to only have the action "right" or "do nothing" working. It never seems to try "left" even once. I tried resetting 3 times, and have run about 5k, 12k and now at 4k and all videos up into 32 (longest run) all have "right" (and very seldom "nothing") as the only action.

I am not sure how tensorboard works, but it only ever shows worker_0/ (in addition to eval/). Not sure if it should show as many workers as cpus I assign (I tried assigning 2 and 4).

The a3c code seems to be shaping beautifully.

Actor Critic for Atari games

Dear Danny

Thank you for the great work! I have two questions:

1- Is it possible to change the “CliffWalk Actor Critic Solution.ipynb” code to implement Actor-Critic for Gym Arari games?

I believe Actor Critic is on-policy algorithm where value based and policy based methods are used together.

Using experience reply is important to de-correlate samples for non-linear approximators. On the other hand, experience reply requires off-policy learning algorithms that can update from data generated by an older policy. ("Minh 2016")

I was thinking think that, it is possible to do the following:
• Change the “Policy estimator” state to be 4 stacked observations (similar to DQN code)
• Change the “Value Estimator” state to be 4 stacked observations (similar to DQN code)
• For both, “Policy estimator” and “Value Estimator”, we use non linear Function approximator, Convolution Network similar to DQN code
• Then, use Experience Replay and use batches.

However, I am not sure about the results

2- How does A3C work? I think:
• Using multiple agents, each agent has his own environment.
• No using of experience reply.
• Then what?
• How do samples become de correlated? (Because they come from different agents?)
• But each agent has its own model and has to “update” its own model params before sending to the higher level. But the agent produces correlated transitions on its environment. (I am confused)

Thank you

n-step

Denny, is there any particular reason you are not including n-step methods? Any plans or interest in them? Or maybe about Eligibility Traces.

discount_factor is unused

In Montecarlo prediction and control, when calculating G you are not using discount_factor, shouldn't it written:

...
G = sum([x[2]*(discount_factor**i) for i,x in enumerate(episode[first_occurence_idx:]]))
...

Questions about A3C

I read through your code and have some questions about the implementation of A3C.

Where is epsilon greedy policy? I can't find it here.
Where is accumulation of the gradients?
Every update need same forward calculation which is done in prediction and this will slow down the speed of learning. I once tried to solve this problem by using partial_run but did you solve this problem?
Did this code replicate the results of the paper?

A3C: Score and Performance

Hi!,
I've made some long running tests and here are the results:

As you can see the score plateaus too quickly. In addition, the performance does not improve with the increase of threads number...

I have a ImportError.... Could you help me ?

ImportError Traceback (most recent call last)
in ()
12 sys.path.append("../")
13
---> 14 from lib import plotting
15 from collections import deque, namedtuple

ImportError: No module named 'lib'

When I follow the case for 'Deep Q-learning for atari game', there happened this error...

What should I do>??
( I'm using python3 on Ubuntu 14.04 LTS and jupyter notebook)

CliffWalk REINFORCE Comments

Dear

Thanks for your keen efforts.

1- I think the comments below are misleading (These comments could belong to Q-Learning not to REINFORCE)

def reinforce(env, estimator_policy, estimator_value, num_episodes, discount_factor=1.0):
"""
Q-Learning algorithm for fff-policy TD control using Function Approximation.
Finds the optimal greedy policy while following an epsilon-greedy policy.

Args:
    env: OpenAI environment.
    estimator: Action-Value function estimator
    num_episodes: Number of episodes to run for.
    discount_factor: Lambda time discount factor.
    epsilon: Chance the sample a random action. Float betwen 0 and 1.
    epsilon_decay: Each episode, epsilon is decayed by this factor

Returns:
    An EpisodeStats object with two numpy arrays for episode_lengths and episode_rewards.
"""

REINFORCE PolicyEstimator loss

Dear

The code is very clear :) thank you

I have Three comments/questions:

1-I have a question regarding the REINFORCE PolicyEstimator loss:

self.loss = -tf.log(self.picked_action_prob) * self.target

where self.target is the advantage = total_return - baseline_value

As you mentioned: Basically, we move our policy into a direction of more reward.

I think that, what we actually do is applying the Policy Gradient Theorem:

grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)].

where:
log(pi(s, a)) is tf.log(self.picked_action_prob)
and
Q(s, a) could be "total_return" or even better "advantage"

(Correct?)
I wonder how and why this works. could you please elaborate

2- Total Return
Because it is Monte Carlo, we wait until the episode end and make updates. regarding total_return

total_return = sum(discount_factor**i * t.reward for i, t in enumerate(episode[t:]))

In English, what you do is adding up the discounted actual rewards for each state in the episode given the future states of that state, correct?

for example: the episode has number of transitions (say 5), each transition has a reward, then the

empirical total_return of state_3 in transition_3 = reward 3 + discount_factor^2 * reward_4 + discount_factor^3 * reward_5

And this applies even if a state is visited more than once during the episode. In this case the same state can be updated more than once?please elaborate

3- In your comment for "Using TD methods"

Q-Value TD Target (for off policy training)
SARSA TD Target (for on policy training)

I think you want to say that we can use Q-Learning OR SARSA in this example (not both), correct?

I wonder if you will implement eligibility traces soon or at least give a hint how to implement in the simplest way.

Thank you very much

DQN and Dyna-Q

Hi Denny

Again, I do appreciate your work!

I was thinking of implementing DQN with Dyna-Q Algorithm where the Q(s,a) is updated not only by real experience, but also by simulated experiences generated from model M.

Dyna-Q : http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/dyna.pdf
Slide 27

However, I think it will be hard to train a model M for the environment:

P(s_t1 | s_t0 , a_t0)
P(r_t1 | s_t0 , a_t0)

This may require a trainable function approximator of the model (nonlinear Neural Network for example).

My idea, instead of having a model M that generates simulated experiences, we can simply use real experience of another parallel agent!

Then, two agents will help each other by providing experience.

This is inspired by A3C that uses multiple agents on different threads to explore the state spaces and make de-correlated updates to the actor and the critic.

Do you think it is a good/new/simple idea, that may speed up the training and use “real simulations” without even having a model of the environment?

Your opinion is very important

Thank you

Actor Critic Comments

Dear

Need to correct the comments for both:

Continuous Mountain Car Actor Critic Solution.ipynb
CliffWalk Actor Critic Solution.ipynb

def actor_critic(env, estimator_policy, estimator_value, num_episodes, discount_factor=1.0):
"""
Q-Learning algorithm for fff-policy TD control using Function Approximation.
Finds the optimal greedy policy while following an epsilon-greedy policy.

Args:
    env: OpenAI environment.
    estimator: Action-Value function estimator
    num_episodes: Number of episodes to run for.
    discount_factor: Lambda time discount factor.
    epsilon: Chance the sample a random action. Float betwen 0 and 1.
    epsilon_decay: Each episode, epsilon is decayed by this factor

Returns:
    An EpisodeStats object with two numpy arrays for episode_lengths and episode_rewards.
"""

Thank you

A3C: Safe way to terminate training

What is the safe way to terminate the training (and be able to continue again)? (is Ctrl C safe? or Kill the process)

Is Continuous Mountain Car solved?

Hi @dennybritz,
I'm here to bother you again LOL.

I was working on Continuous MountainCar Actor Critic Solution.ipynb yesterday and really having fun.
However, I accidentally came across this: According to OpenAI Gym documentation

MountainCarContinuous-v0 defines "solving" as getting average reward of 90.0 over 100 consecutive trials.

This is the Episode Reward over Time plot copied from your .ipynb file. Seems like (by eyeballing) it achieved an average reward around 80 to 85 but not above 90, which is very similar to what I got on my computer.

Not sure which part I did wrong. Any clue? Thanks!

Not really an issue

I am not sure how to send a comment, but I found your code VERY readable, almost like reading english, how it's structured also very beautiful, the names of every variable make sense. It's also highly appreciated that you follow the same terminology as David and Richard in a pristine way, and that due to AIGym, the code is just very focused. But you also put the right comments.

Also, the README has great resources, very up to date and relevant. I didn't know about many of them. And the learning objectives and summary is very good as a refresh of the long reads (book) or David slides.

I hope you find enough time and energy to complete some of the algorithms that are missing! Thank you so much for sharing your work!

TypeError: resize_images()

Dear all,
I have some trouble when running the code, here is the debug information:
Traceback (most recent call last):
File "train.py", line 100, in
max_global_steps=FLAGS.max_global_steps)
File "/home/sk/sk/reinforcement-learning-master-new/PolicyGradient/a3c/worker.py", line 76, in init
self.sp = StateProcessor()
File "/home/sk/sk/reinforcement-learning-master-new/lib/atari/state_processor.py", line 15, in init
self.output, [84, 84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)
TypeError: resize_images() takes at least 3 arguments (3 given)
It means the arguments numbers are not correct. But I can not solve it.
Could anyone give me some suggestions?

A3C: estimators.py

Line 17

Comment Three convolutional layers

I think it should be Two convolutional layers

Line 24:

fc1 = tf.contrib.layers.fully_connected(
inputs=tf.contrib.layers.flatten(conv2),
num_outputs=256,
scope="fc1")

I think we should add:
activation_fn=tf.nn.relu

A3C train.py

Thanks for the great work!

1- I faced an importing error when running train.py
The error was solved after adding --init--.py in directory lib/atari/

2- Moreover, I had to upgrade to TensorFlow 0.11 and Cuda 8.0

3- At the end of train.py "# Wait for all workers to finish": Is this synchronous or asynchronous?

4- Generally, some variables ended with op, what does "op" stand for?

5- Finally, I am running the training now, some output videos "/tmp/a3c/videos/" seems corrupted and cannot be rendered (and some videos looks fine and rendered)

Thanks again

Lambda function in a3c train.py appears broken

We ran the function with t = threading.Thread(target=worker.run, args = (sess,coord,FLAGS.t_max)) as opposed to using the lambda function.

Occasionally the threads were dying/(repeating?) with the lambda function but we haven't (yet) come across the same issue with the standard approach.

May be worth having a look!

Implementation of Value Iteration Networks paper - NIPS 2016 best paper

Do you have plans to implement Value Iteration Networks paper - NIPS 2016 best paper in tensorflow . It would be great and fantastic

Potential bug in tf.contrib.distributions.Normal

Hi Denny,

Recently I'm working on continuous control reinforcement learning task.
I fillowed the steps in Continuous MountainCar Actor Critic Solution to construct PolicyEstimator().
However the log probability of self.normal_dist.log_prob() become positive when self.mu become a small value (<0.2).
I'm wondering that if this is the bug of Tensorflow it self since they calculate the pdf by
f(x) = sqrt(1/(2*pi*sigma^2)) exp(-(x-mu)^2/(2*sigma^2)).
Did you face the same problem while implementing the policy?

Best,
James

ImportError: No module named atari

rzai@rzai00:/media/rzai/ai_data/prj/reinforcement-learning/PolicyGradient/a3c$ python train.py
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Traceback (most recent call last):
File "train.py", line 21, in
from lib.atari import helpers as atari_helpers
ImportError: No module named atari

Forums/Community/Questions?

Is there a forum for this github or Silva's class or Sutton's book?

I'm willing to ask questions here using Issues, but that doesn't seem ideal even though my current question is pretty specific to this github. Could a github repo's Wiki be used for this?

I'm getting the correct answer in the Policy Evaluation notebook, but I'm using nested for loops. Is it possible to vectorize that to eliminate one or more of the loops? With the environment's transition probabilities being tuples, I'm having a hard time seeing it.

No module named gym

After installing gym using the command pip install 'gym[all]' , while running the python file , I amgetting an error of ' no module named gym

Tensorflow Function Error: DQN Deep Q learning Solution.ipynb, new version 0.12.0 resize_images()

Hi, I am using latest tensorflow release 0.12.0 and running the Deep Q Learning Solution.ipynb
code. One error pop up:
TypeError: resize_images() got multiple values for argument 'method'

The error is in tf.image.resize_images() function of StateProcessor() definition.

class StateProcessor():
    def __init__(self):
        self.output = tf.image.resize_images(
                self.output, 84, 84, method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

Cause:
It seems like tf.image.resize_images() function changed in the latest tensorflow 0.12.0, which takes size as one parameter with shape [new_height, new_width], rather than 2 separate args.

https://www.tensorflow.org/api_docs/python/image/resizing#resize_images

size: A 1-D int32 Tensor of 2 elements: new_height, new_width. The new size for the images.

The fix should be quite easy:

self.output = tf.image.resize_images(
      self.output, [84, 84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

Do you think there is need to add version compatibility to support future tensorflow release? Let me know if I can help.

dennybritz / reinforcement-learning Goto Github PK

reinforcement-learning's Introduction

Overview

Table of Contents

List of Implemented Algorithms

Resources

reinforcement-learning's People

Contributors

Stargazers

Watchers

Forkers

reinforcement-learning's Issues

Recommend Projects

Recommend Topics

Recommend Org