Code Monkey home page Code Monkey logo

tensorflow-reinforce's People

Contributors

billy-inn avatar islamelnabarawy avatar junrushao avatar yukezhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorflow-reinforce's Issues

Policy output tends to increase/decrease to 1 and freeze there (Using with TORCS)

Hi,

I want to appreciate for your code and also have some problems and questions, I hope you can help me on them.

The first problem is that I tried to use your code with TORCS and unfortunately it seems to me that algorithm cannot learn. Looking at the Policy output, it always goes to 1 very fast and stays there for almost ever. In the following there is a sample output:

Action: [-0.87523234 0.6581533 -0.11968148]
Rewards: [[ 0.03384038 0.03384038 0.03384038]
[ 0.35649604 0.35649604 0.35649604]
[ 0.32099473 0.32099473 0.32099473]
[ 0.35958865 0.35958865 0.35958865]]
Policy out: [[-0.99999988 -0.99999821 0.99996996]
[-1. -1. 1. ]
[-1. -1. 1. ]
[-1. -1. 1. ]]
next_state_scores: [[-0.63144624 0.52066976 0.46819916]
[-0.94268441 0.87833565 0.83462358]
[-0.93066931 0.85972118 0.8131395 ]
[-0.96144539 0.90986496 0.8721453 ]]
ys: [[-0.59129143 0.54930341 0.49735758]
[-0.57676154 1.22604835 1.18277335]
[-0.6003679 1.17211866 1.12600279]
[-0.59224224 1.260355 1.22301245]]
qvals: [[-0.53078967 0.67661011 0.5653314 ]
[-0.82462442 0.92710859 0.85478604]
[-0.75297546 0.87841618 0.78751284]
[-0.87785703 0.95719421 0.90265679]]
temp_diff: [[ 0.06050175 0.1273067 0.06797382]
[-0.24786288 -0.29893976 -0.32798731]
[-0.15260756 -0.29370248 -0.33848995]
[-0.28561479 -0.30316079 -0.32035565]]
critic_loss: 1.71624
action_grads: [[ 2.43733026e-04 -1.14602779e-04 -1.56897455e-04]
[ 9.41888866e-05 -3.77293654e-05 -7.07318031e-05]
[ 1.33200549e-04 -5.56089872e-05 -9.60492107e-05]
[ 6.47661946e-05 -2.49565346e-05 -5.03367264e-05]]

My next question is about optimizing the critic network, since its learning rate is different with actor network, how did you actually optimized it? And can you tell me why your initialization for the actor and critic network layers are different from those mentioned in the paper?

Thanks

Several gradient steps on recorded trajectory

Hi. Please correct me if I am missing something, but I think that there is a mistake in this line.

Notice that "train_op" calls "apply_gradients", which updates the parameters of the policy network using the log probabilities that the policy outputs when the recorded batch of states is fed. The problem is that the parameters are updated for every time step of the recorded trajectory ("train_op" is called within a temporal for loop), and, due to the fact that the log probabilities are being computed on the fly during the training loop, these probabilities are no longer the ones that were followed during the recorded trajectory. This produces a distribution miss match that violates the on-policy nature of the Reinforce algorithm, and, I believe, may lead to incorrect gradients. I think that the proper way to do the update is to accumulate the gradients of the log probabilities of the entire batch and then perform all the updates.

The same issue applies to the "all_rewards" buffer that is being used for the normalization of the return. Because, as the policy changes, the mean and stdev of the return may change as well, and, therefore, the normalization that is being performed is no longer valid (at least in a strict sense).

As stated before, please correct me if I am wrong. I was simply looking for a Reinforce implementation on Tensorflow and came across this issue.

PG Reinforce

While using Policy Gradients for Reinforce Learning you are using discounted reward. But I think in the david silver lecture he says the reward are sampled from a distribution. But why do you use discounted reward.

Compute policy gradient using cross entropy loss

This is not an issue of the code per se, but I am learning RL and am wondering how the policy gradient is calculated in pg_reinforce.py. In this line

self.cross_entropy_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=self.logprobs, labels=self.taken_actions)

A loss defined as the cross entropy between the action distribution from policy, and the actual action taken. But standard text such as Sutton define the policy gradient in the the REINFORCE algorithm simply as the 'log grad' of the policy function.

What is the difference of the two definitions? It seems cross entropy loss is a more general definition because it includes not just the action taken but all remaining actions. Can you give a reference of your method?

This is great project. Thank you for sharing this.

Modifying for Atari gym envs?

I want to a PG implementation to work with Atari gym envs. I've spend a while trying to modify pg_reinforce with no luck. Has anyone had any luck modifying any PG implementations to work with Atari (such as Pong-v0, Breakout-v0).

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.