yukezhu / tensorflow-reinforce Goto Github PK

View Code? Open in Web Editor NEW

487.0 21.0 138.0 40 KB

Implementations of Reinforcement Learning Models in Tensorflow

License: MIT License

Python 100.00%

reinforcement-learning tensorflow policy-gradient deep-reinforcement-learning deep-q-network actor-critic

tensorflow-reinforce's People

Contributors

Stargazers

Watchers

Forkers

simonyilunw vivekveeriah vyraun xhark binderwang wanjinchang ml-lab coddinglxf xcbat benjamesbabala darrenwang00 xsongx hxl1990 lyk125 nicolaiporskrog allankevinrichie billy-inn daerduocarey colin1988 tonyan oneopane szy1900 seuliufeng ajaytalati malagori dslituiev zeran4 sachinravi14 caoge4 davidtranno1 mcdavid109 sungjinlees bywbilly melonwan dongzhuoyao e2crawfo cgcui markjingnb tinytai peratham xiaowei-hu chenfei-wu mdiby ayush-agrawal maozhiqiang txye yurongyou johnzhou2011 thuang chenchloema kastnerkyle mveres01 alanxu89 eljalalpour sth4k owen94 victory1128 gagang123 isaactl zfw1226 genipap collector-m ly233 gitsamshi meelement casillas-qf weiziang1 godka liuyi-hu spk4422 rcao-hk wangjksjtu droiter mydeeplearning qmind alibaheri luo-pan evhub jiths jufangshen korroktheslavemaster codegank sammy4321 lionkt pazocal mauriciogs99 haisheny sdyy1990 xuewengeophysics xiaojunw gchhor sabavenk jia0713 daominglyu starrapier zheyang-sjtu hope-yao fdumark miniddong shubhampachori12110095

tensorflow-reinforce's Issues

Policy output tends to increase/decrease to 1 and freeze there (Using with TORCS)

Hi,

I want to appreciate for your code and also have some problems and questions, I hope you can help me on them.

The first problem is that I tried to use your code with TORCS and unfortunately it seems to me that algorithm cannot learn. Looking at the Policy output, it always goes to 1 very fast and stays there for almost ever. In the following there is a sample output:

Action: [-0.87523234 0.6581533 -0.11968148]
Rewards: [[ 0.03384038 0.03384038 0.03384038]
[ 0.35649604 0.35649604 0.35649604]
[ 0.32099473 0.32099473 0.32099473]
[ 0.35958865 0.35958865 0.35958865]]
Policy out: [[-0.99999988 -0.99999821 0.99996996]
[-1. -1. 1. ]
[-1. -1. 1. ]
[-1. -1. 1. ]]
next_state_scores: [[-0.63144624 0.52066976 0.46819916]
[-0.94268441 0.87833565 0.83462358]
[-0.93066931 0.85972118 0.8131395 ]
[-0.96144539 0.90986496 0.8721453 ]]
ys: [[-0.59129143 0.54930341 0.49735758]
[-0.57676154 1.22604835 1.18277335]
[-0.6003679 1.17211866 1.12600279]
[-0.59224224 1.260355 1.22301245]]
qvals: [[-0.53078967 0.67661011 0.5653314 ]
[-0.82462442 0.92710859 0.85478604]
[-0.75297546 0.87841618 0.78751284]
[-0.87785703 0.95719421 0.90265679]]
temp_diff: [[ 0.06050175 0.1273067 0.06797382]
[-0.24786288 -0.29893976 -0.32798731]
[-0.15260756 -0.29370248 -0.33848995]
[-0.28561479 -0.30316079 -0.32035565]]
critic_loss: 1.71624
action_grads: [[ 2.43733026e-04 -1.14602779e-04 -1.56897455e-04]
[ 9.41888866e-05 -3.77293654e-05 -7.07318031e-05]
[ 1.33200549e-04 -5.56089872e-05 -9.60492107e-05]
[ 6.47661946e-05 -2.49565346e-05 -5.03367264e-05]]

My next question is about optimizing the critic network, since its learning rate is different with actor network, how did you actually optimized it? And can you tell me why your initialization for the actor and critic network layers are different from those mentioned in the paper?

Thanks

Several gradient steps on recorded trajectory

tensorflow-reinforce/rl/pg_reinforce.py

Line 192 in 173a04f

self.train_op,

Hi. Please correct me if I am missing something, but I think that there is a mistake in this line.

Notice that "train_op" calls "apply_gradients", which updates the parameters of the policy network using the log probabilities that the policy outputs when the recorded batch of states is fed. The problem is that the parameters are updated for every time step of the recorded trajectory ("train_op" is called within a temporal for loop), and, due to the fact that the log probabilities are being computed on the fly during the training loop, these probabilities are no longer the ones that were followed during the recorded trajectory. This produces a distribution miss match that violates the on-policy nature of the Reinforce algorithm, and, I believe, may lead to incorrect gradients. I think that the proper way to do the update is to accumulate the gradients of the log probabilities of the entire batch and then perform all the updates.

The same issue applies to the "all_rewards" buffer that is being used for the normalization of the return. Because, as the policy changes, the mean and stdev of the return may change as well, and, therefore, the normalization that is being performed is no longer valid (at least in a strict sense).

As stated before, please correct me if I am wrong. I was simply looking for a Reinforce implementation on Tensorflow and came across this issue.

PG Reinforce

While using Policy Gradients for Reinforce Learning you are using discounted reward. But I think in the david silver lecture he says the reward are sampled from a distribution. But why do you use discounted reward.

Compute policy gradient using cross entropy loss

This is not an issue of the code per se, but I am learning RL and am wondering how the policy gradient is calculated in pg_reinforce.py. In this line

self.cross_entropy_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=self.logprobs, labels=self.taken_actions)

A loss defined as the cross entropy between the action distribution from policy, and the actual action taken. But standard text such as Sutton define the policy gradient in the the REINFORCE algorithm simply as the 'log grad' of the policy function.

What is the difference of the two definitions? It seems cross entropy loss is a more general definition because it includes not just the action taken but all remaining actions. Can you give a reference of your method?

This is great project. Thank you for sharing this.

yukezhu / tensorflow-reinforce Goto Github PK

tensorflow-reinforce's People

Contributors

Stargazers

Watchers

Forkers

tensorflow-reinforce's Issues

Policy output tends to increase/decrease to 1 and freeze there (Using with TORCS)

Several gradient steps on recorded trajectory

PG Reinforce

Compute policy gradient using cross entropy loss

Noise generation appears not to reflect the OU noise.

Remove regularization on bias terms

Modifying for Atari gym envs?

Improve Python3 support

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent