wojzaremba / trpo Goto Github PK

View Code? Open in Web Editor NEW

99.0 99.0 53.0 15 KB

Python 100.00%

trpo's People

Contributors

Stargazers

Watchers

Forkers

gdb ilyasu123 parisilabs zhongwen programmertp etotheipluspi floodsung tigerneil vyraun yongduek atgambardella ruotianluo peterzcc complyue tianzhuwang07 chensy1992 ericdanz wsjeon stanfordvl mansimov ajaytalati wilsonwangthu jtoyama4 jxwuyi andrewliao11 dotrado kastnerkyle dragonfzj picopoco zhudejun1985 lkh-1 syx528911137 yychrzh yzy1015 afcarl fdsmlhn jianjunchang wangyy161 maksim-vatkin zxhsama joneswong fagan2888

trpo's Issues

About kl_firstfixed

thanks for implementation of trpo, there exist some details that do not make sense to me so far
I can't see why kl_firstfixed is defined as following
kl_firstfixed = tf.reduce_sum(tf.stop_gradient( action_dist_n) * tf.log(tf.stop_gradient(action_dist_n + eps) / (action_dist_n + eps))) / Nf
seems that we didn't make use of anything of oldaction_dist
shouldn't it be
kl_firstfixed = tf.reduce_sum(tf.stop_gradient( oldaction_dist) * tf.log(tf.stop_gradient(oldaction_dist + eps) / (action_dist_n + eps))) / Nf?
besides, why does losses contain the entropy of action_dist_n? why must it be minimized?

Normalize advantage function

Hi, thanks for your implementation of TRPO.

In https://github.com/wojzaremba/trpo/blob/master/main.py#L128-L132 you normalize an advantage function.
I couldn't find any description about this operation in the paper( https://arxiv.org/abs/1502.05477 ).
Why did you do that?

The necessity of kl_firstfixed

Hi Wojciech,

    kl = tf.reduce_sum(oldaction_dist * tf.log((oldaction_dist + eps) / (action_dist_n + eps))) / Nf
    # KL divergence where first arg is fixed
    # replace old->tf.stop_gradient from previous kl
    kl_firstfixed = tf.reduce_sum(tf.stop_gradient(
        action_dist_n) * tf.log(tf.stop_gradient(action_dist_n + eps) / (action_dist_n + eps))) / Nf

I think the kl_firstfixed is exactly the same as kl since the feed is

    feed = {self.obs: obs_n,
              self.action: action_n,
              self.advant: advant_n,
              self.oldaction_dist: action_dist_n}

Why not just use kl instead of kl_firstfixed for simplicity as well as saving computation?

Parameter update does not utilize the result from linesearch( )

Hi Wojciech,

In https://github.com/wojzaremba/trpo/blob/master/main.py#L168-L170

                theta = linesearch(loss, thprev, fullstep, neggdotstepdir / lm)
                theta = thprev + fullstep
                self.sff(theta)

the theta obtained from linesearch does not affect the result, is there something wrong here? Thanks.

Can't reproduce result on RepeatCopy

Hi, I tried your code and ran it for multiple times. My agents turn to stuck at 4 after even more than 10k iterations.
Do you have any insights what the problem could be?

KL divergence always bigger than constraint

I'm trying to reproduce results on Copy-v0.

surrafter, kloldnew, entropy = self.session.run(
    self.losses, feed_dict=feed)
if kloldnew > 2.0 * config.max_kl:
    self.sff(thprev)

The if statement here is always being called, and the KL between old and new is always greater than 0.01 (max_kl). So no changes are being made to the policy.

********** Iteration 1 ************
Total number of episodes:                 784
KL between old and new distribution:      0.0506147 (this is greater than 2 * 0.01)
Entropy:                                  2.912
Surrogate loss:                           -0.210527
Average sum of rewards per episode:       -0.309113300493
Baseline explained:                       -0.0618615207653
Time elapsed:                             0.07 mins

I am running the script by python main.py Copy-v0

wojzaremba / trpo Goto Github PK

trpo's People

Contributors

Stargazers

Watchers

Forkers

trpo's Issues

About kl_firstfixed

Normalize advantage function

The necessity of kl_firstfixed

Parameter update does not utilize the result from linesearch( )

Can't reproduce result on RepeatCopy

KL divergence always bigger than constraint

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent