trpo's People
Forkers
gdb ilyasu123 parisilabs zhongwen programmertp etotheipluspi floodsung tigerneil vyraun yongduek atgambardella ruotianluo peterzcc complyue tianzhuwang07 chensy1992 ericdanz wsjeon stanfordvl mansimov ajaytalati wilsonwangthu jtoyama4 jxwuyi andrewliao11 dotrado kastnerkyle dragonfzj picopoco zhudejun1985 lkh-1 syx528911137 yychrzh yzy1015 afcarl fdsmlhn jianjunchang wangyy161 maksim-vatkin zxhsama joneswong fagan2888trpo's Issues
About kl_firstfixed
thanks for implementation of trpo, there exist some details that do not make sense to me so far
I can't see why kl_firstfixed is defined as following
kl_firstfixed = tf.reduce_sum(tf.stop_gradient( action_dist_n) * tf.log(tf.stop_gradient(action_dist_n + eps) / (action_dist_n + eps))) / Nf
seems that we didn't make use of anything of oldaction_dist
shouldn't it be
kl_firstfixed = tf.reduce_sum(tf.stop_gradient( oldaction_dist) * tf.log(tf.stop_gradient(oldaction_dist + eps) / (action_dist_n + eps))) / Nf
?
besides, why does losses contain the entropy of action_dist_n? why must it be minimized?
Normalize advantage function
Hi, thanks for your implementation of TRPO.
In https://github.com/wojzaremba/trpo/blob/master/main.py#L128-L132 you normalize an advantage function.
I couldn't find any description about this operation in the paper( https://arxiv.org/abs/1502.05477 ).
Why did you do that?
The necessity of kl_firstfixed
Hi Wojciech,
kl = tf.reduce_sum(oldaction_dist * tf.log((oldaction_dist + eps) / (action_dist_n + eps))) / Nf
# KL divergence where first arg is fixed
# replace old->tf.stop_gradient from previous kl
kl_firstfixed = tf.reduce_sum(tf.stop_gradient(
action_dist_n) * tf.log(tf.stop_gradient(action_dist_n + eps) / (action_dist_n + eps))) / Nf
I think the kl_firstfixed is exactly the same as kl since the feed is
feed = {self.obs: obs_n,
self.action: action_n,
self.advant: advant_n,
self.oldaction_dist: action_dist_n}
Why not just use kl instead of kl_firstfixed for simplicity as well as saving computation?
Parameter update does not utilize the result from linesearch( )
Hi Wojciech,
In https://github.com/wojzaremba/trpo/blob/master/main.py#L168-L170
theta = linesearch(loss, thprev, fullstep, neggdotstepdir / lm)
theta = thprev + fullstep
self.sff(theta)
the theta
obtained from linesearch does not affect the result, is there something wrong here? Thanks.
Can't reproduce result on RepeatCopy
Hi, I tried your code and ran it for multiple times. My agents turn to stuck at 4 after even more than 10k iterations.
Do you have any insights what the problem could be?
KL divergence always bigger than constraint
I'm trying to reproduce results on Copy-v0.
surrafter, kloldnew, entropy = self.session.run(
self.losses, feed_dict=feed)
if kloldnew > 2.0 * config.max_kl:
self.sff(thprev)
The if statement here is always being called, and the KL between old and new is always greater than 0.01 (max_kl). So no changes are being made to the policy.
********** Iteration 1 ************
Total number of episodes: 784
KL between old and new distribution: 0.0506147 (this is greater than 2 * 0.01)
Entropy: 2.912
Surrogate loss: -0.210527
Average sum of rewards per episode: -0.309113300493
Baseline explained: -0.0618615207653
Time elapsed: 0.07 mins
I am running the script by python main.py Copy-v0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.