Comments (4)
Hello,
The minus sign comes since we are minimizing cost instead of maximizing reward. In the code, the minus sign is included in the reward definition. See line 344. -torch.log(discrim_net(partial_state_action)).squeeze()
from logo.
Thank you very much for your reply,In your paper, Cπ(s,a) = log(π(s,a)/πb(s,a)),B∗(s,a) = ρπb(s,a)/(ρπb(s,a)+ρπ(s,a) ) ,Cπ(s,a) = −log B(s,a). the minus sign here be used to calculate your definition of ,Cπ(s,a)=log ((ρπb(s,a)+ρπ(s,a) ))/ρπb(s,a),is my understand right?
I can understand the plus sign in line 11 of logo algorithm in the code. May I ask how the minus sign in line 14 is reflected in the code?about the code:-torch.log(discrim_net(partial_state_action)).squeeze(), I understand is used to convert B(s,a)= ρπb(s,a)/(ρπb(s,a)+ρπ(s,a) ) to Cπ (s,a)=log ((ρπb(s,a)+ρπ(s,a) )/ρπb(s,a)).The minus sign is used to adjust the numerator and denominator
Here's what puzzles me,When get Cπ, you compute _arg minπEs∼dπk+1/2,a∼π(s,·)Aπ:Cπ_and πk+1/2 = arg maxπ Es∼dπk,a∼π [Aπ R] are all use the function trpo_step,I don't understand how the minus sign on line 14 shows up in the code when minimizing update parameters, because in the TRPO code there is only the maximized plus sign
I am looking forward to your guidance. I am really confused
from logo.
Hello,
Thank you very much for reviewing the code and paper, Here is the clarification to your question.
In the policy guidance step, the goal is to minimize D_{KL} (\pi, \pi_b). This is akin to solving a reinforcement learning problem with cost C_\pi = log (\pi / \pi_b). This is same as saying perform reinforcement learning with reward C_\pi = log (\pi / pi_b) (line 5) but perform gradient decent instead of gradient ascent (line 14).
Now, coming to your confusion, when we have data, and not the behavior policy, we try to estimate C_\pi, which we do using a discriminator. Note that if the discriminator is implemented as represented by eq 18 of the paper, we get C_\pi = -B(s,a) which is proportional to log (\rho_\pi / \rho_{\pi_B}), thus we preform reinforcement learning using it as a cost, which is same as performing gradient decent treating it as a reward.
Now coming to the implementation, there is a slight difference. The difference comes from equation 18. The code is implemented using equation (16) in the GAIL paper (https://arxiv.org/pdf/1606.03476.pdf) without the entropy term. Note that equation 16 (of GAIL paper) and 18 (of LOGO) are the same with the expectation terms swapped. This simply boils down to switching the labels of the behavior data and data obtained form the current policy during training (does not change the results).
Thus, following this, performing gradient ascent using -B(s,a) in the code (by training the discriminator using equation 16 of GAIL) is same as performing gradient decent using -B(s,a) (by training the discriminator using equation 18 of LOGO) in the paper.
Thank you very much for pointing this out, we will make sure to clarify this.
from logo.
Thank you so much for your clarification. You did a great job!
from logo.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from logo.