Code Monkey home page Code Monkey logo

Comments (4)

DesikRengarajan avatar DesikRengarajan commented on June 11, 2024

Hello,
The minus sign comes since we are minimizing cost instead of maximizing reward. In the code, the minus sign is included in the reward definition. See line 344. -torch.log(discrim_net(partial_state_action)).squeeze()

from logo.

yongpan0715 avatar yongpan0715 commented on June 11, 2024

Thank you very much for your reply,In your paper, Cπ(s,a) = log(π(s,a)/πb(s,a)),B∗(s,a) = ρπb(s,a)/(ρπb(s,a)+ρπ(s,a) ) ,Cπ(s,a) = −log B(s,a). the minus sign here be used to calculate your definition of ,Cπ(s,a)=log ((ρπb(s,a)+ρπ(s,a) ))/ρπb(s,a),is my understand right?

I can understand the plus sign in line 11 of logo algorithm in the code. May I ask how the minus sign in line 14 is reflected in the code?about the code:-torch.log(discrim_net(partial_state_action)).squeeze(), I understand is used to convert B(s,a)= ρπb(s,a)/(ρπb(s,a)+ρπ(s,a) ) to Cπ (s,a)=log ((ρπb(s,a)+ρπ(s,a) )/ρπb(s,a)).The minus sign is used to adjust the numerator and denominator

Here's what puzzles me,When get Cπ, you compute _arg minπEs∼dπk+1/2,a∼π(s,·)Aπ:Cπ_and πk+1/2 = arg maxπ Es∼dπk,a∼π [Aπ R] are all use the function trpo_step,I don't understand how the minus sign on line 14 shows up in the code when minimizing update parameters, because in the TRPO code there is only the maximized plus sign
I am looking forward to your guidance. I am really confused
image

from logo.

DesikRengarajan avatar DesikRengarajan commented on June 11, 2024

Hello,
Thank you very much for reviewing the code and paper, Here is the clarification to your question.
In the policy guidance step, the goal is to minimize D_{KL} (\pi, \pi_b). This is akin to solving a reinforcement learning problem with cost C_\pi = log (\pi / \pi_b). This is same as saying perform reinforcement learning with reward C_\pi = log (\pi / pi_b) (line 5) but perform gradient decent instead of gradient ascent (line 14).
Now, coming to your confusion, when we have data, and not the behavior policy, we try to estimate C_\pi, which we do using a discriminator. Note that if the discriminator is implemented as represented by eq 18 of the paper, we get C_\pi = -B(s,a) which is proportional to log (\rho_\pi / \rho_{\pi_B}), thus we preform reinforcement learning using it as a cost, which is same as performing gradient decent treating it as a reward.

Now coming to the implementation, there is a slight difference. The difference comes from equation 18. The code is implemented using equation (16) in the GAIL paper (https://arxiv.org/pdf/1606.03476.pdf) without the entropy term. Note that equation 16 (of GAIL paper) and 18 (of LOGO) are the same with the expectation terms swapped. This simply boils down to switching the labels of the behavior data and data obtained form the current policy during training (does not change the results).

Thus, following this, performing gradient ascent using -B(s,a) in the code (by training the discriminator using equation 16 of GAIL) is same as performing gradient decent using -B(s,a) (by training the discriminator using equation 18 of LOGO) in the paper.

Thank you very much for pointing this out, we will make sure to clarify this.

from logo.

yongpan0715 avatar yongpan0715 commented on June 11, 2024

Thank you so much for your clarification. You did a great job!

from logo.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.