<img src="" alt="Uploading D815E5

You summarize the LOGO algorithm in formula (10) in your paper, but I don't know why the formula of minimize the upper bound is a minus sign. Why can the same function be used to update parameters in the code, about logo HOT 4 CLOSED

desikrengarajan commented on June 11, 2024

You summarize the LOGO algorithm in formula (10) in your paper, but I don't know why the formula of minimize the upper bound is a minus sign. Why can the same function be used to update parameters in the code,

from logo.

Comments (4)

DesikRengarajan commented on June 11, 2024

Hello,
The minus sign comes since we are minimizing cost instead of maximizing reward. In the code, the minus sign is included in the reward definition. See line 344. -torch.log(discrim_net(partial_state_action)).squeeze()

from logo.

yongpan0715 commented on June 11, 2024

Thank you very much for your reply,In your paper, Cπ(s,a) = log(π(s,a)/πb(s,a)),B∗(s,a) = ρπb(s,a)/(ρπb(s,a)+ρπ(s,a) ) ,Cπ(s,a) = −log B(s,a). the minus sign here be used to calculate your definition of ,Cπ(s,a)=log ((ρπb(s,a)+ρπ(s,a) ))/ρπb(s,a),is my understand right?

I can understand the plus sign in line 11 of logo algorithm in the code. May I ask how the minus sign in line 14 is reflected in the code?about the code:-torch.log(discrim_net(partial_state_action)).squeeze(), I understand is used to convert B(s,a)= ρπb(s,a)/(ρπb(s,a)+ρπ(s,a) ) to Cπ (s,a)=log ((ρπb(s,a)+ρπ(s,a) )/ρπb(s,a)).The minus sign is used to adjust the numerator and denominator

Here's what puzzles me,When get Cπ, you compute _arg minπEs∼dπk+1/2,a∼π(s,·)Aπ:Cπ_and πk+1/2 = arg maxπ Es∼dπk,a∼π [Aπ R] are all use the function trpo_step,I don't understand how the minus sign on line 14 shows up in the code when minimizing update parameters, because in the TRPO code there is only the maximized plus sign
I am looking forward to your guidance. I am really confused

from logo.

DesikRengarajan commented on June 11, 2024

Hello,
Thank you very much for reviewing the code and paper, Here is the clarification to your question.
In the policy guidance step, the goal is to minimize D_{KL} (\pi, \pi_b). This is akin to solving a reinforcement learning problem with cost C_\pi = log (\pi / \pi_b). This is same as saying perform reinforcement learning with reward C_\pi = log (\pi / pi_b) (line 5) but perform gradient decent instead of gradient ascent (line 14).
Now, coming to your confusion, when we have data, and not the behavior policy, we try to estimate C_\pi, which we do using a discriminator. Note that if the discriminator is implemented as represented by eq 18 of the paper, we get C_\pi = -B(s,a) which is proportional to log (\rho_\pi / \rho_{\pi_B}), thus we preform reinforcement learning using it as a cost, which is same as performing gradient decent treating it as a reward.

Now coming to the implementation, there is a slight difference. The difference comes from equation 18. The code is implemented using equation (16) in the GAIL paper (https://arxiv.org/pdf/1606.03476.pdf) without the entropy term. Note that equation 16 (of GAIL paper) and 18 (of LOGO) are the same with the expectation terms swapped. This simply boils down to switching the labels of the behavior data and data obtained form the current policy during training (does not change the results).

Thus, following this, performing gradient ascent using -B(s,a) in the code (by training the discriminator using equation 16 of GAIL) is same as performing gradient decent using -B(s,a) (by training the discriminator using equation 18 of LOGO) in the paper.

Thank you very much for pointing this out, we will make sure to clarify this.

from logo.

yongpan0715 commented on June 11, 2024

Thank you so much for your clarification. You did a great job！

from logo.

You summarize the LOGO algorithm in formula (10) in your paper, but I don't know why the formula of minimize the upper bound is a minus sign. Why can the same function be used to update parameters in the code, about logo HOT 4 CLOSED

Comments (4)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent