Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thank <a class="user-mention notranslate" data-hovercard-type="user" data

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Thank <a class="user-mention notranslate" data-

Thank <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="

Thank <a class="user-mention notranslate" data-hovercard-typ

Reproduce in mix dataset of Hopper-v3 about rfqi HOT 9 CLOSED

zaiyan-x commented on September 1, 2024

Reproduce in mix dataset of Hopper-v3

from rfqi.

Comments (9)

zaiyan-x commented on September 1, 2024 1

Hi Linh,

I did not run into this before. It seems that ETA network just gave up. The asynchronous updates in ETA and the rest could be the reason. You can notice that once ETA network gives up, the critic loss becomes high, i.e., your value network no longer estimates robust value correctly.

My suggestion is to tune the ETA network hyper-parameter a bit. Hope this helps.

Regards,

from rfqi.

zaiyan-x commented on September 1, 2024 1

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

from rfqi.

zaiyan-x commented on September 1, 2024 1

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

from rfqi.

zaiyan-x commented on September 1, 2024 1

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].
Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,
The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.
Regards,
Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

Yes, it is for making the dataset more diverse. :d

Hope this helps,

Zaiyan

from rfqi.

linhlpv commented on September 1, 2024

Thank @zaiyan-x, let me try this.
One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D .
Best,
Linh

from rfqi.

linhlpv commented on September 1, 2024

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as
current version:
target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma
with not_done version:
target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version.

The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D.
Best,
Linh

from rfqi.

linhlpv commented on September 1, 2024

Hi @zaiyan-x ,
I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()
I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. The red and orange lines are the version with the not_done signal, and the green one is for the current version.
I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?
Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

Thank you for your suggestion. For me right now it seems like using not_done signal when training makes the etas stable and be in the reasonable range.

from rfqi.

linhlpv commented on September 1, 2024

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].
Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

from rfqi.

linhlpv commented on September 1, 2024

Yup. Thank you so much 👍

Linh

from rfqi.

Reproduce in mix dataset of Hopper-v3 about rfqi HOT 9 CLOSED

Comments (9)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent