Code Monkey home page Code Monkey logo

Comments (9)

zaiyan-x avatar zaiyan-x commented on September 1, 2024 1

Hi Linh,

I did not run into this before. It seems that ETA network just gave up. The asynchronous updates in ETA and the rest could be the reason. You can notice that once ETA network gives up, the critic loss becomes high, i.e., your value network no longer estimates robust value correctly.

My suggestion is to tune the ETA network hyper-parameter a bit. Hope this helps.

Regards,

ZX

from rfqi.

zaiyan-x avatar zaiyan-x commented on September 1, 2024 1

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

from rfqi.

zaiyan-x avatar zaiyan-x commented on September 1, 2024 1

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. image The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

from rfqi.

zaiyan-x avatar zaiyan-x commented on September 1, 2024 1

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].
Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,
The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.
Regards,
Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

Yes, it is for making the dataset more diverse. :d

Hope this helps,

Zaiyan

from rfqi.

linhlpv avatar linhlpv commented on September 1, 2024

Thank @zaiyan-x, let me try this.
One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D .
Best,
Linh

from rfqi.

linhlpv avatar linhlpv commented on September 1, 2024

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as
current version:
target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma
with not_done version:
target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version.
image
The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D.
Best,
Linh

from rfqi.

linhlpv avatar linhlpv commented on September 1, 2024

Hi @zaiyan-x ,
I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()
I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. image The red and orange lines are the version with the not_done signal, and the green one is for the current version.
I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?
Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

Thank you for your suggestion. For me right now it seems like using not_done signal when training makes the etas stable and be in the reasonable range.

from rfqi.

linhlpv avatar linhlpv commented on September 1, 2024

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].
Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

from rfqi.

linhlpv avatar linhlpv commented on September 1, 2024

Yup. Thank you so much 👍

Linh

from rfqi.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.