xbpeng / awr Goto Github PK

View Code? Open in Web Editor NEW

174.0 174.0 37.0 2.62 MB

Implementation of advantage-weighted regression.

License: MIT License

Python 100.00%

awr's People

Contributors

Stargazers

Watchers

awr's Issues

Offline version of AWR

Hi, I am trying to modify AWR into the offline version (or fully off-policy version). I find that the paper states that one can simply treat the dataset as the replay buffer and don't need to do any modifications. But I notice that if I remove sampling in rl_agent.train, line 105 in rl_agent.py:
train_return, train_path_count, new_sample_count = self._rollout_train(self._samples_per_iter),
new_sample_count will remain 0, so that update steps are also 0.

Would you like to point out a proper way of modifications to obtain the offline AWR?

Train_Return vs Test_Return

Hi, Thank you for sharing the repo!

I was wondering how the Train_Return and Test_Return is calculated
and what the difference between the two.
I see that one is using norm_a_tf and sample_a_tf in the code.

Why Normalization of vf

Hello,

thanks for the code, while I tried to re-implement the program, I find that there is one step to normalize value function vf here . It's implementated by v_predict = v(s; \theta) * (1-/gamma) and critic update is implemented by
min_\theta [v(s; \theta) * (1-/gamma) - v_estimate ]^2.

Is there any reason to normalize Value functions output, I tested to remove the normalization term and rescaled learning rate(by 1-gamma), looks there is no problem in HalfCheetah-v2.

It holds similar performance with original version.

Best,

Parameters used for motion imitation

Hello,

I am trying to use this algorithm (rewritten in PyTorch with Gym vectorized envs) for motion imitation, starting with the PyBullet implementation of the DeepMimic environment. In the paper, section 5.3, there is a comparison of DeepMimic's modified off-policy PPO with AWR and RWR on some of DeepMimic's tasks, but no further information was given on which hyperparameters were used there.

The appendix gives some parameters which I think apply to the usual MuJoCo benchmarks, but I'm not sure if they also apply to the DeepMimic tasks (for instance the MLP hidden dimensions of (128, 64) don't seem right for DeepMimic since the original paper uses (1024, 512)).

Beta (temperature) set to 1.0

In the paper it says the Beta value is 0.05 but in the code here for all environments it is 1.0. Can you provide an explanation for this please?

Thank you.
Trenton

xbpeng / awr Goto Github PK

awr's People

Contributors

Stargazers

Watchers

Forkers

awr's Issues

Offline version of AWR

Train_Return vs Test_Return

Why Normalization of vf

Parameters used for motion imitation

Beta (temperature) set to 1.0

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent