Hi,
Thanks for sharing the code!
I have a few questions regarding how to reproduce the results in the paper.
- README says that you should run the following command to train a preference transformer model:
# Preference Transfomer (PT)
CUDA_VISIBLE_DEVICES=0 python -m JaxPref.new_preference_reward_main --use_human_label True --comment {experiment_name} --transformer.embd_dim 256 --transformer.n_layer 1 --transformer.n_head 4 --env {D4RL env name} --logging.output_dir './logs/pref_reward' --batch_size 256 --num_query {number of query} --query_len 100 --n_epochs 10000 --skip_flag 0 --seed {seed} --model_type PrefTransformer
However, I notice that this command will set config.used_weighted_sum
in
|
config.use_weighted_sum = False |
to
False
, leading to
not using the preference attention layer at all.
|
else: |
|
x = nn.Dense(features=self.inner_dim)(hidden_output) |
|
x = ops.apply_activation(x, activation=self.activation) |
|
output = nn.Dense(features=1)(x) |
|
if self.activation_final != 'none': |
|
output = ops.apply_activation(output, activation=self.activation_final) |
Is it correct that you do not need the --transformer.use_weighted_sum
flag?
- I tried to reproduce the paper's results both with and without the
--transformer.use_weighted_sum
flag. Other than this flag, I strictly followed the guidelines in the README. In detail, when training the reward model, I set --num_query
to 500 for *-medium-replay datasets and 100 for *-medium-expert datasets. When running IQL with the learned reward model, I set --seq_len=100
, --eval_Interval=5000
, --config=configs/mujoco_config.py
, and --eval_episodes=10
. Below are the IQL results for 8 seeds (0~7):
use_weighted_sum |
hopper-medium-replay-v2 |
hopper-medium-expert-v2 |
walker2d-medium-replay-v2 |
walker2d-medium-expert-v2 |
False |
70.03 (24.06) |
87.31 (13.15) |
75.82 (2.37) |
109.93 (0.83) |
True |
68.83 (23.34) |
68.54 (32.64) |
76.48 (3.30) |
109.78 (0.47) |
(Values in the parentheses denote the std of the normalized return.)
Surprisingly, I find that not using the preference attention layer actually performs better. Does this mean that the preference attention layer is not helpful? Or am I missing something? Also, do you have any ablation results regarding each component of your method?
In addition, I failed to reproduce the results on hopper-medium-replay (84.54) both with and without the preference attention layer. Can you take a look at this issue?