Code Monkey home page Code Monkey logo

preferencetransformer's People

Contributors

csmile-1006 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

preferencetransformer's Issues

problem with maze2d video generation

Hello, I'm having problem generating videos in the maze2d environment. When I use gym_env.render() on maze2d, it doesn't show anything. and if following your method of generating antmaze gym_env.physics.render(), there is an errorAttributeError: 'MazeEnv' object has no attribute 'physics'. any help please?

Question about 'reverse'

Hello, thanks for your great work. I have a question about the parameter “reverse”, in your code if "reverse" is set to be True, the data sequence will be (s, a, s, a, …), otherwise it will be (a, s, a, s, ……). And in your code you choose the later, I wonder is there any reason why you choose to put action before state? In your paper, it seems that the data sequence should be (s, a, s, a, …).
2a5421f741bd6d7d7edefe152b67ec3

The proposed preference attention layer does not seem necessary

Hi,

Thanks for sharing the code!

I have a few questions regarding how to reproduce the results in the paper.

  1. README says that you should run the following command to train a preference transformer model:
# Preference Transfomer (PT)
CUDA_VISIBLE_DEVICES=0 python -m JaxPref.new_preference_reward_main --use_human_label True --comment {experiment_name} --transformer.embd_dim 256 --transformer.n_layer 1 --transformer.n_head 4 --env {D4RL env name} --logging.output_dir './logs/pref_reward' --batch_size 256 --num_query {number of query} --query_len 100 --n_epochs 10000 --skip_flag 0 --seed {seed} --model_type PrefTransformer

However, I notice that this command will set config.used_weighted_sum in

config.use_weighted_sum = False

to False, leading to not using the preference attention layer at all.
else:
x = nn.Dense(features=self.inner_dim)(hidden_output)
x = ops.apply_activation(x, activation=self.activation)
output = nn.Dense(features=1)(x)
if self.activation_final != 'none':
output = ops.apply_activation(output, activation=self.activation_final)

Is it correct that you do not need the --transformer.use_weighted_sum flag?

  1. I tried to reproduce the paper's results both with and without the --transformer.use_weighted_sum flag. Other than this flag, I strictly followed the guidelines in the README. In detail, when training the reward model, I set --num_query to 500 for *-medium-replay datasets and 100 for *-medium-expert datasets. When running IQL with the learned reward model, I set --seq_len=100, --eval_Interval=5000, --config=configs/mujoco_config.py, and --eval_episodes=10. Below are the IQL results for 8 seeds (0~7):
use_weighted_sum hopper-medium-replay-v2 hopper-medium-expert-v2 walker2d-medium-replay-v2 walker2d-medium-expert-v2
False 70.03 (24.06) 87.31 (13.15) 75.82 (2.37) 109.93 (0.83)
True 68.83 (23.34) 68.54 (32.64) 76.48 (3.30) 109.78 (0.47)

(Values in the parentheses denote the std of the normalized return.)

Surprisingly, I find that not using the preference attention layer actually performs better. Does this mean that the preference attention layer is not helpful? Or am I missing something? Also, do you have any ablation results regarding each component of your method?

In addition, I failed to reproduce the results on hopper-medium-replay (84.54) both with and without the preference attention layer. Can you take a look at this issue?

Do you have any plans to release a human-labeled dataset? If so, when would that be?

The problem that the paper addresses is very interesting, with the surprising difference in performance between human-labeled and synthetic data being particularly noteworthy. The release of a human-labeled dataset would undoubtedly have a significant impact on the advancement of research in this field, and I am eagerly looking forward to it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.