csmile-1006 / preferencetransformer Goto Github PK

View Code? Open in Web Editor NEW

135.0 3.0 16.0 26.17 MB

Preference Transformer: Modeling Human Preferences using Transformers for RL (ICLR2023 Accepted)

Home Page: https://sites.google.com/view/preference-transformer

License: MIT License

Python 91.79% Shell 0.01% Jupyter Notebook 2.37% CSS 0.80% JavaScript 2.72% HTML 2.31%

rl rlhf robotics

preferencetransformer's People

Contributors

Stargazers

Watchers

Forkers

xuefeng11 ziyu-deep dumpmemory codeaudit billschumacher sarikayamehmet zurichrain vishalbelsare fedoracy pickxiguapi preferencediffuser chongminggao treeb-450 laolinde

preferencetransformer's Issues

can I get some other indices for my human label ?

Thank you for your wonderful work. Besides generating human_label with the collected indices in your file, can I collect more indices or my own indices? hope you give some useful suggestions

Is it possible to release anonymized ID for the labelers?

Hi,
I wonder if you could release the ID for each human label. It is an informative feature for analyzing the quality of the labels and its impact on policy learning. Anonymized IDs will suffice.

Thanks.

problem with maze2d video generation

Hello, I'm having problem generating videos in the maze2d environment. When I use gym_env.render() on maze2d, it doesn't show anything. and if following your method of generating antmaze gym_env.physics.render(), there is an errorAttributeError: 'MazeEnv' object has no attribute 'physics'. any help please?

Question about 'reverse'

Hello, thanks for your great work. I have a question about the parameter “reverse”, in your code if "reverse" is set to be True, the data sequence will be (s, a, s, a, …), otherwise it will be (a, s, a, s, ……). And in your code you choose the later, I wonder is there any reason why you choose to put action before state? In your paper, it seems that the data sequence should be (s, a, s, a, …).

The proposed preference attention layer does not seem necessary

Hi,

Thanks for sharing the code!

I have a few questions regarding how to reproduce the results in the paper.

README says that you should run the following command to train a preference transformer model:

# Preference Transfomer (PT)
CUDA_VISIBLE_DEVICES=0 python -m JaxPref.new_preference_reward_main --use_human_label True --comment {experiment_name} --transformer.embd_dim 256 --transformer.n_layer 1 --transformer.n_head 4 --env {D4RL env name} --logging.output_dir './logs/pref_reward' --batch_size 256 --num_query {number of query} --query_len 100 --n_epochs 10000 --skip_flag 0 --seed {seed} --model_type PrefTransformer

However, I notice that this command will set config.used_weighted_sum in

PreferenceTransformer/JaxPref/PrefTransformer.py

Line 36 in 8b34c69

config.use_weighted_sum = False

to False, leading to not using the preference attention layer at all.

PreferenceTransformer/flaxmodels/flaxmodels/gpt2/trajectory_gpt2.py

Lines 403 to 408 in 8b34c69

    
           else: 
        
               x = nn.Dense(features=self.inner_dim)(hidden_output) 
        
               x = ops.apply_activation(x, activation=self.activation) 
        
               output = nn.Dense(features=1)(x) 
        
               if self.activation_final != 'none': 
        
                   output = ops.apply_activation(output, activation=self.activation_final)

Is it correct that you do not need the --transformer.use_weighted_sum flag?

I tried to reproduce the paper's results both with and without the --transformer.use_weighted_sum flag. Other than this flag, I strictly followed the guidelines in the README. In detail, when training the reward model, I set --num_query to 500 for *-medium-replay datasets and 100 for *-medium-expert datasets. When running IQL with the learned reward model, I set --seq_len=100, --eval_Interval=5000, --config=configs/mujoco_config.py, and --eval_episodes=10. Below are the IQL results for 8 seeds (0~7):

use_weighted_sum	hopper-medium-replay-v2	hopper-medium-expert-v2	walker2d-medium-replay-v2	walker2d-medium-expert-v2
False	70.03 (24.06)	87.31 (13.15)	75.82 (2.37)	109.93 (0.83)
True	68.83 (23.34)	68.54 (32.64)	76.48 (3.30)	109.78 (0.47)

(Values in the parentheses denote the std of the normalized return.)

Surprisingly, I find that not using the preference attention layer actually performs better. Does this mean that the preference attention layer is not helpful? Or am I missing something? Also, do you have any ablation results regarding each component of your method?

In addition, I failed to reproduce the results on hopper-medium-replay (84.54) both with and without the preference attention layer. Can you take a look at this issue?

Do you have any plans to release a human-labeled dataset? If so, when would that be?

The problem that the paper addresses is very interesting, with the surprising difference in performance between human-labeled and synthetic data being particularly noteworthy. The release of a human-labeled dataset would undoubtedly have a significant impact on the advancement of research in this field, and I am eagerly looking forward to it.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	else:
	x = nn.Dense(features=self.inner_dim)(hidden_output)
	x = ops.apply_activation(x, activation=self.activation)
	output = nn.Dense(features=1)(x)
	if self.activation_final != 'none':
	output = ops.apply_activation(output, activation=self.activation_final)