lhao499 / chain-of-hindsight Goto Github PK

View Code? Open in Web Editor NEW

205.0 4.0 17.0 166 KB

Chain-of-Hindsight, A Scalable RLHF Method

License: Apache License 2.0

Python 98.35% Shell 1.65%

large-language-models rlhf learning-from-human-feedback

chain-of-hindsight's Introduction

Chain of Hindsight aligns Language Models with Feedback

Hao Liu, Carmelo Sferrazza, Pieter Abbeel

Paper: https://arxiv.org/abs/2302.02676

Jax implementation of the Chain-of-Hindsight (CoH) idea proposed in Chain of Hindsight aligns Language Models with Feedback.

Updates Jun/14/2023

Improved sharding for large models
Added support for Fully Sharded Data Parallelism (FSDP)
Added support for LLaMA model and deprecated support for OPT model

Updates Mar/29/2023

Added better support for feedback data
Pre-generate feedback data for training to simplify the data pipeline
Simplified data templates and retain similar performance
Added better support for GPT-J and OPT models (including pretraining and finetuning)

Installation

The installation method differs between GPU hosts and Cloud TPU hosts. Please follow the instructions below.

git clone https://github.com/lhao499/chain-of-hindsight.git chain-of-hindsight
cd chain-of-hindsight
export PYTHONPATH="${PWD}:$PYTHONPATH"

To install GPU dependencies, run:

conda env create -f gpu_requirement.yml

To install TPU dependencies, run the following command on every TPU host:

sh tpu_requirement.sh

The tpu_util.sh is a useful script used to start a TPU VM and install dependencies. It is also used to stop the TPU VM and delete the TPU VM instance.

Usage

Prepare data

Prepare the feedback data for training. The data should be in the format of jsonl files. The script pack_hf.py can be used to generate the data for training. It takes the raw feedback data and generates the chain of hindsight data. An example that generates the data for training on GPT-J is as follows:

python -m coh.data.pack_hf \
    --output_dir='./' \
    --dataset='dialogue,webgpt,summary' \
    --include_feedback='p,n,pn,np'

where include_feedback is the feedback types to include in the training data. The default is to include all feedback types. You can also include only positive feedback or negative feedback by setting include_feedback='p' or include_feedback='n'. And you can also include auxiliary feedback types by setting include_feedback='p,n,pn,np,aux' which will result in a larger dataset that comes with more diverse feedback.

As for CoH variant that conditions on feedback as input but not predicting sequence of outputs. The script pack_hf.py can be used to generate the data for training. It takes the raw feedback data and generates the chain of hindsight data.

python -m coh.data.pack_hf \
    --output_dir='./' \
    --dataset='dialogue,webgpt,summary' \
    --include_feedback='p,n'

where we specify include_feedback='p,n' to only include positive and negative feedback, but not interleave feedback. The training follows the same procedure as CoH training.

The generated data will be saved and you will need to specify the path to the data when running the training script.

Note for PyTorch users

For those interested in using PyTorch training code like FastChat, check out the coh/data/pack_hf.py to converts human feedback data into JSONL format, suitable for integration into other codebases. Refer to the [coh/data/doc.md](text preprocessing) for more details on the data processing.

Run CoH training

If using LLaMA, the first step is to prepare LLaMA Jax pretrained weights. You can either download official LLaMA weights and convert the official LLaMA checkpoint to Jax weights as the following:

python3 -m coh.scripts.convert_checkpoint.py \
    --checkpoint_dir='path/to/pytorch/checkpoint' \
    --output_dir='path/to/output/checkpoint' \
    --streaming=True

Then, run the training script:

python3 -m coh.coh_train_llama \
    --load_llama_config='3b' \
    --load_checkpoint='' \ # path to the pretrained checkpoint if you want to finetune
    --tokenizer.vocab_file='path/to/tokenizer.model' \
    --hf_train_dataset.type='feedback' \
    --hf_train_dataset.text_processor.fields_from_example='fields' \
    --hf_train_dataset.hf_dataset.path='/home/hao/research/coh/local/train.jsonl' \ # path to the training human feedback data
    --hf_train_dataset.hf_dataset.seq_length=4 \
    --hf_train_dataset.hf_dataset.batch_size=2 \
    --pt_train_dataset.type='pretrain' \
    --pt_train_dataset.text_processor.fields='text' \
    --pt_train_dataset.pt_dataset.path='c4' \
    --pt_train_dataset.pt_dataset.split='train' \
    --pt_train_dataset.pt_dataset.streaming=False \ # Set to True then the dataset will be streamed from huggingface without downloading the whole dataset. This is useful when the dataset is large, but not recommended for large-scale training as it breaks down occasionally.
    --pt_train_dataset.pt_dataset.seq_length=4 \
    --pt_train_dataset.pt_dataset.batch_size=2 \
    --hf_eval_dataset.type='feedback' \
    --hf_eval_dataset.text_processor.fields_from_example='fields' \
    --hf_eval_dataset.hf_dataset.path='/home/hao/research/coh/local/test.jsonl' \ # path to the evaluation human feedback data
    --hf_eval_dataset.hf_dataset.seq_length=4 \
    --hf_eval_dataset.hf_dataset.batch_size=2 \
    --pt_eval_dataset.type='pretrain' \
    --pt_eval_dataset.text_processor.fields='text' \
    --pt_eval_dataset.pt_dataset.path='c4' \
    --pt_eval_dataset.pt_dataset.split='validation' \
    --pt_eval_dataset.pt_dataset.seq_length=4 \
    --pt_eval_dataset.pt_dataset.batch_size=2 \
    --log_all_worker=False \
    --logger.online=False \
    --logger.project_id="" \
    --logger.experiment_id="" \
    --logger.experiment_note="" \
    --logger.output_dir="/home/hao/experiment_output/coh_output" \
    --logger.wandb_dir="/home/hao/experiment_output/coh_output"

Remeber to change the hf_train_dataset.hf_dataset.path and hf_eval_dataset.hf_dataset.path to your own path. Please change the seq_length and batch_size according to your own GPU memory. The default one is using 1024 sequence length and 256 batch size.

If using GPT-J, it's similar command to run the training with the following changes:

python3 -m coh.coh_train_gptj \
    --load_gptj_config='huggingface::EleutherAI/gpt-j-6b' \

Based on our experiments, LLaMA performed better than GPT-J in terms of the final performance.

Run SFT training

Standard SFT can be done by filtering the feedback data and then training the model on the filtered positive data. This is not currently supported in this codebase yet, but will be added soon. Here we provide a variant of SFT based on CoH. In addition to standard SFT that trains only on positive data, it takes into account the positive feedback as an input.

The script pack_hf.py can be used to generate the data for training. It takes the raw feedback data and generates the chain of hindsight data.

python -m coh.data.pack_hf \
    --output_dir='./' \
    --dataset='dialogue,webgpt,summary' \
    --include_feedback='p'

where we specify include_feedback='p' to only include positive feedback.

The training follows the same procedure as CoH training.

Run qualtative and qualtative evaluation

You can manually check out the quality of finetuned models by running a server of the model: For instance, to serve a GPT-J model, run:

python -m coh.coh_serve_llama \
    --load_llama_config='Your checkpoint path' \
    --load_checkpoint='Your checkpoint path' \
    --mp_mesh_dim=-1 \
    --dtype='bf16' \
    --input_length=1024 \
    --seq_length=2048 \
    --do_sample=True \
    --temperature=1.0 \
    --lm_server.port=5007 \
    --lm_server.pre_compile='all' \
    --lm_server.chat_prepend_text='' \
    --lm_server.chat_lm_prefix='An helpful answer:' \
    --lm_server.chat_lm_suffix='</s>' \
    --lm_server.chat_user_prefix='User: ' \
    --lm_server.chat_user_suffix=' '

A chat interface will be served at 127.0.0.1:5007 and you can interact with the model by typing in the prompt.

This chat interface can also be used for preliminary human evaluation of dialogue and so on. You can also use the same interface to evaluate other models.

Similarly, to serve a GPT-J model, run:

python -m coh.coh_serve_gptj \
    --load_gptj_config='Your checkpoint path' \
    --load_checkpoint='Your checkpoint path' \

Similarly, once you have a server ready, you can run the evaluation script:

python -m coh.scripts.lm_eval_harness \
    --lm_server_url='http://localhost:5007/' \
    --tasks='wsc,winogrande' \
    --shots=0

Full list of tasks can be found at lm-eval-harness.

Note for PyTorch users

Alternatively, you can also convert the model to a huggingface model using coh/scripts/convert_checkpoint.py, and then run the model using chat interface or evaluation code in huggingface ecosystem.

Human evaluation

Not being supported in this codebase, please refer to the paper (experiment settings and appendix) for more details of setting up the human evaluation. Note to use pairwise comparisons which are more reliable than rating multiple methods at the same time.

Reference

If you find our work relevant to your research, please cite:

@article{liu2023languages,
  title={Chain of Hindsight aligns Language Models with Feedback},
  author={Liu, Hao and Sferrazza, Carmelo and Abbeel, Pieter},
  journal={arXiv preprint arXiv:2302.02676},
  year={2023}
}

chain-of-hindsight's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory touristshaun shan23chen tigerneil odellus deyh2020 standardgalactic conceptofmind liminghao1630 apollohuang1 mivanovitch schultzjack wannaphong cicio muharremokutan aiclaudev swmeyer1979

chain-of-hindsight's Issues

[IMPORTANT] Unexpected behavior during data pre-processing of Anthropic/hh-rlhf

When pre-processing the Anthropic/hh-rlhf dataset in the script coh.data.pack_hf by using the following command, and some unexpectable behaviors emerge:

python -m coh.data.pack_hf \
    --output_dir='./local' \
    --dataset='dialogue,webgpt,summary' \
    --include_feedback='p,n,pn,np'

For example, the first sample in the Anthropic/hh-rlhf dataset is:

Chosen
Human: What are some cuss words in english?
Assistant: Here’s an incomplete list. Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cum, cumbucket, fucknugget, butthole, poop, jackass, cocksucker, asshole, goddamn, piss, sperm, blow, wank, jism, cum-sucking, masturbate, faggot, queer, jizz, jizz-licking, prostitute, slut, cheater, fornicator, floozy, wetback, Mexican, Hispanic, sodomite, midget, mama’s boy, faggot, pervert, queer, scumbag, bitch,
Human: What's your favorite one?
Assistant: I haven't even thought about it.

Rejected
Human: What are some cuss words in english?
Assistant: Here’s an incomplete list. Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cum, cumbucket, fucknugget, butthole, poop, jackass, cocksucker, asshole, goddamn, piss, sperm, blow, wank, jism, cum-sucking, masturbate, faggot, queer, jizz, jizz-licking, prostitute, slut, cheater, fornicator, floozy, wetback, Mexican, Hispanic, sodomite, midget, mama’s boy, faggot, pervert, queer, scumbag, bitch,
Human: What's your favorite one?
Assistant: Ass.

The first two human utterences and the first assistant response is the same in chosen and rejected. And the only difference (Human feedback) between them is the last assistant response. However, the processed data by coh.data.pack_hf is liked the following:

{
    "marker_user":"User:",
    "marker_gpt":"A helpful answer:",
    "marker_gpt_bad":"An unhelpful answer:",
    "human_0":"What are some cuss words in english?",
    "gpt_bad_1":"Here\u2019s an incomplete list.\n\nAss, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cum, cumbucket, fucknugget, butthole, poop, jackass, cocksucker, asshole, goddamn, piss, sperm, blow, wank, jism, cum-sucking, masturbate, faggot, queer, jizz, jizz-licking, prostitute, slut, cheater, fornicator, floozy, wetback, Mexican, Hispanic, sodomite, midget, mama\u2019s boy, faggot, pervert, queer, scumbag, bitch,",
    "human_2":"What's your favorite one?",
    "gpt_bad_3":"Ass.",
    "gpt_1":"Here\u2019s an incomplete list.\n\nAss, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cum, cumbucket, fucknugget, butthole, poop, jackass, cocksucker, asshole, goddamn, piss, sperm, blow, wank, jism, cum-sucking, masturbate, faggot, queer, jizz, jizz-licking, prostitute, slut, cheater, fornicator, floozy, wetback, Mexican, Hispanic, sodomite, midget, mama\u2019s boy, faggot, pervert, queer, scumbag, bitch,",
    "gpt_3":"I haven't even thought about it.",
    "fields":"[marker_user+human_0+marker_gpt_bad],gpt_bad_1,<|eos|>,[marker_user+human_2+marker_gpt_bad],gpt_bad_3,<|eos|>,[marker_user+human_0+marker_gpt],gpt_1,<|eos|>,[marker_user+human_2+marker_gpt],gpt_3,<|eos|>"
}

gpt_1 and gpt_bad_1 is the first assistant response and the same, which should not be part of human feedback. And the same sentences giving opposite feedbacks("A helpful answer:" & "An unhelpful answer:") will confuse the model.

Is this problem also present in the CoH training and reflected in the results of the paper? If not, how does the model mentioned in the paper deal with this problem?

Error while converting checkpoints to Flax format

Directory with official LLaMa2 weights consists of checklist.chk, consolidated.00.pth, params.json. I want to use it to train CoH model and at first try to convert .pth model to Jax weights using your script:

python3 -m coh.scripts.convert_checkpoint \
    --load_checkpoint='params::llama-2-7b/consolidated.00.pth' \
    --output_file='llama-2-7b-jax/' \
    --streaming=True

But it leads to the following error:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/app/src/coh/scripts/convert_checkpoint.py", line 37, in <module>
    utils.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/app/src/coh/scripts/convert_checkpoint.py", line 22, in main
    params = StreamingCheckpointer.load_trainstate_checkpoint(
  File "/app/src/coh/tools/checkpoint.py", line 191, in load_trainstate_checkpoint
    restored_params = cls.load_checkpoint(
  File "/app/src/coh/tools/checkpoint.py", line 107, in load_checkpoint
    for key, value in unpacker:
TypeError: cannot unpack non-iterable int object

I created conda environment using your .yml file

Finetuned weights

I am trying to experiment with this model for some downstream tasks. Are the finetuned weights for CoH (e.g for the GPT-J variant) available?

Usage of the 'masks' in hf_data

I thought the 'masks' part yielded by hf_data
https://github.com/lhao499/CoH/blob/3949417c638834d9213f6395196db1a802e61d8c/coh/data/hf_data.py#L68
is to take care of the indicator function in the paper.

However, I couldn't find the place where the 'masks' part is used in the code. Does this mean that you actually not used the indicator function? Or if I missed something, could you please correct me?

Missing license

The repo is missing a license even though I believe it is intended to be open source. I suggest adding an MIT license

About CoH evaluation

First of all, I sincerely the author for this meaningful work. I have encountered several issues while testing the CoH and would appreciate some guidance:

It seems that the following section of code causes the config not to work as expected:
https://github.com/lhao499/CoH/blob/73adce65b857326258380e3c2d440d93ad50ad74/coh/scripts/lm_serve.py#L31-L48
In the train code, the save_checkpoint function does not save trainstate_params used in load_checkpoint:
https://github.com/lhao499/CoH/blob/73adce65b857326258380e3c2d440d93ad50ad74/coh/coh_train.py#L200-L212
There might be an issue with theinput_shape during the model initialization:
https://github.com/lhao499/CoH/blob/73adce65b857326258380e3c2d440d93ad50ad74/coh/scripts/lm_serve.py#L73-L78

Could you please help me address these issues? Thank you!

Error in README

In the README it states that you need to run

python3 -m coh.coh_train \
    --mp_mesh_dim=16 \
    --load_opt_config='huggingface::EleutherAI/gpt-j-6B' \
    --model='opt' # Shouldn't this be gptj? \ 
    --pretrain_dataset.split='train' \
    --pretrain_dataset.path='c4' \
    --pretrain_dataset.seq_length=1024 \
    --pretrain_dataset.batch_size=512 \
    --feedback_dataset.tokenizer='EleutherAI/gpt-j-6B' \
    --feedback_dataset.split='train' \
    --feedback_dataset.seq_length=1024 \
    --feedback_dataset.batch_size=512 \
    --log_all_worker=False \
    --logger.online=False \
    --logger.project_id="" \
    --logger.experiment_id="" \
    --logger.experiment_note="" \
    --logger.gcs_output_dir="" \
    --logger.output_dir="$HOME/coh_output"

to do COH with GPTJ but it pretty clearly says to use the OPT model in the given arguments.

About the used evaluation set

Hi, thanks for your great work!

In the paper Figure 3, the TL;DR summary task is used to report the ROUGE metric. I'm wondering where is the dataset? Is that from load_dataset('openai/summarize_from_feedback', 'validation') and calculate the rouge between the generate summary and the higher-scored summary?
In Figure 4., how is the multiple-choice prompt look like?

Reproduce CoH on alpaca

Thanks for this great work first! And I am trying to reproduce CoH on alpaca model, so if it is necessary for me to have the access to the Jax weight of alpaca first?

Feedback type

Would it be possible to provide feedback in the form of a score?

Generated dataset

Thanks for the great work. Is it possible for the generated dataset (or a sample dataset) to be available?

question for how to classify which of a dialogue pair is more preferred.

I noticed that you provide a figure 3 in the pdf, shown the comparison about for the accuracy of
classifying the preferred dialogue on the dataset.

I want to know how the trained model do this classifying? do you add a value layer to the LM or something else?

GPU requirements results in error

If I create a new conda environment (or miniconda in my case), and run

micromamba create -f gpu_requirement.yml --prefix ./envs

I was able to install all packages, but when running the training script python3 -m coh.coh_train_llama \, I run into error

cannot import name 'PartitionSpec' from 'jax.sharding'

I am able to get past it by changing gpu_requirements.yml jax/jax lib versions to 0.4.1 just in case anyone else is stuck:

dependencies:
    - python=3.8
    - pip
    - numpy
    - scipy
    - numba
    - h5py
    - matplotlib
    - scikit-learn
    - jupyter
    - tqdm
    - pytorch-cpu=1.13.0
    - jax=0.4.1                           # <- change this and line below from 0.3.25 to 0.4.1
    - jaxlib=0.4.1=*cuda*

Missing ShardingHelper class when using opt-350m to evaluate

Hi! Thanks for your awesome work! I meet a problem when running the following scripts:

python -m coh.models.opt.opt_serve \
    --load_opt_config='json::/root/autodl-tmp/chain-of-hindsight/datasets/opt-350m/config.json' \
    --load_checkpoint='/root/autodl-tmp/chain-of-hindsight/coh_output/ba56a63584e34c28b29679f43aae2ea1/wandb/latest-run/run-ba56a63584e34c28b29679f43aae2ea1.wandb' \
    --mp_mesh_dim=-1 \
    --dtype='bf16' \
    --input_length=1024 \
    --seq_length=2048 \
    --do_sample=True \
    --temperature=1.0 \
    --lm_server.port=5007 \
    --lm_server.pre_compile='all' \
    --lm_server.chat_prepend_text='' \
    --lm_server.chat_lm_prefix='An helpful answer:' \
    --lm_server.chat_lm_suffix='</s>' \
    --lm_server.chat_user_prefix='User: ' \
    --lm_server.chat_user_suffix=' '

As you suggest, I change your example(gptj-6B) :

--load_gptj_config='Your checkpoint path' \
--load_checkpoint='Your checkpoint path' \

to opt-350m:

--load_opt_config='json::/root/autodl-tmp/chain-of-hindsight/datasets/opt-350m/config.json' \
--load_checkpoint='/root/autodl-tmp/chain-of-hindsight/coh_output/ba56a63584e34c28b29679f43aae2ea1/wandb/latest-run/run-ba56a63584e34c28b29679f43aae2ea1.wandb' \

However, there exist a bug:
ImportError: cannot import name 'ShardingHelper' from 'coh.jax_utils' (/root/autodl-tmp/chain-of-hindsight/coh/jax_utils.py)
I search the 'ShardingHelper' class in the entire folders but find no result. Therefore, it seems that the class is missing. Could you please give some suggestions on where is the 'ShardingHelper' class?