Training Open Instruction-Following Language Models

This repo serves as an open effort on instruction-tuning popular pretrained language models on publicly available datasets. We release this repo and will keep updating it with:

Code for finetuning language models with latest techniques and instruction datasets in a unified format.
Code for running standard evaluation on a range of benchmarks, targeting for differnt capabilities of these language models.
Checkpoints or other useful artifacts that we build in our exploration.

Please see our first paper How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources for more thoughts behind this project and our initial findings.

News

[2023-08-18] Added support for ToxiGen/TrutufulQA evaluation. Check our scripts/eval/ for examples of running them.
[2023-08-08] Supported several new instruction dataset, including LIMA / WizardLM / Open-Orca. See the preparation script for details. Performance hasn't been evaluated yet.
[2023-08-06] Supported LLaMa 2 finetuning and FlashAttention-2 by bumping the version of transformers and many other dependencies.
[2023-06-29] Added licensing info for our released models.
[2023-06-09] Released Tülu (a suite of LLaMa models fully-finetuned on a strong mix of datasets) and many other checkpoints on HuggingFace [Links].
[2023-06-09] Initial release of the codebase containing the training and evaluation code for our arxiv paper.

Setup

To run training, evaluation, or inference for our finetuned models, you need to install the required packages by running the following command (after installing pytorch):

pip install -r requirements.txt

If you just want the dependencies for the weight diff script, use:

pip install -r weight-diff-requirements.txt

Training

Dataset preparation

We include a collection of representative instruction datasets in our exploration and are adding new ones to our list. We unify them into the same chatting format. To download and prepare these datasets, simply run the following command:

./scripts/prepare_train_data.sh

Please check these datasets for licenses and restrictions around their use!

Model preparation

Generally, most huggingface-compatible causal language models should work fine with our codebase, potentially with some adjusting for different tokenizers etc. Some models may require addtional requests to download. E.g., for LLaMa 1 and 2, please consult the Hugging Face documentation for requesting access and converting them to a huggingface-compatible format.

Finetuning

You can use the following command to run instruction tuning (finetuning a pretrained model to follow instructions):

./scripts/finetune_with_accelerate.sh

Make sure to adjust model_name_or_path, tokenizer_name, train_file, and output_dir to your models / data / setting. By default, this uses deepspeed with accelerate.

Released Checkpoints

We provide a number of model checkpoints that we trained. You can find them on Hugging Face here. Here are some quick links to the checkpoints that are finetuned from LLaMa 1:

Datasets ↓ Model Sizes →	7B	13B	30B	65B
SuperNI	link	link
CoT	link	link
Flan V2	link	link
Dolly	link	link
Open Assistant 1	link	link
ShareGPT	link	link	link	link
Self-instruct (original)	link	link
Unnatural Instructions	link	link
Alpaca	link	link
Code-Alpaca	link	link
GPT4-Alpaca	link	link
Baize	link	link
Human-Mix	link	link	link	link
Tulu	link	link	link	link

We also trained Pythia and OPT models on the Tulu mixture (aka the Human+GPT mixture), and they are available here:

Weight diff script

Some of the checkpoints are released as weight diffs to the base model (mostly for LLaMa 1). We use a slightly modified form of the Alpaca weight diff script, which runs the same.

To merge a model:

Download the relevant LLaMa model and convert it to Hugging Face format (see above).
Download our repository and install the right dependencies (see above).
Download the model diff you want.
Run the command below:

python scripts/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}

Evaluation

Benchmark-based eval

We provide the scripts for running evaluation of Huggingface/OpenAI models on a list of standard benchmarks targeting for the core capabilities of large language models. These benchmakrs include:

We are working on including more promising benchmarks into this list. Please stay tuned!

You can use the following script to download all the evaluation data:

./scripts/prepare_eval_data.sh

Evaluation scripts for different datasets are put under ./scripts. For example, you can use the following command to run the MMLU evaluation script:

./scripts/eval/mmlu.sh

Model-based eval

We support using GPT4 to evaluate the quality of model's response following the GPT4 evaluation protocol proposed in AlpacaFarm. To run this AlpacaFarm eval, please make sure you install our fork of AlpacaFarm (https://github.com/hamishivi/alpaca_farm) and use the following script:

python eval/alpaca_farm_eval.py --model <model> --batch_size 8

Please check the script for more details on the script itself!

Human evaluation

We will release our human evaluation interface and data soon!

Licensing

This codebase is licensed under Apache 2.0 as given in LICENSE.

The license we use for the models released (along with the base model licenses) can be found in model_licenses/tulu_license.txt - just replace <MODELNAME> with the actual model name (i.e., the name on HuggingFace).

Citation

If you used this repository or our models, please cite our work:

@misc{wang2023far,
   title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources}, 
   author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
   year={2023},
   eprint={2306.04751},
   archivePrefix={arXiv},
   primaryClass={cs.CL}
}

dsttsd / open-instruct Goto Github PK

open-instruct's Introduction