bigscience-workshop / xmtf Goto Github PK

Crosslingual Generalization through Multitask Finetuning

Home Page: https://arxiv.org/abs/2211.01786

License: Apache License 2.0

Python 0.18% Jupyter Notebook 99.82%

instruction-tuning large-language-models multitask-learning language-models bloom multilingual-nlp t5 bloomz mt0 zero-shot-learning

xmtf's Introduction

Crosslingual Generalization through Multitask Finetuning

This repository provides an overview of all components used for the creation of BLOOMZ & mT0 and xP3 introduced in the paper Crosslingual Generalization through Multitask Finetuning.

Data
Models
Create xP3
Train models
- BLOOMZ
- mT0
Evaluate models
- Rank Evaluation
- Generation Evaluation
Plots & Tables
- Plots
- Tables
Citation

Data

Name	Explanation	Example models
xP3x	Mixture of 17 tasks in 277 languages with English prompts	WIP - Join us at Project Aya @C4AI to help!
xP3	Mixture of 13 training tasks in 46 languages with English prompts	BLOOMZ & mT0-13B
xP3mt	Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English)	BLOOMZ-MT & mT0-13B-MT
xP3all	xP3 + our evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts
xP3megds	Megatron-DeepSpeed processed version of xP3	BLOOMZ
P3	Repreprocessed version of the English-only P3 with 8 training tasks	BLOOMZ-P3 & mT0-13B-P3

Models

Multitask finetuned on xP3. Recommended for prompting in English.
Parameters	300M	580M	1.2B	3.7B	13B	560M	1.1B	1.7B	3B	7.1B	176B
Finetuned Model	mt0-small	mt0-base	mt0-large	mt0-xl	mt0-xxl	bloomz-560m	bloomz-1b1	bloomz-1b7	bloomz-3b	bloomz-7b1	bloomz
Multitask finetuned on xP3mt. Recommended for prompting in non-English.
Finetuned Model					mt0-xxl-mt					bloomz-7b1-mt	bloomz-mt
Multitask finetuned on P3. Released for research purposes only. Strictly inferior to above models!
Finetuned Model					mt0-xxl-p3					bloomz-7b1-p3	bloomz-p3
Original pretrained checkpoints. Not recommended.
Pretrained Model	mt5-small	mt5-base	mt5-large	mt5-xl	mt5-xxl	bloom-560m	bloom-1b1	bloom-1b7	bloom-3b	bloom-7b1	bloom

Create xP3(x)

We have processed & uploaded xP3. If you want to recreate it, follow these steps:

Get promptsource: For xP3mt git clone -b xp3mt https://github.com/Muennighoff/promptsource.git, for xP3 git clone -b tr13 https://github.com/Muennighoff/promptsource.git & install cd promptsource; pip install -e .
Get packages pip install -q datasets iso-639
Get the creation script & edit it if necessary:

For xP3mt, set USE_ENGLISH_PROMPTS = False in the beginning
For xP3, set USE_ENGLISH_PROMPTS = True in the beginning

Run the script, such as via python prepare_xp3.py or a SLURM script

For the new extension of xP3, xP3x, the process is largely the same except:

Install the xp3x branch instead i.e. pip install git+https://github.com/Muennighoff/promptsource.git@xp3x
The creation script is in this repository & named create_xp3x.py.

xP3x is a superset of xP3, so unless you want to reproduce the paper, we recommend always using xP3x (or xP3mt if you want machine-translated prompts).

Train models

BLOOMZ

Download the pretrained model checkpoint, which is of shape PP=12, TP=4, DP=4. If you'd like to reshape the model you will also need to download the universal checkpoint. If you want to continue finetuning, you should use our finetuned checkpoint, which is of shape PP=72, TP=1, DP=4.
Setup the training code: git clone -b t0loading https://github.com/bigscience-workshop/Megatron-DeepSpeed & follow its setup guide to create an environment with necessary packages.
Download the Megatron-DeepSpeed processed xP3megds or repreprocess it for Megatron-DeepSpeed yourself by downloading xP3, removing the merged_{lang}.jsonl files & preprocess it using the script here.
Setup & run the training script: We use SLURM scripts available at bigscience-workshop/bigscience/train/tr13-mtf and referred to as xp3capmixnewcodelonglossseq. E.g. this is the script launched to train bloomz. Important parts of the script to modify are:

#SBATCH variables, such as nodes, gpus, time, etc. - Our SLURM guide is here
source $six_ALL_CCFRWORK/start-tr13f-6B3-ml-t0 to point to your own conda environment setup via Megatron-DeepSpeed
PATH environment variables, notably
- TRAIN_DATA_PATH & VALID_DATA_PATH, which point to files pointing to your processed training and validation data. We provide our files in this repository (xp3capmixnewcodelong_train.txt & xp3capmixnewcodelong_validation.txt), but you will likely want to change the paths inside. The percentages per language are based on how much each language makes up in xP3 with code being slightly upsampled.
PP_SIZE=72, TP_SIZE=1 & BATCH SIZE & co specifying the layout. This will depend on the hardware available to you. If you change, you may have to reshape the model. For reshaping you need to use the universal checkpoint and use the --universal flag in the script. We recommend saving a new checkpoint right after & then continuing training without --universal, which will be faster.
If you want to restart from a saved checkpoint (e.g. after training a few steps like above), make sure to remove the --no-load-optim & --reset-progress flags
After training, you can convert the checkpoint to transformers format using the script here

Helpful resources:

Blog post
BLOOM community tab, such as here

mT0

Follow the finetuning instructions here making sure to use pretrained mT5 models & the xP3 dataset.

Helpful resources:

T5X paper

Evaluate models

Evaluation results are all available in this repository: https://huggingface.co/datasets/bigscience/evaluation-results under the respective models. Below we explain how to run evaluation.

Rank Evaluation

We evaluate the models on Rank Evaluation on XCOPA, XNLI, XStoryCloze & XWinograd:

Get promptsource fork: git clone -b xp3mt https://github.com/Muennighoff/promptsource.git & cd promptsource; pip install -e .
Get t-zero fork: git clone -b muennighoff/upgrdps https://github.com/Muennighoff/t-zero.git & cd t-zero; pip install -e .
Download model & run evaluation script, for example for bloomz.

Generation Evaluation

We evaluate generation on translation & summarization during training for validation:

Get promptsource fork: git clone -b xp3mt https://github.com/Muennighoff/promptsource & cd promptsource; pip install -e .
Get bigscience-workshop/lm-evaluation-harness: git clone https://github.com/bigscience-workshop/lm-evaluation-harness. The script for the 7.1B model, for example, is here.

We also evaluate code generation on HumanEval:

Get code evaluation code git clone https://github.com/loubnabnl/bloom-code-evaluation & go through its setup.
Set prepend_eos to False in code_eval.py at complete_code(model, tokenizer, prompt, num_completions=1, prepend_eos=True, **gen_kwargs) i.e. complete_code(model, tokenizer, prompt, num_completions=1, prepend_eos=False, **gen_kwargs).
Download model & run evaluation script swapping out MODEL_CKPT for your path, for example for bloomz use this.

Plots & Tables

Plots

Figure 1: plotstables/xp3_taxonomy.drawio & plotstables/xp3_taxonomy.pdf
Figure 2: plotstables/xp3_languages.ipynb & colab
Figure 3: plotstables/xp3_variants.pdf & drawings
Figure 4: plotstables/xp3_generalization_bar.pdf & colab
Figure 5: plotstables/lang_generalization & colab
Figure 6: plotstables/scale.pdf & colab
Figure 7: plotstables/validation.pdf & colab
Figure 8: plotstables/pretraining_sizes.pdf & colab
Figure 9: plotstables/english_task_generalization.pdf & colab
Figure 10: plotstables/task_generalization.pdf & colab
Figure 11: plotstables/roots_xp3_languages.pdf & colab requiring some of the files in plotstables/contamination
Figure 12: plotstables/examples/bloom_code_example.py & plotstables/examples/bloom_code_light.pdf & plotstables/examples/bloomz_code_light.pdf; The raw code files can be found here & here
Figure 13 - Figure 16: plotstables/examples/*.pdf & plotstables/examples/generations.drawio

Tables

Table 1: Colab & Colab for complex version
Table 2: Adapted from the Codex paper
Table 3: Manual
Table 4: plotstables/compute_codegen_len.ipynb for generations & plotstables/countcode.py for xP3
Table 5: Manual
Table 6: Manual
Table 7: plotstables/levenshtein.py
Table 8: Same as Table 1 with languages swapped from L1 to L2
Table 9: Colab
Table 10: Colab
Prompt Appendix: https://github.com/albanie/prompt_formatting_in_latex

Citation

@article{muennighoff2022crosslingual,
  title={Crosslingual generalization through multitask finetuning},
  author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others},
  journal={arXiv preprint arXiv:2211.01786},
  year={2022}
}

xmtf's People

Contributors

Stargazers

Watchers

xmtf's Issues

Why does the number of templates differ between languages?

Hello, thank you for your inspiring work!

I assumed for xP3mt, all languages would have the same number of templates within a dataset, as they are all machine translated from the English templates. However, while I was taking a look at xP3mt, I noticed that the number of templates differ between languages within the same dataset.
For example, XCOPA has 12, 10 and 5 templates for English, Chinese and Italian, respectively.

It seems like,

Not all English templates were MT'd to other languages. For XCOPA, only 5 out of 12 were MT'd from English to other languages.
Some languages have paraphrased duplicates. For XCOPA-zh, each template has a paraphrased version, resulting in 10 (5x2) templates.

I checked that XQuAD is also like this.

Your paper explains the experiments in great detail, however I believe the above detail was not mentioned. Could you please provide some additional information about this decision? Thank you in advance.

Questions about datas

Hi 😀

First of all, thank you for your very interesting work 🚀

I was wondering about two points where I didn't find an answer by myself (maybe I didn't search well) and I would need your help.

I would have liked to know for a given task, what is the prompt used for finetuning for a given language. For example, let's say French summarization. So I started to search to know which prompt were used for the French summarization but I didn't find a list that would summarize such information. PromptSource provides 2085 prompts in English, but nothing about translations in other languages. Does such a list exist? 🤔
To try to have a solution to the previous point, I thought I had to download the xP3mt dataset and read directly which prompts were used. The problem is that you can actually download all the data for a selected language but you can't do an additional filter on the task/(sub)dataset. Would this be something that could be added?
Or even better, create individual multilingual datasets of the translations you have done. For example, having the ability to upload an "mSamSum" which would be the multilingual version of "SamSum" which is purely in English at the base. This would probably allow to be reused in other works, especially monolingual ones. If I take again the example of French summary, there are few data currently available: Orangesum, XLSum and Wiki-lingua. Having easy access to the translations of CNN Daily Mail, Gigaword, MultiNews, SamSum and XSum would allow to do very interesting things 🤯

P3megds URL is not available

Could you please fix the URL for: P3megds. This URL currently is not available.

Some datasets are not in xP3all

Hello! It seems that some datasets like xnli and xwinogrande-ru are not in xP3all in huggingface. I wonder if they would be uploaded later? Thank you!

I can't find the model weights that you used for experimentation.

https://arxiv.org/pdf/2212.09535.pdf
I was reading this paper, and really interested into trying this myself. But I can't find the model weights (bloom-3b) anywhere. Can you link that? would be great.

How to convert megatron-deepspeed checkpoints to huggingface checkpoints ?

I try to use https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/convert_checkpoint/deepspeed_to_transformers.py to convert checkpoints, but i meet this issue bigscience-workshop/Megatron-DeepSpeed#355, could u help me resolve it ? thanks !

What is the training config?

Hello, thanks for your work! I want to try to implement this work myself, but I cann't achieve the high performance by xP3 and mT0-xxl as shown in the paper Crosslingual Generalization through Multitask Finetuning. I wonder the training details of this work, how many steps do you train the model, and what is your lr-decay-ratio? Could I get the config file to implement your result? Thank you very much!

Use Petals without sharing GPU

is it possible to use petals for inferring/prompt tuning without sharing my gpu?

how to repreduce bloomz-*

Thank you for contributing such an excellent work.
I notice that bloomz-* outperform bloom-* via instruct tuning, I want to build a new bloomz-* model upon bloom model, (e.g. bloom-1b7-> bloomz-1b7-mt), but after finetuning bloom-1b7 model on some instruct data from xp3mt, the performance drops much.
I use a batch size of 2048 and learn rate of 2e-5, and labels on inputs are masked.
what else do i need to pay attention to? Or if there are some scripts to do this?

Controlled generation

Hi!
Thanks for the amazing job!

Have a couple of quick questions. I'm trying to use mT0-xxl-mt for QA. When I provide the context and ask a question, subject of which is not presented in the context, the model anyway provide something from the context even if it's totally wrong. Ideal scenario in this case - is if the model could output like 'I cannot answer this question with this context" or something like that.

It that possible without hard training on additional data?
Bias matter question. If I train the model on additional data, would the model still provide "good" answers when the subject of question is in the context?

Export mt0-xxl-mt to ONNX fails

Hello, guys!
As the title says, I'm trying to export mt0-xxl-mt (with some adjustments, which I specify later) to ONNX, but the export fails all the time.
So, regarding model adjustments: I've loaded the model from hugging face in 8bit precision mode, then I fine-tuned it on my downstream task with LORA/PEFT and after that trying to export it to ONNX.
I've just realized that In both basic model from hugging face and model after LORA/PEFT finetuning state_dict there is a curious layer named 'weight_format' with the value 'row' instead of weights tensor. And the export to ONNX fails because of the export function trying to apply detach() method on that value, which obviously generates an error.
So my questions are:

What is the 'weight_format' layer and what it stands for?
If I just clear this layer out of the state_dict and the model architecture, will it cause a further error/model work instability?
Is there a "good" way to export this model to ONNX, without adjusting the state_dict and the model architecture?

bloomz-mt universal checkpoint

Hello!
Thanks a lot for your job!
I want to finetune bloomz-mt by your Megatron-DeepSpeed，but I can not find a universal version checkpoint of bloomz-mt or bloomz. I only found the bloom universal checkpoint below.
https://huggingface.co/bigscience/bloom-optimizer-states/tree/global_step95000_universal

With limited GPUs，I have to use TP 4, PP 12 to finetune, but I found that you suggest not to merge TP in below document. So I want to find the bloomz-mt universal checkpoint
https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/finetune.md

mT0-xxl finetuning

Hello!
Thanks a lot for your job! I'm using mT0-xxl for question answering task, however it performs with not so high quality I expected it to do. So I'm trying to finetune the model a little bit. If I understood correctly, first of all I should get checkpoint and gin file for the model I want to finetune. Could you please share with these?
And is it possible to finetune it with torch or tf is the only way?

Is mT0 suitable for continued training on span corruption task?

Is mT0 suitable / recommended for continued training on mixture of denoising (span corruption, extreme span corruption, prefix LM) tasks similar to UL2? Like below

# span_corruption
{
"text_input": "The <extra_id_0> walks in <extra_id_1> park", 
"text_output": "<extra_id_0> cute dog <extra_id_1> the <extra_id_2>"
}

# extreme_span_corruption
{
"text_input": "The <extra_id_0> park", 
"text_output": "<extra_id_0> cute dog walks in the <extra_id_1>"
}

# prefix LM
{
"text_input": "The cute <extra_id_0>", 
"text_output": "<extra_id_0> dog walks in the park"
}

My domain text is quite different from internet text so I assume span corruption task would help mT0 learn special syntax / semantics of my domain.

Were the checkpoints selected based on the held-out performance or seen task performance?

Hi, I thank again for your awesome work.

Your paper states that "We select the final checkpoint based on validation performance."
Does the "validation performance" mean held-out performance, or seen task performance measured on their available eval subsets?

It seems like there are mixed approaches in the literatures. While T0 checkpoints were picked solely based on seen task performance, Flan-T5 checkpoints were picked based on held-out performance.

When I first read your paper, I assumed they were picked based on held-out performance, but I recently found that prepare_xp3_train.py saves seen task validation sets separately when available.

It would help us a lot if you could please provide additional information on this. Thank you.

Questions on creating instruction data

Thanks for the great work!

I have a few questions regarding data creation of xP3 after following the guide here to create instruction data on the code language subset.

I noticed the total samples of the public processed data (from here) on the code split is 2707724. However, my resulting data following the above github guide is much more than that (approximately >3M samples). I wonder if there were any additional post-processing to get the final instruction data for tuning?
Following the above github guide, I noticed there was no prompt for this particular dataset State Changes. I got this warning when running the creation code:
Tried instantiating `DatasetTemplates` for Fraser/python-state-changes, but no prompts found. Please ignore this warning if you are creating new prompts for this dataset.

Is this dataset not assigned with any prompt (similar to how HumanEval was treated). Or is the below version of PromptSource I used is not correct:
git clone -b tr13 https://github.com/Muennighoff/promptsource.git & install cd promptsource; pip install -e .

how to convert model weights(e.g., bigscience/bloomz-560m-optimizer-states) to Hugging Face model.bin file?

Hi, how to convert model weights(e.g., bigscience/bloomz-560m-optimizer-states) to Hugging Face model.bin file?

Quesiton about MTFDataset

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/data/mtf_dataset.py#L34

The MTFDataset class take documents as arguments, but didn't use it(except in assert statement).
I think documents is train/valid/test split index, is it ok to ignore documents?

Dose mt0&bloomz trained on dev, devtest datasets of Flores-200?

According to link, Florse-200 only has dev, devtest sets.
So, if I want to evaluate the translation performance of mt0&bloomz, It is unreasonable to evaluate on the Florse-200 dev, devtest datasets directly.

Do I understand correctly？

Getting machine-translated prompts of xP3mt

Hi,

Thank you for the very interesting work and releasing the code. It is very helpful!

Is there a way I can get the machine-translated prompts per task?

For example, how would I get the Spanish (es) prompt for Paws-x only?

In the HuggingFace repo bigscience/xP3mt seems to contains the input, output pairs in Spanish for all the training tasks. Is there a way I can get the input, output pairs for Paws-x only?
In the creation script data/xp3/prepare_xp3_train.py, setting USE_ENGLISH_PROMPTS to False seems to load prompts in different languages from PromptSource, but PromptSource only has prompts in English for Paws-x (https://github.com/bigscience-workshop/promptsource/tree/main/promptsource/templates/paws-x)

Also, more generally, how do you do machine-translation for prompts if the language is from right-to-left instead of left-to-right or has different ordering like subject-object-verb instead of subject-verb-object? Would the target come before the input or would you reorder the sentences in the input (i.e premise or hypothesis) in the prompt? And if the target comes before the input, how would the model work since it generates from left to right?

Thank you,
Derek

How to fineutne mT0 with specific down-stream data?

Can you provide the continue fine-tune code for mT0 to specific down-stream data task，we want to test it for specific scene, e.g. retrieval and recommendation.
We find a similar version for continue fine-tuning flan-t5 in https://www.philschmid.de/fine-tune-flan-t5-deepspeed. Is it same to xmtf, or can you provide an official example, just like classification or QA?
Best wishes.

Parsing the xP3 dataset

Q1: I am trying to extract the Arabic instructions from the xP3 dataset, and I want to put them in the format: “Instruction”, “Input”, and “Output”. Currently, the data is in this format: “inputs” and “targets”.

I found that the instructions sometimes are in the last part of the “inputs” and preceded by \n, and sometimes without any delimiter. In other cases, the instructions are at the beginning of the “inputs”, etc.

Here is an example where the instruction is at the end of the input, but without any delimiter to recognize it.
File: xp3_GEM_xlsum_arabic_train_xp3longrest.jsonl
{"inputs":"...\nووسط هذه القلة يقف أيضا شقيقيها الفنان فيصل لعيبي، الذي أثر كثيرا في تطورها الفني وباتت تشكل معه ثنائيا فنيا مميزا، يجعل من أعمالهما في حوار دائم، فتحمل كثيرا من الوشائج والتشابهات الأسلوبية والشكلية لكنها تفترق في التوجه. ففي الوقت الذي يسعى فيصل إلى تأصيل فنه في قلب منجز الرسم العراقي بالتركيز على الخصوصية العراقية واللمسة المحلية والنهل من التراث الفني الرافديني في مراحله المختلفة وعكسه بلغة فنية حداثوية معاصرة، تسعى عفيفة إلى تمييز نفسها عنه بالتحليق في فضاء إنساني عام، مبعدة لوحاتها عن أي ... Write the rest of the article:","targets":"حوارا مع نظرتها ويركز عليها ليكتشف أنها تنظر في مكان آخر أو ربما في ماضٍ بعيد. \n\nوتعنى عفيفة باختيار

Q2: I found many incomplete inputs and outputs, ex: having this string:
"... Continue the article for another 4000 characters max:","targets":"."}
What should we do in such cases?
Thanks
Hamdy Mubarak

bigscience-workshop / xmtf Goto Github PK

xmtf's Introduction

Crosslingual Generalization through Multitask Finetuning

Data

Models

Create xP3(x)

Train models

BLOOMZ

mT0

Evaluate models

Rank Evaluation

Generation Evaluation

Plots & Tables

Plots

Tables

Citation

xmtf's People

Contributors

Stargazers

Watchers

Forkers

xmtf's Issues

Recommend Projects

Recommend Topics

Recommend Org