Code Monkey home page Code Monkey logo

mgpt's Introduction

mGPT

Multilingual Generative Pretrained Transformer

MIT license

We introduce mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from linguistically diverse 25 language families using Wikipedia and C4 Corpus. We detail the design and pretraining procedure. The models undergo an intrinsic and extrinsic evaluation: language modeling in all languages, downstream evaluation on cross-lingual NLU datasets and benchmarks in 33 languages, and world knowledge probing in 23 languages. The in-context learning abilities are on par with the contemporaneous language models while covering a larger amount of languages, including underrepresented and low-resource languages of the Commonwealth of Independent States and the small peoples in Russia. The source code and the language models are available under the MIT license.

[Paper] [Habr (Russian)] [HugginFace mGPT-1.3B Model Card] [HugginFace mGPT-13B Model Card] [Papers With Code]

Setting up environment

pip install -r requirements.txt

Pretrain data

The model was pretrained on a 600Gb of texts, mainly from MC4 and Wikipedia.

  • MC4
  • Wikipedia (version 20201101)

The Wikipedia texts are extracted from the dumps (v. 20201101) with WikiExtractor (Attardi, 2015). Training data was deduplicated, and the text deduplication includes 64-bit hashing of each text in the corpus for keeping texts with a unique hash. We also filter the documents based on their text compression rate using zlib4. The most strongly and weakly compressing deduplicated texts are discarded.

Transformers usage 🤗

from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("sberbank-ai/mGPT")
model = GPT2LMHeadModel.from_pretrained("sberbank-ai/mGPT")

text = "Александр Сергеевич Пушкин родился в "
input_ids = tokenizer.encode(text, return_tensors="pt").cuda(device)
out = model.generate(
        input_ids, 
        min_length=100, 
        max_length=100, 
        eos_token_id=5, 
        pad_token=1,
        top_k=10,
        top_p=0.0,
        no_repeat_ngram_size=5
)
generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)
Александр Сергеевич Пушкин родился в  г. Санкт-Петербурге.

Choosing best parameters:

In general:

eos_token_id=5, 
pad_token=1,
do_sample=True,
top_k=0,
top_p=0.8,
no_repeat_ngram_size=4

English Generation: top_p=0.95, top_k=0

Examples

mGPT Generation Examples

Open In Colab

mGPT Fine-tuning example

Open In Colab

Languages supported

Afrikaans (af), Arabic (ar), Armenian (hy), Azerbaijani (az), Basque (eu), Bashkir (ba), Belarusian (be), Bengali (bn), Bulgarian (bg), Burmese (my), Buryat (bxr), Chuvash (cv), Danish (da), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Javanese (jv), Kalmyk (xal), Kazakh (kk), Korean (ko), Kyrgyz (ky), Latvian (lv), Lithuanian (lt), Malay (ms), Malayalam (ml), Marathi (mr), Mongolian (mn), Ossetian (os), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es), Swedish (sv), Swahili (sw), Tatar (tt), Telugu (te), Thai (th), Turkish (tr), Turkmen (tk), Tuvan (tyv), Ukrainian (uk), Uzbek (uz), Vietnamese (vi), Yakut (sax), Yoruba (yo)

Pretraining

[mGPT-1.3B Model Card] [mGPT-13B Model Card]

We utilize the DeepSpeed library and Megatron-LM. We pretrain our LMs with a total batch size of 2048 and the context window of 512 tokens. The total number of the training steps is 600k, and the models have seen $400$B tokens during pretraining. The pretraining took 14 days on a cluster of 256 V100 GPUs for mGPT-1.3B and 22 days on 512 V100 GPUs for mGPT-13B.

Monoligual models:

Habr article about the monoligual mGPT-1.3B models (Russian)

Monolingual models on HuggingFace:

Contributing

We welcome community contributions to the model, and celebrate both its inference and training technique enhancements.

Cite Us

@article{shliazhko2024mgpt,
 title={mGPT: Few-Shot Learners Go Multilingual},
 author={Shliazhko, Oleh and Fenogenova, Alena and Tikhonova, Maria and Kozlova, Anastasia and Mikhailov, Vladislav and Shavrina, Tatiana},
 journal={Transactions of the Association for Computational Linguistics},
 volume={12},
 pages={58--79},
 year={2024},
 publisher={MIT Press One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA~…}
}

mgpt's People

Contributors

ollmer avatar tatianashavrina avatar vmkhlv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mgpt's Issues

add web demo to Huggingface

Hi, would you be interested in adding mgpt web demo to Hugging Face using Gradio in the https://huggingface.co/sberbank-ai organization?

here is a guide for adding spaces to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces

Example spaces with repos:
github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/salesforce/BLIP

github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore

a Gradio Demo can be setup in 2 lines of code using the inference api integration through huggingface

import gradio as gr
gr.Interface.load("huggingface/sberbank-ai/mGPT").launch()

would launch the demo

Please let us know if you would be interested and if you have any questions.

Can you share the script for training your tokenizer?

Hello @TatianaShavrina and @ollmer, I need to extend the vocabulary of this model. can you share the script you used for training your tokeniser? From my experience, using ByteLevelBPETokenizer is a bit complex especially when working with non latin scripts such as Arabic or cyrillics as shown here. So I am wondering how did that without this issue and would like to use it to extend this model's vocab. I am extending the vocab for some scripts not covered yet such as Devanagari and Ol Chiki.

Fine-tunning with specific words for non english language

Hi

Is it possible to finetune the LLM (mgpt) with specific words?
I want the model to answer with just few available words in my dictionary not all the possible words for specific languages (I'm talking about Persian Language)

My dictionary of words {around 2000} are in CVS or text file!

would it be possible to do this? if yes, would you please guide me through?

Thanks
Best regards

add new language to the model?

Hi All,

Thank you for realising the model.

I was wondering is it possible to add a support for a new language (Czech) through fine-tuning?
I mean, I can collect a textual corpora in Czech and finetune the mGPT model for several epochs, do you it can work?

Cannot run the generation example notebook - which transformers version to use?

(Using Colab Free with T4)

Running !pip install transformers ==4.10.3
results in

`Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.10.3
Using cached transformers-4.10.3-py3-none-any.whl (2.8 MB)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers==4.10.3) (3.12.0)
Collecting huggingface-hub>=0.0.12 (from transformers==4.10.3)
Using cached huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.10.3) (1.22.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from transformers==4.10.3) (23.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.10.3) (6.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.10.3) (2022.10.31)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers==4.10.3) (2.27.1)
Collecting sacremoses (from transformers==4.10.3)
Using cached sacremoses-0.0.53-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1 (from transformers==4.10.3)
Using cached tokenizers-0.10.3.tar.gz (212 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers==4.10.3) (4.65.0)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.0.12->transformers==4.10.3) (2023.4.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.0.12->transformers==4.10.3) (4.5.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.10.3) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.10.3) (2022.12.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.10.3) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.10.3) (3.4)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from sacremoses->transformers==4.10.3) (1.16.0)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from sacremoses->transformers==4.10.3) (8.1.3)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from sacremoses->transformers==4.10.3) (1.2.0)
Building wheels for collected packages: tokenizers
error: subprocess-exited-with-error

× Building wheel for tokenizers (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Building wheel for tokenizers (pyproject.toml) ... error
ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

I looked in notebooks/mgpt_huggingface_generation_example.ipynb and saw that you used transformers==4.23.1. So, I tried the above notebook (from your README) with !pip install 4.23.1. This also did not run:

model.cuda() model.eval() transformers.set_seed(1337) for text in texts: input_ids = tokenizer.encode(text, return_tensors="pt").cuda() out = model.generate( input_ids, min_length=100, max_length=100, eos_token_id=5, pad_token=1, do_sample=True, top_k=0, top_p=0.9, no_repeat_ngram_size=4) generated_text = list(map(tokenizer.decode, out))[0] print('---') print(generated_text)

results in

`---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

in <cell line: 4>()
4 for text in texts:
5 input_ids = tokenizer.encode(text, return_tensors="pt").cuda()
----> 6 out = model.generate(
7 input_ids,
8 min_length=100,

2 frames

/usr/local/lib/python3.10/dist-packages/transformers/generation_utils.py in _validate_model_kwargs(self, model_kwargs)
907
908 if unused_model_args:
--> 909 raise ValueError(
910 f"The following model_kwargs are not used by the model: {unused_model_args} (note: typos in the"
911 " generate arguments will also show up in this list)"

ValueError: The following model_kwargs are not used by the model: ['pad_token'] (note: typos in the generate arguments will also show up in this list)
.

Which transformers version to use for generation?

Can you share the data stat?

You have provided the languages covered in the paper. Could you also provide the amount of data per language that was used for pretraining?

mGPT 13B parameters

Hello! Impressive work, especially that mGPT might be the only multilingual LM model for text generation so far. The paper mentions that there is another version with 13 billion parameters. I assume its performance will be even better. Is this version available somewhere, even for research purposes? Thank you so much!

"Loss: nan" in training

When trying to teach mGPT through a notebook, after a while loss equals "nan".
image
In this case, the generation is disrupted:
image
File size for learning - 45 mb
Learning Rate - 1e-5
GPU - Nvidia Tesla P100

1.3B model

when i am loading 1.3 B MODEL its taking more than 50GB ?
any idea why

Non-English prompts

Hi,

could you share the non-English prompts (as in Appendix G but for the other languages)?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.