ikergarcia1996 / easy-translate Goto Github PK

Easy-Translate is a script for translating large text files with a SINGLE COMMAND. Easy-Translate is designed to be as easy as possible for beginners and as seamlesscustomizable and as possible for advanced users.

License: Apache License 2.0

Python 100.00%

4-bit 8-bit cpu easy-to-use gpu hugginface hugginface-hub huggingface-transformers m2m100 machine-translation

easy-translate's Introduction

Easy-Translate is a script for translating large text files with a 💥SINGLE COMMAND💥. Easy-Translate is designed to be as easy as possible for beginners and as seamless and customizable as possible for advanced users. We support almost any model, including M2M100, NLLB200, SeamlessM4T, LLaMA, Bloom and more 🥳. We also provide a script for Easy-Evaluation of your translations 📋

Easy-Translate is built on top of 🤗HuggingFace's Transformers and 🤗HuggingFace's Accelerate library.

We currently support:

CPU / multi-CPU / GPU / multi-GPU / TPU acceleration
BF16 / FP16 / FP32 / 8 Bits / 4 Bits precision.
Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
Multiple decoding strategies: Greedy Search, Beam Search, Top-K Sampling, Top-p (nucleus) sampling, etc. See Decoding Strategies for more information.
Load huge models in a single GPU with 8-bits / 4-bits quantization and support for splitting the model between GPU and CPU. See Loading Huge Models for more information.
LoRA models support
Support for any Seq2SeqLM or CausalLM model from HuggingFace's Hub.
Prompt support! See Prompting for more information.
🆕 Add support for SeamlessM4T!

Test the 🔌 Online Demo here: https://huggingface.co/spaces/Iker/Translate-100-languages

Supported Models

💥 EasyTranslate now supports any Seq2SeqLM (m2m100, nllb200, SeamlessM4T, small100, mbart, MarianMT, T5, FlanT5, etc.) and any CausalLM (GPT2, LLaMA, Vicuna, Falcon, etc.) model from 🤗 Hugging Face's Hub!! We still recommend you to use M2M100, NLLB200 or SeamlessM4T for the best results, but you can experiment with any other MT model, as well as prompting LLMs to generate translations (See Prompting Section for more details). You can also see the examples folder for examples of how to use EasyTranslate with different models.

M2M100

M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation introduced in this paper and first released in this repository.

M2M100 can directly translate between 9,900 directions of 100 languages.

Facebook/m2m100_418M: https://huggingface.co/facebook/m2m100_418M
Facebook/m2m100_1.2B: https://huggingface.co/facebook/m2m100_1.2B
Facebook/m2m100_12B: https://huggingface.co/facebook/m2m100-12B-avg-5-ckpt

NLLB200

No Language Left Behind (NLLB) open-sources models capable of delivering high-quality translations directly between any pair of 200+ languages — including low-resource languages like Asturian, Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, regardless of their language preferences. It was introduced in this paper and first released in this repository.

NLLB can directly translate between +40,000 of +200 languages.

facebook/nllb-moe-54b: https://huggingface.co/facebook/nllb-moe-54b (Requires transformers 4.28.0)
facebook/nllb-200-3.3B: https://huggingface.co/facebook/nllb-200-3.3B
facebook/nllb-200-1.3B: https://huggingface.co/facebook/nllb-200-1.3B
facebook/nllb-200-distilled-1.3B: https://huggingface.co/facebook/nllb-200-distilled-1.3B
facebook/nllb-200-distilled-600M: https://huggingface.co/facebook/nllb-200-distilled-600M

SeamlessM4T

SeamlessM4T a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It was introduced in this paper and first released in this repository.

SeamlessM4T can directly translate between 196 Languages for text input/output.

facebook/hf-seamless-m4t-medium: https://huggingface.co/facebook/hf-seamless-m4t-medium (Requires transformers 4.35.0)
facebook/hf-seamless-m4t-large: https://huggingface.co/facebook/hf-seamless-m4t-large (Requires transformers 4.35.0)

Other MT Models supported

We support every MT model in the 🤗 Hugging Face's Hub. If you find a model that doesn't work, please open an issue for us to fix it or a PR with the fix. This includes, among many others:

Small100: https://huggingface.co/alirezamsh/small100
Mbart many-to-many / many-to-one: https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt
Opus MT: https://huggingface.co/Helsinki-NLP/opus-mt-es-en

See the Supported languages table for a table of the supported languages and their ids.

Citation

If you use this software please cite

@inproceedings{garcia-ferrero-etal-2022-model,
    title = "Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings",
    author = "Garc{\'\i}a-Ferrero, Iker  and
      Agerri, Rodrigo  and
      Rigau, German",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.478",
    pages = "6403--6416",
}

Requirements

Pytorch >= 1.10.0 
See: https://pytorch.org/get-started/locally/

Accelerate >= 0.12.0
pip install accelerate

HuggingFace Transformers 
If you plan to use NLLB200, please use >= 4.28.0, as an important bug was fixed in this version. 
If you plan to use SeamlessM4T, please use >= 4.35.0. 
pip install --upgrade transformers

BitsAndBytes (Optional, required for 8-bits / 4-bits quantization)
pip install bitsandbytes

PEFT (Optional, required for loading LoRA models)
pip install peft

Translate a file

Run python translate.py -h for more info.
See the examples folder for examples of how to run different models.

Using a single CPU / GPU

python3 translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B

If you want to translate all the files in a directory, use the --sentences_dir flag instead of --sentences_path.

# We use --files_extension txt to translate only files with this extension. 
# Use empty string to translate all files in the directory

python3 translate.py \
--sentences_dir sample_text/ \
--output_path sample_text/translations \
--files_extension txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B

Multi-GPU

See Accelerate documentation for more information (multi-node, TPU, Sharded model...): https://huggingface.co/docs/accelerate/index
You can use the Accelerate CLI to configure the Accelerate environment (Run accelerate config in your terminal) instead of using the --multi_gpu and --num_processes flags.

# Use 2 GPUs
accelerate launch --multi_gpu --num_processes 2 --num_machines 1 translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B

Automatic batch size finder

We will automatically find a batch size that fits in your GPU memory. The default initial batch size is 128 (You can set it with the --starting_batch_size 128 flag). If we find an Out Of Memory error, we will automatically decrease the batch size until we find a working one.

Loading Huge Models

Huge models such as LLaMA 65B or nllb-moe-54b can be loaded in a single GPU with 8 bits and 4 bits quantification with minimal performance degradation. See BitsAndBytes. Set precision to 8 or 4 with the --precision flag.

pip install bitsandbytes

python3 translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.nllb200-moe-54B.txt \
--source_lang eng_Latn \
--target_lang spa_Latn \
--model_name facebook/nllb-moe-54b \
--precision 8 \
--force_auto_device_map \
--starting_batch_size 8

If even the quantified model does not fit in your GPU memory, you can set the --force_auto_device_map flag. The model will be split across GPUs and CPU to fit it in memory. CPU offloading is slow, but will allow you to run huge models that do not fit in your GPU memory.

Prompting

You can use LLMs such as LLaMA, Vicuna, GPT2, FlanT5, etc, instead of a translation model. These models require a prompt to define the task. You can either have the prompt already in the input file (each sentence includes the prompt) or you can use the --prompt flag to add the prompt to each sentence. In this case, you need to include the token %%SENTENCE%% in the prompt. This token will be replaced by the sentence to translate. You do not need to specify the --source_lang and --target_lang flags in this case.

python3 translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.FlanT5.translation.txt \
--model_name google/flan-t5-large \
--prompt "Translate English to Spanish: %%SENTENCE%%"

Decoding/Sampling strategies

You can choose the decoding/sampling strategy to use and the number of candidate translations to output for each input sentence. By default, we will use beam-search with num_beams set to 5, and we will output the most likely candidate translation. This should be the best configuration for most use cases. You can change this behaviour with the following flags:

--num_beams: Number of beams to use for beam-search decoding (default: 5)
--do_sample: Whether to use sampling instead of beam-search decoding (default: False)
--temperature: Sampling temperature (default: 0.8)
--top_k: Top k sampling (default: 100)
--top_p: Top p sampling (default: 0.75)
--repetition_penalty: Repetition penalty (default: 1.0)
--keep_special_tokens: Whether to keep special tokens (default: False)
--keep_tokenization_spaces: Whether to keep tokenization spaces (default: False)
--num_return_sequences: Number of candidate translations to output for each input sentence (default: 1)

Please, note that running --do_sample with --num_beams > 1 and 8 bits or 4 bits quantification may be numerically unstable and produce an error.

Evaluate translations

To run the evaluation script you need to install bert_score: pip install bert_score and 🤗HuggingFace's Evaluate model: pip install evaluate.

The evaluation script will calculate the following metrics:

Run the following command to evaluate the translations:

python3 eval.py \
--pred_path sample_text/en2es.translation.m2m100_1.2B.txt \
--gold_path sample_text/es.txt

If you want to save the results to a file use the --output_path flag.

See sample_text/en2es.m2m100_1.2B.json for a sample output.

easy-translate's People

Contributors

Stargazers

Watchers

Forkers

kalebu techthiyanes excurl chaunceydust ishine fighting41love lcy19930619 mingo-wu1 x-hao jianghaowen-schoolwork aliang-nlp 1160300910 1036098840 plano-da zhanggaoxuan lililibaohang xiaosheng123xiao qiaodongxing anyafromcn ybaooo sora013 crazyxiaoqiang huashengqiaokeli yfq512 crzaizxw1314 zjohnkong loxjbve fsfrank zhiyou720 anguscool bobtang westdoorblowash xiaoxieshop themass1226 976311200 fuguohong1024 lyicy 2447380040 qingqingsun-bao striker2333 thomas-rjyj weslayzhang jorylovely wangdoubleyan zhusun8402 ghldiy znicelya fulldb laiqinghan sm1379834 wolf0515 huangzhijun007 zhang-yichen wxd9321 saler-1 lenardos ufo008 jjdeyun 1078966865 valarmorghulis2018 qilong164765913 binkingg aa1251964168 alpha-kc ruanchaves metin-mustafa liutyii 97648077 vsnipper codedylans dhaizei wangjuelong chenshihuanggit yjc980121 crb1411 xqd915 songhh95 panliang5020 sifuhr littleboy6 itliubo fanbooo allenhong007 liuliunlp javabloger jichena webwlsong nigowl yolo-muc huyangkkk chadda3mon pwmhm russ0024 didixyy oahnewss kiiinou whitefu abtdi69 xurunjie kuncys

easy-translate's Issues

How to install locally on windows?

I would love to use this as a gui on windows

Small 100 not working anymore.

I was previously using this from sample and its no longer works now after updates.
python3 translate.py \ --sentences_path sample_text/en.txt \ --output_path sample_text/en2es.translation.small100.txt \ --source_lang en \ --target_lang es \ --model_name alirezamsh/small100

This is the log:
2023-12-06T12:38:42.665144096Z Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2023-12-06T12:38:43.268330478Z The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
2023-12-06T12:38:43.268375418Z The tokenizer class you load from this checkpoint is 'M2M100Tokenizer'.
2023-12-06T12:38:43.268383058Z The class this function is called from is 'SMALL100Tokenizer'.
2023-12-06T12:38:43.269837017Z Loading model from alirezamsh/small100
2023-12-06T12:38:43.269851767Z Loading custom small100 tokenizer for utils.tokenization_small100
2023-12-06T12:38:43.270022978Z Traceback (most recent call last):
2023-12-06T12:38:43.270132589Z File "//Easy-Translate/translate.py", line 538, in
2023-12-06T12:38:43.270510501Z main(
2023-12-06T12:38:43.270625462Z File "//Easy-Translate/translate.py", line 134, in main
2023-12-06T12:38:43.270820843Z model, tokenizer = load_model_for_inference(
2023-12-06T12:38:43.270937283Z File "/Easy-Translate/model.py", line 90, in load_model_for_inference
2023-12-06T12:38:43.271180565Z tokenizer: PreTrainedTokenizerBase = AutoTokenizer.from_pretrained(
2023-12-06T12:38:43.271340276Z File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2045, in from_pretrained
2023-12-06T12:38:43.271888289Z return cls._from_pretrained(
2023-12-06T12:38:43.271961479Z File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
2023-12-06T12:38:43.272524103Z tokenizer = cls(*init_inputs, **init_kwargs)
2023-12-06T12:38:43.272535813Z File "/Easy-Translate/utils/tokenization_small100.py", line 153, in init
2023-12-06T12:38:43.272685754Z super().init(
2023-12-06T12:38:43.272714444Z File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py", line 366, in init
2023-12-06T12:38:43.272931705Z self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
2023-12-06T12:38:43.272935405Z File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py", line 462, in _add_tokens
2023-12-06T12:38:43.273221467Z current_vocab = self.get_vocab().copy()
2023-12-06T12:38:43.273318857Z File "/Easy-Translate/utils/tokenization_small100.py", line 289, in get_vocab
2023-12-06T12:38:43.273603199Z vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
2023-12-06T12:38:43.273645449Z File "/Easy-Translate/utils/tokenization_small100.py", line 192, in vocab_size
2023-12-06T12:38:43.273817970Z return len(self.encoder) + len(self.lang_token_to_id) + self.num_madeup_words
2023-12-06T12:38:43.274002501Z AttributeError: 'SMALL100Tokenizer' object has no attribute 'encoder'. Did you mean: 'encode'?

This model own space has same error:
https://huggingface.co/spaces/alirezamsh/small100

How to make the model load only once?

Can the model be loaded only once instead of waiting for the load to complete each time?

OSError: It looks like the config file at 'models/pytorch_model.bin' is not a valid JSON file

Hello,
Tested with Debian 11/12, cuda 11.7/11.8, different models, different precisions,with and without accel, etc. Other projects based on torch and transformers work well on the same machine.

I have these errors when running the script:
`python3 translate.py --sentences_path sample_text/en.txt --output_path sample_text/en2es.translation.m2m100_1.2B.txt --source_lang en --target_lang es --model_name models/pytorch_model.bin
Loading model from models/pytorch_model.bin
Traceback (most recent call last):
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/configuration_utils.py", line 702, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/configuration_utils.py", line 793, in _dict_from_json_file
text = reader.read()
^^^^^^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "Easy-Translate/translate.py", line 443, in
main(
File "Easy-Translate/translate.py", line 115, in main
model, tokenizer = load_model_for_inference(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/model.py", line 75, in load_model_for_inference
config = AutoConfig.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 983, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/configuration_utils.py", line 617, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
raise EnvironmentError(

OSError: It looks like the config file at 'models/pytorch_model.bin' is not a valid JSON file.`

API plans?

Any plans to add a simple API even on a server running from Python?

BaGRoS

? Useable with variables, not files?

Hello,
I have a great many headlines and articles, most not in English, stored in a database, which I can store in python lists.
I would like to loop through the lists, and translate each variable, without writing the contents to a file first.
This is part of a website, which displays a great many headlines, to start, for the users to choose from.
Writing each headline to a file, translate, and read the file back to a list will take too much time.
Thank you, in advance, for your help.
Baruch

How to translate subtitle .srt

I use this command

python3 translate.py \
--sentences_path input.srt \
--output_path result.srt \
--source_lang eng_Latn \
--target_lang ind_Latn \
--model_name facebook/nllb-200-distilled-600M \
--precision fp16

with input.srt

1
00:00:07,312 --> 00:00:09,993
Hello.

2
00:00:09,994 --> 00:00:11,227
Where are you right now?

3
00:00:11,228 --> 00:00:13,360
Right now I am on my way
to South Dakota.

4
00:00:13,361 --> 00:00:16,093
Gonna do a little camping,
do a little fishing.

5
00:00:16,094 --> 00:00:17,426
Good for you, Colter.

but the result.srt has problems:

wrong order
empty line replace with (dalam bahasa Inggris)
appended unknown

1
00:00:07,312 --> 00:00:09,993
Hei, apa yang kau lakukan?
(dalam bahasa Inggris) <-- this should be empty line
2 (satu) <-- the '(satu)' should not be exist
00:00:09,994 --> 00:00:11,227
Di mana kau sekarang?
(dalam bahasa Inggris) ....
3 Pemberantasan Korupsi <-- this also should not be exist
00:00:11,228 --> 00:00:13,360
Saat ini aku sedang dalam perjalanan
ke Dakota Selatan.
(dalam bahasa Inggris) ...
4
00:00:13,361 --> 00:00:16,093
Akan pergi berkemah sedikit,
lakukan sedikit memancing.
(dalam bahasa Inggris) ...
5
00:00:16,094 --> 00:00:17,426
Bagus untukmu, Colter.
(dalam bahasa Inggris) ...

Add more models

In some (english-centric or low ressource) cases other models could yield better results, e.g. mBART or SMaLL100:
https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt
https://huggingface.co/alirezamsh/small100 has another tokenizer
Could we include them?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.