linzehui / mrasp Goto Github PK

Shell 1.41% Smalltalk 92.94% Python 3.75% Perl 1.25% Emacs Lisp 0.51% JavaScript 0.03% NewLisp 0.05% Ruby 0.05% Slash 0.01% SystemVerilog 0.01%

mrasp's Introduction

Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information, EMNLP2020

This is the repo for EMNLP2020 paper Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information.

[paper]

News

We have evolved our mRASP into mRASP2/mCOLT, which is a much stronger many-to-many multilingual model. mRASP2 has been accepted by ACL2021 main conference. Welcome to use mRASP2.

[paper]
[code]

Introduction

mRASP, representing multilingual Random Aligned Substitution Pre-training, is a pre-trained multilingual neural machine translation model. mRASP is pre-trained on large scale multilingual corpus containing 32 language pairs. The obtained model can be further ﬁne-tuned on downstream language pairs. To effectively bring words and phrases with similar meaning closer in representation across multiple languages, we introduce Random Aligned Substitution (RAS) technique. Extensive experiments conducted on different scenarios demonstrate the efficacy of mRASP. For detailed information please refer to the paper.

Structure

.
├── experiments                             # Example files: including configs and data
├── preprocess                              # The preprocess step
│   ├── tools/
│   │   ├── __init__.py
│   │   ├── common.sh           
│   │   ├── data_preprocess/                # clean + tokenize
│   │   │   ├── __init__.py
│   │   │   ├── clean_scripts/
│   │   │   ├── tokenize_scripts/
│   │   │   ├── clean_each.sh
│   │   │   ├── prep_each.sh
│   │   │   ├── prep_mono.sh                # preprocess a monolingual corpus
│   │   │   ├── prep_parallel.sh            # preprocess a parallel corpus
│   │   │   └── tokenize_each.sh
│   │   ├── misc/
│   │   │   ├── __init__.py
│   │   │   ├── multilingual_preprocess_yml_generator.py
│   │   │   └── multiprocess.sh
│   │   ├── ras/
│   │   │   ├── __init__.py
│   │   │   ├── random_alignment_substitution.sh
│   │   │   ├── random_alignment_substitution_w_multi.sh 
│   │   │   ├── replace_word.py  # RAS using MUSE bilingual dict
│   │   │   └── replace_word_w_multi.py  # RAS using multi-way parallel dict
│   │   └── subword/
│   │       ├── __init__.py
│   │       ├── multilingual_apply_subword_vocab.sh     # script to only apply subword (w/o learning new vocab)
│   │       ├── multilingual_learn_apply_subword_vocab_joint.sh     # script to learn new vocab and apply subword
│   │       └── scripts/
│   ├── __init__.py
│   ├── multilingual_merge.sh               # script to merge multiple parallel dataset
│   ├── multilingual_preprocess_main.sh     # main entry for preprocess
│   └── README.md    
├── train                        
│   ├── __init__.py
│   ├── misc/
│   │   ├── load_config.sh
│   │   └── monitor.sh                  # script to monitor the generation of checkpoint and evaluate them
│   ├── scripts/
│   │   ├── __init__.py
│   │   ├── average_checkpoints_from_file.py
│   │   ├── average_ckpt.sh             # checkpoint average
│   │   ├── common_scripts.sh
│   │   ├── get_worst_ckpt.py
│   │   ├── keep_top_ckpt.py
│   │   ├── remove_bpe.py
│   │   └── rerank_utils.py
│   ├── pre-train.sh                    # main entry for pre-train
│   ├── fine-tune.sh                    # main entry for fine-tune
│   └── README.md
├── requirements.txt
└── README.md

Pre-requisite

pip install -r requirements.txt

Pipeline

The pipeline contains two steps: Pre-train and Fine-tune. We first pre-train our model on multiple language pairs jointly. Then we further fine-tune on downstream language pairs.

Preprocess

The preprocess pipeline is composed of the following 4 separate steps:

Data filtering and cleaning
Tokenization
Learn / Apply joint bpe sub-word vocabulary
Random Alignment Substitution (optional, only valid for train set)

We provide a script to run all the above steps in one command:

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${config_yaml_file}

Pre-train

step1: preprocess train data and learn a joint BPE subword vocabulary across all languages.

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/train.yml

The command above will do clean, subword, merge, ras, step by step. Now we have a BPE vocabulary and an RASed multilingual dataset merged from multiple language pairs.

step2: preprocess development data

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/dev.yml

We create a multilingual development set to help choose the best pre-trained checkpoint.

step3: binarize data

bash ${PROJECT_ROOT}/experiments/example/bin_pretrain.sh

step4: pre-train on RASed multilingual corpus

export CUDA_VISIBLE_DEVICES=0,1,2,3 && bash ${PROJECT_ROOT}/train/pre-train.sh ${PROJECT_ROOT}/experiments/example/configs/train/pre-train/transformer_big.yml

You can modify the configs to choose the model architecture or dataset used.

Fine-tune

step1: preprocess train/test data

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/train_en2de.yml
bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/test_en2de.yml

The command above will do: clean and subword.

step2: binarize data

bash ${PROJECT_ROOT}/experiments/example/bin_finetune.sh

step3: fine-tune on specific language pairs

export CUDA_VISIBLE_DEVICES=0,1,2 && export EVAL_GPU_INDEX=${eval_gpu_index} && bash ${PROJECT_ROOT}/train/fine-tune.sh ${PROJECT_ROOT}/experiments/example/configs/train/fine-tune/en2de_transformer_big.yml ${PROJECT_ROOT}/experiments/example/configs/eval/en2de_eval.yml

eval_gpu_index denotes the index of gpu on your machine that will be allocated to evaluate the model. if you set it to -1, it means that cpu will be used for evaluating during training.

Multilingual Pre-trained Model

Dataset

We merge 32 English-centric language pairs, resulting in 64 directed translation pairs in total. The original 32 language pairs corpus contains about 197M pairs of sentences. We get about 262M pairs of sentences after applying RAS, since we keep both the original sentences and the substituted sentences. We release both the original dataset and dataset after applying RAS. (Note that if you can't download the files, please replace the download link prefix "sf3-ttcdn-tos.pstatp.com" with "lf3-nlp-opensource.bytetos.com".)

Dataset	#Pair
32-lang-pairs-TRAIN	197603294
32-lang-pairs-RAS-TRAIN	262662792
32-lang-pairs-DEV	156587
Vocab	-
BPE Code	-

Checkpoints

We release checkpoints trained on 32-lang-pairs and 32-lang-pairs-RAS. We also extend our model to 58 language pairs.

Dataset	Checkpoint
Baseline-w/o-RAS	mTransformer-6enc6dec
mRASP-PC32	mRASP-PC32-6enc6dec
mRASP-PC58	-

Fine-tuning Model

We release En-Ro, En2De and En2Fr benchmark checkpoints and the corresponding configs.

Lang-Pair	Datasource	Checkpoints	Configs	tok-BLEU	detok-BLEU
En2Ro	WMT16 En-Ro, dev, test	en2ro	en2ro_config	39.0	37.6
Ro2En	WMT16 Ro-En, dev, test	ro2en	ro2en_config	37.7	36.9
En2De	WMT16 En-De, newstest16	en2de	en2de_config	30.3	-
En2Fr	WMT14 En-Fr, newstest14	en2fr	en2fr_config	44.3	-

Comparison with mBART

mBART is a pre-trained model trained on large-scale multilingual corpora. To illustrate the superiority of mRASP, we also compare our results with mBART. We choose different scales of language pairs and use the same test sets as mBART.

Lang-pairs	Size	Direction	Datasource	Testset	Checkpoint	mBART	mRASP
En-Gu	10K	⟶	en_gu_train	newstest19	en2gu	0.1	3.2
		⟵	en_gu_train	newstest19	gu2en	0.3	0.6
En-Kk	128K	⟶	en_kk_train	newstest19	en2kk	2.5	8.2
		⟵	en_kk_train	newstest19	kk2en	7.4	12.3
En-Tr	388K	⟶	en_tr_train	newstest17	en2tr	17.8	20.0
		⟵	en_tr_train	newstest17	tr2en	22.5	23.4
En-Et	2.3M	⟶	en_et_train	newstest18	en2et	21.4	20.9
		⟵	en_et_train	newstest18	et2en	27.8	26.8
En-Fi	4M	⟶	en_fi_train	newstest17	en2fi	22.4	24.0
		⟵	en_fi_train	newstest17	fi2en	28.5	28.0
En-Lv	5.5M	⟶	en_lv_train	newstest17	en2lv	15.9	21.6
		⟵	en_lv_train	newstest17	lv2en	19.3	24.4
En-Cs	978K	⟶	en_cs_train	newstest19	en2cs	18.0	19.9
En-De	4.5M	⟶	en_de_train	newstest19	en2de	30.5	35.2
En-Fr	40M	⟶	en_fr_train	newstest14	en2fr	41.0	44.3

Citation

If you are interested in mRASP, please consider citing our paper:

@inproceedings{lin-etal-2020-pre,
    title = "Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information",
    author = "Lin, Zehui  and
      Pan, Xiao  and
      Wang, Mingxuan  and
      Qiu, Xipeng  and
      Feng, Jiangtao  and
      Zhou, Hao  and
      Li, Lei",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.210",
    pages = "2649--2663",
}

mrasp's People

Contributors

Stargazers

Watchers

mrasp's Issues

韩语预处理？

作者您好，抱歉打扰您。
请问韩语预处理是切字，还是用moses tokenizer？

是否支持蒙、藏、维等民族语翻译？

作者您好，感谢您的工作！有几个问题请教一下您：

是否支持蒙文(内蒙)、藏文、维文的翻译？
mRASP的哈萨克文是西里尔字母的，还是阿拉伯字母的？
感谢您在百忙之中解答我的问题！

抱歉，实在是没有看懂fairseq和shell脚本，请教个问题

在基于某种语言微调的时候，我看use_dir.TranslationWithLangtokTask.setup_task代码中读取了dict.{}.txt，这个文件我网上搜应该是fairseq-process这一步产生的，貌似后面构建词表的时候就是用的目标语言对的dict.{}.txt，如果是这样的话，岂不是和预训练词表不一致了吗？是哪里我理解错了

关于valid和test过程

在test时我们调用您之前提出的fairseq-generate方法，并添加 --lang-prefix-tok 后面的language token是目标端语言的language token。那我们在valid过程需要怎么测试Bleu值呢？还是通过loss来选取model？如果想通过bleu值选取model，需要对fairseq-train进行如何修改？还是和双语的过程一致，只需在dev集种的src和tgt两端都加上标签

AttributeError: 'MMapIndexedDataset' object has no attribute '_bin_buffer_mmap'

大佬，我的fairseq版本是0.12.0,运行bash train/pre-train.sh /data01/code/python_project/nlp_project/mRASP/experiments/example/configs/train/pre-train/transformer_big.yml

会报下面的错误
Traceback (most recent call last):
File "/home/stary/miniconda3/envs/mrasp/bin/fairseq-train", line 8, in
sys.exit(cli_main())
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq_cli/train.py", line 557, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq_cli/train.py", line 133, in main
task.load_dataset(valid_sub_split, combine=False, epoch=1)
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq/tasks/translation.py", line 338, in load_dataset
self.datasets[split] = load_langpair_dataset(
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq/tasks/translation.py", line 85, in load_langpair_dataset
src_dataset = data_utils.load_indexed_dataset(
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq/data/data_utils.py", line 106, in load_indexed_dataset
dataset = indexed_dataset.make_dataset(
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq/data/indexed_dataset.py", line 86, in make_dataset
return MMapIndexedDataset(path)
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq/data/indexed_dataset.py", line 494, in init
self._do_init(path)
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq/data/indexed_dataset.py", line 507, in _do_init
self._bin_buffer_mmap = np.memmap(
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/numpy/core/memmap.py", line 267, in new
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
ValueError: cannot mmap an empty file
Exception ignored in: <function MMapIndexedDataset.del at 0x7ff3fc5ce1f0>
Traceback (most recent call last):
File "/home/stary/miniconda3/envs/mrasp/lib/python3.8/site-packages/fairseq/data/indexed_dataset.py", line 513, in del
self._bin_buffer_mmap._mmap.close()
AttributeError: 'MMapIndexedDataset' object has no attribute '_bin_buffer_mmap'

想咨询一下大佬，你使用的fairseq是什么版本

预训练数据预处理问题

感谢作者公开，我在数据预处理的时候，我把预训练所用的数据都下载好，并使用cat 32-lang-pairs.tar.* | tar -xzf - 去解压，我在执行bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${config_yaml_file}时，${config_yaml_file}选择的是${PROJECT_ROOT}/experiments/example/configs/preprocess/train.yml, 请问，在train.yml中，raw_data_path是预训练数据下载后所放在的目录是吗？后边的merged_output_path，output_main_path....dict_path等各种路径，可以自己自定义是嘛？@linzehui

Validation set for the language pairs that you compare with mBART

Hi,

Thanks for publishing this repository. I saw you publish the train and test set for the language pairs that you compare with mBART. I was wondering whether you can also provide the corresponding validation datasets. Thanks!

预训练模型下载

你好，请问mRASP-PC32-6enc6dec是否支持中-阿互译，哪里可以下载到mRASP-PC58预训练模型，谢谢

请问模型支持跨语言输入吗

比如输入英语和德语的拼接这种的，想尝试一下能不能迁移到自动译后编辑任务当中去。

无法复现英德数据集上无微调的结果

我下载了readme页面上的pretrain_checkpoint_last_RAS.pt，使用这个checkpoint复现无微调的德翻英结果。
数据集处理方面，经过预处理，使用mrasp的bpe词表进行切分后之后，在英法语料的句首分别拼接了LANG_TOK_DE和LANG_TOK_EN。
在生成时，使用generate的--prefix-size 1参数来保证目标端存在LANG_TOK_EN。
我使用了官方fairseq，tag:v0.10.2的代码，生成时使用的参数如下
CUDA_VISIBLE_DEVICES=8 python fairseq_cli/generate.py $DATA_PATH --path ~/mRASP/pretrain_checkpoint_last_RAS.pt --gen-subset valid --beam 5 --batch-size 16 --max-len-a 0 --max-len-b 256 --skip-invalid-size-inputs-valid-test --prefix-size 1

但我在德翻英的测试集合上只获得了bleu 17.53的结果。生成的部分翻译句子与bleu得分如下：
S-2633 LANG_TOK_DE Z@@ u viel Stre@@ ss
T-2633 LANG_TOK_EN To@@ o st@@ res@@ sed
H-2633 -0.26273542642593384 LANG_TOK_EN To@@ o Mu@@ ch Stre@@ ss

S-1076 LANG_TOK_DE Be@@ zir@@ k Pil@@ @@@ @ sen
T-1076 LANG_TOK_EN Pil@@ @@@ @ sen region
H-1076 -1.3391966819763184 LANG_TOK_EN Pil@@ son@@ g Dist@@ ric@@ t

S-2911 LANG_TOK_DE Dar@@ auf bin ich stol@@ z .
T-2911 LANG_TOK_EN That ple@@ @@@ @ as@@ es me .
H-2911 -0.3221472203731537 LANG_TOK_EN I am pro@@ ud of this .

希望作者提供一些改进的建议

在论文中看到使用pre-train的模型进行多语言翻译，可以获得很好的效果。有个问题，如果是使用fairseq-interactive命令，同时使用pre-train模型来进行双语翻译时，元语言与目标语言需要用什么标识符标识吗？

在论文中看到使用pre-train的模型进行多语言翻译，可以获得很好的效果。有个问题，如果是使用fairseq-interactive命令，同时使用pre-train模型来进行翻译时，元语言与目标语言需要用什么标识符标识吗？毕竟pre-train的模型是使用32对不同语种进行训练的，且没有进行fine-tuning,所以进行翻译时，元语言与目标语言是不是需要添加标识符来标识语种？能麻烦您给个示例吗？

All model download links are invalidated

It seems that all the download links have been invalidated. When will you update again? Thank you.

结果复现的问题

感谢公开！我们在复现en2gu结果时，发现与论文中的结果([email protected])有一些不同（尽管我们使用你们公开的checkpoints和测试集上，测的结果是2.58，可以看到仍然有轻微的不一致）,请问这个语言对的测试有什么特殊的处理吗？

Error merging override generation.print_alignment=False

我将fairseq模型转为transformers模型

https://huggingface.co/thehonestbob/mrasp

This error occurred while running preprocessing, no vocab found? vocab.bpe.600: No such file or directory :No such file or directory

我在论文中看到，使用pre_train模型就可以达到不错的效果，那么使用pre_train模型来进行翻译时，输入是用什么表示来区分是哪两种语言的翻译的？

if I want to use the pre_train model and fairseq-interactive command to translate a en2de sentence, What should the fairseq-interactive commands look like？
just put the "how are you?"
or put the "LANG_TOK_DE how are you?"
or some thing else to distinguish the target language?
could you give me an example of the input command and the input english sequence? thanks very much!

数据集下载问题

Dataset：32-lang-pairs-TRAIN 和 32-lang-pairs-RAS-TRAIN 没法下载了，可以重新提供链接吗

En-Fr微调模型checkpoint下载链接失效

似乎所有的微调模型的下载链接都失效了。
icloud云盘显示“所有者已停止共享，或者您无权打开此项”。

Arch type

I can't load the pretrained 32-lang-pairs-RAS-ckp - model with the tagged fairseq version 0.9.0:

| model transformer_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 243313664 (num. trained: 243313664)
| training on 1 GPUs
| max tokens per GPU = 2048 and max sentences per GPU = None
Traceback (most recent call last):
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/trainer.py", line 194, in load_checkpoint
    self.get_model().load_state_dict(
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/models/fairseq_model.py", line 71, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1044, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerModel:
	size mismatch for encoder.embed_positions.weight: copying a param with shape torch.Size([302, 1024]) from checkpoint, the shape in current model is torch.Size([258, 1024]).
	size mismatch for decoder.embed_positions.weight: copying a param with shape torch.Size([302, 1024]) from checkpoint, the shape in current model is torch.Size([258, 1024]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/kalle/Sprachdaten/mRASP/train_environment/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq_cli/train.py", line 333, in cli_main
    main(args)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq_cli/train.py", line 70, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 115, in load_checkpoint
    extra_state = trainer.load_checkpoint(
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/trainer.py", line 202, in load_checkpoint
    raise Exception(
Exception: Cannot load model parameters from checkpoint /media/kalle/Sprachdaten/mRASP/checkpoint_best.pt; please ensure that the architectures match.

The model states itself as transformer_vaswani_wmt_en_de_big. Have there been changes to the architecture? Isn't the architecture compatible due to facebookresearch/fairseq#2664 ?

Thanks for your promissing work!

Tabel1 结果咨询

感谢公开！关于 Tabel 1 的一些列实验结果，所有的Fine-tuning数据都是来源于预训练的吗（或者只有En-De,En-Fr,En-Ro是来自于WMT公开数据集，其他都是预训练使用的训练集）？

Fine-tuning mRASP for Ar-En

Hello. Thanks for sharing this repository!
I am currently trying to fine-tune the mRASP checkpoint available in the repository on a parallel Ar-En dataset using fairseq.

I first perform pre-processing on the raw dataset using this code:
!fairseq-preprocess \ --source-lang ar --target-lang en \ --srcdict vocab.bpe.32000.txt --tgtdict vocab.bpe.32000.txt \ --trainpref data/train \ --validpref data/valid \ --testpref data/test \ --destdir data_bin_ar_en

And this is my training command:
! fairseq-train data_bin_ar_en \ --encoder-normalize-before --decoder-normalize-before \ --layernorm-embedding \ --task translation --arch transformer --encoder-attention-heads 16 --decoder-attention-heads 16\ --source-lang ar --target-lang en --encoder-embed-dim 1024 --decoder-embed-dim 1024 \ --restore-file mRASP-PC32-6enc6dec.pt \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --share-all-embeddings \ --lr-scheduler polynomial_decay --warmup-updates 2500 --max-update 40000 \ --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \ --max-tokens 4000 --update-freq 8 --encoder-learned-pos --decoder-learned-pos\ --no-epoch-checkpoints --save-dir checkpoints \ --seed 9823843 --log-format simple --log-interval 2 \ --skip-invalid-size-inputs-valid-test --fp16 --lr-scheduler inverse_sqrt --reset-lr-scheduler --reset-dataloader --reset-meters \ --activation-fn gelu --lr '5e-4' --reset-optimizer --ddp-backend c10d

However, after several trials I keep getting these error messages:
`Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/config_loader_impl.py", line 513, in _apply_overrides_to_config
OmegaConf.update(cfg, key, value, merge=True)
File "/usr/local/lib/python3.7/dist-packages/omegaconf/omegaconf.py", line 613, in update
root.setattr(last_key, value)
File "/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py", line 286, in setattr
self._format_and_raise(key=key, value=value, cause=e)
File "/usr/local/lib/python3.7/dist-packages/omegaconf/base.py", line 101, in _format_and_raise
type_override=type_override,
File "/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py", line 694, in format_and_raise
_raise(ex, cause)
File "/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py", line 610, in _raise
raise ex # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ValidationError: Invalid value 'False', expected one of [hard, soft]
full_key: generation.print_alignment
reference_type=GenerationConfig
object_type=GenerationConfig

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/fairseq-train", line 8, in
sys.exit(cli_main())
File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/train.py", line 557, in cli_main
distributed_utils.call_main(cfg, main)
File "/usr/local/lib/python3.7/dist-packages/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/train.py", line 168, in main
disable_iterator_cache=task.has_sharded_data("train"),
File "/usr/local/lib/python3.7/dist-packages/fairseq/checkpoint_utils.py", line 253, in load_checkpoint
reset_meters=reset_meters,
File "/usr/local/lib/python3.7/dist-packages/fairseq/trainer.py", line 478, in load_checkpoint
filename, load_on_all_ranks=load_on_all_ranks
File "/usr/local/lib/python3.7/dist-packages/fairseq/checkpoint_utils.py", line 339, in load_checkpoint_to_cpu
state = _upgrade_state_dict(state)
File "/usr/local/lib/python3.7/dist-packages/fairseq/checkpoint_utils.py", line 677, in _upgrade_state_dict
state["cfg"] = convert_namespace_to_omegaconf(state["args"])
File "/usr/local/lib/python3.7/dist-packages/fairseq/dataclass/utils.py", line 389, in convert_namespace_to_omegaconf
composed_cfg = compose("config", overrides=overrides, strict=False)
File "/usr/local/lib/python3.7/dist-packages/hydra/experimental/compose.py", line 37, in compose
with_log_configuration=False,
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/hydra.py", line 512, in compose_config
from_shell=from_shell,
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/config_loader_impl.py", line 156, in load_configuration
from_shell=from_shell,
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/config_loader_impl.py", line 277, in _load_configuration
ConfigLoaderImpl._apply_overrides_to_config(config_overrides, cfg)
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/config_loader_impl.py", line 522, in _apply_overrides_to_config
) from ex
hydra.errors.ConfigCompositionException: Error merging override generation.print_alignment=False`

I searched for a solution but could not find any. My question is how exactly did you fine-tune the pre-trained model on Ar-En?
Also, I didn't apply bpe on the dataset before pre-processing because when I applied them and pre-processed the data, the number of resulting parallel sentences was not equal. Can you provide a way on how to apply the bpe codes? This was the code I used to apply bpe:
!subword-nmt apply-bpe -c codes.bpe.32000.txt < data/valid.ar.txt > data/valid.bpe.ar !subword-nmt apply-bpe -c codes.bpe.32000.txt < data/valid.en.txt > data/valid.bpe.en !subword-nmt apply-bpe -c codes.bpe.32000.txt < data/test.ar.txt > data/test.bpe.ar !subword-nmt apply-bpe -c codes.bpe.32000.txt < data/test.en.txt > data/test.bpe.en !subword-nmt apply-bpe -c codes.bpe.32000.txt < data/train.ar.txt > data/train.bpe.ar !subword-nmt apply-bpe -c codes.bpe.32000.txt < data/train.en.txt > data/train.bpe.en

开发集询问

公开的训练集中没有相应的开发集，请问是直接从训练集中随机采样一部分作为开发集吗？

对多语言的预训练中，如何解决中文、英文、日语这些语言的分词工具指定

您好，在预处理的示例中，您指定了唯一一种分词工具 mosestokenizer，但是我看到tokenize_each.sh 中支持多种分词工具。如果我的训练语料同时包含中英需要两种或多种分词工具时，我该如何指定对于中文用jieba，对于英文用moses呢，谢谢

關於全局變數路徑?

請問以下全局變數設定在哪個檔案內?
PROJECT_ROOT
CKPT
CURRENT_VOCAB
NEW_VOCAB
OUTPUT_DIR
非常感謝!

下载mRASP-PC32-6enc6dec.pt,使用德英数据微调，多头注意力计算报错

使用的是experiments/fine-tune-configs/en2de_config.yml

Traceback (most recent call last):
File "/data/anaconda3/envs/ymz_vecalign/bin/fairseq-train", line 8, in
sys.exit(cli_main())
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq_cli/train.py", line 557, in cli_main
distributed_utils.call_main(cfg, main)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq_cli/train.py", line 190, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq_cli/train.py", line 316, in train
log_output = trainer.train_step(samples)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/trainer.py", line 857, in train_step
raise e
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/trainer.py", line 824, in train_step
loss, sample_size_i, logging_output = self.task.train_step(
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 515, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/criterions/label_smoothed_cross_entropy.py", line 79, in forward
net_output = model(**sample["net_input"])
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/models/transformer/transformer_base.py", line 144, in forward
encoder_out = self.encoder(
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/models/transformer/transformer_encoder.py", line 165, in forward
return self.forward_scriptable(
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/models/transformer/transformer_encoder.py", line 294, in forward_scriptable
lr = layer(x, encoder_padding_mask=encoder_padding_mask_out)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/modules/transformer_layer.py", line 351, in forward
x, _ = self.self_attn(
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/fairseq/modules/multihead_attention.py", line 544, in forward
return F.multi_head_attention_forward(
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/torch/nn/functional.py", line 5075, in multi_head_attention_forward
q, k, v = _in_projection(query, key, value, q_proj_weight, k_proj_weight, v_proj_weight, b_q, b_k, b_v)
File "/data/anaconda3/envs/ymz_vecalign/lib/python3.8/site-packages/torch/nn/functional.py", line 4813, in _in_projection
return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

如何计算bleu分数？

之前没用过fairseq框架，最后说明给了评价的配置文件，但没有给相应的执行脚本啊？

fine-tune阶段词表及词嵌入矩阵与预训练模型不一致的问题

你好，我按照README里面finetune的步骤，用自己的语料进行微调，报了如下错误：
RuntimeError: Error(s) in loading state_dict for TransformerModel:
size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([64871, 1024]) from checkpoint, the shape in current model is torch.Size([4130, 1024]).
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([64871, 1024]) from checkpoint, the shape in current model is torch.Size([4130, 1024]).

看起来是微调时生成的词表和预训练阶段不一致造成的，但是README里没有提供如何令微调语料的词表和预训练模型（64871）进行对齐的流程，请问是我遗漏了哪个步骤吗？

About override issue

Hi, I'm trying to inference with mRASP-PC32-6enc6dec model, but it was failed with under error descriptions.

hydra.errors.ConfigCompositionException: Error merging override generation.print_alignment=False

and here is my running command
fairseq-generate ./newstest_test/de-en/ -s de -t en --skip-invalid-size-inputs-valid-test --path /hdd1/jaehyo/checkpoints/mRASP/mRASP-PC32-6enc6dec.pt --beam 5 --batch-size 128

Should I match the fairseq version with matching one? or Am I missing something wrong?
Thank you.

关于pre-train的问题，这个阶段只能在单机单卡上运行，单机多卡hang住了

done tkx

微调英德数据集，执行最后的fine-tine.sh，报错KeyError：0

@linzehui @PANXiao1994 您好，微调英德数据集，执行最后的fine-tine.sh，报错KeyError：0，报错详细信息如下，麻烦帮忙看一下。
-----> checkpoint_2_500 | en2de__wmt14_head100 <-----
Traceback (most recent call last):
File "/data1/home/nmt/sungege/mRASP/train/scripts/rerank_utils.py", line 152, in
get_hypo_and_ref(filename, hypo_file, ref_in, ref_out, r2l=r2l)
File "/data1/home/nmt/sungege/mRASP/train/scripts/rerank_utils.py", line 111, in get_hypo_and_ref
assert rank < len(hypo_dict[0])
KeyError: 0

添加中文，日文分词器之后，无法加载预训练模型

我在脚本中添加了中文，日文分词器，但是无法成功加载预训练模型，请问该如何解决？

缺少文件 multilingual_preprocess_yml_generator.py ，跑不起来

hi，你好，按照你的步骤跑不起来了，缺少一些文件，比如multilingual_preprocess_yml_generator.py，这个怎么回事呀？谢谢

代码中的sacrebleu的版本号是多少呀？

How pre-trained model without fine-tuning works on specific language pair translations?

In paper, authors mentioned that, model without fine-tuning works surprisingly well on all datasets.
I want to test the pre-trained model in en2af translation without any fine-tuning, by fairseq-interactive cmd, and tartget language set as af, when I type "hello", it returns a result start with LANG_TOK_FR, so how can I get translation result with LANG_TOK_AF as my wish ?

Do you have dictionary for all languages or just part of them?

I download the dics published by MUSE, and I found there were just 20+ dics for the languages in 32-lang-pairs.
For example, I could not find the dic for en-gu and en-kk.
So, I was wondering whether you gus have some other dics? If not, do RAS benefit the rest languages which have no dic to make code-switch for them?

数据预处理一直跑不通

作者你好，可以麻烦你再把整个代码的执行过程再细化一下嘛？比如先是下载数据集，首先下载数据集需要配置那些配置文件，然后去执行那些脚本，数据下载之后放在什么位置，解压数据，然后是数据预处理，需要配置那些配置文件，然后再执行那些脚本，得到什么结果等等。可能是我自己太菜了，希望能够得到一个傻瓜式的教程。先谢谢作者

confused about preprocess pipeline

The output path in train.yml is /data00/home/panxiao.94/experiments/pmnmt/experiments/toy/data/prep/train.
But the data path in transformer_big.yml is /data00/home/panxiao.94/experiments/pmnmt/experiments/toy/data/pre-train.
So,.. is there some steps lost about the preprocess phase ?

About 32-lang-pairs-TRAIN

Hi,

I tried to download 32-lang-pairs-TRAIN by the provided scripts.
for all the 32-lang-pairs.tar.* files,
is it right to use the following script to decompress?

cat 32-lang-pairs.tar.* | tar -xzf -

I got some errors by the above codes:
gzip: stdin: invalid compressed data--format violated
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Did I miss anything? Thank you.

请问一下这个文件是如何得来的"dict.{}.txt".format(args.source_lang)

我尝试直接使用你们给出的无微调命令进行预测
fairseq-generate $DATA_PATH
--user-dir ${repo_dir}/user_dir
-s de
-t en
--skip-invalid-size-inputs-valid-test
--beam 5 --batch-size 16 --max-len-a 0 --max-len-b 256
--path ${ckpt}
--task translation_w_langtok
--lang-prefix-tok LANG_TOK_EN
--nbest 1
由于我对fairseq不熟悉，不知道DATA_PATH是什么结构的如何得来的，是不是里面包含了dict.{}.txt相关文件

en-de在toy上复现问题

您好，请问我直接用您的en-de模型在toy数据形成的二进制文件上跑，出现以下错误是为什么呢？hydra.errors.ConfigCompositionException: Error merging override generation.print_alignment=False

您好，方便告知你们处理小语种的tokenizer吗？

因为我在处理印地语时，在日志中看到了这个警告，这个警告是你们代码里面写的。但是按理说mosestokenizer是可以正常对印地语进行分词的吧？所以能否告知你们在处理时如阿拉伯语、日语、缅甸语、印地语使用的是什么tokenizer？

Does fine-tuned model use the same vocabulary as pre-trained model?

The pre-trained model use learned joint vocabulary with 64867 tokens, merged by 32k operations. That is the file vocab.bpe.32000.
Does fine-tuned model (e.g. en2de) released in icloud use the same vocabulary file as pre-trained model?

数据集

你好，我想问一下你们收集的英中方向的数据集是来自什么领域？比如wmt的数据集是包含网址上提供的三个领域吗？
非常感谢！！！

加载模型出错

我是直接下的model加载的，arch用的是transformer_wmt_en_de_big，报一下错：

RuntimeError: Error(s) in loading state_dict for TransformerModel:
size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([64871, 1024]) from checkpoint, the shape in current model is torch.Size([14737, 1024]).
size mismatch for encoder.embed_positions.weight: copying a param with shape torch.Size([302, 1024]) from checkpoint, the shape in current model is torch.Size([258, 1024]).
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([64871, 1024]) from checkpoint, the shape in current model is torch.Size([14737, 1024]).
size mismatch for decoder.embed_positions.weight: copying a param with shape torch.Size([302, 1024]) from checkpoint, the shape in current model is torch.Size([258, 1024]).
size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([64871, 1024]) from checkpoint, the shape in current model is torch.Size([14737, 1024]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/student/lichengzhang/anaconda3/envs/nmt2/bin/fairseq-train", line 8, in
sys.exit(cli_main())
File "/home/student/lichengzhang/anaconda3/envs/nmt2/lib/python3.8/site-packages/fairseq_cli/train.py", line 352, in cli_main
distributed_utils.call_main(args, main)
File "/home/student/lichengzhang/anaconda3/envs/nmt2/lib/python3.8/site-packages/fairseq/distributed_utils.py", line 301, in call_main
main(args, **kwargs)
File "/home/student/lichengzhang/anaconda3/envs/nmt2/lib/python3.8/site-packages/fairseq_cli/train.py", line 110, in main
extra_state, epoch_itr = checkpoint_utils.load_checkpoint(
File "/home/student/lichengzhang/anaconda3/envs/nmt2/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 188, in load_checkpoint
extra_state = trainer.load_checkpoint(
File "/home/student/lichengzhang/anaconda3/envs/nmt2/lib/python3.8/site-packages/fairseq/trainer.py", line 291, in load_checkpoint
raise Exception(
Exception: Cannot load model parameters from checkpoint /home/student/lichengzhang/zhoushuhui/nmt/mRASP-master/pretrain_ckpt/checkpoint_best.pt; please ensure that the architectures match.