Code Monkey home page Code Monkey logo

mass's Introduction

MASS

MASS: Masked Sequence to Sequence Pre-training for Language Generation, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-training method for sequence to sequence based language generation tasks. It randomly masks a sentence fragment in the encoder, and then predicts it in the decoder.

img

MASS can be applied on cross-lingual tasks such as neural machine translation (NMT), and monolingual tasks such as text summarization. The current codebase supports unsupervised NMT (implemented based on XLM), supervised NMT, text summarization and conversational response generation, which are all based on Fairseq. We will release our implementation for other sequence to sequence generation tasks in the future.

What is New!

We release MPNet, a new pre-trained method for language understanding. GitHub: https://github.com/microsoft/MPNet

Unsupervised NMT

Unsupervised Neural Machine Translation just uses monolingual data to train the models. During MASS pre-training, the source and target languages are pre-trained in one model, with the corresponding langauge embeddings to differentiate the langauges. During MASS fine-tuning, back-translation is used to train the unsupervised models. Code is under MASS-unsupNMT. We provide pre-trained and fine-tuned models:

Languages Pre-trained Model Fine-tuned Model BPE codes Vocabulary
EN - FR MODEL MODEL BPE codes Vocabulary
EN - DE MODEL MODEL BPE codes Vocabulary
En - RO MODEL MODEL BPE_codes Vocabulary

We are also preparing larger models on more language pairs, and will release them in the future.

Dependencies

Currently we implement MASS for unsupervised NMT based on the codebase of XLM. The depencies are as follows:

  • Python 3
  • NumPy
  • PyTorch (version 0.4 and 1.0)
  • fastBPE (for BPE codes)
  • Moses (for tokenization)
  • Apex (for fp16 training)

Data Ready

We use the same BPE codes and vocabulary with XLM. Here we take English-French as an example.

cd MASS

wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr

./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr

Pre-training:

python train.py                                      \
--exp_name unsupMT_enfr                              \
--data_path ./data/processed/en-fr/                  \
--lgs 'en-fr'                                        \
--mass_steps 'en,fr'                                 \
--encoder_only false                                 \
--emb_dim 1024                                       \
--n_layers 6                                         \
--n_heads 8                                          \
--dropout 0.1                                        \
--attention_dropout 0.1                              \
--gelu_activation true                               \
--tokens_per_batch 3000                              \
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
--epoch_size 200000                                  \
--max_epoch 100                                      \
--eval_bleu true                                     \
--word_mass 0.5                                      \
--min_len 5                                          \

During the pre-training prcess, even without any back-translation, you can observe the model can achieve some intial BLEU scores:

epoch -> 4
valid_fr-en_mt_bleu -> 10.55
valid_en-fr_mt_bleu ->  7.81
test_fr-en_mt_bleu  -> 11.72
test_en-fr_mt_bleu  ->  8.80

Distributed Training

To use multiple GPUs e.g. 3 GPUs on same node

export NGPU=3; CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node=$NGPU train.py [...args]

To use multiple GPUS across many nodes, use Slurm to request multi-node job and launch the above command. The code automatically detects the SLURM_* environment vars to distribute the training.

Fine-tuning

After pre-training, we use back-translation to fine-tune the pre-trained model on unsupervised machine translation:

MODEL=mass_enfr_1024.pth

python train.py \
  --exp_name unsupMT_enfr                              \
  --data_path ./data/processed/en-fr/                  \
  --lgs 'en-fr'                                        \
  --bt_steps 'en-fr-en,fr-en-fr'                       \
  --encoder_only false                                 \
  --emb_dim 1024                                       \
  --n_layers 6                                         \
  --n_heads 8                                          \
  --dropout 0.1                                        \
  --attention_dropout 0.1                              \
  --gelu_activation true                               \
  --tokens_per_batch 2000                              \
  --batch_size 32	                                     \
  --bptt 256                                           \
  --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
  --epoch_size 200000                                  \
  --max_epoch 30                                       \
  --eval_bleu true                                     \
  --reload_model "$MODEL,$MODEL"                       \

We also provide a demo to use MASS pre-trained model on the WMT16 en-ro bilingual dataset. We provide pre-trained and fine-tuned models:

Model Ro-En BLEU (with BT)
Baseline 34.0
XLM 38.5
MASS 39.1

Download dataset by the below command:

wget https://dl.fbaipublicfiles.com/XLM/codes_enro
wget https://dl.fbaipublicfiles.com/XLM/vocab_enro

./get-data-bilingual-enro-nmt.sh --src en --tgt ro --reload_codes codes_enro --reload_vocab vocab_enro

After download the mass pre-trained model from the above link. And use the following command to fine tune:

MODEL=mass_enro_1024.pth

python train.py \
	--exp_name unsupMT_enro                              \
	--data_path ./data/processed/en-ro                   \
	--lgs 'en-ro'                                        \
	--bt_steps 'en-ro-en,ro-en-ro'                       \
	--encoder_only false                                 \
	--mt_steps 'en-ro,ro-en'                             \
	--emb_dim 1024                                       \
	--n_layers 6                                         \
	--n_heads 8                                          \
	--dropout 0.1                                        \
	--attention_dropout 0.1                              \
	--gelu_activation true                               \
	--tokens_per_batch 2000                              \
	--batch_size 32                                      \
	--bptt 256                                           \
	--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
	--epoch_size 200000                                  \
	--max_epoch 50                                       \
	--eval_bleu true                                     \
	--reload_model "$MODEL,$MODEL"

Supervised NMT

We also implement MASS on fairseq, in order to support the pre-training and fine-tuning for large scale supervised tasks, such as neural machine translation, text summarization. Unsupervised pre-training usually works better in zero-resource or low-resource downstream tasks. However, in large scale supervised NMT, there are plenty of bilingual data, which brings challenges for conventional unsupervised pre-training. Therefore, we design new pre-training loss to support large scale supervised NMT. The code is under MASS-supNMT.

We extend the MASS to supervised setting where the supervised sentence pair (X, Y) is leveraged for pre-training. The sentence X is masked and feed into the encoder, and the decoder predicts the whole sentence Y. Some discret tokens in the decoder input are also masked, to encourage the decoder to extract more informaiton from the encoder side.
img

During pre-training, we combine the orignal MASS pre-training loss and the new supervised pre-training loss together. During fine-tuning, we directly use supervised sentence pairs to fine-tune the pre-trained model. Except for NMT, this pre-trainig paradigm can be also applied on other superviseed sequence to sequence tasks.

We release the pre-trained model and example codes of how to pre-train and fine-tune on WMT Chinese<->English (Zh<->En) translation.:

Languages Pre-trained Model BPE codes English-Dict Chinese-Dict
Zh - En MODEL CODE VOCAB VOCAB

Prerequisites

After download the repository, you need to install fairseq by pip:

pip install fairseq==0.7.1

Data Ready

We first prepare the monolingual and bilingual sentences for Chinese and English respectively. The data directory looks like:

- data/
  ├─ mono/
  |  ├─ train.en
  |  ├─ train.zh
  |  ├─ valid.en
  |  ├─ valid.zh
  |  ├─ dict.en.txt
  |  └─ dict.zh.txt
  └─ para/
     ├─ train.en
     ├─ train.zh
     ├─ valid.en
     ├─ valid.zh
     ├─ dict.en.txt
     └─ dict.zh.txt

The files under mono are monolingual data, while under para are bilingual data. dict.en(zh).txt in different directory should be identical. The dictionary for different language can be different. Running the following command can generate the binarized data:

# Ensure the output directory exists
data_dir=data/
mono_data_dir=$data_dir/mono/
para_data_dir=$data_dir/para/
save_dir=$data_dir/processed/

# set this relative path of MASS in your server
user_dir=mass

mkdir -p $data_dir $save_dir $mono_data_dir $para_data_dir


# Generate Monolingual Data
for lg in en zh
do

  fairseq-preprocess \
  --task cross_lingual_lm \
  --srcdict $mono_data_dir/dict.$lg.txt \
  --only-source \
  --trainpref $mono_data_dir/train --validpref $mono_data_dir/valid \
  --destdir $save_dir \
  --workers 20 \
  --source-lang $lg

  # Since we only have a source language, the output file has a None for the
  # target language. Remove this

  for stage in train valid
  do
    mv $save_dir/$stage.$lg-None.$lg.bin $save_dir/$stage.$lg.bin
    mv $save_dir/$stage.$lg-None.$lg.idx $save_dir/$stage.$lg.idx
  done
done

# Generate Bilingual Data
fairseq-preprocess \
  --user-dir $mass_dir \
  --task xmasked_seq2seq \
  --source-lang en --target-lang zh \
  --trainpref $para_data_dir/train --validpref $para_data_dir/valid \
  --destdir $save_dir \
  --srcdict $para_data_dir/dict.en.txt \
  --tgtdict $para_data_dir/dict.zh.txt

Pre-training

We provide a simple demo code to demonstrate how to deploy mass pre-training.

save_dir=checkpoints/mass/pre-training/
user_dir=mass
data_dir=data/processed/

mkdir -p $save_dir

fairseq-train $data_dir \
    --user-dir $user_dir \
    --save-dir $save_dir \
    --task xmasked_seq2seq \
    --source-langs en,zh \
    --target-langs en,zh \
    --langs en,zh \
    --arch xtransformer \
    --mass_steps en-en,zh-zh \
    --memt_steps en-zh,zh-en \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --lr 0.00005 --min-lr 1e-09 \
    --criterion label_smoothed_cross_entropy \
    --max-tokens 4096 \
    --dropout 0.1 --relu-dropout 0.1 --attention-dropout 0.1 \
    --max-update 100000 \
    --share-decoder-input-output-embed \
    --valid-lang-pairs en-zh \

We also provide a pre-training script which is used for our released model.

Fine-tuning

After pre-training stage, we fine-tune the model on bilingual sentence pairs:

data_dir=data/processed
save_dir=checkpoints/mass/fine_tune/
user_dir=mass
model=checkpoint/mass/pre-training/checkpoint_last.pt # The path of pre-trained model

mkdir -p $save_dir

fairseq-train $data_dir \
    --user-dir $user_dir \
    --task xmasked_seq2seq \
    --source-langs zh --target-langs en \
    --langs en,zh \
    --arch xtransformer \
    --mt_steps zh-en \
    --save-dir $save_dir \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --lr-shrink 0.5 --lr 0.00005 --min-lr 1e-09 \
    --criterion label_smoothed_cross_entropy \
    --max-tokens 4096 \
    --max-update 100000 --max-epoch 50 \
    --dropout 0.1 --relu-dropout 0.1 --attention-dropout 0.1 \
    --share-decoder-input-output-embed \
    --valid-lang-pairs zh-en \
    --reload_checkpoint $model

We also provide a fine-tuning script which is used for our pre-trained model.

Inference

After the fine-tuning stage, you can generate translation results by using the below script:

model=checkpoints/mass/fine_tune/checkpoint_best.pt
data_dir=data/processed
user_dir=mass

fairseq-generate $data_dir \
    --user-dir $user_dir \
    -s zh -t en \
    --langs en,zh \
    --source-langs zh --target-langs en \
    --mt_steps zh-en \
    --gen-subset valid \
    --task xmasked_seq2seq \
    --path $model \
    --beam 5 --remove-bpe 

Text Summarization

MASS for text summarization is also implemented on fairseq. The code is under MASS-summarization.

Dependency

pip install torch==1.0.0 
pip install fairseq==0.8.0

MODEL

MASS uses default Transformer structure. We denote L, H, A as the number of layers, the hidden size and the number of attention heads.

Model Encoder Decoder Download
MASS-base-uncased 6L-768H-12A 6L-768H-12A MODEL
MASS-middle-uncased 6L-1024H-16A 6L-1024H-16A MODEL

Results on Abstractive Summarization (12/03/2019)

Dataset RG-1 RG-2 RG-L
CNN/Daily Mail 43.05 20.02 40.08
Gigaword 38.93 20.20 36.20
XSum 39.75 17.24 31.95

Evaluated by files2rouge.

Pipeline for Pre-Training

Download data

Our model is trained on Wikipekia + BookCorpus. Here we use wikitext-103 to demonstrate how to process data.

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

Tokenize corpus

We use wordpiece vocabuary (from bert) to tokenize the original text data directly. We provide a script to deal with data. You need to pip install pytorch_transformers first to generate tokenized data.

mkdir -p mono
for SPLIT in train valid test; do 
    python encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs mono/${SPLIT}.txt \
        --workers 60; \
done 

Binarized data

wget -c https://modelrelease.blob.core.windows.net/mass/mass-base-uncased.tar.gz
tar -zxvf mass-base-uncased.tar.gz
# Move dict.txt from tar file to the data directory 

fairseq-preprocess \
    --user-dir mass --only-source \
    --trainpref mono/train.txt --validpref mono/valid.txt --testpref mono/test.txt \
    --destdir processed --srcdict dict.txt --workers 60

Pre-training

TOKENS_PER_SAMPLE=512
WARMUP_UPDATES=10000
PEAK_LR=0.0005
TOTAL_UPDATES=125000
MAX_SENTENCES=8
UPDATE_FREQ=16

fairseq-train processed \
    --user-dir mass --task masked_s2s --arch transformer_mass_base \
    --sample-break-mode none \
    --tokens-per-sample $TOKENS_PER_SAMPLE \
    --criterion masked_lm \
    --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --ddp-backend=no_c10d \

Pipeline for Fine-tuning (CNN / Daily Mail)

Data

Download, tokenize and truncate data from this link, and use the above tokenization to generate wordpiece-level data. Rename the shuffix article and title as src and tgt. Assume the tokenized data is under cnndm/para

fairseq-preprocess \
    --user-dir mass --task masked_s2s \
    --source-lang src --target-lang tgt \
    --trainpref cnndm/para/train --validpref cnndm/para/valid --testpref cnndm/para/test \
    --destdir cnndm/processed --srcdict dict.txt --tgtdict dict.txt \
    --workers 20

dict.txt is included in mass-base-uncased.tar.gz. A copy of binarized data can be obtained from here.

Running

fairseq-train cnndm/processed/ \
    --user-dir mass --task translation_mass --arch transformer_mass_base \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 0.0005 --min-lr 1e-09 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --update-freq 8 --max-tokens 4096 \
    --ddp-backend=no_c10d --max-epoch 25 \
    --max-source-positions 512 --max-target-positions 512 \
    --skip-invalid-size-inputs-valid-test \
    --load-from-pretrained-model mass-base-uncased.pt \

lr=0.0005 is not the optimal choice for any task. It is tuned on the dev set (among 1e-4, 2e-4, 5e-4).

Inference

MODEL=checkpoints/checkpoint_best.pt
fairseq-generate $DATADIR --path $MODEL \
    --user-dir mass --task translation_mass \
    --batch-size 64 --beam 5 --min-len 50 --no-repeat-ngram-size 3 \
    --lenpen 1.0 \

min-len is sensitive for different tasks, lenpen needs to be tuned on the dev set.

Reference

If you find MASS useful in your work, you can cite the paper as below:

@inproceedings{song2019mass,
    title={MASS: Masked Sequence to Sequence Pre-training for Language Generation},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    booktitle={International Conference on Machine Learning},
    pages={5926--5936},
    year={2019}
}

Related Works

mass's People

Contributors

microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar stillkeeptry avatar tan-xu avatar thammegowda avatar tobyoup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mass's Issues

Results on CNN/DM dataset ?

Thanks for the great codebase !

I was wondering if you tried to get the results on CNN / DM dataset (summarization).

If so, can you share it ?

[Data Ready] How to get /data/processed/en-fr ?

Hi, thank you for releasing your code of your paper :)
But I encountered some problems when I typed 'python train.py'

Problem: Data Ready

I typed the commands as follows:

cd MASS
wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr
./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr

But it says,
bash: ./get-data-nmt.sh: Permission denied

Do you know how to resolve this? and I want to get 'data/processed/en-fr' :)
Thank you so much for reading this issue.

Unexpected key(s) in state_dict: "lang_embeddings.weight". When loading models for NMT.

I processed the data, then did unsupervised NMT fine-tuning:

`MODEL=mass_enfr_1024.pth

python train.py
--exp_name unsupMT_enfr
--data_path ./data/processed/en-fr/
--lgs 'en-fr'
--bt_steps 'en-fr-en,fr-en-fr'
--encoder_only false
--emb_dim 1024
--n_layers 6
--n_heads 8
--dropout 0.1
--attention_dropout 0.1
--gelu_activation true
--tokens_per_batch 2000
--batch_size 32
--bptt 256
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001
--epoch_size 200000
--max_epoch 30
--eval_bleu true
--reload_model "$MODEL,$MODEL"
`

However, when reloading the model, I got:

File "train.py", line 349, in <module> main(params) File "train.py", line 239, in main encoder, decoder = build_model(params, data['dico']) File "/data/lily/zl379/Projects/ANLP/MASS/MASS/src/model/__init__.py", line 134, in build_model encoder.load_state_dict(enc_reload) File "/home/lily/zl379/anaconda2/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for TransformerModel: Unexpected key(s) in state_dict: "lang_embeddings.weight".

fast.cc: No such file or directory

I am trying to run bash file "get-data-nmt.sh".

Compiling fastBPE...
g++: error: fast.cc: No such file or directory
g++: fatal error: no input file

it's like the installation of fastBPE has changed:

from:
g++ -std=c++11 -pthread -O3 fast.cc -o fast

to:
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

please take this into consideration.

Typo in MASS-fairseq

In fine-tuning script,--reload-checkpoint should be correct to --reload_checkpoint.
In addition, why is --max-update set to 10 ? This is too small .

MASS Fairseq - A request

Hi, Its exciting that you have added the task to fairseq. One request would be to use fairseq as a library (pip installed), then registering tasks etc instead of forking.

If some inflexibility in fairseq that prevents you from doing that, I will try to help fixing that in fairseq.

chinese model

thank you for your great jobs
can you release chinese model in the feature? for chinese text summary

ZhEn pretraining model

How to process our para data with the provided BPE codes. I ran the fastBPE tools, and got some problems. The privided BPE codes has two columns, fastBPE need three columns. Could you give some advice?

About Fine-tuning for Text Summarization

Hi,

Thank you for the great work. Recently I tried fine-tuning based on the pre-trained model (https://modelrelease.blob.core.windows.net/mass/mass_summarization_1024.pth).

I followed the instructions (https://github.com/microsoft/MASS#fine-tuning-2) in the readme file and ran this command on a single GPU machine. After the command finished, I tested the output by running python translate_ensemble.py --exp_name giga_test --src_lang ar --tgt_lang ti --beam 5 --batch_size 1 --model_path ./dumped/mass_summarization/bvk6g6f9xl/checkpoint.pth --output_path ./dumped/mass_summarization/bvk6g6f9xl/output.txt.beam5 < ./data/processed/giga/test.ar-ti.ar. Then I processed the output to remove the BPE mark @@ and tested the ROUGE scores. The ROUGE scores are ROUGE-1 F1=37.2 and ROUGE-2 F1=18.8.

I think I have missed something important here. Could you please instruct me how to correctly fine-tune the model?

File Error When Finetuning Gigawords

Hi,

I downloaded the pre-trained monolingual model for text summarization and preprocessed Gigawords using get-data-gigaword.sh.
Then I am trying to finetune it following https://github.com/microsoft/MASS#fine-tuning-2
However, I got a file related error:

Traceback (most recent call last):
  File "train.py", line 345, in <module>
    check_data_params(params)
  File "/workspace/MASS/code/MASS/MASS/src/data/loader.py", line 359, in check_data_params
    assert all([all([os.path.isfile(p1) and os.path.isfile(p2) for p1, p2 in paths.values()]) for paths in params.para_dataset.values()])
AssertionError

I changed the data path to ./data/processed/giga/ since in the get-data-gigaword.sh file the output folder is giga instead of summarization (

PROC_PATH=$DATA_PATH/processed/giga/
).

Could you please help with this?

Thanks.

How to use multi-gpu to pretrain mass-fairseq model?

I tried to use two GPUs to run mass-fairseq model, and got ModuleNotFoundError:

Traceback (most recent call last):
File "", line 1, in
File "/opt/miniconda3/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/miniconda3/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
ModuleNotFoundError: No module named 'mass'

I just use the script for text-summarization, and my script is as follow:

CUDA_VISIBLE_DEVICES=0,1 fairseq-train $data_dir \
--user-dir $user_dir \
--save-dir $save_dir \
--task xmasked_seq2seq \
--source-langs ar,ti \
--target-langs ar,ti \
--langs ar,ti \
--arch xtransformer \
--mass_steps ar-ar \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --lr 0.0001 --min-lr 1e-09 \
--criterion label_smoothed_cross_entropy \
--max-tokens 4096 \
--dropout 0.1 --relu-dropout 0.1 --attention-dropout 0.1 \
--max-update 300000 \
--max-epoch 50 \
--share-decoder-input-output-embed \
--word_mask 0.15 \
--dataset-impl lazy \
--valid-lang-pairs ar-ar \
--num-workers 4 \
--save-interval 20

so how to use multi-gpu when using mass-fairseq?

how to decode with the mass_ft_enfr_1024.pth

I run the scripts like this,
cat $test | python translate.py --exp_name translate --src_lang en --tgt_lang fr --model_path mass_ft_enfr_1024.pth --output_path output

but got the error:

AttributeError: 'AttrDict' object has no attribute 'attention_setting'

Why don't save the best model ?

This code didn't save the best model based on the score of validation set during training.
Is it my mistake or this code ?

fp16 training

Turning on the --fp16 True flag I receive the following error

    assert len(modules) == 1, "fp16 not implemented for more than one module"
AssertionError: fp16 not implemented for more than one module

Any idea why this could be? Is fp16 supported?

Quick question about pre-training with one sentence

I have a quick question for your MASS paper, thanks for the good work and sharing your codes.

My question is that I want to confirm in your pre-training, your masked sentence is always ONE sentence, instead of several. That's my understanding from the paper.

I'm asking this because when thinking about some down-stream tasks such as summarization or dialogue, the input is usually several sentences, so it could be interesting to try pre-training with multiple sentences as input.

the mass_ende_1024.pth and mass_ft_ende_1024.pth model haven't all the parameters

I meet the same problem. i use translate.py and translate_ensemble.py, the problem is the same. is't ende model is old ? or may i use pretrained-ende model?
whether i finetune first , then i can use the translate.py?

CUDA_VISIBLE_DEVICES=4 cat test |python3 translate.py --exp_name translate --src_lang en --tgt_lang de --model_path ./model_ende/finetune_model/mass_ft_ende_1024.pth --output_path output
INFO - 08/09/19 16:30:19 - 0:00:00 - ============ Initialized logger ============
INFO - 08/09/19 16:30:19 - 0:00:00 - batch_size: 32
beam: 1
command: python translate.py --exp_name translate --src_lang en --tgt_lang de --model_path './model_ende/finetune_model/mass_ft_ende_1024.pth' --output_path output --exp_id "875c0m2q9t"
dump_path: ./dumped/translate/875c0m2q9t
exp_id: 875c0m2q9t
exp_name: translate
fp16: False
length_penalty: 1
model_path: ./model_ende/finetune_model/mass_ft_ende_1024.pth
output_path: output
src_lang: en
tgt_lang: de
INFO - 08/09/19 16:30:19 - 0:00:00 - The experiment will be stored in ./dumped/translate/875c0m2q9t

INFO - 08/09/19 16:30:19 - 0:00:00 - Running command: python translate.py --exp_name translate --src_lang en --tgt_lang de --model_path './model_ende/finetune_model/mass_ft_ende_1024.pth' --output_path output

INFO - 08/09/19 16:30:26 - 0:00:07 - Supported languages: en, de
Traceback (most recent call last):
File "translate.py", line 160, in
main(params)
File "translate.py", line 77, in main
setattr(params, name, getattr(model_params, name))
AttributeError: 'AttrDict' object has no attribute 'bos_index'

Rouge score Text Summarization

Hi, how can I calculate the Rouge score of text summarization after fine tuning? Currently, it is showing acc, bleu other scores after fine tuning.

Replicating en-fr UNMT with a smaller emb_dim

I have to use a smaller emb_dim of 512 because my GPU Mem is 12.8G. As a result, the hidden_dim of Transformer would be 4 * 512 = 2048. However, the results seem to be much worse than yours.

My command of pre-training is:
python train.py --exp_name unsupMT_enfr --data_path './data/processed/en-fr/' --lgs 'en-fr' --bt_steps 'en-fr-en,fr-en-fr' --mass_steps 'en,fr' --lambda_bt '0:0,10:0' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 3000 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 200000 --max_epoch 100 --eval_bleu true --word_mass '0.5' --min_len 5 --exp_id "ht6lz6ziu1"

My command of fine-tuning is:
python train.py --exp_name unsupMT_enfr --data_path './data/processed/en-fr/' --lgs 'en-fr' --bt_steps 'en-fr-en,fr-en-fr' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 2000 --batch_size 32 --bptt 256 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 200000 --max_epoch 30 --eval_bleu true --reload_model 'mass_enfr_512.pth,mass_enfr_512.pth' --exp_id "w83s8z4nkx"

Only the emb_dim is changed, but I get the following result:

pre-train.log:

2 days, 12:15:39 - log:{"epoch": 99, "valid_fr-en_mt_ppl": 249.58064788484583, "valid_fr-en_mt_acc": 23.941589780961678, "valid_fr-en_mt_bleu": 2.28, "valid_en-fr_mt_ppl": 209.35770915056546, "valid_en-fr_mt_acc": 23.806986740933585, "valid_en-fr_mt_bleu": 2.6, "test_fr-en_mt_ppl": 216.85211556087498, "test_fr-en_mt_acc": 24.96583143507973, "test_fr-en_mt_bleu": 2.53, "test_en-fr_mt_ppl": 180.5567727326117, "test_en-fr_mt_acc": 24.483515007722094, "test_en-fr_mt_bleu": 2.71}

fine-tune.log:

2 days, 0:07:54 - log:{"epoch": 29, "valid_fr-en_mt_ppl": 206.69015785463145, "valid_fr-en_mt_acc": 40.49940187275702, "valid_fr-en_mt_bleu": 10.23, "valid_en-fr_mt_ppl": 156.52663532211892, "valid_en-fr_mt_acc": 40.216746382514984, "valid_en-fr_mt_bleu": 10.53, "test_fr-en_mt_ppl": 145.12240020417357, "test_fr-en_mt_acc": 43.62819539357125, "test_fr-en_mt_bleu": 11.71, "test_en-fr_mt_ppl": 109.9388737280739, "test_en-fr_mt_acc": 43.34333102043557, "test_en-fr_mt_bleu": 12.17}

Do I need to change other hyperparams? Is an emb_dim of 1024 necessary?

EN-only pre-trained model

Hello,

Thank you for releasing all these models and code.

I want to fine-tune the EN-only pre-trained model for another monolingual task. I imagine the process is similar to that for the summarization task. As such, I was checking the commands used there and I noticed that the pre-trained model used is named MODEL=mass_en_1024.pth. However, the pre-trained model available in the Summarization section is named mass_summarization_1024.pth. Should I assume that that's the EN-only pre-trained model that I need to fine-tune?

Thank you.

func self.pred_layer.get_scores params error

MASS/src/model/transformer.py line 587 scores = self.pred_layer.get_scores(tensor, lang_id=tgt_lang_id) params error.
it should be scores = self.pred_layer.get_scores(tensor)

evaluator

At the evaluation process, do you need to mask the input sentences? I don't find this operation in the code. So the PPL reported in the paper is computed when the input is a whole sentence?

Illegal division by zero at multi-bleu.perl

Uploading image.png…

hi we tried to pre-train the model by entering as below
but we got the above message.... anyone knows whats wrong?
is max epoch too small?
we just wanted to check as fast as we can

python train.py
--exp_name unsupMT_enfr
--data_path ./data/processed/en-fr/
--lgs 'en-fr'
--mass_steps 'en,fr'
--encoder_only false
--emb_dim 1024
--n_layers 6
--n_heads 8
--dropout 0.1
--attention_dropout 0.1
--gelu_activation true
--tokens_per_batch 1000
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001
--epoch_size 1000
--max_epoch 1
--eval_bleu true
--word_mass 0.5
--min_len 5 \

About mask_sent of mass_step.Code and paper description seem to be inconsistent.

trainer.py

`def mask_sent(self, x, l):
max_len = 0
positions, inputs, targets, outputs, = [], [], [], []
mask_len = round(len(x[:, 0]) * self.params.word_mass)
len2 = [mask_len for i in range(l.size(0))]

    for i in range(l.size(0)):
        words = x[:l[i], i].tolist()
        start = self.random_start(l[i] - mask_len + 1)

        pos_i, target_i, output_i, input_i = [], [], [], []
        prev_w = None
        for j, w in enumerate(words):
            if j >= start and j < start + mask_len:
                output_i.append(w) 
                target_i.append(prev_w)
                pos_i.append(j - 1)
                input_i.append(self.mask_word(w))
            else:
                input_i.append(w)
            prev_w = w

        #################################
        I have read the code and paper.
        In the mass_step, the input of the decoder is "targets".
        
        According to the paper,
        **Should this "target_i[0]" be set to <mask_index>?**
        for example,like this:
        
        **target_i[0]=self.params.mask_index**
        
        #################################

        inputs.append(input_i)
        targets.append(target_i)
        outputs.append(output_i)
        positions.append(pos_i)

    x1  = torch.LongTensor(max(l) , l.size(0)).fill_(self.params.pad_index)
    x2  = torch.LongTensor(mask_len, l.size(0)).fill_(self.params.pad_index)
    y   = torch.LongTensor(mask_len, l.size(0)).fill_(self.params.pad_index)
    pos = torch.LongTensor(mask_len, l.size(0))
    l1  = l.clone()
    l2  = torch.LongTensor(len2)
    for i in range(l.size(0)):
        x1[:l1[i], i].copy_(torch.LongTensor(inputs[i]))
        x2[:l2[i], i].copy_(torch.LongTensor(targets[i]))
        y[:l2[i], i].copy_(torch.LongTensor(outputs[i]))
        pos[:l2[i], i].copy_(torch.LongTensor(positions[i]))
    pred_mask = y != self.params.pad_index
    y = y.masked_select(pred_mask)
    return x1, l1, x2, l2, y, pred_mask, pos`

[MEMORY ERROR] RuntimeError: CUDA out of memory.

Hi,

When I try to fine-tune the unsupervised NMT task, I encountered the message as below. Do you know any other way to solve this error rather than reducing the batch size?

RuntimeError: CUDA out of memory. Tried to allocate 367.12 MiB (GPU 0; 10.91 GiB total capacity; 9.64 GiB already allocated; 203.38 MiB free; 249.00 MiB cached)

Thank you for reading this issue :)

StopIteration Error with fairseq-interactive

I can generate outputs with fairseq-generate but fail with fairseq-interactive. I would appreciate it if you have any ideas. Thanks!

Traceback (most recent call last):
File "/usr/local/python3/bin/fairseq-interactive", line 11, in
load_entry_point('fairseq==0.7.1', 'console_scripts', 'fairseq-interactive')()
File "/usr/local/python3/lib/python3.6/site-packages/fairseq_cli/interactive.py", line 185, in cli_main
main(args)
File "/usr/local/python3/lib/python3.6/site-packages/fairseq_cli/interactive.py", line 121, in main
task.max_positions(),
File "/home/user/MASS/MASS-fairseq/mass/xmasked_seq2seq.py", line 487, in max_positions
for key in next(iter(self.datasets.values())).datasets.keys()
StopIteration

run get-data-gigaword.sh failed

Can someone explain how to run the code for summarization?
Run get-data-gigaword.sh give me millions of errors:

`Learning BPE codes...
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//train.ar-ti.ar ...
Read 0 words (0 unique) from text file.
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//train.ar-ti.ti ...
Read 0 words (0 unique) from text file.
Segmentation fault (core dumped)
BPE learned in /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//codes
Applying article BPE codes...
Loading codes from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//codes ...
Read 0 codes from the codes file.
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//train.ar-ti.ar ...
Read 0 words (0 unique) from text file.
Applying BPE to /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//train.ar-ti.ar ...
Output memory map failed : 22.
Applying title BPE codes...
Loading codes from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//codes ...
Read 0 codes from the codes file.
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//train.ar-ti.ti ...
Read 0 words (0 unique) from text file.
Applying BPE to /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//train.ar-ti.ti ...
Output memory map failed : 22.
Extracting vocabulary...
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//train.ar-ti.ar ...
Read 0 words (0 unique) from text file.
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//train.ar-ti.ti ...
Read 0 words (0 unique) from text file.
Full vocab in: /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//vocab.ar-ti
Loading codes from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//codes ...
Read 0 codes from the codes file.
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//valid.ar-ti.ar ...
Read 0 words (0 unique) from text file.
Applying BPE to /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//valid.ar-ti.ar ...
Output memory map failed : 22.
Loading codes from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//codes ...
Read 0 codes from the codes file.
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//valid.ar-ti.ti ...
Read 0 words (0 unique) from text file.
Applying BPE to /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//valid.ar-ti.ti ...
Output memory map failed : 22.
Loading codes from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//codes ...
Read 0 codes from the codes file.
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//test.ar-ti.ar ...
Read 0 words (0 unique) from text file.
Applying BPE to /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//test.ar-ti.ar ...
Output memory map failed : 22.
Loading codes from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/processed/giga//codes ...
Read 0 codes from the codes file.
Loading vocabulary from /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//test.ar-ti.ar ...
Read 0 words (0 unique) from text file.
Applying BPE to /home/lily/zl379/Projects/ANLP/MASS/MASS/data/para//test.ar-ti.ar ...
Output memory map failed : 22.
INFO - 07/05/19 22:21:53 - 0:00:00 - Read 14 words from the vocabulary file.

Traceback (most recent call last):
File "/home/lily/zl379/Projects/ANLP/MASS/MASS/preprocess.py", line 37, in
data = Dictionary.index_data(txt_path, bin_path, dico)
File "/data/lily/zl379/Projects/ANLP/MASS/MASS/src/data/dictionary.py", line 217, in index_data
assert sentences.min() >= 0
File "/home/lily/zl379/anaconda2/envs/py36/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial)
ValueError: zero-size array to reduction operation minimum which has no identity
INFO - 07/05/19 22:21:54 - 0:00:00 - Read 14 words from the vocabulary file.

Traceback (most recent call last):
File "/home/lily/zl379/Projects/ANLP/MASS/MASS/preprocess.py", line 37, in
data = Dictionary.index_data(txt_path, bin_path, dico)
File "/data/lily/zl379/Projects/ANLP/MASS/MASS/src/data/dictionary.py", line 217, in index_data
assert sentences.min() >= 0
File "/home/lily/zl379/anaconda2/envs/py36/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial)
ValueError: zero-size array to reduction operation minimum which has no identity
INFO - 07/05/19 22:21:54 - 0:00:00 - Read 14 words from the vocabulary file.

Traceback (most recent call last):
File "/home/lily/zl379/Projects/ANLP/MASS/MASS/preprocess.py", line 37, in
data = Dictionary.index_data(txt_path, bin_path, dico)
File "/data/lily/zl379/Projects/ANLP/MASS/MASS/src/data/dictionary.py", line 217, in index_data
assert sentences.min() >= 0
File "/home/lily/zl379/anaconda2/envs/py36/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial)
ValueError: zero-size array to reduction operation minimum which has no identity
INFO - 07/05/19 22:21:55 - 0:00:00 - Read 14 words from the vocabulary file.

Traceback (most recent call last):
File "/home/lily/zl379/Projects/ANLP/MASS/MASS/preprocess.py", line 37, in
data = Dictionary.index_data(txt_path, bin_path, dico)
File "/data/lily/zl379/Projects/ANLP/MASS/MASS/src/data/dictionary.py", line 217, in index_data
assert sentences.min() >= 0
File "/home/lily/zl379/anaconda2/envs/py36/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial)
ValueError: zero-size array to reduction operation minimum which has no identity
INFO - 07/05/19 22:21:55 - 0:00:00 - Read 14 words from the vocabulary file.

Traceback (most recent call last):
File "/home/lily/zl379/Projects/ANLP/MASS/MASS/preprocess.py", line 37, in
data = Dictionary.index_data(txt_path, bin_path, dico)
File "/data/lily/zl379/Projects/ANLP/MASS/MASS/src/data/dictionary.py", line 217, in index_data
assert sentences.min() >= 0
File "/home/lily/zl379/anaconda2/envs/py36/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial)
ValueError: zero-size array to reduction operation minimum which has no identity
INFO - 07/05/19 22:21:56 - 0:00:00 - Read 14 words from the vocabulary file.

Traceback (most recent call last):
File "/home/lily/zl379/Projects/ANLP/MASS/MASS/preprocess.py", line 37, in
data = Dictionary.index_data(txt_path, bin_path, dico)
File "/data/lily/zl379/Projects/ANLP/MASS/MASS/src/data/dictionary.py", line 217, in index_data
assert sentences.min() >= 0
File "/home/lily/zl379/anaconda2/envs/py36/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial)
ValueError: zero-size array to reduction operation minimum which has no identity
`

I downloaded the tools and giga dataset, what else should I do?

Typo

MASS: Masked Sequence to Sequence Pre-training for Langauge Generation
In the github details
Should be langUAage

How to decide how many GPU is good for fine-tune ?

I want to fine-tune an mass model with three different dataset which sentences numbers is respectively 2m(ende) 1m(enfr) and 0.4m(enro). I prepare to using the mass public fine-tune settings and I want to know How many GPUs respectively is suited for diffenrent scale dataset.

Welcome to give any effective suggestions and thanks alot.

why is the data in folder hypotheses abnormal ?

In my code, both the valid and test set are normal.

INFO - 06/26/19 16:40:35 - 0:00:07 - ============ Data summary
INFO - 06/26/19 16:40:35 - 0:00:07 - Monolingual data   - train -           en:   5000000
INFO - 06/26/19 16:40:35 - 0:00:07 - Monolingual data   - valid -           en:      3000
INFO - 06/26/19 16:40:35 - 0:00:07 - Monolingual data   -  test -           en:      2999
INFO - 06/26/19 16:40:35 - 0:00:07 - Monolingual data   - train -           de:   5000000
INFO - 06/26/19 16:40:35 - 0:00:07 - Monolingual data   - valid -           de:      3000
INFO - 06/26/19 16:40:35 - 0:00:07 - Monolingual data   -  test -           de:      2999
INFO - 06/26/19 16:40:35 - 0:00:07 - Parallel data      - valid -        de-en:      3000
INFO - 06/26/19 16:40:35 - 0:00:07 - Parallel data      -  test -        de-en:      2999

However, why is the data in folder hypotheses abnormal ?

$ wc -l  hypotheses/*
   50 hyp0.de-en.test.txt
   50 hyp0.de-en.valid.txt
   50 hyp0.en-de.test.txt
   50 hyp0.en-de.valid.txt
   50 hyp1.de-en.test.txt
   50 hyp1.de-en.valid.txt
   50 hyp1.en-de.test.txt
   50 hyp1.en-de.valid.txt
   50 ref.de-en.test.txt
   50 ref.de-en.valid.txt
   50 ref.en-de.test.txt
   50 ref.en-de.valid.txt

Why do they only have 50 sentences ?

hyperparamters to reproduce paper result

Could you share the full command (and default hyperparameters) you used? For example, I found --word_mask_keep_rand is "0.8,0.1,0.1" in MASS but "0,0,1" in MASS-fairseq.

How can I reproduce the en-fr results of unsupervised NMT?

I used the MASS model you provided to fine-tune the en-fr's unsupervised system, but the the results are more than 1BLEU lower than the score reported in the paper.
My training script is as follow .

 3 data=/search/odin/mmyin/XLM/data/processed/en-fr
  4 mass_model=/search/odin/mmyin/MASS/Unsupervised/MASS-EN-FR/mass_enfr_1024.pth
  5 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  6 export NGPU=8
  7 
  8 python -m torch.distributed.launch --nproc_per_node=$NGPU train.py\
  9     --exp_name unsupMT_enfr                              \
 10     --exp_id en-fr_1024 \
 11     --data_path $data \
 12     --reload_model $mass_model,$mass_model \
 13     --lgs 'en-fr'                                        \
 14     --bt_steps 'en-fr-en,fr-en-fr'                       \
 15     --encoder_only false                                 \
 16     --emb_dim 1024                                       \
 17     --n_layers 6                                         \
 18     --n_heads 8                                          \
 19     --dropout 0.1                                        \
 20     --attention_dropout 0.1                              \
 21     --gelu_activation true                               \
 22     --tokens_per_batch 1000                              \
 23     --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
 24     --epoch_size 200000                                  \
 25     --max_epoch 50                                      \
 26     --eval_bleu true                                     \
 27     --word_mass 0.5                                      \
 28     --min_len 5                                          \
 29     --stopping_criterion 'valid_en-fr_mt_bleu,10'      \
 30     --validation_metrics 'valid_en-fr_mt_bleu'          \
~                                                                                                                                                             
~                                                                               

I used beam search for decoding, and beam-size was set to 10.
Here are the reproduced scores.

UNMT en-fr fr-en
MASS 37.5 34.6
MASS(our reproduce) 36.16 33.67

Baseline Implementations

Hi, thanks for sharing this work! Would really appreciate if you could point me out to the baseline implementations for unsupervised NMT -- specifically the BERT+LM and DAE methods with perhaps more details on the experimental parameters for the baseline tasks -- since I couldn't find them in the paper. Thanks.

size mismatch when reloading pretrained model

I train to fine-tune the pre-trained summarization model for my own dataset, After pre-processing, my vocabulary is not the same with that in the pretrained model and it gives me an error:

size mismatch for embeddings.weight: copying a param with shape torch.Size([32268, 1024]) from checkpoint, the shape in current model is torch.Size([29598, 1024])

How can I ensure the vocabulary on different dataset to match with each other? Thanks a lot

Missing key(s) in state_dict: "module.lang_embeddings.weight". when resuming the NMT training

I am trying to resume training from a checkpoint, but get getting an exception.

How to reproduce.

  1. pre train MASS. Note theMODEL path
  2. Finetune NMT model with --reload_model "$MODEL,$MODEL". Finetune-Training stopped after certain epochs
  3. Try to resume finetune-training. This time use the same --exp_name and --exp_id as step 2. And no need to set --reload_model argument.
    (I just want to resume training from a previous checkpt, if these steps are incorrect, please let me know the correct steps)

The logs show that the trainer tries to restore the earlier checkpoint,

Reloading checkpoint from ./dumped/mt-models/<exp_name>/<exp_id>/checkpoint.pth

however, an exception is raised:

  File "MASS/MASS/src/trainer.py", line 99, in __init__
    super().__init__(data, params)
  File "MASS/MASS/src/trainer.py", line 99, in __init__
    self.reload_checkpoint()
  File "MASS/MASS/src/trainer.py", line 465, in reload_checkpoint
    self.reload_checkpoint()
  File "MASS/MASS/src/trainer.py", line 465, in reload_checkpoint
    getattr(self, name).load_state_dict(data[name])
  File "/nas/home/tg/libs/miniconda3/envs/fb-unsupmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 777, in load_state_dict
    getattr(self, name).load_state_dict(data[name])
  File "/nas/home/tg/libs/miniconda3/envs/fb-unsupmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 777, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
        Missing key(s) in state_dict: "module.lang_embeddings.weight".
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
        Missing key(s) in state_dict: "module.lang_embeddings.weight".

It looks like the issue is related to #15

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.