bert-nmt / bert-nmt Goto Github PK

License: Other

Python 98.99% C++ 0.30% Shell 0.34% Lua 0.37%

bert-nmt's Introduction

Introduction

This repository contains the code for BERT-fused NMT, which is introduced in the ICLR2020 paper Incorporating BERT into Neural Machine Translation.

If you find this work helpful in your research, please cite as:

@inproceedings{
Zhu2020Incorporating,
title={Incorporating BERT into Neural Machine Translation},
author={Jinhua Zhu and Yingce Xia and Lijun Wu and Di He and Tao Qin and Wengang Zhou and Houqiang Li and Tieyan Liu},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=Hyl7ygStwB}
}

NOTE: We have updated our code to enable you use more powerful pretrained models contained in huggingface/transformers. With bert-base-german-dbmdz-uncased, we get a new result $37.34$ on IWSLT'14 de->en task.

Requirements and Installation

PyTorch version == 1.0.0/1.1.0
Python version >= 3.5

Installing from source

To install fairseq from source and develop locally:

git clone https://github.com/bert-nmt/bert-nmt
cd bertnmt
pip install --editable .

Getting Started

Data Preprocessing

First, you should run Fairseq prepaer-xxx.sh to get tokenized&bped files like:

train.en train.de valid.en valid.de test.en test.de

Then you can use makedataforbert.sh to get input file for BERT model (please note that the path is correct). You can get

train.en train.de valid.en valid.de test.en test.de train.bert.en valid.bert.en test.bert.en

Then preprocess data like Fairseq:

python preprocess.py --source-lang src_lng --target-lang tgt_lng \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir destdir  --joined-dictionary --bert-model-name bert-base-uncased

Note: For more language pairs used in our paper, please refer to another repo.

Train a vanilla NMT model using Fairseq

Using data above and standard Fairseq repository, you can get a pretrained NMT model.

Note: The update_freq in iwslt en->zh translation is set to 2, and other hyper-parameters are the same as de<->en

Train a BERT-fused NMT model

The important options we add:

        parser.add_argument('--bert-model-name', default='bert-base-uncased', type=str)
        parser.add_argument('--warmup-from-nmt', action='store_true', )
        parser.add_argument('--warmup-nmt-file', default='checkpoint_nmt.pt', )
        parser.add_argument('--encoder-bert-dropout', action='store_true',)
        parser.add_argument('--encoder-bert-dropout-ratio', default=0.25, type=float)

--bert-model-name specify the BERT model name, provided in file.
--warmup-from-nmt indicate you will also use a pretrained NMT model to train your BERT-fused NMT model. If you this option, we suggest you use --reset-lr-scheduler, too.
--warmup-nmt-file specify the NMT model name (in your $savedir).
--encoder-bert-dropout indicate you will use drop-net trick.
--encoder-bert-dropout-ratio specify the ratio ($\in [0, 0.5]$) used in drop-net. This is a training script example:

#!/usr/bin/env bash
nvidia-smi

cd /yourpath/bertnmt
python3 -c "import torch; print(torch.__version__)"

src=en
tgt=de
bedropout=0.5
ARCH=transformer_s2_iwslt_de_en
DATAPATH=/yourdatapath
SAVEDIR=checkpoints/iwed_${src}_${tgt}_${bedropout}
mkdir -p $SAVEDIR
if [ ! -f $SAVEDIR/checkpoint_nmt.pt ]
then
    cp /your_pretrained_nmt_model $SAVEDIR/checkpoint_nmt.pt
fi
if [ ! -f "$SAVEDIR/checkpoint_last.pt" ]
then
warmup="--warmup-from-nmt --reset-lr-scheduler"
else
warmup=""
fi

python train.py $DATAPATH \
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup \
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log

Generate

Using the generate.py to test model is the same as the Fairseq, but you should add --bert-model-name to indicate your BERT model name.

Using the interactive.py to test model is a little different from the Fairseq. You should follow this procedure:

sed -r 's/(@@ )|(@@ ?$)//g' $bpefile > $bpefile.debpe
$MOSE/scripts/tokenizer/detokenizer.perl -l $src < $bpefile.debpe > $bpefile.debpe.detok
paste -d "\n" $bpefile $bpefile.debpe.detok > $bpefile.in
cat $bpefile.in | python interactive.py  -s $src -t $tgt \
--buffer-size 1024 --batch-size 128 --beam 5 --remove-bpe  > output.log

bert-nmt's People

Contributors

Stargazers

Watchers

Forkers

bcmi220 alphadl phychaos stuartchan zjr35897 xhyandwyy prabhatbharat wibruce kiminh yokusama zz860808 echan00 nanaakwasiabayieboateng ngoctanle krieya zixuwang1996 parkchanjun drstevenc arahan99 haya985 mirzakhalov demon-jiehao ashwanitanwar elliotthwang nagoudi shepherd233 dineshkumares vivid-k ccmehk oshaikh13 conquerself like-0 bbiyongel shikhar-s orangeadegit jzl0166 wangyu1997 nicolexj syl072096 zhihong1224 neural-mt zjpbinary charizardacademy smr-s meomeoowww cuongvng pangjh3 unibody bharathrajagajula patrick-g-zhang cedar33 nashid ishanisri kleag 20161105421 farahalthinayyan felixgithub2017 anddyyyyy lingua-ignota ziyicui2022 komeiji514 tokisakikurumi2001 yifanjun233 strategist922 venkatace scarlett-h liorz alukanlp hamsik1223 alexeysorokin dioxideme yuran-zhao kudep deeppavlov yelinga dot23 markhsia xiazeyu0628 kaiwen-tang beyondhyx dsf1235 askinkaty wanglaiqi kimcheongbin twodoge2 old-young233 gdls tmacmilan copperdong trellixvulnteam bonheur60 hungngocphat01 egliette chloe-mxxxxc pandinosaurus tranhamduong qinghuachen007 harmanpreet7916 shreyasaswar

bert-nmt's Issues

No such file or directory 'squad/dict.en.txt'

I am trying to using this model to generate questions from sentences. The model, trains, but when I am trying to test with generate.py, I get this error

FileNotFoundError: [Errno 2] No such file or directory: 'squaden/dict.en.txt'

My command is
python generate.py squaden --path checkpoints/iwed_en_tgt_0.5/checkpoint_best.pt --batch-size 128 --beam=5 --bert-model-name "bert-base-uncased" --cpu --source-lang en --target-lang tgt

I have also tried using interactive.py, but it gives me an error saying "data is a required argument". When I pass in data as an arg instead of through stdin, it gives me the same error as generate.py. Any advise would be appreciated.

num_updates

Hi @bert-nmt
How did this error cause it, and if modified, what parameters should I add?

File "train.py", line 167
progress.print(stats, tag='train', step=stats['num_updates'])

thanks

所有可以设置的参数都在training script example里吗？

python train.py $DATAPATH
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07'
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log
上面是所有的可以设置的参数吗？如果我想训练的时候取消bert参数的固定该怎么做？

--encoder-bert-dropout-ratio meaning?

The paper mentions the range of [0, 1.0] for drop net probability, but the code uses [0, 0.5]. Any difference between them? And what happens when ratio is lower like 0.1 or 0.0?

RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows.

I am receiving the error RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. for translation of long sentences (full stack trace attached below).
It looks like a sample with 329 src_tokens requires a bert_input tensor of length 568 that is too big.
What can I do to increase this source sentence length constraint?

{'net_input': 
  {
    'src_tokens': tensor([[   38,     6,     5,    34,     5,    ...]]), 
    'src_lengths': tensor([329]),
    'bert_input': tensor([[   101,  10105,    167,    115,  12100,   ...]])
  }
}

src_tokens' tensor length is 329
'bert_input' tensor length is 568

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/main.py", line 52, in tagatag_v2
    outputs = translate_v2(inputs, src_id, targ_id, TAGATAG_MODELS, args, utils, src_dict, tgt_dict)
  File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/query_tagatag_v2.py", line 58, in translate_v2
    translations = task.inference_step(generator, TAGATAG_MODELS, sample)
  File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/fairseq/tasks/fairseq_task.py", line 246, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/usr/local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/fairseq/sequence_generator.py", line 152, in generate
    bert_outs, _ = model.models[0].bert_encoder(bertinput, output_all_encoded_layers=True, attention_mask=~bert_encoder_padding_mask)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/bert/modeling.py", line 736, in forward
    embedding_output = self.embeddings(input_ids, token_type_ids)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/bert/modeling.py", line 272, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/functional.py", line 1484, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at ../aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

How does IWSLT EN-ZH calculate the bleu score?

Hello, how do you calculate the bleu score about the results of Chinese sentences? Do you divide a sentence into words?

Looking forward to your reply, thank you very much.

@teslacool

How to train XLnet fused NMT model

are they in the same way?

problems with obtaining pretrained model

Can you tell me how you obtained the pretrained model exactly? I followed the steps on the readme page and ran prepare-iwslt14.sh (provided in examples) and then makedataforbert.sh. Then, I ran preprocess.py and I made sure to give it --joined-vocabulary and made sure that the encoder /decoder have the same dimensions as the transformer model expected on the transformer-s2. my probem is that the transformer model that i should pretrain is not training and the validation loss keeps increasing

my prepare-iwslt14.sh file:

echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git

echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
git clone https://github.com/rsennrich/subword-nmt.git

SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
LC=$SCRIPTS/tokenizer/lowercase.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
BPEROOT=subword-nmt/subword_nmt
BPE_TOKENS=10000

URL="https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz"
GZ=de-en.tgz

if [ ! -d "$SCRIPTS" ]; then
echo "Please set SCRIPTS variable correctly to point to Moses scripts."
exit
fi

src=de
tgt=en
lang=de-en
prep=iwslt14.tokenized.de-en
tmp=$prep/tmp
orig=orig

mkdir -p $orig $tmp $prep

echo "Downloading data from ${URL}..."
cd $orig
wget "$URL"

if [ -f $GZ ]; then
echo "Data successfully downloaded."
else
echo "Data not successfully downloaded."
exit
fi

tar zxvf $GZ
cd ..

echo "pre-processing train data..."
for l in $src $tgt; do
f=train.tags.$lang.$l
tok=train.tags.$lang.tok.$l

cat $orig/$lang/$f | \
grep -v '<url>' | \
grep -v '<talkid>' | \
grep -v '<keywords>' | \
sed -e 's/<title>//g' | \
sed -e 's/<\/title>//g' | \
sed -e 's/<description>//g' | \
sed -e 's/<\/description>//g' | \
perl $TOKENIZER -threads 8 -l $l > $tmp/$tok
echo ""

done
perl $CLEAN -ratio 1.5 $tmp/train.tags.$lang.tok $src $tgt $tmp/train.tags.$lang.clean 1 175
for l in $src $tgt; do
perl $LC < $tmp/train.tags.$lang.clean.$l > $tmp/train.tags.$lang.$l
done

echo "pre-processing valid/test data..."
for l in $src $tgt; do
for o in ls $orig/$lang/IWSLT14.TED*.$l.xml; do
fname=${o##/}
f=$tmp/${fname%.}
echo $o $f
grep '<seg id' $o |
sed -e 's/\s*//g' |
sed -e 's/\s*</seg>\s*//g' |
sed -e "s/\’/'/g" |
perl $TOKENIZER -threads 8 -l $l |
perl $LC > $f
echo ""
done
done

echo "creating train, valid, test..."
for l in $src $tgt; do
awk '{if (NR%23 == 0) print $0; }' $tmp/train.tags.de-en.$l > $tmp/valid.$l
awk '{if (NR%23 != 0) print $0; }' $tmp/train.tags.de-en.$l > $tmp/train.$l

cat $tmp/IWSLT14.TED.dev2010.de-en.$l \
    $tmp/IWSLT14.TEDX.dev2012.de-en.$l \
    $tmp/IWSLT14.TED.tst2010.de-en.$l \
    $tmp/IWSLT14.TED.tst2011.de-en.$l \
    $tmp/IWSLT14.TED.tst2012.de-en.$l \
    > $tmp/test.$l

done

TRAIN=$tmp/train.en-de
BPE_CODE=$prep/code
rm -f $TRAIN
for l in $src $tgt; do
cat $tmp/train.$l >> $TRAIN
done

echo "learn_bpe.py on ${TRAIN}..."
python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE

for L in $src $tgt; do
for f in train.$L valid.$L test.$L; do
echo "apply_bpe.py to ${f}..."
python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $prep/$f
done
done`

my makedataforbert.sh file (which is properly placed)
#!/usr/bin/env bash lng=$1 echo "src lng $lng" for sub in train valid test do sed -r 's/(@@ )|(@@ ?$)//g' ${sub}.${lng} > ${sub}.bert.${lng}.tok ../mosesdecoder/scripts/tokenizer/detokenizer.perl -l $lng < ${sub}.bert.${lng}.tok > ${sub}.bert.${lng} rm ${sub}.bert.${lng}.tok done

my preprocessing comand (which i ran from the bert-nmt directory)
TEXT=iwslt14.tokenized.de-en python preprocess.py --source-lang de --target-lang en \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ --destdir destdir --joined-dictionary --bert-model-name bert-base-uncased

my training command for the obtaining a pretrained model (from the fairseq directory)
CUDA_VISIBLE_DEVICES=0 python train.p y ../iwslt_de_en --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0. 0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy -- label-smoothing 0.1 --max-tokens 4096 --share-all-embeddings

my logs
`Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, ada
ptive_softmax_dropout=0, arch='transformer_iwslt_de_en', attention_dropout=0.0, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='label_smoothed_cro
ss_entropy', curriculum=0, data='../iwslt_de_en', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=512, decoder_em
bed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_ou
tput_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_po
rt=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encoder_attention_heads=4, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_
embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, find_unused_parameters=False, fix_batches_to_gpus=False, fp1
6=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_
load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', max_epoch=0, m
ax_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4096, max_update=0, memory_efficient_fp16=F
alse, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_worke
rs=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset
_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, seed=1, senten
ce_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=
None, task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1
, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=-1, warmup_updates=4000, weight_decay=0.0001)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| ../iwslt_de_en valid de-en 7283 examples
TransformerModel(
(encoder): TransformerEncoder(
(embed_tokens): Embedding(10152, 512, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(3): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(4): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(5): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): TransformerDecoder(
(embed_tokens): Embedding(10152, 512, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)

  )
  (3): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (encoder_attn): MultiheadAttention(
      (out_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (fc1): Linear(in_features=512, out_features=1024, bias=True)
    (fc2): Linear(in_features=1024, out_features=512, bias=True)
    (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (4): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (encoder_attn): MultiheadAttention(
      (out_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (fc1): Linear(in_features=512, out_features=1024, bias=True)
    (fc2): Linear(in_features=1024, out_features=512, bias=True)
    (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (5): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (encoder_attn): MultiheadAttention(
      (out_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (fc1): Linear(in_features=512, out_features=1024, bias=True)
    (fc2): Linear(in_features=1024, out_features=512, bias=True)
    (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
      )

RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCBlas.cu:259

My env :
python=3.5
cuda=9.0
pytorch=1.1.0
torchvision=0.3.0

## Step(1). Download WMT16 English-German

## Step(2). makedataforbert.sh

## Step(3).

TEXT=examples/translation/wmt16_en_de_test

python preprocess.py --source-lang en --target-lang de   
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test   
--destdir destdir  --joined-dictionary --bert-model-name bert-base-uncased

## Step(4). Download WMT16 English-German Model from fairseq

## Step(5).

#!/usr/bin/env bash
nvidia-smi
python3 -c "import torch; print(torch.__version__)"

src=en
tgt=de
bedropout=0.5
ARCH=transformer_vaswani_wmt_en_de_big
DATAPATH=destdir/
SAVEDIR=checkpoints/wmt16_${src}_${tgt}_${bedropout}
mkdir -p $SAVEDIR

if [ ! -f $SAVEDIR/checkpoint_nmt.pt ]; then     cp wmt16.en-de.joined-dict.transformer/model.pt $SAVEDIR/checkpoint_nmt.pt; fi
if [ ! -f "$SAVEDIR/checkpoint_last.pt" ]; then warmup="--warmup-from-nmt --reset-lr-scheduler"; else warmup=""; fi

python train.py $DATAPATH \
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup \
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log

After that I got the log and error message ...

(bertNMT) blue90211@AI:~/Storage01/bert-nmt$ python train.py $DATAPATH -a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' --adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup --encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log
| distributed init (rank 1): tcp://localhost:14689
| distributed init (rank 0): tcp://localhost:14689
| initialized host AI as rank 1
| initialized host AI as rank 0
Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='destdir/', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_no_bert=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:14689', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, dropout=0.3, encoder_attention_heads=16, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_bert_mixup=False, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio=1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_update=150000, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=True, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/wmt16_en_de_0.5', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='de', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_from_nmt=True, warmup_init_lr=1e-07, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001)
| [en] dictionary: 32768 types
| [de] dictionary: 32768 types
| destdir/ valid en-de 3000 examples
bert_gates [True, True, True, True, True, True]
TransformerModel(
  (encoder): TransformerEncoder(
    (embed_tokens): Embedding(32768, 1024, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(32768, 1024, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (bert_encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (1): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (2): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (3): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (4): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (5): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (6): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (7): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (8): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (9): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (10): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (11): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )

| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 341438720 (num. trained: 231956480)
| training on 2 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
Model will load checkpoint from checkpoints/wmt16_en_de_0.5/checkpoint_nmt.pt
| NOTICE: your device may support faster training with --fp16
| loaded checkpoint checkpoints/wmt16_en_de_0.5/checkpoint_nmt.pt (epoch 31 @ 0 updates)
| loading train data for epoch 31
| destdir/ train en-de 4500966 examples
| saved checkpoint checkpoints/wmt16_en_de_0.5/checkpoint31.pt (epoch 31 @ 0 updates) (writing took 1.965510606765747 seconds)
Traceback (most recent call last):
  File "train.py", line 315, in <module>
    cli_main()
  File "train.py", line 307, in cli_main
    nprocs=args.distributed_world_size,
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/mnt/Storage01/blue90211/bert-nmt/train.py", line 274, in distributed_main
    main(args, init_distributed=True)
  File "/mnt/Storage01/blue90211/bert-nmt/train.py", line 89, in main
    train(args, trainer, task, epoch_itr)
  File "/mnt/Storage01/blue90211/bert-nmt/train.py", line 130, in train
    log_output = trainer.train_step(samples)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/trainer.py", line 289, in train_step
    raise e
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/trainer.py", line 266, in train_step
    ignore_grad
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/tasks/fairseq_task.py", line 232, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/criterions/label_smoothed_cross_entropy.py", line 38, in forward
    net_output = model(**sample['net_input'])
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/fairseq_model.py", line 239, in forward
    encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, **kwargs)
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/transformer.py", line 564, in forward
    x = layer(x, encoder_padding_mask)
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/transformer.py", line 1245, in forward
    x, _ = self.self_attn(query=x, key=x, value=x, key_padding_mask=encoder_padding_mask)
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/modules/multihead_attention.py", line 117, in forward
    q, k, v = self.in_proj_qkv(query)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/modules/multihead_attention.py", line 240, in in_proj_qkv
    return self._in_proj(query).chunk(3, dim=-1)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/modules/multihead_attention.py", line 277, in _in_proj
    return F.linear(input, weight, bias)
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/functional.py", line 1408, in linear
    output = input.matmul(weight.t())
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCBlas.cu:259

Interactive evaluation script

Neat work. Congratulation.

In the paper, you mentioned that you are using multi-bleu.perl to evaluate IWSLT’14 En↔De, but in the script provided ("iwslt-interactive.sh"), you are evaluating with sacrebleu.

I tried myself, but somehow generate.py outperform interactive.py for about 1 bleu point. I guess the gap comes from the tokenization problem?

Would you mind updating the evaluation script for iwslt14?

AttributeError: 'BertTokenizerFast' object has no attribute 'encode_line'

I encountered the following trouble: when I used a short corpus of Chinese and English to preprocess the data according to the command line as shown in the figure, there was an error. The wrong location is / root / Bert NMT / fairseq/ binarizer.py Line 60 of IDS= dict.encode_ Error in line():
AttributeError: 'BertTokenizerFast' object has no attribute 'encode_ line'.

I don't know why there is such a mistake, because I use Chinese and English sentences, right? Looking forward to your reply. Thank you very much

Bugs? when finetune BERT

when we want to finetune BERT during translation, we turn on --finetune_bert.
However, I noticed that in train.py L54-56, you disable the update of BERT's pooler, can you explain a bit why you turn off the update of the pooling layer?

are the parameters of the pretrained bert model trainable in the fused nmt training?

are the parameters of the pre trained bert model trainable in the fused nmt training?

Where can I get pretrained_nmt_model that matches transformer_s2_iwslt_de_en.

Hello,

Sorry to ask such a basic question.
I ran train.ipynb and got "Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match.".
Where can I get pretrained_nmt_model that matches transformer_s2_iwslt_de_en.

I gat model4.pt from
https://github.com/pytorch/fairseq/tree/master/examples/translation
transformer.wmt19.en-de

src = 'en'
tgt = 'de'
bedropout = '0.5'
ARCH = 'transformer_s2_iwslt_de_en'
DATAPATH = 'examples/translation/data_preprocess'
SAVEDIR = 'checkpoints/iwed_' + src + '' + tgt + '' + bedropout
your_pretrained_nmt_model = 'wmt19.en-de.joined-dict.ensemble/model4.pt'

!time python train.py $DATAPATH
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07'
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log

Question regarding datapath

Hello, I am currently trying to use your code for nmt task. While working on the demo one using fairseq/examples/translation/prepare-iwslt14.sh and following your instruction on datapreprocessing.

After that, I tried to train the bert-fused NMT model, however, I encountered below error. I am not sure where does it come about, does it have something to do with my data path?

for data path, I have put all of the bert-preprocessed files using your code "preprocess.py" to this path: /content/bert-nmt/bert-nmt-files/Bert-NMT-files

I would sincerely appreciate if you could take a look at these following error message and give me some ideas on where to modify them. Thank you!

Traceback (most recent call last):
Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_s2_iwslt_de_en', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='/content/bert-nmt/bert-nmt-files/Bert-NMT-files', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_no_bert=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encoder_attention_heads=4, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_bert_mixup=False, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio=1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_update=150000, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/iwed_${src}${tgt}${bedropout}', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='de', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_from_nmt=False, warmup_init_lr=1e-07, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001)
| [en] dictionary: 10152 types
| [de] dictionary: 10152 types
| /content/bert-nmt/bert-nmt-files/Bert-NMT-files valid en-de 7283 examples
File "/content/bert-nmt/train.py", line 315, in
cli_main()
File "/content/bert-nmt/train.py", line 311, in cli_main
main(args)
File "/content/bert-nmt/train.py", line 46, in main
task.load_dataset(valid_sub_split, combine=True, epoch=0)
File "/content/bert-nmt/fairseq/tasks/translation.py", line 213, in load_dataset
bert_model_name = self.bert_model_name
File "/content/bert-nmt/fairseq/tasks/translation.py", line 80, in load_langpair_dataset
srcbert_datasets, srcbert_datasets.sizes, berttokenizer,
AttributeError: 'NoneType' object has no attribute 'sizes'

Problems using Interactive.py

I am following the instructions in the readme regarding using interactive.py:

sed -r 's/(@@ )|(@@ ?$)//g' $bpefile > $bpefile.debpe
$MOSE/scripts/tokenizer/detokenizer.perl -l $src < $bpefile.debpe > $bpefile.debpe.detok
paste -d "\n" $bpefile $bpefile.debpe.detok > $bpefile.in
cat $bpefile.in | python interactive.py  -s $src -t $tgt \
--buffer-size 1024 --batch-size 128 --beam 5 --remove-bpe  > output.log

Is $bpefile a file containing the source sentences after bpe has been applied to it? (e.g test.en?)

Also, the following execution gives an error as the 'data' argument is missing:
cat $bpefile.in | python interactive.py -s $src -t $tgt --buffer-size 1024 --batch-size 128 --beam 5 --remove-bpe > output.log

Running Preliminary Explorations

Hey there -- wanted to start by saying thanks for the excellent work!

I'm trying to run some of the preliminary experiments, but I'm not sure how I'd go about doing the following:

(1): Initializing the Encoder with BERT: The transformer_iwslt_de_en architecture has only 6 encoder layers, but BERT-base has 12 decoder layers. I'm not sure what layers were selected to initialize the NMT encoder. I think I see code for doing this with XLM (https://github.com/bert-nmt/bert-nmt/blob/master/fairseq/models/transformer_from_pretrained_xlm.py), but not with BERT.

(2): I'm also not sure if there's code for running this case: "Leveraging the output of BERT as embedding;" if you could point me in the right direction, I'd really appreciate it.

Thanks a ton!
Omar

edit: My guess is this has something to do with bert_gates and bert_ratio/encoder_ratio, but I'm not entirely sure.

What are the development sets and test sets in the direction of IWSLT 2017 en-zh?

Hello, is the development set you use in the direction of IWSLT en-zh IWSLT2016? Is the test set IWSLT 2017?
If it is convenient, can you provide the pre-processing script before bpe? If you are using the prepare-iwslt14.sh script on fairseq, there is an echo "creating train, valid, test ... "step in the preprocessing. I don't know why. Do you need this when doing en-zh？

Thank you very much and look forward to your reply.
@teslacool

Get error when I ensemble models

decode_one_ () miss 1 argument

Longer time to train a BERT-fused model

Hi,
I trained a NMT model for low resource language in fairseq, it takes 30 seconds for an epoch on 8*2080ti GPUs, on a dataset containing 0.3M sentences and the architecture is transformer-base.

Using the same data and transformers2 as architecture, am getting around 20 minutes per epoch.
Used the train command present here.

Note : Enabled args update-freq 16 , --fp16 and ddp_backend no_c10d .
Can you please infer from these why the training time is too high. Thanks.

iwslt14英德翻译任务词表问题

使用了您提供的preprocess.py脚本之后发现两个语种的词典是一样的，请问这样是对的吗？

Can I trained Bert-fuse model with my pretrained-Roberta

Hi, I've been trying to customize the code lately but it can't work because in my language there is only a pre-trained Roberta model. Can I have insight on how to do that? Thank you

Can I train a fused style xlm-nmt rather than initialize the encoder from the pretrained-xlm?

I saw the transformer_from pretrained_xlm.py, but it seems it will initialize the NMT model. I want to train a fuse style xlm-nmt model like the paper said from a pretrained xlm and pretrained nmt.
Do the codes released support this yet?

How to train a model without bert by your code?

I use your framwork to train a transformer model by fairseq-train, then print the model arch, but i see there is a bert.
Any influence ?

Is the bleu score of EN-DE BERT-fused NMT model reported in paper's Table 3 is case-sensitive or case-insensitive?

I only find a little info about case in Appendix.1.

for En-De, we lowercase all words, split 7k sentence pairs from the training dataset for validation and concatenate dev2010, dev2012, tst2010, tst2011, tst2012 as the test set.

Is this mean the bleu score is case-insensitive？

Training gets stuck after few epochs with 100% GPU utilization

Hello hello,

I am currently experimenting with the BERT-Fuse model. I could succesfully run a training with a small dataset. Now, as I am attempting a training with a 100+ Million-token-dataset, the training gets stuck with 100% GPU utilisation after a few epochs. Tried reducing the batch size, but same issue. Has anyone faced a similar issue? Do you have any suggestions for solving this please?

Thank you in advance.

Please tell me how to use bin and idx files.

Hello,

Sorry to ask such a basic question.
I ran preprocess.py and I got bin and idx files.
I don't know how to use .bin and .idx files after this.
Please tell me how to use bin and idx files.

total 1687755
-rw------- 1 root root 534956 Apr 16 05:52 dict.de.txt
-rw------- 1 root root 534956 Apr 16 05:52 dict.en.txt
-rw------- 1 root root 324412 Apr 16 07:25 test.bert.en-de.en.bin
-rw------- 1 root root 72136 Apr 16 07:25 test.bert.en-de.en.idx
-rw------- 1 root root 338516 Apr 16 06:36 test.en-de.de.bin
-rw------- 1 root root 72136 Apr 16 06:36 test.en-de.de.idx
-rw------- 1 root root 324740 Apr 16 06:13 test.en-de.en.bin
-rw------- 1 root root 72136 Apr 16 06:13 test.en-de.en.idx
-rw------- 1 root root 479599392 Apr 16 07:24 train.bert.en-de.en.bin
-rw------- 1 root root 95068360 Apr 16 07:24 train.bert.en-de.en.idx
-rw------- 1 root root 477476928 Apr 16 06:36 train.en-de.de.bin
-rw------- 1 root root 95068360 Apr 16 06:36 train.en-de.de.idx
-rw------- 1 root root 466401152 Apr 16 06:13 train.en-de.en.bin
-rw------- 1 root root 95068360 Apr 16 06:13 train.en-de.en.idx
-rw------- 1 root root 4855892 Apr 16 07:25 valid.bert.en-de.en.bin
-rw------- 1 root root 961456 Apr 16 07:25 valid.bert.en-de.en.idx
-rw------- 1 root root 4838976 Apr 16 06:36 valid.en-de.de.bin
-rw------- 1 root root 961456 Apr 16 06:36 valid.en-de.de.idx
-rw------- 1 root root 4721140 Apr 16 06:13 valid.en-de.en.bin
-rw------- 1 root root 961456 Apr 16 06:13 valid.en-de.en.idx

NMT model architecture

Is it not possible to use a non-fairseq NMT model for the saved checkpoint? I get a keyError - state['best_loss'] in checkpoint_utils.py when I try to warmup with an open-nmt transformer model.

RuntimeError: CUDA error: device-side assert triggered

I have encountered the following error while trying to train bert fused nmt using BERT-Base, Multilingual Cased. Kindly help!

File "/mnt/beegfs/home/abdulrauf/alector/nrpu-July/installations/bert-nmt/train.py", line 315, in <module> cli_main() File "/mnt/beegfs/home/abdulrauf/alector/nrpu-July/installations/bert-nmt/train.py", line 311, in cli_main main(args) File "/mnt/beegfs/home/abdulrauf/alector/nrpu-July/installations/bert-nmt/train.py", line 89, in main train(args, trainer, task, epoch_itr) File "/mnt/beegfs/home/abdulrauf/alector/nrpu-July/installations/bert-nmt/train.py", line 130, in train log_output = trainer.train_step(samples) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/trainer.py", line 289, in train_step raise e File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/trainer.py", line 266, in train_step ignore_grad File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/tasks/fairseq_task.py", line 232, in train_step loss, sample_size, logging_output = criterion(model, sample) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/criterions/label_smoothed_cross_entropy.py", line 38, in forward net_output = model(**sample['net_input']) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/models/fairseq_model.py", line 241, in forward bert_encoder_out, _ = self.bert_encoder(bert_input, output_all_encoded_layers=True, attention_mask= 1. - bert_encoder_padding_mask) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/bert/modeling.py", line 736, in forward embedding_output = self.embeddings(input_ids, token_type_ids) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/bert/modeling.py", line 272, in forward position_embeddings = self.position_embeddings(position_ids) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA error: device-side assert triggered

How to train bert fused NMT model using my own BERT model

I want to train bert fused NMT using my own BERT model, kindly guide me the proper process to do that.

Is there a memory leak in the code?

I got the error ”Exception: process 1 terminated with signal SIGKILL“ when i training a model(25M sentences). It seem that caused by out of memory, the reason as follow:

This error appear after "loading train data" and "saved checkpoint", so i think It should be during training.
It's not GPU OOM
Memory usage continue to increase during this period
It run well when i train in a small dataset(4k sentences) with same parameters, but i just try train 5 epoch.

Error in pre-processing module with update-20-10 branch

Hi Team,

I really appreciate your research work. I started trying out your code base. I'm using code base from update-20-10 branch. When my training script reach to the preprocessing stage. I'm getting following error trace. Can you please help me resolve this? Have I missed on something?

  File "bert-nmt/preprocess.py", line 274, in <module>
    cli_main()
  File "bert-nmt/preprocess.py", line 270, in cli_main
    main(args)
  File "bert-nmt/preprocess.py", line 191, in main
    make_all(args.source_lang, berttokenizer)
  File "bert-nmt/preprocess.py", line 176, in make_all
    make_dataset(vocab, args.trainpref, "train", lang, num_workers=args.workers)
  File "bert-nmt/preprocess.py", line 172, in make_dataset
    make_binary_dataset(vocab, input_prefix, output_prefix, lang, num_workers)
  File "bert-nmt/preprocess.py", line 138, in make_binary_dataset
    offset=0, end=offsets[1]
  File "bert-nmt/fairseq/binarizer.py", line 60, in binarize
    ids = dict.encode_line(
AttributeError: 'BertTokenizerFast' object has no attribute 'encode_line'

I have install the dependencies using this docker files.

Thanks,

Using Pretrained NMT

Hello! Strictly speaking this is not an issue but a question. I have read through your paper, but I do not understand the purpose and effect of using a pretrained NMT model. In the bert-fused model, how much of a difference in model performance is there between using a pretrained NMT model and a randomly initialized one, specifcally in low-resource scenario with unlabeled data?

RuntimeError: Subtraction, the `-` operator, with a bool tensor is not supported. If you are trying to invert a mask, use the `~` or `bitwise_not()` operator instead.

Ran into the following error while training a BERT-fused NMT model:

/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown                                                                              
  len(cache))
Traceback (most recent call last):
  File "train.py", line 315, in <module>
    cli_main()
  File "train.py", line 307, in cli_main
    nprocs=args.distributed_world_size,
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/eee/bert-nmt/train.py", line 274, in distributed_main
    main(args, init_distributed=True)
  File "/home/eee/bert-nmt/train.py", line 89, in main
    train(args, trainer, task, epoch_itr)
  File "/home/eee/bert-nmt/train.py", line 130, in train
    log_output = trainer.train_step(samples)
  File "/home/eee/bert-nmt/fairseq/trainer.py", line 289, in train_step
    raise e
  File "/home/eee/bert-nmt/fairseq/trainer.py", line 266, in train_step
    ignore_grad
  File "/home/eee/bert-nmt/fairseq/tasks/fairseq_task.py", line 232, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/eee/bert-nmt/fairseq/criterions/label_smoothed_cross_entropy.py", line 38, in forward
    net_output = model(**sample['net_input'])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/eee/bert-nmt/fairseq/models/transformer.py", line 339, in forward
    bert_encoder_out, _ =  self.bert_encoder(bert_input, output_all_encoded_layers=True, attention_mask= 1. - bert_encoder_padding_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 325, in __rsub__
    return _C._VariableFunctions.rsub(self, other)
RuntimeError: Subtraction, the `-` operator, with a bool tensor is not supported. If you are trying to invert a mask, use the `~` or `bitwise_not()` operator instead.

Should transformer.py line 339 be changed from:
bert_encoder_out, _ = self.bert_encoder(bert_input, output_all_encoded_layers=True, attention_mask= 1. - bert_encoder_padding_mask)

to:
bert_encoder_out, _ = self.bert_encoder(bert_input, output_all_encoded_layers=True, attention_mask= ~bert_encoder_padding_mask)

makedataforbert.sh doesn't run out of the box

The bash makedataforbert.sh doesnt run out of the box

To reproduce,

Run bash makedataforbert.sh 'fr-en.fr'

src lng fr-en.fr
sed: can't read train.fr-en.fr: No such file or directory
makedataforbert.sh: line 7: ../mosesdecoder/scripts/tokenizer/detokenizer.perl: No such file or directory
sed: can't read valid.fr-en.fr: No such file or directory
makedataforbert.sh: line 7: ../mosesdecoder/scripts/tokenizer/detokenizer.perl: No such file or directory
sed: can't read test.fr-en.fr: No such file or directory
makedataforbert.sh: line 7: ../mosesdecoder/scripts/tokenizer/detokenizer.perl: No such file or directory

The reason is that the previous steps downloads the files into the translation/iwslt17.de_fr.en.bpe16k folder instead of the same folder as the makedataforbert script.

File with Exact Architecture

Hello,
I've been trying to locate the file with the exact bert fused nmt architecture, could you please mention the name?
Thanks a lot for the clarification.

Could you please share some GPU and training duration information?

Hi, I am curious about your GPU configuration and the training duration. Could you please share some information about that?

Using our own data for nmt task with bert-nmt

Hello, I would like to know if there is a way to change the dataset for nmt task. I am trying to use bert-nmt on my own dataset (English to English, description generation) and would like to know if there would be a better way for me to use my own dataset, rather than use code like below (others' dataset):

!bash fairseq/examples/translation/prepare-iwslt14.sh

I am currently trying to change the files from the folder created from the above dataset, but is it the correct way for me to do so, would there be other alternative way to change the dataset (mine is a csv file) and still use bert-nmt to conduct the task?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7346: invalid continuation byte

I'm trying to follow the 'Data Preprocessing' example and am receiving a UTF-8 decoding error as shown below:

sudo python3 preprocess.py --source-lang en --target-lang de \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt17_en_de  --joined-dictionary --bert-model-name bert-base-multilingual-cased


Namespace(alignfile=None, bert_model_name='bert-base-multilingual-cased', cpu=False, criterion='cross_entropy', dataset_impl='cached', destdir='data-bin/wmt17_en_de', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='en', srcdict=None, target_lang='de', task='translation', tbmf_wrapper=False, tensorboard_logdir='', testpref='examples/translation/wmt17_en_de/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, trainpref='examples/translation/wmt17_en_de/train', user_dir=None, validpref='examples/translation/wmt17_en_de/valid', workers=1)
Traceback (most recent call last):
  File "preprocess.py", line 274, in <module>
    cli_main()
  File "preprocess.py", line 270, in cli_main
    main(args)
  File "preprocess.py", line 75, in main
    {train_path(lang) for lang in [args.source_lang, args.target_lang]}, src=True
  File "preprocess.py", line 56, in build_dictionary
    padding_factor=args.padding_factor,
  File "/bert-nmt/fairseq/tasks/fairseq_task.py", line 54, in build_dictionary
    Dictionary.add_file_to_dictionary(filename, d, tokenizer.tokenize_line, workers)
  File "/bert-nmt/fairseq/data/dictionary.py", line 284, in add_file_to_dictionary
    merge_result(Dictionary._add_file_to_dictionary_single_worker(filename, tokenize, dict.eos_word))
  File "/bert-nmt/fairseq/data/dictionary.py", line 262, in _add_file_to_dictionary_single_worker
    line = f.readline()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7346: invalid continuation byte

My guess is because I am using the bert-base-multilingual-cased model?

Since I only care about EN and DE in this case:
I am changing line 247 in https://github.com/bert-nmt/bert-nmt/blob/master/fairseq/data/dictionary.py FROM:
with open(filename, 'r', encoding='utf-8') as f:
TO:
with open(filename, 'r', encoding='utf-8', errors='ignore') as f:

Does this sound correct?

Could you please point out the key code? there are too many fairseq code

thank you very much

transformer_s2_iwslt_de_en vs. transformer architecture

The BERT fused NMT-model uses transformer_s2_iwslt_de_en which seems to be the same as transformer_iwslt_de_en

Is it fair to assume using a larger model such as transformer_s2_vaswani_wmt_en_de_big would improve accuracy at the cost of requiring more resources and increasing training & inference time?

BERT-NMT generation hangs and is killed unexpectedly

I preprocess my data for bert-base-uncased and I used the transformer_iwslt_de_en architecture to train the vanilla NMT using the following command:

CUDA_VISIBLE_DEVICES=0 fairseq-train \
    data-bin/sample \
    --arch transformer_iwslt_de_en --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 4096 \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric

Then I tried to train the BERT-fused NMT using:

python train.py $DATAPATH \
    -a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \
    --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
    --adam-betas '(0.9,0.98)' --save-dir $SAVEDIR $warmup \
    --cpu --encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log

Training is successful and I see the trained model checkpoint. But when I tried to use generate.py using this:

python generate.py data-bin/sample \
    --path checkpoints/sample_iwslt_input_output_0.5/checkpoint293.pt --bert-model-name bert-base-uncased \
    --beam 5 --remove-bpe --cpu

Generation hangs indefinitely and when I stop it forcefully I get the following error stack:

^CTraceback (most recent call last):
  File "generate.py", line 195, in <module>
    cli_main()
  File "generate.py", line 191, in cli_main
    main(args)
  File "generate.py", line 109, in main
    hypos = task.inference_step(generator, models, sample, prefix_tokens)
  File "/home/skhurana/bert-nmt/fairseq/tasks/fairseq_task.py", line 246, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 43, in decorate_no_grad
    return func(*args, **kwargs)
  File "/home/skhurana/bert-nmt/fairseq/sequence_generator.py", line 329, in generate
    tokens[:, :step + 1], encoder_outs, bert_outs, temperature=self.temperature,
  File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 43, in decorate_no_grad
    return func(*args, **kwargs)
  File "/home/skhurana/bert-nmt/fairseq/sequence_generator.py", line 596, in forward_decoder
    temperature=temperature,
  File "/home/skhurana/bert-nmt/fairseq/sequence_generator.py", line 626, in _decode_one
    decoder_out = list(model.decoder(tokens, encoder_out, bert_out, incremental_state=self.incremental_states[model]))
  File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/skhurana/bert-nmt/fairseq/models/transformer.py", line 855, in forward
    x, extra = self.extract_features(prev_output_tokens, encoder_out, bert_encoder_out, incremental_state)
  File "/home/skhurana/bert-nmt/fairseq/models/transformer.py", line 904, in extract_features
    self_attn_mask=self.buffered_future_mask(x) if incremental_state is None else None,
  File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/skhurana/bert-nmt/fairseq/models/transformer.py", line 1511, in forward
    attn_mask=self_attn_mask,
  File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/skhurana/bert-nmt/fairseq/modules/multihead_attention.py", line 163, in forward
    v = torch.cat((prev_value, v), dim=1)
KeyboardInterrupt

Not sure what is happening.

Does the version of fairseq matter if it's different in training the vanilla NMT and the one used here?

looking for detailed explination

i am a student, and i am very glad about the paper "INCORPORATION OF BERT INTO NMT". I wish you would explain me more in detail about the code or help me guide to execute the code you have given me

why got error "AttributeError: 'NoneType' object has no attribute 'hidden_size'"

The log is as follow:
0%| | 33792/407873900 [00:29<94:07:20, 1203.63BModel name 'bert-base-uncased' was not found in model name list (bert-base-uncased, bert-large -uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-mult ilingual-cased, bert-base-chinese, bert-base-german-cased). We assumed 'https://s3.amazonaw s.com/models.huggingface.co/bert/bert-base-uncased.tar.gz' was a path or url but couldn't f ind any file associated to this path or url.
Traceback (most recent call last):
Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1 e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch= 'transformer_s2_iwslt_de_en', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket _cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0 , data='data-bin/iwslt14.tokenized.de-en', dataset_impl='cached', ddp_backend='c10d', decod er_attention_heads=4, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim =1024, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_no_bert= False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validat ion=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=F alse, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encod er_attention_heads=4, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_be rt_mixup=False, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio =1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=Fa lse, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_u pdates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True ', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler=' inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=Non e, max_source_positions=1024, max_target_positions=1024, max_tokens=50000, max_update=15000 0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=F alse, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_worke rs=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multi ple=8, reset_dataloader=False, reset_lr_scheduler=True, reset_meters=False, reset_optimizer =False, restore_file='checkpoint_last.pt', save_dir='checkpoints/iwed_en_de_0.5', save_inte rval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, sha re_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang ='en', target_lang='de', task='translation', tbmf_wrapper=False, tensorboard_logdir='', thr eshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir =None, valid_subset='valid', validate_interval=1, warmup_from_nmt=True, warmup_init_lr=1e-0 7, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001)
| [en] dictionary: 10152 types
| [de] dictionary: 10152 types
| data-bin/iwslt14.tokenized.de-en valid en-de 7283 examples
File "train.py", line 315, in
cli_main()
File "train.py", line 311, in cli_main
main(args)
File "train.py", line 49, in main
model = task.build_model(args)
File "/home/alex/bert-nmt-master/fairseq/tasks/fairseq_task.py", line 169, in build_mod el
return models.build_model(args, self)
File "/home/alex/bert-nmt-master/fairseq/models/init.py", line 50, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "/home/alex/bert-nmt-master/fairseq/models/transformer.py", line 301, in build_mod el
args.bert_out_dim = bertencoder.hidden_size
AttributeError: 'NoneType' object has no attribute 'hidden_size'
0%| | 33792/407873900 [05:15<1058:16:27, 107.05B/s]
pls help me,thanks

Cannot resume training

Hi, thank you for your work.

I encounter a problem when resuming my training, somehow the training will restart from the first epoch again. The log is as below

| model transformer_s2_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 354184960 (num. trained: 245874688)
| training on 1 GPUs
| max tokens per GPU = 3050 and max sentences per GPU = None
Model will load checkpoint from ../model/bert-base-cased/2222/checkpoint_last.pt
| loaded checkpoint ../model/bert-base-cased/2222/checkpoint_last.pt (epoch 8 @ 0 updates)
| loading train data for epoch 0
| ../process/bin train src-trg 1159547 examples
| saved checkpoint ../model/bert-base-cased/2222/checkpoint0.pt (epoch 0 @ 0 updates) (writing took 530.7144210338593 seconds)

Even though I have trained it for 8 epochs it is restarting from the first epoch. Do you have any idea how this happens? Thank you very much.

Warm regards,
Reza Qorib

TyTypeError: init() missing 3 required positional arguments: 'bertencoder', 'berttokenizer', and 'mask_cls_sep'

I trained lightconv model via fairseq .
When I used it in bert nmt, I got this error .
(bertNMT) blue90211@AI02:~/Storage01/bert-nmt$ CUDA_VISIBLE_DEVICES=0 python train.py $DATAPATH -a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' --adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup --encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout --bert-model-name bert-base-uncased | tee -a $SAVEDIR/training.log
Traceback (most recent call last):
Namespace(adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='lightconv', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='databin/wmt17_enzh_join', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=8, decoder_conv_dim=512, decoder_conv_type='dynamic', decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_glu=True, decoder_input_dim=512, decoder_kernel_size_list=[3, 7, 15, 31, 31, 31], decoder_layers=6, decoder_learned_pos=False, decoder_no_bert=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encoder_attention_heads=8, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_bert_mixup=False, encoder_conv_dim=512, encoder_conv_type='dynamic', encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_glu=True, encoder_kernel_size_list=[3, 7, 15, 31, 31, 31, 31], encoder_layers=7, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio=1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, input_dropout=0.1, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_update=150000, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=True, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='zh', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_from_nmt=True, warmup_init_lr=1e-07, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001, weight_dropout=0.0, weight_softmax=True)
| [en] dictionary: 73104 types
| [zh] dictionary: 73104 types
| databin/wmt17_enzh_join valid en-zh 2001 examples
File "train.py", line 315, in
cli_main()
File "train.py", line 311, in cli_main
main(args)
File "train.py", line 49, in main
model = task.build_model(args)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/tasks/fairseq_task.py", line 169, in build_model
return models.build_model(args, self)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/init.py", line 50, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/lightconv.py", line 176, in build_model
return LightConvModel(encoder, decoder)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/lightconv.py", line 53, in init
super().init(encoder, decoder)
TypeError: init() missing 3 required positional arguments: 'bertencoder', 'berttokenizer', and 'mask_cls_sep'`

Always get gradient exploding error.

I try to use a pretrain XLM with my own code, and it fail to train. So I try to use the source code and pretrained BERT from your url in the code, and it also get this error.
Is there anything should be noticed here？
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.

How to use interactive.py

Can you provide some examples?
what's the bpefile?

sed -r 's/(@@ )|(@@ ?$)//g' $bpefile > $bpefile.debpe
$MOSE/scripts/tokenizer/detokenizer.perl -l $src < $bpefile.debpe > $bpefile.debpe.detok
paste -d "\n" $bpefile $bpefile.debpe.detok > $bpefile.in
cat $bpefile.in | python interactive.py -s $src -t $tgt
--buffer-size 1024 --batch-size 128 --beam 5 --remove-bpe > output.log

Architectures match problem in en-zh data

I tried to use en-zh data in bertNMT , but I got the problem in architectures match.

My env :
cuda 9.0
pytorch 1.0.0
python 3.6

My Preprocess :
<Step 1> token / clean and generate bpe format
English token : NLTK
Chinese token : Jieba
follow this guideline : https://github.com/twairball/fairseq-zh-en

<Step 2> generate bert input
makedataforbert.sh
<Step 3> generate binary file

TEXT=examples/translation/fairseq-zh-en/data/wmt17_en_zh
DATADIR=data-bin/wmt17_en_zh
NUM_OPS=32000

fairseq-preprocess 
--source-lang en \
--target-lang zh \
--trainpref $TEXT/train.${NUM_OPS}.bpe \
--validpref $TEXT/valid.${NUM_OPS}.bpe \
--testpref $TEXT/test.${NUM_OPS}.bpe \
--thresholdsrc 3 \
--thresholdtgt 3 \
--destdir $DATADIR

My Pretrain model
I did a pretrain model via fairseq .

CUDA_VISIBLE_DEVICES=0 
fairseq-train data-bin/wmt17_en_zh 
--arch transformer_vaswani_wmt_en_de_big 
--share-decoder-input-output-embed 
--optimizer adam 
--adam-betas '(0.9, 0.98)' 
--clip-norm 0.0 
--lr 5e-4 --lr-scheduler inverse_sqrt 
--warmup-updates 4000 
--dropout 0.3 
--weight-decay 0.0001 
--criterion label_smoothed_cross_entropy 
--label-smoothing 0.1 
--max-tokens 1024 
--eval-bleu 
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' 
--eval-bleu-detok moses 
--eval-bleu-remove-bpe 
--eval-bleu-print-samples 
--best-checkpoint-metric bleu 
--maximize-best-checkpoint-metric 
--save-dir checkpoints/fconv_wmt17_en_zh

After that, I choose a checkpoint.pt file as my pretrain model to use.

Training

src=en
tgt=zh
bedropout=0.5
ARCH=transformer_vaswani_wmt_en_de_big 
DATAPATH=destdir/
SAVEDIR=checkpoints/wmt17_${src}_${tgt}_${bedropout}
mkdir -p $SAVEDIR
if [ ! -f $SAVEDIR/checkpoint_nmt.pt ]; then     cp /home/blue90211/Storage01/fairseq/checkpoints/fconv_wmt17_en_zh/test_best.pt $SAVEDIR/checkpoint_nmt.pt; fi
if [ ! -f "$SAVEDIR/checkpoint_last.pt" ]; then warmup="--warmup-from-nmt --reset-lr-scheduler"; else warmup=""; fi

CUDA_VISIBLE_DEVICES=1 python train.py $DATAPATH \
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup \
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log

My log file :

Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='destdir/', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_no_bert=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encoder_attention_heads=16, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_bert_mixup=False, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio=1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_update=150000, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=True, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/wmt17_en_zh_0.5', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='zh', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_from_nmt=True, warmup_init_lr=1e-07, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001)
| [en] dictionary: 65912 types
| [zh] dictionary: 65912 types
| destdir/ valid en-zh 2001 examples
bert_gates [True, True, True, True, True, True]
TransformerModel(
  (encoder): TransformerEncoder(
    (embed_tokens): Embedding(65912, 1024, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(65912, 1024, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bert_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (bert_encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (1): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (2): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (3): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (4): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (5): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (6): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (7): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (8): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (9): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (10): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (11): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
)Traceback (most recent call last):
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/trainer.py", line 150, in load_checkpoint

| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375378176 (num. trained: 265895936)
| training on 1 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
Model will load checkpoint from checkpoints/wmt17_en_zh_0.5/checkpoint_nmt.pt
    self.get_model().load_state_dict(state['model'], strict=False if warmup_from_nmt else True)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/fairseq_model.py", line 72, in load_state_dict
    return super().load_state_dict(state_dict, strict)
  File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for TransformerModel:
        size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([29248, 1024]) from checkpoint, the shape in current model is torch.Size([65912, 1024]).
        size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([33864, 1024]) from checkpoint, the shape in current model is torch.Size([65912, 1024]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 315, in <module>
    cli_main()
  File "train.py", line 311, in cli_main
    main(args)
  File "train.py", line 75, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/checkpoint_utils.py", line 115, in load_checkpoint
    warmup_from_nmt=args.warmup_from_nmt,
  File "/mnt/Storage01/blue90211/bert-nmt/fairseq/trainer.py", line 153, in load_checkpoint
    'Cannot load model parameters from checkpoint, '
Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match.

Should I choose another architecture ?
Why I used the same parameter and architecture in fairseq, but can't work in bertNMT?
Can you give me some suggestion?

Thank you

Way to visualize/observe the way BERT Attention being used in training and inference?

Is there any way I can check the extent by which BERT Attention is being used both in the encoder and decoder level? Is there anything in the code that can help me do this? Basically, I would want to see if BERT Attention is being used and if it is different at inference time vs training time