microsoft / mpnet Goto Github PK

MPNet: Masked and Permuted Pre-training for Language Understanding https://arxiv.org/pdf/2004.09297.pdf

License: MIT License

Makefile 0.04% Python 95.05% Batchfile 0.05% Shell 2.06% C++ 0.44% Cuda 1.55% Lua 0.25% Cython 0.57%

mpnet's Introduction

MPNet

MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-training method for language understanding tasks. It solves the problems of MLM (masked language modeling) in BERT and PLM (permuted language modeling) in XLNet and achieves better accuracy.

News: We have updated the pre-trained models now.

Supported Features

A unified view and implementation of several pre-training models including BERT, XLNet, MPNet, etc.
Code for pre-training and fine-tuning for a variety of language understanding (GLUE, SQuAD, RACE, etc) tasks.

Installation

We implement MPNet and this pre-training toolkit based on the codebase of fairseq. The installation is as follow:

pip install --editable pretraining/
pip install pytorch_transformers==1.0.0 transformers scipy sklearn

Pre-training MPNet

Our model is pre-trained with bert dictionary, you first need to pip install transformers to use bert tokenizer. We provide a script encode.py and a dictionary file dict.txt to tokenize your corpus. You can modify encode.py if you want to use other tokenizers (like roberta).

1) Preprocess data

We choose WikiText-103 as a demo. The running script is as follow:

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

for SPLIT in train valid test; do \
    python MPNet/encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

Then, we need to binarize data. The command of binarizing data is following:

fairseq-preprocess \
    --only-source \
    --srcdict MPNet/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

2) Pre-train MPNet

The below command is to train a MPNet model:

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATA_DIR \
    --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
    --arch mpnet_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'

Notes: You can replace arch with mpnet_rel_base and add command --mask-whole-words --bpe bert to use relative position embedding and whole word mask.

Notes: You can specify --input-mode as mlm or plm to train masked language model or permutation language model.

Pre-trained models

We have updated the final pre-trained MPNet model for fine-tuning.

You can load the pre-trained MPNet model like this:

from fairseq.models.masked_permutation_net import MPNet
mpnet = MPNet.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data', bpe='bert')
assert isinstance(mpnet.model, torch.nn.Module)

Fine-tuning MPNet on down-streaming tasks

Acknowledgements

Our code is based on fairseq-0.8.0. Thanks for their contribution to the open-source commuity.

Reference

If you find this toolkit useful in your work, you can cite the corresponding papers listed below:

@article{song2020mpnet,
    title={MPNet: Masked and Permuted Pre-training for Language Understanding},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    journal={arXiv preprint arXiv:2004.09297},
    year={2020}
}

Related Works

MASS: Masked Sequence to Sequence Pre-training for Language Generation, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. GitHub: https://github.com/microsoft/MASS
LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning, by Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, Tie-Yan Liu

mpnet's People

Contributors

Stargazers

Watchers

mpnet's Issues

关于MPNet在GLUE dev集合上结果的疑问？

恭喜MPNet被NIPS2020接收，在最新一版的MPNet论文中，请问GLUE的dev集合中RTE和MRPC以及STS-B这几个任务有没有用MNLI checkpoint热启动呢？感谢～

does it support cuda

l enforce it to cuda, but throw exception (max recursive)

would you release the code about convert the mpnet to the transformers one?

would you release the code about convert the mpnet to the transformers "microsoft/mpnet-base"?
I want to convert my pretrain one to the transformers format

The future is to combine MPNet with other language models innovations

For example, it could really make sense to adapt MPNet to preserve PLM but uses the approach of ELECTRA for MLM.
SpanBERT has some potential too (e.g on coreference resolution)
I believe this could really push the state of the art of accuracy on key tasks.

What do you think?
@StillKeepTry
@tan-xu

Moreover there are important low hanging fruits that have been consistently ignored by transformer researchers:

The activation function used should probably be https://github.com/digantamisra98/Mish
as it is the one that give the most accuracy gains in general. It can give 1% accuracy gains which is huge.

Secondly the optimizer you're using, Adam is flawed and you should use its rectified version:
https://github.com/LiyuanLucasLiu/RAdam
Moreover it can be optionally combined with a complementary optimizer:
https://github.com/michaelrzhang/lookahead

Moreover there are newer techniques for training that yield significant accuracy gains, such as:
https://github.com/Yonghongwei/Gradient-Centralization
And gradient normalization.

There is a library that integrate all those advances and more here:
https://github.com/lessw2020/Ranger21

Accuracy gains in NLP/NLU have reached a plateau. The reason is that researchers works far too much in isolation. They bring N new innovations per years but the number of researchers that attempt to use those innovations/optimization together can be counted on the fingers of one hand.

XLnet has been consistently ignored by researchers, you are the ones that saw the opportunity to combine the best of both worlds of BERT and XLnet. Why stop there?
As I said, both transformer/language model wise and activation function/optimizer wise there are a LOT of significant accuracy optimizations to integrate into the successor of MPNet.
Aggregating those optimizations could yield a revolutionary language model that would have 5-10% accuracy gains on average over existing SOTA. It would mark history.
No one will attempt to combine a wide range of those innovations, you are the only hope. I you do not do it, I'm afraid no one else will and NLU will stagnate for the decade to come.

The exact English pretraining data and Chinese pretraining data that are exact same to the BERT paper's pretraining data.

Any one know where to get them?
Thank you and thank you.

How to use deepspeed?

https://github.com/microsoft/DeepSpeed
MPNet suffer from slow training time, deepspeed could significantly reduce the time needed and transformers (hugginface) support it apparently, any guide/sample code from how to enable it for MPNet?

Inconsistencies between data collator output and masked permute in original paper

Hi all on the MPNet research team,

I am in the process of converting the fairseq training code for MPNet into a training loop that is compatible with Huggingface. Although many of the convenience classes already exist in Huggingface (like MPNetForMaskedLM), one thing that has become clear to us is that we will need to port over the collator function in MaskedDataset (under tasks/masked_permutation_lm).

In exploring how this collator works, I understand the logic as:

Permute input IDs (based on whole word spans or tokens via arg) and positions
Create masked/corrupted tokens based on the final n indices of the permuted sequence, where n is the prediction size (i.e. seq_len x 0.15 at default values)
Concat these together using concat(seq, mask, mask) and concat(positions, predict_positions, predict_positions)

Using this logic, we might expect the collator function to perform the below operation on some dummy input IDs:

src_tokens = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]

# Once the collator permutes everything and we append the mask portions, we expect something like
new_ids = [ 20,  23,  30,  14,  15,  27,  28,  11,  12,  17,  18,  26,  29,  13, 10,  19,  21,  22,  16,  24,  25, <mask>,  <corrupted>, <mask>, <mask>,  <corrupted>, <mask>]
new_positions = [10, 13, 20,  4,  5, 17, 18,  1,  2,  7,  8, 16, 19,  3,  0,  9, 11, 12, 6, 14, 15,  6, 14, 15,  6, 14, 15]

However, after rereading the MPNet paper, especially section 2.2 and 2.3 with attention on Figure 2, it would SEEM that the output of the collator is incongruous with what is described in these sections.

Figure 2 points out that the content and query masks are built using a permuted sequence that looks like:

src_tokens = [x_1, x_2, x_3, x_4, x_5, x_6]

# Once permuted we get:
new_ids = [x_1, x_3, x_5, <mask>, <mask>, <mask>,  x_4, x_6, x_2]
new_positions = [1, 3, 5, 4, 6, 2, 4, 6, 2]

In this example within the paper, we are masking the pred_len tokens and then appending the content to the end for the content stream. However, the collator output KEEPS the token content in the main sequence, and then adds TWO batches of mask tokens to the end, which to me seems necessarily different than what's described in the paper. Referring back to our dummy example above, I can outline the discrepancies I'm seeing:

collator_ids = [ 20,  23,  30,  14,  15,  27,  28,  11,  12,  17,  18,  26,  29,  13, 10,  19,  21,  22,  16,  24,  25, <mask>,  <corrupted>, <mask>, <mask>,  <corrupted>, <mask>]
collator_positions = [10, 13, 20,  4,  5, 17, 18,  1,  2,  7,  8, 16, 19,  3,  0,  9, 11, 12, 6, 14, 15,  6, 14, 15,  6, 14, 15]

paper_ids = [ 20,  23,  30,  14,  15,  27,  28,  11,  12,  17,  18,  26,  29,  13, 10,  19,  21,  22, <mask>,  <corrupted>, <mask>, 16, 24, 25]
paper_positions = [10, 13, 20,  4,  5, 17, 18,  1,  2,  7,  8, 16, 19,  3,  0,  9, 11, 12, 6, 14, 15,  6, 14, 15]

My question, then, is this: am I correct in understanding that the collator implementation is different than what's described in the paper? If so, why?

Access to the released fairseq checkpoint

Hi I would like access the released fairseq checkpoint model - to continue pretraining on it
https://modelrelease.blob.core.windows.net/pre-training/MPNet/mpnet.base.tar.gz
This URL is not accessible to the public

Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.

When running the Training script for SQUAD I was getting the below error.

Traceback (most recent call last):
  File "/media/data2/anaconda/envs/mpnet/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq_cli/train.py", line 370, in cli_main
    main(args)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq_cli/train.py", line 47, in main
    task = tasks.setup_task(args)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/__init__.py", line 17, in setup_task
    return TASK_REGISTRY[args.task].setup_task(args, **kwargs)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 104, in setup_task
    return cls(args, dictionary)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 84, in __init__
    self.tokenizer = SQuADTokenizer(args.bpe_vocab_file, dictionary)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 42, in __init__
    self.max_len_single_sentence = self.max_len - 2
  File "/media/data2/anaconda/envs/mpnet/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1547, in max_len_single_sentence
    raise ValueError(
ValueError: Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.

By commenting out line 42 and 43 in file

 self.max_len_single_sentence = self.max_len - 2
 self.max_len_sentences_pair = self.max_len - 3

It resolves but is it fine to do so?

When I run the script, I was getting less F1 score and Exact Match than mentioned in the Paper. I also created an issue for that

input mode questions?

is it input mode use mlm is like roberta, and use plm is like xlnet?

and by the way, would you provide the script to convert the model format to the huggingface transformers ones?

could you release the finetune performance with fewer epoch on glue?

could you release the finetune performance with fewer epoch on glue, such as epoch 3 the same as bert or roberta?

可否圈下重点代码部分，fairseq代码太多

多谢多谢！
@tobyoup @StillKeepTry

关于SQuAD 2.0的评估

请问，论文中SQuAD2.0的dev结果是采用默认阈值0 （是否有答案），还是best f1和best em呢? 非常感谢～

MPNet如何训练长序列分类任务IMDB和RACE呢？

是否采用[:128] + [382:]这种截断方式么？感谢～

Is there results on GLUE for MPNet which is pre-trained on 16G corpora?

Is there results on GLUE for MPNet which is pre-trained on 16G corpora? Only results of models trained on 160G corpora. Thanks~

Training on SQUAD2 gives less results on Evaluation set (Research Paper shows better results)

I tried Training MPNet on SQUAD2 data below is the result I was getting on Evalutionset

I used this script

(exact, 50.07159100480081)                                                                                  
(f1, 50.07159100480081)
(total, 11873)
(HasAns_exact, 0.0)
(HasAns_f1, 0.0)
(HasAns_total, 5928)
(NoAns_exact, 100.0)
(NoAns_f1, 100.0)
(NoAns_total, 5945)
(best_exact, 50.07159100480081)
(best_exact_thresh, 0.0)
(best_f1, 50.07159100480081)
(best_f1_thresh, 0.0)

SQUAD2.0 test数据集结果

你好请问一下你们的结果有在https://rajpurkar.github.io/SQuAD-explorer/这个排行榜上吗？

AssertionError: cannot specify cutoff larger than vocab size

When I tried to train MPNet on the dataset wikitext-103, I got this error

rgument --arch/-a: invalid choice: 'mpnet_rel_base'这是为什么？

有谁知道？

How to continue pretraining from the released checkpoint?

Hello,
Thank you for releasing the codes for pretraining MPNet!
I am trying to continue training of the language model task on a custom dataset from the released checkpoint using the --restore-file argument. However, I am not able to successfully load the checkpoint. It fails with the following error: MPNet/pretraining/fairseq/checkpoint_utils.py", line 307, in _upgrade_state _dict registry.set_defaults(state['args'], tasks.TASK_REGISTRY[state['args'].task]) KeyError: 'mixed_position_lm'

In case it helps, here is the details of the training command :

WARMUP_UPDATES=50000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=35        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin

fairseq-train --fp16 $DATA_DIR \
  --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
  --arch mpnet_base --sample-break-mode none --tokens-per-sample $TOKENS_PER_SAMPLE \
  --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
  --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES \ 
  --total-num-update $TOTAL_UPDATES   --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
  --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ --skip-invalid-size-inputs-valid-test \
  --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'\
  --restore-file mpnet.base/mpnet.pt --save-interval-updates 10 --ddp-backend no_c10d

I will appreciate insights on what to do to resolve this error. Thank you!