Code Monkey home page Code Monkey logo

mpnet's Introduction

MPNet

MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-training method for language understanding tasks. It solves the problems of MLM (masked language modeling) in BERT and PLM (permuted language modeling) in XLNet and achieves better accuracy.

News: We have updated the pre-trained models now.

Supported Features

  • A unified view and implementation of several pre-training models including BERT, XLNet, MPNet, etc.
  • Code for pre-training and fine-tuning for a variety of language understanding (GLUE, SQuAD, RACE, etc) tasks.

Installation

We implement MPNet and this pre-training toolkit based on the codebase of fairseq. The installation is as follow:

pip install --editable pretraining/
pip install pytorch_transformers==1.0.0 transformers scipy sklearn

Pre-training MPNet

Our model is pre-trained with bert dictionary, you first need to pip install transformers to use bert tokenizer. We provide a script encode.py and a dictionary file dict.txt to tokenize your corpus. You can modify encode.py if you want to use other tokenizers (like roberta).

1) Preprocess data

We choose WikiText-103 as a demo. The running script is as follow:

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

for SPLIT in train valid test; do \
    python MPNet/encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

Then, we need to binarize data. The command of binarizing data is following:

fairseq-preprocess \
    --only-source \
    --srcdict MPNet/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

2) Pre-train MPNet

The below command is to train a MPNet model:

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATA_DIR \
    --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
    --arch mpnet_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'

Notes: You can replace arch with mpnet_rel_base and add command --mask-whole-words --bpe bert to use relative position embedding and whole word mask.

Notes: You can specify --input-mode as mlm or plm to train masked language model or permutation language model.

Pre-trained models

We have updated the final pre-trained MPNet model for fine-tuning.

You can load the pre-trained MPNet model like this:

from fairseq.models.masked_permutation_net import MPNet
mpnet = MPNet.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data', bpe='bert')
assert isinstance(mpnet.model, torch.nn.Module)

Fine-tuning MPNet on down-streaming tasks

Acknowledgements

Our code is based on fairseq-0.8.0. Thanks for their contribution to the open-source commuity.

Reference

If you find this toolkit useful in your work, you can cite the corresponding papers listed below:

@article{song2020mpnet,
    title={MPNet: Masked and Permuted Pre-training for Language Understanding},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    journal={arXiv preprint arXiv:2004.09297},
    year={2020}
}

Related Works

mpnet's People

Contributors

microsoftopensource avatar stillkeeptry avatar tan-xu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mpnet's Issues

The future is to combine MPNet with other language models innovations

For example, it could really make sense to adapt MPNet to preserve PLM but uses the approach of ELECTRA for MLM.
SpanBERT has some potential too (e.g on coreference resolution)
I believe this could really push the state of the art of accuracy on key tasks.

What do you think?
@StillKeepTry
@tan-xu

Moreover there are important low hanging fruits that have been consistently ignored by transformer researchers:

The activation function used should probably be https://github.com/digantamisra98/Mish
as it is the one that give the most accuracy gains in general. It can give 1% accuracy gains which is huge.

Secondly the optimizer you're using, Adam is flawed and you should use its rectified version:
https://github.com/LiyuanLucasLiu/RAdam
Moreover it can be optionally combined with a complementary optimizer:
https://github.com/michaelrzhang/lookahead

Moreover there are newer techniques for training that yield significant accuracy gains, such as:
https://github.com/Yonghongwei/Gradient-Centralization
And gradient normalization.

There is a library that integrate all those advances and more here:
https://github.com/lessw2020/Ranger21

Accuracy gains in NLP/NLU have reached a plateau. The reason is that researchers works far too much in isolation. They bring N new innovations per years but the number of researchers that attempt to use those innovations/optimization together can be counted on the fingers of one hand.

XLnet has been consistently ignored by researchers, you are the ones that saw the opportunity to combine the best of both worlds of BERT and XLnet. Why stop there?
As I said, both transformer/language model wise and activation function/optimizer wise there are a LOT of significant accuracy optimizations to integrate into the successor of MPNet.
Aggregating those optimizations could yield a revolutionary language model that would have 5-10% accuracy gains on average over existing SOTA. It would mark history.
No one will attempt to combine a wide range of those innovations, you are the only hope. I you do not do it, I'm afraid no one else will and NLU will stagnate for the decade to come.

Inconsistencies between data collator output and masked permute in original paper

Hi all on the MPNet research team,

I am in the process of converting the fairseq training code for MPNet into a training loop that is compatible with Huggingface. Although many of the convenience classes already exist in Huggingface (like MPNetForMaskedLM), one thing that has become clear to us is that we will need to port over the collator function in MaskedDataset (under tasks/masked_permutation_lm).

In exploring how this collator works, I understand the logic as:

  1. Permute input IDs (based on whole word spans or tokens via arg) and positions
  2. Create masked/corrupted tokens based on the final n indices of the permuted sequence, where n is the prediction size (i.e. seq_len x 0.15 at default values)
  3. Concat these together using concat(seq, mask, mask) and concat(positions, predict_positions, predict_positions)

Using this logic, we might expect the collator function to perform the below operation on some dummy input IDs:

src_tokens = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]

# Once the collator permutes everything and we append the mask portions, we expect something like
new_ids = [ 20,  23,  30,  14,  15,  27,  28,  11,  12,  17,  18,  26,  29,  13, 10,  19,  21,  22,  16,  24,  25, <mask>,  <corrupted>, <mask>, <mask>,  <corrupted>, <mask>]
new_positions = [10, 13, 20,  4,  5, 17, 18,  1,  2,  7,  8, 16, 19,  3,  0,  9, 11, 12, 6, 14, 15,  6, 14, 15,  6, 14, 15]

However, after rereading the MPNet paper, especially section 2.2 and 2.3 with attention on Figure 2, it would SEEM that the output of the collator is incongruous with what is described in these sections.

Figure 2 points out that the content and query masks are built using a permuted sequence that looks like:

src_tokens = [x_1, x_2, x_3, x_4, x_5, x_6]

# Once permuted we get:
new_ids = [x_1, x_3, x_5, <mask>, <mask>, <mask>,  x_4, x_6, x_2]
new_positions = [1, 3, 5, 4, 6, 2, 4, 6, 2]

In this example within the paper, we are masking the pred_len tokens and then appending the content to the end for the content stream. However, the collator output KEEPS the token content in the main sequence, and then adds TWO batches of mask tokens to the end, which to me seems necessarily different than what's described in the paper. Referring back to our dummy example above, I can outline the discrepancies I'm seeing:

collator_ids = [ 20,  23,  30,  14,  15,  27,  28,  11,  12,  17,  18,  26,  29,  13, 10,  19,  21,  22,  16,  24,  25, <mask>,  <corrupted>, <mask>, <mask>,  <corrupted>, <mask>]
collator_positions = [10, 13, 20,  4,  5, 17, 18,  1,  2,  7,  8, 16, 19,  3,  0,  9, 11, 12, 6, 14, 15,  6, 14, 15,  6, 14, 15]

paper_ids = [ 20,  23,  30,  14,  15,  27,  28,  11,  12,  17,  18,  26,  29,  13, 10,  19,  21,  22, <mask>,  <corrupted>, <mask>, 16, 24, 25]
paper_positions = [10, 13, 20,  4,  5, 17, 18,  1,  2,  7,  8, 16, 19,  3,  0,  9, 11, 12, 6, 14, 15,  6, 14, 15]

My question, then, is this: am I correct in understanding that the collator implementation is different than what's described in the paper? If so, why?

Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.

When running the Training script for SQUAD I was getting the below error.

Traceback (most recent call last):
  File "/media/data2/anaconda/envs/mpnet/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq_cli/train.py", line 370, in cli_main
    main(args)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq_cli/train.py", line 47, in main
    task = tasks.setup_task(args)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/__init__.py", line 17, in setup_task
    return TASK_REGISTRY[args.task].setup_task(args, **kwargs)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 104, in setup_task
    return cls(args, dictionary)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 84, in __init__
    self.tokenizer = SQuADTokenizer(args.bpe_vocab_file, dictionary)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 42, in __init__
    self.max_len_single_sentence = self.max_len - 2
  File "/media/data2/anaconda/envs/mpnet/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1547, in max_len_single_sentence
    raise ValueError(
ValueError: Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.

By commenting out line 42 and 43 in file

 self.max_len_single_sentence = self.max_len - 2
 self.max_len_sentences_pair = self.max_len - 3

It resolves but is it fine to do so?

When I run the script, I was getting less F1 score and Exact Match than mentioned in the Paper. I also created an issue for that

input mode questions?

is it input mode use mlm is like roberta, and use plm is like xlnet?

and by the way, would you provide the script to convert the model format to the huggingface transformers ones?

关于SQuAD 2.0的评估

请问,论文中SQuAD2.0的dev结果是采用默认阈值0 (是否有答案),还是best f1和best em呢? 非常感谢~

How to continue pretraining from the released checkpoint?

Hello,
Thank you for releasing the codes for pretraining MPNet!
I am trying to continue training of the language model task on a custom dataset from the released checkpoint using the --restore-file argument. However, I am not able to successfully load the checkpoint. It fails with the following error: MPNet/pretraining/fairseq/checkpoint_utils.py", line 307, in _upgrade_state _dict registry.set_defaults(state['args'], tasks.TASK_REGISTRY[state['args'].task]) KeyError: 'mixed_position_lm'

In case it helps, here is the details of the training command :

WARMUP_UPDATES=50000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=35        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin

fairseq-train --fp16 $DATA_DIR \
  --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
  --arch mpnet_base --sample-break-mode none --tokens-per-sample $TOKENS_PER_SAMPLE \
  --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
  --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES \ 
  --total-num-update $TOTAL_UPDATES   --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
  --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ --skip-invalid-size-inputs-valid-test \
  --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'\
  --restore-file mpnet.base/mpnet.pt --save-interval-updates 10 --ddp-backend no_c10d

I will appreciate insights on what to do to resolve this error. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.