Code Monkey home page Code Monkey logo

mlm-scoring's Introduction

Masked Language Model Scoring

License

This package uses masked LMs like BERT, RoBERTa, and XLM to score sentences and rescore n-best lists via pseudo-log-likelihood scores, which are computed by masking individual words. We also support autoregressive LMs like GPT-2. Example uses include:

Paper: Julian Salazar, Davis Liang, Toan Q. Nguyen, Katrin Kirchhoff. "Masked Language Model Scoring", ACL 2020.

Installation

Python 3.6+ is required. Clone this repository and install:

pip install -e .
pip install torch mxnet-cu102mkl  # Replace w/ your CUDA version; mxnet-mkl if CPU only.

Some models are via GluonNLP and others are via πŸ€— Transformers, so for now we require both MXNet and PyTorch. You can now import the library directly:

from mlm.scorers import MLMScorer, MLMScorerPT, LMScorer
from mlm.models import get_pretrained
import mxnet as mx
ctxs = [mx.cpu()] # or, e.g., [mx.gpu(0), mx.gpu(1)]

# MXNet MLMs (use names from mlm.models.SUPPORTED_MLMS)
model, vocab, tokenizer = get_pretrained(ctxs, 'bert-base-en-cased')
scorer = MLMScorer(model, vocab, tokenizer, ctxs)
print(scorer.score_sentences(["Hello world!"]))
# >> [-12.410664200782776]
print(scorer.score_sentences(["Hello world!"], per_token=True))
# >> [[None, -6.126736640930176, -5.501412391662598, -0.7825151681900024, None]]

# EXPERIMENTAL: PyTorch MLMs (use names from https://huggingface.co/transformers/pretrained_models.html)
model, vocab, tokenizer = get_pretrained(ctxs, 'bert-base-cased')
scorer = MLMScorerPT(model, vocab, tokenizer, ctxs)
print(scorer.score_sentences(["Hello world!"]))
# >> [-12.411025047302246]
print(scorer.score_sentences(["Hello world!"], per_token=True))
# >> [[None, -6.126738548278809, -5.501765727996826, -0.782496988773346, None]]

# MXNet LMs (use names from mlm.models.SUPPORTED_LMS)
model, vocab, tokenizer = get_pretrained(ctxs, 'gpt2-117m-en-cased')
scorer = LMScorer(model, vocab, tokenizer, ctxs)
print(scorer.score_sentences(["Hello world!"]))
# >> [-15.995375633239746]
print(scorer.score_sentences(["Hello world!"], per_token=True))
# >> [[-8.293947219848633, -6.387561798095703, -1.3138668537139893]]

(MXNet and PyTorch interfaces will be unified soon!)

Scoring

Run mlm score --help to see supported models, etc. See examples/demo/format.json for the file format. For inputs, "score" is optional. Outputs will add "score" fields containing PLL scores.

There are three score types, depending on the model:

  • Pseudo-log-likelihood score (PLL): BERT, RoBERTa, multilingual BERT, XLM, ALBERT, DistilBERT
  • Maskless PLL score: same (add --no-mask)
  • Log-probability score: GPT-2

We score hypotheses for 3 utterances of LibriSpeech dev-other on GPU 0 using BERT base (uncased):

mlm score \
    --mode hyp \
    --model bert-base-en-uncased \
    --max-utts 3 \
    --gpus 0 \
    examples/asr-librispeech-espnet/data/dev-other.am.json \
    > examples/demo/dev-other-3.lm.json

Rescoring

One can rescore n-best lists via log-linear interpolation. Run mlm rescore --help to see all options. Input one is a file with original scores; input two are scores from mlm score.

We rescore acoustic scores (from dev-other.am.json) using BERT's scores (from previous section), under different LM weights:

for weight in 0 0.5 ; do
    echo "lambda=${weight}"; \
    mlm rescore \
        --model bert-base-en-uncased \
        --weight ${weight} \
        examples/asr-librispeech-espnet/data/dev-other.am.json \
        examples/demo/dev-other-3.lm.json \
        > examples/demo/dev-other-3.lambda-${weight}.json
done

The original WER is 12.2% while the rescored WER is 8.5%.

Maskless finetuning

One can finetune masked LMs to give usable PLL scores without masking. See LibriSpeech maskless finetuning.

Development

Run pip install -e .[dev] to install extra testing packages. Then:

  • To run unit tests and coverage, run pytest --cov=src/mlm in the root directory.

mlm-scoring's People

Contributors

ju-resplande avatar julianslzr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlm-scoring's Issues

Using mlm-scoring with other public PyTorch RoBERTa model

Hello,
This is probably a silly question, but I'm having a hard time adapting the mlm-scoring to use other public PyTorch RoBERTa models that are not on the list of supported models. Do you have any tutorials/materials on how to use self-trained/other public RoBERTa models with mlm-scoring? Any help would be much appreciated and I apologize in advance in case this information is in the repository and I missed it.

Kind regards,
Danielly

IndexError: too many indices for tensor of dimension 1

Hi there,

I'm using the PyTorch implementation with bert-base-uncased and I get the following error when the sentence contains only one token:

Traceback (most recent call last):
  File "bert.py", line 28, in <module>
    print(scorer.score_sentences(["Hello"]))
  File ".../mlm-scoring/src/mlm/scorers.py", line 167, in score_sentences
    return self.score(corpus, **kwargs)[0]
  File ".../mlm-scoring/src/mlm/scorers.py", line 757, in score
    out = out[list(range(split_size)), token_masked_ids]
IndexError: too many indices for tensor of dimension 1

It works fine with MXNet MLMs, but I need to use a community model from HuggingFace.

Thanks!

PyTorch models

Hi,
It seems that support for PyTorch models is currently limited to bert and xlm. Would it be possible to add support for lighter models, e.g. DistilBERT or ALBERT?
Do you think that using these models would hurt the performance of the scorers significantly?

Thanks!

ValueError: Model 'BertForMaskedLMOptimized' is not supported by the scorer 'RegressionFinetuner'.

Hi there,
I'm using community model 'bert-base-chinese' from HuggingFace to finetune masked LMs and I get the following error:
ValueError:
Model 'BertForMaskedLMOptimized' is not supported by the scorer 'RegressionFinetuner'.

  • MLMScorer supports MXNet GluonNLP MLMs: ['bert-base-en-uncased', 'bert-base-en-cased', 'roberta-base-en-cased', 'bert-large-en-uncased', 'bert-large-en-cased', 'roberta-large-en-cased', 'bert-base-en-uncased-owt', 'bert-base-multi-uncased', 'bert-base-multi-cased']
  • LMScorer supports MXNet GluonNLP LMs: ['gpt2-117m-en-cased', 'gpt2-345m-en-cased']
  • MLMScorerPT supports PyTorch Transformers MLMs:
    • 'albert-*' (wrapped by AlbertForMaskedLMOptimized)
    • 'bert-*' (wrapped by BertForMaskedLMOptimized)
    • 'distilbert-*' (wrapped by DistilBertForMaskedLMOptimized)
    • 'xlm-*' (some variants require 'lang' parameter; XLM-R not supported)

What can I do to solve this issue?
Thanks!

Understanding runtimes of different models

Hi,

I need to score a rather large number of sentences for a downstream task. I'm experimenting with models supported by huggingface with no fine tuning, e.g.:

    mlms_model, vocab, tokenizer = get_pretrained(ctxs, 'albert-base-v2')
    scorer = MLMScorerPT(mlms_model, vocab, tokenizer, ctxs)
    sentences = ... # 1847 sentences
    corpus = Corpus.from_text(sentences)
    scores = self.scorer.score(corpus, 1.0, 500) # adjusted batch size to avoid gpu out of memory errors

Depending on the model and scorer I get wildly different runtimes. On my computer, encoding 1847 sentences:

  • MXNet MLMs like 'bert-base-en-cased' and 'roberta-base-en-cased' with MLMScorer take 3-4 minutes
  • MXNet LMs like 'gpt2-117m-en-cased' with LMScorer take about 8-10 secs (for some reason I need to lower the batch size to around 50)
  • 'albert-base-v1' and 'albert-base-v2' with MLMScorerPT take 4-5 minutes.
  • 'distilbert-base-cased' and 'distilbert-base-uncased' take 1-2 minutes.

I expected, perhaps naively, that ALBERT and DistilBERT would be much faster due to reduced dimensionality and number of layers.

xlm-roberta example?

Thank you for the amazing work. I am trying to use xlm models for scoring, but I got a bug like below for using xlm-roberta-base/large.

(base) bill@ink-molly:~/MickeyProbes$ python probe_generation/sent_scoring.py
/home/bill/anaconda3/lib/python3.7/site-packages/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
  Optimizer.opt_registry[name].__name__))
WARNING:root:Model 'xlm-roberta-large' not recognized as an MXNet model; treating as PyTorch model
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 513/513 [00:00<00:00, 257kB/s]
Can't set hidden_size with value 1024 for XLMConfig {
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "model_type": "xlm",
  "pad_token_id": 1
}

Traceback (most recent call last):
  File "probe_generation/sent_scoring.py", line 9, in <module>
    model, vocab, tokenizer = get_pretrained(ctxs, 'xlm-roberta-large')
  File "/home/bill/MickeyProbes/mlm-scoring/src/mlm/models/__init__.py", line 126, in get_pretrained
    model, loading_info = transformers.XLMWithLMHeadModel.from_pretrained(model_fullname, output_loading_info=True)
  File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/modeling_utils.py", line 854, in from_pretrained
    **kwargs,
  File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 316, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 403, in from_dict
    config = cls(**config_dict)
  File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_xlm.py", line 195, in __init__
    super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)
  File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 215, in __init__
    raise err
  File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 212, in __init__
    setattr(self, key, value)
AttributeError: can't set attribute

I am not sure if it is a version issue. Would you please provide an example for running xlm in the code? Thanks!

Increasing batch size when using command-line mlm score

I am trying to use this package's command-line interface in a similar fashion to the README's example:

mlm score \
    --mode hyp \
    --model bert-base-en-uncased \
    --gpus 0 \
    examples/asr-librispeech-espnet/data/dev-other.am.json \
    > examples/demo/dev-other-3.lm.json

However, I see that it uses only around 601 MB of GPU memory, which is much less that what the GPU is able to support (12 GB). Is there any way to increase the batch size when using mlm score? It seems that the --split-size argument would do something like this, is that right?

Hardcoded GPU 0?

Hi there,

I'm facing an issue with your PyTorch implementation and some input sentences. E.g.

s = 'RT @HISPANlCPROBS : When u walk straight into the kitchen to eat & ur mom hits u with the " ya saludaste " #ThanksgivingWithHispanics https://…'
print(scorer.score_sentences([s]))

gives the following error:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.91 GiB total capacity; 451.65 MiB already allocated; 12.12 MiB free; 40.35 MiB cached)

I'm working on a server with three GPUs and tried setting ctxs = [mx.gpu(0)], ctxs = [mx.gpu(1)], ctxs = [mx.gpu(2)] and ctxs = [mx.cpu()] but I always get the same error about GPU 0. I'm wondering if this is hardcoded somewhere in your code? Changing the ctxs variable seems to have no effect.

Thanks.

Can't load tokenizer for 'xlm-roberta-large'

Dear authors,

I have tried to change the pre-trained model to 'xlm-roberta-large', but I got this OSerror message:

Can't load tokenizer for 'xlm-roberta-large'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'xlm-roberta-large' is the correct path to a directory containing all relevant files for a XLMTokenizer tokenizer.

Could you guide me how to solve this problem?

how to integrate models not available via huggingface or gluon?

hi there,

as i understand your library, it works with models which are available from huggingface or gluon.

question: for a model that is not available in the model zoos of those two frameworks, e.g. a model i trained myself, how can i get this to work with a config.json, a pytorch_model.bin and a vocab.txt file?

best,

phillip

Help for Distil Roberta

Hi
I am trying to use distil roberta model with the code. I can see there is class for DistilBert which has been been loaded from transformers library here. Is there a way i could also use it for DistilRoberta as it is not in the models src of transformers library but only has a model card.

GPT Models Scoring error

I tried scoring sentences with the models mentioned here . Every model works fine except for gpt2-117m-en-cased and gpt2-345m-en-cased. The following error pops up

Traceback (most recent call last):
  File "sample.py", line 16, in <module>
    print(scorer.score_sentences(["Hello world!"]))
  File "/home/pandramish.vinay/mlm-scoring/src/mlm/scorers.py", line 148, in score_sentences
    return self.score(corpus, **kwargs)[0]
  File "/home/pandramish.vinay/mlm-scoring/src/mlm/scorers.py", line 396, in score
    dataset = self.corpus_to_dataset(corpus)
  File "/home/pandramish.vinay/mlm-scoring/src/mlm/scorers.py", line 364, in corpus_to_dataset
    ids_masked = self._ids_to_masked(ids_original)
  File "/home/pandramish.vinay/mlm-scoring/src/mlm/scorers.py", line 329, in _ids_to_masked
    mask_token_id = self._vocab.token_to_idx[self._vocab.mask_token]
AttributeError: 'Vocab' object has no attribute 'mask_token'

Any fixes ?

How to integrate this with MuRIL?

I wanted to use this library for computing the scores for MuRIL which is based on BERT MLM.Its not there in huggingface as yet.How to bridge the gap?

Where is vocab file?

I don't want to download the vocab file because I want to do offline. so I want to give a parameter to "get_pretrained".
I read this code, I think it can't do it.
Would you fix it?

if apply domain MLM-finetuning for rescoring

hi, I am little confused on rescoring for asr and nmt.
Is it post-pretrained on domain corpus for rescoring(apply MLM on domain data) or just used open-source pretrained model(roberta or bert on wikibook corpus)?

"NotImplementedError" When trying to fine-tune any bert model

When trying to follow the steps stated in Maskless fine tuning section, (i even tried to use the exact model stated in the steps..)
i always recieve:

     60     @staticmethod
     61     def _check_support(model) -> bool:
---> 62         raise NotImplementedError

is the regression fine tuner implemented for Bert models?

ERROR: No matching distribution found for mxnet-mkl

I cloned the repo locally, then ran pip install -e . and pip install mxnet-mkl but I get the error:

ERROR: Could not find a version that satisfies the requirement mxnet-mkl (from versions: none)
ERROR: No matching distribution found for mxnet-mkl

How can I fix it?

Update to transformers 4.x

I'd quite like to use this library to score the output from my RoBERTa model, but it's implemented with huggingface transformers version 4.x and this library requires 3.3.1 (and that also ended up installing tokenizers-0.8.1rc2 for some reason).

It would be nice if it could be upgraded to the latest verison.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.