ipieter / robbert Goto Github PK

View Code? Open in Web Editor NEW

196.0 11.0 29.0 1.62 MB

A Dutch RoBERTa-based language model

Home Page: https://pieter.ai/robbert/

License: MIT License

Shell 1.01% Python 23.00% Jupyter Notebook 75.99%

bert-model roberta language-model nlp nlp-resources bert transformers

robbert's People

Contributors

Stargazers

Watchers

robbert's Issues

Missing Token `Ċ` in vocabulary for NER Model

Hi @iPieter,

I was trying to use your robbert-v2-dutch-ner model in my code for fine-tuning. I would like to use the tokenizer as a fast tokenizer, so that I'm able to use the word id's to know from which words the tokens do origin.

Unfortunately I'm not able to create a RobertaTokenizerFast from the pretrained model because of the following error: Error while initializing BPE: Token Ċ out of vocabulary

When trying to find a solution, I saw the following issue (huggingface/transformers#9290) which mentions almost the same problem (I think) for the robbert-v2-dutch-base model.

Is it possible that the same fix applied for the base model is also applied for the NER model?

Questions on semantic similarity

Im working on a dutch bible project and therefore interested in semantic similarity.
Are there any plans to make a sentence similarity model.
The only models I found that support semantic similarity in dutch are multi lingual models.

sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens
sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking

My plan for now is:

Find some model that supports Dutch
Train it on sentence similarity (how, where to get a decent dataset)
There are some parallel bible translations that can be used as a start but there are no similarity scores
Evaluate the results

Im also looking for datasets to train a model on that.
The Bertje model also does not have a model trained on sentence similarity
Any suggestions that can help me?

Maximum sequence length tokenizer

It would be nice if model_max_length could be set in the tokenizer configuration. If this is not set, the maximum length as input to the transformer model will be set to VERY_LARGE_INTEGER (1e30):

https://huggingface.co/transformers/main_classes/tokenizer.html#pretrainedtokenizer

This then leads to an exception in the embedding lookup, because the model will attempt to index the position embedings with positions >= 512.

Download link Fairseq v2.0 model is not provided

Lately , I have been experimenting with the masked LM head of the RobBERT model from fairseq (since the huggingfacetransformer wasn't available at the time). And I had been getting some unexpected results, such as:

model = RobertaModel.from_pretrained('../../models/robbert/',checkpoint_file='RobBERT-base.pt')
model.eval() 

text = '<mask> is de hoofdstad van België.'
print(model.fill_mask(text, 4))

Resulting in: [(0.23022648692131042, 'Canada'), (0.11474426090717316, 'France'), (0.08297888934612274, 'Paris'), (0.07531193643808365, 'Dat')]

Another strange example:

text = 'Ik heb zin in <mask> met frietjes.'
print(model.fill_mask(text, no_results))

Resulting in [( 0.15896572172641754, ' brood'), ( 0.11806301772594452, ' chips'), (0.08460015058517456, ' pasta'), ( 0.071708545088768, ' spaghetti')]

However, when I tried out these examples using the Huggingface transformer implementation , I get different (and better) results:

[{'sequence': '<s>Belgiëis de hoofdstad van België.</s>',
  'score': 0.2881818115711212},
 {'sequence': '<s>Brusselis de hoofdstad van België.</s>',
  'score': 0.1142464280128479},
 {'sequence': '<s>Vlaanderenis de hoofdstad van België.</s>',
  'score': 0.09562666714191437},
 {'sequence': '<s>Antwerpenis de hoofdstad van België.</s>',
  'score': 0.06401436030864716},
 {'sequence': '<s>Bis de hoofdstad van België.</s>',
  'score': 0.040388405323028564}]

and

[{'sequence': '<s>Ik heb zin in/met frietjes.</s>',
  'score': 0.26582473516464233},
 {'sequence': '<s>Ik heb zin in...met frietjes.</s>',
  'score': 0.1382495015859604},
 {'sequence': '<s>Ik heb zin in frietjesmet frietjes.</s>',
  'score': 0.1260228306055069},
 {'sequence': '<s>Ik heb zin in kipmet frietjes.</s>',
  'score': 0.043293338268995285},
 {'sequence': '<s>Ik heb zin in frietmet frietjes.</s>',
  'score': 0.03967735171318054}]

When analysing this difference in behaviour, I saw that the link to download the fairseq model still seems to refer to version 1.0. of the model : https://github.com/iPieter/BERDT/releases/download/v1.0/RobBERT-base.pt . Is this plausible?

How to train Robbert-base?

Thanks for sharing.
I want to train a different language model (Hindi).
How did you train your Robbert-base model? Are those steps covered anywhere?

Choice of Vocabulary

Hello,

Great work! I was just wondering why you used an English vocabulary for a Dutch model? Do you have a specific reason to do that?
I saw for example that the Dutch BERT model (Bertje) is using a Dutch vocab and a Spanish model (RuBERTa) is using a Spanish vocab.

Using a Dutch vocab will probably increase the performance of RobBERT. What do you think?

Thank you

Using the model in tensorflow

Would it be possible to use this model in tensorflow? I think i need the pre-trained model in a format like (bert_model.ckpt, vocab.txt, bert_config.json), just like https://github.com/google-research/bert#pre-trained-models.

Sentiment and PoS models with version 2

Is there any plan to make the sentiment classification and part-of-speech transformer models that have been trained on the version 2 of RobBERT (robbert-v2-dutch-base) available?

These would be great for utilizing those models without having to train from scratch.

IndexError: index out of range in self in RobBERT v2

Not a lot of information available, but there seem to be issues with RobBERT v2. Might be related to SimpleTransformers.

Inconsistency config and finetune notebooks

Hi there

I wanted to reproduce the results from your paper on the sentiment analysis task. I followed all the steps you list and then ran the notebook for DBRD. I found two unclarities that I hope you could clarify:

the notebook uses v1, and it is not clear whether the same notebook should be used to reproduce results for v2;
the Config that is used in the notebook is not the same as the one in the repo: in the repo, the config has gradient_accumulation_steps = 8. However, in the notebook output cells I can see that you originally ran this with gradient_accumulation_steps = 1. I could only reproduce your results when I changed the notebook so that config.gradient_accumulation_steps = 1, while running with 4 GPUs.

Maybe these things can be clarified/made more consistent in the repo?

In the end I was able to reproduce your results. My results are a bit lower than than the ones you report in this repository, but they are within the confidence interval that you described in the paper so thanks for including that CI!

Can't find finetune_dbrd.ipynb

First of all, thanks for sharing your code!

In the readme you mention the following:

Sentiment analysis using the Dutch Book Review Dataset

Download the Dutch book review dataset from https://github.com/benjaminvdb/110kDBRD, and save it to data/raw/110kDBRD
Run src/preprocess_dbrd.py to prepare the dataset.
Follow the notebook notebooks/finetune_dbrd.ipynb to finetune the model.

However, I cannot find this file. Could you upload it? Thanks!

Subwords being picked up in zero-shot die/dat demo

The online demo for die/dat disambiguation tries to classify "die" (and presumably also "dat") when it is part of another word (e.g. "dienaar", "verdiend" ...). Below is an adversarial example.

Tokenizer english or dutch

When downloading the pretrained Roberta Tokenizer and inspecting the merges.txt and vocab.json, they seem to be English. I was expecting Dutch.

Are the files correct? or is the download or my expectation wrong?

To download and inspect use:

tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robBERT-base")
tokenizer.save_vocabulary("/somefolder")

transformers and other library versions missing from requirements.txt

Hi,

requirements.txt seem to be missing several library version like transformers, tqdm, pandas, etc. to replicate the experiments.

Mask predicted in English

Hello,

I've loaded RobBert with huggingface transformers and wanted to predict mask with it but I get surprising results.
What am I doing wrong ?

tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robBERT-base")
model = RobertaModel.from_pretrained("pdelobelle/robBERT-base")

maskFill = pipeline('fill-mask', model=model, tokenizer=tokenizer,topk=5)

[{'sequence': '<s>Ik ga met de saying naar het werk.</s>', 'score': 0.9544020295143127, 'token': 584, 'token_str': 'Ġsaying'}, {'sequence': '<s>Ik ga met de real naar het werk.</s>', 'score': 0.00021602092601824552, 'token': 588, 'token_str': 'Ġreal'}, {'sequence': '<s>Ik ga met de play naar het werk.</s>', 'score': 0.00019373372197151184, 'token': 310, 'token_str': 'Ġplay'}, {'sequence': '<s>Ik ga met de this naar het werk.</s>', 'score': 0.00019168092694599181, 'token': 42, 'token_str': 'Ġthis'}, {'sequence': '<s>Ik ga met de for naar het werk.</s>', 'score': 0.0001903186202980578, 'token': 13, 'token_str': 'Ġfor'}]

model = RobertaForMaskedLM.from_pretrained("pdelobelle/robBERT-base")

[{'sequence': '<s>Ik ga met de%) naar het werk.</s>', 'score': 0.01353645883500576, 'token': 8871, 'token_str': '%)'}, 
{'sequence': '<s>Ik ga met de Chile naar het werk.</s>', 'score': 0.010698799043893814, 'token': 9614, 'token_str': 'ĠChile'}, {'sequence': '<s>Ik ga met de som naar het werk.</s>', 'score': 0.008496173657476902, 'token': 16487, 'token_str': 'Ġsom'}, {'sequence': '<s>Ik ga met de cure naar het werk.</s>', 'score': 0.006187774706631899, 'token': 13306, 'token_str': 'Ġcure'}, {'sequence': '<s>Ik ga met deateg naar het werk.</s>', 'score': 0.005943992640823126, 'token': 27586, 'token_str': 'ateg'}]

I've also tried with the AutoModel class

tokenizer = AutoTokenizer.from_pretrained("pdelobelle/robBERT-base")
model = AutoModelForMaskedLM.from_pretrained("pdelobelle/robBERT-base")

[{'sequence': '<s>Ik ga met destant naar het werk.</s>', 'score': 0.017187733203172684, 'token': 20034, 'token_str': 'stant'}, {'sequence': '<s>Ik ga met devest naar het werk.</s>', 'score': 0.006343857850879431, 'token': 13493, 'token_str': 'vest'}, {'sequence': '<s>Ik ga met decies naar het werk.</s>', 'score': 0.005877971183508635, 'token': 32510, 'token_str': 'cies'}, {'sequence': '<s>Ik ga met desteam naar het werk.</s>', 'score': 0.0044727507047355175, 'token': 46614, 'token_str': 'steam'}, {'sequence': '<s>Ik ga met de Sebast naar het werk.</s>', 'score': 0.0035716358106583357, 'token': 32905, 'token_str': 'ĠSebast'}]

model = AutoModel.from_pretrained("pdelobelle/robBERT-base")

[{'sequence': '<s>Ik ga met de saying naar het werk.</s>', 'score': 0.9544020295143127, 'token': 584, 'token_str': 'Ġsaying'}, {'sequence': '<s>Ik ga met de real naar het werk.</s>', 'score': 0.00021602092601824552, 'token': 588, 'token_str': 'Ġreal'}, {'sequence': '<s>Ik ga met de play naar het werk.</s>', 'score': 0.00019373372197151184, 'token': 310, 'token_str': 'Ġplay'}, {'sequence': '<s>Ik ga met de this naar het werk.</s>', 'score': 0.00019168092694599181, 'token': 42, 'token_str': 'Ġthis'}, {'sequence': '<s>Ik ga met de for naar het werk.</s>', 'score': 0.0001903186202980578, 'token': 13, 'token_str': 'Ġfor'}]

Testing on Mac (cpu-only)

First, thanks for sharing this valuable project! We are very excited to test these models in our research at Erasmus Rotterdam.

My question is related to running the models on cpu-only mode, as I have a mac without nvidia (so no cuda support). I am trying to reproduce the sentiment analysis project (dbrd) using the finetune_dbrd.ipynb notebook you have shared. However, when training the model with the provided code I get a message (at the end of the training process) AssertionError: Torch not compiled with CUDA enabled.

So my question is if there is a way to run this with cpu-only pytorch? I see that in your train.py script you are recognising the device type (args.device) and I have tested that it prints out 'cpu'. Are there some parameters that I need to change in order to run the notebook on my mac laptop?

notebook for NER

Hello there @iPieter
I need to evaluate RobBERT and bertje in a named entity recognition task on 18th-19th century Dutch texts.
Is there a notebook somewhere where I can follow the flow from an IOB-tagged dataset to finetuning and applying the model?

ValueError loading RobBERT

Hi all,

I am using RobBERT for a Dutch sequence classification task. Everything worked well (more than 100 times), But now all of a sudden the model does not load anymore.

Downloading the model using:
model = TFRobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base", num_labels = 9)

I get the following error:

ValueError: cannot reshape array of size 30708480 into shape (40000,768)

I am using gp_minimize (function from scikit-optimize), it has worked for many times before this has happened. I added print functions before and after the code lines of the function that gp_minimize tries to optimize, and found out that the problem really is in downloading the model using the code line above.

Please help me, I was about to do the final tests for my thesis. My deadline is next week.

empty processed .txt files when running preprocess_dbrd.py

Hi there,

First of all thanks for your research and repository. I was trying to replicate your book review lm fine tuning but when running preprocess_dbrd.py i come up with empty .txt files in the processed folder. I don't get any errors while running the file so i can't show you any output or error logs.

Anyway, if you could take a look if the error also happens to exist with you, thanks in advance,

Thomas

Dutch tokenizer behaves unexpectedly

The problem

When running the code below, the result of the tokenizer is somewhat strange. Some weird characters seem to be introduced by the tokenization which leads to the fact that following tasks (e.g. MLM) have a poor performance.

Code:

from transformers import RobertaTokenizer, 
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base",do_lower_case=True)
sentence = "ik zie een boom in mijn tuin."
tokenized_text = tokenizer.tokenize(sentence)

Result:
['ik', 'Ġzie', 'Ġeen', 'Ġboom', 'Ġin', 'Ġmijn', 'Ġtuin', '.']

Similar code that uses default BERT Tokenizer

However, when using the exact same code but based on a default BERT tokenizer, the code does seem to work fine.

Code:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)
sentence = "I work as a motorbike stunt rider - that is, I do tricks on my motorbike at shows."
tokenized_text = tokenizer.tokenize(sentence)

Result:
[ 'i', 'work', 'as', 'a', 'motor', '##bi', '##ke', 'stunt', 'rider', '-', 'that', 'is', ',', 'i', 'do', 'tricks', 'on', 'my', 'motor', '##bi', '##ke', 'at', 'shows', '.']

Question

Why is this , and how can it be solved?
Thanks in advance!

ipieter / robbert Goto Github PK

robbert's People

Contributors

Stargazers

Watchers

Forkers

robbert's Issues

The problem

Similar code that uses default BERT Tokenizer

Question

Recommend Projects

Recommend Topics

Recommend Org