vinairesearch / bertweet Goto Github PK

BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)

License: MIT License

Python 100.00%

bert bertweet bertweet-covid19 covid covid-19 covid19 english english-tweets fairseq irony-detection language-model named-entity-recognition ner part-of-speech-tagging python3 roberta sentiment-analysis text-classification transformers

bertweet's People

Contributors

Stargazers

Watchers

Forkers

mcdavid109 kc2fresh karryharsh ismail-30 dandrocec nhatrio milkigit guruprasaad123 pauldevos jbdatascience wkryst saroyehun trendingtechnology oagn dnaaun deepbrain hungbui tienthanhdhcn cedar33 jeniyat hadryan mltlachac bobycv06fpm nremeikis luoy25 quangchiem139 arundhati-b gabrielwong1991 grant-rk anshiquanshu66 galexa05 suhasagg techthiyanes fm-chen mechanicpanic ayobame jmansfield89 adityaguin babajideowoyele tahirlanre elmapple yhliu2022 michileo tiyaro sbocconi jeongsikpark1998 morlikowski yanjiangjerry tnt305

bertweet's Issues

error while running the sample code

Hi,
I am using google colab and trying to run the sample usage code you have given,
from fairseq.data.encoders.fastbpe import fastBPE from fairseq import options parser = options.get_preprocessing_parser() parser.add_argument('--bpe-codes', type=str, help='path to fastBPE BPE',default="BERTweet_base_fairseq/bpe.codes") args = parser.parse_args()

I'm facing an issue while passing args.
The error is in attached img. How do i have this error?

Tokenizer vinai/bertweet-covid19-base-uncased

Does vinai/bertweet-covid19-base-uncased use the same tokenizer as bertweet-base? I've been trying to run a code on my annotated data and keeps giving me an error "index is out of range in self" during train stage if I download model and tokenizer from bertweet-covid19-base-uncased. The only way it worked for me was using the model from bertweet-covid19-base-uncased and tokenizer from bertweet-base.

Model outputs tuples

Hi, could you explain how you generate the tweet sentence embedding please? I check the shape of the output based on the example, features = bertweet(input_ids) seems to have embeddings of each token in feature[0] (e.g., [1,20,768]) and tweet sentence embedding in feature[1] (e.g., [1, 768])? If so, please could you let me know how you generate feature[1]? Is it based on [CLS] token or simply average the whole word token embeddings? Thanks!

Truncated Tweets from Archive Team Tweet Stream

Great work as an LM using Tweets. I am wondering if the tweets that are downloaded from the Archive Team website were also truncated when you trained the model?

AutoTokenizer gives error

The sample script provided here gives error. The script is given below:

import torch
from transformers import AutoModel, AutoTokenizer

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)

Error:

Traceback (most recent call last):
File "temp.py", line 5, in
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
File "<conda_env>//lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 372, in from_pretrained
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
File "<conda_env>/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 275, in tokenizer_class_from_name
if c.name == class_name:
AttributeError: 'NoneType' object has no attribute 'name'

The error is resolved by using:

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)

Use model output for sentiment classifcation

Thanks a lot for the work on BERTweet. In the paper you describe using the model for 3-class sentiment analysis. Can you please provide an example how it is done?

The readme example:

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :crying_face:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
    features = bertweet(input_ids)

produces the results

# Inspect results
print(f'Pooler outputs shape: {features["pooler_output"].shape}')
print(f'Last hidden states shape: {features["last_hidden_state"].shape}')

Pooler outputs shape: torch.Size([1, 768])
Hidden states shape: torch.Size([1, 20, 768])

How do you then use the model outputs to classify the sentiment of the tweet?

Could you share the pre-processed tweet data?

Hi all,

Thanks for the great work! I used it to make this little emoji recommender:

http://rensdimmendaal.com/emoji/

I'd love to expand the number of different emoji I can recommend. However, I cannot at the moment, because some emoji are split into multiple tokens. Would you be willing to share the preprocessed data of bertweet-base so I can add these as single tokens to the vocabulary and tune the model?

using model for local tweets author prediction

@famanson @andreydung @datquocnguyen @tienthanhdhcn @thanhluong
can i use this model for local tweets author prediction of english tweets?

Model name was not found in tokenizers model name list

I tried to run your example usage:

import torch
from transformers import AutoModel, AutoTokenizer

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)  # Models outputs are now tuples
    print(features)

I'm getting the following error:

OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-large-openai-detector, roberta-large-mnli, roberta-large, roberta-base-openai-detector, roberta-base, distilroberta-base). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

My environment:
python 3.5.6
transformers 2.5.1
torch 1.4.0

Huggingface version is not working

error

When I execute the huggingface version It throws the following error:

`OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

reproduce

https://colab.research.google.com/drive/1bwWQAX9Ql0d1fTVSQd1KpP2AQledgyKZ?usp=sharing

BPE installation error

I am working on a remote Ubuntu server using SSH. Python 3.6. I am unable to install fastBPE and get this error:
error: command 'gcc' failed with exit status 1
As a result, I'm unable to execute my code. I am unable to install this with pip or conda as well. Please help.

Question about normalization=True

Hello,
How do you save the tokenizer with a custom parameter in the AutoTokenizer normalization=True? And how do you point to the preprocess function?

Applying Bertweet to a huge pandas dataframe

Hello everyone :)

I'm a psychologist researcher studying user behaviour on social media.

As part of my research, i get a huge amount of tweets on a specific hashtag (~25.000.000 tweets).

I would like to do sentiment analysis on this dataset, i previously used the default HugginFace for SA, but the results weren't that great:

classifier = pipeline("sentiment-analysis")

tweets_df['sentiment'] = tweets_df['text'].apply(lambda row : (classifier(row))[0]['label'])

I run the example code on the main gh page:

I'm here asking for some guidance, of how can i apply it now in a performatic way and get a new column called 'sentiment' with sentiment analysis of each row?

synergy Jina <> BERTweet

hi VinAI team,

Great work 👍 I'm the founder & CEO of Jina AI, you may have used/heard my prev. work on Fashion-MNIST and bert-as-service. I'm the creator of these two OSS.

I'm asking if we can build a synergy between Jina <> BERTweet (& PhoBERT, post it separately). https://github.com/jina-ai/jina

Simply put Jina is a universal neural search engine, it is a search infrastructure that can be used for searching text2text, image2image, audio2audio, etc. We already have examples using Jina for QA and semantic text search here, full examples can be found in here.

Potential synergy

I see great potential to apply this in production. Therefore I kindly ask if you are interested in port it into jina or jina-hub. So that people can use it as one of their search component in Jina?
If you are interested in long-term collaboration, we also have a Slack channel, where we can invite you in to have more discussions. We also welcome your thoughts on it.

Re-training the language model

Hi all,

Thank you for the great work! This will solve the problem of adapting BERT (trained on Wikipedia and the book corpus) to the tweet domain. Of course, the problems such as adapting it to own domain of tweet data is still there. For this purpose, it would be useful to re-train the BERTweet language model first to teach BERTweet to speak the language of a specific domain. I have been investigating some tutorials and the trainer module of transformers. Do you have any guidance, script or a tutorial that can be shared?

Many thanks

How do I perform sentiment analysis using it?

Script to postprocess the prediction outputs on the Ritter11-T-POS test set

def convertTags(filein):
    writer = open(filein+".post", "w")
    lines = open(filein, "r").readlines()
    for ind in range(len(lines)):
        line = lines[ind]
        tokTag = line.strip().split()
        if len(tokTag) == 0:
            writer.write("\n")
            continue
            
        if tokTag[0] == "@USER":
            tokTag[1] = "USR"
        elif tokTag[0] == "HTTPURL":
            tokTag[1] = "URL"
        elif tokTag[0].startswith("#"):
            tokTag[1] = "HT"
        elif tokTag[0] == "RT":
            tokTag[1] = "RT"

        if tokTag[0] == "(" or tokTag[0] == ")":
            if ind >= 1:
                tokTag_1 = lines[ind-1].strip().split()
                if len(tokTag_1) == 2:
                    if tokTag_1[0] == ":" and tokTag_1[1] == "UH":
                        tokTag[1] = "UH"
                    else:
                        tokTag[1] = tokTag[0]
            if ind < len(lines) - 1:
                tokTag_1 = lines[ind + 1].strip().split()
                if len(tokTag_1) == 2:
                    if tokTag_1[0] == ":" and tokTag_1[1] == "UH":
                        tokTag[1] = "UH"
                    else:
                        tokTag[1] = tokTag[0]

        writer.write("\t".join(tokTag) + "\n")
    writer.close()

Sentimental analysis of tweets.

I want to do a sentimental analysis of the tweets and I want to use this model for that purpose.

Can someone provide me a high-level overview of what/How I should be doing to accomplish my task?

Reproducing the POS Tagger Results using Huggingface Tokenizer Offsets

Hi, I was planning on implementing the same POS tagger architecture using the bertweet-base model using Huggingface, but since it is not supported by PreTrainedTokenizerFast, you cannot access the offset_mappings, and thus cannot easily access the embeddings for a given token for POS tagging (planned on pooling the subwords per token in Tweebank). The tokenizer doesn't seem to deviate from the Huggingface Roberta tokenizers except for the normalization functionality, so is there any way to use this feature or see it being added (perhaps in a setting that doesn't use normalization)? It already works for the bertweet-large model, so I assume its not impossible.

Would you release the tutorial about how to generate bpe.codes and dict.txt file?

Would you release the tutorial about how to generate bpe.codes and dict.txt file, and the preprocess pipeline about how generate pretrain data?

I want to train a another language bertweet

What are pre-processing steps applied

I want to know what are the preprocessing steps applied other than emoji, username and URL?
are tweets is lowered?
are digits removed?
what about punctuations?

About the tokenizer for Bertweet-Large

I download the model wieght and config file form https://huggingface.co/vinai/bertweet-large and find that the vocab and tokenizer are different from those in Bertweet-Base(https://huggingface.co/vinai/bertweet-base). Moreover, I cannot find the ':crying_face:' token in large version's vocab.json. The tokenizer seems more like a RoBERTaTokenizer inestad of BertweetTokenizer. May the researchers introduce the changes?

Using with BERTweet with Farm

When I try to use the BERTweet model with the Farm Package I get the following error. It seems to struggling to find the model but I do not understand why. I am using the Jupyter notebook described in this article and have included the cell code with error below.

lang_model = "vinai/bertweet-base"
do_lower_case = False

tokenizer = Tokenizer.load(
    pretrained_model_name_or_path=lang_model,
    do_lower_case=do_lower_case)
---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

/var/folders/5r/p050t_sd4l130ytlj_x4wxyh0000gn/T/ipykernel_39034/78292544.py in <module>
      2 do_lower_case = False
      3 
----> 4 tokenizer = Tokenizer.load(
      5     pretrained_model_name_or_path=lang_model,
      6     do_lower_case=do_lower_case)

~/Git/Trade_with_Twitter/venv/lib/python3.8/site-packages/farm/modeling/tokenization.py in load(cls, pretrained_model_name_or_path, revision, tokenizer_class, use_fast, **kwargs)
     95         elif "RobertaTokenizer" in tokenizer_class:
     96             if use_fast:
---> 97                 ret = RobertaTokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
     98             else:
     99                 ret = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)

~/Git/Trade_with_Twitter/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1706                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing relevant tokenizer files\n\n"
   1707             )
-> 1708             raise EnvironmentError(msg)
   1709 
   1710         for file_id, file_path in vocab_files.items():

OSError: Can't load tokenizer for 'vinai/bertweet-base'. Make sure that:

- 'vinai/bertweet-base' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'vinai/bertweet-base' is the correct path to a directory containing relevant tokenizer files

Host the model on Huggingface?

It would be nice to have the model also hosted on huggingface (https://huggingface.co/models), so people could use it from the huggingface API without manually downloading the model dump.

Is the dataset (80GB) tweets can be accessed?

Hi,
Where can i access the 80GB twitter dataset that you used for training the model.

Preprocessing of tweets

Hello,
I saw your preprocessing steps (where you convert links and mentions to :USER and HTTPURL , but I am wondering, when you trained BERTweet, I imagine there can be strange tokens/symbols and so on, so where you mask 15% of the tokens for each sentence, why does the model not get confused when trying to predict for example ":@" or ":D" or some strange symbols?

Regards

How to deal with batch data input

How to deal with the input of batch data of different lengths, such as batch_ Size = 2, "I like playing basketball" and "it's not a good day" are two sentences as input?

Issue when fine-tuning the model from huggingface hub

Thanks for making the model available in huggingface hub. I tried to use it with some existing code I have. I've been running the same code with some 10+ models from huggingface hub with no issue. When I try to run with: "vinai/bertweet-base"

I get the following error (note model loads fine and it seems it starts training for several iterations) - see below.

I'm not sure what the problem could be. Could the version of transformers and/or pytorch be the problem? Do you know which versions you tried it with? I'm using transformers 3.4 and torch 1.5.1+cu101

Thanks for your help!

| 44/1923 [00:11<08:11,  3.82it/s]Traceback (most recent call last):
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 284, in <module>
    main()
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 195, in main
    trainer.train(
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 756, in train
    tr_loss += self.training_step(model, inputs)
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 1070, in training_step
    loss.backward()
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/autograd/__init__.py", line 98, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered (launch_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:217)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x2b9e5852d536 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xd43696 (0x2b9e2155e696 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const&) + 0x19e1 (0x2b9e2251ce11 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)

is the checkpoints available to public ?

Can't load Tokenizer

When trying to load the Tokenizer using the following code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

I got these error messages:

AttributeError Traceback (most recent call last)
in ()
1 from transformers import AutoTokenizer
----> 2 tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
368 if use_fast and not config.tokenizer_class.endswith("Fast"):
369 tokenizer_class_candidate = f"{config.tokenizer_class}Fast"
--> 370 tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
371 if tokenizer_class is None:
372 tokenizer_class_candidate = config.tokenizer_class

/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in tokenizer_class_from_name(class_name)
271 )
272 for c in all_tokenizer_classes:
--> 273 if c.name == class_name:
274 return c
275

AttributeError: 'NoneType' object has no attribute 'name'

Python 3.6.9
Transformers 4.3.2

Sentence embeddings

How do I generate entire tweet embeddings instead of word embeddings using BERTweet?

bertweet tokenizer compatibility with encode_plus

Unable to use the BERTweet tokenizer with encode_plus. While the tokenizer.encode tokenizes correctly, the tokenizer.encode_plus doesn't work correctly on the raw tweets.

IndexError: index out of range in self

Hi, I encountered an error: "IndexError: index out of range in self." Below is my code. Can you help identify where the problem is? Is it related to the length of the sequence? I can provide the specific text if you need it.

pretrain_model = 'vinai/bertweet-base'
tokenizer = AutoTokenizer.from_pretrained(pretrain_model)
model = AutoModelForMaskedLM.from_pretrained(pretrain_model)

inputs = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=512)
last_hidden_states = model(**inputs, output_hidden_states=True).hidden_states[-1]

Way to mask multiple words in a sentence?

Hi,
Is there a way to mask multiple words in a sentence?

Thanks in advance.

Reproducing the results of fine-tuning XLMR large in the paper

Hi, I'm Interested in your great work and tried to reproduce you results of fine-tuning XLMR (with my own code). And I got 92.6 in Ritter11, 93.4 in ARK, 95.0 in TB-v2. I find that the results of ARK is lower than the reported results in the paper. In the paper, you applied "soft" and "hard" strategy to the dataset while I did nothing. Therefore, I think the reason is possibly the data- preprocessing, am I right?

Config and SequenceClassification

Hi all and thanks for the cool contribution,

Now that the PR is merged on transformers, I am trying to include your model in the simpletransformers repository, in order to use it in my project.

I have read on the README that BERTweet has a BERT-base configuration (and shares the pre-training procedure with RoBERTa). Therefore, how come is it associated with a RobertaConfig in src/transformers/tokenization_auto.py (in TOKENIZER_MAPPING, see the changed files in the PR)? Shouldn't we use a BertConfig instead?
When loading the weights to fine-tune on a text classification downstream task, should we use BertForSequenceClassification or RobertaForSequenceClassification?

Thanks a lot in advance.

vinai/bertweet-large returns LABEL_0 all the time

Hi all, I have something fish going on with bertweet-large model. The code and the output is below. I also test a dataset of 5000 tweets and it returns label_0 for all of them. Do you have any ideas what might be the issue?

Thanks

Best

CODE

classifier = pipeline('sentiment-analysis', model="vinai/bertweet-large") # , return_all_scores = True)

print(classifier('I hate you'))

print(classifier('I love you'))

print(classifier('I you'))

OUTPUT

[{'label': 'LABEL_0', 'score': 0.6324337720870972}]
[{'label': 'LABEL_0', 'score': 0.6408738493919373}]
[{'label': 'LABEL_0', 'score': 0.6261811256408691}]

What is the masked token in BERTweet?

What is the for [MASK] token for BERTweet?

How to get the dependency parsing result using BERTweet

Hi there, thanks for the brilliant work, I want to ask for help about how to get the dependency parsing result using BERTweet?

Some emojis not tokenized properly

Hi dev team
I appreciate you guys for making this model to facilitate nlp research in tweets! I have been trying to use the BERTweet to do my project on Twitter, however I think I've just found something weird with the tokenization step.
The 'Example usage' tab in README gives a sample tweet: "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @user 😢"
I tried to tokenize this tweet with AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False), then use print(tokenizer.convert_ids_to_tokens(tokenizer.encode(line))), I get:

['<s>', 'SC', 'has', 'first', 'two', 'presum@@', 'ptive', 'cases', 'of', 'coronavirus', ',', 'D@@', 'HE@@', 'C', 'confirms', 'HTTPURL', 'via', '@USER', '<unk>', ':', '</s>']

Or with AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False, normalization=True), I get:
['<s>', 'SC', 'has', 'first', 'two', 'presum@@', 'ptive', 'cases', 'of', 'coronavirus', ',', 'D@@', 'HE@@', 'C', 'confirms', 'HTTPURL', 'via', '@USER', ':', 'cry', ':', '</s>']

Either way, the tokenization is not correct for the emoji string ":cry:"

I have checked the source code implemented in Transformer, I think what went wrong is that for emoji.demojizer(), you need to set the option use_aliases=True to cover all emojis, otherwise some just won't get included.

I have also checked tokenizer.get_vocab()[':cry:'], and it returns a KeyError

tokenizer does not pad the data

All the following calls to the tokenizer return the same ids

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased")


input_ids = tokenizer.encode(line, return_tensors="pt")
input_ids = tokenizer.encode(line, padding=True, return_tensors="pt")
input_ids = tokenizer.encode(line, padding=True, max_length=128, return_tensors="pt")

Could you share the dataset in pre-trained phase?

feature requrest

If you have a plan to release a larger model please consider the following options as well:

large positional embedding (for example for conversational threads on Twitter would be useful)
multilingual model

Thanks!

next sentence prediction

Thanks for developing BERTweet
here is a conceptual question when it comes to utilizing tweets for training BERT model and I am curious how you have handled that.

Bert Language model has a "next sentence prediction" model, where through building the LM model try to optimize predicting the next sentence.

since tweets are short and often contain one sentence. I am curious how you have handled that and how have you bypassed NSP part ?

Thank you again.

Can't load BERTweet Tokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

results in:
Traceback (most recent call last):
File "", line 1, in
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_auto.py", line 220, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1425, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1524, in _from_pretrained
raise EnvironmentError(
OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

transformers version: 3.1.0

Example code to use BERTweet with only Transformers?

It is a bit onerous to mix APIs from transformers and fairseq. Any chance to have some demo code with transformers only?