vinairesearch / bertweet Goto Github PK
View Code? Open in Web Editor NEWBERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
License: MIT License
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
License: MIT License
Hi,
I am using google colab and trying to run the sample usage code you have given,
from fairseq.data.encoders.fastbpe import fastBPE from fairseq import options parser = options.get_preprocessing_parser() parser.add_argument('--bpe-codes', type=str, help='path to fastBPE BPE',default="BERTweet_base_fairseq/bpe.codes") args = parser.parse_args()
I'm facing an issue while passing args.
The error is in attached img. How do i have this error?
Does vinai/bertweet-covid19-base-uncased use the same tokenizer as bertweet-base? I've been trying to run a code on my annotated data and keeps giving me an error "index is out of range in self" during train stage if I download model and tokenizer from bertweet-covid19-base-uncased. The only way it worked for me was using the model from bertweet-covid19-base-uncased and tokenizer from bertweet-base.
Hi, could you explain how you generate the tweet sentence embedding please? I check the shape of the output based on the example, features = bertweet(input_ids)
seems to have embeddings of each token in feature[0]
(e.g., [1,20,768]) and tweet sentence embedding in feature[1]
(e.g., [1, 768])? If so, please could you let me know how you generate feature[1]
? Is it based on [CLS]
token or simply average the whole word token embeddings? Thanks!
Great work as an LM using Tweets. I am wondering if the tweets that are downloaded from the Archive Team website were also truncated when you trained the model?
The sample script provided here gives error. The script is given below:
import torch
from transformers import AutoModel, AutoTokenizer
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = bertweet(input_ids)
Error:
Traceback (most recent call last):
File "temp.py", line 5, in
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
File "<conda_env>//lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 372, in from_pretrained
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
File "<conda_env>/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 275, in tokenizer_class_from_name
if c.name == class_name:
AttributeError: 'NoneType' object has no attribute 'name'
The error is resolved by using:
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
Thanks a lot for the work on BERTweet. In the paper you describe using the model for 3-class sentiment analysis. Can you please provide an example how it is done?
The readme example:
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :crying_face:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = bertweet(input_ids)
produces the results
# Inspect results
print(f'Pooler outputs shape: {features["pooler_output"].shape}')
print(f'Last hidden states shape: {features["last_hidden_state"].shape}')
Pooler outputs shape: torch.Size([1, 768])
Hidden states shape: torch.Size([1, 20, 768])
How do you then use the model outputs to classify the sentiment of the tweet?
Hi all,
Thanks for the great work! I used it to make this little emoji recommender:
http://rensdimmendaal.com/emoji/
I'd love to expand the number of different emoji I can recommend. However, I cannot at the moment, because some emoji are split into multiple tokens. Would you be willing to share the preprocessed data of bertweet-base so I can add these as single tokens to the vocabulary and tune the model?
@famanson @andreydung @datquocnguyen @tienthanhdhcn @thanhluong
can i use this model for local tweets author prediction of english tweets?
I tried to run your example usage:
import torch
from transformers import AutoModel, AutoTokenizer
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = bertweet(input_ids) # Models outputs are now tuples
print(features)
I'm getting the following error:
OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-large-openai-detector, roberta-large-mnli, roberta-large, roberta-base-openai-detector, roberta-base, distilroberta-base). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
My environment:
python 3.5.6
transformers 2.5.1
torch 1.4.0
When I execute the huggingface version It throws the following error:
`OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
`
https://colab.research.google.com/drive/1bwWQAX9Ql0d1fTVSQd1KpP2AQledgyKZ?usp=sharing
I am working on a remote Ubuntu server using SSH. Python 3.6. I am unable to install fastBPE and get this error:
error: command 'gcc' failed with exit status 1
As a result, I'm unable to execute my code. I am unable to install this with pip or conda as well. Please help.
Hello,
How do you save the tokenizer with a custom parameter in the AutoTokenizer normalization=True? And how do you point to the preprocess function?
Hello everyone :)
I'm a psychologist researcher studying user behaviour on social media.
As part of my research, i get a huge amount of tweets on a specific hashtag (~25.000.000 tweets).
I would like to do sentiment analysis on this dataset, i previously used the default HugginFace for SA, but the results weren't that great:
classifier = pipeline("sentiment-analysis")
tweets_df['sentiment'] = tweets_df['text'].apply(lambda row : (classifier(row))[0]['label'])
I run the example code on the main gh page:
I'm here asking for some guidance, of how can i apply it now in a performatic way and get a new column called 'sentiment' with sentiment analysis of each row?
hi VinAI team,
Great work ๐ I'm the founder & CEO of Jina AI, you may have used/heard my prev. work on Fashion-MNIST and bert-as-service. I'm the creator of these two OSS.
I'm asking if we can build a synergy between Jina <> BERTweet (& PhoBERT, post it separately). https://github.com/jina-ai/jina
Simply put Jina is a universal neural search engine, it is a search infrastructure that can be used for searching text2text, image2image, audio2audio, etc. We already have examples using Jina for QA and semantic text search here, full examples can be found in here.
I see great potential to apply this in production. Therefore I kindly ask if you are interested in port it into jina or jina-hub. So that people can use it as one of their search component in Jina?
If you are interested in long-term collaboration, we also have a Slack channel, where we can invite you in to have more discussions. We also welcome your thoughts on it.
Hi all,
Thank you for the great work! This will solve the problem of adapting BERT (trained on Wikipedia and the book corpus) to the tweet domain. Of course, the problems such as adapting it to own domain of tweet data is still there. For this purpose, it would be useful to re-train the BERTweet language model first to teach BERTweet to speak the language of a specific domain. I have been investigating some tutorials and the trainer module of transformers. Do you have any guidance, script or a tutorial that can be shared?
Many thanks
def convertTags(filein):
writer = open(filein+".post", "w")
lines = open(filein, "r").readlines()
for ind in range(len(lines)):
line = lines[ind]
tokTag = line.strip().split()
if len(tokTag) == 0:
writer.write("\n")
continue
if tokTag[0] == "@USER":
tokTag[1] = "USR"
elif tokTag[0] == "HTTPURL":
tokTag[1] = "URL"
elif tokTag[0].startswith("#"):
tokTag[1] = "HT"
elif tokTag[0] == "RT":
tokTag[1] = "RT"
if tokTag[0] == "(" or tokTag[0] == ")":
if ind >= 1:
tokTag_1 = lines[ind-1].strip().split()
if len(tokTag_1) == 2:
if tokTag_1[0] == ":" and tokTag_1[1] == "UH":
tokTag[1] = "UH"
else:
tokTag[1] = tokTag[0]
if ind < len(lines) - 1:
tokTag_1 = lines[ind + 1].strip().split()
if len(tokTag_1) == 2:
if tokTag_1[0] == ":" and tokTag_1[1] == "UH":
tokTag[1] = "UH"
else:
tokTag[1] = tokTag[0]
writer.write("\t".join(tokTag) + "\n")
writer.close()
I want to do a sentimental analysis of the tweets and I want to use this model for that purpose.
Can someone provide me a high-level overview of what/How I should be doing to accomplish my task?
Hi, I was planning on implementing the same POS tagger architecture using the bertweet-base
model using Huggingface, but since it is not supported by PreTrainedTokenizerFast
, you cannot access the offset_mappings, and thus cannot easily access the embeddings for a given token for POS tagging (planned on pooling the subwords per token in Tweebank). The tokenizer doesn't seem to deviate from the Huggingface Roberta tokenizers except for the normalization functionality, so is there any way to use this feature or see it being added (perhaps in a setting that doesn't use normalization)? It already works for the bertweet-large
model, so I assume its not impossible.
Would you release the tutorial about how to generate bpe.codes and dict.txt file, and the preprocess pipeline about how generate pretrain data?
I want to train a another language bertweet
I want to know what are the preprocessing steps applied other than emoji, username and URL?
are tweets is lowered?
are digits removed?
what about punctuations?
I download the model wieght and config file form https://huggingface.co/vinai/bertweet-large and find that the vocab and tokenizer are different from those in Bertweet-Base(https://huggingface.co/vinai/bertweet-base). Moreover, I cannot find the ':crying_face:' token in large version's vocab.json. The tokenizer seems more like a RoBERTaTokenizer inestad of BertweetTokenizer. May the researchers introduce the changes?
When I try to use the BERTweet model with the Farm Package I get the following error. It seems to struggling to find the model but I do not understand why. I am using the Jupyter notebook described in this article and have included the cell code with error below.
lang_model = "vinai/bertweet-base"
do_lower_case = False
tokenizer = Tokenizer.load(
pretrained_model_name_or_path=lang_model,
do_lower_case=do_lower_case)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
/var/folders/5r/p050t_sd4l130ytlj_x4wxyh0000gn/T/ipykernel_39034/78292544.py in <module>
2 do_lower_case = False
3
----> 4 tokenizer = Tokenizer.load(
5 pretrained_model_name_or_path=lang_model,
6 do_lower_case=do_lower_case)
~/Git/Trade_with_Twitter/venv/lib/python3.8/site-packages/farm/modeling/tokenization.py in load(cls, pretrained_model_name_or_path, revision, tokenizer_class, use_fast, **kwargs)
95 elif "RobertaTokenizer" in tokenizer_class:
96 if use_fast:
---> 97 ret = RobertaTokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
98 else:
99 ret = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)
~/Git/Trade_with_Twitter/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1706 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing relevant tokenizer files\n\n"
1707 )
-> 1708 raise EnvironmentError(msg)
1709
1710 for file_id, file_path in vocab_files.items():
OSError: Can't load tokenizer for 'vinai/bertweet-base'. Make sure that:
- 'vinai/bertweet-base' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'vinai/bertweet-base' is the correct path to a directory containing relevant tokenizer files
It would be nice to have the model also hosted on huggingface (https://huggingface.co/models), so people could use it from the huggingface API without manually downloading the model dump.
Hi,
Where can i access the 80GB twitter dataset that you used for training the model.
Hello,
I saw your preprocessing steps (where you convert links and mentions to :USER and HTTPURL , but I am wondering, when you trained BERTweet, I imagine there can be strange tokens/symbols and so on, so where you mask 15% of the tokens for each sentence, why does the model not get confused when trying to predict for example ":@" or ":D" or some strange symbols?
Regards
How to deal with the input of batch data of different lengths, such as batch_ Size = 2, "I like playing basketball" and "it's not a good day" are two sentences as input?
Thanks for making the model available in huggingface hub. I tried to use it with some existing code I have. I've been running the same code with some 10+ models from huggingface hub with no issue. When I try to run with: "vinai/bertweet-base"
I get the following error (note model loads fine and it seems it starts training for several iterations) - see below.
I'm not sure what the problem could be. Could the version of transformers and/or pytorch be the problem? Do you know which versions you tried it with? I'm using transformers 3.4 and torch 1.5.1+cu101
Thanks for your help!
| 44/1923 [00:11<08:11, 3.82it/s]Traceback (most recent call last):
File "../models/jigsaw/tr-3.4//run_puppets.py", line 284, in <module>
main()
File "../models/jigsaw/tr-3.4//run_puppets.py", line 195, in main
trainer.train(
File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 756, in train
tr_loss += self.training_step(model, inputs)
File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 1070, in training_step
loss.backward()
File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/autograd/__init__.py", line 98, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered (launch_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:217)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x2b9e5852d536 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xd43696 (0x2b9e2155e696 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const&) + 0x19e1 (0x2b9e2251ce11 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
When trying to load the Tokenizer using the following code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
I got these error messages:
AttributeError Traceback (most recent call last)
in ()
1 from transformers import AutoTokenizer
----> 2 tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
368 if use_fast and not config.tokenizer_class.endswith("Fast"):
369 tokenizer_class_candidate = f"{config.tokenizer_class}Fast"
--> 370 tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
371 if tokenizer_class is None:
372 tokenizer_class_candidate = config.tokenizer_class/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in tokenizer_class_from_name(class_name)
271 )
272 for c in all_tokenizer_classes:
--> 273 if c.name == class_name:
274 return c
275AttributeError: 'NoneType' object has no attribute 'name'
Python 3.6.9
Transformers 4.3.2
How do I generate entire tweet embeddings instead of word embeddings using BERTweet?
Unable to use the BERTweet tokenizer with encode_plus. While the tokenizer.encode tokenizes correctly, the tokenizer.encode_plus doesn't work correctly on the raw tweets.
Hi, I encountered an error: "IndexError: index out of range in self." Below is my code. Can you help identify where the problem is? Is it related to the length of the sequence? I can provide the specific text if you need it.
pretrain_model = 'vinai/bertweet-base'
tokenizer = AutoTokenizer.from_pretrained(pretrain_model)
model = AutoModelForMaskedLM.from_pretrained(pretrain_model)
inputs = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=512)
last_hidden_states = model(**inputs, output_hidden_states=True).hidden_states[-1]
Hi,
Is there a way to mask multiple words in a sentence?
Thanks in advance.
Hi, I'm Interested in your great work and tried to reproduce you results of fine-tuning XLMR (with my own code). And I got 92.6
in Ritter11
, 93.4
in ARK
, 95.0
in TB-v2
. I find that the results of ARK
is lower than the reported results in the paper. In the paper, you applied "soft" and "hard" strategy to the dataset while I did nothing. Therefore, I think the reason is possibly the data- preprocessing, am I right?
Hi all and thanks for the cool contribution,
Now that the PR is merged on transformers, I am trying to include your model in the simpletransformers
repository, in order to use it in my project.
RobertaConfig
in src/transformers/tokenization_auto.py
(in TOKENIZER_MAPPING
, see the changed files in the PR)? Shouldn't we use a BertConfig
instead?BertForSequenceClassification
or RobertaForSequenceClassification
?Thanks a lot in advance.
Hi all, I have something fish going on with bertweet-large model. The code and the output is below. I also test a dataset of 5000 tweets and it returns label_0 for all of them. Do you have any ideas what might be the issue?
Thanks
Best
classifier = pipeline('sentiment-analysis', model="vinai/bertweet-large") # , return_all_scores = True)
print(classifier('I hate you'))
print(classifier('I love you'))
print(classifier('I you'))
[{'label': 'LABEL_0', 'score': 0.6324337720870972}]
[{'label': 'LABEL_0', 'score': 0.6408738493919373}]
[{'label': 'LABEL_0', 'score': 0.6261811256408691}]
What is the for [MASK] token for BERTweet?
Hi there, thanks for the brilliant work, I want to ask for help about how to get the dependency parsing result using BERTweet?
Hi dev team
I appreciate you guys for making this model to facilitate nlp research in tweets! I have been trying to use the BERTweet to do my project on Twitter, however I think I've just found something weird with the tokenization step.
The 'Example usage' tab in README gives a sample tweet: "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @user ๐ข"
I tried to tokenize this tweet with AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False), then use print(tokenizer.convert_ids_to_tokens(tokenizer.encode(line))), I get:
['<s>', 'SC', 'has', 'first', 'two', 'presum@@', 'ptive', 'cases', 'of', 'coronavirus', ',', 'D@@', 'HE@@', 'C', 'confirms', 'HTTPURL', 'via', '@USER', '<unk>', ':', '</s>']
Or with AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False, normalization=True), I get:
['<s>', 'SC', 'has', 'first', 'two', 'presum@@', 'ptive', 'cases', 'of', 'coronavirus', ',', 'D@@', 'HE@@', 'C', 'confirms', 'HTTPURL', 'via', '@USER', ':', 'cry', ':', '</s>']
Either way, the tokenization is not correct for the emoji string ":cry:"
I have checked the source code implemented in Transformer, I think what went wrong is that for emoji.demojizer(), you need to set the option use_aliases=True to cover all emojis, otherwise some just won't get included.
I have also checked tokenizer.get_vocab()[':cry:']
, and it returns a KeyError
All the following calls to the tokenizer return the same ids
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased")
input_ids = tokenizer.encode(line, return_tensors="pt")
input_ids = tokenizer.encode(line, padding=True, return_tensors="pt")
input_ids = tokenizer.encode(line, padding=True, max_length=128, return_tensors="pt")
If you have a plan to release a larger model please consider the following options as well:
Thanks!
Thanks for developing BERTweet
here is a conceptual question when it comes to utilizing tweets for training BERT model and I am curious how you have handled that.
Bert Language model has a "next sentence prediction" model, where through building the LM model try to optimize predicting the next sentence.
since tweets are short and often contain one sentence. I am curious how you have handled that and how have you bypassed NSP part ?
Thank you again.
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
results in:
Traceback (most recent call last):
File "", line 1, in
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_auto.py", line 220, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1425, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/home/tam/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1524, in _from_pretrained
raise EnvironmentError(
OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
transformers version: 3.1.0
It is a bit onerous to mix APIs from transformers
and fairseq
. Any chance to have some demo code with transformers
only?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.