digitalepidemiologylab / covid-twitter-bert Goto Github PK
View Code? Open in Web Editor NEWPretrained BERT model for analysing COVID-19 Twitter data
License: MIT License
Pretrained BERT model for analysing COVID-19 Twitter data
License: MIT License
Hi, I have a question about the training procesure of COVID-Twitter-BERT v2 MNLI. When it fine-tunes on MNLI dataset, does it use cross-encoder architecture and cross-entripy loss as objective?
How about another pretrained model over CORD-19? It would be more scientifically relevant. Thanks.
Hi, I am trying to run basic masked word prediction in pytorch transformers to compare BERT-large-uncased-WWM
and COVID-Twitter-BERT
for a publication.
from transformers import pipeline, AutoModel, AutoTokenizer
pipe = pipeline(task='fill-mask', framework='pt',
model="bert-large-uncased-whole-word-masking",
device=0)
pipe(f"This is the best thing I've {pipe.tokenizer.mask_token} in my life.")
returns meaningful results as:
[{'sequence': "[CLS] this is the best thing i've done in my life. [SEP]",
'score': 0.45353376865386963, 'token': 2589},
{'sequence': "[CLS] this is the best thing i've experienced in my life. [SEP]",
'score': 0.2302728146314621, 'token': 5281},
{'sequence': "[CLS] this is the best thing i've seen in my life. [SEP]",
'score': 0.0811614915728569, 'token': 2464},
{'sequence': "[CLS] this is the best thing i've felt in my life. [SEP]",
'score': 0.06349574029445648, 'token': 2371},
{'sequence': "[CLS] this is the best thing i've had in my life. [SEP]",
'score': 0.058649078011512756, 'token': 2018}]
However
model = AutoModel.from_pretrained(pretrained_model_name_or_path='digitalepidemiologylab/covid-twitter-bert', from_tf=True)
tokenizer = AutoTokenizer.from_pretrained('digitalepidemiologylab/covid-twitter-bert', do_lower_case=True)
pipe = pipeline(task='fill-mask', framework='pt',
model=model,
tokenizer=tokenizer,
device=0)
pipe(f"This is the best thing I've {pipe.tokenizer.mask_token} in my life.")
returns
[{'sequence': "[CLS] this is the best thing i've [unused751] in my life. [SEP]",
'score': 0.012442857027053833, 'token': 756},
{'sequence': "[CLS] this is the best thing i've [unused803] in my life. [SEP]",
'score': 0.00925508514046669, 'token': 808},
{'sequence': "[CLS] this is the best thing i've [unused465] in my life. [SEP]",
'score': 0.009094784036278725, 'token': 470},
{'sequence': "[CLS] this is the best thing i've [unused490] in my life. [SEP]",
'score': 0.008908418007194996, 'token': 495},
{'sequence': "[CLS] this is the best thing i've [unused91] in my life. [SEP]",
'score': 0.008854263462126255, 'token': 92}]
Would really appreciate help here.
Hi all,
Thanks for sharing a great paper and accompanying code! I would love to re-use some of your text processing functions elsewhere (probably as part of a PR to an open source library), however you haven't defined a license file for the repo. Without it:
all rights are reserved and it is not Open Source or Free. You cannot modify or redistribute this code without explicit permission from the copyright holder.
Would you consider a more open source friendly license? :)
Hi,
Thank you for the wonderful model and repository in these much needed time.
I tried to convert covid-twitter-bert TensorFlow model, which was downloaded from TF2 Checkpoint to a PyTorch model using Hugging Face TensorFlow checkpoint conversion (Used transformers version 3.0.0). But, I could not succeed due to the following error occurred.
INFO:transformers.modeling_bert:Skipping _CHECKPOINTABLE_OBJECT_GRAPH
Traceback (most recent call last):
File "/usr/local/bin/transformers-cli", line 32, in <module>
service.run()
File "/usr/local/lib/python3.6/dist-packages/transformers/commands/convert.py", line 78, in run
convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)
File "/usr/local/lib/python3.6/dist-packages/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py", line 36, in convert_tf_checkpoint_to_pytorch
load_tf_weights_in_bert(model, config, tf_checkpoint_path)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 124, in load_tf_weights_in_bert
assert pointer.shape == array.shape
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 594, in __getattr__
type(self).__name__, name))
AttributeError: 'BertForPreTraining' object has no attribute 'shape'
Same conversion method was used with BERT-base (cased_L-12_H-768_A-12) TensorFlow model and it converted successfully. Therefore, I assume this error is specific to covid-twitter-bert model.
Comparing with other BERT TensorFlow models, I found that covid-twitter-bert TensorFlow model missing a .meta file and I suspect it can be a possible reason for this error.
If this error is occurred due to missing .meta file, can you please provide this file? Otherwise, do you have any solution for this issue or can you make a PyTorch model available?
In the paper, it is mentioned "Each tweet was pseudonymized by replacing all Twitter usernames with a common text token. A similar procedure was performed on all URLs to web pages. We also replaced all unicode emoticons with textual ASCII representations (e.g. π for
But it is not exactly clear which exact tokens are used in-place of usernames and URLs. Are they documented anywhere?
Hi, I've been trying to use CT-bert on my local machine and am running into this issue when using my own data set. I'm using a modified version of the GPU notebook provided here. The issue is:
And the code I'm using to load the dataset is as follows:
I don't understand why the issue is cropping up since everything looks alright.
My system specs are:
OS: Windows 10
GPU: RTX 3080
CUDA: 11.3
CudNN: 8.2
Python: 3.9
TF: 2.5
Any help will be greatly appreciated since I've spent most of my christmas break on this :D
For example line 754 is:
1.22065E+18,category_news
This breaks the download (hydration) unfortunately.
Hi,
I want to try to reproduce your results on the SemEval 2016 Task 4 dataset (http://alt.qcri.org/semeval2016/task4/index.php?id=data-and-tools) but the download link provided give only the Ids of the tweets. I tried to download its with the Twitter API, but it appears there are a lot of 404 Errors when trying to get the tweets.
Do you have any solution ?
Thanks a lot for the nice work!
What would be the logic behind masking pronouns with an unknown token of [UNK]. This seems to be a major deviation from standard BERT models.
For example:
tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert", do_lower_case=False)
tokenizer.tokenize('She is cool!')
outputs
['[UNK]', 'is', 'cool', '!']
while
tokenizer2 = AutoTokenizer.from_pretrained("bert-large-cased", do_lower_case=False)
tokenizer2.tokenize('She is cool!')
outputs
['She', 'is', 'cool', '!']
Hi,
It is a very good paper to explore the twitter related COVID info. I am wondering is there any strategies that could be provided for downstream NER tasks that contains more label other than those mentioned in the paper? For instance, we could like to explore NER label such as 'treatment', 'problem', etc.
Thanks!
I have just checked encodings that autotokenizer produces. It seems that for words "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" it produces more than one token, while tokenizer produces one token for 'conventional' words like apple.
E.g.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert-v2", do_lower_case=True)
tokenizer(['wuhan', "covid","coronavirus","sars-cov-2","apple","city"], truncation=True, padding=True, max_length=512)
Result:
{'input_ids': [[101, 8814, 4819, 102, 0, 0, 0, 0, 0], [101, 2522, 17258, 102, 0, 0, 0, 0, 0], [101, 21887, 23350, 102, 0, 0, 0, 0, 0], [101, 18906, 2015, 1011, 2522, 2615, 1011, 1016, 102], [101, 6207, 102, 0, 0, 0, 0, 0, 0], [101, 2103, 102, 0, 0, 0, 0, 0, 0]]}.
As you can see, there are two encoded values for 'wuhan', "covid","coronavirus" ([8814, 4819],[2522, 17258],[ 21887, 23350] accordingly), while one id for apple and city (as it should be - [ 6207] and [2103]).
I have also checked tokenizer dictionary (vocab.txt) from https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2/tree/main
and there are no such terms as "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" (as mentioned in the readme - https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2).
I wonder why model does not recognize covid-related terms and how do I make the model 'understand' these terms? It seems that poor performance of models in my specific case (web texts that mention covid only once) may be related to this issue
Mention the preprint ( https://arxiv.org/abs/2005.07503 ) in readme.
Hey there,
I am trying to download the SST-2 dataset, but it shows 403 error.
The error message is below:
{
"error": {
"code": 403,
"message": "Permission denied. Could not perform this operation"
}
}
I had a run through with this model classifying comments on a petition that had a lot of traction with antivaxers. The results where pretty mixed.
Some observations.
The model really needs an initial classifier to be run trained to ask a more basic question "Is this comment about covid at all?" I found it gave negative or positive classifiers to comments that really didnt apply (Ie one stating something to the effect of "I havent seen my family in a year because of border restrictions" which I , or the algorithm, really has no way of evaluating for truthfulness).
This is to be expected, as the model is for classifying covid statements and thus has no real frame of reference for dealing with statements that arent actually about covid per se.
What I DID notice, is of those misclassified statements, it seems to be honing in on the hostility of the statements. As the petition is largely about dropping australian border restrictions and vaccine mandates, its to be expected that a large number of signers are antivax activists who have had a a tendency to be somewhat aggressive. That has me wondering if the model is actually responding to the tone of 'voice' in the comments, producing strong "false" or "misleading" signals if the input text is aggressive in nature?
edit: Oh and further context, the petition was one claiming to be "West australian doctors", a quick plugging in of names into the registar of medical practicioners revealed that the majority of signers are not actually medical practicioners (and worse, theres some evidence that of those that are, at least some might have had their names entered onto the petition without consent w/ data coming from the registry of practicioners) or are practitioners from non-medical fields like chiropractors and other pseudoscience professions, so I'm not sure how that impacts on the result. Perhaps the algorithm is picking up untruthfulness signals that I'm missing myself.
Hello,
Thank you for releasing a fine-tuned BERT model.
Could I ask what the loss is capturing, i.e., whether the model is trained on mask language modelling (MLM), next sentence prediction (NSP) or sentence order prediction (SOP)?
Thank you!
I was trying to fine tune CT-Bert in my own data and I have created a very simple classifier on top of CT-Bert. I used the
`from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
model = AutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")`
as suggested.
However, in training the model in a cluster with 10GB of memory I am getting this error:
RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 10.73 GiB total capacity; 9.54 GiB already allocated; 19.56 MiB free; 9.86 GiB reserved in total by PyTorch)
I tried same classifier with bert-base-uncased
model. I did not encounter this problem.
So I was wondering if the model is too large for a gpu cluster or should I do something extra?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.