digitalepidemiologylab / covid-twitter-bert Goto Github PK

View Code? Open in Web Editor NEW

182.0 12.0 27.0 3.49 MB

Pretrained BERT model for analysing COVID-19 Twitter data

License: MIT License

Shell 5.33% Python 47.08% TeX 8.51% Jupyter Notebook 39.07%

bert-model twitter-data pretrained-models twitter twitter-sentiment-analysis

covid-twitter-bert's People

Contributors

Stargazers

Watchers

covid-twitter-bert's Issues

the model architecture of COVID-Twitter-BERT v2 MNLI

Hi, I have a question about the training procesure of COVID-Twitter-BERT v2 MNLI. When it fine-tunes on MNLI dataset, does it use cross-encoder architecture and cross-entripy loss as objective?

Make one over CORD-19

How about another pretrained model over CORD-19? It would be more scientifically relevant. Thanks.

Masked worked prediction returning "unused" tokens

Hi, I am trying to run basic masked word prediction in pytorch transformers to compare BERT-large-uncased-WWM and COVID-Twitter-BERT for a publication.

from transformers import pipeline, AutoModel, AutoTokenizer
pipe = pipeline(task='fill-mask', framework='pt',
                 model="bert-large-uncased-whole-word-masking",
                 device=0)
pipe(f"This is the best thing I've {pipe.tokenizer.mask_token} in my life.")

returns meaningful results as:

[{'sequence': "[CLS] this is the best thing i've done in my life. [SEP]",
  'score': 0.45353376865386963,  'token': 2589},
 {'sequence': "[CLS] this is the best thing i've experienced in my life. [SEP]",
  'score': 0.2302728146314621,  'token': 5281},
 {'sequence': "[CLS] this is the best thing i've seen in my life. [SEP]",
  'score': 0.0811614915728569,  'token': 2464},
 {'sequence': "[CLS] this is the best thing i've felt in my life. [SEP]",
  'score': 0.06349574029445648,  'token': 2371},
 {'sequence': "[CLS] this is the best thing i've had in my life. [SEP]",
  'score': 0.058649078011512756,  'token': 2018}]

However

model = AutoModel.from_pretrained(pretrained_model_name_or_path='digitalepidemiologylab/covid-twitter-bert', from_tf=True)
tokenizer = AutoTokenizer.from_pretrained('digitalepidemiologylab/covid-twitter-bert', do_lower_case=True)
pipe = pipeline(task='fill-mask', framework='pt',
                 model=model,
                 tokenizer=tokenizer,
                 device=0)
pipe(f"This is the best thing I've {pipe.tokenizer.mask_token} in my life.")

returns

[{'sequence': "[CLS] this is the best thing i've [unused751] in my life. [SEP]",
  'score': 0.012442857027053833,  'token': 756},
 {'sequence': "[CLS] this is the best thing i've [unused803] in my life. [SEP]",
  'score': 0.00925508514046669,  'token': 808},
 {'sequence': "[CLS] this is the best thing i've [unused465] in my life. [SEP]",
  'score': 0.009094784036278725,  'token': 470},
 {'sequence': "[CLS] this is the best thing i've [unused490] in my life. [SEP]",
  'score': 0.008908418007194996,  'token': 495},
 {'sequence': "[CLS] this is the best thing i've [unused91] in my life. [SEP]",
  'score': 0.008854263462126255,  'token': 92}]

Would really appreciate help here.

Thanks for sharing a great paper and accompanying code! I would love to re-use some of your text processing functions elsewhere (probably as part of a PR to an open source library), however you haven't defined a license file for the repo. Without it:

all rights are reserved and it is not Open Source or Free. You cannot modify or redistribute this code without explicit permission from the copyright holder.

Would you consider a more open source friendly license? :)

Failed to convert the TensorFlow model to a PyTorch model

Hi,
Thank you for the wonderful model and repository in these much needed time.

I tried to convert covid-twitter-bert TensorFlow model, which was downloaded from TF2 Checkpoint to a PyTorch model using Hugging Face TensorFlow checkpoint conversion (Used transformers version 3.0.0). But, I could not succeed due to the following error occurred.

INFO:transformers.modeling_bert:Skipping _CHECKPOINTABLE_OBJECT_GRAPH
Traceback (most recent call last):
  File "/usr/local/bin/transformers-cli", line 32, in <module>
    service.run()
  File "/usr/local/lib/python3.6/dist-packages/transformers/commands/convert.py", line 78, in run
    convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)
  File "/usr/local/lib/python3.6/dist-packages/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py", line 36, in convert_tf_checkpoint_to_pytorch
    load_tf_weights_in_bert(model, config, tf_checkpoint_path)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 124, in load_tf_weights_in_bert
    assert pointer.shape == array.shape
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
AttributeError: 'BertForPreTraining' object has no attribute 'shape'

Same conversion method was used with BERT-base (cased_L-12_H-768_A-12) TensorFlow model and it converted successfully. Therefore, I assume this error is specific to covid-twitter-bert model.
Comparing with other BERT TensorFlow models, I found that covid-twitter-bert TensorFlow model missing a .meta file and I suspect it can be a possible reason for this error.

If this error is occurred due to missing .meta file, can you please provide this file? Otherwise, do you have any solution for this issue or can you make a PyTorch model available?

What are the tokens for Usernames and Hastags in the BERT vocabulary?

In the paper, it is mentioned "Each tweet was pseudonymized by replacing all Twitter usernames with a common text token. A similar procedure was performed on all URLs to web pages. We also replaced all unicode emoticons with textual ASCII representations (e.g. 😄 for ☺️ ) using the Python emoji library"

But it is not exactly clear which exact tokens are used in-place of usernames and URLs. Are they documented anywhere?

Failed to find data adapter

Hi, I've been trying to use CT-bert on my local machine and am running into this issue when using my own data set. I'm using a modified version of the GPU notebook provided here. The issue is:

And the code I'm using to load the dataset is as follows:

I don't understand why the issue is cropping up since everything looks alright.
My system specs are:
OS: Windows 10
GPU: RTX 3080
CUDA: 11.3
CudNN: 8.2
Python: 3.9
TF: 2.5
Any help will be greatly appreciated since I've spent most of my christmas break on this :D

COVID Category (CC) dataset has 4 invalid formats for Tweet ID

For example line 754 is:
1.22065E+18,category_news

This breaks the download (hydration) unfortunately.

Download Twitter-Sentiment SemEval

Hi,

I want to try to reproduce your results on the SemEval 2016 Task 4 dataset (http://alt.qcri.org/semeval2016/task4/index.php?id=data-and-tools) but the download link provided give only the Ids of the tweets. I tried to download its with the Twitter API, but it appears there are a lot of 404 Errors when trying to get the tweets.

Do you have any solution ?

TypeError

Trying to run the colab given in the readme file. Getting this

any idea to resolve?

Logic behind [UNK] token for pronouns

Thanks a lot for the nice work!

What would be the logic behind masking pronouns with an unknown token of [UNK]. This seems to be a major deviation from standard BERT models.

For example:

tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert", do_lower_case=False)
tokenizer.tokenize('She is cool!')

outputs
['[UNK]', 'is', 'cool', '!']

while

tokenizer2 = AutoTokenizer.from_pretrained("bert-large-cased", do_lower_case=False)
tokenizer2.tokenize('She is cool!')

outputs
['She', 'is', 'cool', '!']

Strategies for downstream NER tasks with more label

Hi,

It is a very good paper to explore the twitter related COVID info. I am wondering is there any strategies that could be provided for downstream NER tasks that contains more label other than those mentioned in the paper? For instance, we could like to explore NER label such as 'treatment', 'problem', etc.

Thanks!

No proper encodings for covid-related terms

I have just checked encodings that autotokenizer produces. It seems that for words "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" it produces more than one token, while tokenizer produces one token for 'conventional' words like apple.
E.g.

from transformers import  AutoTokenizer
tokenizer =  AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert-v2", do_lower_case=True)
tokenizer(['wuhan', "covid","coronavirus","sars-cov-2","apple","city"], truncation=True, padding=True, max_length=512)

Result:

{'input_ids': [[101, 8814, 4819, 102, 0, 0, 0, 0, 0], [101, 2522, 17258, 102, 0, 0, 0, 0, 0], [101, 21887, 23350, 102, 0, 0, 0, 0, 0], [101, 18906, 2015, 1011, 2522, 2615, 1011, 1016, 102], [101, 6207, 102, 0, 0, 0, 0, 0, 0], [101, 2103, 102, 0, 0, 0, 0, 0, 0]]}.

As you can see, there are two encoded values for 'wuhan', "covid","coronavirus" ([8814, 4819],[2522, 17258],[ 21887, 23350] accordingly), while one id for apple and city (as it should be - [ 6207] and [2103]).

I have also checked tokenizer dictionary (vocab.txt) from https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2/tree/main
and there are no such terms as "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" (as mentioned in the readme - https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2).

I wonder why model does not recognize covid-related terms and how do I make the model 'understand' these terms? It seems that poor performance of models in my specific case (web texts that mention covid only once) may be related to this issue

Add link to preprint in readme

Mention the preprint ( https://arxiv.org/abs/2005.07503 ) in readme.

Failed to download dataset

Hey there,
I am trying to download the SST-2 dataset, but it shows 403 error.
The error message is below:
{
"error": {
"code": 403,
"message": "Permission denied. Could not perform this operation"
}
}

A few observations.

I had a run through with this model classifying comments on a petition that had a lot of traction with antivaxers. The results where pretty mixed.

Some observations.
The model really needs an initial classifier to be run trained to ask a more basic question "Is this comment about covid at all?" I found it gave negative or positive classifiers to comments that really didnt apply (Ie one stating something to the effect of "I havent seen my family in a year because of border restrictions" which I , or the algorithm, really has no way of evaluating for truthfulness).

This is to be expected, as the model is for classifying covid statements and thus has no real frame of reference for dealing with statements that arent actually about covid per se.

What I DID notice, is of those misclassified statements, it seems to be honing in on the hostility of the statements. As the petition is largely about dropping australian border restrictions and vaccine mandates, its to be expected that a large number of signers are antivax activists who have had a a tendency to be somewhat aggressive. That has me wondering if the model is actually responding to the tone of 'voice' in the comments, producing strong "false" or "misleading" signals if the input text is aggressive in nature?

edit: Oh and further context, the petition was one claiming to be "West australian doctors", a quick plugging in of names into the registar of medical practicioners revealed that the majority of signers are not actually medical practicioners (and worse, theres some evidence that of those that are, at least some might have had their names entered onto the petition without consent w/ data coming from the registry of practicioners) or are practitioners from non-medical fields like chiropractors and other pseudoscience professions, so I'm not sure how that impacts on the result. Perhaps the algorithm is picking up untruthfulness signals that I'm missing myself.

Loss

Hello,

Thank you for releasing a fine-tuned BERT model.
Could I ask what the loss is capturing, i.e., whether the model is trained on mask language modelling (MLM), next sentence prediction (NSP) or sentence order prediction (SOP)?
Thank you!

CUDA out of memory Error from loading CT-Bert directly from Huggingface

I was trying to fine tune CT-Bert in my own data and I have created a very simple classifier on top of CT-Bert. I used the
`from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert")

model = AutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")`
as suggested.
However, in training the model in a cluster with 10GB of memory I am getting this error:

RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 10.73 GiB total capacity; 9.54 GiB already allocated; 19.56 MiB free; 9.86 GiB reserved in total by PyTorch)

I tried same classifier with bert-base-uncased model. I did not encounter this problem.
So I was wondering if the model is too large for a gpu cluster or should I do something extra?

digitalepidemiologylab / covid-twitter-bert Goto Github PK

covid-twitter-bert's People

Contributors

Stargazers

Watchers

Forkers

covid-twitter-bert's Issues

Recommend Projects

Recommend Topics

Recommend Org