Code Monkey home page Code Monkey logo

covid-twitter-bert's People

Contributors

arnaudmiribel avatar francoispichard avatar mar-muel avatar marcelsalathe avatar peregilk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid-twitter-bert's Issues

Masked worked prediction returning "unused" tokens

Hi, I am trying to run basic masked word prediction in pytorch transformers to compare BERT-large-uncased-WWM and COVID-Twitter-BERT for a publication.

from transformers import pipeline, AutoModel, AutoTokenizer
pipe = pipeline(task='fill-mask', framework='pt',
                 model="bert-large-uncased-whole-word-masking",
                 device=0)
pipe(f"This is the best thing I've {pipe.tokenizer.mask_token} in my life.")

returns meaningful results as:

[{'sequence': "[CLS] this is the best thing i've done in my life. [SEP]",
  'score': 0.45353376865386963,  'token': 2589},
 {'sequence': "[CLS] this is the best thing i've experienced in my life. [SEP]",
  'score': 0.2302728146314621,  'token': 5281},
 {'sequence': "[CLS] this is the best thing i've seen in my life. [SEP]",
  'score': 0.0811614915728569,  'token': 2464},
 {'sequence': "[CLS] this is the best thing i've felt in my life. [SEP]",
  'score': 0.06349574029445648,  'token': 2371},
 {'sequence': "[CLS] this is the best thing i've had in my life. [SEP]",
  'score': 0.058649078011512756,  'token': 2018}]

However

model = AutoModel.from_pretrained(pretrained_model_name_or_path='digitalepidemiologylab/covid-twitter-bert', from_tf=True)
tokenizer = AutoTokenizer.from_pretrained('digitalepidemiologylab/covid-twitter-bert', do_lower_case=True)
pipe = pipeline(task='fill-mask', framework='pt',
                 model=model,
                 tokenizer=tokenizer,
                 device=0)
pipe(f"This is the best thing I've {pipe.tokenizer.mask_token} in my life.")

returns

[{'sequence': "[CLS] this is the best thing i've [unused751] in my life. [SEP]",
  'score': 0.012442857027053833,  'token': 756},
 {'sequence': "[CLS] this is the best thing i've [unused803] in my life. [SEP]",
  'score': 0.00925508514046669,  'token': 808},
 {'sequence': "[CLS] this is the best thing i've [unused465] in my life. [SEP]",
  'score': 0.009094784036278725,  'token': 470},
 {'sequence': "[CLS] this is the best thing i've [unused490] in my life. [SEP]",
  'score': 0.008908418007194996,  'token': 495},
 {'sequence': "[CLS] this is the best thing i've [unused91] in my life. [SEP]",
  'score': 0.008854263462126255,  'token': 92}]

Would really appreciate help here.

License?

Hi all,

Thanks for sharing a great paper and accompanying code! I would love to re-use some of your text processing functions elsewhere (probably as part of a PR to an open source library), however you haven't defined a license file for the repo. Without it:

all rights are reserved and it is not Open Source or Free. You cannot modify or redistribute this code without explicit permission from the copyright holder.

Would you consider a more open source friendly license? :)

Failed to convert the TensorFlow model to a PyTorch model

Hi,
Thank you for the wonderful model and repository in these much needed time.

I tried to convert covid-twitter-bert TensorFlow model, which was downloaded from TF2 Checkpoint to a PyTorch model using Hugging Face TensorFlow checkpoint conversion (Used transformers version 3.0.0). But, I could not succeed due to the following error occurred.

INFO:transformers.modeling_bert:Skipping _CHECKPOINTABLE_OBJECT_GRAPH
Traceback (most recent call last):
  File "/usr/local/bin/transformers-cli", line 32, in <module>
    service.run()
  File "/usr/local/lib/python3.6/dist-packages/transformers/commands/convert.py", line 78, in run
    convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)
  File "/usr/local/lib/python3.6/dist-packages/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py", line 36, in convert_tf_checkpoint_to_pytorch
    load_tf_weights_in_bert(model, config, tf_checkpoint_path)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 124, in load_tf_weights_in_bert
    assert pointer.shape == array.shape
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
AttributeError: 'BertForPreTraining' object has no attribute 'shape'

Same conversion method was used with BERT-base (cased_L-12_H-768_A-12) TensorFlow model and it converted successfully. Therefore, I assume this error is specific to covid-twitter-bert model.
Comparing with other BERT TensorFlow models, I found that covid-twitter-bert TensorFlow model missing a .meta file and I suspect it can be a possible reason for this error.

If this error is occurred due to missing .meta file, can you please provide this file? Otherwise, do you have any solution for this issue or can you make a PyTorch model available?

What are the tokens for Usernames and Hastags in the BERT vocabulary?

In the paper, it is mentioned "Each tweet was pseudonymized by replacing all Twitter usernames with a common text token. A similar procedure was performed on all URLs to web pages. We also replaced all unicode emoticons with textual ASCII representations (e.g. πŸ˜„ for ☺️ ) using the Python emoji library"

But it is not exactly clear which exact tokens are used in-place of usernames and URLs. Are they documented anywhere?

Failed to find data adapter

Hi, I've been trying to use CT-bert on my local machine and am running into this issue when using my own data set. I'm using a modified version of the GPU notebook provided here. The issue is:
Screenshot (14)

And the code I'm using to load the dataset is as follows:
Screenshot (12)

I don't understand why the issue is cropping up since everything looks alright.
My system specs are:
OS: Windows 10
GPU: RTX 3080
CUDA: 11.3
CudNN: 8.2
Python: 3.9
TF: 2.5
Any help will be greatly appreciated since I've spent most of my christmas break on this :D

TypeError

Trying to run the colab given in the readme file. Getting this
image
any idea to resolve?

Logic behind [UNK] token for pronouns

Thanks a lot for the nice work!

What would be the logic behind masking pronouns with an unknown token of [UNK]. This seems to be a major deviation from standard BERT models.

For example:

tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert", do_lower_case=False)
tokenizer.tokenize('She is cool!')

outputs
['[UNK]', 'is', 'cool', '!']

while

tokenizer2 = AutoTokenizer.from_pretrained("bert-large-cased", do_lower_case=False)
tokenizer2.tokenize('She is cool!')

outputs
['She', 'is', 'cool', '!']

Strategies for downstream NER tasks with more label

Hi,

It is a very good paper to explore the twitter related COVID info. I am wondering is there any strategies that could be provided for downstream NER tasks that contains more label other than those mentioned in the paper? For instance, we could like to explore NER label such as 'treatment', 'problem', etc.

Thanks!

No proper encodings for covid-related terms

I have just checked encodings that autotokenizer produces. It seems that for words "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" it produces more than one token, while tokenizer produces one token for 'conventional' words like apple.
E.g.

from transformers import  AutoTokenizer
tokenizer =  AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert-v2", do_lower_case=True)
tokenizer(['wuhan', "covid","coronavirus","sars-cov-2","apple","city"], truncation=True, padding=True, max_length=512)

Result:

{'input_ids': [[101, 8814, 4819, 102, 0, 0, 0, 0, 0], [101, 2522, 17258, 102, 0, 0, 0, 0, 0], [101, 21887, 23350, 102, 0, 0, 0, 0, 0], [101, 18906, 2015, 1011, 2522, 2615, 1011, 1016, 102], [101, 6207, 102, 0, 0, 0, 0, 0, 0], [101, 2103, 102, 0, 0, 0, 0, 0, 0]]}. 

As you can see, there are two encoded values for 'wuhan', "covid","coronavirus" ([8814, 4819],[2522, 17258],[ 21887, 23350] accordingly), while one id for apple and city (as it should be - [ 6207] and [2103]).

I have also checked tokenizer dictionary (vocab.txt) from https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2/tree/main
and there are no such terms as "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" (as mentioned in the readme - https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2).

I wonder why model does not recognize covid-related terms and how do I make the model 'understand' these terms? It seems that poor performance of models in my specific case (web texts that mention covid only once) may be related to this issue

Failed to download dataset

Hey there,
I am trying to download the SST-2 dataset, but it shows 403 error.
The error message is below:
{
"error": {
"code": 403,
"message": "Permission denied. Could not perform this operation"
}
}

A few observations.

I had a run through with this model classifying comments on a petition that had a lot of traction with antivaxers. The results where pretty mixed.

Some observations.
The model really needs an initial classifier to be run trained to ask a more basic question "Is this comment about covid at all?" I found it gave negative or positive classifiers to comments that really didnt apply (Ie one stating something to the effect of "I havent seen my family in a year because of border restrictions" which I , or the algorithm, really has no way of evaluating for truthfulness).

This is to be expected, as the model is for classifying covid statements and thus has no real frame of reference for dealing with statements that arent actually about covid per se.

What I DID notice, is of those misclassified statements, it seems to be honing in on the hostility of the statements. As the petition is largely about dropping australian border restrictions and vaccine mandates, its to be expected that a large number of signers are antivax activists who have had a a tendency to be somewhat aggressive. That has me wondering if the model is actually responding to the tone of 'voice' in the comments, producing strong "false" or "misleading" signals if the input text is aggressive in nature?

edit: Oh and further context, the petition was one claiming to be "West australian doctors", a quick plugging in of names into the registar of medical practicioners revealed that the majority of signers are not actually medical practicioners (and worse, theres some evidence that of those that are, at least some might have had their names entered onto the petition without consent w/ data coming from the registry of practicioners) or are practitioners from non-medical fields like chiropractors and other pseudoscience professions, so I'm not sure how that impacts on the result. Perhaps the algorithm is picking up untruthfulness signals that I'm missing myself.

Loss

Hello,

Thank you for releasing a fine-tuned BERT model.
Could I ask what the loss is capturing, i.e., whether the model is trained on mask language modelling (MLM), next sentence prediction (NSP) or sentence order prediction (SOP)?
Thank you!

CUDA out of memory Error from loading CT-Bert directly from Huggingface

I was trying to fine tune CT-Bert in my own data and I have created a very simple classifier on top of CT-Bert. I used the
`from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert")

model = AutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")`
as suggested.
However, in training the model in a cluster with 10GB of memory I am getting this error:

RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 10.73 GiB total capacity; 9.54 GiB already allocated; 19.56 MiB free; 9.86 GiB reserved in total by PyTorch)

I tried same classifier with bert-base-uncased model. I did not encounter this problem.
So I was wondering if the model is too large for a gpu cluster or should I do something extra?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.