ucinlp / covid19-backend Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 875 KB

Code for running all the background services for Covid19 efforts.

License: Apache License 2.0

Python 99.21% Dockerfile 0.68% Shell 0.11%

covid19-backend's People

Contributors

Stargazers

Watchers

Forkers

biancamusat

covid19-backend's Issues

merged.csv

is there instruction of how to reproduce the merged.csv file? why and what are the use of the random number columns?

Additional Baselines

Average GloVe vectors + logistic regression
Average BERT vectors + logistic regression

Move all the adhoc scripts in stream to script folder

Host annotated misleading tweets on the backend.

Hello!
I'm trying to repreduce the results of the paper https://openreview.net/pdf?id=FCna-s-ZaIE but I'm struck in the section of training the models using a .jsonl file. Where thoe jsonl files come from? are they need to be generated by the database? Is the training data the one in covid-lies repo?
Thanks a lot!

Implement a pipeline (crawling - prediction)

Redesign DBs

Wrap InferSent model: https://github.com/facebookresearch/InferSent

Problems reproducing results of BiLSTM

Training under Linux CentOS, on Nvidia Tesla V100, Cuda version 10.1, on correctly built repo and under the recommended conda environment.

Steps to reproduce:

1.- run: python3 -m scripts.ml.train_bilstm --train data/multinli_1.0/multinli_1.0_train.jsonl --dev data/multinli_1.0/multinli_1.0_dev_matched.jsonl --output-dir /covid19-backend/models/ --epochs 20

Error Message:
Traceback (most recent call last):
File "/apps/developers/compilers/anaconda/2019.10/1/default/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/apps/developers/compilers/anaconda/2019.10/1/default/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/nobackup/ptdhv/covid19-backend/scripts/ml/train_bilstm.py", line 164, in
main()
File "/nobackup/ptdhv/covid19-backend/scripts/ml/train_bilstm.py", line 120, in main
acc = accuracy(predictions, labels)
File "/nobackup/ptdhv/covid19-backend/scripts/ml/train_bilstm.py", line 40, in accuracy
return correct.sum() / length
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Add a column to Input table that will be used as a primary key instead of "id" column, and replace "input_id" in output table with the new primary key

Update output table with human as model ID and bertscore-recall-preds + tweet IDs added by Rob

Create evaluation script for model predictions.

Requirements

Create an evaluation script: scripts/ml/evaluate.py

The script will need to take as an input:

Misconception data
A trained model (in the long term we want to support any Detector but in the short term it will suffice to only evaluate SentenceBertClassifier style models)
Annotations

The script should output the following evaluation metrics:

Accuracy of classification models (e.g., SentenceBERT / anything trained to perform NLI). This is a straightforward computation - the misconception and tweet columns provide the model output and the pos/neg/na column provides the gold label.
Hits@k and mean reciprocal rank of all models. Given an input (e.g., tweet, reddit post, etc.) compute the positive match score between that input and every misconception. The metrics can then be computed by ranking the misconceptions by score and identifying the rank of the true misconception being expressed.
precision-recall curve. Ultimately we will need to use a cutoff to predict whether or not a tweet is truly expressing a misconception, having a precision-recall curve will be useful for identifying the best score cutoff.

Problem training SBERT and SBERT DA to reproduce results

I'm having problems to run the training scripts related to SBERT and SBERT DA.

My setup is:
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.9.2009 (Core)
Release: 7.9.2009

Steps to reproduce:

1.- After cloning and installing successfully run:
python3 scripts/ml/train_nli.py --model-name digitalepidemiologylab/covid-twitter-bert --batch_size=10 --epochs=10 --lr=5e-5 --accumulation_steps 32 --train data/multinli_1.0/multinli_1.0_train.jsonl --dev data/multinli_1.0/multinli_1.0_dev_matched.jsonl --ckpt my_model

The error message is:
Traceback (most recent call last):
File "scripts/ml/train_nli.py", line 17, in
from backend.ml.sentence_bert import SentenceBertClassifier
File "/nobackup/ptdhv/covid19-backend/backend/ml/sentence_bert.py", line 34, in
class SentenceBertBase(Detector, torch.nn.Module):
File "/nobackup/ptdhv/covid19-backend/backend/ml/sentence_bert.py", line 50, in SentenceBertBase
loss_kwargs: Dict[str, Any] = None) -> torch.FloatTensor:
File "/home/home01/ptdhv/.local/lib/python3.7/site-packages/overrides/overrides.py", line 88, in overrides
return _overrides(method, check_signature, check_at_runtime)
File "/home/home01/ptdhv/.local/lib/python3.7/site-packages/overrides/overrides.py", line 114, in _overrides
_validate_method(method, super_class, check_signature)
File "/home/home01/ptdhv/.local/lib/python3.7/site-packages/overrides/overrides.py", line 135, in _validate_method
ensure_signature_is_compatible(super_method, method, is_static)
File "/home/home01/ptdhv/.local/lib/python3.7/site-packages/overrides/signature.py", line 93, in ensure_signature_is_compatible
ensure_return_type_compatibility(super_type_hints, sub_type_hints, method_name)
File "/home/home01/ptdhv/.local/lib/python3.7/site-packages/overrides/signature.py", line 288, in ensure_return_type_compatibility
f"{method_name}: return type {sub_return} is not a {super_return}."
TypeError: SentenceBertBase.forward: return type <class 'torch.FloatTensor'> is not a <class 'NoneType'>.

Refactor to make better usage of HuggingFace APIs.

Models should be AutoModels, so we can pass any pretrained model string/folder.
As a corollary, training scripts should use HuggingFace's serialization functionality instead of just saving weights (prevents us from loading a model just to use a separate set of weights)
Distinguish retrievers from stance detectors. Make both pipelines.