deepset-ai / covid-qa Goto Github PK
View Code? Open in Web Editor NEWAPI & Webapp to answer questions about COVID-19. Using NLP (Question Answering) and trusted data sources.
License: Apache License 2.0
API & Webapp to answer questions about COVID-19. Using NLP (Question Answering) and trusted data sources.
License: Apache License 2.0
I just posted an issue here on github for the ChatBot, because my colleague couldn't contact them. A feedback button might also be helpful for this project.
We should create a simple, evaluation dataset that can be used to benchmark our models for matching similar questions.
What should be sufficient for a rough baseline:
Fine-tune BERT (or word embeddings?) on CORD-2019 dataset published on Kaggle:
[CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.]
The dataset has 2GB. I guess the domain is quite different from the FAQ, as the dataset is made up of scientific papers, but could still be valuable to introduce some substantial vocabulary related to the virus.
Sending the query via enter doesn't work anymore (tested in chrome).
@sfakir didn't you already implement this?
It'd better if we announce this product to public in something like producthunt etc or is it done somewhere?
I followed the instructions here https://github.com/deepset-ai/COVID-QA/tree/master/backend. Perhaps a port is not configured correctly?
INFO: initializing identifier
WARNING: PUT http://localhost:9200/document [status:N/A request:0.004s]
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 966, in send
self.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9b10>: Failed to establish a new connection: [Errno 111] Connection refused
WARNING: PUT http://localhost:9200/document [status:N/A request:0.002s]
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 966, in send
self.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9d50>: Failed to establish a new connection: [Errno 111] Connection refused
WARNING: PUT http://localhost:9200/document [status:N/A request:0.002s]
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 966, in send
self.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3eb08d0>: Failed to establish a new connection: [Errno 111] Connection refused
WARNING: PUT http://localhost:9200/document [status:N/A request:0.001s]
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 966, in send
self.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 966, in send
self.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/uvicorn", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 331, in main
run(**kwargs)
File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 354, in run
server.run()
File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 382, in run
loop.run_until_complete(self.serve(sockets=sockets))
File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 389, in serve
config.load()
File "/usr/local/lib/python3.7/site-packages/uvicorn/config.py", line 288, in load
self.loaded_app = import_from_string(self.app)
File "/usr/local/lib/python3.7/site-packages/uvicorn/importer.py", line 20, in import_from_string
module = importlib.import_module(module_str)
File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "./backend/api.py", line 11, in <module>
from backend.controller.router import router as api_router
File "./backend/controller/router.py", line 3, in <module>
from backend.controller import autocomplete, model, feedback
File "./backend/controller/model.py", line 60, in <module>
excluded_meta_data=EXCLUDE_META_DATA_FIELDS,
File "/home/user/src/farm-haystack/haystack/database/elasticsearch.py", line 48, in __init__
self.client.indices.create(index=index, ignore=400, body=custom_mapping)
File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 92, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/indices.py", line 104, in create
"PUT", _make_path(index), params=params, headers=headers, body=body
File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 362, in perform_request
timeout=timeout,
File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 241, in perform_request
raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused)```
Find official data sources for FAQ about COVID-19 in different languages and scrape them.
For now we have individual scrapers for each site. Adding more sites is a very manual and slow process and existing scrapers fail when the site changes slightly. See individual scrapers here.
Automatic Scraper
We need a scraper that takes in a URL to an FAQ page and automatically extracts questions and answers in a structured way. The scraper might need some NLP based question detection to identify which parts need to be extracted. For some pseudo code see here.
Datasources
We can curate a sheet of official FAQ pages and crawl relevant information more quickly.
That way the community can check the validity of the source FAQ and if the extraction worked.
Hey,
what is the current model used for https://covid-staging.deepset.ai/answers? :-)
I find it very accurate for my questions and single words (in german). Have you finetuned on the german Corona QAs? Do have any trained deep learning matching algorithm in use? I cannot imagine that the model just uses cosine simalrity with BERT, because it does not perform in my case as well as the model from the bot right now.
I experiment with my own questions with the pretrained deepset model (german) using cosine similarity. I wonder why just queries with words like "hallo" or "die" has a marginal lower similarity than real question just corona specific questions when just using the deepset german model. So those irrelevant words have a high similarityof around 90%...
Do you know any reason why this is the case?
Since QA pairs in german are rare have any idea what else methods you could try to do text matching without training maybe like a Word Mover distance matching with BERT embeddings?
I am very new in using BERT.
Plain transformer models (like BERT) are known to produce bad sentence embeddings. From a first, very rough test also Sentence-BERT (https://github.com/UKPLab/sentence-transformers) didn't perform too great on a couple of test queries.
We should evaluate different models once we have the eval dataset from #4 and possibly finetune some on the quora duplicate question dataset (or even a small one created by the crowd).
Where do I get the test dataset for COVIDQA?
Hi,
I have been working on chatbot for croatian language. Here is little help for real time scraping.
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
url = "https://www.worldometers.info/coronavirus/"
headers = {'Accept': 'text/html'}
response = requests.get(url, headers=headers)
#print(response)
content = response.content
soup = BeautifulSoup(content,"lxml")
elements = (np.array([[y.text for y in x.find_all("td")] for x in soup.find(id="main_table_countries_today").find_all("tr")]))
elements = [x for x in elements if len(x)==9]
wordmeters = pd.DataFrame(elements)
wordmeters.columns = ["Country,Other","Total Cases","New Cases","Total Deaths","New Deaths","Total Recovered","Active Cases","Serious, Critical","Tot Cases/1M pop"]
wordmeters
Hey,
how can I use the front-end? When you bot was online I tested it, and want to use the telegram front-end. But I am little bit off since there is no description how to do that.
Thanks in advance!
This line in the CDC_Water_scraper will cause an error.
UnboundLocalError: local variable 'current_question' referenced before assignment
It seems that lines
COVID-QA/datasources/scrapers/CDC_Water_scraper.py
Lines 51 to 53 in 4890ad5
The paper mentions "We selected 147 scientific articles mostly related to COVID-19 from the CORD-19" . How can I get the subset of documents to create an index ?
https://github.com/deepset-ai/COVID-QA/blob/master/datasources/scrapers/CDC_Pregnancy_scraper.py
The CDC changed this page from a QA style page to a factual page on 7 April 2020.
This scraper no longer produces any data when run.
In the current implementation, the meta scraper runs all the scrapers sequentially, crawls the FAQs, and then writes to an Elasticsearch index. This is good for initializing an index from scratch.
We should implement a periodic job(cron or AWS Lambda) that runs the meta scraper and check for updates, additions, and deletions since the last run.
A possible quick-n-dirty alternative to a periodic sync job could have been to recreate the entire Elasticsearch index each time we crawl. This works, except, collecting user feedback gets tricky as we lose the document_id
when the list of scrapers gets updated.
link
field.outdated
in ESdocument_id
field in ES is populated as incrementing numbers. It could be changed to UUID to make things simpler to implement.Hi there, thanks for sharing your QA resource!
https://github.com/deepset-ai/COVID-QA/tree/master/data/question-answering
I was wondering if you have a write-up of the annotation methodology? For example, how were the documents selected, how were the questions generated, guidelines for marking the extent of the spans, etc.
Thanks in advance!
I've created a telegram bot to your web interface
https://t.me/corona_scholar_bot
Can I add it to the README?
Hello!
First of all, thank you again for your incredible contribution with not only this dataset, but most importantly with the Haystack toolset!
I was able to closely approximate the results of your paper when running https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_crossvalidation.py, although I had to reduce batch_size to 25 to prevent RuntimeError: CUDA out of memory. Tried to allocate 540.00 MiB (GPU 0; 15.78 GiB total capacity; 14.29 GiB already allocated; 386.75 MiB free; 14.35 GiB reserved in total by PyTorch)
. This is using an Ubuntu 18.04 VM running a Tesla V100 GPU with 128G disk space. As mentioned, the results obtained were quite close:
XVAL EM: 0.26151560178306094
XVAL f1: 0.5858967501101285
I created a custom Covid-19 dataset that combines a preprocessed/cleansed subset of the dataset from the paper "Collecting Verified COVID-19 Question Answer Pairs" (Poliak et al, 2020) and a SQuADified version of your dataset, faq_covidbert.csv. For the latter I used your annotation tool to map questions to chunks in the answers, treating the full answers as contexts.
I trained a model with this dataset using the hyperparameters you specify here: https://huggingface.co/deepset/roberta-base-squad2-covid#hyperparameters . Informal tests of various questions related to Covid-19 indicate superior responses generated from my model as opposed to roberta-base-squad2-covid, which isn't surprising as inspection of both datasets reveals that mine contains far more Covid-19-specific questions and answers.
However, when running question_answering_crossvalidation.py with my dataset the metric results are not as good as what is observed with your dataset or even with the baseline referenced in the paper. Here are the EM and f1 scores I obtained with my dataset:
XVAL EM: 0.21554054054054053
XVAL f1: 0.4432141443807887
Can you provide any insight as to why this would be the case? Thank you so much!
The CDC children scraper does not get the right results.
I guess the page: https://www.cdc.gov/coronavirus/2019-ncov/prepare/children-faq.html
was updated.
@bogdankostic could you please update the scraper - I will put it for now into the data/scraper_outdated folder so the meta crawler wont use it.
Many people have specific information needs, which may not be answered in FAQs (e.g. "How many people are infected with Corona in Berlin?").
A possible first step would be to classify the intent of the question (w/ Rasa or DeepPavlov).
Afterwards, slot filling can be used to extract semantic concepts (w/ Rasa or DeepPavlov).
The information can then be converted into database queries (e.g. COVID-19, Coronavirus Tracker API, Coronazaehler).
Extra: The result may be formatted for different intents.
Similar projects:
Rasa project for answering simple COVID-19 questions.
https://github.com/LuisMossburger/CoronaBibliothekar/tree/rasa-init
Combined COVID-19 cases database (RKI, JHU, ECDC)
https://github.com/swildermann/COVID-19
Can anyone share more details about how we should contribute to this project, like to_be_done list or something like that.
Thanks.
Thanks for your work to help the people in need! Your site has been added! I currently maintain the Open-Source-COVID-19 page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.
http://open-source-covid-19.weileizeng.com/
Cheers!
Hi,
I'm trying to load the dataset using this suggested code:
from datasets import load_dataset
dataset = load_dataset("covid_qa_deepset")
However, i get the following error:
FileNotFoundError: Couldn't find file locally at covid_qa_deepset\covid_qa_deepset.py, or remotely at https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/covid_qa_deepset/covid_qa_deepset.py or https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/covid_qa_deepset/covid_qa_deepset.py
Would greatly appreciate some advice! Thanks!
Hello, Would you please help understand how the "Exact Match" is calculated? May be please point to which .py file to be referred for the code. Thank you
Building multilingual models (zero-shot, transfer learning, etc.) takes time.
So, in the meantime, as stated in #2 , we could machine-translate FAQs from English into other languages and add them to the search cluster, so that they can be retrieved at foreign language input. Translations in the background don't need to be perfect, but sufficient for retrieval (adequacy before fluency/grammar).
TODOs:
data/scrapers
repoUsing Sentence Transformers (https://github.com/UKPLab/sentence-transformers) I will start training a model on the Quora Question Pairs Dataset (https://www.kaggle.com/c/quora-question-pairs) that can classify duplicate question pairs.
Small issue here - the package name seems to be outdated.
Pls close, if "irda" is fine.
My website (http://know-covid19.info/) continues to get a few dozen hits daily, but I had to remove the FAQ section when Deepset took the hosted API offline. :-(
I would be happy to host the API myself. It looks like the database is included in the GitHub repo, but what about the trained model? Can you share with me the resources and instructions so that I can host the API?
While using model 2
, the API returns answer for almost everything but not in English. I belive model 2 should only return English answers.
Try Define gravity?
Thank you so much for sharing your data and tools! I am working with the question-answering dataset for an experiment of my own.
@Timoeller mentioned in #103 that the documents used in the annotation tool to create the COVID-QA.json dataset "are a subset of CORD-19 papers that annotators deemed related to Covid." I was wondering if these are the same documents as listed in faq_covidbert.csv.
The reason I ask is that, as a workaround I've created my own retrieval txt file(s) through extracting the answers from COVID-QA.json, but the results are hit or miss. They are particularly off if I break the file up into chunks to improve performance, for instance into a separate txt file for each answer. I'm assuming this is due to lost context. I'm wondering if I should simply be using faq_covidbert as illustrated here, even though I am using extractive-QA.
The reason I did my method is that I was trying to follow an approach most closely approximating the extractive QA tutorial.
My ultimate objective is to compare the experience of using extractive QA vs FAQ-style QA, so I presumed that it would be apropos to have a bit of separation in the doc storage dataset.
Thank you!
Experiment with different methods for data augmentation, report results and compare to baseline.
Hi - I wanted to share a model that I've pretrained from scratch using BERT Large Cased and the BioASQ 7b - factoid dataset on TPU v2-8.
Original Implementation:
https://github.com/dmis-lab/biobert
Dataset can also be found on their repo.
Model Details:
loss = 0.41782737
step = 18000
max_seq_length = 384
learning_rate = 3e-6
doc_stride = 128
The model is tensorflow based, and I haven't yet converted it to torch or transformers, and haven't evaluated it yet.
I'd like to continue training it on COVID related questions, as well as additional data from BioASQ but haven't yet found an easy way to convert the raw bioASQ data into the format for training. If someone would like to do that so I can continue training the model further, please let me know.
You should be able to download all the files easily with gsutil installed by running
gsutil -m cp -r gs://ce-covid-public/biobert-large-cased/* /path/to/folder/
If someone wants to run evaluation on the models and provide the metrics, I can update this.
Question - when running the backend on Docker with GPU enabled and BERT embeddings, it doesn't seem to be using the GPUs even with all the correct drivers. Is there some documentation around this?
Great job on the progress so far! I believe there's a lot of value in what's being done.
I started a chat room on Gitter to help us coordinate the projects. Please sign up at https://gitter.im/COVID-QA/community?utm_source=share-link&utm_medium=link&utm_campaign=share-link
So far we rely on matching questions with the one of existing FAQs.
=> We could add an extractive QA model + some trustworthy text datasources to handle the long tail of questions. If there's no good match in the FAQs we could forward the request to this extractive QA model
I've been developing a telegram bot using the API. Currently am using https://covid-middleware.deepset.ai/api/bert/question
to get the answers.
curl -X POST \
https://covid-middleware.deepset.ai/api/bert/question \
-H 'content-type: application/json' \
-d '{
"question":"community spread?"
}'
but the swagger doesn't list this API and shows a different one with different structures.
So, my question is, Which API should I choose to get the answers? @tanaysoni
"I just wanted to 'like' the upper answer, but by clicking the button the second (incorrect) answer also was marked with a green like.
Question asked was "Wie lang ist die Inkubationszeit" or something similar."
The problem seems to be, that there answers have the same document_id.
We need some unique id for each answer - maybe paragraph_id?
Example:
paragraph_id:"HnLWAnEB3Qua7g62e99g"
document_id:"1"
Hi, would you please help me understand how the preprocessing is done for theCovidQA corpus ? Why I ask is because the context in the CovidQA dataset seems to be so much larger than the maximum length set in the code (which is 300+ and BERT max_length is 512 tokens). How is the data processed to fit into the limit ? Couldn't find the code for that in the Git. Please advice. Thank you.
I'm not seeing the same results after fine-tuning roberta-base-squad2 with COVID-QA.json as compared to running with your deepset/roberta-base-squad2-covid model. I followed the tutorial Tutorial2_Finetune_a_model_on_your_data. In your paper "COVID-QA: A Question Answering Dataset for COVID-19" I didn't see any specific hyperparameters used for training that might explain these differences. Did you train with default parameters?
Thanks!
Thanks again for creating this PR. Great work!
Two comments:
Would be great to hear your thoughts on that and maybe address them in a separate PR.
Originally posted by @tholor in #58 (comment)
There should be the possibility to display the UI in multiple languages (at least German and English).
It seems that the BMAS-Scraper is not properly working.
The problem seems to be with extracting the questions. The resulting question column contains three empty strings.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.