deepset-ai / covid-qa Goto Github PK

View Code? Open in Web Editor NEW

344.0 344.0 121.0 8.18 MB

API & Webapp to answer questions about COVID-19. Using NLP (Question Answering) and trusted data sources.

License: Apache License 2.0

Dockerfile 0.10% Python 28.08% HTML 0.20% CSS 3.78% JavaScript 10.01% Jupyter Notebook 51.57% Kotlin 6.19% Java 0.07%

api corona covid-19 covid-data faq nlp question-answering search

covid-qa's People

Contributors

Stargazers

Watchers

Forkers

borhenryk akankshasinhagithub rajasgs david-mueller stedomedo sureshmaya rajacsp runinho schafeld kobkrit amirstudy grotepfn andra-pumnea sfakir dataworm collins-k wuppersaver khisamnurx kp-forks e-tornike sandro832 simondodson viktoralm gazzola vishwaprabhakar trisongz chloe2 tchigher sekmet shyamsunder0072 zzzbit 5l1v3r1 abhishekap374 victormadu monsiisaiah12 mwilian cesar-po1 manikant92 vikingforties prosper21 wodole leandrorodriguess tkh42 wsheffel eangelica2014 pratyaya guruprasaad123 christiancosgrove aradhanacha fd54 swapna-intel ricardopt tbnv999 dwhitena mfleming99 shiviv hlevy arpitkotecha srinivasgutta7 j-1602 flarrow7 ishikawa407 rogervaas tcapilla wallacerao aaronbriel priyank7n adamsalyers kairanithin021 h4st3 misanthropicdeity kairanithin ngtrang edzai haydncci ykmanoj stenke baajarmeh baronrustamov jsquire1 nadim365 sayedah robinsuri ieshan krpandapaw yasserelsaid swithintan edyyh romankova ananyamarjia shahramnasir ishakantak edm101 muziyida nahidalam karanpreetpurba sarrusharma rvenu-gop marjia11 ambitioner-c

covid-qa's Issues

Feedback button - contact details

I just posted an issue here on github for the ChatBot, because my colleague couldn't contact them. A feedback button might also be helpful for this project.

Create english evaluation dataset for question similarity

We should create a simple, evaluation dataset that can be used to benchmark our models for matching similar questions.

What should be sufficient for a rough baseline:

100-300 question pairs of similar questions
extending that with 50% false pairs

Fine-tune BERT on CORD-2019 dataset

Fine-tune BERT (or word embeddings?) on CORD-2019 dataset published on Kaggle:

[CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.]

The dataset has 2GB. I guess the domain is quite different from the FAQ, as the dataset is made up of scientific papers, but could still be valuable to introduce some substantial vocabulary related to the virus.

Allow sending a query by pressing enter

Sending the query via enter doesn't work anymore (tested in chrome).

@sfakir didn't you already implement this?

Public Announcement

It'd better if we announce this product to public in something like producthunt etc or is it done somewhere?

Docker build succeeds, docker run fails with elasticsearch error

I followed the instructions here https://github.com/deepset-ai/COVID-QA/tree/master/backend. Perhaps a port is not configured correctly?

INFO:     initializing identifier
WARNING:  PUT http://localhost:9200/document [status:N/A request:0.004s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9b10>: Failed to establish a new connection: [Errno 111] Connection refused
WARNING:  PUT http://localhost:9200/document [status:N/A request:0.002s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9d50>: Failed to establish a new connection: [Errno 111] Connection refused
WARNING:  PUT http://localhost:9200/document [status:N/A request:0.002s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3eb08d0>: Failed to establish a new connection: [Errno 111] Connection refused
WARNING:  PUT http://localhost:9200/document [status:N/A request:0.001s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/uvicorn", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 331, in main
    run(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 354, in run
    server.run()
  File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 382, in run
    loop.run_until_complete(self.serve(sockets=sockets))
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 389, in serve
    config.load()
  File "/usr/local/lib/python3.7/site-packages/uvicorn/config.py", line 288, in load
    self.loaded_app = import_from_string(self.app)
  File "/usr/local/lib/python3.7/site-packages/uvicorn/importer.py", line 20, in import_from_string
    module = importlib.import_module(module_str)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "./backend/api.py", line 11, in <module>
    from backend.controller.router import router as api_router
  File "./backend/controller/router.py", line 3, in <module>
    from backend.controller import autocomplete, model, feedback
  File "./backend/controller/model.py", line 60, in <module>
    excluded_meta_data=EXCLUDE_META_DATA_FIELDS,
  File "/home/user/src/farm-haystack/haystack/database/elasticsearch.py", line 48, in __init__
    self.client.indices.create(index=index, ignore=400, body=custom_mapping)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 92, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/indices.py", line 104, in create
    "PUT", _make_path(index), params=params, headers=headers, body=body
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 362, in perform_request
    timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 241, in perform_request
    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused)```

Data collection for different languages

Find official data sources for FAQ about COVID-19 in different languages and scrape them.

Needed: Scraper that automatically extracts questions and answers from a URL

For now we have individual scrapers for each site. Adding more sites is a very manual and slow process and existing scrapers fail when the site changes slightly. See individual scrapers here.

Automatic Scraper
We need a scraper that takes in a URL to an FAQ page and automatically extracts questions and answers in a structured way. The scraper might need some NLP based question detection to identify which parts need to be extracted. For some pseudo code see here.

Datasources
We can curate a sheet of official FAQ pages and crawl relevant information more quickly.
That way the community can check the validity of the source FAQ and if the extraction worked.

What is the current model

Hey,

what is the current model used for https://covid-staging.deepset.ai/answers? :-)
I find it very accurate for my questions and single words (in german). Have you finetuned on the german Corona QAs? Do have any trained deep learning matching algorithm in use? I cannot imagine that the model just uses cosine simalrity with BERT, because it does not perform in my case as well as the model from the bot right now.

I experiment with my own questions with the pretrained deepset model (german) using cosine similarity. I wonder why just queries with words like "hallo" or "die" has a marginal lower similarity than real question just corona specific questions when just using the deepset german model. So those irrelevant words have a high similarityof around 90%...
Do you know any reason why this is the case?

Since QA pairs in german are rare have any idea what else methods you could try to do text matching without training maybe like a Word Mover distance matching with BERT embeddings?

I am very new in using BERT.

Benchmark & improve different embedding models

Plain transformer models (like BERT) are known to produce bad sentence embeddings. From a first, very rough test also Sentence-BERT (https://github.com/UKPLab/sentence-transformers) didn't perform too great on a couple of test queries.

We should evaluate different models once we have the eval dataset from #4 and possibly finetune some on the quora duplicate question dataset (or even a small one created by the crowd).

Where test dataset

Where do I get the test dataset for COVIDQA?

Real-Time data scraping for countries

Hi,

I have been working on chatbot for croatian language. Here is little help for real time scraping.

import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

url = "https://www.worldometers.info/coronavirus/"
headers = {'Accept': 'text/html'}
response = requests.get(url, headers=headers)
#print(response)
content = response.content
soup = BeautifulSoup(content,"lxml")

elements = (np.array([[y.text for y in x.find_all("td")] for x in soup.find(id="main_table_countries_today").find_all("tr")]))
elements = [x for x in elements if len(x)==9]

wordmeters = pd.DataFrame(elements)
wordmeters.columns = ["Country,Other","Total Cases","New Cases","Total Deaths","New Deaths","Total Recovered","Active Cases","Serious, Critical","Tot Cases/1M pop"]
wordmeters

How to use front-end

Hey,

how can I use the front-end? When you bot was online I tested it, and want to use the telegram front-end. But I am little bit off since there is no description how to do that.

Thanks in advance!

CDC Water Scraper Error

COVID-QA/datasources/scrapers/CDC_Water_scraper.py

Line 51 in 4890ad5

columns["question"].append(current_question)

This line in the CDC_Water_scraper will cause an error.

UnboundLocalError: local variable 'current_question' referenced before assignment

It seems that lines

COVID-QA/datasources/scrapers/CDC_Water_scraper.py

Lines 51 to 53 in 4890ad5

    
           columns["question"].append(current_question) 
        
           columns["answer"].append(current_answer) 
        
           columns["answer_html"].append(current_answer_html)

should be deleted.

Where do I get the document subset of Cord-19 used for covid-qa

The paper mentions "We selected 147 scientific articles mostly related to COVID-19 from the CORD-19" . How can I get the subset of documents to create an index ?

Getting empty answer

Advice regarding approach for comparison of FAQ-QA

CDC Pregnancy page no longer in QA format.

https://github.com/deepset-ai/COVID-QA/blob/master/datasources/scrapers/CDC_Pregnancy_scraper.py

The CDC changed this page from a QA style page to a factual page on 7 April 2020.
This scraper no longer produces any data when run.

Implement periodic sync of Elasticsearch with scrapers

Proposal

In the current implementation, the meta scraper runs all the scrapers sequentially, crawls the FAQs, and then writes to an Elasticsearch index. This is good for initializing an index from scratch.

We should implement a periodic job(cron or AWS Lambda) that runs the meta scraper and check for updates, additions, and deletions since the last run.

A possible quick-n-dirty alternative to a periodic sync job could have been to recreate the entire Elasticsearch index each time we crawl. This works, except, collecting user feedback gets tricky as we lose the document_id when the list of scrapers gets updated.

Workflow

execute the meta crawler
search in ES if crawled question/answer pairs for a given scraper are present. The ES query can be filtered by the link field.
existing questions in ES which are no longer present(or are changed) in the newly crawled link are marked as outdated in ES

Other details

Currently, the document_id field in ES is populated as incrementing numbers. It could be changed to UUID to make things simpler to implement.
The API queries should be changed to exclude outdated documents.

Annotation methodology of QA resources

Hi there, thanks for sharing your QA resource!
https://github.com/deepset-ai/COVID-QA/tree/master/data/question-answering

I was wondering if you have a write-up of the annotation methodology? For example, how were the documents selected, how were the questions generated, guidelines for marking the extent of the spans, etc.

Thanks in advance!

NOT AN ISSUE : I've created a Telegram Bot

I've created a telegram bot to your web interface

https://t.me/corona_scholar_bot

Can I add it to the README?

Results with custom dataset

Hello!

First of all, thank you again for your incredible contribution with not only this dataset, but most importantly with the Haystack toolset!

I was able to closely approximate the results of your paper when running https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_crossvalidation.py, although I had to reduce batch_size to 25 to prevent RuntimeError: CUDA out of memory. Tried to allocate 540.00 MiB (GPU 0; 15.78 GiB total capacity; 14.29 GiB already allocated; 386.75 MiB free; 14.35 GiB reserved in total by PyTorch). This is using an Ubuntu 18.04 VM running a Tesla V100 GPU with 128G disk space. As mentioned, the results obtained were quite close:
XVAL EM: 0.26151560178306094
XVAL f1: 0.5858967501101285

I created a custom Covid-19 dataset that combines a preprocessed/cleansed subset of the dataset from the paper "Collecting Verified COVID-19 Question Answer Pairs" (Poliak et al, 2020) and a SQuADified version of your dataset, faq_covidbert.csv. For the latter I used your annotation tool to map questions to chunks in the answers, treating the full answers as contexts.

I trained a model with this dataset using the hyperparameters you specify here: https://huggingface.co/deepset/roberta-base-squad2-covid#hyperparameters . Informal tests of various questions related to Covid-19 indicate superior responses generated from my model as opposed to roberta-base-squad2-covid, which isn't surprising as inspection of both datasets reveals that mine contains far more Covid-19-specific questions and answers.

However, when running question_answering_crossvalidation.py with my dataset the metric results are not as good as what is observed with your dataset or even with the baseline referenced in the paper. Here are the EM and f1 scores I obtained with my dataset:
XVAL EM: 0.21554054054054053
XVAL f1: 0.4432141443807887

Can you provide any insight as to why this would be the case? Thank you so much!

CDC Children scraper is outdated

The CDC children scraper does not get the right results.
I guess the page: https://www.cdc.gov/coronavirus/2019-ncov/prepare/children-faq.html
was updated.

@bogdankostic could you please update the scraper - I will put it for now into the data/scraper_outdated folder so the meta crawler wont use it.

Infrequently Asked Question (IFAQ) classification and answering

Many people have specific information needs, which may not be answered in FAQs (e.g. "How many people are infected with Corona in Berlin?").

A possible first step would be to classify the intent of the question (w/ Rasa or DeepPavlov).

Afterwards, slot filling can be used to extract semantic concepts (w/ Rasa or DeepPavlov).

The information can then be converted into database queries (e.g. COVID-19, Coronavirus Tracker API, Coronazaehler).

Extra: The result may be formatted for different intents.

Similar projects:
Rasa project for answering simple COVID-19 questions.
https://github.com/LuisMossburger/CoronaBibliothekar/tree/rasa-init

Combined COVID-19 cases database (RKI, JHU, ECDC)
https://github.com/swildermann/COVID-19

How to contribute

Can anyone share more details about how we should contribute to this project, like to_be_done list or something like that.

Thanks.

Open Source Helps!

Thanks for your work to help the people in need! Your site has been added! I currently maintain the Open-Source-COVID-19 page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.

http://open-source-covid-19.weileizeng.com/

Cheers!

Error when loading dataset

Hi,

I'm trying to load the dataset using this suggested code:

from datasets import load_dataset
dataset = load_dataset("covid_qa_deepset")

However, i get the following error:

FileNotFoundError: Couldn't find file locally at covid_qa_deepset\covid_qa_deepset.py, or remotely at https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/covid_qa_deepset/covid_qa_deepset.py or https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/covid_qa_deepset/covid_qa_deepset.py

Would greatly appreciate some advice! Thanks!

Question : Exact Match computation

Hello, Would you please help understand how the "Exact Match" is calculated? May be please point to which .py file to be referred for the code. Thank you

Multilingual IR with Machine-Translated FAQ

Building multilingual models (zero-shot, transfer learning, etc.) takes time.

So, in the meantime, as stated in #2 , we could machine-translate FAQs from English into other languages and add them to the search cluster, so that they can be retrieved at foreign language input. Translations in the background don't need to be perfect, but sufficient for retrieval (adequacy before fluency/grammar).

TODOs:

Scrape the English FAQ from data/scrapers repo
Build machine-translator tool (e.g. with https://pypi.org/project/googletrans/)
Translate some samples to check quality
Translate all English FAQ
Add data to ESC with columns: language, original_english_doc, is_machine_translated

Train BERT on Quora Question Pairs Dataset

Using Sentence Transformers (https://github.com/UKPLab/sentence-transformers) I will start training a model on the Quora Question Pairs Dataset (https://www.kaggle.com/c/quora-question-pairs) that can classify duplicate question pairs.

Frontend: fix package.json

Small issue here - the package name seems to be outdated.
Pls close, if "irda" is fine.

Instructions for Hosting API

My website (http://know-covid19.info/) continues to get a few dozen hits daily, but I had to remove the FAQ section when Deepset took the hosted API offline. :-(

I would be happy to host the API myself. It looks like the database is included in the GitHub repo, but what about the trained model? Can you share with me the resources and instructions so that I can host the API?

Model 2 Issue

While using model 2, the API returns answer for almost everything but not in English. I belive model 2 should only return English answers.

Try Define gravity?

Document Retrieval for extractive QA with COVID-QA

Thank you so much for sharing your data and tools! I am working with the question-answering dataset for an experiment of my own.

@Timoeller mentioned in #103 that the documents used in the annotation tool to create the COVID-QA.json dataset "are a subset of CORD-19 papers that annotators deemed related to Covid." I was wondering if these are the same documents as listed in faq_covidbert.csv.

The reason I ask is that, as a workaround I've created my own retrieval txt file(s) through extracting the answers from COVID-QA.json, but the results are hit or miss. They are particularly off if I break the file up into chunks to improve performance, for instance into a separate txt file for each answer. I'm assuming this is due to lost context. I'm wondering if I should simply be using faq_covidbert as illustrated here, even though I am using extractive-QA.

The reason I did my method is that I was trying to follow an approach most closely approximating the extractive QA tutorial.

My ultimate objective is to compare the experience of using extractive QA vs FAQ-style QA, so I presumed that it would be apropos to have a bit of separation in the doc storage dataset.

Thank you!

Data Augmentation

Experiment with different methods for data augmentation, report results and compare to baseline.

BioBERT Model Available - Trained on BioASQ for Question Answer

Hi - I wanted to share a model that I've pretrained from scratch using BERT Large Cased and the BioASQ 7b - factoid dataset on TPU v2-8.

Original Implementation:
https://github.com/dmis-lab/biobert

Dataset can also be found on their repo.

Model Details:
loss = 0.41782737
step = 18000
max_seq_length = 384
learning_rate = 3e-6
doc_stride = 128

The model is tensorflow based, and I haven't yet converted it to torch or transformers, and haven't evaluated it yet.

I'd like to continue training it on COVID related questions, as well as additional data from BioASQ but haven't yet found an easy way to convert the raw bioASQ data into the format for training. If someone would like to do that so I can continue training the model further, please let me know.

You should be able to download all the files easily with gsutil installed by running

gsutil -m cp -r gs://ce-covid-public/biobert-large-cased/* /path/to/folder/

If someone wants to run evaluation on the models and provide the metrics, I can update this.

Question - when running the backend on Docker with GPU enabled and BERT embeddings, it doesn't seem to be using the GPUs even with all the correct drivers. Is there some documentation around this?

Great job on the progress so far! I believe there's a lot of value in what's being done.

Gitter Chat Room community for COVID-QA Projects

I started a chat room on Gitter to help us coordinate the projects. Please sign up at https://gitter.im/COVID-QA/community?utm_source=share-link&utm_medium=link&utm_campaign=share-link

Add extractive QA (aka SQuAD style)

So far we rely on matching questions with the one of existing FAQs.

Pro: fast + reliable answers
Con: does only work for the most common questions

=> We could add an extractive QA model + some trustworthy text datasources to handle the long tail of questions. If there's no good match in the FAQs we could forward the request to this extractive QA model

Question : API

I've been developing a telegram bot using the API. Currently am using https://covid-middleware.deepset.ai/api/bert/question to get the answers.

curl -X POST \
  https://covid-middleware.deepset.ai/api/bert/question \
  -H 'content-type: application/json' \
  -d '{
	"question":"community spread?"
}'

but the swagger doesn't list this API and shows a different one with different structures.

So, my question is, Which API should I choose to get the answers? @tanaysoni

Frontend: giving Feedback results in tagging multiple ansers

"I just wanted to 'like' the upper answer, but by clicking the button the second (incorrect) answer also was marked with a green like.
Question asked was "Wie lang ist die Inkubationszeit" or something similar."

The problem seems to be, that there answers have the same document_id.
We need some unique id for each answer - maybe paragraph_id?

Example:
paragraph_id:"HnLWAnEB3Qua7g62e99g"
document_id:"1"

Preprocessing of context to fit max_length

Hi, would you please help me understand how the preprocessing is done for theCovidQA corpus ? Why I ask is because the context in the CovidQA dataset seems to be so much larger than the maximum length set in the code (which is 300+ and BERT max_length is 512 tokens). How is the data processed to fit into the limit ? Couldn't find the code for that in the Git. Please advice. Thank you.

Questions regarding datasets for FAQ-QA and extractive-QA

I'm not seeing the same results after fine-tuning roberta-base-squad2 with COVID-QA.json as compared to running with your deepset/roberta-base-squad2-covid model. I followed the tutorial Tutorial2_Finetune_a_model_on_your_data. In your paper "COVID-QA: A Question Answering Dataset for COVID-19" I didn't see any specific hyperparameters used for training that might explain these differences. Did you train with default parameters?

Thanks!

Feature : Matched Question and Feedback Option

Thanks again for creating this PR. Great work!

Two comments:

Right now the answer displayed by the bot doesn't contain the "matched question". However, it might be helpful for the users to see that in order to judge if the answer is really relevant to their question. You find it in the response JSON in the field "question".
We now also have the option for user feedback (see API endpoint). So people can rate if the given answer was helpful or not and we will use the data to improve the NLP model. This could be also an helpful addition to the telegram bot.

Would be great to hear your thoughts on that and maybe address them in a separate PR.

Originally posted by @tholor in #58 (comment)

	columns["question"].append(current_question)
	columns["answer"].append(current_answer)
	columns["answer_html"].append(current_answer_html)