Code Monkey home page Code Monkey logo

covid-qa's People

Contributors

andra-pumnea avatar bogdankostic avatar borhenryk avatar brandenchan avatar dataworm avatar dwhitena avatar kshannon avatar mfleming99 avatar runinho avatar schafeld avatar sfakir avatar stedomedo avatar tanaysoni avatar tbnv999 avatar theapache64 avatar tholor avatar timoeller avatar tkh42 avatar viktoralm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid-qa's Issues

Feedback button - contact details

I just posted an issue here on github for the ChatBot, because my colleague couldn't contact them. A feedback button might also be helpful for this project.

Create english evaluation dataset for question similarity

We should create a simple, evaluation dataset that can be used to benchmark our models for matching similar questions.

What should be sufficient for a rough baseline:

  • 100-300 question pairs of similar questions
  • extending that with 50% false pairs

Fine-tune BERT on CORD-2019 dataset

Fine-tune BERT (or word embeddings?) on CORD-2019 dataset published on Kaggle:

[CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.]

The dataset has 2GB. I guess the domain is quite different from the FAQ, as the dataset is made up of scientific papers, but could still be valuable to introduce some substantial vocabulary related to the virus.

Docker build succeeds, docker run fails with elasticsearch error

I followed the instructions here https://github.com/deepset-ai/COVID-QA/tree/master/backend. Perhaps a port is not configured correctly?

INFO:     initializing identifier
WARNING:  PUT http://localhost:9200/document [status:N/A request:0.004s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9b10>: Failed to establish a new connection: [Errno 111] Connection refused
WARNING:  PUT http://localhost:9200/document [status:N/A request:0.002s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9d50>: Failed to establish a new connection: [Errno 111] Connection refused
WARNING:  PUT http://localhost:9200/document [status:N/A request:0.002s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3eb08d0>: Failed to establish a new connection: [Errno 111] Connection refused
WARNING:  PUT http://localhost:9200/document [status:N/A request:0.001s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 229, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 376, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/uvicorn", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 331, in main
    run(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 354, in run
    server.run()
  File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 382, in run
    loop.run_until_complete(self.serve(sockets=sockets))
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.7/site-packages/uvicorn/main.py", line 389, in serve
    config.load()
  File "/usr/local/lib/python3.7/site-packages/uvicorn/config.py", line 288, in load
    self.loaded_app = import_from_string(self.app)
  File "/usr/local/lib/python3.7/site-packages/uvicorn/importer.py", line 20, in import_from_string
    module = importlib.import_module(module_str)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "./backend/api.py", line 11, in <module>
    from backend.controller.router import router as api_router
  File "./backend/controller/router.py", line 3, in <module>
    from backend.controller import autocomplete, model, feedback
  File "./backend/controller/model.py", line 60, in <module>
    excluded_meta_data=EXCLUDE_META_DATA_FIELDS,
  File "/home/user/src/farm-haystack/haystack/database/elasticsearch.py", line 48, in __init__
    self.client.indices.create(index=index, ignore=400, body=custom_mapping)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 92, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/indices.py", line 104, in create
    "PUT", _make_path(index), params=params, headers=headers, body=body
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 362, in perform_request
    timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 241, in perform_request
    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f66d3bb9cd0>: Failed to establish a new connection: [Errno 111] Connection refused)```

Needed: Scraper that automatically extracts questions and answers from a URL

For now we have individual scrapers for each site. Adding more sites is a very manual and slow process and existing scrapers fail when the site changes slightly. See individual scrapers here.

Automatic Scraper
We need a scraper that takes in a URL to an FAQ page and automatically extracts questions and answers in a structured way. The scraper might need some NLP based question detection to identify which parts need to be extracted. For some pseudo code see here.

Datasources
We can curate a sheet of official FAQ pages and crawl relevant information more quickly.
That way the community can check the validity of the source FAQ and if the extraction worked.

What is the current model

Hey,

what is the current model used for https://covid-staging.deepset.ai/answers? :-)
I find it very accurate for my questions and single words (in german). Have you finetuned on the german Corona QAs? Do have any trained deep learning matching algorithm in use? I cannot imagine that the model just uses cosine simalrity with BERT, because it does not perform in my case as well as the model from the bot right now.

I experiment with my own questions with the pretrained deepset model (german) using cosine similarity. I wonder why just queries with words like "hallo" or "die" has a marginal lower similarity than real question just corona specific questions when just using the deepset german model. So those irrelevant words have a high similarityof around 90%...
Do you know any reason why this is the case?

Since QA pairs in german are rare have any idea what else methods you could try to do text matching without training maybe like a Word Mover distance matching with BERT embeddings?

I am very new in using BERT.

Real-Time data scraping for countries

Hi,

I have been working on chatbot for croatian language. Here is little help for real time scraping.

image

import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

url = "https://www.worldometers.info/coronavirus/"
headers = {'Accept': 'text/html'}
response = requests.get(url, headers=headers)
#print(response)
content = response.content
soup = BeautifulSoup(content,"lxml")

elements = (np.array([[y.text for y in x.find_all("td")] for x in soup.find(id="main_table_countries_today").find_all("tr")]))
elements = [x for x in elements if len(x)==9]

wordmeters = pd.DataFrame(elements)
wordmeters.columns = ["Country,Other","Total Cases","New Cases","Total Deaths","New Deaths","Total Recovered","Active Cases","Serious, Critical","Tot Cases/1M pop"]
wordmeters

How to use front-end

Hey,

how can I use the front-end? When you bot was online I tested it, and want to use the telegram front-end. But I am little bit off since there is no description how to do that.

Thanks in advance!

CDC Water Scraper Error

columns["question"].append(current_question)

This line in the CDC_Water_scraper will cause an error.

UnboundLocalError: local variable 'current_question' referenced before assignment

It seems that lines

columns["question"].append(current_question)
columns["answer"].append(current_answer)
columns["answer_html"].append(current_answer_html)
should be deleted.

Implement periodic sync of Elasticsearch with scrapers

Proposal

In the current implementation, the meta scraper runs all the scrapers sequentially, crawls the FAQs, and then writes to an Elasticsearch index. This is good for initializing an index from scratch.

We should implement a periodic job(cron or AWS Lambda) that runs the meta scraper and check for updates, additions, and deletions since the last run.

A possible quick-n-dirty alternative to a periodic sync job could have been to recreate the entire Elasticsearch index each time we crawl. This works, except, collecting user feedback gets tricky as we lose the document_id when the list of scrapers gets updated.

Workflow

  • execute the meta crawler
  • search in ES if crawled question/answer pairs for a given scraper are present. The ES query can be filtered by the link field.
  • existing questions in ES which are no longer present(or are changed) in the newly crawled link are marked as outdated in ES

Other details

  • Currently, the document_id field in ES is populated as incrementing numbers. It could be changed to UUID to make things simpler to implement.
  • The API queries should be changed to exclude outdated documents.

Results with custom dataset

Hello!

First of all, thank you again for your incredible contribution with not only this dataset, but most importantly with the Haystack toolset!

I was able to closely approximate the results of your paper when running https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_crossvalidation.py, although I had to reduce batch_size to 25 to prevent RuntimeError: CUDA out of memory. Tried to allocate 540.00 MiB (GPU 0; 15.78 GiB total capacity; 14.29 GiB already allocated; 386.75 MiB free; 14.35 GiB reserved in total by PyTorch). This is using an Ubuntu 18.04 VM running a Tesla V100 GPU with 128G disk space. As mentioned, the results obtained were quite close:
XVAL EM: 0.26151560178306094
XVAL f1: 0.5858967501101285

I created a custom Covid-19 dataset that combines a preprocessed/cleansed subset of the dataset from the paper "Collecting Verified COVID-19 Question Answer Pairs" (Poliak et al, 2020) and a SQuADified version of your dataset, faq_covidbert.csv. For the latter I used your annotation tool to map questions to chunks in the answers, treating the full answers as contexts.

I trained a model with this dataset using the hyperparameters you specify here: https://huggingface.co/deepset/roberta-base-squad2-covid#hyperparameters . Informal tests of various questions related to Covid-19 indicate superior responses generated from my model as opposed to roberta-base-squad2-covid, which isn't surprising as inspection of both datasets reveals that mine contains far more Covid-19-specific questions and answers.

However, when running question_answering_crossvalidation.py with my dataset the metric results are not as good as what is observed with your dataset or even with the baseline referenced in the paper. Here are the EM and f1 scores I obtained with my dataset:
XVAL EM: 0.21554054054054053
XVAL f1: 0.4432141443807887

Can you provide any insight as to why this would be the case? Thank you so much!

Infrequently Asked Question (IFAQ) classification and answering

Many people have specific information needs, which may not be answered in FAQs (e.g. "How many people are infected with Corona in Berlin?").

A possible first step would be to classify the intent of the question (w/ Rasa or DeepPavlov).

Afterwards, slot filling can be used to extract semantic concepts (w/ Rasa or DeepPavlov).

The information can then be converted into database queries (e.g. COVID-19, Coronavirus Tracker API, Coronazaehler).

Extra: The result may be formatted for different intents.

Similar projects:
Rasa project for answering simple COVID-19 questions.
https://github.com/LuisMossburger/CoronaBibliothekar/tree/rasa-init

Combined COVID-19 cases database (RKI, JHU, ECDC)
https://github.com/swildermann/COVID-19

How to contribute

Can anyone share more details about how we should contribute to this project, like to_be_done list or something like that.

Thanks.

Open Source Helps!

Thanks for your work to help the people in need! Your site has been added! I currently maintain the Open-Source-COVID-19 page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.

http://open-source-covid-19.weileizeng.com/

Cheers!

Error when loading dataset

Hi,

I'm trying to load the dataset using this suggested code:

from datasets import load_dataset
dataset = load_dataset("covid_qa_deepset")

However, i get the following error:

FileNotFoundError: Couldn't find file locally at covid_qa_deepset\covid_qa_deepset.py, or remotely at https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/covid_qa_deepset/covid_qa_deepset.py or https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/covid_qa_deepset/covid_qa_deepset.py

Would greatly appreciate some advice! Thanks!

Question : Exact Match computation

Hello, Would you please help understand how the "Exact Match" is calculated? May be please point to which .py file to be referred for the code. Thank you

Multilingual IR with Machine-Translated FAQ

Building multilingual models (zero-shot, transfer learning, etc.) takes time.

So, in the meantime, as stated in #2 , we could machine-translate FAQs from English into other languages and add them to the search cluster, so that they can be retrieved at foreign language input. Translations in the background don't need to be perfect, but sufficient for retrieval (adequacy before fluency/grammar).

TODOs:

  • Scrape the English FAQ from data/scrapers repo
  • Build machine-translator tool (e.g. with https://pypi.org/project/googletrans/)
  • Translate some samples to check quality
  • Translate all English FAQ
  • Add data to ESC with columns: language, original_english_doc, is_machine_translated

Instructions for Hosting API

My website (http://know-covid19.info/) continues to get a few dozen hits daily, but I had to remove the FAQ section when Deepset took the hosted API offline. :-(

I would be happy to host the API myself. It looks like the database is included in the GitHub repo, but what about the trained model? Can you share with me the resources and instructions so that I can host the API?

Model 2 Issue

While using model 2, the API returns answer for almost everything but not in English. I belive model 2 should only return English answers.

Try Define gravity?

Document Retrieval for extractive QA with COVID-QA

Thank you so much for sharing your data and tools! I am working with the question-answering dataset for an experiment of my own.

@Timoeller mentioned in #103 that the documents used in the annotation tool to create the COVID-QA.json dataset "are a subset of CORD-19 papers that annotators deemed related to Covid." I was wondering if these are the same documents as listed in faq_covidbert.csv.

The reason I ask is that, as a workaround I've created my own retrieval txt file(s) through extracting the answers from COVID-QA.json, but the results are hit or miss. They are particularly off if I break the file up into chunks to improve performance, for instance into a separate txt file for each answer. I'm assuming this is due to lost context. I'm wondering if I should simply be using faq_covidbert as illustrated here, even though I am using extractive-QA.

The reason I did my method is that I was trying to follow an approach most closely approximating the extractive QA tutorial.

My ultimate objective is to compare the experience of using extractive QA vs FAQ-style QA, so I presumed that it would be apropos to have a bit of separation in the doc storage dataset.

Thank you!

Data Augmentation

Experiment with different methods for data augmentation, report results and compare to baseline.

BioBERT Model Available - Trained on BioASQ for Question Answer

Hi - I wanted to share a model that I've pretrained from scratch using BERT Large Cased and the BioASQ 7b - factoid dataset on TPU v2-8.

Original Implementation:
https://github.com/dmis-lab/biobert

Dataset can also be found on their repo.

Model Details:
loss = 0.41782737
step = 18000
max_seq_length = 384
learning_rate = 3e-6
doc_stride = 128

The model is tensorflow based, and I haven't yet converted it to torch or transformers, and haven't evaluated it yet.

I'd like to continue training it on COVID related questions, as well as additional data from BioASQ but haven't yet found an easy way to convert the raw bioASQ data into the format for training. If someone would like to do that so I can continue training the model further, please let me know.

You should be able to download all the files easily with gsutil installed by running

gsutil -m cp -r gs://ce-covid-public/biobert-large-cased/* /path/to/folder/

If someone wants to run evaluation on the models and provide the metrics, I can update this.

Question - when running the backend on Docker with GPU enabled and BERT embeddings, it doesn't seem to be using the GPUs even with all the correct drivers. Is there some documentation around this?

Great job on the progress so far! I believe there's a lot of value in what's being done.

Add extractive QA (aka SQuAD style)

So far we rely on matching questions with the one of existing FAQs.

  • Pro: fast + reliable answers
  • Con: does only work for the most common questions

=> We could add an extractive QA model + some trustworthy text datasources to handle the long tail of questions. If there's no good match in the FAQs we could forward the request to this extractive QA model

Question : API

I've been developing a telegram bot using the API. Currently am using https://covid-middleware.deepset.ai/api/bert/question to get the answers.

curl -X POST \
  https://covid-middleware.deepset.ai/api/bert/question \
  -H 'content-type: application/json' \
  -d '{
	"question":"community spread?"
}'

but the swagger doesn't list this API and shows a different one with different structures.

So, my question is, Which API should I choose to get the answers? @tanaysoni

Frontend: giving Feedback results in tagging multiple ansers

"I just wanted to 'like' the upper answer, but by clicking the button the second (incorrect) answer also was marked with a green like.
Question asked was "Wie lang ist die Inkubationszeit" or something similar."

The problem seems to be, that there answers have the same document_id.
We need some unique id for each answer - maybe paragraph_id?

Example:
paragraph_id:"HnLWAnEB3Qua7g62e99g"
document_id:"1"

Preprocessing of context to fit max_length

Hi, would you please help me understand how the preprocessing is done for theCovidQA corpus ? Why I ask is because the context in the CovidQA dataset seems to be so much larger than the maximum length set in the code (which is 300+ and BERT max_length is 512 tokens). How is the data processed to fit into the limit ? Couldn't find the code for that in the Git. Please advice. Thank you.

Questions regarding datasets for FAQ-QA and extractive-QA

I'm not seeing the same results after fine-tuning roberta-base-squad2 with COVID-QA.json as compared to running with your deepset/roberta-base-squad2-covid model. I followed the tutorial Tutorial2_Finetune_a_model_on_your_data. In your paper "COVID-QA: A Question Answering Dataset for COVID-19" I didn't see any specific hyperparameters used for training that might explain these differences. Did you train with default parameters?

Thanks!

Feature : Matched Question and Feedback Option

Thanks again for creating this PR. Great work!

Two comments:

  • Right now the answer displayed by the bot doesn't contain the "matched question". However, it might be helpful for the users to see that in order to judge if the answer is really relevant to their question. You find it in the response JSON in the field "question".
  • We now also have the option for user feedback (see API endpoint). So people can rate if the given answer was helpful or not and we will use the data to improve the NLP model. This could be also an helpful addition to the telegram bot.

Would be great to hear your thoughts on that and maybe address them in a separate PR.

Originally posted by @tholor in #58 (comment)

BMAS-Scraper not working

It seems that the BMAS-Scraper is not properly working.

The problem seems to be with extracting the questions. The resulting question column contains three empty strings.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.