Code Monkey home page Code Monkey logo

semantic-search-engine-for-covid-19-literature's Introduction

Search-Engine-for-COVID-19-literature

CORD-19 Search

A Semantic Search Engine for research papers on COVID-19 in various categories such as symptoms, influential factors, similar diseases and viruses etc.
This is a solution to the CORD-19 challenge on Kaggle. The dataset was created in response to the COVID-19 pandemic containing over 500,000 scholarly articles, including over 200,000 with full text about COVID-19, SARS-CoV-2, and related coronaviruses.

COVID-19 Open Research Dataset Challenge (CORD-19)

The challenge was to build a search engine/data mining tool that can accurately develop answers to high priority scientific questions in this domain. The size of the dataset is 46.71 GB as of now and more literature is periodically added. The dataset can be found here and here.
The traditional approach is key-word based search using metrics such as TF-IDF or BM25. Although these methods do a solid job in providing good results, they fail to consider the sequence of words in the query. Moreover, with the increase in vocabulary, the vector size increases as well(one can use sparse vectors to overcome this problem). However, the major drawback of these approaches is that they fail to incorporate the semantics of the query or data.

To overcome this, word/sentence embeddings can be used that capture the meaning of the query much more accurately.

  • To use word embeddings, the query and data is represented by a weighted sum of the different word embeddings in a sentence(BM25 can be used to weigh the words).
  • However, for sentence embeddings, no such processing needs to be done as the vectors already contain semantics of the entire text.

Our solution uses sentence embeddings to encode the query and data and then compute the similarity scores between the two to rank the documents. To generate sentence embeddings, we use the BioBERT model - a pre-trained biomedical language representation model for biomedical text mining which produces 768-dimensional vectors. The model can be found here and the corresponding paper here. We use the Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co. framework to load the BioBERT model.

The main issue that decreases the efficiency of search engines is the time taken to compute the similarity scores of the entire dataset with respect to the query. To overcome this issue, we use Faiss, a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to the ones that possibly do not fit in the RAM. It builds an index over the entire dataset and efficiently returns the top 'n' results.

Question Search Results
What do models for transmission predict? Publications
What is the longest duration of viral shedding? Symptoms
Effectiveness of case isolation/isolation of exposed individuals Factors

All the above questions are part of the CORD-19 challenge

semantic-search-engine-for-covid-19-literature's People

Contributors

aditya9061 avatar arnavsshah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

aditya9061

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.