Code Monkey home page Code Monkey logo

retrieval-cord19's Introduction

Welcome to the retrieval-cord19 repository!

Table of Contents

  1. Introduction
  2. Installation
  3. Structure of the repository
  4. Implemented models
  5. Results

Introduction

This repo implements with Java and Apache Lucene the classical information retrieval models in the literature. This project is an extension of a college assignment for a subject named "Information Retrieval". The models implemented are those explained in the popular Manning's book Introduction to Information Retrieval, openly available in the Stanford University website.

These classical methods are tested with the CORD19 dataset (using the 2020-07-16 release) following the TREC-COVID Challenge scheme, where a topics set is used in order to launch queries to the retrieval model and a set of relevance judgements is used to compute evaluation metrics.

In order to tackle the retrieval problem, our implementation considers:

  • The models field of each topic (see the XML file of the topics set).
<topics task="COVIDSearch 2020" batch="5">
    <topic number="1">
    <models>coronavirus origin</models>
    <question>what is the origin of COVID-19</question>
    <narrative>seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans</narrative>
    </topic>
    <topic number="2">
    <models>coronavirus response to weather changes</models>
    <question>how does the coronavirus respond to changes in the weather</question>
    <narrative>seeking range of information about the SARS-CoV-2 virus viability in different weather/climate conditions as well as information related to transmission of the virus in different climate conditions</narrative>
    </topic>
...
</topics>

Note: If you are unfamiliar with the TREC-COVID Challenge, consider reading:

Installation

To execute this code (Java and Python files), the following prerequisites are needed:

pip install wget tarfile bs4 lxml transformers
  • To download the CORD19 dataset and TREC-COVID Collection files, run download-data.py file in retrieval-cord19/ via:
python download-data.py

To contribute to this repo or experiment with the modules, we provide the pom file to automatically create the Apache Maven project.

Structure of the repository

Once data has been download in the previous section, in order to understand and execute the Java code of this repository, the following structure is needed:

retrieval-cord19/
    2020-07-16/
        document_parses/
            pdf_json/
            pmc_json/
            embeddings.csv
            metadata.csv
            relevance-judgements.txt
            topics-embeddings.json
            topics-set.xml
    src/
      cords/
      formats/
      lucene/
      models/
      schemas/
      util/

The folder 2020-07-16/ contains the CORD19 dataset along with TREC-COVID auxiliary files:

The folder src/ stores Java packages that implement the reading, parsing, indexing and querying processes in order to test our classical retrieval models.

Implemented models

In this section we briefly explain the implemented retrieval models. Note that these models are not intended to provide the best results in the TREC-COVID Challenge, but to show how classical retrieval models work for academical purposes.

Considerations about the Page Rank implementation

In order to obtain the graph of references between documents, we manually implement a searching process where, for each document:

  1. We obtained information about its bibliography entries and how many times in the body text each entry was cited.
  2. For each bibliography entry (in the code documentation this is also called reference) we create a BooleanQuery and search the title and authors of the entry in the index. We create a match between each bibliography entry and the top m documents obtained.
  3. Matches are saved as vectors in the index.

Thus, once the PageRank process has finished, in the index we have the following information per document $d_i$ (for $i=1,...,n$ where $n$ is the number of documents in the collection):

  • A vector $\vec{t}^{(i, c)} = (t^{(i,c)}_1,..., t^{(i,c)}_n)$ with reference information where $t^{(i,c)}_j$ is the number of times $d_i$ references to $d_j$ considering the cite counts.
  • A vector $\vec{t}^{(i,nc)}$ that is obtained via normalizing $\vec{t}^{(i, c)}$ following Page Rank algorithm:

$$ \text{norm}({\vec{t}_c}) = \begin{cases} (1/n | j=1,...,n) & \text{if }t_j=0, \ \forall j \in [1,n] \\\ \vec{t}_c\cdot\frac{1-\alpha}{\text{sum}(\vec{t}_c)} + \frac{\alpha}{n} & \text{otherwise} \end{cases} $$

  • A vector $\vec{t}^{(i,nb)} = \text{norm}(\vec{t}^{(i,b)})$ where $\vec{t}^{(i,b)} = \mathbb{I}( \vec{t}^{(i,c)} \geq 1)$ is the binarization of $\vec{t}^{(i,c)}$.
  • Invert vectors:

$$ \begin{cases} \vec{o}^{(i,nb)} = (o^{(i,nb)}_1,...,o^{(i,nb)}_n), \quad \text{where } o_j^{(i,nb)} = t^{(j, nb)}_i \\\ \vec{o}^{(i,nc)} = (o^{(i,nc)}_1,...,o^{(i,nc)}_n), \quad \text{where } o_j^{(i,nc)} = t^{(j, nc)}_i \end{cases} $$

Note: $\vec{o}$ vectors are the column vectors obtained by row-stacking normalized vectors $\vec{t}_i$ for $i=1,...,n$ in a matrix.

With the inverse-references normalized vectors we can compute the PageRank algorithm until convergence.

Results

Execution times

retrieval-cord19's People

Contributors

anaezquerro avatar pedrosouza1 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.