Welcome to the `retrieval-cord19` repository!

Introduction
Installation
Structure of the repository
Implemented models
Results

Introduction

This repo implements with Java and Apache Lucene the classical information retrieval models in the literature. This project is an extension of a college assignment for a subject named "Information Retrieval". The models implemented are those explained in the popular Manning's book Introduction to Information Retrieval, openly available in the Stanford University website.

These classical methods are tested with the CORD19 dataset (using the 2020-07-16 release) following the TREC-COVID Challenge scheme, where a topics set is used in order to launch queries to the retrieval model and a set of relevance judgements is used to compute evaluation metrics.

In order to tackle the retrieval problem, our implementation considers:

The models field of each topic (see the XML file of the topics set).

<topics task="COVIDSearch 2020" batch="5">
    <topic number="1">
    <models>coronavirus origin</models>
    <question>what is the origin of COVID-19</question>
    <narrative>seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans</narrative>
    </topic>
    <topic number="2">
    <models>coronavirus response to weather changes</models>
    <question>how does the coronavirus respond to changes in the weather</question>
    <narrative>seeking range of information about the SARS-CoV-2 virus viability in different weather/climate conditions as well as information related to transmission of the virus in different climate conditions</narrative>
    </topic>
...
</topics>

The title, abstract, authors, body and references of each article in the CORD19 dataset.
The document embeddings provided with the collection.

Note: If you are unfamiliar with the TREC-COVID Challenge, consider reading:

Information about the structure of the metadata CSV file, JSON article files, and more in the official GitHub repo of the CORD19 dataset.
Information about the scheme of the relevance judgements TXT file in the TREC-COVID Challenge page.

Installation

To execute this code (Java and Python files), the following prerequisites are needed:

Java 17.
Python 3.10 and pip. Once both have been installed, install wget, tarfile, bs4, lxml and transformers libraries by running in terminal:

pip install wget tarfile bs4 lxml transformers

To download the CORD19 dataset and TREC-COVID Collection files, run download-data.py file in retrieval-cord19/ via:

python download-data.py

To contribute to this repo or experiment with the modules, we provide the pom file to automatically create the Apache Maven project.

Structure of the repository

Once data has been download in the previous section, in order to understand and execute the Java code of this repository, the following structure is needed:

retrieval-cord19/
    2020-07-16/
        document_parses/
            pdf_json/
            pmc_json/
            embeddings.csv
            metadata.csv
            relevance-judgements.txt
            topics-embeddings.json
            topics-set.xml
    src/
      cords/
      formats/
      lucene/
      models/
      schemas/
      util/

The folder 2020-07-16/ contains the CORD19 dataset along with TREC-COVID auxiliary files:

document_parses/ contains the PMC and PDF articles in JSON format.
metadata.csv contains, by row, the most important information about each article of the CORD19 dataset.
embeddings.csv contains, by row, the article embedding computed by a pretrained SPECTER.
relevance-judgements.txt contains, by row, the relevance judgements of the TREC-COVID Collection.
topics-set.xml contains the TREC-COVID topics info.
topics-embeddings.json contains the embeddings of each topic query and narrative field computed with SPECTER.

The folder src/ stores Java packages that implement the reading, parsing, indexing and querying processes in order to test our classical retrieval models.

cords: Implements Java classes with the following functionalities:
1. CollectionReader.java: Reading and parsing the TREC-COVID collection files.
2. Poolindexing.java: Indexing the collection into an Apache Lucene index.
3. PageRank.java: Computing the references graph between articles of the collection.
4. QueryComputation.java: Computing the queries of each topic of the TREC-COVID Challenge.
5. QueryEvaluation.java: Evaluating our retrieval models in the TREC-COVID Challenge.
formats: Defines file structures of the collection in order to parse its content.
- Article.java is used for the PMC and PDF JSON files in document_parses/.
- Metadata.java is used for each row of metadata.csv CSV file.
- RelevanceJudgements.java is used for each line of relevance-judgements.txt TXT file.
- Topics.java is used for each item of topics-set.xml XML file.
lucene: Is an abstraction of the original Apache Lucene classes IndexWriter, IndexReader and IndexSearcher that handles exception throws.
models: Implementation of the classical retrieval models (see the next section).
schemas: Our own classes to store variables and easily implement parsing, indexing and querying processes.
util: Auxiliary static functions that are used for all classes in order to afford code.

Implemented models

In this section we briefly explain the implemented retrieval models. Note that these models are not intended to provide the best results in the TREC-COVID Challenge, but to show how classical retrieval models work for academical purposes.

Boolean Weighted Model: It uses title, abstract and body fields with weights $20$, $10$ and $5$ respectively. The documents' scores are computed with the Language Retrieval Model using Jelinek-Mercer smoothing.
Vector Model: It uses the embedding field of each document and the topics embeddings (stored in topic-embeddings.json) to compute a KnnVectorQuery and then apply the Rocchio algorithm to obtain new query embeddings. The parameters used for Rocchio can be manually configured in the VectorModel class. By default, we use $\alpha=0.5$, $\beta=0.4$ and $\gamma=0.1$, and the number of reranking iterations is $5$.
Probability Model: It computes a Boolean Weighted Query using the Probabilistic Retrieval Model and reranks the initial ranking by expanding the query with new terms.
PageRank Model: It uses the Boolean Weighted Model to compute initial results and then reranks the initial ranking using the Page Rank of each document. Note that Page Rank is obtained at indexing time.

Considerations about the Page Rank implementation

In order to obtain the graph of references between documents, we manually implement a searching process where, for each document:

We obtained information about its bibliography entries and how many times in the body text each entry was cited.
For each bibliography entry (in the code documentation this is also called reference) we create a BooleanQuery and search the title and authors of the entry in the index. We create a match between each bibliography entry and the top m documents obtained.
Matches are saved as vectors in the index.

Thus, once the PageRank process has finished, in the index we have the following information per document $d_i$ (for $i=1,...,n$ where $n$ is the number of documents in the collection):

A vector $\vec{t}^{(i, c)} = (t^{(i,c)}_1,..., t^{(i,c)}_n)$ with reference information where $t^{(i,c)}_j$ is the number of times $d_i$ references to $d_j$ considering the cite counts.
A vector $\vec{t}^{(i,nc)}$ that is obtained via normalizing $\vec{t}^{(i, c)}$ following Page Rank algorithm:

$$ \text{norm}({\vec{t}_c}) = \begin{cases} (1/n | j=1,...,n) & \text{if }t_j=0, \ \forall j \in [1,n] \\\ \vec{t}_c\cdot\frac{1-\alpha}{\text{sum}(\vec{t}_c)} + \frac{\alpha}{n} & \text{otherwise} \end{cases} $$

A vector $\vec{t}^{(i,nb)} = \text{norm}(\vec{t}^{(i,b)})$ where $\vec{t}^{(i,b)} = \mathbb{I}( \vec{t}^{(i,c)} \geq 1)$ is the binarization of $\vec{t}^{(i,c)}$.
Invert vectors:

$$ \begin{cases} \vec{o}^{(i,nb)} = (o^{(i,nb)}_1,...,o^{(i,nb)}_n), \quad \text{where } o_j^{(i,nb)} = t^{(j, nb)}_i \\\ \vec{o}^{(i,nc)} = (o^{(i,nc)}_1,...,o^{(i,nc)}_n), \quad \text{where } o_j^{(i,nc)} = t^{(j, nc)}_i \end{cases} $$

Note: $\vec{o}$ vectors are the column vectors obtained by row-stacking normalized vectors $\vec{t}_i$ for $i=1,...,n$ in a matrix.

With the inverse-references normalized vectors we can compute the PageRank algorithm until convergence.

anaezquerro / retrieval-cord19 Goto Github PK

retrieval-cord19's Introduction

Welcome to the `retrieval-cord19` repository!

Table of Contents

Introduction

Installation

Structure of the repository

Implemented models

Considerations about the Page Rank implementation

Results

Execution times

retrieval-cord19's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

anaezquerro / retrieval-cord19 Goto Github PK

retrieval-cord19's Introduction

Welcome to the retrieval-cord19 repository!

Table of Contents

Introduction

Installation

Structure of the repository

Implemented models

Considerations about the Page Rank implementation

Results

Execution times

retrieval-cord19's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Welcome to the `retrieval-cord19` repository!