This repo implements with Java and Apache Lucene the classical information retrieval models in the literature. This project is an extension of a college assignment for a subject named "Information Retrieval". The models implemented are those explained in the popular Manning's book Introduction to Information Retrieval, openly available in the Stanford University website.
These classical methods are tested with the CORD19 dataset (using the 2020-07-16 release) following the TREC-COVID Challenge scheme, where a topics set is used in order to launch queries to the retrieval model and a set of relevance judgements is used to compute evaluation metrics.
In order to tackle the retrieval problem, our implementation considers:
- The
models
field of each topic (see the XML file of the topics set).
<topics task="COVIDSearch 2020" batch="5">
<topic number="1">
<models>coronavirus origin</models>
<question>what is the origin of COVID-19</question>
<narrative>seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans</narrative>
</topic>
<topic number="2">
<models>coronavirus response to weather changes</models>
<question>how does the coronavirus respond to changes in the weather</question>
<narrative>seeking range of information about the SARS-CoV-2 virus viability in different weather/climate conditions as well as information related to transmission of the virus in different climate conditions</narrative>
</topic>
...
</topics>
- The title, abstract, authors, body and references of each article in the CORD19 dataset.
- The document embeddings provided with the collection.
Note: If you are unfamiliar with the TREC-COVID Challenge, consider reading:
- Information about the structure of the metadata CSV file, JSON article files, and more in the official GitHub repo of the CORD19 dataset.
- Information about the scheme of the relevance judgements TXT file in the TREC-COVID Challenge page.
To execute this code (Java and Python files), the following prerequisites are needed:
- Java 17.
- Python 3.10 and pip. Once both have been installed, install wget, tarfile, bs4, lxml and transformers libraries by running in terminal:
pip install wget tarfile bs4 lxml transformers
- To download the CORD19 dataset and TREC-COVID Collection files, run
download-data.py
file inretrieval-cord19/
via:
python download-data.py
To contribute to this repo or experiment with the modules, we provide the pom file to automatically create the Apache Maven project.
Once data has been download in the previous section, in order to understand and execute the Java code of this repository, the following structure is needed:
retrieval-cord19/
2020-07-16/
document_parses/
pdf_json/
pmc_json/
embeddings.csv
metadata.csv
relevance-judgements.txt
topics-embeddings.json
topics-set.xml
src/
cords/
formats/
lucene/
models/
schemas/
util/
The folder 2020-07-16/
contains the CORD19 dataset along with TREC-COVID auxiliary files:
document_parses/
contains the PMC and PDF articles in JSON format.metadata.csv
contains, by row, the most important information about each article of the CORD19 dataset.embeddings.csv
contains, by row, the article embedding computed by a pretrained SPECTER.relevance-judgements.txt
contains, by row, the relevance judgements of the TREC-COVID Collection.topics-set.xml
contains the TREC-COVID topics info.topics-embeddings.json
contains the embeddings of each topic query and narrative field computed with SPECTER.
The folder src/
stores Java packages that implement the reading, parsing,
indexing and querying processes in order to test our classical retrieval models.
-
cords
: Implements Java classes with the following functionalities:CollectionReader.java
: Reading and parsing the TREC-COVID collection files.Poolindexing.java
: Indexing the collection into an Apache Lucene index.PageRank.java
: Computing the references graph between articles of the collection.QueryComputation.java
: Computing the queries of each topic of the TREC-COVID Challenge.QueryEvaluation.java
: Evaluating our retrieval models in the TREC-COVID Challenge.
-
formats
: Defines file structures of the collection in order to parse its content.Article.java
is used for the PMC and PDF JSON files indocument_parses/
.Metadata.java
is used for each row ofmetadata.csv
CSV file.RelevanceJudgements.java
is used for each line ofrelevance-judgements.txt
TXT file.Topics.java
is used for each item oftopics-set.xml
XML file.
-
lucene
: Is an abstraction of the original Apache Lucene classes IndexWriter, IndexReader and IndexSearcher that handles exception throws. -
models
: Implementation of the classical retrieval models (see the next section). -
schemas
: Our own classes to store variables and easily implement parsing, indexing and querying processes. -
util
: Auxiliary static functions that are used for all classes in order to afford code.
In this section we briefly explain the implemented retrieval models. Note that these models are not intended to provide the best results in the TREC-COVID Challenge, but to show how classical retrieval models work for academical purposes.
-
Boolean Weighted Model:
It uses
title
,abstract
andbody
fields with weights$20$ ,$10$ and$5$ respectively. The documents' scores are computed with the Language Retrieval Model using Jelinek-Mercer smoothing. -
Vector Model:
It uses the
embedding
field of each document and the topics embeddings (stored intopic-embeddings.json
) to compute a KnnVectorQuery and then apply the Rocchio algorithm to obtain new query embeddings. The parameters used for Rocchio can be manually configured in theVectorModel
class. By default, we use$\alpha=0.5$ ,$\beta=0.4$ and$\gamma=0.1$ , and the number of reranking iterations is$5$ . - Probability Model: It computes a Boolean Weighted Query using the Probabilistic Retrieval Model and reranks the initial ranking by expanding the query with new terms.
- PageRank Model: It uses the Boolean Weighted Model to compute initial results and then reranks the initial ranking using the Page Rank of each document. Note that Page Rank is obtained at indexing time.
In order to obtain the graph of references between documents, we manually implement a searching process where, for each document:
- We obtained information about its bibliography entries and how many times in the body text each entry was cited.
- For each bibliography entry (in the code documentation this is also called
reference
) we create a BooleanQuery and search the title and authors of the entry in the index. We create a match between each bibliography entry and the topm
documents obtained. - Matches are saved as vectors in the index.
Thus, once the PageRank process has finished, in the index we have the following information per document
- A vector
$\vec{t}^{(i, c)} = (t^{(i,c)}_1,..., t^{(i,c)}_n)$ with reference information where$t^{(i,c)}_j$ is the number of times$d_i$ references to$d_j$ considering the cite counts. - A vector
$\vec{t}^{(i,nc)}$ that is obtained via normalizing$\vec{t}^{(i, c)}$ following Page Rank algorithm:
- A vector
$\vec{t}^{(i,nb)} = \text{norm}(\vec{t}^{(i,b)})$ where$\vec{t}^{(i,b)} = \mathbb{I}( \vec{t}^{(i,c)} \geq 1)$ is the binarization of$\vec{t}^{(i,c)}$ . - Invert vectors:
Note:
With the inverse-references normalized vectors we can compute the PageRank algorithm until convergence.