Code Monkey home page Code Monkey logo

csv-for-lsr-ecir24's Introduction

CSV-for-LSR-ECIR24

Introduction

Welcome to the official repository for ECIR '24 paper, Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies. This contains code and instructions for fully reproducing our results, as well as pointers to checkpoints and datasets that you could potentially incorporate into your workflow!

Overview

In general, there are 6 steps to fully reproduce our results:

  1. Learning corpus-specific vocabularies (CSV).
  2. Pre-training CSV-based language models on the retrieval corpus.
  3. Training and expanding the corpus using TILDE based on the CSV-based LM.
  4. Creating expanded corpus and training data for uniCOIL.
  5. Training uniCOIL based on CSV-based LM.
  6. Inferencing uniCOIL on the expanded corpus and creating the inverted index.

Resources

Under a lot of circumstances, you don't actually need the follow the whole workflow described above. For example, you could take one of the CSV-based pre-trained checkpoints and finetune it with SPLADE. Thus, we provide several checkpoints and datasets that you could potential plug into your workflows and try them out!

Type Link Meaning
Pre-trained model pxyu/MSMARCO-V2-BERT-MLM-CSV30k BERT (CSV, 30K MS MARCO vocabularies) that is pre-trained on MS MARCO v2 corpus for 3 epochs
Pre-trained model pxyu/MSMARCO-V2-BERT-MLM-CSV100k BERT (CSV, 100K MS MARCO vocabularies) that is pre-trained on MS MARCO v2 corpus for 3 epochs
Pre-trained model pxyu/MSMARCO-V1-BERT-MLM-CSV300k BERT (CSV, 300K MS MARCO vocabularies) that is pre-trained on MS MARCO v1 corpus for 10 epochs
Pre-trained model pxyu/MSMARCO-V2-BERT-MLM-CSV300k BERT (CSV, 300K MS MARCO vocabularies) that is pre-trained on MS MARCO v2 corpus for 3 epochs

(more to be added...)

Detailed steps of the 7-step approach

1. Learning corpus-specific vocabularies (CSV)

2. Pre-training CSV-based language models on the retrieval corpus.

3. Training and expanding the corpus using TILDE based on the CSV-based LM.

By now, you should have a BERT checkpoint with CSV (or a Huggingface checkpoint we shared) that is pre-trained on the retrieval corpus, located at PRETRAINED_BERT_PATH_OR_NAME. For document expansion, we mostly inherit the code from TILDE framework, based on which we make some changes to generate augmented training data in order to deal with the false-negative issue in MS MARCO.

For training a TILDE model using MS MARCO, we first need to process the data:

cd tilde
mkdir -p data/hard_neg data/train data/collection

# download MS MARCO data and place them into the data folder
wget https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives/resolve/main/msmarco-hard-negatives.jsonl.gz data/hard_neg
wget https://msmarco.z22.web.core.windows.net/msmarcoranking/queries.tar.gz data/hard_neg
wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz data/collection
tar -xzvf data/hard_neg/queries.tar.gz -C data/hard_neg/
tar -xzvf data/collection/collection.tar.gz -C data/collection/

# create augmented training data based on borda aggregatation
cd scripts
python create-borda-falneg.py --top_k 10

Now, we have augmented training data for TILDE at tilde/data/train/borda_top10.train.tsv. The next step is to train a BERT-based TILDE model using this data:

# go back the the tilde folder
cd tilde

python train_tilde.py \
    --model_type_or_path PRETRAINED_BERT_PATH_OR_NAME \
    --train_path data/train/borda_top10.train.tsv \
    --save_path checkpoints/YOUR_TILDE_MODEL_NAME \
    --batch_size 64 \
    --num_gpus 8 \
    --use_dl --use_ql

Finally, we can use the trained BERT model to acquire the expanded tokens that should be added to every MS MARCO passage:

python expansion.py \
    --model_type_or_path PRETRAINED_BERT_PATH_OR_NAME \
    --ckpt_path checkpoints/YOUR_TILDE_MODEL_NAME/epoch_5.ckpt \
    --corpus_path data/collection/collection.tsv \
    --output_dir data/collection/expanded/YOUR_TILDE_MODEL_NAME \
    --topk 200 \
    --batch_size 64 \
    --shard -1 \
    --num_workers 8 \
    --store_raw

Now, the expanded terms related to our new vocabularies are available at tilde/data/collection/expanded/YOUR_TILDE_MODEL_NAME, which is valuable for training effective uniCOIL next.

4. Creating expanded corpus and training data for uniCOIL.

5. Training uniCOIL based on CSV-based LM.

6. Inferencing uniCOIL on the expanded corpus and creating the inverted index.

csv-for-lsr-ecir24's People

Contributors

pxyu avatar

Stargazers

Jinjing Zhou avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.