Code Monkey home page Code Monkey logo

convdr's Introduction

ConvDR

This repo contains code and data for SIGIR 2021 paper "Few-Shot Conversational Dense Retrieval".

Prerequisites

Install dependencies:

git clone https://github.com/thunlp/ConvDR.git
cd ConvDR
pip install -r requirements.txt

We recommend set PYTHONPATH before running the code:

export PYTHONPATH=${PYTHONPATH}:`pwd`

To train ConvDR, we need trained ad hoc dense retrievers. We use ANCE for both tasks. Please downloads those checkpoints here: TREC CAsT and OR-QuAC. For TREC CAsT, we directly use the official model trained on MS MARCO Passage Retrieval task. For OR-QuAC, we initialize the retriever from the official model trained on NQ and TriviaQA, and continue training on OR-QuAC with manually reformulated questions using the ANCE codebase.

The following code downloads those checkpoints and store them in ./checkpoints.

mkdir checkpoints
wget https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip
wget https://data.thunlp.org/convdr/ad-hoc-ance-orquac.cp
unzip Passage_ANCE_FirstP_Checkpoint.zip
mv "Passage ANCE(FirstP) Checkpoint" ad-hoc-ance-msmarco

Data Preparation

By default, we expect raw data to be stored in ./datasets/raw and processed data to be stored in ./datasets:

mkdir datasets
mkdir datasets/raw

TREC CAsT

CAsT shared files download

Use the following commands to download the document collection for CAsT-19 & CAsT-20 as well as the MARCO duplicate file:

cd datasets/raw
wget https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz -O msmarco.tsv
wget http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz
wget http://boston.lti.cs.cmu.edu/Services/treccast19/duplicate_list_v1.0.txt

CAsT-19 files download

Download necessary files for CAsT-19 and store them into ./datasets/raw/cast-19:

mkdir datasets/raw/cast-19
cd datasets/raw/cast-19
wget https://raw.githubusercontent.com/daltonj/treccastweb/master/2019/data/evaluation/evaluation_topics_v1.0.json
wget https://raw.githubusercontent.com/daltonj/treccastweb/master/2019/data/evaluation/evaluation_topics_annotated_resolved_v1.0.tsv
wget https://trec.nist.gov/data/cast/2019qrels.txt

CAsT-20 files download

Download necessary files for CAsT-20 and store them into ./datasets/raw/cast-20:

mkdir datasets/raw/cast-20
cd datasets/raw/cast-20
wget https://raw.githubusercontent.com/daltonj/treccastweb/master/2020/2020_automatic_evaluation_topics_v1.0.json
wget https://raw.githubusercontent.com/daltonj/treccastweb/master/2020/2020_manual_evaluation_topics_v1.0.json
wget https://trec.nist.gov/data/cast/2020qrels.txt

CAsT preprocessing

Use the scripts ./data/preprocess_cast19 and ./data/preprocess_cast20 to preprocess raw CAsT files:

mkdir datasets/cast-19
mkdir datasets/cast-shared
python data/preprocess_cast19.py  --car_cbor=datasets/raw/dedup.articles-paragraphs.cbor  --msmarco_collection=datasets/raw/msmarco.tsv  --duplicate_file=datasets/raw/duplicate_list_v1.0.txt  --cast_dir=datasets/raw/cast-19/  --out_data_dir=datasets/cast-19  --out_collection_dir=datasets/cast-shared
mkdir datasets/cast-20
mkdir datasets/cast-shared
python data/preprocess_cast20.py  --car_cbor=datasets/raw/dedup.articles-paragraphs.cbor  --msmarco_collection=datasets/raw/msmarco.tsv  --duplicate_file=datasets/raw/duplicate_list_v1.0.txt  --cast_dir=datasets/raw/cast-20/  --out_data_dir=datasets/cast-20  --out_collection_dir=datasets/cast-shared

OR-QuAC

OR-QuAC files download

Download necessary OR-QuAC files and store them into ./datasets/raw/or-quac:

mkdir datasets/raw/or-quac
cd datasets/raw/or-quac
wget https://ciir.cs.umass.edu/downloads/ORConvQA/all_blocks.txt.gz
wget https://ciir.cs.umass.edu/downloads/ORConvQA/qrels.txt.gz
gzip -d *.txt.gz
mkdir preprocessed
cd preprocessed
wget https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/train.txt
wget https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/test.txt
wget https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/dev.txt

OR-QuAC preprocessing

Use the scripts ./data/preprocess_orquac to preprocess OR-QuAC files:

mkdir datasets/or-quac
python data/preprocess_orquac.py  --orquac_dir=datasets/raw/or-quac  --output_dir=datasets/or-quac

Generate Document Embeddings

Our code is based on ANCE and we have a similar embedding inference pipeline, where the documents are first tokenized and converted to token ids and then the token ids are used for embedding inference. We create sub-directories tokenized and embeddings inside ./datasets/cast-shared and ./datasets/or-quac to store the tokenized documents and document embeddings, respectively:

mkdir datasets/cast-shared/tokenized
mkdir datasets/cast-shared/embeddings
mkdir datasets/or-quac/tokenized
mkdir datasets/or-quac/embeddings

Run ./data/tokenizing.py to tokenize documents in parallel:

# CAsT
python data/tokenizing.py  --collection=datasets/cast-shared/collection.tsv  --out_data_dir=datasets/cast-shared/tokenized  --model_name_or_path=checkpoints/ad-hoc-ance-msmarco --model_type=rdot_nll
# OR-QuAC
python data/tokenizing.py  --collection=datasets/or-quac/collection.tsv  --out_data_dir=datasets/or-quac/tokenized  --model_name_or_path=bert-base-uncased --model_type=dpr

After tokenization, run ./drivers/gen_passage_embeddings.py to generate document embeddings:

# CAsT
python -m torch.distributed.launch --nproc_per_node=$gpu_no python drivers/gen_passage_embeddings.py  --data_dir=datasets/cast-shared/tokenized  --checkpoint=checkpoints/ad-hoc-ance-msmarco  --output_dir=datasets/cast-shared/embeddings  --model_type=rdot_nll
# OR-QuAC
python -m torch.distributed.launch --nproc_per_node=$gpu_no python drivers/gen_passage_embeddings.py  --data_dir=datasets/or-quac/tokenized  --checkpoint=checkpoints/ad-hoc-ance-orquac.cp  --output_dir=datasets/or-quac/embeddings  --model_type=dpr

Note that we follow the ANCE implementation and this step takes up a lot of memory. To generate all 38M CAsT document embeddings safely, the machine should have at least 200GB memory. It's possible to save memory by generating a part at a time, and we may update the implementation in the future.

ConvDR Training

Now we are all prepared: we have downloaded & preprocessed data, and we have obtained document embeddings. Simply run ./drivers/run_convdr_train.py to train a ConvDR using KD (MSE) loss:

# CAsT-19, KD loss only, five-fold cross-validation
python drivers/run_convdr_train.py  --output_dir=checkpoints/convdr-kd-cast19  --model_name_or_path=checkpoints/ad-hoc-ance-msmarco  --train_file=datasets/cast-19/eval_topics.jsonl  --query=no_res  --per_gpu_train_batch_size=4  --learning_rate=1e-5   --log_dir=logs/convdr_kd_cast19  --num_train_epochs=8  --model_type=rdot_nll  --cross_validate
# CAsT-20, KD loss only, five-fold cross-validation, use automatic canonical responses, set a longer length
python drivers/run_convdr_train.py  --output_dir=checkpoints/convdr-kd-cast20  --model_name_or_path=checkpoints/ad-hoc-ance-msmarco  --train_file=datasets/cast-20/eval_topics.jsonl  --query=auto_can  --per_gpu_train_batch_size=4  --learning_rate=1e-5   --log_dir=logs/convdr_kd_cast20  --num_train_epochs=8  --model_type=rdot_nll  --cross_validate  --max_concat_length=512
# OR-QuAC, KD loss only
python drivers/run_convdr_train.py  --output_dir=checkpoints/convdr-kd-orquac.cp  --model_name_or_path=checkpoints/ad-hoc-ance-orquac.cp  --train_file=datasets/or-quac/train.jsonl  --query=no_res  --per_gpu_train_batch_size=4  --learning_rate=1e-5  --log_dir=logs/convdr_kd_orquac  --num_train_epochs=1  --model_type=dpr  --log_steps=100

Note that for CAsT-20, it's better to first pretrain the model on CANARD and then do cross-validation:

# Pretrain on CANARD (use preprocessed OR-QuAC)
python drivers/run_convdr_train.py  --output_dir=checkpoints/convdr-kd-cast20-warmup  --model_name_or_path=checkpoints/ad-hoc-ance-msmarco  --train_file=datasets/or-quac/train.jsonl  --query=man_can  --per_gpu_train_batch_size=4  --learning_rate=1e-5   --log_dir=logs/convdr_kd_cast20_warmup  --num_train_epochs=1  --model_type=rdot_nll  --log_steps=100  --max_concat_length=512
# Do cross-validation on CAsT-20; Set model_name_or_path to the pretrained model and specify teacher_model to the ad hoc model
python drivers/run_convdr_train.py  --output_dir=checkpoints/convdr-kd-cast20  --model_name_or_path=checkpoints/convdr-kd-cast20-warmup  --teacher_model=checkpoints/ad-hoc-ance-msmarco  --train_file=datasets/cast-20/eval_topics.jsonl  --query=auto_can  --per_gpu_train_batch_size=4  --learning_rate=1e-5   --log_dir=logs/convdr_kd_cast20  --num_train_epochs=8  --model_type=rdot_nll  --cross_validate  --max_concat_length=512

To use ranking loss, we need to find negative documents for each query. We use top retrieved negatives documents from the ranking results of manual queries. So we need to first perform retrieval using the manual queries:

# CAsT-19
python drivers/run_convdr_inference.py  --model_path=checkpoints/ad-hoc-ance-msmarco  --eval_file=datasets/cast-19/eval_topics.jsonl  --query=target  --per_gpu_eval_batch_size=8  --ann_data_dir=datasets/cast-19/embeddings  --qrels=datasets/cast-19/qrels.tsv  --processed_data_dir=datasets/cast-19/tokenized  --raw_data_dir=datasets/cast-19   --output_file=results/cast-19/manual_ance.jsonl  --output_trec_file=results/cast-19/manual_ance.trec  --model_type=rdot_nll  --output_query_type=manual  --use_gpu
# OR-QuAC, inference on train, set query to "target" to use manual queries directly
python drivers/run_convdr_inference.py  --model_path=checkpoints/ad-hoc-ance-orquac.cp  --eval_file=datasets/or-quac/train.jsonl  --query=target  --per_gpu_eval_batch_size=8  --ann_data_dir=datasets/or-quac/embeddings  --qrels=datasets/or-quac/qrels.tsv  --processed_data_dir=datasets/or-quac/tokenized  --raw_data_dir=datasets/or-quac   --output_file=results/or-quac/manual_ance_train.jsonl  --output_trec_file=results/or-quac/manual_ance_train.trec  --model_type=dpr  --output_query_type=train.manual  --use_gpu

After the retrieval finishes, we can select negative documents from manual runs and supplement the original training files with them:

# CAsT-19
python data/gen_ranking_data.py  --train=datasets/cast-19/eval_topics.jsonl  --run=results/cast-19/manual_ance.trec  --output=datasets/cast-19/eval_topics.rank.jsonl  --qrels=datasets/cast-19/qrels.tsv  --collection=datasets/cast-shared/collection.tsv  --cast
# OR-QuAC
python data/gen_ranking_data.py  --train=datasets/or-quac/train.jsonl  --run=results/or-quac/manual_ance_train.trec  --output=datasets/or-quac/train.rank.jsonl  --qrels=datasets/or-quac/qrels.tsv  --collection=datasets/or-quac/collection.jsonl

Now we are able to use the ranking loss, with the --ranking_task flag on:

# CAsT-19, Multi-task
python drivers/run_convdr_train.py  --output_dir=checkpoints/convdr-multi-cast19  --model_name_or_path=checkpoints/ad-hoc-ance-msmarco  --train_file=datasets/cast-19/eval_topics.rank.jsonl  --query=no_res  --per_gpu_train_batch_size=4  --learning_rate=1e-5   --log_dir=logs/convdr_multi_cast19  --num_train_epochs=8  --model_type=rdot_nll  --cross_validate  --ranking_task
# OR-QuAC, Multi-task
python drivers/run_convdr_train.py  --output_dir=checkpoints/convdr-multi-orquac.cp  --model_name_or_path=checkpoints/ad-hoc-ance-orquac.cp  --train_file=datasets/or-quac/train.rank.jsonl  --query=no_res  --per_gpu_train_batch_size=4  --learning_rate=1e-5  --log_dir=logs/convdr_multi_orquac  --num_train_epochs=1  --model_type=dpr  --log_steps=100  --ranking_task

To disable the KD loss, simply set the --no_mse flag.

ConvDR Inference

Run ./drivers/run_convdr_inference.py to get inference results. output_file is the OpenMatch-format file for reranking, and output_trec_file is the TREC-style run file which can be evaluated by the trec_eval tool.

# OR-QuAC
python drivers/run_convdr_inference.py  --model_path=checkpoints/convdr-multi-orquac.cp  --eval_file=datasets/or-quac/test.jsonl  --query=no_res  --per_gpu_eval_batch_size=8  --cache_dir=../ann_cache_dir  --ann_data_dir=datasets/or-quac/embeddings  --qrels=datasets/or-quac/qrels.tsv  --processed_data_dir=datasets/or-quac/tokenized  --raw_data_dir=datasets/or-quac   --output_file=results/or-quac/multi_task.jsonl  --output_trec_file=results/or-quac/multi_task.trec  --model_type=dpr  --output_query_type=test.raw  --use_gpu
# CAsT-19
python drivers/run_convdr_inference.py  --model_path=checkpoints/convdr-kd-cast19  --eval_file=datasets/cast-19/eval_topics.jsonl  --query=no_res  --per_gpu_eval_batch_size=8  --cache_dir=../ann_cache_dir  --ann_data_dir=datasets/cast-19/embeddings  --qrels=datasets/cast-19/qrels.tsv  --processed_data_dir=datasets/cast-19/tokenized  --raw_data_dir=datasets/cast-19   --output_file=results/cast-19/kd.jsonl  --output_trec_file=results/cast-19/kd.trec  --model_type=rdot_nll  --output_query_type=raw  --use_gpu  --cross_validation

The query embedding inference always takes the first GPU. If you set the --use_gpu flag (recommended), the retrieval will be performed on the remaining GPUs. The retrieval process consumes a lot of GPU resources. To reduce the resource usage, we split all document embeddings into several blocks, perform searching one-by-one and finally combine the results. If you have enough GPU resources, you can modify the code to perform searching all at once.

Download Trained Models

Three trained models can be downloaded with the following link: CAsT19-KD-CV-Fold1, CAsT20-KD-Warmup-CV-Fold2 and ORQUAC-Multi.

Results

Download ConvDR and baseline runs on CAsT

Contact

Please send email to [email protected] [email protected].

convdr's People

Contributors

yu-shi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

convdr's Issues

Question about ConvDR re-ranking

Hi Shi Yu,

The paper said, "The same few-shot paradigms can also be used to train the BERT Reranker".

But I don't understand how ConvDR is optimized under the teacher-student framework for re-ranking. Do we need to combine the document in KD loss? Could you please explain it in more detail~?

Provide ConvDR runs for comparison

Dear authors,
thank you for this really interesting work and publishing your code.

I would like to ask whether it is possible to provide the ranking files (output) for ConvDR published in the Sigir paper. I am working on the same domain, but I am having problems creating the ANCE index in the machine Im using.

Specifically, it would be helpful for me to have the runs for CAsT 19,20 (and 21, if available) datasets, for (a) both the original ConvDR model, as well as (b) the "zero-shot" variant introduced.

PS: Also, one clarification question: to my understanding, the zero-shot variant is simply original query + ANCE, right?

Thank you (and happy Chinese new year!)

May I ask you the random_seed that you used on CAsT 20?

Hi Shi Yu,

Thanks for your great work and open-source code. May I ask you the random_seed that you used on CAsT 20? When I use the default seed 42, the results I got are 0.314 (without warmup on CANARD) and 0.325 (with warmup) w.r.t. nDCG@3, which have a little gap between the paper reported. Or do you have some suggestions for me to fill this gap?

how critical is the version of transformers used to replicate this work ?

Hi, I'm very interested in this work and want to reproduce it , while I noticed that in the requriements.txt the version of transformers are fixed to 2.3.0, which is not available in my conda setting. Is it OK to replace it to either 2.1.1 (older version) or 3.5.1 (newer version) ?
Thanks!

What is the parameters of BM25?

Hi Yu,
I found that the performance of BM25 with manual query rewrites on the Cast 20 is 0.445 (MRR), 0.301 (NDCG@3), which is much higher than the performance of BM25 from Pyserini with the default setting: the result I got is 0.406 (MRR), 0.257 (NDCG@3).

I would appreciate it if you could you please share your parameters for BM25?

Best wishes,
Chuan Meng

Cannot wget Passage_ANCE_FirstP_Checkpoint.zip

Hi, I followed the tutorial but got an error when running this line:

wget https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip

The error I get is:

--2021-08-09 13:11:30--  https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip
Resolving webdatamltrainingdiag842.blob.core.windows.net (webdatamltrainingdiag842.blob.core.windows.net)... 52.239.193.68
Connecting to webdatamltrainingdiag842.blob.core.windows.net (webdatamltrainingdiag842.blob.core.windows.net)|52.239.193.68|:443... connected.
Unable to establish SSL connection.

System information: Linux 4.15.0-135-generic #139-Ubuntu SMP Mon Jan 18 17:38:24 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I have also referred to this page but none of the solutions here works.
https://stackoverflow.com/questions/50292852/wget-returns-unable-to-establish-ssl-connection/50293151

Thanks!

Issues with mapping passage ids in results for TREC CAST-19

First of all, I would like to say that in my opinion this work is a great piece of research, the idea is reasonable and the results are great too. That is why I decided to use it in my project about conversational search.

I was happy to see that the project's Github repository had a clear readme and seemed easy to replicate. Unfortunately, after about 3 weeks, I'm still struggling to successfully run all the steps and there are some parts of the repository that I don't understand. At the same time, my project deadline is by the end of the week so I really need it to work soon and thus I would greatly appreciate your help.

The current issue I'm struggling with is that when i run this command:

python drivers/run_convdr_inference.py  --model_path=checkpoints/convdr-kd-cast19  --eval_file=datasets/cast-19/eval_topics.jsonl  --query=no_res  --per_gpu_eval_batch_size=8  --cache_dir=../ann_cache_dir  --ann_data_dir=datasets/cast-19/embeddings  --qrels=datasets/cast-19/qrels.tsv  --processed_data_dir=datasets/cast-19/tokenized  --raw_data_dir=datasets/cast-19   --output_file=results/cast-19/kd.jsonl  --output_trec_file=results/cast-19/kd.trec  --model_type=rdot_nll  --output_query_type=raw  --use_gpu  --cross_validation

I obtain a .trec file that instead of having passage ids preceded with "CAR" or "MARCO" (such as the .trec files that you include for download in the repo such as convdr_multi.trec), only has numerical values as passage ids. Can you please let me know how did you map pids into ids with "CAR" or "MARCO"? What I tried is using the id_remap.py script to map the passage ids using car_idx_to_id.pickle file. However, in that case we seem to be missing the MARCO ids and when running id_remap.py I get list index error.

python data/id_remap.py --convdr results/cast-19/multi.trec --doc_idx_to_id datasets/cast-shared/car_idx_to_id.pickle --out_trec results/cast-19/multi_mapped.trec

So my current guess is that I should also add marco ids to car_idx_to_id.pickle. Could you please tell me if that's what I need to do or did you map the ids in a different way?

Thank you in advance for your response.

about Generate Document Embeddings

Thanks for sharing your code. It was a very rewarding job. I'm trying to reproduce the work. But there were some problems.
when I run the code "gen_passage_embeddings.py" It always makes mistakes when "merging embeddings". I traced the code to find problems with the function "barrier_array_merge" in file util.py. The size of the variable "data_list" keeps increasing until out of memory. But "data_list" don't output. However, the "data_array" is write into files. As far as I know, "data_array" is the input parameter, and nothing is done to it, so why print it out?
I would appreciate it if you could explain it.
Ps: I'm sure I have enough memory.

Best wishes!

Issue with Downloading "Passage_ANCE_FirstP_Checkpoint.zip" File

I'm trying to download the "Passage_ANCE_FirstP_Checkpoint.zip" file from the ConvDR repository using the following command:
wget https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip
However, I'm encountering a 404 error, indicating the file cannot be found at the specified location.
The error is:

--2024-03-05 19:57:32-- https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip
Resolving webdatamltrainingdiag842.blob.core.windows.net (webdatamltrainingdiag842.blob.core.windows.net)... 20.60.229.129
Connecting to webdatamltrainingdiag842.blob.core.windows.net (webdatamltrainingdiag842.blob.core.windows.net)|20.60.229.129|:443... connected.
HTTP request sent, awaiting response... 404 The specified resource does not exist.
2024-03-05 19:57:32 ERROR 404: The specified resource does not exist..

I've already tried the following steps:

1.Double-checked the URL for any typos or errors.
2.Verified I'm using the correct protocol (HTTPS).
3.Consulted the ConvDR documentation and release notes, but couldn't find any information about file relocation or updates.
I understand that external file locations might change over time, and I wanted to check if there's an alternative way to download the required file or if the file is no longer available.

Any information or guidance you can provide would be greatly appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.