The code is splitted between 2 different plateforms, which are Google Collab and Compute Canada. It also contains 2 other repos that are used in the implementation of this project, DPR repo for "Dense Passage Retrieval", CAsT Dataset, MSMARCo Dataset. Another repo used for evaluation of the CAsT.
Pierre McWhannel, Nicole Yan, Ahmed Salamah.
- BASH files to submit Jobs on different clusters on ComputeCanada.
- NoteBooks to used on Google Collabs.
- Dense retriever model is based on bi-encoder architecture.
- Dense Passage Retrieval inspired by this paper.
- Related data pre- and post- processing tools.
- Dense retriever component for inference time logic is based on FAISS index for the DPR paper.
Installation from the source. Python's virtual or Conda environments are recommended.
git clone https://github.com/AhmedHussKhalifa/Dense_Passage_Retrieval_in_Conversational_Search
cd Dense_Passage_Retrieval_in_Conversational_Search
This project is tested on Python 3.6+, PyTorch 1.2.0+ and Transformers 3.5.
sbatch ComputeCanada/pyTorch_DPR.sh
You might need the following setup through the command line if the prvious file did not work.
module load python/3.6.3
# Replace virtual_DPR with where ever you create your virtual env.
virtualenv --no-download virtual_DPR
source virtual_DPR/bin/activate
pip install torch --no-index
pip install --no-index torch torchvision torchtext torchaudio
pip install --no-index 'transformers==3.0.2'
pip install spacy[cuda] --no-index
module load nixpkgs/16.09 gcc/7.3.0 cuda/10.1
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
module load python/3.6.3
module load faiss/1.6.2
deactivate
source virtual_DPR/bin/activate
python --version
wget https://dl.fbaipublicfiles.com/dpr/checkpoint/reader/nq-trivia-hybrid/hf_bert_base.cp
sbatch ComputeCanada/create-json-job.sh
sbatch ComputeCanada/reformat_car.sh
sbatch ComputeCanada/PreprocessDataNoNeg.sh
sbatch ComputeCanada/merge-job.sh
sbatch ComputeCanada/ctx_1.sh
6. For rewriting using the GPT2QR, you can use this notebook on googleColabs or using ComputeCanada using this link.
Google_Collab/GPT2QR.ipynb
sbatch ComputeCanada/cast_inference.sh
8. Other parts of the project we have combined our scripts between ComputeCanada and GoogleCollab (Notebooks are well commented).
9. In "CAsT_GPT_rewrite" folder you will find a rewritten examples for the cast queries according to the reformulation strategies that is used in our experiments.
If you had some errors while working with Huggingface to download pretrained model you can follow these steps:
mkdir data/models
cd data/models
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar -xzf bert-base-uncased.tar.gz
mv bert_config.json config.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
mv bert-base-uncased-vocab.txt vocab.txt
#### change the hf_models.py in dpr/models
def get_bert_tokenizer(pretrained_cfg_name: str, do_lower_case: bool = True):
return BertTokenizer.from_pretrained('/home/YOUR_USERNAME/scratch/DPR/data/models/vocab.txt', do_lower_case=do_lower_case)
class HFBertEncoder(BertModel):
def init_encoder(cls, cfg_name: str, projection_dim: int = 0, dropout: float = 0.1, **kwargs) -> BertModel:
#cfg = BertConfig.from_pretrained(cfg_name if cfg_name else 'bert-base-uncased')
cfg = BertConfig.from_pretrained('/home/YOUR_USER_NAME/scratch/DPR/data/models/')
More detials can be provided by any of the following authors: