Code Monkey home page Code Monkey logo

dense_passage_retrieval_in_conversational_search's Introduction

Dense Passage Retrieval in Conversational Search

The code is splitted between 2 different plateforms, which are Google Collab and Compute Canada. It also contains 2 other repos that are used in the implementation of this project, DPR repo for "Dense Passage Retrieval", CAsT Dataset, MSMARCo Dataset. Another repo used for evaluation of the CAsT.

Authors:

Pierre McWhannel, Nicole Yan, Ahmed Salamah.

Features

  1. BASH files to submit Jobs on different clusters on ComputeCanada.
  2. NoteBooks to used on Google Collabs.
  3. Dense retriever model is based on bi-encoder architecture.
  4. Dense Passage Retrieval inspired by this paper.
  5. Related data pre- and post- processing tools.
  6. Dense retriever component for inference time logic is based on FAISS index for the DPR paper.

Installation

Installation from the source. Python's virtual or Conda environments are recommended.

git clone https://github.com/AhmedHussKhalifa/Dense_Passage_Retrieval_in_Conversational_Search
cd Dense_Passage_Retrieval_in_Conversational_Search

This project is tested on Python 3.6+, PyTorch 1.2.0+ and Transformers 3.5.

1. Setup ComputeCanada and run a single trial:

sbatch ComputeCanada/pyTorch_DPR.sh

You might need the following setup through the command line if the prvious file did not work.

module load python/3.6.3

# Replace virtual_DPR with where ever you create your virtual env.
virtualenv --no-download virtual_DPR
source virtual_DPR/bin/activate

pip install torch --no-index
pip install --no-index torch torchvision torchtext torchaudio
pip install --no-index 'transformers==3.0.2'
pip install spacy[cuda] --no-index

module load nixpkgs/16.09 gcc/7.3.0 cuda/10.1
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

module load python/3.6.3
module load faiss/1.6.2
deactivate

source virtual_DPR/bin/activate
python --version

2. Download DPR trained on Trivia multi hybrid retriever results and HF bert-base-uncased model:

wget https://dl.fbaipublicfiles.com/dpr/checkpoint/reader/nq-trivia-hybrid/hf_bert_base.cp

3. Download Car dataset and split it

sbatch ComputeCanada/create-json-job.sh
sbatch ComputeCanada/reformat_car.sh

4. Split the data into training, testing and validation (MSMARCO & CAsT):

sbatch ComputeCanada/PreprocessDataNoNeg.sh
sbatch ComputeCanada/merge-job.sh

5. To generate the dense embbeding on compute canada.

sbatch ComputeCanada/ctx_1.sh

6. For rewriting using the GPT2QR, you can use this notebook on googleColabs or using ComputeCanada using this link.

Google_Collab/GPT2QR.ipynb

7. For Cast inference

sbatch ComputeCanada/cast_inference.sh

8. Other parts of the project we have combined our scripts between ComputeCanada and GoogleCollab (Notebooks are well commented).

9. In "CAsT_GPT_rewrite" folder you will find a rewritten examples for the cast queries according to the reformulation strategies that is used in our experiments.

Note:

If you had some errors while working with Huggingface to download pretrained model you can follow these steps:

mkdir data/models
cd data/models

wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar -xzf bert-base-uncased.tar.gz
mv bert_config.json config.json

wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
mv bert-base-uncased-vocab.txt vocab.txt

#### change the hf_models.py in dpr/models

def get_bert_tokenizer(pretrained_cfg_name: str, do_lower_case: bool = True):
    return BertTokenizer.from_pretrained('/home/YOUR_USERNAME/scratch/DPR/data/models/vocab.txt', do_lower_case=do_lower_case)


class HFBertEncoder(BertModel):
   def init_encoder(cls, cfg_name: str, projection_dim: int = 0, dropout: float = 0.1, **kwargs) -> BertModel:
        #cfg = BertConfig.from_pretrained(cfg_name if cfg_name else 'bert-base-uncased')
        cfg = BertConfig.from_pretrained('/home/YOUR_USER_NAME/scratch/DPR/data/models/')

More detials can be provided by any of the following authors:

  1. Pierre McWhannel.
  2. Nicole Yan.
  3. Ahmed Salamah.

dense_passage_retrieval_in_conversational_search's People

Contributors

ahmedhusskhalifa avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.