Code Monkey home page Code Monkey logo

dpr-scale's Introduction

Scalable implementation of dense retrieval.

This repo implements the following papers:

Input Data Format (JSONL)

Linewise JSON file where each row typically looks like:

{
    "question": ...,
    "positive_ctxs": [
        {
        "title": "...",
        "text": "....",
        <optional>
        "id": ...,
        "relevance": ...
        }, {...}, ...
    ],
    "hard_negative_ctxs": [{...}, {...}, ...]
}

or

{
    "question": ...,
    "id": ...,
    "ctxs": [
        {
        "has_answer": True or False,    
        "title": "...",
        "text": "....",
        <optional>
        "id": ...,
        "relevance": ...
        }, {...}, ...
    ]
}

If your training data is large, you can use a lightweight format by specifying the line number (docidx starting from 0) of the document in the corpus without storing its title and text:

{
    "question": ...,
    "positive_ctxs": [
        {
        "docidx": ..., # denote the position of the passage in the corpus, starting from 0
        <optional>
        "id": ...,
        "relevance": ...
        }, {...}, ...
    ],
    "hard_negative_ctxs": [{...}, {...}, ...]
}

This format requires you to use DenseRetrieverMultiJsonlDataModule and set --corpus_path while training. See below example with config msmarco_baseline.yaml. The corpus format follow the default Wiki corpus format with header at first line:

id"\t"text"\t"title
<id>"\t"<text>"\t"<title>
...

Training on cluster

By default it trains locally:

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py trainer.gpus=1

You can try our example of baseline training locally on MS MARCO dataset with the lightweight data format:

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py -m --config-name msmarco_baseline.yaml 

SLURM Training

To train the model on SLURM, run:

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py -m trainer=slurm trainer.num_nodes=2 trainer.gpus=2

Reproduce DPR on 8 gpus

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py -m --config-name nq.yaml  +hydra.launcher.name=dpr_stl_nq_reproduce

Generate embeddings on Wikipedia

PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_embeddings.py -m --config-name nq.yaml datamodule=generate datamodule.test_path=psgs_w100.tsv +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH>

Get retrieval results

Currently this runs on 1 GPU. Use CTX_EMBEDDINGS_DIR from above.

PYTHONPATH=.:$PYTHONPATH python dpr_scale/run_retrieval.py --config-name nq.yaml trainer=gpu_1_host trainer.gpus=1 +task.output_path=<PATH_TO_OUTPUT_JSON> +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH> +task.passages=psgs_w100.tsv datamodule.test_path=<PATH_TO_QUERIES_JSONL>

Generate query embeddings

Alternatively, query embedding generation and retrieval can be separated. After query embeddings are generated using the following command, the run_retrieval_fb.py or run_retrieval_multiset.py script can be used to perform retrieval.

PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_query_embeddings.py -m --config-name nq.yaml trainer.gpus=1 datamodule.test_path=<PATH_TO_QUERIES_JSONL> +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH> +task.query_emb_output_path=<OUTPUT_TO_QUERY_EMB>

Get evaluation metrics for a given JSON output file

python dpr_scale/eval_dpr.py --retrieval <PATH_TO_OUTPUT_JSON> --topk 1 5 10 20 50 100 

Get evaluation metrics for MSMARCO

python dpr_scale/msmarco_eval.py ~data/msmarco/qrels.dev.small.tsv PATH_TO_OUTPUT_JSON

Domain-matched Pre-training Tasks for Dense Retrieval

Paper: https://arxiv.org/abs/2107.13602

The sections below provide links to datasets and pretrained models, as well as, instructions to prepare datasets, pretrain and fine-tune them.

Q&A Datasets

PAQ

Download the dataset from here

Conversational Datasets

You can download the dataset from the respective tables.

Reddit

File Download Link
train download
dev download

ConvAI2

File Download Link
train download
dev download

DSTC7

File Download Link
train download
dev download
test download

Prepare by downloading the tar ball linked here, and using the command below.

DSTC7_DATA_ROOT=<path_of_dir_where_the_data_is_extracted>
python dpr_scale/data_prep/prep_conv_datasets.py \
    --dataset dstc7 \
    --in_file_path $DSTC7_DATA_ROOT/ubuntu_train_subtask_1_augmented.json \
    --out_file_path $DSTC7_DATA_ROOT/ubuntu_train.jsonl

Ubuntu V2

File Download Link
train download
dev download
test download

Prepare by downloading the tar ball linked here, and using the command below.

UBUNTUV2_DATA_ROOT=<path_of_dir_where_the_data_is_extracted>
python dpr_scale/data_prep/prep_conv_datasets.py \
    --dataset ubuntu2 \
    --in_file_path $UBUNTUV2_DATA_ROOT/train.csv \
    --out_file_path $UBUNTUV2_DATA_ROOT/train.jsonl

Pretraining DPR

Pretrained Checkpoints

Pretrained Model Dataset Download Link
BERT-base PAQ download
BERT-large PAQ download
BERT-base Reddit download
BERT-large Reddit download
RoBERTa-base Reddit download
RoBERTa-large Reddit download

Pretraining on PAQ dataset

DPR_ROOT=<path_of_your_repo's_root>
MODEL="bert-large-uncased"
NODES=8
BSZ=16
MAX_EPOCHS=20
LR=1e-5
TIMOUT_MINS=4320
EXP_DIR=<path_of_the_experiment_dir>
TRAIN_PATH=<path_of_the_training_data_file>
mkdir -p ${EXP_DIR}/logs
PYTHONPATH=$DPR_ROOT python ${DPR_ROOT}/dpr_scale/main.py -m \
    --config-dir ${DPR_ROOT}/dpr_scale/conf \
    --config-name nq.yaml \
    hydra.launcher.timeout_min=$TIMOUT_MINS \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    task.optim.lr=${LR} \
    task.model.model_path=${MODEL} \
    trainer.max_epochs=${MAX_EPOCHS} \
    datamodule.train_path=$TRAIN_PATH \
    datamodule.batch_size=${BSZ} \
    datamodule.num_negative=1 \
    datamodule.num_val_negative=10 \
    datamodule.num_test_negative=50 > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

Pretraining on Reddit dataset

# Use a batch size of 16 for BERT and RoBERTa base models.
BSZ=4
NODES=8
MAX_EPOCHS=5
WARMUP_STEPS=10000
LR=1e-5
MODEL="roberta-large"
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=. python dpr_scale/main.py -m \
    --config-dir ${DPR_ROOT}/dpr_scale/conf \
    --config-name reddit.yaml \
    hydra.launcher.nodes=${NODES} \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    task.optim.lr=${LR} \
    task.model.model_path=${MODEL} \
    trainer.max_epochs=${MAX_EPOCHS} \
    task.warmup_steps=${WARMUP_STEPS} \
    datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

Fine-tuning DPR on downstream tasks/datasets

Fine-tune the pretrained PAQ checkpoint

# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
# Use a batch size of 32 for BERT and RoBERTa base models.
BSZ=12
MODEL="bert-large-uncased"
MAX_EPOCHS=40
WARMUP_STEPS=1000
NODES=1
PRETRAINED_CKPT_PATH=<path_of_checkpoint_pretrained_on_reddit>
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=. python dpr_scale/main.py -m \
    --config-dir ${DPR_ROOT}/dpr_scale/conf \
    --config-name nq.yaml \
    hydra.launcher.name=${NAME} \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    trainer.max_epochs=${MAX_EPOCHS} \
    datamodule.num_negative=1 \
    datamodule.num_val_negative=25 \
    datamodule.num_test_negative=50 \
    +trainer.val_check_interval=150 \
    task.warmup_steps=${WARMUP_STEPS} \
    task.optim.lr=${LR} \
    task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
    task.model.model_path=${MODEL} \
    datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

Fine-tune the pretrained Reddit checkpoint

Batch sizes that worked on Volta 32GB GPUs for respective model and datasets.

Model Dataset Batch Size
BERT/RoBERTa base ConvAI2 64
RBERT/RoBERTa base ConvAI2 16
BERT/RoBERTa base DSTC7 24
BERT/RoBERTa base DSTC7 8
BERT/RoBERTa base Ubuntu V2 64
BERT/RoBERTa large Ubuntu V2 16
# Change the config file name to convai2.yaml or dstc7.yaml for the respective datasets.
CONFIG_FILE_NAME=ubuntuv2.yaml
# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
BSZ=16
NODES=1
MAX_EPOCHS=5
WARMUP_STEPS=10000
MODEL="roberta-large"
PRETRAINED_CKPT_PATH=<path_of_checkpoint_pretrained_on_reddit>
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=${DPR_ROOT} python ${DPR_ROOT}/dpr_scale/main.py -m \
    --config-dir=${DPR_ROOT}/dpr_scale/conf \
    --config-name=$CONFIG_FILE_NAME \
    hydra.launcher.nodes=${NODES} \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    trainer.max_epochs=${MAX_EPOCHS} \
    +trainer.val_check_interval=150 \
    task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
    task.warmup_steps=${WARMUP_STEPS} \
    task.optim.lr=${LR} \
    task.model.model_path=$MODEL \
    datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

License

dpr-scale is CC-BY-NC 4.0 licensed as of now.

dpr-scale's People

Contributors

borguz avatar ccsasuke avatar facebook-github-bot avatar jacklin64 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dpr-scale's Issues

CITADEL reproduction scripts

Hello! Is it possible to release reproduction scripts for citadel paper?
I am assuming that only thing that needs to be changed is task=multivec task/model=citadel_model but it's still nice to have a parameters to achieve reported results!

Thank you for your work :)

Question about deepspeed option

Dear colleagues,

During training process on several gpus, I have an exception like this:

  File "/root/dpr-scale/dpr_scale/task/dpr_task.py", line 272, in validation_epoch_end
    self._eval_epoch_end(valid_outputs)
  File "/root/dpr-scale/dpr_scale/task/dpr_task.py", line 266, in _eval_epoch_end
    self.log_dict(metrics, on_epoch=True, sync_dist=True)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 343, in log_dict
    self.log(
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 286, in log
    self._results.log(
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/core/step_result.py", line 149, in log
    value = sync_fn(value, group=sync_dist_group, reduce_op=sync_dist_op)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 290,
in reduce
    output = sync_ddp_if_available(output, group, reduce_op)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py", line 129, in s
ync_ddp_if_available
    return sync_ddp(result, group=group, reduce_op=reduce_op)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py", line 162, in s
ync_ddp
    torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
  File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_r
educe
    work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

How could I fix this error on validation step?

My current hyperparameters for multi-training are:

accumulate_grad_batches: 1
plugins: deepspeed
accelerator: ddp
precision: 16

Thanks in advance.

Reddit data download link broken

Hi, I'm trying to download the 200M reddit data. But it seems that the url is broken

$wget https://dl.fbaipublicfiles.com/dpr_scale/reddit/train.200M.jsonl
--2023-04-18 15:05:43--  https://dl.fbaipublicfiles.com/dpr_scale/reddit/train.200M.jsonl
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.32.50.61, 13.32.50.10, 13.32.50.72, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.32.50.61|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-04-18 15:05:44 ERROR 403: Forbidden.

what kind of bug might happen when num_workers > 0?

Hi @ccsasuke

I noticed you mentioned that num_workers > 0 bugs out right now.

num_workers: int = 0, # increasing this bugs out right now

However, when I set num_workers = 8, it seems the code works.
Could you points me out what kind of bug might happen?
or you forget to remove the comment after solving the bugs since some configs do set num_workers > 10.
num_workers: 10

scidocs seems to be bad formatted

Hi colleagues. When trying to embed the corpus from the BeIR benchmark with dpr_scale/generate_embeddings.py it goes in error because some "text" fields of scidocs are NaN. I have corrected it by simply replace the NaN with empty strings:

import pandas as pd
import numpy as np

BEIR_FOLDER = "/home/davide/DRAGON/dpr-scale/beir/"

scidocs_path = BEIR_FOLDER + "scidocs/collection.tsv"
scidocs = pd.read_csv(scidocs_path, sep="\t")

scidocs.loc[~scidocs.text.apply(lambda x: isinstance(x, str)), "text"] = ""
scidocs.loc[~scidocs.title.apply(lambda x: isinstance(x, str)), "title"] = ""

scidocs.to_csv(scidocs_path, sep="\t", index=False)

Embeddings generation without Trainer

Dear colleagues,

Do you know how to generate embeddings for all contexts without Trainer usage? At initialization step, I only want to load a model from my checkpoint and then calculate query embeddings and retrieval documents as inference step.

Thanks,
Daria

BEIR reproduction

Thanks for providing this!

Do you have the scripts for reproducing the results of SPAR on BEIR benchmark?

Particularly, did you tune the concatenation weight for BEIR evaluation?

KeyError: 'positive_ctxs' when running run_retrieval.py for nq-test.jsonl

When I run run_retrieval.py with nq-test.jsonl as the test file, I got KeyError: 'positive_ctxs' as the nq-test.jsonl does not have positive_ctxs. Why we need positive_ctxs in the test?

Traceback (most recent call last):
File "/home/default/persistent_drive/dpr_scale/dpr_scale/run_retrieval.py", line 83, in main
trainer.test(task, datamodule=datamodule)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 914, in test
results = self.__test_given_model(model, test_dataloaders)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 972, in __test_given_model
results = self.fit(model)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 498, in fit
self.dispatch()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in dispatch
self.accelerator.start_testing(self)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in start_testing
self.training_type_plugin.start_testing(trainer)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 118, in start_testing
self._results = trainer.run_test()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 785, in run_test
eval_loop_results, _ = self.run_evaluation()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 711, in run_evaluation
for batch_idx, batch in enumerate(dataloader):
File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/home/default/persistent_drive/dpr_scale/dpr_scale/datamodule/dpr.py", line 138, in collate_test
return self.collate(batch, "test")
File "/home/default/persistent_drive/dpr_scale/dpr_scale/datamodule/dpr.py", line 203, in collate
return self.dpr_transform(batch, stage)
File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/default/persistent_drive/dpr_scale/dpr_scale/transforms/dpr_transform.py", line 85, in forward
contexts_pos = row["positive_ctxs"]
KeyError: 'positive_ctxs'

Some questions about 'generating training queries'

To generate our own training data for Lambda, we want to create training queries. At this time, I have a question about the code in 'dpr_scale/utils/prep_wiki_exp.py'.

image
Looking at the code above, it can be seen that the query is obtained from passage_sents.

image
However, if you look at the source of passage_sents, you can see that it was taken from the document. Am I right to understand?

image
In the end, it seems that positive_ctxs stores the passage from passage_sents, and I wonder if query and context are used correctly.

If you have time, please reply. Thank you.

Clarification about training of dragon

In the paper of DRAGON is stated that the training samples are triplets of, given a source of supervision:
query, one sampled document from top 10 retrieved documents, and one sampled from the top 41-50 documents.

In the code, specifically in dpr_scale.task.dpr_task, it seems that the CrossEntropyLoss is computed taking the scores of 2*batch_size documents, so that the training samples are actually far bigger than simple triplets, as the model is seeing in batch negatives as well.

I am confused at this point. Is the code not reproducing the paper or is the paper not clear enough?

Error during embeddings generation

Dear colleagues,
when I try to generate embeddings, I have an error:

Testing: 100%|████████████████████████████████████████████████████████████████████████████| 1851/1851 [02:41<00:00, 13.31it/s]
Writing tensor of size torch.Size([29606, 768]) to /root/dpr/ctx_embeddings/reps_0000.pkl
Error executing job with overrides: ['trainer.gpus=1', 'datamodule=generate', 'datamodule.test_path=/root/dpr/python_docs_w100.tsv', 'datamodule.test_batch_size=16', '+task.ctx_embeddings_dir=/root/dpr/ctx_embeddings', '+task.checkpoint_path=/root/dpr/trained_only_by_answers.ckpt', '+task.pretrained_checkpoint_path=/root/dpr/trained_only_by_answers.ckpt']
Traceback (most recent call last):
  File "/root/dpr-scale/dpr_scale/generate_embeddings.py", line 30, in <module>
    main()
  File "/root/.local/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
    _run_hydra(
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 385, in _run_hydra
    run_and_report(
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 386, in <lambda>
    lambda: hydra.multirun(
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 140, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 161, in sweep
    _ = r.return_value
  File "/root/.local/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/root/.local/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/root/dpr-scale/dpr_scale/generate_embeddings.py", line 26, in main
    trainer.test(task, datamodule=datamodule)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 914, in test
    results = self.__test_given_model(model, test_dataloaders)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 972, in __test_given_model
    results = self.fit(model)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 498, in fit
    self.dispatch()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in dispatch
    self.accelerator.start_testing(self)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in start_testing
    self.training_type_plugin.start_testing(trainer)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 118, in start_testing
    self._results = trainer.run_test()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 785, in run_test
    eval_loop_results, _ = self.run_evaluation()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in run_evaluation
    deprecated_eval_results = self.evaluation_loop.evaluation_epoch_end()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 187, in evaluation_epoch_end
    deprecated_results = self.__run_eval_epoch_end(self.num_dataloaders)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 219, in __run_eval_epoch_end
    eval_results = model.test_epoch_end(eval_results)
  File "/root/dpr-scale/dpr_scale/task/dpr_eval_task.py", line 49, in test_epoch_end
    torch.distributed.barrier()  # make sure rank 0 waits for all to complete
  File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2708, in barrier
    default_pg = _get_default_group()
  File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Testing: 100%|██████████| 1851/1851 [02:42<00:00, 11.40it/s]

Do you know how to fix it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.