microsoft / msmarco-passage-ranking Goto Github PK

MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. A variant of this task will be the part of TREC and AFIRM 2019. For Updates about TREC 2019 please follow This Repository Passage Reranking task Task Given a query q and a the 1000 most relevant passages P = p1, p2, p3,... p1000, as retrieved by BM25 a succeful system is expected to rerank the most relevant passage as high as possible. For this task not all 1000 relevant items have a human labeled relevant passage. Evaluation will be done using MRR

Home Page: https://microsoft.github.io/MSMARCO-Passage-Ranking/

License: MIT License

Jupyter Notebook 56.18% Shell 2.77% Python 41.05%

msmarco-passage-ranking's Introduction

To participate in the MS MARCO Passage Ranking leaderboard, please go here: https://microsoft.github.io/msmarco/Submission.

msmarco-passage-ranking's People

Contributors

Stargazers

Watchers

Forkers

vdedyukhin zhizhongisaacchen hanshuguang-google fardon yangliuy fage2016 seanmacavaney bhaskers-blu-org2 taffywrinkle claudiusgonzo mohan-zhang-u zouxiaoyuonly ratheeankit12 fword mabu-dev marziehf gaurav5590 yulengsen 2071848 guoday tiffen arvinzhuang duemoo mingjiehuang007 der-ofenmeister sunsishining python-repository-hub zuacubd test-mass-forker-org-1 albertosolis001 jybai tiyaro techthiyanes jaikarnyu chatgpt-apps habibzadeh

msmarco-passage-ranking's Issues

Top 1000 Train in Trec Format

Hello,

I wonder if the train set top 1000 retrieval result in TREC eval format , i.e. teIn, can be released? This will allow developers to put together training sets of various sizes. Please release it in its original form, without shuffling, so truncation can be properly done.

can't download triples.train.full.tsv.gz

Hello, the file triples.train.full.tsv.gz is too big and I keep getting an error when downloading with wget. Is there an easy way to download it?
thank you.

We already have the triple format data, why need Qrels?

I am new to information retrivial, Thank you very much.
@dfcf93

Extracting qidpidtriples.train.full.tar.gz

I am having troubles extracting qidpidtriples.train.full.tar.gz. The commands I ran:

gunzip qidpidtriples.train.full.tar.gz yielding the file qidpidtriples.train.full.tar
then running tar -xf qidpidtriples.train.full.tar I get the following error:

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Archive contains '736889\t87209' where numeric off_t value expected
...

If I run tar -xzvf qidpidtriples.train.full.tar.gz directly I get the same error.

MSMARCO URL is not reachable

The MSMARCO http://www.msmarco.org/dataset.aspx site is not reachable.

top1000.dev contains just the same queries as that in queries.dev.small.tsv

Hi, I am replicating the project and I found that the dataset you provide may have missed some data.

Specifically, I found that the set of query id in top1000.dev.tar.gz is equal to the set of query id in qrels.dev.small.tsv (by the way, the set of query id in qrels.dev.small.tsv and queries.dev.small.tsv is same, which is definitely true).

So I am wondering top1000.dev.tar.gz should be renamed to top1000.dev.small.tar.gz? And there should have 'true' top1000.dev.tar.gz which is candidate file of queries in qrels.dev.tsv?

By the way, I found some files' number of colum Num Records in the table is not correct. First, it says that top1000.dev.tar.gz has 6,669,195 Num Records whereas I found the downloaded top1000.dev.tar.gz has 6668967. Second, triples.train.small.tar.gz in table has 39,782,779 Num Records whereas I found the downloaded triples.train.small.tar.gz has 39780811. And I guess this may not be a problem because the diff is a little small compared to the total number?

Thanks~

Third version of Train Triples QID PID Format that mimics triples.train.full.tsv.gz

Per discussion here: 4695a71

The current version of qidpidtriples.train.full.2.tsv.gz has the same records as triples.train.full.tsv.gz, but they are in a different order.

It would be nice for these to be consistent so that those using these files as the training data sequence can control for the order of training in experiments.

Encoding issues with triples.train.small

I have downloaded triples.train.small.tar.gz, extracted the tsv file and here is what I see:

faxis a little caffeine ok during pregnancy        We don<C3><A2><C2><80><C2><99>t know a lot about the ef
fects of caffeine during pregnancy on you and your baby. So it<C3><A2><C2><80><C2><99>s best to limit t
he amount you get each day. If you<C3><A2><C2><80><C2><99>re pregnant, limit caffeine to 200 milligrams
 each day. This is about the amount in 1<C3><82><C2><BD> 8-ounce cups of coffee or one 12-ounce cup of
coffee.     It is generally safe for pregnant women to eat chocolate because studies have shown to prov
e certain benefits of eating chocolate during pregnancy. However, pregnant women should ensure their ca
ffeine intake is below 200 mg per day.

when I open it as a text file.
You can see how quotation marks and other unicode symbols are encoded incorrectly, and such behaviour can be seen across the dataset.
The problem seems to be that some characters are double encoded, as described for example here https://stackoverflow.com/a/34175283/5230670

Any plan to update the Msmarco Passage Ranking dataset?

Hi, thanks for delivering this powerful dataset to the IR community.

This dataset was released in 2018. Three years later, there are new words and topics emerging over the internet. Do you have a plan to update this dataset?

Thanks!

relevant / non-relavant ratio

What is the ratio of Relevant/Non-Relevant (positive/negative) in the large training set (Train Triples Large)? 400/1?

top-1000 train set contains ~5 retrieved passages on average

Lets count the number of retrieved passages in top1000.train.tar.gz:

train_qids = []
with(open(PATH_PREFIX + 'queries.train.tsv')) as qcollection:
    for l in qcollection:
        qid, qtxt = l.strip().split('\t')
        train_qids.append(int(qid))

print('Number of train queries: %s' % (format(len(qids), ',')))

qid_serp = dict()
# top-1k train set
with tarfile.open(PATH_PREFIX + 'top1000.train.tar.gz', "r:gz") as top1k:
    for member in top1k.getmembers():
        f = top1k.extractfile(member)
        if f is not None:
            try:
                for l in f:
                    qid, did = l.decode('utf-8').strip().split('\t')[:2]
                    qid, did = map(int, [qid, did])
                    qid_serp.setdefault(qid, 0)
                    qid_serp[qid] += 1
            except:
                print('mis-format train doc: ', l)

print('Total queries presented in top-1000: %d' %(len(qid_serp.keys())))
print('Min number of retrieved passages: %d' %(min(qid_serp.values())))
print('Max number of retrieved passages: %d' %(max(qid_serp.values())))

This is the output:

Number of train queries: 808,731

mis-format train doc:  b' Your computer may display a reverse image if certain keys are pressed at the same time. Depending on your configuration, it may be black and white or inverted color, as shown below.\n'

Total queries presented in top-1000: 481780
Min number of retrieved passages: 1
Max number of retrieved passages: 19

Why only 481K out of 808K train queries are presented in top1000.train.tar.gz?
and more importantly, why each presented query at most contains only 19 retrieved passages?!

which column is the relevance label in the qrels file?

You said "Please ignore columns 1 and 3. " when you introduced the qrels file. So , i wonder which column is the relevance label?

About labels in data process of baseline duet

The labels in data process of baseline duet is all set to zero?

ms_marco_eval reference_file structure

Hi,

I just checked out your IR evaluation script msmarco_eval.py and I think I found something inconsistent related to the reference_file format, but I can be wrong too.

You state in one of the function definitions in the script that the reference_file format is QUERYID\tPASSAGEID:

def compute_metrics_from_files(path_to_reference, path_to_candidate, perform_checks=True):
    """Compute MRR metric
    Args:    
    p_path_to_reference_file (str): path to reference file.
        Reference file should contain lines in the following format:
            QUERYID\tPASSAGEID
            Where PASSAGEID is a relevant passage for a query. Note QUERYID can repeat on different lines with different PASSAGEIDs

However QRELS are not formatted this way. As defined in the TREC Relevance Judgements File List the format of a qrels file is as follows TOPIC\tITERATION\tDOCUMENT#\tRELEVANCY, where

TOPIC is the topic number,
ITERATION is the feedback iteration (almost always zero and not used),
DOCUMENT# is the official document number that corresponds to the "docno" field in the documents, and
RELEVANCY is a binary code of 0 for not relevant and 1 for relevant.

You provided us with QREL files in your Github repo that are already in that latter format.

If you have a look at the function load_reference_from_stream that loads the reference file, you see that you also assumed that there are at least 3 tab-separated values. Here is an excerpt

l = l.strip().split('\t')
qid = int(l[0])
qids_to_relevant_passageids[qid].append(int(l[2]))

Therefore, I assume that your function documentation for compute_metrics_from_files is just unintentionally wrong. I think it should be for example QUERYID\tITERATION\tPASSAGEID\tRELEVANCY.

I was a bit confused at first about that and I hope that my note here clarifies the usage of the ms_marco_eval script for other newcomers, like me.

Provide small queries and qrels datasets

Would it be possible to provide direct links to qrels.dev.small.tsv, queries.dev.small.tsv and queries.eval.small.tsv. They are used in many research papers and are hard to find (beside having to re-download the collection.tsv file)?

Invalid line breaks in the top1000 TSV files of the reranking datasets

Describe the bug

A lot of invalid line breaks are contained in the top1000 TSV files of the reranking datasets.
For example, line 234472 in the top1000.dev.tsv does not start with the IDs.

To Reproduce

% sed -n 234471,234472p top1000.dev.tsv
1082445 3492590 what does unlock my device mean iOS: Understanding the SIM PIN.
You can lock your SIM card so that it can't be used without a Personal Identification Number (PIN).

relevant-passages section of readme said there are 5 qrels files, but I only see two, the other three are query files

https://github.com/microsoft/MSMARCO-Passage-Ranking#relevant-passages

In the relevant-passages section, it says that the file located here https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz

contains the following

7437 qrels.dev.small.tsv
59273 qrels.dev.tsv
7304 qrels.eval.small.tsv
59187 qrels.eval.tsv
532761 qrels.train.tsv
665962 total

However, these are the files I got instead

collection.tsv
qrels.dev.small.tsv
qrels.train.tsv
queries.dev.small.tsv
queries.dev.tsv
queries.eval.small.tsv
queries.eval.tsv
queries.train.tsv

Are there missing qrels files, or does the read.me have some typos?

I also had a few other questions about the data for the passage ranking/retrieval data:

Are the triples training data created from the purely the passage texts, query text, and the qrels relevance ranking? Are the negative samples selected at random (excluding the positive passage)?

the qrels data has a total of 532,761(train) + 59,273(dev) relevance rankings, but there are about 1 million queries and 8.8 million passages. So does the relevance data exclude a majority of the questions/passages?

MSMARCO license ambiguity

Hi, I am wondering as to what exactly in this repo/project is released on an MIT license and can be used for commercial research purposes. On the linked page http://www.msmarco.org/dataset.aspx it is clearly written that the datasets are not intended for this.

The problem is that in the README, I can see links directly starting a download of the datasets.

Am I missing something here?
It will be great if you could state it clearly in the repository.

On full-documents and passage alignement

Hi,
Is there an alignement available (or a script) for passages and full documents ?
Since passages are extracted from documents we should be able to have this information. Since the maintainer is present on a lot of MS-MARCO project, I was wondering if a release date for Document Rankin's data was available ?

Extracting qidpidtriples.train.full.tar.gz returns, "This does not look like a tar archive"

I downloaded twice and extracted it using following command

wget https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.tar.gz
tar -xvzf ./qidpidtriples.train.full.tar.gz

but I still got the error

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

expected 3 fields, saw 4

Why are there often 4 fields in the triple train small file when there should be 3 ?

Can't download qidpidtriples.train.full.tar.gz

BlobNotFound The specified blob does not exist. RequestId:5a650a97-c01e-00a0-6f93-ecde7b000000 Time:2020-02-26T10:54:09.0262111Z

How the triples.train.small is generated

Hi, there:

I have a question about how the triples.train.small is generated, in the statement, it says

"The triples.train..tsv are two files that we have created as an easy to consume training dataset. Each line of the TSV contains querytext, A relevant passage, and an non-relevant passage all separated by \t. The only difference between triples.train.full.tsv and triples.train.small.tsv is the smaller is ~10% of the overall size since the full sized train is > 270gbs."

Does triples.train.small contain all of the queries as triples.full.small?
For each query (in triples.train.small), how non-relevant passages were selected? Randomly?

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request