Code Monkey home page Code Monkey logo

microsoft / msmarco-passage-ranking Goto Github PK

View Code? Open in Web Editor NEW
267.0 13.0 38.0 7.96 MB

MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. A variant of this task will be the part of TREC and AFIRM 2019. For Updates about TREC 2019 please follow This Repository Passage Reranking task Task Given a query q and a the 1000 most relevant passages P = p1, p2, p3,... p1000, as retrieved by BM25 a succeful system is expected to rerank the most relevant passage as high as possible. For this task not all 1000 relevant items have a human labeled relevant passage. Evaluation will be done using MRR

Home Page: https://microsoft.github.io/MSMARCO-Passage-Ranking/

License: MIT License

Jupyter Notebook 56.18% Shell 2.77% Python 41.05%

msmarco-passage-ranking's Introduction

msmarco-passage-ranking's People

Contributors

bmitra-msft avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar seanmacavaney avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msmarco-passage-ranking's Issues

Top 1000 Train in Trec Format

Hello,

I wonder if the train set top 1000 retrieval result in TREC eval format , i.e. teIn, can be released? This will allow developers to put together training sets of various sizes. Please release it in its original form, without shuffling, so truncation can be properly done.

Extracting qidpidtriples.train.full.tar.gz

I am having troubles extracting qidpidtriples.train.full.tar.gz. The commands I ran:

  1. gunzip qidpidtriples.train.full.tar.gz yielding the file qidpidtriples.train.full.tar
  2. then running tar -xf qidpidtriples.train.full.tar I get the following error:

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Archive contains '736889\t87209' where numeric off_t value expected
...

If I run tar -xzvf qidpidtriples.train.full.tar.gz directly I get the same error.

top1000.dev contains just the same queries as that in queries.dev.small.tsv

Hi, I am replicating the project and I found that the dataset you provide may have missed some data.

Specifically, I found that the set of query id in top1000.dev.tar.gz is equal to the set of query id in qrels.dev.small.tsv (by the way, the set of query id in qrels.dev.small.tsv and queries.dev.small.tsv is same, which is definitely true).

So I am wondering top1000.dev.tar.gz should be renamed to top1000.dev.small.tar.gz? And there should have 'true' top1000.dev.tar.gz which is candidate file of queries in qrels.dev.tsv?

By the way, I found some files' number of colum Num Records in the table is not correct. First, it says that top1000.dev.tar.gz has 6,669,195 Num Records whereas I found the downloaded top1000.dev.tar.gz has 6668967. Second, triples.train.small.tar.gz in table has 39,782,779 Num Records whereas I found the downloaded triples.train.small.tar.gz has 39780811. And I guess this may not be a problem because the diff is a little small compared to the total number?

Thanks~

Encoding issues with triples.train.small

Hi

I have downloaded triples.train.small.tar.gz, extracted the tsv file and here is what I see:

faxis a little caffeine ok during pregnancy        We don<C3><A2><C2><80><C2><99>t know a lot about the ef
fects of caffeine during pregnancy on you and your baby. So it<C3><A2><C2><80><C2><99>s best to limit t
he amount you get each day. If you<C3><A2><C2><80><C2><99>re pregnant, limit caffeine to 200 milligrams
 each day. This is about the amount in 1<C3><82><C2><BD> 8-ounce cups of coffee or one 12-ounce cup of
coffee.     It is generally safe for pregnant women to eat chocolate because studies have shown to prov
e certain benefits of eating chocolate during pregnancy. However, pregnant women should ensure their ca
ffeine intake is below 200 mg per day.

when I open it as a text file.
You can see how quotation marks and other unicode symbols are encoded incorrectly, and such behaviour can be seen across the dataset.
The problem seems to be that some characters are double encoded, as described for example here https://stackoverflow.com/a/34175283/5230670

Any plan to update the Msmarco Passage Ranking dataset?

Hi, thanks for delivering this powerful dataset to the IR community.

This dataset was released in 2018. Three years later, there are new words and topics emerging over the internet. Do you have a plan to update this dataset?

Thanks!

relevant / non-relavant ratio

What is the ratio of Relevant/Non-Relevant (positive/negative) in the large training set (Train Triples Large)? 400/1?

top-1000 train set contains ~5 retrieved passages on average

Lets count the number of retrieved passages in top1000.train.tar.gz:

train_qids = []
with(open(PATH_PREFIX + 'queries.train.tsv')) as qcollection:
    for l in qcollection:
        qid, qtxt = l.strip().split('\t')
        train_qids.append(int(qid))

print('Number of train queries: %s' % (format(len(qids), ',')))

qid_serp = dict()
# top-1k train set
with tarfile.open(PATH_PREFIX + 'top1000.train.tar.gz', "r:gz") as top1k:
    for member in top1k.getmembers():
        f = top1k.extractfile(member)
        if f is not None:
            try:
                for l in f:
                    qid, did = l.decode('utf-8').strip().split('\t')[:2]
                    qid, did = map(int, [qid, did])
                    qid_serp.setdefault(qid, 0)
                    qid_serp[qid] += 1
            except:
                print('mis-format train doc: ', l)

print('Total queries presented in top-1000: %d' %(len(qid_serp.keys())))
print('Min number of retrieved passages: %d' %(min(qid_serp.values())))
print('Max number of retrieved passages: %d' %(max(qid_serp.values())))

This is the output:

Number of train queries: 808,731

mis-format train doc:  b' Your computer may display a reverse image if certain keys are pressed at the same time. Depending on your configuration, it may be black and white or inverted color, as shown below.\n'

Total queries presented in top-1000: 481780
Min number of retrieved passages: 1
Max number of retrieved passages: 19

Why only 481K out of 808K train queries are presented in top1000.train.tar.gz?
and more importantly, why each presented query at most contains only 19 retrieved passages?!

ms_marco_eval reference_file structure

Hi,

I just checked out your IR evaluation script msmarco_eval.py and I think I found something inconsistent related to the reference_file format, but I can be wrong too.

You state in one of the function definitions in the script that the reference_file format is QUERYID\tPASSAGEID:

def compute_metrics_from_files(path_to_reference, path_to_candidate, perform_checks=True):
    """Compute MRR metric
    Args:    
    p_path_to_reference_file (str): path to reference file.
        Reference file should contain lines in the following format:
            QUERYID\tPASSAGEID
            Where PASSAGEID is a relevant passage for a query. Note QUERYID can repeat on different lines with different PASSAGEIDs

However QRELS are not formatted this way. As defined in the TREC Relevance Judgements File List the format of a qrels file is as follows TOPIC\tITERATION\tDOCUMENT#\tRELEVANCY, where

  • TOPIC is the topic number,
  • ITERATION is the feedback iteration (almost always zero and not used),
  • DOCUMENT# is the official document number that corresponds to the "docno" field in the documents, and
  • RELEVANCY is a binary code of 0 for not relevant and 1 for relevant.

You provided us with QREL files in your Github repo that are already in that latter format.

If you have a look at the function load_reference_from_stream that loads the reference file, you see that you also assumed that there are at least 3 tab-separated values. Here is an excerpt

l = l.strip().split('\t')
qid = int(l[0])
qids_to_relevant_passageids[qid].append(int(l[2]))

Therefore, I assume that your function documentation for compute_metrics_from_files is just unintentionally wrong. I think it should be for example QUERYID\tITERATION\tPASSAGEID\tRELEVANCY.

I was a bit confused at first about that and I hope that my note here clarifies the usage of the ms_marco_eval script for other newcomers, like me.

Provide small queries and qrels datasets

Would it be possible to provide direct links to qrels.dev.small.tsv, queries.dev.small.tsv and queries.eval.small.tsv. They are used in many research papers and are hard to find (beside having to re-download the collection.tsv file)?

Invalid line breaks in the top1000 TSV files of the reranking datasets

Describe the bug

A lot of invalid line breaks are contained in the top1000 TSV files of the reranking datasets.
For example, line 234472 in the top1000.dev.tsv does not start with the IDs.

To Reproduce

% sed -n 234471,234472p top1000.dev.tsv
1082445 3492590 what does unlock my device mean iOS: Understanding the SIM PIN.
You can lock your SIM card so that it can't be used without a Personal Identification Number (PIN).

relevant-passages section of readme said there are 5 qrels files, but I only see two, the other three are query files

https://github.com/microsoft/MSMARCO-Passage-Ranking#relevant-passages

In the relevant-passages section, it says that the file located here https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz

contains the following

7437 qrels.dev.small.tsv

59273 qrels.dev.tsv
7304 qrels.eval.small.tsv
59187 qrels.eval.tsv
532761 qrels.train.tsv
665962 total

However, these are the files I got instead

collection.tsv
qrels.dev.small.tsv
qrels.train.tsv
queries.dev.small.tsv
queries.dev.tsv
queries.eval.small.tsv
queries.eval.tsv
queries.train.tsv

Are there missing qrels files, or does the read.me have some typos?

I also had a few other questions about the data for the passage ranking/retrieval data:

Are the triples training data created from the purely the passage texts, query text, and the qrels relevance ranking? Are the negative samples selected at random (excluding the positive passage)?

the qrels data has a total of 532,761(train) + 59,273(dev) relevance rankings, but there are about 1 million queries and 8.8 million passages. So does the relevance data exclude a majority of the questions/passages?

MSMARCO license ambiguity

Hi, I am wondering as to what exactly in this repo/project is released on an MIT license and can be used for commercial research purposes. On the linked page http://www.msmarco.org/dataset.aspx it is clearly written that the datasets are not intended for this.

The problem is that in the README, I can see links directly starting a download of the datasets.

Am I missing something here?
It will be great if you could state it clearly in the repository.

On full-documents and passage alignement

Hi,
Is there an alignement available (or a script) for passages and full documents ?
Since passages are extracted from documents we should be able to have this information. Since the maintainer is present on a lot of MS-MARCO project, I was wondering if a release date for Document Rankin's data was available ?

How the triples.train.small is generated

Hi, there:

I have a question about how the triples.train.small is generated, in the statement, it says

"The triples.train..tsv are two files that we have created as an easy to consume training dataset. Each line of the TSV contains querytext, A relevant passage, and an non-relevant passage all separated by \t. The only difference between triples.train.full.tsv and triples.train.small.tsv is the smaller is ~10% of the overall size since the full sized train is > 270gbs."

  • Does triples.train.small contain all of the queries as triples.full.small?
  • For each query (in triples.train.small), how non-relevant passages were selected? Randomly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.