trema-unh / trec-car-tools Goto Github PK

View Code? Open in Web Editor NEW

36.0 8.0 14.0 222 KB

Tools for working with the TREC CAR dataset.

Home Page: http://trec-car.cs.unh.edu/

License: BSD 3-Clause "New" or "Revised" License

Python 60.41% Java 27.82% Makefile 10.62% Shell 1.16%

trec-car

trec-car-tools's Introduction

TREC Car Tools

Development tools for participants of the TREC Complex Answer Retrieval track.

Data release support for v1.5 and v2.0. and v2.6

Note that in order to allow to compile your project for two trec-car format versions, the maven artifact Id was changed to treccar-tools-v2 with version 2.0, and the package path changed to treccar_v2

Current support for

Python 3.6
Java 1.8

If you are using Anaconda, install the cbor library for Python 3.6:

conda install -c laura-dietz cbor=1.0.0

How to use the Python bindings for trec-car-tools?

Get the data from http://trec-car.cs.unh.edu
Clone this repository
python setup.py install

Look out for test.py for an example on how to access the data.

How to use the java 1.8 (or higher) bindings for trec-car-tools through maven?

add to your project's pom.xml file (or similarly gradel or sbt):

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>

add the trec-car-tools dependency:

        <dependency>     
	    <groupId>com.github.TREMA-UNH</groupId>
	    <artifactId>trec-car-tools-java</artifactId>
	    <version>20</version>
        </dependency>

compile your project with mvn compile

Tool support

This package provides support for the following activities.

read_data: Reading the provided paragraph collection, outline collections, and training articles
format_runs: writing submission files

Reading Data

If you use python or java, please use trec-car-tools, no need to understand the following. We provide bindings for haskell upon request. If you are programming under a different language, you can use any CBOR library and decode the grammar below.

CBOR is similar to JSON, but it is a binary format that compresses better and avoids text file encoding issues.

Articles, outlines, paragraphs are all described with CBOR following this grammar. Wikipedia-internal hyperlinks are preserved through ParaLinks.

     Page         -> $pageName $pageId [PageSkeleton] PageType PageMetadata
     PageType     -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
     PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors WikiDataQid SiteId PageTags
     RedirectNames       -> [$pageName] 
     DisambiguationNames -> [$pageName] 
     DisambiguationIds   -> [$pageId] 
     CategoryNames       -> [$pageName] 
     CategoryIds         -> [$pageId] 
     InlinkIds           -> [$pageId] 
     InlinkAnchors       -> [$anchorText] 
     WikiDataQid         -> [$qid] 
     SiteId              -> [$siteId] 
     PageTags            -> [$pageTags] 
     
     PageSkeleton -> Section | Para | Image | ListItem | Infobox
     Section      -> $sectionHeading [PageSkeleton]
     Para         -> Paragraph
     Paragraph    -> $paragraphId, [ParaBody]
     ListItem     -> $nestingLevel, Paragraph
     Image        -> $imageURL [PageSkeleton]
     ParaBody     -> ParaText | ParaLink
     ParaText     -> $text
     ParaLink     -> $targetPage $targetPageId $linkSection $anchorText
     Infobox      -> $infoboxName [($key, [PageSkeleton])]

You can use any CBOR serialization library. Below a convenience library for reading the data into Python (3.5)

./read_data/trec_car_read_data.py Python 3.5 convenience library for reading the input data (in CBOR format). -- If you use anaconda, please install the cbor library with conda install -c auto cbor=1.0 -- Otherwise install it with pypi install cbor

Ranking Results

Given an outline, your task is to produce one ranking for each section $section (representing an information need in traditional IR evaluations).

Each ranked element is an (entity,passage) pair, meaning that this passage is relevant for the section, because it features a relevant entity. "Relevant" means that the entity or passage must/should/could be listed in this section.

The section is represented by the path of headings in the outline $pageTitle/$heading1/$heading1.1/.../$section in URL encoding.

The entity is represented by the DBpedia entity id (derived from the Wikipedia URL). Optionally, the entity can be omitted.

The passage is represented by the passage id given in the passage corpus (an MD5 hash of the content). Optionally, the passage can be omitted.

The results are provided in a format that is similar to the "trec_results file format" of trec_eval. More info on how to use trec_eval and source.

Example of ranking format

     Green\_sea\_turtle\Habitat  Pelagic\_zone  12345          0     27409 myTeam 
     $qid                        $entity        $passageId     rank  sim   run_id

Integration with other tools

It is recommended to use the format_runs package to write run files. Here an example:

    with open('runfile', mode='w', encoding='UTF-8') as f:
        writer = configure_csv_writer(f)
        for page in pages:
            for section_path in page.flat_headings_list():
                ranking = [RankingEntry(page.page_name, section_path, p.para_id, r, s, paragraph_content=p) for p,s,r in ranking]
                format_run(writer, ranking, exp_name='test')

        f.close()

This ensures that the output is correctly formatted to work with trec_eval and the provided qrels file.

Run trec_eval version 9.0.4 as usual:

  trec_eval -q release.qrel runfile > run.eval

The output is compatible with the eval plotting package minir-plots. For example run

  python column.py --out column-plot.pdf --metric map run.eval
  python column_difficulty.py --out column-difficulty-plot.pdf --metric map run.eval run2.eval

Moreover, you can compute success statistics such as hurts/helps or a paired-t-test as follows.

  python hurtshelps.py --metric map run.eval run2.eval
  python paired-ttest.py --metric map run.eval run2.eval

TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.

trec-car-tools's People

Stargazers

Watchers

Forkers

kellywzhang jpulliza daltonj abdo-br frankblood jjfiv arevuzdisaitek flaviomartins xliu78 lukuang legodps setupminimal erolm-a seanmacavaney

trec-car-tools's Issues

Does paragraph file contain duplicate paragraphids?

ParagraphsFile expects CBOR rows to return bytes, but wikipedia CBOR has str rows

Minimum Working Example:

cbor_path = "wikipedia/car-wiki2020-01-01/enwiki2020.cbor"

from trec_car.read_data import AnnotationsFile, ParagraphsFile
cbor_toc_paragraphs = ParagraphsFile(cbor_path)

cbor_toc_paragraphs.get(b'enwiki:U.S.%20Route%20277')

the last line causes the following stack trace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-4af510e90e8f> in <module>
----> 1 cbor_toc_paragraphs.get(b'enwiki:U.S.%20Route%20277')
/usr/local/lib/python3.6/dist-packages/trec_car_tools-2.5.3-py3.6.egg/trec_car/read_data.py in get(self, page)
    771             if offset is not None:
    772                 self.cbor.seek(offset)
--> 773                 return read_val(cbor.load(self.cbor))
    774             return None
    775     return AnnotationsFile

/usr/local/lib/python3.6/dist-packages/trec_car_tools-2.5.3-py3.6.egg/trec_car/read_data.py in from_cbor(cbor)
    558             raise CborElementNotDefinedException(cbor)
    559 
--> 560         paragraphId = cbor[1].decode('ascii')
    561         return Paragraph(paragraphId, map(ParaBody.from_cbor, cbor[2]))
    562

The problem is clearly in line 560 as we expect cbor[1] to be a bytes but it is a str instead.

I am going to push a small pull request quite soon to fix this.

cannot load test200 v2.0 in python

Hi, there.
I have followed the instruction. Use the following command to get data from the dataset, but failed.
python test.py pages train.pages.cbor

I get the following errors:

Traceback (most recent call last):
File "test.py", line 29, in
args.func(args)
File "test.py", line 7, in dump_pages
for p in iter_annotations(args.file):
File "E:\PythonProject\TERC-CAR\trec_car\read_data.py", line 232, in iter_annotations
yield Page.from_cbor(cbor.load(file))
File "E:\PythonProject\TERC-CAR\trec_car\read_data.py", line 44, in from_cbor
assert cbor[0] == 0 # tag
AssertionError

Then I try to print out the binary data that I have read from the file, and I got this:

['CAR', [0], [0, [[0, 'enwiki', 'en-US', '20161220', []]], 'trec-car v2.0', [], [[0, 'trec-car-import', '0b10a0342662c908f3198d79100796472918d30a', []]]]]

Is there anything wrong with my operation? Could anyone help me? thank you.

Travis is Broken

The travis tests are failing to install 'maven2' when setting up their environment.

I suspect that this can just be the 'maven' package?

moving to python>=3.6 and dropping typing requirement?

Hi Laura, (cc: @cmacdonald)

Would you approve a PR that did the following:

Removed support for python 3.5
Dropped the pypi typing requirement

The reason being that from python 3.6 onward, typing is part of the standard library. The pypi package shadows the standard library version, and this leads to some annoying errors.

I don't think many people still use 3.5, so it wouldn't be a big loss. (And if they did, they could always use the previous version of trec-car-tools from pypi.)

Thanks!
sean

URLs to data-releases do not work

The link to data-releases and other links are not working. (e.g. http://trec-car.cs.unh.edu/datareleases/). I was able to download the dataset and files last month and now couldn't. Please guide me through if there any re-directions.

issue with reading TREC data by using trec-car-tools in python3

I have already followed the instruction to access the data by test.py, but there is something wrong I can not figure it out.

Here are the problem I met:

[xuesiyuan@241server python3]$ python ./trec_car/read_data.py ./benchmarkY1/benchmarkY1-train/train.pages.cbor ./benchmarkY1/benchmarkY1-train/train.pages.cbor-outlines.cbor ./benchmarkY1/benchmarkY1-train/train.pages.cbor-paragraphs.cbor >out

After this, there is nothing out.

[xuesiyuan@241server python3]$ python read_data_test.py ./benchmarkY1/benchmarkY1-train/train.pages.cbor ./benchmarkY1/benchmarkY1-train/train.pages.cbor-outlines.cbor ./benchmarkY1/benchmarkY1-train/train.pages.cbor-paragraphs.cbor >out
Traceback (most recent call last):
File "read_data_test.py", line 14, in
for p in iter_annotations(f):
File "/home/xuesiyuan/pythonworkspace/pythonWorkspace/py3/trec-car-tools-1.5/python3/trec_car/read_data.py", line 509, in _iter_with_header
yield parse(cbor.load(file))
File "/home/xuesiyuan/pythonworkspace/pythonWorkspace/py3/trec-car-tools-1.5/python3/trec_car/read_data.py", line 71, in from_cbor
return Page(pagename, pageId, map(PageSkeleton.from_cbor, cbor[3]), PageMetadata.from_cbor(cbor[4]))
File "/home/xuesiyuan/pythonworkspace/pythonWorkspace/py3/trec-car-tools-1.5/python3/trec_car/read_data.py", line 215, in from_cbor
pageType=PageType.from_cbor(cbor[1])
IndexError: list index out of range

Is there anything wrong ??

Normalize link targets

Currently links targets are unprocessed and therefore have the following form in general pageName#anchor where the capitalization of the first letter of pageName is lost.

Provide access to normalized pagenames and store anchor separately.

flat_headings_list is not flat

soboroff$ ipython3
Python 3.6.6 (default, Jun 28 2018, 05:43:53) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from trec_car import read_data

In [2]: fp = open('/home/collections/news-track-2018-wikipedia/all-enwiki-201708
   ...: 20/all-enwiki-20170820.cbor', 'rb')

In [3]: a = read_data.iter_annotations(fp)

In [4]: page = a.__next__()

In [5]: page.flat_headings_list()
Out[5]: 
[[<trec_car.read_data.Section at 0x106b684e0>,
  <trec_car.read_data.Section at 0x106b6b278>],
 [<trec_car.read_data.Section at 0x106b68518>,
  <trec_car.read_data.Section at 0x106b6bf98>],
 [<trec_car.read_data.Section at 0x106b68518>,
  <trec_car.read_data.Section at 0x106b73160>],
 [<trec_car.read_data.Section at 0x106b68518>,
  <trec_car.read_data.Section at 0x106b73208>],
 [<trec_car.read_data.Section at 0x106b6bf28>],
 [<trec_car.read_data.Section at 0x106b73fd0>,
  <trec_car.read_data.Section at 0x106d780b8>],
 [<trec_car.read_data.Section at 0x106b73fd0>,
  <trec_car.read_data.Section at 0x106d7b160>],
 [<trec_car.read_data.Section at 0x106b73fd0>,
  <trec_car.read_data.Section at 0x106d80080>],
 [<trec_car.read_data.Section at 0x106d78080>],
 [<trec_car.read_data.Section at 0x106d803c8>],
 [<trec_car.read_data.Section at 0x106d80e10>],
 [<trec_car.read_data.Section at 0x106d80e48>],
 [<trec_car.read_data.Section at 0x106d80e80>],
 [<trec_car.read_data.Section at 0x106d83080>]]

In [6]: import itertools

In [7]: itertools.chain.from_iterable(page.flat_headings_list())
Out[7]: <itertools.chain at 0x106d9e0b8>

In [8]: list(itertools.chain.from_iterable(page.flat_headings_list()))
Out[8]: 
[<trec_car.read_data.Section at 0x106b684e0>,
 <trec_car.read_data.Section at 0x106b6b278>,
 <trec_car.read_data.Section at 0x106b68518>,
 <trec_car.read_data.Section at 0x106b6bf98>,
 <trec_car.read_data.Section at 0x106b68518>,
 <trec_car.read_data.Section at 0x106b73160>,
 <trec_car.read_data.Section at 0x106b68518>,
 <trec_car.read_data.Section at 0x106b73208>,
 <trec_car.read_data.Section at 0x106b6bf28>,
 <trec_car.read_data.Section at 0x106b73fd0>,
 <trec_car.read_data.Section at 0x106d780b8>,
 <trec_car.read_data.Section at 0x106b73fd0>,
 <trec_car.read_data.Section at 0x106d7b160>,
 <trec_car.read_data.Section at 0x106b73fd0>,
 <trec_car.read_data.Section at 0x106d80080>,
 <trec_car.read_data.Section at 0x106d78080>,
 <trec_car.read_data.Section at 0x106d803c8>,
 <trec_car.read_data.Section at 0x106d80e10>,
 <trec_car.read_data.Section at 0x106d80e48>,
 <trec_car.read_data.Section at 0x106d80e80>,
 <trec_car.read_data.Section at 0x106d83080>]

bm_generaterankings.py: results file can contain multiple paragraph ids

Edit: Wrong repository, feel free to remove.

My fancy new feature

implement data loader
train model and save
write results

Incorrect anchor text serialization

There is an error between Data.java and DeserializeData.java. The order of the parameters in Deserialize data is currently not correct.

Current code:
public ParaLink(String pageId, String anchorText, String page) {

Correct code:
public ParaLink(String page, String pageId, String anchorText) {

The result is that the fields are jumbled.

format_runs.py throws error with v1.1 release

I'm getting this error on format_runs.py from the format_runs_test.py.
I get this error for both halfwise and spritzer. So I think its the
code and not the data. For python version
Python 3.5.2 :: Anaconda 4.2.0 (64-bit) and this code

pages = []
with open(spritzer_outlines, 'rb') as f:
pages = [p for p in itertools.islice(iter_annotations(f), 0, 10)]

paragraphs = []
with open(spritzer_paragraphs, 'rb') as f:
paragraphs = [p for p in itertools.islice(iter_paragraphs(f), 0, None,5)]

print("pages: ", len(pages))
print("paragraphs: ", len(paragraphs))

mock_ranking = [(p, 1.0 / (r + 1), (r + 1)) for p, r in
zip(paragraphs, range(0, 1000))]

with open('testfile',mode='w', encoding='UTF-8') as f:
writer = configure_csv_writer(f)
for page in pages:
for section_path in page.flat_headings_list():
query_id = "/".join([page.page_id]+[section.headingId for
section in section_path])
ranking = [RankingEntry(query_id, p.para_id, r, s,
paragraph_content=p) for p, s, r in mock_ranking]
format_run(writer, ranking, exp_name='test')

f.close()

Output:

pages: 3
paragraphs: 26

AttributeError Traceback (most recent call last)
in ()
19 query_id =
"/".join([page.page_id]+[section.headingId for section in
section_path])
20 ranking = [RankingEntry(query_id, p.para_id, r, s,
paragraph_content=p) for p, s, r in mock_ranking]
---> 21 format_run(writer, ranking, exp_name='test')
22
23 f.close()

/home/matt/lib/trec-car-tools/python3/trec_car/format_runs.py in
format_run(writer, ranking_of_paragraphs, exp_name)
56 for elem in ranking_of_paragraphs:
57 # query-number Q0 document-id rank score Exp
---> 58 writer.write(" ".join([str(x) for x in
elem.to_trec_eval_row(exp_name)]))
59 writer.write("\n")

AttributeError: '_csv.writer' object has no attribute 'write'

Paragraph-Dataset 2.0 cannot be loaded in java

I am facing an issue when trying to read the paragraph-corpus-dataset 2.0.
The code to read the data looks like the one given in the examples:

final FileInputStream fileInputStream = new FileInputStream(new File(path)); for(Data.Paragraph p : DeserializeData.iterableParagraphs(fileInputStream)) { System.out.println(p); }

When the program tries to load the first paragraph it quits with the following exception:

java.lang.NullPointerException
at edu.unh.cs.treccar_v2.read_data.DeserializeData.paragraphFromCbor(DeserializeData.java:294)
at edu.unh.cs.treccar_v2.read_data.DeserializeData.access$100(DeserializeData.java:14)
at edu.unh.cs.treccar_v2.read_data.DeserializeData$1ParagraphIterator.parseItem(DeserializeData.java:142)
at edu.unh.cs.treccar_v2.read_data.DeserializeData$1ParagraphIterator.parseItem(DeserializeData.java:137)
at edu.unh.cs.treccar_v2.read_data.CborListWithHeaderIterator.next(CborListWithHeaderIterator.java:53)
at ...

The reader seems to expect a tag for each item in the cbor (see the assert in DeserializeData.pageFromCbor()) which is currently null for each array in the given DataItem.

I use the maven-dependencies in version 2.0 and tried different java-version from 1.8 to 11.
The data is readable in Python without any issue.

NameError: name 'TrecCarHeader' is not defined

Hi,
I encountered this error when running the method trec_car.read_data.iter_annotations(). How can I fix this? Thank you

Administrative Headings still in Processed-All-But-Benchmark 2.0.2

I still see the following headings in release-v2.0.2/processedAllButBenchmark.cbor:

History
Biography
References
External%20links
Types
Examples
Description
Overview
Career
Uses
See%20also
Game%20log
Life%20and%20career
Structure
Notable%20discoveries
Related%20legislation
Classification
Life
Fleet
Etymology
Function
Species
Background
Abstracting%20and%20indexing
Applications
Causes
Criteria
Signs%20and%20symptoms
Insect%20Larva
Further%20reading
Production
Interactions
Life%20and%20work
Replacements%20for%20holotypes
Definition
Municipalities
Taxonomy
Categories
Functions%20and%20organization
Route%20description
Occurrence
Geography
Neurology
Variations
Development
Cause
Genera
Preparation
Definitions
Varieties

v2.0 dataset para id in manual qrels not found in paragraphCorpus

example: 147e6b06a5cbe64ad7bf73ab8c632152a400718b
this paragraph id shows up in manual qrel but could not be found in paragraphCorpus

code to reproduce this error

import os
import time
import pandas as pd
from trec_car.read_data import *
data_dir = '/home/rding/rank_ds/trec_car/v2.0'
# get all the qrels, query-doc pairs
qrel_file = os.path.join(data_dir, 'benchmarkY1Test-manual-qrels/manual.benchmarkY1test.cbor.hierarchical.qrels')
with open(qrel_file, 'r') as f:
    q_lines = [x.strip().split() for x in f.readlines()]
df=pd.DataFrame(q_lines, columns=['qid', 'blah', 'pid', 'rel'])
pid_used = set(df.pid.unique())

start = time.time()
paragraphs = os.path.join(data_dir, 'paragraphCorpus/dedup.articles-paragraphs.cbor')
pid_selected = dict()
with open(paragraphs, 'rb') as f:
    for p in iter_paragraphs(f):
        if p.para_id in pid_used:
            pid_selected[p.para_id] = p.get_text()
print('time elapsed: {}s'.format(time.time()-start))

for i in pid_used:
    if i not in pid_selected:
        print(i)
        break

ClassCastException in DeserializeData.java

Exception in thread "main" java.lang.ClassCastException: co.nstant.in.cbor.model.ByteString cannot be cast to co.nstant.in.cbor.model.UnicodeString
at edu.unh.cs.treccar.read_data.DeserializeData.pageFromCbor(DeserializeData.java:157)
at edu.unh.cs.treccar.read_data.DeserializeData$1.lowLevelNext(DeserializeData.java:36)
at edu.unh.cs.treccar.read_data.DeserializeData$1.(DeserializeData.java:27)
at edu.unh.cs.treccar.read_data.DeserializeData.iterAnnotations(DeserializeData.java:26)
at edu.unh.cs.treccar.read_data.DeserializeData$2.iterator(DeserializeData.java:67)
at edu.unh.cs.treccar.playground.LinksWithContextKeywords.main(LinksWithContextKeywords.java:157)

As input I'm using release-v1.1.cbor.paragraphs. I guess, I need the complete article.cbor for linksWithContextKeywords, which is mentioned in another issue: enwiki-20161220-pages-articles24.xml-p030503454p033952815.cbor ?

Thanks for your help

Java CBOR deserialization error

edu.unh.cs.treccar.playground.ExtractPlainText ".../paragraphcorpus/paragraphcorpus.cbor"

Exception in thread "main" java.lang.ClassCastException: co.nstant.in.cbor.model.UnsignedInteger cannot be cast to co.nstant.in.cbor.model.UnicodeString
at edu.unh.cs.treccar.read_data.TrecCarHeader.(TrecCarHeader.java:46)
at edu.unh.cs.treccar.read_data.CborListWithHeaderIterator.(CborListWithHeaderIterator.java:23)
at edu.unh.cs.treccar.read_data.DeserializeData$1PageIterator.(DeserializeData.java:21)
at edu.unh.cs.treccar.read_data.DeserializeData.iterAnnotations(DeserializeData.java:29)
at edu.unh.cs.treccar.read_data.DeserializeData$1.iterator(DeserializeData.java:36)
at edu.unh.cs.treccar.playground.ExtractPlainText.main(ExtractPlainText.java:53)

I'm using the 1.5 tools branch and the paragraph data from the website.

Remove duplicates from qrels

bug in read_data.py on PageMetadata => str fonction (categoryNames)

The fonction def __str__(self): in the class PageMetadata crash on file without categoryNames.

There is an error in the test before getting categoryNames, it should be if self.categoryNames is None and not if self.redirectNames is None.

    def __str__(self):
        redirStr = ("" if self.redirectNames is None else (" redirected = "+", ".join([name for name in self.redirectNames])))
        disamStr = ("" if self.disambiguationNames is None else (" disambiguated = "+", ".join([name for name in self.disambiguationNames])))
        catStr = ("" if self.redirectNames is None else (" categories = "+", ".join([name for name in self.categoryNames]))) ## HERE
        inlinkStr = ("" if self.inlinkIds is None else (" inlinks = "+", ".join([name for name in self.inlinkIds])))
        # inlinkAnchorStr = str (self.inlinkAnchors)
        inlinkAnchorStr = ("" if self.inlinkAnchors is None else \
                                (" inlinkAnchors = "+", ".join( \
                                    [ ("%s: %d" % (name, freq)) for (name, freq) in self.inlinkAnchors] \
                                    # [ ("%s: " % (name)) for (name, freq) in self.inlinkAnchors] \
                                )))
        return  "%s \n%s \n%s \n%s \n%s\n" % (redirStr, disamStr, catStr, inlinkStr, inlinkAnchorStr)

I find it will i was trying to read the last release of the TRECCAR dataset, so I am confuse...

This bug stay on the git for 6 months (last commit on read_data), so no one had this problem before me...?
Am I using the wrong dataset and all the files must have categoryNames?
or do most users not use the python version of the tool?

about data release v1.4

The paragraph 83423c198b6099edba08f185f940042d5dba3b79 is annotated as relevant to more than one section_ids although the following statement occurs in the track web page:

*.cbor.hierarchical.qrels: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)

cat release-v1.4/fold1.train.cbor.hierarchical.qrels | grep 83423c198b6099edba08f185f940042d5dba3b79

yields the following output

Joint%20University%20Programmes%20Admissions%20System/Difficulty 0 83423c198b6099edba08f185f940042d5dba3b79 1
Kawasaki%20Ki-100/Production 0 83423c198b6099edba08f185f940042d5dba3b79 1
Kawasaki%20Ki-61/Production 0 83423c198b6099edba08f185f940042d5dba3b79 1
Sports%20in%20San%20Antonio/NCAA%20college%20basketball 0 83423c198b6099edba08f185f940042d5dba3b79 1
Variational%20Bayesian%20methods/A%20more%20complex%20example 0 83423c198b6099edba08f185f940042d5dba3b79 1

Zipping cbor

Suggestion to distibute zipped cbor, and tooling support reads the zipped stream directly

Does qrel file contain duplicate entries?

Issue with Python library with 2.0 paragraph collection

Hi - I've updated the trec_car library to the latest update. With the following code I received the error message I posted below:

from trec_car.read_data import *

ct = 0

with open("../release-v2.0/paragraphCorpus/dedup.articles-paragraphs.cbor", 'rb') as f: #16GB
for p in iter_paragraphs(f):
ct += 1

##########

AssertionError
Traceback (most recent call last)
in ()
4
5 with open("../release-v2.0/paragraphCorpus/dedup.articles-paragraphs.cbor", 'rb') as f: #16GB
----> 6 for p in iter_paragraphs(f):
7 ct += 1
8 print (ct)

~/anaconda2/envs/py36/lib/python3.6/site-packages/trec_car/read_data.py in iter_paragraphs(file)
246 while True:
247 try:
--> 248 yield Paragraph.from_cbor(cbor.load(file))
249 except EOFError:
250 break

~/anaconda2/envs/py36/lib/python3.6/site-packages/trec_car/read_data.py in from_cbor(cbor)
171 @staticmethod
172 def from_cbor(cbor):
--> 173 assert cbor[0] == 0
174 paragraphId = cbor[1].decode('ascii')
175 return Paragraph(paragraphId, map(ParaBody.from_cbor, cbor[2]))

AssertionError:

Am I doing something wrong?