<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Would we even want to index catA in one monolithic index? </blockquot

Refactor ClueWeb09b to parallel structure of IndexGov2 about anserini HOT 17 CLOSED

castorini commented on May 25, 2024

Refactor ClueWeb09b to parallel structure of IndexGov2

from anserini.

Comments (17)

iorixxx commented on May 25, 2024

Hi,

Ok.

How about using args4j instead of commons-cli?
https://github.com/kohsuke/args4j/blob/master/args4j/examples/SampleMain.java

I think it is more user-friendly.

Thanks,
Ahmet

On Saturday, October 10, 2015 3:47 PM, Jimmy Lin [email protected] wrote:

@iorixxx Please check out my branch cw09b-refactoring
I've pulled in your edits and started putting classes in the "right" package hierarchy, following the general layout of Lucene's package hierarchy. Can you please:
* Clean up various reference (e.g., pom.xml) to make sure everything still works?
* Refactor out usage of Args class in IndexClueWeb09b (we should just be using commons-cli), and in general, make the logging, cmdline options, etc. consistent?
Thanks!
—
Reply to this email directly or view it on GitHub.

from anserini.

lintool commented on May 25, 2024

I'm happy with arg4j, but would you mind retro-fitting all the classes to make them consistent? Thanks!

from anserini.

iorixxx commented on May 25, 2024

cw09b-refactoring lacks my last two commits (abandon System.currentTimeMillis(), which is not monotonic) from my fork. Do you mind grabbing them?

from anserini.

iorixxx commented on May 25, 2024

Hi, I reduced the pom.xml and used arg4j for command line parsing in cw09b indexer.

I think, annotations help readability of the code: IndexArgs.java

I forgot to mention issue number in the commit message.

Here is the commit: c57141a6b4

retro-fitting all the classes

I see that gov2 has doclimit, update, positions, optimize options. Do we really need them?

And searchers don't follow indexers' CLI behaviour.

from anserini.

lintool commented on May 25, 2024

cw09b-refactoring lacks my last two commits (abandon System.currentTimeMillis(), which is not monotonic) from my fork. Do you mind grabbing them?

I thought I did - https://github.com/lintool/Anserini/network shows that I picked up all your forks.

I see that gov2 has doclimit, update, positions, optimize options. Do we really need them?

Here's my rationale, one by one:

doclimit: sometimes you just want to index a small portion of the collection to test. keep?
update: probably don't need, since we're indexing standard TREC test collections. drop?
positions and count: we want these to compare against other systems. Also, at some point in time we'll want to add docvector and store the document vectors for relevance feedback. keep?
optimize: should we just always optimize by default (since these are static doc collections). keep?

And searchers don't follow indexers' CLI behaviour.

We should probably make these consistent next...

from anserini.

iorixxx commented on May 25, 2024

doclimit: sometimes you just want to index a small portion of the collection to test. keep?

Well you can always supply a sub-folder (or some arbitrary folder containing a few warc files) with the -input argument. E.g. Instead of /path/to/cw09b/, /path/to/cw09b/enwp03. It would complicate things without adding value to reproducibility.

update: probably don't need, since we're indexing standard TREC test collections. drop?

+1 to drop

positions and count:

+1 to keep.

optimize: should we just always optimize by default

Optimize can be in an another executable. It takes about 30 - 40 minutes to optimise catB dataset, but i haven't optimised catA dataset yet. And it needs 2-3 times more disk space during the operation. If the user have enough disk space and want to optimise, we could provide a separate program. Lets drop it from indexer code?

from anserini.

lintool commented on May 25, 2024

doclimit: sometimes you just want to index a small portion of the collection to test. keep?

Well you can always supply a sub-folder (or some arbitrary folder containing a few warc files) with the -input argument. E.g. Instead of /path/to/cw09b/, /path/to/cw09b/enwp03. It would complicate things without adding value to reproducibility.

Hrm. I don't feel that strongly... although I've already found the option to be useful in my testing... don't need to muck around with different paths, don't need to worry about how big each file is, don't need to copy around files, etc. just use same exact command, add in extra option.

I would vote to keep, but fine with your decision... can always go back and add later.

optimize: should we just always optimize by default

Optimize can be in an another executable. It takes about 30 - 40 minutes to optimise catB dataset, but i haven't optimised catA dataset yet. And it needs 2-3 times more disk space during the operation. If the user have enough disk space and want to optimise, we could provide a separate program. Lets drop it from indexer code?

I don't like proliferation of small programs...

Would we even want to index catA in one monolithic index? I'd like for catA we'd build an index per shard and manage using Solr or something like that...

from anserini.

iorixxx commented on May 25, 2024

I made doclimit to control the number of *.warc files. Making it to control actual lucene documents would require threads to communicate each other. In current cwb09 indexer, threads are free from each other.

Here is the usage message, 3 required options along with 3 optional ones.

 -doclimit [Number]      : Maximum number of *.warc documents to index (-1 to index
                           everything) (default: -1)
 -index [Path]           : Lucene index
 -input [Path]           : Collection Directory
 -optimize [true|false]  : Optimize index (force merge) (default: false)
 -positions [true|false] : Index positions (default: true)
 -threads [Number]       : Number of Threads
Example: IndexClueWeb09b -index [Path] -input [Path] -threads [Number]

What do you think about this in its current form? Can someone test doclimit option?

from anserini.

iorixxx commented on May 25, 2024

Would we even want to index catA in one monolithic index?

If the index fits in one box, why not? Current catB index is 80 GB. Unlike the real-word settings, for static TREC collections & experiments, sharding might not needed.

from anserini.

lintool commented on May 25, 2024

-optimize [true|false] : Optimize index (force merge) (default: false)
-positions [true|false] : Index positions (default: true)

Can we just have the arg instead of true/false? I.e., -optimize instead of -optimize true.

from anserini.

lintool commented on May 25, 2024

Would we even want to index catA in one monolithic index?

If the index fits in one box, why not? Current catB index is 80 GB. Unlike the real-word settings, for static TREC collections & experiments, sharding might not needed.

Challenge accepted :)

Streeling has plenty of disk and memory - at some point I'll try indexing all of ClueWeb.

Which raises the point: is the indexer for ClueWeb09 going to be any different from ClueWeb09, other than the data path? If no, shouldn't we just rename class to ClueWeb09?

from anserini.

iorixxx commented on May 25, 2024

positions and optimize options are now Boolean switch:

 -optimize          : Boolean switch to optimize index (force merge) (default: false)
 -positions         : Boolean switch to index positions (default: false)

from anserini.

iorixxx commented on May 25, 2024

is the indexer for ClueWeb09 going to be any different from ClueWeb09, other than the data path?

I don't have entire collection, so I didn't try. But it believe it can handle catA :) (minor tweaks may be required) Lets wait for the first try and see how it goes. Then we can re-name?

from anserini.

lintool commented on May 25, 2024

It's on my TODO list to merge in your branch.

from anserini.

iorixxx commented on May 25, 2024

Since we agreed on indexing options, I applied them to gov2 indexer code. I think it has better readability this way.

from anserini.

lintool commented on May 25, 2024

Agreed. Building cw09b index and trying to reproduce results now. Will merge back into master after everything looks okay.

from anserini.

lintool commented on May 25, 2024

Merged. a6ab450

from anserini.

Refactor ClueWeb09b to parallel structure of IndexGov2 about anserini HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent