Code Monkey home page Code Monkey logo

Comments (17)

iorixxx avatar iorixxx commented on May 25, 2024

Hi,

Ok.

How about using args4j instead of commons-cli?
https://github.com/kohsuke/args4j/blob/master/args4j/examples/SampleMain.java

I think it is more user-friendly.

Thanks,
Ahmet

On Saturday, October 10, 2015 3:47 PM, Jimmy Lin [email protected] wrote:

@iorixxx Please check out my branch cw09b-refactoring
I've pulled in your edits and started putting classes in the "right" package hierarchy, following the general layout of Lucene's package hierarchy. Can you please:
* Clean up various reference (e.g., pom.xml) to make sure everything still works?
* Refactor out usage of Args class in IndexClueWeb09b (we should just be using commons-cli), and in general, make the logging, cmdline options, etc. consistent?
Thanks!

Reply to this email directly or view it on GitHub.

from anserini.

lintool avatar lintool commented on May 25, 2024

I'm happy with arg4j, but would you mind retro-fitting all the classes to make them consistent? Thanks!

from anserini.

iorixxx avatar iorixxx commented on May 25, 2024

cw09b-refactoring lacks my last two commits (abandon System.currentTimeMillis(), which is not monotonic) from my fork. Do you mind grabbing them?

from anserini.

iorixxx avatar iorixxx commented on May 25, 2024

Hi, I reduced the pom.xml and used arg4j for command line parsing in cw09b indexer.

I think, annotations help readability of the code: IndexArgs.java

I forgot to mention issue number in the commit message.

Here is the commit: c57141a6b4

retro-fitting all the classes

I see that gov2 has doclimit, update, positions, optimize options. Do we really need them?

And searchers don't follow indexers' CLI behaviour.

from anserini.

lintool avatar lintool commented on May 25, 2024

cw09b-refactoring lacks my last two commits (abandon System.currentTimeMillis(), which is not monotonic) from my fork. Do you mind grabbing them?

I thought I did - https://github.com/lintool/Anserini/network shows that I picked up all your forks.

I see that gov2 has doclimit, update, positions, optimize options. Do we really need them?

Here's my rationale, one by one:

  • doclimit: sometimes you just want to index a small portion of the collection to test. keep?
  • update: probably don't need, since we're indexing standard TREC test collections. drop?
  • positions and count: we want these to compare against other systems. Also, at some point in time we'll want to add docvector and store the document vectors for relevance feedback. keep?
  • optimize: should we just always optimize by default (since these are static doc collections). keep?

And searchers don't follow indexers' CLI behaviour.

We should probably make these consistent next...

from anserini.

iorixxx avatar iorixxx commented on May 25, 2024

doclimit: sometimes you just want to index a small portion of the collection to test. keep?

Well you can always supply a sub-folder (or some arbitrary folder containing a few warc files) with the -input argument. E.g. Instead of /path/to/cw09b/, /path/to/cw09b/enwp03. It would complicate things without adding value to reproducibility.

update: probably don't need, since we're indexing standard TREC test collections. drop?

+1 to drop

positions and count:

+1 to keep.

optimize: should we just always optimize by default

Optimize can be in an another executable. It takes about 30 - 40 minutes to optimise catB dataset, but i haven't optimised catA dataset yet. And it needs 2-3 times more disk space during the operation. If the user have enough disk space and want to optimise, we could provide a separate program. Lets drop it from indexer code?

from anserini.

lintool avatar lintool commented on May 25, 2024

doclimit: sometimes you just want to index a small portion of the collection to test. keep?

Well you can always supply a sub-folder (or some arbitrary folder containing a few warc files) with the -input argument. E.g. Instead of /path/to/cw09b/, /path/to/cw09b/enwp03. It would complicate things without adding value to reproducibility.

Hrm. I don't feel that strongly... although I've already found the option to be useful in my testing... don't need to muck around with different paths, don't need to worry about how big each file is, don't need to copy around files, etc. just use same exact command, add in extra option.

I would vote to keep, but fine with your decision... can always go back and add later.

optimize: should we just always optimize by default

Optimize can be in an another executable. It takes about 30 - 40 minutes to optimise catB dataset, but i haven't optimised catA dataset yet. And it needs 2-3 times more disk space during the operation. If the user have enough disk space and want to optimise, we could provide a separate program. Lets drop it from indexer code?

I don't like proliferation of small programs...

Would we even want to index catA in one monolithic index? I'd like for catA we'd build an index per shard and manage using Solr or something like that...

from anserini.

iorixxx avatar iorixxx commented on May 25, 2024

I made doclimit to control the number of *.warc files. Making it to control actual lucene documents would require threads to communicate each other. In current cwb09 indexer, threads are free from each other.

Here is the usage message, 3 required options along with 3 optional ones.

 -doclimit [Number]      : Maximum number of *.warc documents to index (-1 to index
                           everything) (default: -1)
 -index [Path]           : Lucene index
 -input [Path]           : Collection Directory
 -optimize [true|false]  : Optimize index (force merge) (default: false)
 -positions [true|false] : Index positions (default: true)
 -threads [Number]       : Number of Threads
Example: IndexClueWeb09b -index [Path] -input [Path] -threads [Number]

What do you think about this in its current form? Can someone test doclimit option?

from anserini.

iorixxx avatar iorixxx commented on May 25, 2024

Would we even want to index catA in one monolithic index?

If the index fits in one box, why not? Current catB index is 80 GB. Unlike the real-word settings, for static TREC collections & experiments, sharding might not needed.

from anserini.

lintool avatar lintool commented on May 25, 2024

-optimize [true|false] : Optimize index (force merge) (default: false)
-positions [true|false] : Index positions (default: true)

Can we just have the arg instead of true/false? I.e., -optimize instead of -optimize true.

from anserini.

lintool avatar lintool commented on May 25, 2024

Would we even want to index catA in one monolithic index?

If the index fits in one box, why not? Current catB index is 80 GB. Unlike the real-word settings, for static TREC collections & experiments, sharding might not needed.

Challenge accepted :)

Streeling has plenty of disk and memory - at some point I'll try indexing all of ClueWeb.

Which raises the point: is the indexer for ClueWeb09 going to be any different from ClueWeb09, other than the data path? If no, shouldn't we just rename class to ClueWeb09?

from anserini.

iorixxx avatar iorixxx commented on May 25, 2024

positions and optimize options are now Boolean switch:

 -optimize          : Boolean switch to optimize index (force merge) (default: false)
 -positions         : Boolean switch to index positions (default: false)

from anserini.

iorixxx avatar iorixxx commented on May 25, 2024

is the indexer for ClueWeb09 going to be any different from ClueWeb09, other than the data path?

I don't have entire collection, so I didn't try. But it believe it can handle catA :) (minor tweaks may be required) Lets wait for the first try and see how it goes. Then we can re-name?

from anserini.

lintool avatar lintool commented on May 25, 2024

sg

It's on my TODO list to merge in your branch.

from anserini.

iorixxx avatar iorixxx commented on May 25, 2024

Since we agreed on indexing options, I applied them to gov2 indexer code. I think it has better readability this way.

from anserini.

lintool avatar lintool commented on May 25, 2024

Agreed. Building cw09b index and trying to reproduce results now. Will merge back into master after everything looks okay.

from anserini.

lintool avatar lintool commented on May 25, 2024

Merged. a6ab450

from anserini.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.