Comments (17)
Hi,
Ok.
How about using args4j instead of commons-cli?
https://github.com/kohsuke/args4j/blob/master/args4j/examples/SampleMain.java
I think it is more user-friendly.
Thanks,
Ahmet
On Saturday, October 10, 2015 3:47 PM, Jimmy Lin [email protected] wrote:
@iorixxx Please check out my branch cw09b-refactoring
I've pulled in your edits and started putting classes in the "right" package hierarchy, following the general layout of Lucene's package hierarchy. Can you please:
* Clean up various reference (e.g., pom.xml) to make sure everything still works?
* Refactor out usage of Args class in IndexClueWeb09b (we should just be using commons-cli), and in general, make the logging, cmdline options, etc. consistent?
Thanks!
—
Reply to this email directly or view it on GitHub.
from anserini.
I'm happy with arg4j, but would you mind retro-fitting all the classes to make them consistent? Thanks!
from anserini.
cw09b-refactoring lacks my last two commits (abandon System.currentTimeMillis(), which is not monotonic) from my fork. Do you mind grabbing them?
from anserini.
Hi, I reduced the pom.xml and used arg4j for command line parsing in cw09b indexer.
I think, annotations help readability of the code: IndexArgs.java
I forgot to mention issue number in the commit message.
Here is the commit: c57141a6b4
retro-fitting all the classes
I see that gov2 has doclimit, update, positions, optimize options. Do we really need them?
And searchers don't follow indexers' CLI behaviour.
from anserini.
cw09b-refactoring lacks my last two commits (abandon System.currentTimeMillis(), which is not monotonic) from my fork. Do you mind grabbing them?
I thought I did - https://github.com/lintool/Anserini/network shows that I picked up all your forks.
I see that gov2 has doclimit, update, positions, optimize options. Do we really need them?
Here's my rationale, one by one:
doclimit
: sometimes you just want to index a small portion of the collection to test. keep?update
: probably don't need, since we're indexing standard TREC test collections. drop?positions
andcount
: we want these to compare against other systems. Also, at some point in time we'll want to adddocvector
and store the document vectors for relevance feedback. keep?optimize
: should we just always optimize by default (since these are static doc collections). keep?
And searchers don't follow indexers' CLI behaviour.
We should probably make these consistent next...
from anserini.
doclimit: sometimes you just want to index a small portion of the collection to test. keep?
Well you can always supply a sub-folder (or some arbitrary folder containing a few warc files) with the -input argument. E.g. Instead of /path/to/cw09b/, /path/to/cw09b/enwp03. It would complicate things without adding value to reproducibility.
update: probably don't need, since we're indexing standard TREC test collections. drop?
+1 to drop
positions and count:
+1 to keep.
optimize: should we just always optimize by default
Optimize can be in an another executable. It takes about 30 - 40 minutes to optimise catB dataset, but i haven't optimised catA dataset yet. And it needs 2-3 times more disk space during the operation. If the user have enough disk space and want to optimise, we could provide a separate program. Lets drop it from indexer code?
from anserini.
doclimit: sometimes you just want to index a small portion of the collection to test. keep?
Well you can always supply a sub-folder (or some arbitrary folder containing a few warc files) with the -input argument. E.g. Instead of /path/to/cw09b/, /path/to/cw09b/enwp03. It would complicate things without adding value to reproducibility.
Hrm. I don't feel that strongly... although I've already found the option to be useful in my testing... don't need to muck around with different paths, don't need to worry about how big each file is, don't need to copy around files, etc. just use same exact command, add in extra option.
I would vote to keep, but fine with your decision... can always go back and add later.
optimize: should we just always optimize by default
Optimize can be in an another executable. It takes about 30 - 40 minutes to optimise catB dataset, but i haven't optimised catA dataset yet. And it needs 2-3 times more disk space during the operation. If the user have enough disk space and want to optimise, we could provide a separate program. Lets drop it from indexer code?
I don't like proliferation of small programs...
Would we even want to index catA in one monolithic index? I'd like for catA we'd build an index per shard and manage using Solr or something like that...
from anserini.
I made doclimit to control the number of *.warc files. Making it to control actual lucene documents would require threads to communicate each other. In current cwb09 indexer, threads are free from each other.
Here is the usage message, 3 required options along with 3 optional ones.
-doclimit [Number] : Maximum number of *.warc documents to index (-1 to index
everything) (default: -1)
-index [Path] : Lucene index
-input [Path] : Collection Directory
-optimize [true|false] : Optimize index (force merge) (default: false)
-positions [true|false] : Index positions (default: true)
-threads [Number] : Number of Threads
Example: IndexClueWeb09b -index [Path] -input [Path] -threads [Number]
What do you think about this in its current form? Can someone test doclimit option?
from anserini.
Would we even want to index catA in one monolithic index?
If the index fits in one box, why not? Current catB index is 80 GB. Unlike the real-word settings, for static TREC collections & experiments, sharding might not needed.
from anserini.
-optimize [true|false] : Optimize index (force merge) (default: false)
-positions [true|false] : Index positions (default: true)
Can we just have the arg instead of true/false
? I.e., -optimize
instead of -optimize true
.
from anserini.
Would we even want to index catA in one monolithic index?
If the index fits in one box, why not? Current catB index is 80 GB. Unlike the real-word settings, for static TREC collections & experiments, sharding might not needed.
Challenge accepted :)
Streeling has plenty of disk and memory - at some point I'll try indexing all of ClueWeb.
Which raises the point: is the indexer for ClueWeb09 going to be any different from ClueWeb09, other than the data path? If no, shouldn't we just rename class to ClueWeb09?
from anserini.
positions and optimize options are now Boolean switch:
-optimize : Boolean switch to optimize index (force merge) (default: false)
-positions : Boolean switch to index positions (default: false)
from anserini.
is the indexer for ClueWeb09 going to be any different from ClueWeb09, other than the data path?
I don't have entire collection, so I didn't try. But it believe it can handle catA :) (minor tweaks may be required) Lets wait for the first try and see how it goes. Then we can re-name?
from anserini.
sg
It's on my TODO list to merge in your branch.
from anserini.
Since we agreed on indexing options, I applied them to gov2 indexer code. I think it has better readability this way.
from anserini.
Agreed. Building cw09b index and trying to reproduce results now. Will merge back into master after everything looks okay.
from anserini.
Merged. a6ab450
from anserini.
Related Issues (20)
- Incorporate jtreceval directly into our repo HOT 3
- Jank in HNSW and InvertedDense search: -threads and -parallelism
- Test failure → Build failure HOT 3
- Upgrade to Lucene 9.9 HOT 13
- Enable recursive graph bisection?
- Lucene 9.9: Benchmark HNSW improvements HOT 11
- Lucene 9.9: Benchmark sparse improvements HOT 1
- Counter-intuitive result: more RAM = slower indexing (standard inverted indexes) HOT 3
- Integrate jtreceval into Anserini HOT 2
- Add ability to download pre-built indexes HOT 3
- Unable to run BEIR (v1.0.0): SPLADE++ CoCondenser-EnsembleDistil regressions HOT 1
- Iterator Design Pattern concerns
- Chain of Responsibility Pattern concerns
- Strategy Design Pattern concerns
- Reproduce "End-to-End Retrieval with Learned Dense and Sparse Representations Using Lucene" with pre-built indexes HOT 1
- Basic rank fusion implementation in Anserini HOT 1
- SearchCollection -rf.qrels option HOT 1
- Errors with openai-ada2-int8 regressions: GCLocker errors HOT 4
- error
- Cache path change
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anserini.