Comments (7)
Building three types of indexes, using uniCOIL as an example:
# "base"
target/appassembler/bin/IndexCollection \
-collection JsonVectorCollection \
-input /mnt/collections/msmarco/msmarco-passage-unicoil \
-index indexes/lucene-index.msmarco-passage-unicoil/ \
-generator DefaultLuceneDocumentGenerator \
-threads 16 -impact -pretokenized -optimize &
# Store docvectors
target/appassembler/bin/IndexCollection \
-collection JsonVectorCollection \
-input /mnt/collections/msmarco/msmarco-passage-unicoil \
-index indexes/lucene-index.msmarco-passage-unicoil.docvectors/ \
-generator DefaultLuceneDocumentGenerator \
-threads 16 -impact -pretokenized -storeDocvectors -optimize &
# Store raw text
target/appassembler/bin/IndexCollection \
-collection JsonVectorCollection \
-input /mnt/collections/msmarco/msmarco-passage-unicoil \
-index indexes/lucene-index.msmarco-passage-unicoil.text/ \
-generator DefaultLuceneDocumentGenerator \
-threads 16 -impact -pretokenized -storeRaw -optimize &
Saves a lot of space to store only raw text:
$ du -h indexes/ | grep unicoil
1.3G indexes/lucene-index.msmarco-passage-unicoil
47G indexes/lucene-index.msmarco-passage-unicoil.docvectors
8.0G indexes/lucene-index.msmarco-passage-unicoil.text
from anserini.
@AileenLin to help you out - this is currently what's not working on the Pyserini end:
python -m pyserini.search.lucene \
--threads 16 --batch-size 128 \
--index ../anserini/indexes/lucene-index.msmarco-passage-unicoil.docvectors \
--topics dl19-passage-unicoil \
--output runs/run.dl19-rocchio.txt \
--hits 1000 --impact --rocchio
Ultimately, I want to make this work.
from anserini.
do you mean this error? AttributeError: 'LuceneImpactSearcher' object has no attribute 'set_rocchio'
I have tested anserini with the following and it matched the benchmark
target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-unicoil.docvectors/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ -collection JsonVectorCollection \ -topicreader TsvInt \ -output runs/run.msmarco-passage-unicoil.rocchio.topics.msmarco-passage.dev-subset.unicoil.txt \ -impact -pretokenized -rocchio
target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-unicoil.text/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ -collection JsonVectorCollection \ -topicreader TsvInt \ -output runs/run.msmarco-passage-unicoil.rocchio.topics.msmarco-passage.dev-subset.unicoil.txt \ -impact -pretokenized -rocchio
from anserini.
Yup, we need to expose the feature in the Java class, and then wire the connections to Python.
from anserini.
got it
from anserini.
from anserini.
This has been pushed out in v0.22.0 all done!
from anserini.
Related Issues (20)
- Add dl22 docs to Anserini HOT 2
- Change local filename of downloaded pre-built index HOT 4
- Duplicate downloading of ONNX files for test cases?
- Can't run 2CR on pre-built indexes directly on fatjar - can't read YAML files HOT 14
- Building anserini on MacOS HOT 21
- Missing appassembler-maven-plugin:2.1.0:assemble HOT 6
- Instructions for reproducing runs on MS MARCO V2.1 with prebuilt indexes HOT 1
- Align RunMsMarco with Fatjar regression instructions HOT 3
- Errors with new MS MARCO v2.1 and BEIR regressions HOT 6
- REST API design HOT 4
- Implement run fusion directly in Anserini HOT 5
- Aligned doc output with 2CR repro classes HOT 1
- Try out new REST API - connect with RankLLM HOT 1
- Discussion: REST API routes for different corpus/model combinations - how do we name? HOT 8
- failed to mvn package HOT 1
- Some BEIR queries are unused HOT 1
- Refactor ThreadPoolExecutor to use try-with-resources
- Align commands from `run_regression.py` and auto-generated docs
- Add regressions for fusion
- Try parquet-floor
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anserini.