Comments (11)
I’ll try to find time to run this.
from anserini.
Reporting on experiments with code on #2302 at commit 0b9f79e
"Added -noMerge option".
MS MARCO v1 passage, dev queries, cosDPR-distil model. This is the model reported here:
- Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes
- End-to-End Retrieval with Learned Dense and Sparse Representations Using Lucene
Experimental Setup
Comparing HNSW fp32 vs. int8. In both cases, I'm forcing no merge to reduce variance on indexing times (which varies from trial to trial depending on which segments get selected for merging). For both, we end up with 16 segments.
Experiments on my Mac Studio, stats:
- Mac Studio, M1 Ultra (20 cores; 16 performance and 4 efficiency), 128 GB RAM
- macOS Sonoma 14.1.2; openjdk 11.0.13 2021-10-19.
- Using 16 threads.
Four trials for each condition. Commands:
python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-cos-dpr-distil-hnsw >& logs/log.msmarco-passage-cos-dpr-distil-hnsw.lucene990.fp32.1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-cos-dpr-distil-hnsw-int8 >& logs/log.msmarco-passage-cos-dpr-distil-hnsw.lucene990.int8.1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-cos-dpr-distil-hnsw >& logs/log.msmarco-passage-cos-dpr-distil-hnsw.lucene990.fp32.2
python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-cos-dpr-distil-hnsw >& logs/log.msmarco-passage-cos-dpr-distil-hnsw.lucene990.fp32.3
python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-cos-dpr-distil-hnsw >& logs/log.msmarco-passage-cos-dpr-distil-hnsw.lucene990.fp32.4
python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-cos-dpr-distil-hnsw-int8 >& logs/log.msmarco-passage-cos-dpr-distil-hnsw.lucene990.int8.2
python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-cos-dpr-distil-hnsw-int8 >& logs/log.msmarco-passage-cos-dpr-distil-hnsw.lucene990.int8.3
python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-cos-dpr-distil-hnsw-int8 >& logs/log.msmarco-passage-cos-dpr-distil-hnsw.lucene990.int8.4
Results
Indexing:
# fp32
Total 8,841,823 documents indexed in 00:31:04
Total 8,841,823 documents indexed in 00:31:08
Total 8,841,823 documents indexed in 00:30:34
Total 8,841,823 documents indexed in 00:30:24
avg 00:30:48
# int8
Total 8,841,823 documents indexed in 00:35:55
Total 8,841,823 documents indexed in 00:35:32
Total 8,841,823 documents indexed in 00:35:42
Total 8,841,823 documents indexed in 00:36:15
avg 00:35:51
+16% slowdown
Retrieval:
# fp32
6980 queries processed in 00:02:33 = ~45.34 q/s
6980 queries processed in 00:02:33 = ~45.57 q/s
6980 queries processed in 00:02:33 = ~45.33 q/s
6980 queries processed in 00:02:33 = ~45.55 q/s
avg 45.45 q/s
# int8
6980 queries processed in 00:01:48 = ~64.54 q/s
6980 queries processed in 00:01:48 = ~64.50 q/s
6980 queries processed in 00:01:48 = ~64.56 q/s
6980 queries processed in 00:01:48 = ~64.45 q/s
avg 64.51 q/s
+42% speedup
Finally, effectiveness: remains the same, no difference.
from anserini.
Finally, effectiveness: remains the same, no difference.
🎉
+16% slowdown
On index and initial flush, this isn't surprising. We build the graph with float32 and then have the small additional overhead of calculating quantiles and storing the quantized representation of everything.
But, I have noticed that merging is faster (about 30-40%).
from anserini.
@benwtrent merging is difficult to benchmark because I get high variance in running times... my diagnosis is that running time is idiosyncratically dependent on which segments get selected for merging... is this a convincing explanation?
from anserini.
@lintool I understand :)
I am happy to see that effectiveness is still, well, effective.
from anserini.
Hrm... update: trying the same experiments, but with OpenAI embeddings. Getting errors:
...
2023-12-14 20:46:43,421 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
[2336.220s][warning][gc,alloc] pool-2-thread-16: Retried waiting for GCLocker too often allocating 244355 words
2023-12-14 20:47:44,104 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
[2373.702s][warning][gc,alloc] pool-2-thread-8: Retried waiting for GCLocker too often allocating 249714 words
2023-12-14 20:48:44,511 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
2023-12-14 20:49:46,485 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
2023-12-14 20:50:49,971 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
2023-12-14 20:51:52,476 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
2023-12-14 20:52:53,176 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
2023-12-14 20:53:56,204 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
2023-12-14 20:54:58,288 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
[2792.142s][warning][gc,alloc] pool-2-thread-5: Retried waiting for GCLocker too often allocating 249321 words
[2793.036s][warning][gc,alloc] pool-2-thread-8: Retried waiting for GCLocker too often allocating 249713 words
2023-12-14 20:56:13,898 INFO [main] index.AbstractIndexer (AbstractIndexer.java:260) - 53.93% of files completed, 4,830,000 documents indexed
[2869.427s][warning][gc,alloc] pool-2-thread-9: Retried waiting for GCLocker too often allocating 247479 words
[2872.043s][warning][gc,alloc] pool-2-thread-3: Retried waiting for GCLocker too often allocating 237701 words
@benwtrent any ideas? cc/ @ChrisHegarty @jpountz
How should I start debugging?
from anserini.
do these warnings result in indexing errors? or is it "just" polluting the output logs?
it might be the GC not managing to free up enough memory (also considering the larger embeddings' size of 1536), either we can give the JVM a larger heap or allow more retries from the GC (e.g., passing -XX:GCLockerRetryAllocationCount=100
to the indexing java command)
from anserini.
We're still having issues with openai-ada2-int8
, but #2302 (Lucene 9.9.1) has been merged into master
. Will create new issue focused on debugging this.
from anserini.
Ref: #2314
from anserini.
Ref: #2318
from anserini.
No further follow-up, closing.
from anserini.
Related Issues (20)
- Maven build / test issue HOT 2
- Add DL19/DL20 for Cohere V3 embeddings HOT 2
- Anserini Retrieval latency question - Mono thread/CPU ?
- bge-base-en-v1.5 encoder query length issues HOT 1
- Allow trec_eval to take symbols representing standard qrels (instead of full qrel files) HOT 7
- Upgrade JDK? HOT 4
- Add dl22 docs to Anserini HOT 2
- Change local filename of downloaded pre-built index HOT 4
- Duplicate downloading of ONNX files for test cases?
- Can't run 2CR on pre-built indexes directly on fatjar - can't read YAML files HOT 14
- Building anserini on MacOS HOT 21
- Missing appassembler-maven-plugin:2.1.0:assemble HOT 6
- Instructions for reproducing runs on MS MARCO V2.1 with prebuilt indexes HOT 1
- Align RunMsMarco with Fatjar regression instructions HOT 3
- Errors with new MS MARCO v2.1 and BEIR regressions HOT 6
- REST API design HOT 4
- Implement run fusion directly in Anserini HOT 5
- Aligned doc output with 2CR repro classes HOT 1
- Try out new REST API - connect with RankLLM HOT 1
- Discussion: REST API routes for different corpus/model combinations - how do we name? HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anserini.