tony-x / search-benchmark-game Goto Github PK

This project forked from quickwit-oss/search-benchmark-game

https://github.com/quickwit-oss/search-benchmark-game

Home Page: https://tantivy-search.github.io/bench/

License: MIT License

Shell 0.01% JavaScript 0.15% Python 0.19% Java 0.10% Rust 0.18% Makefile 0.04% HTML 0.04% SCSS 0.01% Jupyter Notebook 99.29%

search-benchmark-game's People

Contributors

Stargazers

Watchers

Forkers

mikemccand slow-j mdmarshmallow singhsuryansh12 macohen mooreniemi

search-benchmark-game's Issues

Confirm that TOP_N hits returned match in both engines

Let's confirm the two engines are returning the same/correct hits before concluding anything about performance?

Does Tantivy also specialize SIMD for aarch64?

Arm (aarch64) cpus are becoming more popular/powerful (e.g. AWS Graviton instances) ... does Tantivy also specialize SIMD for Arm CPUs?

Document how to setup and run the benchmark

The original Tantivy README explained how to setup and run the benchmarks for yourself ... let's do the same for our hybrid luceneutil/search-benchmark-game solution here?

Also test non-relevance sort

Our search benchmarks so far are sorting by BM25.

We should also test a non-relevance sort, e.g. add a numeric (doc values) field and sort by that. We should use the same non-relevance fields as luceneutil, which come from the enwiki corpus already.

Benchmark Lucene with a single level skip list

@Tony-X and I were chatting about the difference in skipping in Tantivy (single level, fast binary search) and Lucene (multi-level, sort of complex implementation) and wondering whether that is causing the AndHighLow performance difference.

We could test this by disabling Lucene's multi-level skip list at indexing time, or, rather, enabling it but configuring the level skip multiplier so high (Integer.MAX_VALUE?) so that it only ever uses a single skip level.

Wrong Lucene version in build.gradle

This looks wrong, the version is outrdated. If the Lucene analyzers module is not used at all, we should remove it: https://github.com/Tony-X/search-benchmark-game/blob/4402d42c906830e85d8d79a30ae776f204ade770/engines/lucene-9.5.0/build.gradle#LL17C4-L17C4

Run more iterations per task

Three iterations only leaves room for substantial noise (GC, hotspot, etc.).

Can we increase to 50 iterations, discard first N (10?) as warmup, and then take fastest time of the remaining 40 tasks.

Let's see if that makes the %tg difference b/w the two engines more consistent, especially for the high hit count cases.

Enable Panama MMAP

In PR #56, we disabled Panama MMAP, by setting enableMemorySegments=false to allow us to test with JDK19+ (we are currently using JDK20 for the search-benchmark-game results)

I had some issues with using JDK19+ with Panama MMAP, see the comments in #47 .

Let's enable this to ensure we are indeed testing the latest & greatest Lucene features!

Open issues in Tantivy for possible cross-fertilization improvements

E.g. maybe Tantivy does not rewrite a query like "A B" to "A" when B has docFreq=0?

Let's gather other possible cross fertilizations here and iteratively propose upstream to Tantivy and Lucene dev communities.

[Brainstorm] Functional Differences Between Tantivy & Lucene

As of 2023-06-20, we believe there are functional differences in these two libraries. Our goal with this issue is to brainstorm differences with the community and eventually move these findings into a wiki page. Our first few questions:

How do Tantivy and Lucene encode postings? What are the differences?
How do Tantivy and Lucene do skipping?
How do Tantivy and Lucene implement BMW (Block-Max Wand)?

What else should we discuss? We can create separate issues as needed for more discussion...

Can we get both count and top_n?

Lucene returns a count (either estimated or exact) when collecting TOP_N hits ... but from the benchmark results UI it looks like Tantivy is always showing 1 hit. Is this just a limitation of the benchmarker or does Tantivy really not report estimated hit count along with TOP_N?

Also test with deletions

Deletions/updates are common in real search applications -- can we compare the two engines with varying %tg of deletions?

Benchmark should report variance of the post-warmup latencies

Spinoff from #25.

Can we add variance of the time to run each task in the final report UI? This is helpful for tasks that do not have predictable execution times and might point to a source of interesting structural noise (GC, hotspot recompilation, or something) in the benchmark.

Measure search engine time, not client time

The Rust benchmark client now measures time on the client side, which includes communication time over the local Unix pipe. Likely the added pipe cost is minuscule, but let's be paranoid and instead measure the precise query execution time on each server?

Use min latency for both engines to eliminate noise

This is especially important for Lucene, since there could be GC

Can we disable BMW for both engines and test performance?

This would let us separate out differences in how both engines use BMW for skipping.

Can we tie-break sort the benchmark results by %tg difference?

It's hard to consume the percentage difference b/w the two engines otherwise.

Try benchmarking with SIMD disabled

One possible approach to see how much of a gain Tantivy is seeing because it can specialize SIMD for postings lists decoding would be to run under a VM that blocks SIMD instructions.

Why do sloppy phrase queries return such different hit counts?

The two engines sometimes produce wildly different results -- how come?

E.g.:

"web url"~4   428 docs (Tantivy)   131,414 docs  (Lucene)

Sort the index consistently on both side

This would enable generating controlled amount of deletions #1

Do we have a percentile latency breakdown?

If we focus too much on average then the stats can be dominated by those long tail queries?

Is the Tantivy index really single segment?

After make index, when I ls -l tantivy-0.19/idx I see this:

-rw-r--r-- 1 mike mike     406685 May 25 08:14 0271c17c6ab54da4bed5143ed9e8732f.fast
-rw-r--r-- 1 mike mike     545586 May 25 08:14 0271c17c6ab54da4bed5143ed9e8732f.fieldnorm
-rw-r--r-- 1 mike mike   84751928 May 25 08:14 0271c17c6ab54da4bed5143ed9e8732f.idx
-rw-r--r-- 1 mike mike   56702658 May 25 08:14 0271c17c6ab54da4bed5143ed9e8732f.pos
-rw-r--r-- 1 mike mike    2195069 May 25 08:14 0271c17c6ab54da4bed5143ed9e8732f.store
-rw-r--r-- 1 mike mike   90916564 May 25 08:14 0271c17c6ab54da4bed5143ed9e8732f.term
-rw-r--r-- 1 mike mike        142 May 25 08:15 0410ac18886c4f45ad35f5548e67b965.fast
-rw-r--r-- 1 mike mike    4253474 May 25 08:13 0410ac18886c4f45ad35f5548e67b965.fieldnorm
-rw-r--r-- 1 mike mike  674168376 May 25 08:15 0410ac18886c4f45ad35f5548e67b965.idx
-rw-r--r-- 1 mike mike  440099234 May 25 08:15 0410ac18886c4f45ad35f5548e67b965.pos
-rw-r--r-- 1 mike mike   17115473 May 25 08:15 0410ac18886c4f45ad35f5548e67b965.store
-rw-r--r-- 1 mike mike  499169215 May 25 08:15 0410ac18886c4f45ad35f5548e67b965.term
-rw-r--r-- 1 mike mike        142 May 25 08:12 0725c009094b43dfb561e2686c685a44.fast
-rw-r--r-- 1 mike mike    3641078 May 25 08:11 0725c009094b43dfb561e2686c685a44.fieldnorm
-rw-r--r-- 1 mike mike  657912184 May 25 08:12 0725c009094b43dfb561e2686c685a44.idx
-rw-r--r-- 1 mike mike  439505796 May 25 08:12 0725c009094b43dfb561e2686c685a44.pos
-rw-r--r-- 1 mike mike   14651197 May 25 08:12 0725c009094b43dfb561e2686c685a44.store
-rw-r--r-- 1 mike mike  406930076 May 25 08:12 0725c009094b43dfb561e2686c685a44.term
-rw-r--r-- 1 mike mike     415249 May 25 08:14 1d1fba8b728c4735886915f719e5a155.fast
-rw-r--r-- 1 mike mike     568487 May 25 08:14 1d1fba8b728c4735886915f719e5a155.fieldnorm
-rw-r--r-- 1 mike mike   85273587 May 25 08:14 1d1fba8b728c4735886915f719e5a155.idx
-rw-r--r-- 1 mike mike   56744442 May 25 08:14 1d1fba8b728c4735886915f719e5a155.pos
-rw-r--r-- 1 mike mike    2287230 May 25 08:14 1d1fba8b728c4735886915f719e5a155.store
-rw-r--r-- 1 mike mike   93781334 May 25 08:14 1d1fba8b728c4735886915f719e5a155.term
-rw-r--r-- 1 mike mike     405070 May 25 08:14 29137e052b644efcb12c359f3225b76e.fast
-rw-r--r-- 1 mike mike     549125 May 25 08:13 29137e052b644efcb12c359f3225b76e.fieldnorm
-rw-r--r-- 1 mike mike   84774858 May 25 08:14 29137e052b644efcb12c359f3225b76e.idx
-rw-r--r-- 1 mike mike   56802967 May 25 08:14 29137e052b644efcb12c359f3225b76e.pos
-rw-r--r-- 1 mike mike    2209308 May 25 08:14 29137e052b644efcb12c359f3225b76e.store
-rw-r--r-- 1 mike mike   90816610 May 25 08:14 29137e052b644efcb12c359f3225b76e.term
-rw-r--r-- 1 mike mike        139 May 25 08:10 34d7f5f7a39b4f18a3462c3d5175d84f.fast
-rw-r--r-- 1 mike mike    3218222 May 25 08:09 34d7f5f7a39b4f18a3462c3d5175d84f.fieldnorm
-rw-r--r-- 1 mike mike  647937541 May 25 08:10 34d7f5f7a39b4f18a3462c3d5175d84f.idx
-rw-r--r-- 1 mike mike  430061929 May 25 08:10 34d7f5f7a39b4f18a3462c3d5175d84f.pos
-rw-r--r-- 1 mike mike   12949633 May 25 08:10 34d7f5f7a39b4f18a3462c3d5175d84f.store
-rw-r--r-- 1 mike mike  433387379 May 25 08:10 34d7f5f7a39b4f18a3462c3d5175d84f.term
-rw-r--r-- 1 mike mike          0 May 25 08:15 499a2a9ab6614af0983119526173eb45.fast
-rw-r--r-- 1 mike mike   29967545 May 25 08:15 499a2a9ab6614af0983119526173eb45.fieldnorm
-rw-r--r-- 1 mike mike 1466223757 May 25 08:18 499a2a9ab6614af0983119526173eb45.idx
-rw-r--r-- 1 mike mike  851720696 May 25 08:18 499a2a9ab6614af0983119526173eb45.pos
-rw-r--r-- 1 mike mike          0 May 25 08:15 499a2a9ab6614af0983119526173eb45.store
-rw-r--r-- 1 mike mike  656914437 May 25 08:18 499a2a9ab6614af0983119526173eb45.term
-rw-r--r-- 1 mike mike        142 May 25 08:11 4e0775b1aee34649aa86966e0da235b8.fast
-rw-r--r-- 1 mike mike    3379702 May 25 08:09 4e0775b1aee34649aa86966e0da235b8.fieldnorm
-rw-r--r-- 1 mike mike  659921718 May 25 08:11 4e0775b1aee34649aa86966e0da235b8.idx
-rw-r--r-- 1 mike mike  433690134 May 25 08:11 4e0775b1aee34649aa86966e0da235b8.pos
-rw-r--r-- 1 mike mike   13599407 May 25 08:11 4e0775b1aee34649aa86966e0da235b8.store
-rw-r--r-- 1 mike mike  433097571 May 25 08:11 4e0775b1aee34649aa86966e0da235b8.term
-rw-r--r-- 1 mike mike        142 May 25 08:14 579dc7ed3d984a4b950f00a3082a0839.fast
-rw-r--r-- 1 mike mike    3990309 May 25 08:12 579dc7ed3d984a4b950f00a3082a0839.fieldnorm
-rw-r--r-- 1 mike mike  665233387 May 25 08:14 579dc7ed3d984a4b950f00a3082a0839.idx
-rw-r--r-- 1 mike mike  438146067 May 25 08:14 579dc7ed3d984a4b950f00a3082a0839.pos
-rw-r--r-- 1 mike mike   16056513 May 25 08:14 579dc7ed3d984a4b950f00a3082a0839.store
-rw-r--r-- 1 mike mike  452116929 May 25 08:14 579dc7ed3d984a4b950f00a3082a0839.term
-rw-r--r-- 1 mike mike     174034 May 25 08:14 5cb72df77c5347649f835f11d91821e8.fast
-rw-r--r-- 1 mike mike     230813 May 25 08:14 5cb72df77c5347649f835f11d91821e8.fieldnorm
-rw-r--r-- 1 mike mike   34797726 May 25 08:14 5cb72df77c5347649f835f11d91821e8.idx
-rw-r--r-- 1 mike mike   23066174 May 25 08:14 5cb72df77c5347649f835f11d91821e8.pos
-rw-r--r-- 1 mike mike     928460 May 25 08:14 5cb72df77c5347649f835f11d91821e8.store
-rw-r--r-- 1 mike mike   42777101 May 25 08:14 5cb72df77c5347649f835f11d91821e8.term
-rw-r--r-- 1 mike mike        141 May 25 08:14 6b3310e73ad2403586ebb81a9f429e74.fast
-rw-r--r-- 1 mike mike    4054947 May 25 08:13 6b3310e73ad2403586ebb81a9f429e74.fieldnorm
-rw-r--r-- 1 mike mike  668074349 May 25 08:14 6b3310e73ad2403586ebb81a9f429e74.idx
-rw-r--r-- 1 mike mike  438172622 May 25 08:14 6b3310e73ad2403586ebb81a9f429e74.pos
-rw-r--r-- 1 mike mike   16316605 May 25 08:14 6b3310e73ad2403586ebb81a9f429e74.store
-rw-r--r-- 1 mike mike  501457399 May 25 08:14 6b3310e73ad2403586ebb81a9f429e74.term
-rw-r--r-- 1 mike mike     139818 May 25 08:14 6ca78d9949d64a50a104e12cccdf9bbf.fast
-rw-r--r-- 1 mike mike     526679 May 25 08:13 6ca78d9949d64a50a104e12cccdf9bbf.fieldnorm
-rw-r--r-- 1 mike mike   84338041 May 25 08:14 6ca78d9949d64a50a104e12cccdf9bbf.idx
-rw-r--r-- 1 mike mike   56404829 May 25 08:14 6ca78d9949d64a50a104e12cccdf9bbf.pos
-rw-r--r-- 1 mike mike    2118998 May 25 08:14 6ca78d9949d64a50a104e12cccdf9bbf.store
-rw-r--r-- 1 mike mike   83369152 May 25 08:14 6ca78d9949d64a50a104e12cccdf9bbf.term
-rw-r--r-- 1 mike mike      86711 May 25 08:14 70990c5f50f5451195717fc9537d887f.fast
-rw-r--r-- 1 mike mike     549140 May 25 08:14 70990c5f50f5451195717fc9537d887f.fieldnorm
-rw-r--r-- 1 mike mike   83799629 May 25 08:14 70990c5f50f5451195717fc9537d887f.idx
-rw-r--r-- 1 mike mike   56017038 May 25 08:14 70990c5f50f5451195717fc9537d887f.pos
-rw-r--r-- 1 mike mike    2209369 May 25 08:14 70990c5f50f5451195717fc9537d887f.store
-rw-r--r-- 1 mike mike   82187808 May 25 08:14 70990c5f50f5451195717fc9537d887f.term
-rw-r--r-- 1 mike mike        141 May 25 08:12 a6c96e578b824fb1ad7893e1614328a4.fast
-rw-r--r-- 1 mike mike    3523909 May 25 08:10 a6c96e578b824fb1ad7893e1614328a4.fieldnorm
-rw-r--r-- 1 mike mike  661635350 May 25 08:12 a6c96e578b824fb1ad7893e1614328a4.idx
-rw-r--r-- 1 mike mike  439209853 May 25 08:12 a6c96e578b824fb1ad7893e1614328a4.pos
-rw-r--r-- 1 mike mike   14179717 May 25 08:12 a6c96e578b824fb1ad7893e1614328a4.store
-rw-r--r-- 1 mike mike  412848549 May 25 08:12 a6c96e578b824fb1ad7893e1614328a4.term
-rw-r--r-- 1 mike mike        142 May 25 08:13 b862983c25d844c89da73ee6dd482ff3.fast
-rw-r--r-- 1 mike mike    3906636 May 25 08:11 b862983c25d844c89da73ee6dd482ff3.fieldnorm
-rw-r--r-- 1 mike mike  664436405 May 25 08:13 b862983c25d844c89da73ee6dd482ff3.idx
-rw-r--r-- 1 mike mike  437492359 May 25 08:13 b862983c25d844c89da73ee6dd482ff3.pos
-rw-r--r-- 1 mike mike   15719818 May 25 08:13 b862983c25d844c89da73ee6dd482ff3.store
-rw-r--r-- 1 mike mike  439975329 May 25 08:13 b862983c25d844c89da73ee6dd482ff3.term
-rw-r--r-- 1 mike mike     137172 May 25 08:14 bd2a947e97434d5f9ca656458c04342d.fast
-rw-r--r-- 1 mike mike     182093 May 25 08:14 bd2a947e97434d5f9ca656458c04342d.fieldnorm
-rw-r--r-- 1 mike mike   27530710 May 25 08:14 bd2a947e97434d5f9ca656458c04342d.idx
-rw-r--r-- 1 mike mike   18342728 May 25 08:14 bd2a947e97434d5f9ca656458c04342d.pos
-rw-r--r-- 1 mike mike     732425 May 25 08:14 bd2a947e97434d5f9ca656458c04342d.store
-rw-r--r-- 1 mike mike   33553082 May 25 08:14 bd2a947e97434d5f9ca656458c04342d.term
-rw-r--r-- 1 mike mike     163869 May 25 08:14 de40404070b14235a3df18ea0ad0e594.fast
-rw-r--r-- 1 mike mike     214096 May 25 08:14 de40404070b14235a3df18ea0ad0e594.fieldnorm
-rw-r--r-- 1 mike mike   32026454 May 25 08:14 de40404070b14235a3df18ea0ad0e594.idx
-rw-r--r-- 1 mike mike   21247245 May 25 08:14 de40404070b14235a3df18ea0ad0e594.pos
-rw-r--r-- 1 mike mike     861202 May 25 08:14 de40404070b14235a3df18ea0ad0e594.store
-rw-r--r-- 1 mike mike   38321689 May 25 08:14 de40404070b14235a3df18ea0ad0e594.term
-rw------- 1 mike mike       2605 May 25 08:15 meta.json

Also the meta.json seems to indicate there are still multiple segments.

Whereas the corresponding Lucene index does have a single segment:

[root@beast3 engines]# ls -lc lucene-9.5.0/idx
total 10927392
-rw-r--r-- 1 mike mike    4166643 May 25 09:03 _11_1.liv
-rw-r--r-- 1 mike mike       1463 May 25 08:55 _11.fdm
-rw-r--r-- 1 mike mike     453924 May 25 08:55 _11.fdt
-rw-r--r-- 1 mike mike      44969 May 25 08:55 _11.fdx
-rw-r--r-- 1 mike mike        249 May 25 09:03 _11.fnm
-rw-r--r-- 1 mike mike   35086021 May 25 09:03 _11.kdd
-rw-r--r-- 1 mike mike     305298 May 25 09:03 _11.kdi
-rw-r--r-- 1 mike mike        153 May 25 09:03 _11.kdm
-rw-r--r-- 1 mike mike 4594864820 May 25 09:03 _11_Lucene90_0.doc
-rw-r--r-- 1 mike mike   66708056 May 25 09:03 _11_Lucene90_0.dvd
-rw-r--r-- 1 mike mike        162 May 25 09:03 _11_Lucene90_0.dvm
-rw-r--r-- 1 mike mike 3443744005 May 25 09:03 _11_Lucene90_0.pos
-rw-r--r-- 1 mike mike 2961401248 May 25 09:03 _11_Lucene90_0.tim
-rw-r--r-- 1 mike mike   49474808 May 25 09:03 _11_Lucene90_0.tip
-rw-r--r-- 1 mike mike        248 May 25 09:03 _11_Lucene90_0.tmd
-rw-r--r-- 1 mike mike   33332679 May 25 08:55 _11.nvd
-rw-r--r-- 1 mike mike        103 May 25 08:55 _11.nvm
-rw-r--r-- 1 mike mike        581 May 25 09:03 _11.si
-rw-r--r-- 1 mike mike        155 May 25 09:03 segments_2
-rw-r--r-- 1 mike mike          0 May 25 08:18 write.lock

Test different heap sizes for Lucene benchy

Spinoff from this discussion: apache/lucene#12358 (comment)

We should fix the heap size for the JVM running Lucene. It can save some cost of the JVM trying to grow/shrink/reallocate, etc.?

I'm starting with 4 GB as a random guess but we should test at what heap size is GC cost minimized.

Sometimes my new results.json is not reflected in the UI?

I'm not sure what I did wrong, but I ran make bench and then make serve yet somehow the results displayed in the UI did not reflect the results in my results.json.

I even removed all results.json from my dev area and the UI continued to serve some old copy.

I think somehow my browser was caching the old results in the local service worker? When I switched to a different browser, suddenly it worked (reflected my new results). Likewise an incognito browser window also worked.

Enable dynamic pruning immediately

This benchmark uses IndexSearcher#search(Query, int) which will only start to dynamically prune after collecting 1,000 hits. This can be fixed by creating the collector manually.

TopScoreDocsCollector collector = TopScoreDocCollector.create(10, 10); // second argument is now 10 instead of 1,000
searcher.search(query, collector);

Run benchmark on Graviton3 instance

I am curious how the performance difference between the two engines differs between a modern aarch64 (e.g. Graviton3) and a modern x86-64 box.

For starters let's just run the "Tantivy vs Lucene" benchmark on a Graviton3 instance?

Separately we can try to find "similar" aarch64 and x86-64 instances and compare performance of each engine across those.

Add highly concurrent indexing mode

We mostly are focusing on search performance so we'd ideally like to build each index at maximum throughput (saturate CPU, IO, etc.) so we can get to searching faster.

Maybe we could add a fast_index target?

Test red-line QPS

Part of Tantivy's high performance over Lucene might come from using low-level concurrency in a single query. E.g. from this blog post:

Search engines generally rely on the OS to schedule their IO and do prefetching. Quickwit schedules its own IO and concurrently downloads the data to maximize its possible throughput. In practice, we do observe a throughput of 1GB/s over the span of a query. This is the kind of bandwidth you could expect from an SSD.

The context was in searching an index directly from S3, but likely this also applies to searching local hot (cached entirely in RAM) indices?

If so, we really need to measure red-line QPS to compare the two engines. Concurrency is net/net a zero sum game (your are consuming CPU resources). Below red-line it can make each query run faster by utilizing otherwise idle CPU resources. But at red-line we would see a fairer comparison.

Benchmark should produce a simple plain text summary

It's nice that we get a results.json with all the juicy details of each benchmark, but, for sharing purposes (e.g. the Graviton results I just attached to #36) it'd be nicer/easier to copy/paste a text summary so that we could see at a glance what the results look like.

We could just make a new tool that parses the JSON and renders to flat text, in addition to the make bench that renders to an interactive Web UI.

"make bench" fails the first time

I'm not sure what happened here, but after a make clean and a make index, I then ran make bench and got this:

[mike@raptorlake search-benchmark-game]$ make bench
--- Benchmarking ---
======================
BENCHMARKING tantivy-0.19 TOP_10_COUNT
/s0/l/search-benchmark-game/engines/tantivy-0.19
--- Warming up ...
- Warm up Run #1 of 1
Traceback (most recent call last):
  File "/s0/l/search-benchmark-game/src/client.py", line 96, in <module>
    for _ in drive(queries_shuffled, search_client, command):
  File "/s0/l/search-benchmark-game/src/client.py", line 50, in drive
    count = client.query(query.query, command)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/s0/l/search-benchmark-game/src/client.py", line 39, in query
    cnt = int(recv)
          ^^^^^^^^^
ValueError: invalid literal for int() with base 10: b"--- Building tantivy's binary ---"
make: *** [Makefile:58: bench] Error 1
[mike@raptorlake search-benchmark-game]$    Compiling tantivy-bench v0.1.0 (/s0/l/search-benchmark-game/engines/tantivy-0.19)
    Finished release [optimized] target(s) in 13.16s
thread 'main' panicked at 'failed printing to stdout: Broken pipe (os error 32)', library/std/src/io/stdio.rs:1019:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
make[1]: *** [Makefile:17: serve] Error 101

I think we need to fix client.py to be resilient to some errant make output building things, the first time?

When I ran make bench again, it seems to be succeeding (running now).

`results.json` should include more details about the run like JVM command-line, OS env, etc.

I love that the results.json captures all the query latency details, but we should also record JVM command-line flags, OS / CPU details, etc.

`results.json` should include the category (`COUNT`, `TOP_10_COUNT`) in each query

I pretty-printed the results.json from a full run. It seems each task is executed twice (with N iterations each), yet the times vary quite a bit on each, for both Tantivy and Lucene:

beast3:search-benchmark-game[bind_all]$ grep -A 17 "\"+who +so\"" pretty1.txt
        "query": "+who +so",
        "tags": [
          "AndHighHigh"
        ],
        "count": 257789,
        "duration": [
          31627,
          31647,
          31655,
          31658,
          31661,
          31671,
          31675,
          31716,
          31720,
          31846
        ]
      },
--
        "query": "+who +so",
        "tags": [
          "AndHighHigh"
        ],
        "count": 257789,
        "duration": [
          37542,
          37566,
          37613,
          37614,
          37617,
          37634,
          37641,
          37648,
          37665,
          37708
        ]
      },
--
        "query": "+who +so",
        "tags": [
          "AndHighHigh"
        ],
        "count": 257789,
        "duration": [
          23350,
          23370,
          23399,
          23412,
          23413,
          23421,
          23438,
          23466,
          23511,
          23512
        ]
      },
--
        "query": "+who +so",
        "tags": [
          "AndHighHigh"
        ],
        "count": 257789,
        "duration": [
          25625,
          25647,
          25650,
          25665,
          25668,
          25682,
          25697,
          25709,
          25711,
          26451
        ]
      },

We should test on a multi-segment index?

Right now the benchmark force merges both indices down to a single segment, which is great for drawing comparisions, but is not realistic for apps that need continuous updates.

It's a bit tricky to get the same docs to the same segments in the same order in both engines though...

Bug: if you run a single COMMAND that is not COUNT, you cannot view the results.

Repro: set COMMANDS ?= TOP_10

Result:

If you duplicate and rename TOP_10 in the result.json, you can view a report by changing selection.

Therefore I am guessing that since state.mode of Benchmark defaults to COUNT and we only have 1 key (which is not COUNT) therefore we cannot change the mode onChange?

I can't confirm this since I am unable to rebuild the web UI.
As part of this, we should update the development guide with instructions on how to do so.

Also test with TOP_N=100

Many search applications do some sort of re-ranking of the initial search results ... let's test with a larger TOP_N?

Are the benchmark results repeatable?

We still see that within the same category of queries, e.g. OrHighHigh, there is a wide difference in how much faster Tantivy is than Lucene from a given pair of terms to another.

I'm curious whether those differences are consistent/repeatable? I.e. will the same pair of terms show XXX% gain, within tightish variance, from one run to another?

Let's also compare indexing throughput?

The benchmark today seems to focus on search performance, which is typically the more important of the two.

But for some applications the situation is reversed: maximizing indexing throughput on limited hardware is more important.

Can we remove the `results.json` from git?

I know it's nice to be able to checkout and immediately bring up the UI for browsing results...

But it's a bit misleading, since this did not run on your machine.

Also, git diff is massive if you've run benchmarks locally.

Finally, it is checked in in three different places (results.json, web/public/results.json, web/build/results.json), and it's confusing knowing which one "matters".

Will the JVM's hotspot compiler sometimes compile to branchless (`CMOVcc`) CPU instructions?

One of the cool optimizations Tantivy (and many other places, apparently) use is branchless algorithms like binary search. Another blog post about branchless binary search.

It compiles down to CPU instructions called "conditional move", which will move data from one place to another depending on a condition, versus a more general if statement that changes which instructions the CPU executes. CMOVcc does not hide memory latency since there is still a data path dependency on both the condition and the source to move, but it does avoid costly branch mis-predictions, which for algos like binary search, or data structures like priority queues, can be frequent.

This got me wondering: would such implementations in Java also compile down to CMOVcc instructions?

I tried googling and didn't find any clear answers. There are some indications, like this JDK bug, but that is moving in the opposite direction (NOT compiling down to CMOVcc).

Is there somehow a chance for branchless algorithms and data structures to work well in Java too?

Too much variance in perf difference for the same class of queries

Tantivy is substantially faster than Lucene, e.g. for OrHighHigh (disjunction of two high freq terms) queries.

But I'm baffled why the percent difference between the two engines is so wildly different even for queries matching 1M+ hits?

E.g.:

people nbsp | 4,004 μs1,167,748 docs | 8,867 μs +121.5 % 1,167,749 docs

and:

cite 2009 | 4,829 μs1,381,248 docs | 22,398 μs +363.8 % 1,381,266 docs

If Tantivy has a better postings decode (e.g. dedicated SIMD) it should be fairly uniformly faster for queries whose cost is dominated by postings decode.

Test on multi-segment indices?

Tantivy may be using concurrency for a single query even on a single segment.

Maybe, to make the query latency comparisons more fair, we should create a multi-segment index, and use Lucene's concurrent (across segment) search? This also better matches real-world usage for applications that need to continuously update their indices.

Upgrade & re-test latest Tantivy and Lucene

There have been a number of search-time optimizations inspired by Tantivy folded into Lucene lately. And I'm sure Tantivy is making improvements too! Let's re-upgrade to the latest & greatest and re-test?

Try turning off patching in Lucene's PFOR encoding

One difference between Lucene and Tantivy is Lucene uses the "patch" FOR, meaning the large values in a block are held out as exceptions so that the remaining values can use a smaller number of bits to encode, a tradeoff of CPU for lower storage space.

Let's try temporarily disabling the patching in Lucene, to match how Tantivy encodes, to see how much of a performance difference that is costing us?

Upgrade from JDK17 to JDK19+

As part of having all the the components updated to the newest versions, we should do the same for JDK.

Currently only JDK17 is supported. To upgrade to JDK 19 we need to enable Panama API and build MemorySegmentIndexInputProvider into the JAR.

This is the current error message when running make index with JDK19

--- Indexing Lucene 9.5.0 with %2 deletes ---
java -server -cp build/libs/search-index-benchmark-game-lucene-1.0-SNAPSHOT-all.jar BuildIndex idx 2 < /local/home/jslowins/unison/tantivy-bench/search-benchmark-game/corpus/enwiki-20120502-lines-1k-fixed-utf8-with-random-label.txt
Exception in thread "main" java.lang.LinkageError: MemorySegmentIndexInputProvider is missing in Lucene JAR file
        at org.apache.lucene.store.MMapDirectory.lookupProvider(MMapDirectory.java:437)
        at java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
        at org.apache.lucene.store.MMapDirectory.doPrivileged(MMapDirectory.java:395)
        at org.apache.lucene.store.MMapDirectory.<clinit>(MMapDirectory.java:448)
        at org.apache.lucene.store.FSDirectory.open(FSDirectory.java:161)
        at org.apache.lucene.store.FSDirectory.open(FSDirectory.java:156)
        at BuildIndex.main(BuildIndex.java:33)
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.store.MemorySegmentIndexInputProvider
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:495)
        at java.base/java.lang.Class.forName(Class.java:474)
        at java.base/java.lang.invoke.MethodHandles$Lookup.findClass(MethodHandles.java:2786)
        at org.apache.lucene.store.MMapDirectory.lookupProvider(MMapDirectory.java:422)
        ... 6 more

Add indexing time to the benchmarks report

Test with the throughput GC, not G1GC (the default)

@uschindler pointed out on an upstream Lucene issue that G1GC may not be "best" for measuring Lucene's single-threaded performance, and we should use the throughput collector instead. This collector will stop-the-world (causing any running queries to take crazy long, but they would be discarded as outliers by the benchy) in exchange for minimal net CPU cost spent on GC.

It's simple to enable the throughput collector (-XX:+UseParallelGC I think). We should test if this makes any difference in the benchmark.

HOWEVER, it's hard to know if this is the "right thing to do". Yes, on the one hand, if we are trying to purely measure CPU cost of Lucene's querying, G1GC is a bit unfair since it adds some continuous cost due to memory guards on reads, though it seems likely most of those reads won't be blocked on G1GC having to suddenly move an object. But on the other hand, in a production context for searching, most applications will not use the throughput collector but rather G1GC or maybe CMS, and so G1GC would reflect the "real world" cost of doing business in javaland.

Still we should run this test to at least see the impact.

We should also test doc values

Doc values are an important part of most search applications, and we're not testing them today.