Comments (3)
Ok, so after 14 days of intensive reflection (lol), I found a solution to reduce the timings of the query_all
function, removing the HashMap<DocumentId, Vec<Match>>
and replacing it with a Vec<(DocumentId, Match)>
that is sort in parallel using rayon.
The previous vec is finally aggregated into 7 vec of same data type (i.e. all distances, all exact), following the data oriented previously developped design of the engine.
Here are the before/after performance logs of the search engine by searching "s" by using the default-fields.toml
schema.
Searching for: s
97360 total documents to classify
677358 total matches to classify
query_all took 88.97 ms
criterion SumOfTypos, documents group of size 97360
criterion SumOfTypos sort took 5.57 ms
criterion NumberOfWords, documents group of size 97360
criterion NumberOfWords sort took 6.42 ms
criterion WordsProximity, documents group of size 97360
criterion WordsProximity sort took 7.53 ms
criterion SumOfWordsAttribute, documents group of size 97360
criterion SumOfWordsAttribute sort took 25.56 ms
criterion SumOfWordsPosition, documents group of size 50898
criterion SumOfWordsPosition sort took 3.23 ms
criterion Exact, documents group of size 50898
criterion Exact sort took 4.70 ms
criterion DocumentId, documents group of size 50898
criterion DocumentId sort took 6.85 ms
Found 4 results in 152.88 ms
Searching for: s
97360 total documents to classify
677358 total matches to classify
query_all took 32.94 ms
criterion SumOfTypos, documents group of size 97360
criterion SumOfTypos sort took 3.56 ms
criterion NumberOfWords, documents group of size 97360
criterion NumberOfWords sort took 3.06 ms
criterion WordsProximity, documents group of size 97360
criterion WordsProximity sort took 4.35 ms
criterion SumOfWordsAttribute, documents group of size 97360
criterion SumOfWordsAttribute sort took 9.23 ms
criterion SumOfWordsPosition, documents group of size 50898
criterion SumOfWordsPosition sort took 1.75 ms
criterion Exact, documents group of size 50898
criterion Exact sort took 2.35 ms
criterion DocumentId, documents group of size 50898
criterion DocumentId sort took 3.64 ms
Found 4 results in 61.06 ms
It seems to be a success, a 2.50x times improvement, note that we use multithreading, the rayon library is nicely designed and use a pool of threads but it could have an impact on the number of concurrent http requests.
I need to transpose the old version criterion tests to the new one.
from meilisearch.
Working on a simple solution brings to good timings (branch data-oriented).
97360 total documents to classify
626460 total matches to classify
query_all took 106.18 ms
criterion SumOfTypos, documents group of size 97360
criterion SumOfTypos sort took 4.36 ms
criterion NumberOfWords, documents group of size 97360
criterion NumberOfWords sort took 3.51 ms
criterion WordsProximity, documents group of size 97360
criterion WordsProximity sort took 1.76 ms
criterion SumOfWordsAttribute, documents group of size 97360
criterion SumOfWordsAttribute sort took 10.47 ms
criterion SumOfWordsPosition, documents group of size 33657
criterion SumOfWordsPosition sort took 5.97 ms
criterion Exact, documents group of size 16708
criterion Exact sort took 882.94 μs
criterion DocumentId, documents group of size 16708
criterion DocumentId sort took 1.39 ms
from meilisearch.
After many hours of reflection I did not find a solution to fix the query_all
important overhead.
from meilisearch.
Related Issues (20)
- The `settings` route should return the `embedder` field HOT 3
- Tenancy Token - Placeholder Search Privacy Issue HOT 5
- Remove `exportPuffinReport` exp feature
- Avoid regenerating embeddings at dump import
- Only ever store vectors in the vector store + hide embeddings
- Introduce a setting to hide embeddings HOT 1
- Filter by score HOT 1
- Get similar documents
- `distinctAttribute` at search HOT 1
- Remediate "Company has an approved Cryptography Policy" HOT 1
- Strange result when applying filters to an array HOT 1
- Modifying an embedder should not always recompute all the vectors
- Internal error `Error while generating embeddings` on wrong url HOT 2
- Consider reworking error kinds in `milli`
- Improve compatibility with kubernetes by updating actix-web
- Unstable results when running the same query twice HOT 2
- _geoPoint and _geoRadius not working HOT 2
- Test CI failing when enabling/disabling some features
- Exists key in filter has a bug HOT 1
- Invalid facet counts returned when adding a filter
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from meilisearch.