Comments (18)
Testing it out.
from elastiknn.
A little of the output from Elasticsearch:
Caused by: org.elasticsearch.ElasticsearchException$1: Couldn't advance to binary doc values for doc with id [404521]
at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:644) ~[elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:307) [elasticsearch-7.9.2.jar:7.9.2]
... 21 more
Caused by: java.lang.RuntimeException: Couldn't advance to binary doc values for doc with id [404521]
at com.klibisz.elastiknn.query.ExactQuery$StoredVecReader.apply(ExactQuery.scala:53) ~[?:?]
at com.klibisz.elastiknn.query.HashingQuery$.$anonfun$apply$2(HashingQuery.scala:23) ~[?:?]
at org.apache.lucene.search.MatchHashesAndScoreQuery$1$2.score(MatchHashesAndScoreQuery.java:168) ~[?:?]
at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:59) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
at org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect(TopScoreDocCollector.java:76) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:242) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:229) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
at org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:56) ~[elasticsearch-7.9.2.jar:7.9.2]
at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
from elastiknn.
That's interesting. I can't think of any obvious reason why that part of the code would have changed for the 7.9.2 build.
Can you tell more about your index setup? At least: how many shards, how many segments, how many total docs?
from elastiknn.
One shard, 500,000 documents. This is just an initial test on my work laptop. I don't know what a segment is.
Maybe I can try to replicate this with a one document index.
from elastiknn.
Couldn't replicate with a simple one field one document index unfortunately.
25 segments in the original index:
index shard prirep ip segment generation docs.count docs.deleted size size.memory committed searchable version compound
discovery_events_v2 0 p 127.0.0.1 _1lp 2077 388877 150231 1012.3mb 20670 true true 8.6.2 false
discovery_events_v2 0 p 127.0.0.1 _2g5 3173 1359771 984921 6.2gb 42664 true true 8.6.2 false
discovery_events_v2 0 p 127.0.0.1 _2nt 3449 515047 0 693.1mb 17876 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _3h1 4501 281414 0 839mb 17268 true true 8.6.2 false
discovery_events_v2 0 p 127.0.0.1 _43t 5321 199234 0 512.5mb 16132 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _4i9 5841 49849 0 128.4mb 13172 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _533 6591 46272 0 99.1mb 13524 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _565 6701 183862 0 525.4mb 16564 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5a1 6841 25803 0 52.2mb 12852 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5dd 6961 22172 0 55.1mb 12964 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5fb 7031 60066 0 107.1mb 13220 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5gp 7081 20324 0 41mb 21228 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5ix 7161 33735 0 78.1mb 12516 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5l5 7241 25158 0 47.1mb 14556 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5mj 7291 97666 0 157.2mb 13204 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5n3 7311 17518 0 48.5mb 16164 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5pb 7391 24012 0 50mb 13268 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5w9 7641 4210 0 19.2mb 14908 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5xd 7681 59 0 324kb 15772 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5xn 7691 44 0 325.9kb 14052 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5y7 7711 46 0 381.5kb 15244 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5yh 7721 74 0 434.8kb 17004 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5ys 7732 92 0 511.9kb 17748 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5z1 7741 93 0 491.3kb 17076 true true 8.6.2 true
discovery_events_v2 0 p 127.0.0.1 _5z2 7742 2 0 48.1kb 10100 true true 8.6.2 true
from elastiknn.
Thanks for the debugging info.
To clarify: segments are basically to shards as shards are to an index. i.e., an index has one or more shards, and each shard has one or more segments.
One thing you might try next: merge that index into one segment, then re-run? See https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
That's not a good long-term solution, but it would tell us whether the bug is due to having docs in multiple segments.
I appreciate your patience in helping smooth out these issues :) There's quite a long-tail of use-cases and configurations with Elasticsearch.
from elastiknn.
One other data point is that if I reindex from the original index (on which I had the exception) to a new index, then the exception goes away.
from elastiknn.
_forcemerge doesn't seem to do anything. Returns immediately and the number of segments appears unchanged.
from elastiknn.
One other data point is that if I reindex from the original index (on which I had the exception) to a new index, then the exception goes away.
Hmm, ok, that makes it sound like there is some issue with how Elastiknn is using Ids when there's > 1 segment.
_forcemerge doesn't seem to do anything. Returns immediately and the number of segments appears unchanged.
Perhaps try with: _forcemerge?max_num_segments=1
Also, I'm not sure if you saw on the 7.9.2 PR, but I made a comment saying the performance is pretty abysmal due to the additional compression that the Lucene project introduced. You might be better off playing around with the latest 7.6.2 version. I'll probably have to add some sort of intermediate caching layer until Lucene gives a way to disable/configure that compression.
from elastiknn.
Hmm, was just able to reproduce the problem. To give more context on what I am doing, I first use "reindex" to create a copy of an existing index. The elastiknn_dense_float_vector field does not exist in the source index. Then I run a script that makes a bunch of calls to the bulk() API with "update" actions in order to store values (vectors) in the elastiknn_dense_float_vector field.
So, I don't know, maybe some problem with updates.
from elastiknn.
Gotcha. At what point do you update the mapping to include the vector field? If I know that, I should be able to try a repro.
from elastiknn.
Before the initial reindex, I create the index with the elastiknn_dense_vector field defined in the initial mapping.
In my most recent replication, I saw the problem after updating several thousand documents.
from elastiknn.
All good info. I should have time to look into this tonight.
from elastiknn.
@ejackson-eb When you get a chance, can you try it with the plugin release here: https://github.com/alexklibisz/elastiknn/releases/tag/0.1.0-PRE43-PR184-SNAPSHOT
from elastiknn.
That plug-in is for 7.6.2, but I have only seen this problem with the plug-in you built for 7.9.2.
from elastiknn.
I see. After I merge this change into master I'll get it out to the 7.9.2 branch as well and it'll publish a snapshot.
from elastiknn.
All of the fixes I've made the past couple weeks, as well as the release for 7.9.2 are out now. Check out the latest version and LMK if you're still having issues with this.
from elastiknn.
This seems to be resolved based on some email discussion.
from elastiknn.
Related Issues (20)
- Cross-build for Elasticsearch 7.x and 8.x HOT 11
- Stop publishing Scala and Java libraries
- Migrate to Scala 3
- JAVA api
- RecallSuite tests are extremely slow in Github Actions HOT 2
- Adding elastiknn as an extension in the Elastic cloud fails with releases 8.4.2.1 and 8.4.3.0 HOT 4
- Migrate documentation site to github pages HOT 1
- Integrate with Coveralls for test coverage
- Try PyLucene for ann-benchmarks implementation
- Upgrade ann-benchmarks to 8.6.2 (or latest)
- Try Vectors from Project Panama for vector similarity computations HOT 1
- Plugin [.installing-18148280304972249747] is missing a descriptor properties file HOT 1
- Run benchmarks in Github Actions on a standalone EC2 instance HOT 1
- Try vectors from Project Panama for LSH operations HOT 3
- can't create a mapping HOT 1
- Try quick select algorithm for KthGreatest implementation HOT 4
- Try resampling vectors to speed up L2LshModel
- Try getting rid of HashAndFreq to minimize allocations HOT 1
- Try re-using threadlocal arrays in ArrayHitCounter HOT 2
- Try caching the query vector's FloatVector segments when computing distance HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elastiknn.