Comments (14)
@HantaoCai please send me in mail: [email protected], if you could also offer milvus logs, it would be perfect
from milvus.
/assign @HantaoCai
from milvus.
The document has been sent, please check your inbox.
from milvus.
@zhuwenxing is trying to reproduce the issue with your data
from milvus.
yes, It can be reproduced
reproduce script
from pymilvus import connections, Collection
connections.connect()
c = Collection(name="gemini_library_v5_bak")
print(c.describe())
map = {}
res = c.query(expr="file_id == 3058", output_fields=["*"])
pk_list= [r['index_id'] for r in res]
for i in range(1):
res = c.query(expr="file_id == 3058", output_fields=["*"])
print(len(res))
assert set(pk_list) == set([r['index_id'] for r in res])
for r in res:
if r['index_id'] not in map:
map[r['index_id']] = r['tags']
else:
map[r['index_id']] += r['tags']
print(f"first time query, then compare following time query result with first time query result")
for i in range(10):
diff_cnt = 0
tmp = {}
res = c.query(expr="file_id == 3058", output_fields=["*"])
print(len(res))
assert set(pk_list) == set([r['index_id'] for r in res])
for r in res:
tmp[r['index_id']] = r['tags']
if map[r['index_id']] != r['tags']:
diff_cnt += 1
# print(f"diff found: {r['index_id']}, {map[r['index_id']]} != {r['tags']}")
print(f"in compare time {i}, find diff count: {diff_cnt}")
result: same index_id, the value of tags sometimes is ["1"], and sometimes is []
diff found: ff386894910459ec6cc058b0e395015f, ['1'] != []
diff found: ff42158f2e5623c5fe07d9cff4cd9965, ['1'] != []
diff found: ff68acd76bfbdfc189e6d87631810a40, ['1'] != []
diff found: ff84615dbc0a42a8509dbc11b0b7fac5, ['1'] != []
diff found: ff8c617c8c4633a5598968f661f7343c, ['1'] != []
diff found: ffd0810e0b8993435e8e8792204d1a65, ['1'] != []
diff found: ffd1b8e0759fd12991a255f4d2f522fb, ['1'] != []
diff found: ffe59056e7a3235628c80d2150b13dbd, ['1'] != []
diff found: fff08735118cc834861d4b9a892cbb6b, ['1'] != []
diff found: fff89ffe055b52cd804a21fa0e11475e, ['1'] != []
diff found: fff9e71c0e45270462cee2d3fe80c6d5, ['1'] != []
{'collection_name': 'gemini_library_v5_bak', 'auto_id': False, 'num_shards': 1, 'description': 'gemini矢量表', 'fields': [{'field_id': 100, 'name': 'index_id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 110}, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 1536}}, {'field_id': 102, 'name': 'partition_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 103, 'name': 'file_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 104, 'name': 'chunk_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 105, 'name': 'tags', 'description': '', 'type': <DataType.ARRAY: 22>, 'params': {'max_length': 200, 'max_capacity': 1024}, 'element_type': <DataType.VARCHAR: 21>}], 'aliases': [], 'collection_id': 450111991800541421, 'consistency_level': 2, 'properties': {}, 'num_partitions': 65, 'enable_dynamic_field': True}
2414
first time query, then compare following time query result with first time query result
2414
in compare time 0, find diff count: 2414
2414
in compare time 1, find diff count: 0
2414
in compare time 2, find diff count: 0
2414
in compare time 3, find diff count: 0
2414
in compare time 4, find diff count: 0
2414
in compare time 5, find diff count: 0
2414
in compare time 6, find diff count: 0
2414
in compare time 7, find diff count: 0
2414
in compare time 8, find diff count: 0
2414
in compare time 9, find diff count: 2414
from milvus.
/assign @longjiquan
from milvus.
add a step to check count of each pk, found that each pk has two entities.
So it should be the same PK, but with different data inserted.
for k, v in map.items():
res = c.query(expr=f"index_id == '{k}'", output_fields=["count(*)"])
print(f"{k} {res}")
ffd1b8e0759fd12991a255f4d2f522fb data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
ffe59056e7a3235628c80d2150b13dbd data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff08735118cc834861d4b9a892cbb6b data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff89ffe055b52cd804a21fa0e11475e data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff9e71c0e45270462cee2d3fe80c6d5 data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
from milvus.
@HantaoCai it proves that that data you inserted have dup primary keys, which causes query results change. please try to de-dup the data first
/assign @HantaoCai
/unassign @zhuwenxing @longjiquan
from milvus.
I previously inquired about why multiple records with the same primary key could be inserted, and the response I received was that the data retrieved would be the new record with the same primary key. This behavior is different from what I am currently experiencing. Should this be considered a bug?
We are long-time users of Milvus, and version 2.2 does not have an upsert feature, which means our historical code might contain Insert methods. Additionally, we expect, as with traditional databases, that primary keys should be unique. Therefore, from a user's perspective, the Insert method is not very meaningful, and we believe that the use of upsert should be favored instead.
from milvus.
the response I received was that the data retrieved would be the new record with the same primary key.
Yes, the expected behavior should be like this.
We are investigating whether restoring or importing, if there are identical primary keys, would result in identical timestamps, thus leading to the current issue.
from milvus.
@HantaoCai
I think there is no clear way to cover all PK. under filtering and search this is not doable.
Even without upsert you might be able to delete the old data and insert new one to avoid duplication
from milvus.
We will be cleaning up our data and updating all historical code to use the upsert method.
We would like to see either all records with the same PK being retrieved during queries, or ensure that only the latest record for a given PK is returned. This would help us to quickly identify or prevent these issues. However, we also recognize that retrieving only the latest record for a PK may not always be the best approach.
In any case, thank you for your assistance in investigating the root cause of the issue.
from milvus.
When I set out to clean up the data, I encountered a problem. It appears that this is not a simple task; I am unable to delete data using the PK as a reference.
Is there a way for me to filter out the data that has the same primary key but is older in terms of the timestamp?
from milvus.
Regarding this issue with Attu, my current goal is to delete records with duplicate primary keys. I intend to remove the older records. After comparing with the scalar database, I have identified which records need to be deleted. However, during my test deletion on Attu, I found that the delete expression generated by Attu is based on the primary key rather than the filter criteria I provided. This has led to all records with the same primary key being deleted. For the product's deletion feature, I would expect the delete expression to be generated using my filter criteria, not the primary key, because the primary key is not unique at the moment.
@xiaofan-luan @yanliang567
from milvus.
Related Issues (20)
- [Bug]: [laion-1b] Flush collection 300s timeout when concurrent upsert and read HOT 1
- [Enhancement]: Persisting the compaction task
- [Bug]: [laion-1b] The queryNode SIGABRT: abort with log: 'milvus::SegcoreError' what(): Error:RemoveDir:Directory not empty HOT 2
- [Bug]: filter indexed segments failed HOT 9
- [Bug]: [benchmark][cluster] queryNode panic `what(): Error:RemoveDir:Directory not empty` in the scenario of serial load-release collection which replica=2 HOT 2
- [Bug]: Query iterator with expr `int64 < 100 || int64 > 2000` gets unexpected result HOT 2
- [Bug]: The size of the import segments is uneven HOT 2
- [Enhancement]: optimize json to reduce memory cost
- [Bug]: The deleted collection reappears in the cluster and cannot be deleted. Therefore, the cluster collection cannot be used HOT 9
- [Enhancement]: Use BatchPkExist to reduce bloom filter func call cost HOT 1
- [Feature]: Reranking step on binary search
- [Bug]: [GOSDK] Search with OutputFields but return empty fields data
- [Enhancement]: speed up array-equal operator via inverted index
- [Bug]: Lose valid data when setting radius and range filter in vector search HOT 3
- [Bug]: No error is returned when searching an empty collection with a vector that has a different schema dim HOT 2
- [Bug]: No error is returned when searching an empty collection with a vector that has a different schema DataType HOT 4
- [Bug]: System continues to create empty segments after upgrade to 2.4 HOT 14
- `id` [Feature]: test
- `env`[Feature]: test
- [Bug]: Flush timeout and search fails for nearly 17 minutes after chaos recovery from pod-kill standalone HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from milvus.