Is there an existing issue for this? <li class="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

/assign <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

yes, It can be reproduced reproduce <div class="snippet-c

/assign <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug]: Inconsistent Query Results with Identical Filters in Milvus about milvus HOT 14 OPEN

HantaoCai commented on June 17, 2024

[Bug]: Inconsistent Query Results with Identical Filters in Milvus

from milvus.

Comments (14)

yanliang567 commented on June 17, 2024

@HantaoCai please send me in mail: [email protected], if you could also offer milvus logs, it would be perfect

from milvus.

yanliang567 commented on June 17, 2024

/assign @HantaoCai

from milvus.

HantaoCai commented on June 17, 2024

The document has been sent, please check your inbox.

from milvus.

yanliang567 commented on June 17, 2024

@zhuwenxing is trying to reproduce the issue with your data

from milvus.

zhuwenxing commented on June 17, 2024

yes, It can be reproduced

reproduce script

from pymilvus import connections, Collection
connections.connect()
c = Collection(name="gemini_library_v5_bak")
print(c.describe())
map = {}
res = c.query(expr="file_id == 3058", output_fields=["*"])
pk_list= [r['index_id'] for r in res]
for i in range(1):
    res = c.query(expr="file_id == 3058", output_fields=["*"])
    print(len(res))
    assert set(pk_list) == set([r['index_id'] for r in res])
    for r in res:
        if r['index_id'] not in map:
            map[r['index_id']] = r['tags']
        else:
            map[r['index_id']] += r['tags']
print(f"first time query, then compare following time query result with first time query result")

for i in range(10):
    diff_cnt = 0
    tmp = {}
    res = c.query(expr="file_id == 3058", output_fields=["*"])
    print(len(res))
    assert set(pk_list) == set([r['index_id'] for r in res])
    for r in res:
        tmp[r['index_id']] = r['tags']
        if map[r['index_id']] != r['tags']:
            diff_cnt += 1
            # print(f"diff found: {r['index_id']}, {map[r['index_id']]} != {r['tags']}")
    print(f"in compare time  {i}, find diff count: {diff_cnt}")

result: same index_id, the value of tags sometimes is ["1"], and sometimes is []

diff found: ff386894910459ec6cc058b0e395015f, ['1'] != []
diff found: ff42158f2e5623c5fe07d9cff4cd9965, ['1'] != []
diff found: ff68acd76bfbdfc189e6d87631810a40, ['1'] != []
diff found: ff84615dbc0a42a8509dbc11b0b7fac5, ['1'] != []
diff found: ff8c617c8c4633a5598968f661f7343c, ['1'] != []
diff found: ffd0810e0b8993435e8e8792204d1a65, ['1'] != []
diff found: ffd1b8e0759fd12991a255f4d2f522fb, ['1'] != []
diff found: ffe59056e7a3235628c80d2150b13dbd, ['1'] != []
diff found: fff08735118cc834861d4b9a892cbb6b, ['1'] != []
diff found: fff89ffe055b52cd804a21fa0e11475e, ['1'] != []
diff found: fff9e71c0e45270462cee2d3fe80c6d5, ['1'] != []

{'collection_name': 'gemini_library_v5_bak', 'auto_id': False, 'num_shards': 1, 'description': 'gemini矢量表', 'fields': [{'field_id': 100, 'name': 'index_id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 110}, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 1536}}, {'field_id': 102, 'name': 'partition_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 103, 'name': 'file_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 104, 'name': 'chunk_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 105, 'name': 'tags', 'description': '', 'type': <DataType.ARRAY: 22>, 'params': {'max_length': 200, 'max_capacity': 1024}, 'element_type': <DataType.VARCHAR: 21>}], 'aliases': [], 'collection_id': 450111991800541421, 'consistency_level': 2, 'properties': {}, 'num_partitions': 65, 'enable_dynamic_field': True}
2414
first time query, then compare following time query result with first time query result
2414
in compare time  0, find diff count: 2414
2414
in compare time  1, find diff count: 0
2414
in compare time  2, find diff count: 0
2414
in compare time  3, find diff count: 0
2414
in compare time  4, find diff count: 0
2414
in compare time  5, find diff count: 0
2414
in compare time  6, find diff count: 0
2414
in compare time  7, find diff count: 0
2414
in compare time  8, find diff count: 0
2414
in compare time  9, find diff count: 2414

from milvus.

xiaofan-luan commented on June 17, 2024

/assign @longjiquan

from milvus.

zhuwenxing commented on June 17, 2024

add a step to check count of each pk, found that each pk has two entities.

So it should be the same PK, but with different data inserted.

for k, v in map.items():
    res = c.query(expr=f"index_id == '{k}'", output_fields=["count(*)"])
    print(f"{k} {res}")

ffd1b8e0759fd12991a255f4d2f522fb data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
ffe59056e7a3235628c80d2150b13dbd data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff08735118cc834861d4b9a892cbb6b data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff89ffe055b52cd804a21fa0e11475e data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff9e71c0e45270462cee2d3fe80c6d5 data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}

from milvus.

yanliang567 commented on June 17, 2024

@HantaoCai it proves that that data you inserted have dup primary keys, which causes query results change. please try to de-dup the data first
/assign @HantaoCai
/unassign @zhuwenxing @longjiquan

from milvus.

HantaoCai commented on June 17, 2024

I previously inquired about why multiple records with the same primary key could be inserted, and the response I received was that the data retrieved would be the new record with the same primary key. This behavior is different from what I am currently experiencing. Should this be considered a bug?

We are long-time users of Milvus, and version 2.2 does not have an upsert feature, which means our historical code might contain Insert methods. Additionally, we expect, as with traditional databases, that primary keys should be unique. Therefore, from a user's perspective, the Insert method is not very meaningful, and we believe that the use of upsert should be favored instead.

from milvus.

zhuwenxing commented on June 17, 2024

the response I received was that the data retrieved would be the new record with the same primary key.

Yes, the expected behavior should be like this.

We are investigating whether restoring or importing, if there are identical primary keys, would result in identical timestamps, thus leading to the current issue.

from milvus.

xiaofan-luan commented on June 17, 2024

@HantaoCai
I think there is no clear way to cover all PK. under filtering and search this is not doable.
Even without upsert you might be able to delete the old data and insert new one to avoid duplication

from milvus.

HantaoCai commented on June 17, 2024

We will be cleaning up our data and updating all historical code to use the upsert method.

We would like to see either all records with the same PK being retrieved during queries, or ensure that only the latest record for a given PK is returned. This would help us to quickly identify or prevent these issues. However, we also recognize that retrieving only the latest record for a PK may not always be the best approach.

In any case, thank you for your assistance in investigating the root cause of the issue.

from milvus.

HantaoCai commented on June 17, 2024

When I set out to clean up the data, I encountered a problem. It appears that this is not a simple task; I am unable to delete data using the PK as a reference.

Is there a way for me to filter out the data that has the same primary key but is older in terms of the timestamp?

from milvus.

HantaoCai commented on June 17, 2024

Regarding this issue with Attu, my current goal is to delete records with duplicate primary keys. I intend to remove the older records. After comparing with the scalar database, I have identified which records need to be deleted. However, during my test deletion on Attu, I found that the delete expression generated by Attu is based on the primary key rather than the filter criteria I provided. This has led to all records with the same primary key being deleted. For the product's deletion feature, I would expect the delete expression to be generated using my filter criteria, not the primary key, because the primary key is not unique at the moment.
@xiaofan-luan @yanliang567

from milvus.

[Bug]: Inconsistent Query Results with Identical Filters in Milvus about milvus HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent