Code Monkey home page Code Monkey logo

Comments (14)

yanliang567 avatar yanliang567 commented on June 17, 2024

@HantaoCai please send me in mail: [email protected], if you could also offer milvus logs, it would be perfect

from milvus.

yanliang567 avatar yanliang567 commented on June 17, 2024

/assign @HantaoCai

from milvus.

HantaoCai avatar HantaoCai commented on June 17, 2024

The document has been sent, please check your inbox.

from milvus.

yanliang567 avatar yanliang567 commented on June 17, 2024

@zhuwenxing is trying to reproduce the issue with your data

from milvus.

zhuwenxing avatar zhuwenxing commented on June 17, 2024

yes, It can be reproduced

reproduce script

from pymilvus import connections, Collection
connections.connect()
c = Collection(name="gemini_library_v5_bak")
print(c.describe())
map = {}
res = c.query(expr="file_id == 3058", output_fields=["*"])
pk_list= [r['index_id'] for r in res]
for i in range(1):
    res = c.query(expr="file_id == 3058", output_fields=["*"])
    print(len(res))
    assert set(pk_list) == set([r['index_id'] for r in res])
    for r in res:
        if r['index_id'] not in map:
            map[r['index_id']] = r['tags']
        else:
            map[r['index_id']] += r['tags']
print(f"first time query, then compare following time query result with first time query result")

for i in range(10):
    diff_cnt = 0
    tmp = {}
    res = c.query(expr="file_id == 3058", output_fields=["*"])
    print(len(res))
    assert set(pk_list) == set([r['index_id'] for r in res])
    for r in res:
        tmp[r['index_id']] = r['tags']
        if map[r['index_id']] != r['tags']:
            diff_cnt += 1
            # print(f"diff found: {r['index_id']}, {map[r['index_id']]} != {r['tags']}")
    print(f"in compare time  {i}, find diff count: {diff_cnt}")

result: same index_id, the value of tags sometimes is ["1"], and sometimes is []

diff found: ff386894910459ec6cc058b0e395015f, ['1'] != []
diff found: ff42158f2e5623c5fe07d9cff4cd9965, ['1'] != []
diff found: ff68acd76bfbdfc189e6d87631810a40, ['1'] != []
diff found: ff84615dbc0a42a8509dbc11b0b7fac5, ['1'] != []
diff found: ff8c617c8c4633a5598968f661f7343c, ['1'] != []
diff found: ffd0810e0b8993435e8e8792204d1a65, ['1'] != []
diff found: ffd1b8e0759fd12991a255f4d2f522fb, ['1'] != []
diff found: ffe59056e7a3235628c80d2150b13dbd, ['1'] != []
diff found: fff08735118cc834861d4b9a892cbb6b, ['1'] != []
diff found: fff89ffe055b52cd804a21fa0e11475e, ['1'] != []
diff found: fff9e71c0e45270462cee2d3fe80c6d5, ['1'] != []
{'collection_name': 'gemini_library_v5_bak', 'auto_id': False, 'num_shards': 1, 'description': 'gemini矢量表', 'fields': [{'field_id': 100, 'name': 'index_id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 110}, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 1536}}, {'field_id': 102, 'name': 'partition_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 103, 'name': 'file_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 104, 'name': 'chunk_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 105, 'name': 'tags', 'description': '', 'type': <DataType.ARRAY: 22>, 'params': {'max_length': 200, 'max_capacity': 1024}, 'element_type': <DataType.VARCHAR: 21>}], 'aliases': [], 'collection_id': 450111991800541421, 'consistency_level': 2, 'properties': {}, 'num_partitions': 65, 'enable_dynamic_field': True}
2414
first time query, then compare following time query result with first time query result
2414
in compare time  0, find diff count: 2414
2414
in compare time  1, find diff count: 0
2414
in compare time  2, find diff count: 0
2414
in compare time  3, find diff count: 0
2414
in compare time  4, find diff count: 0
2414
in compare time  5, find diff count: 0
2414
in compare time  6, find diff count: 0
2414
in compare time  7, find diff count: 0
2414
in compare time  8, find diff count: 0
2414
in compare time  9, find diff count: 2414

from milvus.

xiaofan-luan avatar xiaofan-luan commented on June 17, 2024

/assign @longjiquan

from milvus.

zhuwenxing avatar zhuwenxing commented on June 17, 2024

add a step to check count of each pk, found that each pk has two entities.

So it should be the same PK, but with different data inserted.

for k, v in map.items():
    res = c.query(expr=f"index_id == '{k}'", output_fields=["count(*)"])
    print(f"{k} {res}")
ffd1b8e0759fd12991a255f4d2f522fb data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
ffe59056e7a3235628c80d2150b13dbd data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff08735118cc834861d4b9a892cbb6b data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff89ffe055b52cd804a21fa0e11475e data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff9e71c0e45270462cee2d3fe80c6d5 data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}

from milvus.

yanliang567 avatar yanliang567 commented on June 17, 2024

@HantaoCai it proves that that data you inserted have dup primary keys, which causes query results change. please try to de-dup the data first
/assign @HantaoCai
/unassign @zhuwenxing @longjiquan

from milvus.

HantaoCai avatar HantaoCai commented on June 17, 2024

I previously inquired about why multiple records with the same primary key could be inserted, and the response I received was that the data retrieved would be the new record with the same primary key. This behavior is different from what I am currently experiencing. Should this be considered a bug?

We are long-time users of Milvus, and version 2.2 does not have an upsert feature, which means our historical code might contain Insert methods. Additionally, we expect, as with traditional databases, that primary keys should be unique. Therefore, from a user's perspective, the Insert method is not very meaningful, and we believe that the use of upsert should be favored instead.

from milvus.

zhuwenxing avatar zhuwenxing commented on June 17, 2024

the response I received was that the data retrieved would be the new record with the same primary key.

Yes, the expected behavior should be like this.

We are investigating whether restoring or importing, if there are identical primary keys, would result in identical timestamps, thus leading to the current issue.

from milvus.

xiaofan-luan avatar xiaofan-luan commented on June 17, 2024

@HantaoCai
I think there is no clear way to cover all PK. under filtering and search this is not doable.
Even without upsert you might be able to delete the old data and insert new one to avoid duplication

from milvus.

HantaoCai avatar HantaoCai commented on June 17, 2024

We will be cleaning up our data and updating all historical code to use the upsert method.

We would like to see either all records with the same PK being retrieved during queries, or ensure that only the latest record for a given PK is returned. This would help us to quickly identify or prevent these issues. However, we also recognize that retrieving only the latest record for a PK may not always be the best approach.

In any case, thank you for your assistance in investigating the root cause of the issue.

from milvus.

HantaoCai avatar HantaoCai commented on June 17, 2024

When I set out to clean up the data, I encountered a problem. It appears that this is not a simple task; I am unable to delete data using the PK as a reference.

Is there a way for me to filter out the data that has the same primary key but is older in terms of the timestamp?

from milvus.

HantaoCai avatar HantaoCai commented on June 17, 2024

Regarding this issue with Attu, my current goal is to delete records with duplicate primary keys. I intend to remove the older records. After comparing with the scalar database, I have identified which records need to be deleted. However, during my test deletion on Attu, I found that the delete expression generated by Attu is based on the primary key rather than the filter criteria I provided. This has led to all records with the same primary key being deleted. For the product's deletion feature, I would expect the delete expression to be generated using my filter criteria, not the primary key, because the primary key is not unique at the moment.
@xiaofan-luan @yanliang567

from milvus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.