Code Monkey home page Code Monkey logo

Comments (14)

tianshihan818 avatar tianshihan818 commented on July 3, 2024

According to these issues:
#32714
#32894

It seems that this scene is caused by the high writing throughput, thus some datanodes became unhealthy, so there will be too much growing segments in some querynode. (But why the already sealed segment loaded num is decreased? I am not very clearly about what happened inside).

I tried to re-deploy the querynodes and re-load the collection, the memory usage becomes back to normal:
企业微信截图_1718605265848

Hope there will be some helpful info.

  • Should I scale up the datanodes?
  • And how to control the writing rate or concurrency for large scale data scene?

from milvus.

yanliang567 avatar yanliang567 commented on July 3, 2024

@tianshihan818 thank you for the issue and updates. It seems that something stuck in that querynode, but we cannot tell without milvus logs. Could you please refer this doc to export the whole Milvus logs for investigation?
BTW, which index type are you running?

/assign @tianshihan818

from milvus.

tianshihan818 avatar tianshihan818 commented on July 3, 2024

@yanliang567 Thanks! Here are the logs:
milvus-log.tar.gz

And I use IVF_SQ8 index type:
image

from milvus.

yanliang567 avatar yanliang567 commented on July 3, 2024

/assign @congqixia
please help to take a look.

from milvus.

tianshihan818 avatar tianshihan818 commented on July 3, 2024

Additional info:
The initial logs are lost, so I reproduce this test(concurrent insert), and get the above logs, i.e.
milvus-log.tar.gz

This time the insert API first report error at June 17th 22:50: pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=message send timeout: TimeoutError)> (The time on the machine nodes is set to be 8 hours later, so in logs should be around 14:50.)
And I checked the metrics and pods list during this test period.

  • Pod list
企业微信截图_17186918952022
  • Memory
企业微信截图_17186908729771
  • Segement Loaded Num
企业微信截图_17186912067480
  • Queryable Entity Num
企业微信截图_17186912832872
  • Goroutines Total
企业微信截图_17186922195118
  • Goroutines
企业微信截图_17186923342838
  • CPU
企业微信截图_17186924613424
  • OS Threads
企业微信截图_17186925929468
  • GC Max duration seconds
企业微信截图_17186926669084
  • Other metrics
企业微信截图_17186945946792 企业微信截图_17186947094234

from milvus.

tianshihan818 avatar tianshihan818 commented on July 3, 2024

And I continue inserting, the memory usage of querynode miltest-milvus-querynode-5559dc65d-knhn7 is increasing, when it reached the quota water level, will report the same error as before: pymilvus.exceptions.MilvusException: <MilvusException: (code=9, message=quota exceeded[reason=memory quota exceeded, please allocate more resources])>

企业微信截图_17186966014880

And the logs of this time are here:
milvus-log-new.tar.gz

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 3, 2024

you can try to increase shard number.

In our design, best practice is to have 100 - 200 million data for each shard.

On your case, you have roughly 1B data, we would recoomend to use 8 shards.

The reason you see one query nodes cpu very high is because it is the shard delegator. If index build can not catch up, data may accumulate. also since there is only delegator forward request, you may see it becomes the bottleneck on heavy search.

If you can use pprof to check why the memory is high it could be also helpful

from milvus.

congqixia avatar congqixia commented on July 3, 2024

@tianshihan818
The segment is still been loaded into the delegator querynode, which is not by design
could you please provided a birdwatcher backup and output for show segment-loaded-grpc command?
https://github.com/milvus-io/birdwatcher/releases/tag/v1.0.4

from milvus.

tianshihan818 avatar tianshihan818 commented on July 3, 2024

Thank for your reply! @xiaofan-luan @congqixia

I use birdwatcher to get pprof heap files:
bw_pprof_heap.240619-045745.tar.gz

And the segment-loaded-grpc list results are in this file:
show_segment_loaded_grpc.log

from milvus.

congqixia avatar congqixia commented on July 3, 2024

@tianshihan818 thanks for the quick reply, could you please provided the backup file as well?

from milvus.

tianshihan818 avatar tianshihan818 commented on July 3, 2024

@congqixia The etcd backup is here:
bw_etcd_ALL.240619-060804.bak.gz

from milvus.

XuanYang-cn avatar XuanYang-cn commented on July 3, 2024

DataNode painiked on konwn bugs, should be fixed by #33829
image

from milvus.

tianshihan818 avatar tianshihan818 commented on July 3, 2024

@xiaofan-luan Hello! Thanks for sharing the tech detail. I verified in the logs, the "delegator" is indeed on the abnormal querynode. Is this "delegator" responsiable for forwarding the loading or searching request and gathering the search results?
So under the concurrent insert case, once called collection.load(), the querynode will load data later automatically, as my collection's consistency_level is set to be "Strong", if the index build process (use IVF_SQ8 index type in this test, ideally the memory usage of the index files will be around 30% of the raw data size) don't catch up the insert speed, the querynode will load the raw data instead of the index files into the memory, thus causing the high memory usage, right?
I am still not very clearly on these things:

  • Why only the "delegator" querynode itself bear the load pressure in this case?
  • Will it better for "load all data after insert all data"?

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 3, 2024

delegator hold all the growing segment data. when segment is sealed and index is built, all other node can take the work .

@XuanYang-cn upgrade to 2.4.5 could solve the problem? do we need special fix for this segment not found error?

from milvus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.