Comments (13)
I guess the blocking here is the segment you have here, the more upload operations are.
Once you have 10000 segments, each upload takes 20ms, then, it's 200s.
Let's take a pool with cpu core * 8 nodes for concurrent file flush.
from milvus.
/unassign
from milvus.
It's not stucked, L0 compaction executes are slower and slower when segment num increases. And all the l0 tasks are pending to execute. Making MixCompaction unable to schedule.
need to refine scheduler to avoid this situation.
from milvus.
L0 compaction becomes slower because the bf?
from milvus.
maybe we need an extra stage of L0 compaction:
- concurrently load bf into a datanode(maybe cached)
- concurrently split all L0 segments to existing segments
- lock the L1/L2 compaction
- copy delta logs if necessary to compacted result
- release the lock.
I would assume this will simply release the pressure of compaction and reduce the lock holding time
from milvus.
we may also need a cache mechanism for cache BF for fast loading in the future.
maybe cache it on log node is good idea? @XuanYang-cn @congqixia @tedxu
from milvus.
L0 compaction becomes slower because the bf?
@xiaofan-luan
L0 rely on the cached bf in datanode, becase we want datanode to work when disable L0. So currently, L0 compaction process doesn't need to load bf from S3, there're all in datanode already.
Need tracing to know the time costs, as segment num increased to 8000, l0 compaction p99 cost 15mins.
And l0 compaction tasks by pass the scheduler max task num, submitted hundreds of L0 tasks. causing no MixCompaction can be scheduled. And no MixCompaction can be scheduled is the root cause of the increasing segment num.
#31270 would prevent scheduler to append endless l0 tasks.
from milvus.
@wangting0128 please help verify, also could you enable tracing during tests?
/assign @wangting0128
from milvus.
A typcial batch L0 process tracing
from milvus.
Very obvious and quick enahncement: change linear upload to batch upload
from milvus.
make sense to me
from milvus.
After verification, under the current mechanism, there are two ways to alleviate the problem of slow compaction speed under concurrent DML.
verified image:2.4-20240318-506534c2
- Add dataNode count
- Adjust config parameters to improve compaction concurrency
dataCoord:
compaction:
workerMaxParallelTaskNum: 5
maxParallelTaskNum: 20
from milvus.
After verification, under the current mechanism, there are two ways to alleviate the problem of slow compaction speed under concurrent DML.
verified image:2.4-20240318-506534c2
- Add dataNode count
2. Adjust config parameters to improve compaction concurrencydataCoord: compaction: workerMaxParallelTaskNum: 5 maxParallelTaskNum: 20
After verification, this issue can be closed
from milvus.
Related Issues (20)
- [Feature]: Support Range Search Pagination Retain Order so No Duplication when using Different Offset HOT 3
- [Enhancement]: Enable ReadOnly/ReadWrite/Admin Privilege Group HOT 1
- [Enhancement]: improve bitset performance for AVX512 HOT 1
- [Bug]: querynode restarts due to `SIGSEGV: segmentation violation` after etcd follower pod failure chaos test HOT 9
- Why don't I have a GPU_IVF_FLAT in here HOT 6
- [Bug]: deletion problem HOT 3
- [Enhancement]: Mark query node as read only after suspend HOT 2
- [Bug]: `SampleFraction` config does not work for segcore tracing HOT 1
- [Bug]: Cluster scope limiter rate cannot be update to proxies when proxy number updates
- [Feature]: Join two collections HOT 4
- [Bug]: [benchmark][standalone] Milvus panic `panic: runtime error: index out of range [-1]` in concurrent dql scene HOT 1
- [Bug]: [Nightly] Milvus pod restart many times and panic for context deadline exceeded HOT 4
- [Enhancement]: Optimize the retrieval operations for dynamic fields. HOT 2
- [Bug]: Service Fails to Restart with GRPC Connection Errors After Being Killed with SIGKILL HOT 5
- [Bug]: [resource group]The error message is unclear and lacks specific meaning. HOT 3
- [Bug]: docker restart milvus-standalone container the loading time is particularly long, about ten million data has not been loaded more than 4 hours HOT 2
- [Bug]: lock was not released in time HOT 3
- [Bug]: Use `libopenblas-openmp` instead of `libopenblas` HOT 3
- [Feature]: Support index build and search on JSON/Dynamic field HOT 3
- [Bug]: Import data but "code":0 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from milvus.