Comments (14)
/assign @sunby
please help to check
/unassign
from milvus.
@artin-recollective Even with mmap enabled, a segment needs to be first loaded into memory, processed, and then serialized into an mmap file. So, predicting memory usage is expected. The issue here is that you've set the maximum size of the segment larger than the available memory, which prevents the segment from being loaded into memory. This 21010MB segment may be generated by compaction. The reason why v2.3.12 works should be that no segment bigger than memory size is generated.
from milvus.
@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024
The same validation in 2.3.12 considers if mmap is enabled first before failing it:
if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
toMB(maxSegmentSize),
concurrency,
toMB(memUsage),
toMB(predictMemUsage),
toMB(totalMem),
paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
}
if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
toMB(maxSegmentSize),
concurrency,
toMB(memUsage),
toMB(predictMemUsage),
toMB(totalMem),
paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
}
from milvus.
@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024 The same validation in 2.3.12 considers if mmap is enabled first before failing it:
if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) { return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f", toMB(maxSegmentSize), concurrency, toMB(memUsage), toMB(predictMemUsage), toMB(totalMem), paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) } if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) { return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f", toMB(maxSegmentSize), concurrency, toMB(memUsage), toMB(predictMemUsage), toMB(totalMem), paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) }
@artin-recollective Thanks for your reply!
There's a known bug in v2.3.x that compaction may create a very large segment(much larger than the max segment size). That's maybe why this segment is generated.
And IMO the memory check codes in v2.3.x is wrong because the segment should be loaded into memory first.
As for this
2.3.12 works, and 2.4.x does not work
I think this large segment may be created in v2.3.x but not loaded. Milvus works fine because this segment's ancestors are already be loaded. But after you upgrade to v2.4.x, loading collection will choose this segment to be loaded so loading fails.
You can try to do a loading in v2.3.x again and I expect it will fail.
from milvus.
@artin-recollective If you have the v2.3.x querycoord logs and the log level is info, please help to search this keyword "load segments done" and "449469341520221184" (segment id) to check this segment was loaded successfully.
from milvus.
@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024 The same validation in 2.3.12 considers if mmap is enabled first before failing it:
if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) { return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f", toMB(maxSegmentSize), concurrency, toMB(memUsage), toMB(predictMemUsage), toMB(totalMem), paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) } if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) { return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f", toMB(maxSegmentSize), concurrency, toMB(memUsage), toMB(predictMemUsage), toMB(totalMem), paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) }
@artin-recollective Thanks for your reply! There's a known bug in v2.3.x that compaction may create a very large segment(much larger than the max segment size). That's maybe why this segment is generated. And IMO the memory check codes in v2.3.x is wrong because the segment should be loaded into memory first. As for this
2.3.12 works, and 2.4.x does not work
I think this large segment may be created in v2.3.x but not loaded. Milvus works fine because this segment's ancestors are already be loaded. But after you upgrade to v2.4.x, loading collection will choose this segment to be loaded so loading fails.
You can try to do a loading in v2.3.x again and I expect it will fail.
Thanks for the info @sunby, I actually have reverted back to 2.3.12 and loaded the collection again without any problems. I have done the switch multiple times for testing and the collection actually gets fully loaded.
I haven't investigated the code fully to understand if the whole segment gets loaded into memory first and then offloaded to disk, but it seems to me like in 2.3.12, the mmap functionality is working correctly and flexibility with whatever memory amount is available. That only makes sense since the whole concept of mmap is to use less memory.
I am curious if there's a fix for the big segment size that you mentioned? Is there a way to fix the segments that have grown too big?
from milvus.
@artin-recollective If you have the v2.3.x querycoord logs and the log level is info, please help to search this keyword "load segments done" and "449469341520221184" (segment id) to check this segment was loaded successfully.
I actually checked the logs in 2.3.12 after downgrading from 2.4.4 and I can actually find the load segments done
logs in both query coord and query node for that segments. In fact I can see the collection has 5 segments and I can find load segments done
logs for all 5 segments in both query node and query coord.
querycoord logs:
[2024/06/14 15:08:46.293 +00:00] [INFO] [task/executor.go:251] ["load segments done"] [taskID=1718377018226] [collectionID=449469341520987343] [replicaID=450462225059282945] [segmentID=449469341520221184] [node=883] [source=segment_checker] [shardLeader=883] [elapsed=9m23.508329564s]
querynode logs:
[2024/06/14 15:08:46.000 +00:00] [INFO] [querynodev2/services.go:501] ["load segments done..."] [traceID=74bb001f3eaf6896c63d0b2e78ad90ca] [collectionID=449469341520987343] [partitionID=449469341520987344] [shard=by-dev-rootcoord-dml_13_449469341520987343v0] [segmentID=449469341520221184] [currentNodeID=883] [segments="[449469341520221184]"]
from milvus.
@artin-recollective Even with mmap enabled, a segment needs to be first loaded into memory, processed, and then serialized into an mmap file. So, predicting memory usage is expected. The issue here is that you've set the maximum size of the segment larger than the available memory, which prevents the segment from being loaded into memory. This 21010MB segment may be generated by compaction. The reason why v2.3.12 works should be that no segment bigger than memory size is generated.
is there a reason we changed to such large segment size?
Our recommendation is to use 4-8g segment size, for the reason:
- large segment induild is much slower (takes 1 hour or more to build each time)
- large segment can not be balanced evenly on all querynodes.
- large segments takes memory to load, we each at least 1 x segment size extra memory to load. which means you have to have at least 20G memory to load a 20G segment even with mmap.(Without mmap that will be even more).
from milvus.
with 12GB memory, 2/4GB segment size is what we recommended.
from milvus.
you can always tune the param to remove the check, but this may cause oom and other unexpected behaviours
from milvus.
My segment size is
with 12GB memory, 2/4GB segment size is what we recommended.
I am using all the default configs for segment size, so it should be 512 from: https://github.com/milvus-io/milvus/blob/v2.3.12/configs/milvus.yaml
from milvus.
@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024 The same validation in 2.3.12 considers if mmap is enabled first before failing it:
if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) { return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f", toMB(maxSegmentSize), concurrency, toMB(memUsage), toMB(predictMemUsage), toMB(totalMem), paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) } if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) { return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f", toMB(maxSegmentSize), concurrency, toMB(memUsage), toMB(predictMemUsage), toMB(totalMem), paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) }
@artin-recollective Thanks for your reply! There's a known bug in v2.3.x that compaction may create a very large segment(much larger than the max segment size). That's maybe why this segment is generated. And IMO the memory check codes in v2.3.x is wrong because the segment should be loaded into memory first. As for this
2.3.12 works, and 2.4.x does not work
I think this large segment may be created in v2.3.x but not loaded. Milvus works fine because this segment's ancestors are already be loaded. But after you upgrade to v2.4.x, loading collection will choose this segment to be loaded so loading fails.
You can try to do a loading in v2.3.x again and I expect it will fail.
@sunby is there any way that I can find the segments that are larger than expected? For example in s3 or with a sdk call that shows the segment size in MBs?
from milvus.
using birdwatcher can get the all the segment size
from milvus.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
from milvus.
Related Issues (20)
- [Enhancement]: Add index task number for standalone milvus
- [Bug]: [benchmark][standalone] load collection raise error `collection not loaded` HOT 8
- [Feature]: Support Range Search Pagination Retain Order so No Duplication when using Different Offset HOT 3
- [Enhancement]: Enable ReadOnly/ReadWrite/Admin Privilege Group HOT 1
- [Enhancement]: improve bitset performance for AVX512
- [Bug]: querynode restarts due to `SIGSEGV: segmentation violation` after etcd follower pod failure chaos test HOT 9
- Why don't I have a GPU_IVF_FLAT in here HOT 6
- [Bug]: deletion problem HOT 2
- [Enhancement]: Mark query node as read only after suspend HOT 2
- [Bug]: `SampleFraction` config does not work for segcore tracing HOT 1
- [Bug]: Cluster scope limiter rate cannot be update to proxies when proxy number updates
- [Feature]: Join two collections HOT 4
- [Bug]: [benchmark][standalone] Milvus panic `panic: runtime error: index out of range [-1]` in concurrent dql scene HOT 1
- [Bug]: [Nightly] Milvus pod restart many times and panic for context deadline exceeded HOT 4
- [Enhancement]: Optimize the retrieval operations for dynamic fields. HOT 1
- [Bug]: Service Fails to Restart with GRPC Connection Errors After Being Killed with SIGKILL HOT 5
- [Bug]: [resource group]The error message is unclear and lacks specific meaning. HOT 3
- [Bug]: docker restart milvus-standalone container the loading time is particularly long, about ten million data has not been loaded more than 4 hours HOT 2
- [Bug]: lock was not released in time HOT 3
- [Bug]: Use `libopenblas-openmp` instead of `libopenblas` HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from milvus.