Is there an existing issue for this? <li class="

/assign <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="use

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

[Bug]: MMAP regression with Milvus 2.4.x about milvus HOT 14 CLOSED

artin-recollective commented on September 7, 2024

[Bug]: MMAP regression with Milvus 2.4.x

from milvus.

Comments (14)

yanliang567 commented on September 7, 2024

/assign @sunby
please help to check
/unassign

from milvus.

sunby commented on September 7, 2024

@artin-recollective Even with mmap enabled, a segment needs to be first loaded into memory, processed, and then serialized into an mmap file. So, predicting memory usage is expected. The issue here is that you've set the maximum size of the segment larger than the available memory, which prevents the segment from being loaded into memory. This 21010MB segment may be generated by compaction. The reason why v2.3.12 works should be that no segment bigger than memory size is generated.

from milvus.

artin-recollective commented on September 7, 2024

@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024
The same validation in 2.3.12 considers if mmap is enabled first before failing it:

if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

	if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

https://github.com/milvus-io/milvus/blob/835862df22c5c8f531fedf6cc024f89f0a50a3c6/internal/querynodev2/segments/segment_loader.go#L1062C2-L1080C3

from milvus.

sunby commented on September 7, 2024

@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024 The same validation in 2.3.12 considers if mmap is enabled first before failing it:
if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

	if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}
https://github.com/milvus-io/milvus/blob/835862df22c5c8f531fedf6cc024f89f0a50a3c6/internal/querynodev2/segments/segment_loader.go#L1062C2-L1080C3

@artin-recollective Thanks for your reply!
There's a known bug in v2.3.x that compaction may create a very large segment(much larger than the max segment size). That's maybe why this segment is generated.
And IMO the memory check codes in v2.3.x is wrong because the segment should be loaded into memory first.
As for this

2.3.12 works, and 2.4.x does not work

I think this large segment may be created in v2.3.x but not loaded. Milvus works fine because this segment's ancestors are already be loaded. But after you upgrade to v2.4.x, loading collection will choose this segment to be loaded so loading fails.

You can try to do a loading in v2.3.x again and I expect it will fail.

from milvus.

sunby commented on September 7, 2024

@artin-recollective If you have the v2.3.x querycoord logs and the log level is info, please help to search this keyword "load segments done" and "449469341520221184" (segment id) to check this segment was loaded successfully.

from milvus.

artin-recollective commented on September 7, 2024

@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024 The same validation in 2.3.12 considers if mmap is enabled first before failing it:
if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

	if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}
https://github.com/milvus-io/milvus/blob/835862df22c5c8f531fedf6cc024f89f0a50a3c6/internal/querynodev2/segments/segment_loader.go#L1062C2-L1080C3
@artin-recollective Thanks for your reply! There's a known bug in v2.3.x that compaction may create a very large segment(much larger than the max segment size). That's maybe why this segment is generated. And IMO the memory check codes in v2.3.x is wrong because the segment should be loaded into memory first. As for this

2.3.12 works, and 2.4.x does not work

I think this large segment may be created in v2.3.x but not loaded. Milvus works fine because this segment's ancestors are already be loaded. But after you upgrade to v2.4.x, loading collection will choose this segment to be loaded so loading fails.

You can try to do a loading in v2.3.x again and I expect it will fail.

Thanks for the info @sunby, I actually have reverted back to 2.3.12 and loaded the collection again without any problems. I have done the switch multiple times for testing and the collection actually gets fully loaded.
I haven't investigated the code fully to understand if the whole segment gets loaded into memory first and then offloaded to disk, but it seems to me like in 2.3.12, the mmap functionality is working correctly and flexibility with whatever memory amount is available. That only makes sense since the whole concept of mmap is to use less memory.
I am curious if there's a fix for the big segment size that you mentioned? Is there a way to fix the segments that have grown too big?

from milvus.

artin-recollective commented on September 7, 2024

@artin-recollective If you have the v2.3.x querycoord logs and the log level is info, please help to search this keyword "load segments done" and "449469341520221184" (segment id) to check this segment was loaded successfully.

I actually checked the logs in 2.3.12 after downgrading from 2.4.4 and I can actually find the load segments done logs in both query coord and query node for that segments. In fact I can see the collection has 5 segments and I can find load segments done logs for all 5 segments in both query node and query coord.

querycoord logs:

[2024/06/14 15:08:46.293 +00:00] [INFO] [task/executor.go:251] ["load segments done"] [taskID=1718377018226] [collectionID=449469341520987343] [replicaID=450462225059282945] [segmentID=449469341520221184] [node=883] [source=segment_checker] [shardLeader=883] [elapsed=9m23.508329564s]

querynode logs:

[2024/06/14 15:08:46.000 +00:00] [INFO] [querynodev2/services.go:501] ["load segments done..."] [traceID=74bb001f3eaf6896c63d0b2e78ad90ca] [collectionID=449469341520987343] [partitionID=449469341520987344] [shard=by-dev-rootcoord-dml_13_449469341520987343v0] [segmentID=449469341520221184] [currentNodeID=883] [segments="[449469341520221184]"]

from milvus.

xiaofan-luan commented on September 7, 2024

@artin-recollective Even with mmap enabled, a segment needs to be first loaded into memory, processed, and then serialized into an mmap file. So, predicting memory usage is expected. The issue here is that you've set the maximum size of the segment larger than the available memory, which prevents the segment from being loaded into memory. This 21010MB segment may be generated by compaction. The reason why v2.3.12 works should be that no segment bigger than memory size is generated.

is there a reason we changed to such large segment size?

Our recommendation is to use 4-8g segment size, for the reason:

large segment induild is much slower (takes 1 hour or more to build each time)
large segment can not be balanced evenly on all querynodes.
large segments takes memory to load, we each at least 1 x segment size extra memory to load. which means you have to have at least 20G memory to load a 20G segment even with mmap.(Without mmap that will be even more).

from milvus.

xiaofan-luan commented on September 7, 2024

with 12GB memory, 2/4GB segment size is what we recommended.

from milvus.

xiaofan-luan commented on September 7, 2024

you can always tune the param to remove the check, but this may cause oom and other unexpected behaviours

from milvus.

artin-recollective commented on September 7, 2024

My segment size is

with 12GB memory, 2/4GB segment size is what we recommended.

I am using all the default configs for segment size, so it should be 512 from: https://github.com/milvus-io/milvus/blob/v2.3.12/configs/milvus.yaml

from milvus.

artin-recollective commented on September 7, 2024

@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024 The same validation in 2.3.12 considers if mmap is enabled first before failing it:
if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

	if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}
https://github.com/milvus-io/milvus/blob/835862df22c5c8f531fedf6cc024f89f0a50a3c6/internal/querynodev2/segments/segment_loader.go#L1062C2-L1080C3
@artin-recollective Thanks for your reply! There's a known bug in v2.3.x that compaction may create a very large segment(much larger than the max segment size). That's maybe why this segment is generated. And IMO the memory check codes in v2.3.x is wrong because the segment should be loaded into memory first. As for this

2.3.12 works, and 2.4.x does not work

I think this large segment may be created in v2.3.x but not loaded. Milvus works fine because this segment's ancestors are already be loaded. But after you upgrade to v2.4.x, loading collection will choose this segment to be loaded so loading fails.

You can try to do a loading in v2.3.x again and I expect it will fail.

@sunby is there any way that I can find the segments that are larger than expected? For example in s3 or with a sdk call that shows the segment size in MBs?

from milvus.

xiaofan-luan commented on September 7, 2024

using birdwatcher can get the all the segment size

from milvus.

stale commented on September 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

from milvus.

[Bug]: MMAP regression with Milvus 2.4.x about milvus HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent