Code Monkey home page Code Monkey logo

Comments (14)

yanliang567 avatar yanliang567 commented on September 7, 2024

/assign @sunby
please help to check
/unassign

from milvus.

sunby avatar sunby commented on September 7, 2024

@artin-recollective Even with mmap enabled, a segment needs to be first loaded into memory, processed, and then serialized into an mmap file. So, predicting memory usage is expected. The issue here is that you've set the maximum size of the segment larger than the available memory, which prevents the segment from being loaded into memory. This 21010MB segment may be generated by compaction. The reason why v2.3.12 works should be that no segment bigger than memory size is generated.

from milvus.

artin-recollective avatar artin-recollective commented on September 7, 2024

@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024
The same validation in 2.3.12 considers if mmap is enabled first before failing it:

if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

	if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

https://github.com/milvus-io/milvus/blob/835862df22c5c8f531fedf6cc024f89f0a50a3c6/internal/querynodev2/segments/segment_loader.go#L1062C2-L1080C3

from milvus.

sunby avatar sunby commented on September 7, 2024

@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024 The same validation in 2.3.12 considers if mmap is enabled first before failing it:

if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

	if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

https://github.com/milvus-io/milvus/blob/835862df22c5c8f531fedf6cc024f89f0a50a3c6/internal/querynodev2/segments/segment_loader.go#L1062C2-L1080C3

@artin-recollective Thanks for your reply!
There's a known bug in v2.3.x that compaction may create a very large segment(much larger than the max segment size). That's maybe why this segment is generated.
And IMO the memory check codes in v2.3.x is wrong because the segment should be loaded into memory first.
As for this

2.3.12 works, and 2.4.x does not work

I think this large segment may be created in v2.3.x but not loaded. Milvus works fine because this segment's ancestors are already be loaded. But after you upgrade to v2.4.x, loading collection will choose this segment to be loaded so loading fails.

You can try to do a loading in v2.3.x again and I expect it will fail.

from milvus.

sunby avatar sunby commented on September 7, 2024

@artin-recollective If you have the v2.3.x querycoord logs and the log level is info, please help to search this keyword "load segments done" and "449469341520221184" (segment id) to check this segment was loaded successfully.

from milvus.

artin-recollective avatar artin-recollective commented on September 7, 2024

@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024 The same validation in 2.3.12 considers if mmap is enabled first before failing it:

if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

	if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

https://github.com/milvus-io/milvus/blob/835862df22c5c8f531fedf6cc024f89f0a50a3c6/internal/querynodev2/segments/segment_loader.go#L1062C2-L1080C3

@artin-recollective Thanks for your reply! There's a known bug in v2.3.x that compaction may create a very large segment(much larger than the max segment size). That's maybe why this segment is generated. And IMO the memory check codes in v2.3.x is wrong because the segment should be loaded into memory first. As for this

2.3.12 works, and 2.4.x does not work

I think this large segment may be created in v2.3.x but not loaded. Milvus works fine because this segment's ancestors are already be loaded. But after you upgrade to v2.4.x, loading collection will choose this segment to be loaded so loading fails.

You can try to do a loading in v2.3.x again and I expect it will fail.

Thanks for the info @sunby, I actually have reverted back to 2.3.12 and loaded the collection again without any problems. I have done the switch multiple times for testing and the collection actually gets fully loaded.
I haven't investigated the code fully to understand if the whole segment gets loaded into memory first and then offloaded to disk, but it seems to me like in 2.3.12, the mmap functionality is working correctly and flexibility with whatever memory amount is available. That only makes sense since the whole concept of mmap is to use less memory.
I am curious if there's a fix for the big segment size that you mentioned? Is there a way to fix the segments that have grown too big?

from milvus.

artin-recollective avatar artin-recollective commented on September 7, 2024

@artin-recollective If you have the v2.3.x querycoord logs and the log level is info, please help to search this keyword "load segments done" and "449469341520221184" (segment id) to check this segment was loaded successfully.

I actually checked the logs in 2.3.12 after downgrading from 2.4.4 and I can actually find the load segments done logs in both query coord and query node for that segments. In fact I can see the collection has 5 segments and I can find load segments done logs for all 5 segments in both query node and query coord.

querycoord logs:

[2024/06/14 15:08:46.293 +00:00] [INFO] [task/executor.go:251] ["load segments done"] [taskID=1718377018226] [collectionID=449469341520987343] [replicaID=450462225059282945] [segmentID=449469341520221184] [node=883] [source=segment_checker] [shardLeader=883] [elapsed=9m23.508329564s]

querynode logs:

[2024/06/14 15:08:46.000 +00:00] [INFO] [querynodev2/services.go:501] ["load segments done..."] [traceID=74bb001f3eaf6896c63d0b2e78ad90ca] [collectionID=449469341520987343] [partitionID=449469341520987344] [shard=by-dev-rootcoord-dml_13_449469341520987343v0] [segmentID=449469341520221184] [currentNodeID=883] [segments="[449469341520221184]"]

from milvus.

xiaofan-luan avatar xiaofan-luan commented on September 7, 2024

@artin-recollective Even with mmap enabled, a segment needs to be first loaded into memory, processed, and then serialized into an mmap file. So, predicting memory usage is expected. The issue here is that you've set the maximum size of the segment larger than the available memory, which prevents the segment from being loaded into memory. This 21010MB segment may be generated by compaction. The reason why v2.3.12 works should be that no segment bigger than memory size is generated.

is there a reason we changed to such large segment size?

Our recommendation is to use 4-8g segment size, for the reason:

  1. large segment induild is much slower (takes 1 hour or more to build each time)
  2. large segment can not be balanced evenly on all querynodes.
  3. large segments takes memory to load, we each at least 1 x segment size extra memory to load. which means you have to have at least 20G memory to load a 20G segment even with mmap.(Without mmap that will be even more).

from milvus.

xiaofan-luan avatar xiaofan-luan commented on September 7, 2024

with 12GB memory, 2/4GB segment size is what we recommended.

from milvus.

xiaofan-luan avatar xiaofan-luan commented on September 7, 2024

you can always tune the param to remove the check, but this may cause oom and other unexpected behaviours

from milvus.

artin-recollective avatar artin-recollective commented on September 7, 2024

My segment size is

with 12GB memory, 2/4GB segment size is what we recommended.

I am using all the default configs for segment size, so it should be 512 from: https://github.com/milvus-io/milvus/blob/v2.3.12/configs/milvus.yaml

from milvus.

artin-recollective avatar artin-recollective commented on September 7, 2024

@sunby the data is the same that is being used in 2.3.12, just trying to load the same collection with mmap being enabled. 2.3.12 works, and 2.4.x does not work. I am using the exact same limits, segment sizes and host machine for query node for both versions. To clarify my segment size is 512MB and max segment size is configured at 1024 The same validation in 2.3.12 considers if mmap is enabled first before failing it:

if !mmapEnabled && predictMemUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

	if mmapEnabled && memUsage > uint64(float64(totalMem)*paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat()) {
		return 0, 0, fmt.Errorf("load segment failed, OOM if load, maxSegmentSize = %v MB, concurrency = %d, memUsage = %v MB, predictMemUsage = %v MB, totalMem = %v MB thresholdFactor = %f",
			toMB(maxSegmentSize),
			concurrency,
			toMB(memUsage),
			toMB(predictMemUsage),
			toMB(totalMem),
			paramtable.Get().QueryNodeCfg.OverloadedMemoryThresholdPercentage.GetAsFloat())
	}

https://github.com/milvus-io/milvus/blob/835862df22c5c8f531fedf6cc024f89f0a50a3c6/internal/querynodev2/segments/segment_loader.go#L1062C2-L1080C3

@artin-recollective Thanks for your reply! There's a known bug in v2.3.x that compaction may create a very large segment(much larger than the max segment size). That's maybe why this segment is generated. And IMO the memory check codes in v2.3.x is wrong because the segment should be loaded into memory first. As for this

2.3.12 works, and 2.4.x does not work

I think this large segment may be created in v2.3.x but not loaded. Milvus works fine because this segment's ancestors are already be loaded. But after you upgrade to v2.4.x, loading collection will choose this segment to be loaded so loading fails.

You can try to do a loading in v2.3.x again and I expect it will fail.

@sunby is there any way that I can find the segments that are larger than expected? For example in s3 or with a sdk call that shows the segment size in MBs?

from milvus.

xiaofan-luan avatar xiaofan-luan commented on September 7, 2024

using birdwatcher can get the all the segment size

from milvus.

stale avatar stale commented on September 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

from milvus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.