Is your improvement request related to a feature? Please describe (👍 if you like thi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[IMPROVEMENT] Investigate performance bottleneck in v1 data path ,about longhorn/longhorn

PhanLe1010 commented on May 28, 2024 6

It is int64. Probably no danger?

you are right, if we are writing with speed 50k IOPs, it would take 5849424 years to go overflow. None of us can see that day :)))

from longhorn.

PhanLe1010 commented on May 28, 2024 4

Update: a tiny small improvement to the revision counter logic

Current implementation

We are converting the int64 to a string then a slice of bytes https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/pkg/replica/revision_counter.go#L45-L46

The strconv.FormatInt function seems to eat noticeable CPU time in profiling

Modification:

Using LittleEndian to encode and decode the int64:

func (r *Replica) writeRevisionCounter(counter int64) error {
	if r.revisionFile == nil {
		return fmt.Errorf("BUG: revision file wasn't initialized")
	}

	copy(revisionCounterBuf, int64ToBytes(counter))
	_, err := r.revisionFile.WriteAt(revisionCounterBuf, 0)
	if err != nil {
		return errors.Wrap(err, "failed to write to revision counter file")
	}
	return nil
}

func int64ToBytes(n int64) []byte {
	bytes := make([]byte, 8)
	binary.LittleEndian.PutUint64(bytes, uint64(n))
	return bytes
}

func bytesToInt64(bytes []byte) int64 {
	return int64(binary.LittleEndian.Uint64(bytes))
}

Result

A small improvement in the write performance maybe 1-2%.
However, the implementation cost is big: we need to migrate and update from old decoding to new decoding.
I think this is not a good idea

from longhorn.

PhanLe1010 commented on May 28, 2024 2

This is potentially the 3rd bottleneck:

For some reason, we always have to do an alignment for each read requests which means go through this code path https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/vendor/github.com/longhorn/sparse-tools/sparse/file.go#L120-L123
This doesn't happen for the write request. I.e., write request doesn't have to go through this code path https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/vendor/github.com/longhorn/sparse-tools/sparse/file.go#L133-L136

cc @shuo-wu @ejweber @derekbit

from longhorn.

derekbit commented on May 28, 2024 2

This is potentially the 3rd bottleneck:

For some reason, we always have to do an alignment for each read requests which means go through this code path https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/vendor/github.com/longhorn/sparse-tools/sparse/file.go#L120-L123

This doesn't happen for the write request. I.e., write request doesn't have to go through this code path https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/vendor/github.com/longhorn/sparse-tools/sparse/file.go#L133-L136

cc @shuo-wu @ejweber @derekbit

We already know the input data []byte needs to be aligned for direct IO.
Can we always allocate an aligned buffer for data in the caller end for preventing extra data copy in the two functions?

from longhorn.

PhanLe1010 commented on May 28, 2024 2

We already know the input data []byte needs to be aligned for direct IO.
Can we always allocate an aligned buffer for data in the caller end for preventing extra data copy in the two functions?

Agree with this idea. I will give it a try to see if it can improve the read speed

@derekbit Looks like the performance gets worse. So we will not do this:

IOPS (Read/Write)
        Random:          45,395 / 19,553
    Sequential:          62,778 / 33,106

Bandwidth in KiB/sec (Read/Write)
        Random:        714,421 / 348,190
    Sequential:        976,918 / 353,571
                                        

Latency in ns (Read/Write)
        Random:        821,898 / 455,709
    Sequential:        814,332 / 456,455

from longhorn.

PhanLe1010 commented on May 28, 2024 2

Update:

There is potentially another bottleneck in the read flow in the replica. For every read, we have to make a systemcall os.(*File).Stat to get the size of the snaoshot file here. This systemcall is eating noticeable CPU. Do you think we can cache this snapshot size value and don't have to issue a systemcall for every read? @shuo-wu @derekbit @ejweber

from longhorn.

PhanLe1010 commented on May 28, 2024 1

We already know the input data []byte needs to be aligned for direct IO.
Can we always allocate an aligned buffer for data in the caller end for preventing extra data copy in the two functions?

Agree with this idea. I will give it a try to see if it can improve the read speed

from longhorn.

derekbit commented on May 28, 2024 1

Side track a little bit. Do you think there is a danger with integer overflow when we keep increasing the counter over here https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/pkg/replica/revision_counter.go#L143 ?

If we keep writing long enough, it might become negative

It is int64. Probably no danger?

from longhorn.

PhanLe1010 commented on May 28, 2024 1

@shuo-wu This is the CPU usage of strconv.FormatInt that we were talking about in the sync up. It is small but noticeable

from longhorn.

shuo-wu commented on May 28, 2024 1

For every read, we have to make a systemcall os.(*File).Stat to get the size of the snaoshot file here.

Good catch! I am fine with using the cached size. The snapshot immutability should be guaranteed by the ctime/checksum mechanism

from longhorn.

PhanLe1010 commented on May 28, 2024 1

Do you think we should backport this to v1.6.x and v1.5.x? @shuo-wu @ejweber @derekbit

from longhorn.

ejweber commented on May 28, 2024 1

Do you think we should backport this to v1.6.x and v1.5.x? @shuo-wu @ejweber @derekbit

I don't think there is much danger here, especially around compatibility, as the changes are all self-contained within the engine. On the other hand, connection count will increase (and possible CPU utilization as well) as a tradeoff for the performance. Maybe that's surprising in a patch update?

I lean slightly towards backporting, but I'm also fine with the gains being associated with a new minor version of Longhorn. I'll defer to whatever you decide.

from longhorn.

PhanLe1010 commented on May 28, 2024

Side track a little bit. Do you think there is a danger with integer overflow when we keep increasing the counter over here https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/pkg/replica/revision_counter.go#L143 ?

If we keep writing long enough, it might become negative

from longhorn.

PhanLe1010 commented on May 28, 2024

As dissed with @shuo-wu in the sync up. We rerun some of the Broadcom tests. The setup is the same setup in our initial Broadcom report in SUSE data center

Case 1: In a CPU/RAM limited constraints

Random read IOPs test
- IOPs: similar
- CPU usage: similar
Sequential read IOPs test
- IOPs: similar
- CPU usage: similar
Random write IOPs test
- IOPs: increased from 2988 to 3150
- CPU usage: similar
Sequential write IOPs test
- IOPs: similar
- CPU usage: similar

Case 2: In no CPU/RAM constraint and workload are NOT rate-limited

With master-head engine:

IOPS (Read/Write)
        Random:          26,603 / 28,501
    Sequential:          49,915 / 43,092

Bandwidth in KiB/sec (Read/Write)
        Random:      1,587,307 / 505,325
    Sequential:      1,827,467 / 535,291
                                        

Latency in ns (Read/Write)
        Random:        272,445 / 260,084
    Sequential:        273,663 / 257,785

With the PR:

IOPS (Read/Write)
        Random:          26,946 / 33,542
    Sequential:          50,484 / 51,998

Bandwidth in KiB/sec (Read/Write)
        Random:      1,595,310 / 541,211
    Sequential:      1,611,390 / 528,907
                                        

Latency in ns (Read/Write)
        Random:        274,136 / 262,201
    Sequential:        277,235 / 252,450

Note: in this case read performance somehow doesn't increase. I believe it is something weird with tgt on Photon RT OS. Even with tgt+local file backend, it is still give the same read performance as tgt+Longhorn => the bottleneck is at tgt

Conclusion

Does the increasing number of connections between engine-replica hurt the resource-constrained env?
-> Looks like no

from longhorn.

PhanLe1010 commented on May 28, 2024

Btw, this article is very helpful in understand a common mistake in go concurrency https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/

from longhorn.

PhanLe1010 commented on May 28, 2024

After testing, the snapshot file caching idea doesn't seem to improve performance in practice even though sounds good in theory. CPU usage is also relatively the same. Let's abandon it. Sorry for the noise.

from longhorn.

longhorn-io-github-bot commented on May 28, 2024

Pre Ready-For-Testing Checklist

from longhorn.

PhanLe1010 commented on May 28, 2024

Test Plan:

Create a cluster of 3 worker nodes with similar big spec like equinix metal m3.small.x86, ubuntu 22.04, 5.15.0-101-generic, 20Gbps network
deploy Longhorn 1.6.1
Run kbench
Upgrade Longhorn to master-head and upgrade the engine to master-head
Run Kbench again
Verify the you see better IOPs result
See the issue description for example

from longhorn.

roger-ryao commented on May 28, 2024

Verified on master-head 20240508

longhorn master-head 8026d1a
longhorn-engine master-head https://github.com/longhorn/longhorn-engine/commit/

The test steps
#8436 (comment)

Result Passed

In the AWS EC2 t2.xlarge environment, I did not observe significant differences in IOPS.
Therefore, I have decided to build up a cluster in a local virtual machine to conduct this test.
The test result is as follows.

v1.6.1

=========================
FIO Benchmark Summary
For: test_device
CPU Idleness Profiling: disabled
Size: 30G
Quick Mode: disabled
=========================
IOPS (Read/Write)
        Random:           10,329 / 4,684
    Sequential:          17,677 / 10,204

Bandwidth in KiB/sec (Read/Write)
        Random:        413,252 / 141,958
    Sequential:        461,723 / 170,358
                                        

Latency in ns (Read/Write)
        Random:        431,773 / 553,568
    Sequential:        357,682 / 547,949

master-head

=========================
FIO Benchmark Summary
For: test_device
CPU Idleness Profiling: disabled
Size: 30G
Quick Mode: disabled
=========================
IOPS (Read/Write)
        Random:           11,024 / 5,082
    Sequential:          18,149 / 10,672

Bandwidth in KiB/sec (Read/Write)
        Random:        415,402 / 137,600
    Sequential:        445,611 / 167,325
                                        

Latency in ns (Read/Write)
        Random:        409,800 / 525,464
    Sequential:        342,326 / 525,112

Here is the summary table

	IOPS (Random Read/Write)	IOPS (Sequential Read/Write)
v1.6.1-1st	10,329 / 4,684	17,677 / 10,204
master-head-1st	11,024 / 5,082	18,149 / 10,672

	Bandwidth KiB/sec (Random Read/Write)	Bandwidth KiB/sec (Sequential Read/Write)
v1.6.1-1st	413,252 / 141,958	461,723 / 170,358
master-head-1st	415,402 / 137,600	445,611 / 167,325

	Latency ns (Random Read/Write)	Latency ns (Sequential Read/Write)
v1.6.1-1st	431,773 / 553,568	357,682 / 547,949
master-head-1st	409,800 / 525,464	342,326 / 525,112

Hi @PhanLe1010
I think we can close this ticket. If there are any concerns regarding the test results, please let me know.

from longhorn.

[IMPROVEMENT] Investigate performance bottleneck in v1 data path about longhorn HOT 19 CLOSED

Comments (19)

Current implementation

Modification:

Result

Case 1: In a CPU/RAM limited constraints

Case 2: In no CPU/RAM constraint and workload are NOT rate-limited

Conclusion

Pre Ready-For-Testing Checklist

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent