Noticed while working on <a class="issue-link js-issue-link" data-error-text="Failed t

Just for clarity, this bug: Only affects 3.5.0 Requir

Here is what I did to test the correction: I st

Note that you don't need the -R 100 , you can use <cod

Pathological performance of file_handle_cache read buffer when synchronising queues about rabbitmq-server HOT 7 CLOSED

rabbitmq commented on May 21, 2024

Pathological performance of file_handle_cache read buffer when synchronising queues

from rabbitmq-server.

Comments (7)

simonmacmullen commented on May 21, 2024

Hard code a smaller buffer size

Seems rather sad.

Dynamically shrink the buffer size if we determine it is not working

This is the approach I've gone for, primarily because it should be able to stop other pathological behaviour.

Read the buffer backwards from our seek point if we detect we are seeking backwards

This might be a nice option to get syncing going still faster, but it's also fiddly and only solves this exact problem. I'll settle for having it no worse than it was in 3.4.4.

from rabbitmq-server.

simonmacmullen commented on May 21, 2024

Just for clarity, this bug:

Only affects 3.5.0
Requires messages to be larger than the queue index embedding threshold (by default 4kB)
Requires messages to be paged out before synchronisation starts

You can see in the I/O stats on the master that if (say) 250 messages are read from disk per second, we also read 250MB/s even if the messages are much smaller than that.

from rabbitmq-server.

dumbbell commented on May 21, 2024

Here is what I did to test the correction:

I start two nodes, A and B, with a very low vm_memory_high_watermark to make them page messages out early, clustered them and added the following HA policy on node B:
```
rabbitmqctl -n B set_policy ha-all "." '{"ha-mode":"all"}'
```
I stopped node B using:
```
rabbitmq -n B stop_app
```
I used PerfTest to publish 10 kB messages with a rate-limited consumer so messages stay in RabbitMQ:
```
PerfTest -s 10240 -R 100
```
The producer could publish around 40,000 messages before being throttled.
I started node B again and force synchronisation from the management UI.

With the stable branch, the management UI reports I/O read rates of:

150 messages/s
150 MB/s

With the rabbitmq-server-69 branch (this fix), it reports:

1000 messages/s
15 MB/s

I logged the size of the read buffer in file_handle_cache.erl at the same time. With stable, the buffer remains at an expected 1MB size. With the fix, the size continuously switches between 10468 and 20936, with an occasional jump to 4 MB.

from rabbitmq-server.

simonmacmullen commented on May 21, 2024

Note that you don't need the -R 100, you can use -y0 -u test -p to get PerfTest to publish to a queue with no consumers which might be easier to work with.

The 4MB sizes probably refer to other files (queue index files?)

Not sure whether the flicking between 10468 and 20936 is worth fixing, what do you think?

from rabbitmq-server.

simonmacmullen commented on May 21, 2024

Oh, also you can set a very low vm_memory_high_watermark_paging_ratio rather than vm_memory_high_watermark, that way you can publish indefinitely but get paged out rapidly.

from rabbitmq-server.

dumbbell commented on May 21, 2024

One correction to my previous comment:

150 messages/s
1000 messages/s

Those should read:

150 reads/s
1000 reads/s

The 4MB sizes probably refer to other files (queue index files?)

You're right, the file handle differs for those reads.

Not sure whether the flicking between 10468 and 20936 is worth fixing, what do you think?

After a test:

With the flickering buffer:
- 1000 reads/s
- 15 MB/s from disc
- 10-12 MB/s sent to node B
With a constant buffer:
- 750 reads/s
- 7 MB/s from disc
- 7 MB/s sent to node B

In the first case, we read 20 kB to only use 10 kB, then we read 10 kB, then we double and so on. We don't do this in the second case (we always read 10 kB). When comparing the number of reads to the throughput, we see the 33% decrease of throughput in the second case, corresponding to not wasting 10 kB. However, I can't explain why it is slower...

from rabbitmq-server.

dumbbell commented on May 21, 2024

Here are new, more meaningful numbers comparing stable and 8faf4ee.

The protocol is:

Start nodes A and B, cluster them, add a HA policy.
Create a queue from the management UI.
Stop node N.
Use PerfTest to queue 300,000 messages, which is enough to page them out (the filesystem is tmpfs). No clients are connected after that.
Start B and force synchronization. While this happens, look at the time the full sync takes, as well as I/O and network statistics.

Results with stable:

Synchronization finished in 1'55".
Reads: 1600/s (1.6 GiB/s)
Network (from A to B): 18 MiB/s while messages are paged in, then 58 MiB/s

Results with 8faf4ee:

Synchronization finished in 1'10".
Reads: 4500/s (57 MiB/s)
Network (from A to B): 58 MiB/s

from rabbitmq-server.

Pathological performance of file_handle_cache read buffer when synchronising queues about rabbitmq-server HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent