I have a cluster of ~300 nodes and mesh is consuming ~1-2 GB of RAM. I dug into it and

From my use-case of protokube (<a class="issue-link js-issue-link" data-error-text="Fa

High memory/CPU utilization for moderately sized cluster about mesh HOT 5 CLOSED

weaveworks commented on May 24, 2024

High memory/CPU utilization for moderately sized cluster

from mesh.

Comments (5)

bboreham commented on May 24, 2024

The idea is that gossip messages are not sent very often - Weave Mesh selects log(number of connections) peers to send to - so the scalability should be good.

However various other issues in the code mean that messages are sent way more often than this ideal. #101, #106 and #107 are attempts to improve matters, though work is ongoing to understand the full set of causes.

from mesh.

jacksontj commented on May 24, 2024

From my use-case of protokube (kubernetes/kops#7427) I'm seeing ~2 cores of CPU usage and ~3G of RAM usage with a fully connected mesh of ~300 nodes. This seems to highlight some serious scale limitations of weaveworks/mesh -- as that isn't even a very large cluster. More importantly the utilization ramp-up was more-or-less exponential as more nodes were added.

Even after I made a custom build with #107 fixed the CPU usage dropped to 1.6 cores -- which is still way too many (all the CPU time was being spent marshaling/unmarshaling the peer list being gossiped around).

There seem to be quite a few issues, a couple: (1) no concept of "suspect" state (2) peer messages include the list of all peers it has connected-- which scaled with cluster size. There are likely more but TBH I have decided to instead spend my time swapping protokube to a more robust/reliable gossip library.

from mesh.

bboreham commented on May 24, 2024

We're seeing quite positive results from deferring gossip updates - #117 and #118.
Would you like to try those in your build?

from mesh.

bboreham commented on May 24, 2024

peer messages include the list of all peers it has connected

It's worse than that - the topology message lists all the connections of all peers. In other words in a fully-connected cluster it's O(N^2).

However, for 300 nodes that might be 8MB per message, which needs something else to get to 1-2GB.

We found that the connections would each read a message then block on the Peers lock to apply the update. So with 200 connections that's 1.6GB.

Changing the "everyone sends everything" behaviour is quite a big change, so ahead of that I felt that just slowing down the initial connections would help - #124 . After initial connection the updates only go to logN peers so we shouldn't get the massive spikes.

from mesh.

bboreham commented on May 24, 2024

I'm going to close this issue now 0.4 is released - if you want to come back to the discussion please do.

from mesh.

Recommend Projects

High memory/CPU utilization for moderately sized cluster about mesh HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent