weaveworks / mesh Goto Github PK
View Code? Open in Web Editor NEWA tool for building distributed applications.
License: Apache License 2.0
A tool for building distributed applications.
License: Apache License 2.0
Two PRs languished for ~5 months, which is not good.
Perhaps we could have an explicit list of maintainers, and an instruction to @-mention them to gain attention?
For alertmanager, we'd like to be able to update the set of peers in a gossip, based on discovered kubernetes pods. What's the right way to do this?
Currently, I'm doing something like
meshRouter.ConnectionMaker.InitiateConnections(peers, true)
where meshRouter
is a mesh.Router
Sub-questions:
peers
includes the current node, what happens?We set a deadline on reads but not on writes. This means if something is blocking the other end of the connection we will wait forever.
Since we expect connected peers to be fairly live, a deadline around the length of a heartbeat should allow us to error on stalled peers while not affecting normal operation.
Worst case, a connection would be torn down then re-established.
Seen at weaveworks/weave#3762
mesh performs peers.garbageCollect
on two occasions:
OnGossipBroadcast
for topology gossip dataPeers.garbageCollect() invokes Peer.routes() which is O(n^2) operation hence results in significant CPU usage when there is significant topology change.
Following metrics (gathered as calls per second) were gathered with instrumented mesh on 150 node kubernetes cluster with weave-net using the mesh. Its not uncommon for some one to apply daemon set which results each node connecting to rest of the peers (so n^2 connection) resulting in significant topology changes and hence topology gossip.
===================================================================
2019-09-17 7:22:0 Peers.garbageCollect(): 365
2019-09-17 7:22:0 routes.calculate() -> routes.calculateBroadcast(): 59
2019-09-17 7:22:0 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 335
2019-09-17 7:22:0 routes.calculateUnicast(): 119
2019-09-17 7:22:0 connectionMaker.refresh(): 63
2019-09-17 7:22:0 rx gossip unicast: 0
2019-09-17 7:22:0 rx gossip broadcast: 325
2019-09-17 7:22:0 gossip broadcast - relay broadcasts: 345
2019-09-17 7:22:0 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:1 Peers.garbageCollect(): 347
2019-09-17 7:22:1 routes.calculate() -> routes.calculateBroadcast(): 68
2019-09-17 7:22:1 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 328
2019-09-17 7:22:1 routes.calculateUnicast(): 135
2019-09-17 7:22:1 connectionMaker.refresh(): 70
2019-09-17 7:22:1 rx gossip unicast: 0
2019-09-17 7:22:1 rx gossip broadcast: 316
2019-09-17 7:22:1 gossip broadcast - relay broadcasts: 324
2019-09-17 7:22:1 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:2 Peers.garbageCollect(): 369
2019-09-17 7:22:2 routes.calculate() -> routes.calculateBroadcast(): 61
2019-09-17 7:22:2 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 313
2019-09-17 7:22:2 routes.calculateUnicast(): 124
2019-09-17 7:22:2 connectionMaker.refresh(): 64
2019-09-17 7:22:2 rx gossip unicast: 0
2019-09-17 7:22:2 rx gossip broadcast: 315
2019-09-17 7:22:2 gossip broadcast - relay broadcasts: 343
2019-09-17 7:22:2 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:3 Peers.garbageCollect(): 336
2019-09-17 7:22:3 routes.calculate() -> routes.calculateBroadcast(): 75
2019-09-17 7:22:3 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 327
2019-09-17 7:22:3 routes.calculateUnicast(): 148
2019-09-17 7:22:3 connectionMaker.refresh(): 75
2019-09-17 7:22:3 rx gossip unicast: 0
2019-09-17 7:22:3 rx gossip broadcast: 322
2019-09-17 7:22:3 gossip broadcast - relay broadcasts: 326
2019-09-17 7:22:3 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:4 Peers.garbageCollect(): 353
2019-09-17 7:22:4 routes.calculate() -> routes.calculateBroadcast(): 69
2019-09-17 7:22:4 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 344
2019-09-17 7:22:4 routes.calculateUnicast(): 138
2019-09-17 7:22:4 connectionMaker.refresh(): 71
2019-09-17 7:22:4 rx gossip unicast: 0
2019-09-17 7:22:4 rx gossip broadcast: 339
2019-09-17 7:22:4 gossip broadcast - relay broadcasts: 337
2019-09-17 7:22:4 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:5 Peers.garbageCollect(): 323
2019-09-17 7:22:5 routes.calculate() -> routes.calculateBroadcast(): 68
2019-09-17 7:22:5 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 330
2019-09-17 7:22:5 routes.calculateUnicast(): 136
2019-09-17 7:22:5 connectionMaker.refresh(): 70
2019-09-17 7:22:5 rx gossip unicast: 0
2019-09-17 7:22:5 rx gossip broadcast: 328
2019-09-17 7:22:5 gossip broadcast - relay broadcasts: 311
2019-09-17 7:22:5 gossip broadcast - topology updates: 3
===================================================================
2019-09-17 7:22:6 Peers.garbageCollect(): 340
2019-09-17 7:22:6 routes.calculate() -> routes.calculateBroadcast(): 78
2019-09-17 7:22:6 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 320
2019-09-17 7:22:6 routes.calculateUnicast(): 156
2019-09-17 7:22:6 connectionMaker.refresh(): 82
2019-09-17 7:22:6 rx gossip unicast: 0
2019-09-17 7:22:6 rx gossip broadcast: 321
2019-09-17 7:22:6 gossip broadcast - relay broadcasts: 322
2019-09-17 7:22:6 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:7 Peers.garbageCollect(): 321
2019-09-17 7:22:7 routes.calculate() -> routes.calculateBroadcast(): 85
2019-09-17 7:22:7 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 300
2019-09-17 7:22:7 routes.calculateUnicast(): 172
2019-09-17 7:22:7 connectionMaker.refresh(): 90
2019-09-17 7:22:7 rx gossip unicast: 0
2019-09-17 7:22:7 rx gossip broadcast: 296
2019-09-17 7:22:7 gossip broadcast - relay broadcasts: 309
2019-09-17 7:22:7 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:8 Peers.garbageCollect(): 313
2019-09-17 7:22:8 routes.calculate() -> routes.calculateBroadcast(): 81
2019-09-17 7:22:8 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 308
2019-09-17 7:22:8 routes.calculateUnicast(): 161
2019-09-17 7:22:8 connectionMaker.refresh(): 85
2019-09-17 7:22:8 rx gossip unicast: 0
2019-09-17 7:22:8 rx gossip broadcast: 309
2019-09-17 7:22:8 gossip broadcast - relay broadcasts: 291
2019-09-17 7:22:8 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:9 Peers.garbageCollect(): 316
2019-09-17 7:22:9 routes.calculate() -> routes.calculateBroadcast(): 84
2019-09-17 7:22:9 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 307
2019-09-17 7:22:9 routes.calculateUnicast(): 167
2019-09-17 7:22:9 connectionMaker.refresh(): 88
2019-09-17 7:22:9 rx gossip unicast: 0
2019-09-17 7:22:9 rx gossip broadcast: 302
2019-09-17 7:22:9 gossip broadcast - relay broadcasts: 306
2019-09-17 7:22:9 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:10 Peers.garbageCollect(): 312
2019-09-17 7:22:10 routes.calculate() -> routes.calculateBroadcast(): 83
2019-09-17 7:22:10 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 278
2019-09-17 7:22:10 routes.calculateUnicast(): 166
2019-09-17 7:22:10 connectionMaker.refresh(): 85
2019-09-17 7:22:10 rx gossip unicast: 0
2019-09-17 7:22:10 rx gossip broadcast: 275
2019-09-17 7:22:10 gossip broadcast - relay broadcasts: 300
2019-09-17 7:22:10 gossip broadcast - topology updates: 2
===================================================================
This would have picked up the issue in #70
For instance:
$ go test -tags 'peer_name_hash peer_name_alternative'
# github.com/weaveworks/mesh
./peer.go:50: name.bytes undefined (type PeerName has no field or method bytes)
./peer.go:60: name.bytes undefined (type PeerName has no field or method bytes)
./peer_name_hash.go:49: undefined: checkFatal
FAIL github.com/weaveworks/mesh [build failed]
routes() calculation is expensive operation O(n^3) where n is number of peers in the mesh
routes are calculated every time mesh receives a gossip topology update from the peer. When all the consumers of mesh library starting at once (like k8s cluster is starting up) or when a peer is joining the mesh or leaving the mesh, each peer receives topology update. at scale topology update gossip can get noisy resulting like high CPU usage resulting in calculation of routes
this issue is an enhancement request to add rate limiter to serialise limit the number of concurrent routes() calculation that can occur at a time. Also coalesce the consecutive requests effectively reducing the number of calculations.
I don't see a way to monitor updates of changing status of peers (added new, removed, connections lost).
Our need is to measure statistics of this mesh peers status. Did I missed something or how do you see it?
Hello, I am a newbie to weaveworks/mesh. I want to use it to build a distributed app. I wonder if there is an example that I can learn from or some commercial version. Thanks for any help!
Sometimes machines have multiple IP addresses, and the one you wish peers to connect to is not the same as the one you connect out on.
When local peer connects/disconnects to remote peers (handleAddConnection,handleConnectionEstablished,handleDeleteConnection) there is broadcastTopologyUpdate
In case of fully connected mesh nodes like in case of Kubernetes deployed on AWS with Weave-net, broadcast topology updates are redundant as each node could be connected to added/deleted node so it can learn the topology first-hand.
It would be desirable to minimize the broadcast topology updates at least in case of fully connected mesh nodes.
Documentation on Gossiper.On*
and GossipData.Merge
could be clearer whether I can modify the argument and return it again or can return the receiver after successfully merging.
Original message from weaveworks/weave#1763:
"SurrogateGossiper does not (and cannot) know how to merge GossipData, so instead it just stashes it in a queue. That is potentially unbounded since downstream connections in gossip propagation may not be able to accept data at the same rate it is received."
Note that #66 ameliorated this, in the case that some messages are exact duplicates.
As title shows, if i submit issue or pr, will someone take a look?
The implementation relies on Go's map
iteration order being random, but it isn't very.
Number of times a certain element x is extracted first from a #golang map with 1000 elements. Yup, heavily skewed: if you care about extracting fairly a random element from a map do *not* use `for i := range m { return i }`.
— Carlo Alberto Ferraris (@CAFxX) June 2, 2019
Another example demonstrates it is even worse on small maps. With two connections, randomNeighbours()
will pick the first-added one 90% of the time, which caused the Weave Net tests to fail on #120.
On each connection add/delete/established event from a peer mesh router broadcasts topology updates to the peers. In fully connected topology broadcast would be to the all nodes in mesh.
A received topolgy gossip is further relayed to the peers if its a new update. While this should not be a concern in a stable topology it can be problematic in some use-cases.
For e.g.
Considering #114, #115 which resuts in high cpu usage, combination chatty topology gossip results in cascading effect.
As number of peers in the mesh increases it significantly impacts scalability.
Following metrics were gathered with instrumented mesh on 150 node kubernetes cluster running weave-net using mesh. rx gossip broadcast
are received topology gossip per second.
===================================================================
2019-09-17 7:22:0 Peers.garbageCollect(): 365
2019-09-17 7:22:0 routes.calculate() -> routes.calculateBroadcast(): 59
2019-09-17 7:22:0 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 335
2019-09-17 7:22:0 routes.calculateUnicast(): 119
2019-09-17 7:22:0 connectionMaker.refresh(): 63
2019-09-17 7:22:0 rx gossip unicast: 0
2019-09-17 7:22:0 rx gossip broadcast: 325
2019-09-17 7:22:0 gossip broadcast - relay broadcasts: 345
2019-09-17 7:22:0 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:1 Peers.garbageCollect(): 347
2019-09-17 7:22:1 routes.calculate() -> routes.calculateBroadcast(): 68
2019-09-17 7:22:1 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 328
2019-09-17 7:22:1 routes.calculateUnicast(): 135
2019-09-17 7:22:1 connectionMaker.refresh(): 70
2019-09-17 7:22:1 rx gossip unicast: 0
2019-09-17 7:22:1 rx gossip broadcast: 316
2019-09-17 7:22:1 gossip broadcast - relay broadcasts: 324
2019-09-17 7:22:1 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:2 Peers.garbageCollect(): 369
2019-09-17 7:22:2 routes.calculate() -> routes.calculateBroadcast(): 61
2019-09-17 7:22:2 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 313
2019-09-17 7:22:2 routes.calculateUnicast(): 124
2019-09-17 7:22:2 connectionMaker.refresh(): 64
2019-09-17 7:22:2 rx gossip unicast: 0
2019-09-17 7:22:2 rx gossip broadcast: 315
2019-09-17 7:22:2 gossip broadcast - relay broadcasts: 343
2019-09-17 7:22:2 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:3 Peers.garbageCollect(): 336
2019-09-17 7:22:3 routes.calculate() -> routes.calculateBroadcast(): 75
2019-09-17 7:22:3 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 327
2019-09-17 7:22:3 routes.calculateUnicast(): 148
2019-09-17 7:22:3 connectionMaker.refresh(): 75
2019-09-17 7:22:3 rx gossip unicast: 0
2019-09-17 7:22:3 rx gossip broadcast: 322
2019-09-17 7:22:3 gossip broadcast - relay broadcasts: 326
2019-09-17 7:22:3 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:4 Peers.garbageCollect(): 353
2019-09-17 7:22:4 routes.calculate() -> routes.calculateBroadcast(): 69
2019-09-17 7:22:4 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 344
2019-09-17 7:22:4 routes.calculateUnicast(): 138
2019-09-17 7:22:4 connectionMaker.refresh(): 71
2019-09-17 7:22:4 rx gossip unicast: 0
2019-09-17 7:22:4 rx gossip broadcast: 339
2019-09-17 7:22:4 gossip broadcast - relay broadcasts: 337
2019-09-17 7:22:4 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:5 Peers.garbageCollect(): 323
2019-09-17 7:22:5 routes.calculate() -> routes.calculateBroadcast(): 68
2019-09-17 7:22:5 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 330
2019-09-17 7:22:5 routes.calculateUnicast(): 136
2019-09-17 7:22:5 connectionMaker.refresh(): 70
2019-09-17 7:22:5 rx gossip unicast: 0
2019-09-17 7:22:5 rx gossip broadcast: 328
2019-09-17 7:22:5 gossip broadcast - relay broadcasts: 311
2019-09-17 7:22:5 gossip broadcast - topology updates: 3
===================================================================
2019-09-17 7:22:6 Peers.garbageCollect(): 340
2019-09-17 7:22:6 routes.calculate() -> routes.calculateBroadcast(): 78
2019-09-17 7:22:6 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 320
2019-09-17 7:22:6 routes.calculateUnicast(): 156
2019-09-17 7:22:6 connectionMaker.refresh(): 82
2019-09-17 7:22:6 rx gossip unicast: 0
2019-09-17 7:22:6 rx gossip broadcast: 321
2019-09-17 7:22:6 gossip broadcast - relay broadcasts: 322
2019-09-17 7:22:6 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:7 Peers.garbageCollect(): 321
2019-09-17 7:22:7 routes.calculate() -> routes.calculateBroadcast(): 85
2019-09-17 7:22:7 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 300
2019-09-17 7:22:7 routes.calculateUnicast(): 172
2019-09-17 7:22:7 connectionMaker.refresh(): 90
2019-09-17 7:22:7 rx gossip unicast: 0
2019-09-17 7:22:7 rx gossip broadcast: 296
2019-09-17 7:22:7 gossip broadcast - relay broadcasts: 309
2019-09-17 7:22:7 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:8 Peers.garbageCollect(): 313
2019-09-17 7:22:8 routes.calculate() -> routes.calculateBroadcast(): 81
2019-09-17 7:22:8 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 308
2019-09-17 7:22:8 routes.calculateUnicast(): 161
2019-09-17 7:22:8 connectionMaker.refresh(): 85
2019-09-17 7:22:8 rx gossip unicast: 0
2019-09-17 7:22:8 rx gossip broadcast: 309
2019-09-17 7:22:8 gossip broadcast - relay broadcasts: 291
2019-09-17 7:22:8 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:9 Peers.garbageCollect(): 316
2019-09-17 7:22:9 routes.calculate() -> routes.calculateBroadcast(): 84
2019-09-17 7:22:9 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 307
2019-09-17 7:22:9 routes.calculateUnicast(): 167
2019-09-17 7:22:9 connectionMaker.refresh(): 88
2019-09-17 7:22:9 rx gossip unicast: 0
2019-09-17 7:22:9 rx gossip broadcast: 302
2019-09-17 7:22:9 gossip broadcast - relay broadcasts: 306
2019-09-17 7:22:9 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:10 Peers.garbageCollect(): 312
2019-09-17 7:22:10 routes.calculate() -> routes.calculateBroadcast(): 83
2019-09-17 7:22:10 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 278
2019-09-17 7:22:10 routes.calculateUnicast(): 166
2019-09-17 7:22:10 connectionMaker.refresh(): 85
2019-09-17 7:22:10 rx gossip unicast: 0
2019-09-17 7:22:10 rx gossip broadcast: 275
2019-09-17 7:22:10 gossip broadcast - relay broadcasts: 300
2019-09-17 7:22:10 gossip broadcast - topology updates: 2
===================================================================
I'm running a server that uses mesh and got this panic:
2020/09/09 01:35:25 [service] configured logging provider (stderr)
badger 2020/09/09 01:35:25 INFO: All 0 tables opened in 0s
badger 2020/09/09 01:35:25 INFO: Replaying file id: 0 at offset: 0
badger 2020/09/09 01:35:25 INFO: Replay took: 2.48µs
badger 2020/09/09 01:35:25 DEBUG: Value log discard stats empty
2020/09/09 01:35:25 [service] configured message storage (ssd)
2020/09/09 01:35:25 [service] configured usage metering (noop)
2020/09/09 01:35:25 [service] configured contracts provider (single)
2020/09/09 01:35:25 [service] configured monitoring sink (self)
2020/09/09 01:35:25 [service] configured node name (ae:b1:cc:f5:0a:d9)
2020/09/09 01:35:25 [service] starting the listener (0.0.0.0:8080)
2020/09/09 01:35:25 [tls] unable to configure certificates, make sure a valid cache or certificate is configured
2020/09/09 01:35:25 [service] service started
2020/09/09 01:35:31 [swarm] peer created (ea:c5:b0:af:45:c1)
2020/09/09 01:35:38 [swarm] peer created (ba:54:8f:83:e8:02)
2020/09/09 01:35:38 [closing] panic recovered: runtime error: invalid memory address or nil pointer dereference
goroutine 842 [running]:
runtime/debug.Stack(0xc006775a88, 0x119d020, 0x1b2d020)
/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
github.com/emitter-io/emitter/internal/broker.(*Conn).Close(0xc007b22090, 0xbfce2a370001db67, 0x4c72e81)
/go-build/src/github.com/emitter-io/emitter/internal/broker/conn.go:353 +0x32d
panic(0x119d020, 0x1b2d020)
/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/weaveworks/mesh.(*gossipSender).Broadcast(0xc007a8b450, 0xaeb1ccf50ad9, 0x1521c80, 0xc007b1e250)
/go/pkg/mod/github.com/weaveworks/[email protected]/gossip.go:202 +0xb2
github.com/weaveworks/mesh.(*gossipChannel).relayBroadcast(0xc0000a3e00, 0xaeb1ccf50ad9, 0x1521c80, 0xc007b1e250)
/go/pkg/mod/github.com/weaveworks/[email protected]/gossip_channel.go:116 +0x10b
github.com/weaveworks/mesh.(*gossipChannel).GossipBroadcast(0xc0000a3e00, 0x1521c80, 0xc007b1e250)
/go/pkg/mod/github.com/weaveworks/[email protected]/gossip_channel.go:83 +0x4f
github.com/emitter-io/emitter/internal/service/cluster.(*Swarm).Notify(0xc0001757a0, 0x152efc0, 0xc0078d8d80, 0x1)
/go-build/src/github.com/emitter-io/emitter/internal/service/cluster/swarm.go:359 +0xb5
github.com/emitter-io/emitter/internal/broker.(*Conn).onConnect(0xc007b22090, 0xc0079a8f00, 0x4)
/go-build/src/github.com/emitter-io/emitter/internal/broker/conn.go:344 +0x1a6
github.com/emitter-io/emitter/internal/broker.(*Conn).onReceive(0xc007b22090, 0x152f080, 0xc0079a8f00, 0x0, 0x0)
/go-build/src/github.com/emitter-io/emitter/internal/broker/conn.go:194 +0x7aa
github.com/emitter-io/emitter/internal/broker.(*Conn).Process(0xc007b22090, 0x0, 0x0)
/go-build/src/github.com/emitter-io/emitter/internal/broker/conn.go:180 +0x1cb
created by github.com/emitter-io/emitter/internal/broker.(*Service).onAcceptConn
/go-build/src/github.com/emitter-io/emitter/internal/broker/service.go:315 +0x6e
2020/09/09 01:35:38 [query] presence query received ([1156454352 2843744046 1815237614 2486959251 1909563767])
Routes calculations is order of O(n_peers^2) was optimised in
weaveworks/weave#1773
weaveworks/weave#1761
However for use-cased where mesh of nodes are fully connected i.e) each node is single hop away. Consider this topology as special case when calculating the routes.
This topology is very common in case Kubernetes clusters, and weave-net consuming the mesh library can benefit from this optimised routing.
Topology updates, either sent directly or received from other peers, are sent via
GossipChannel.relayBroadcast()
. It calls routes.ensureRecalculated()
, which has three possible behaviours:
This means that, when recalculations are plentiful, e.g. when a large cluster is forming, each update waits for one recalc. We saw this when trying to get #106 to work.
I believe the idea was that ensureRecalculated()
would wait for any previously-requested recalc to finish, but not wait for any future recalcs that might become necessary.
The example project from examples/increment-only-counter
is overengineered and too complicated for a usage example and there are no examples in the docs. I would suggest writing minimal working example in the README so people don't have to reverse engineer the library
A general pattern we followed to date is:
chan<-
) of that in a struct member<-chan
) to a spawned goroutineThe idea here is that the typing ensures that only the spawned goroutine can read from the channel.
This was changed in 0cb8f55 for the likes of connectionMaker
and gossipSender
, but not others like routes
.
We need to make up our mind whether to stick to the pattern or not, and apply that decision consistently.
Is there something run?
I run the example Increment-only counter
If 127.0.0.1:6001 down, then the cluster didn't consisitent.
Is it right?
At https://github.com/weaveworks/mesh/blob/master/connection.go#L194, the incoming UID is parsed into remote
, but registerRemote()
ignores this UID and looks up the peer by name.
If the peer has restarted and reconnected, this has the effect of overwriting the new info with the old peer's info, so for instance LocalPeer.handleAddConnection()
will see it as an existing peer and not send the initial gossip to it, thus leaving it ignorant of current topology, DNS and IPAM data.
If we have at least 3 peers which do not implement a particular channel, and some other peer sends gossip on that channel, then the following happens:
Router.handleGossip()
decodes the message and calls gossipChannel.deliver()
gossipChannel.deliver()
calls surrogateGossiper.OnGossip()
with the incoming messagesurrogateGossiper.OnGossip
returns a surrogateGossipData
with the same payloadgossipChannel.deliver()
then relays this payload to other listenersThis surfaced as an error message from Weave Net "connection shutting down due to error: host clock skew of -4043s exceeds 900s limit", reported at https://groups.google.com/a/weave.works/forum/#!topic/weave-users/zcrATGRTY6s
It is quite bad because the peers are sending gossip data in a tight loop.
This is a transplant of weaveworks/weave#1867. Quote,
When a peer restarts quickly, not all other peers necessarily see the peer go away, since topology gossip data merging may combine the removal and addition, or the events get re-ordered in transit, with the removal being skipped (since it will have an earlier version) - both of these effectively just end up updating the peer UID and version. If that happens on just a single peer then the DNS entries of the restarted peer are all retained. This is problematic in two ways:
- if any containers died on the peer while the weave router was down, their entries are leaked.
- for surviving containers, the version of the re-created entry may well be lower than the existing one, e.g. if previously the entry had been tombstoned and resurrected a few times. If the last version of the entry known by surviving peers was a tombstone, then this will effectively wipe out the re-created entry.
Possible fix,
Peers could invoke the OnGC callbacks when the UID of a peer changes.
From #33,
It looks like LocalConnection would not need to exported either if it implemented an interface with an OverlayConn method, so the likes of https://github.com/weaveworks/weave/blob/master/router/network_router.go#L189 can use that instead of a cast to LocalConnection.
When mesh router receives ProtocolGossipBroadcast
traffic it attempts to relay broadcast to peers. In order to broadcast gossip recived from a source, routes.BroadcastAll()
is performed on each received gossip. routes.BroadcastAll()
calls peer.routes
which is known to be O(n^2) operation.
As an optimization to prevent peer.routes
calls, routes.lookupOrCalculate
caches the calculated broadcast routes. However on topology changes routes are flushed. While this should be fine on a stable cluster, when there is constant toplogy changes cache misses can be expensive.
Following metrics (gathered as calls per second) were gathered with instrumented mesh on 150 node kubernetes cluster with weave-net using the mesh. Its not uncommon for some one to apply daemon set which results each node connecting to rest of the peers (so n^2 connection) resulting in significant topology changes. Hence misses in routes.lookupOrCalculate()
resulting in calculating peer.routes
on every call.
It would be desirable to prevent excessive peer.routes()
calls resulting from the gossip broadcast
===================================================================
2019-09-17 7:23:0 Peers.garbageCollect(): 198
2019-09-17 7:23:0 routes.calculate() -> routes.calculateBroadcast(): 59
2019-09-17 7:23:0 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 201
2019-09-17 7:23:0 routes.calculateUnicast(): 118
2019-09-17 7:23:0 connectionMaker.refresh(): 64
2019-09-17 7:23:0 rx gossip unicast: 0
2019-09-17 7:23:0 rx gossip broadcast: 183
2019-09-17 7:23:0 gossip broadcast - relay broadcasts: 192
2019-09-17 7:23:0 gossip broadcast - topology updates: 12
===================================================================
2019-09-17 7:23:1 Peers.garbageCollect(): 247
2019-09-17 7:23:1 routes.calculate() -> routes.calculateBroadcast(): 77
2019-09-17 7:23:1 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 246
2019-09-17 7:23:1 routes.calculateUnicast(): 155
2019-09-17 7:23:1 connectionMaker.refresh(): 88
2019-09-17 7:23:1 rx gossip unicast: 0
2019-09-17 7:23:1 rx gossip broadcast: 226
2019-09-17 7:23:1 gossip broadcast - relay broadcasts: 234
2019-09-17 7:23:1 gossip broadcast - topology updates: 4
===================================================================
2019-09-17 7:23:2 Peers.garbageCollect(): 216
2019-09-17 7:23:2 routes.calculate() -> routes.calculateBroadcast(): 80
2019-09-17 7:23:2 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 220
2019-09-17 7:23:2 routes.calculateUnicast(): 158
2019-09-17 7:23:2 connectionMaker.refresh(): 75
2019-09-17 7:23:2 rx gossip unicast: 0
2019-09-17 7:23:2 rx gossip broadcast: 209
2019-09-17 7:23:2 gossip broadcast - relay broadcasts: 206
2019-09-17 7:23:2 gossip broadcast - topology updates: 11
===================================================================
2019-09-17 7:23:3 Peers.garbageCollect(): 233
2019-09-17 7:23:3 routes.calculate() -> routes.calculateBroadcast(): 87
2019-09-17 7:23:3 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 256
2019-09-17 7:23:3 routes.calculateUnicast(): 174
2019-09-17 7:23:3 connectionMaker.refresh(): 88
2019-09-17 7:23:3 rx gossip unicast: 0
2019-09-17 7:23:3 rx gossip broadcast: 240
2019-09-17 7:23:3 gossip broadcast - relay broadcasts: 226
2019-09-17 7:23:3 gossip broadcast - topology updates: 6
===================================================================
2019-09-17 7:23:4 Peers.garbageCollect(): 289
2019-09-17 7:23:4 routes.calculate() -> routes.calculateBroadcast(): 88
2019-09-17 7:23:4 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 217
2019-09-17 7:23:4 routes.calculateUnicast(): 176
2019-09-17 7:23:4 connectionMaker.refresh(): 109
2019-09-17 7:23:4 rx gossip unicast: 0
2019-09-17 7:23:4 rx gossip broadcast: 204
2019-09-17 7:23:4 gossip broadcast - relay broadcasts: 278
2019-09-17 7:23:4 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:23:5 Peers.garbageCollect(): 253
2019-09-17 7:23:5 routes.calculate() -> routes.calculateBroadcast(): 82
2019-09-17 7:23:5 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 246
2019-09-17 7:23:5 routes.calculateUnicast(): 164
2019-09-17 7:23:5 connectionMaker.refresh(): 96
2019-09-17 7:23:5 rx gossip unicast: 0
2019-09-17 7:23:5 rx gossip broadcast: 244
2019-09-17 7:23:5 gossip broadcast - relay broadcasts: 244
2019-09-17 7:23:5 gossip broadcast - topology updates: 4
===================================================================
2019-09-17 7:23:6 Peers.garbageCollect(): 234
2019-09-17 7:23:6 routes.calculate() -> routes.calculateBroadcast(): 71
2019-09-17 7:23:6 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 225
2019-09-17 7:23:6 routes.calculateUnicast(): 144
2019-09-17 7:23:6 connectionMaker.refresh(): 93
2019-09-17 7:23:6 rx gossip unicast: 0
2019-09-17 7:23:6 rx gossip broadcast: 228
2019-09-17 7:23:6 gossip broadcast - relay broadcasts: 226
2019-09-17 7:23:6 gossip broadcast - topology updates: 2
===================================================================
2019-09-17 7:23:7 Peers.garbageCollect(): 262
2019-09-17 7:23:7 routes.calculate() -> routes.calculateBroadcast(): 80
2019-09-17 7:23:7 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 250
2019-09-17 7:23:7 routes.calculateUnicast(): 158
2019-09-17 7:23:7 connectionMaker.refresh(): 96
2019-09-17 7:23:7 rx gossip unicast: 0
2019-09-17 7:23:7 rx gossip broadcast: 254
2019-09-17 7:23:7 gossip broadcast - relay broadcasts: 253
2019-09-17 7:23:7 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:23:8 Peers.garbageCollect(): 264
2019-09-17 7:23:8 routes.calculate() -> routes.calculateBroadcast(): 55
2019-09-17 7:23:8 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 259
2019-09-17 7:23:8 routes.calculateUnicast(): 109
2019-09-17 7:23:8 connectionMaker.refresh(): 60
2019-09-17 7:23:8 rx gossip unicast: 0
2019-09-17 7:23:8 rx gossip broadcast: 259
2019-09-17 7:23:8 gossip broadcast - relay broadcasts: 246
2019-09-17 7:23:8 gossip broadcast - topology updates: 3
===================================================================
2019-09-17 7:23:9 Peers.garbageCollect(): 266
2019-09-17 7:23:9 routes.calculate() -> routes.calculateBroadcast(): 60
2019-09-17 7:23:9 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 318
2019-09-17 7:23:9 routes.calculateUnicast(): 120
2019-09-17 7:23:9 connectionMaker.refresh(): 71
2019-09-17 7:23:9 rx gossip unicast: 0
2019-09-17 7:23:9 rx gossip broadcast: 319
2019-09-17 7:23:9 gossip broadcast - relay broadcasts: 255
2019-09-17 7:23:9 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:23:10 Peers.garbageCollect(): 308
2019-09-17 7:23:10 routes.calculate() -> routes.calculateBroadcast(): 58
2019-09-17 7:23:10 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 255
2019-09-17 7:23:10 routes.calculateUnicast(): 116
2019-09-17 7:23:10 connectionMaker.refresh(): 73
2019-09-17 7:23:10 rx gossip unicast: 0
2019-09-17 7:23:10 rx gossip broadcast: 258
2019-09-17 7:23:10 gossip broadcast - relay broadcasts: 302
2019-09-17 7:23:10 gossip broadcast - topology updates: 2
===================================================================
It seems that mesh doesn't support IPv6 at the moment.
Line 102 in f76d3ef
Or should I be barking up another tree?
Per a discussion (https://groups.google.com/d/msgid/prometheus-developers/CAL%2BpMaAC4mo%3DaD1D7PbUcK1S0UAQ2YqyzhE%2BjND7tt_8ys0MtQ%40mail.gmail.com?utm_medium=email&utm_source=footer) on the Prometheus mailing list about the component that uses this mesh library, it appears that if a peer's port is not the same as one of the ports in the statically configured initial peers list, the other cluster members will fail to connect to it directly, presumably because the gossip messages about peer membership don't contain port information.
I have a use case where I run software on top of a cluster scheduler (like Apache Aurora) where I get random assigned ports for each instance in there cluster, which would mean that I can't use anything that uses this library in that cluster effectively. Please fix the library so that there are no hidden assumptions about how to connect to each peer.
Discovered in weaveworks/weave#2527...
weave launch 0.0.0.0
produces
github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*connectionMaker).connectionTerminated.func1(0xc8223f1f08)
/go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection_maker.go:195 +0x103
github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*connectionMaker).queryLoop(0xc820014c60, 0xc820014c00)
/go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection_maker.go:224 +0xf4
created by github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.newConnectionMaker
/go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection_maker.go:74 +0x24a
The problem is in mesh.localPeer.createConnection. The connectionMaker has the target address as "0.0.0.0:6783", which is what it passes to that function as the peerAddr. We re-parse that into remoteTCPAddr with net.ResolveTCPAddr, which has the same string representation. So far so good. But then we initialise the remoteConnection data structure in newRemoteConnection() with a remoteTCPAddr of tcpConn.RemoteAddr().String(). That actually returns "127.0.0.1:6783". I've traced the origin of that to a syscall.Getpeername()
in the go socket code.
AFAICT the code has always been like that. The obvious fix is to invoke newRemoteConnection() with peerAddr instead.
https://github.com/weaveworks/mesh/blob/master/connection_maker.go#L96
If the host
part of a peer
is a DNS name, it'd be nice if that name were resolved and all resultant addresses were added to the list, instead of just one. This is a very common pattern in service discovery i.e. Consul.
(I'm not sure how this code would respond with a request to peer with oneself though?)
Looking at the PR's and issues, I wonder if this repo is no longer being maintained?
I have a cluster of ~300 nodes and mesh is consuming ~1-2 GB of RAM. I dug into it and the memory is all being consumed by the topology gossip messages. Upon further inspection I found that the gossip messages are including all peers in the message -- which means the message sizes (and therefore the memory and CPU to generate them) scale with the cluster size.
Are there any plans to implement a more scalable topology gossip?
Why does NewGossip
panic if a Gossip
with that id already exists? Why not either return an error
or return the existing Gossip
?
See
Line 136 in 2534f73
The existing interface makes it difficult to build programs that build Gossip
s in response to runtime events.
peers.go:270:23:warning: redundant type conversion (unconvert)
Right now gossipInterval
is hard-coded to be 30 sec in mesh library. So every 30 sec there is topology update that is gossiped. Processing topology update received through gossip can get expensive when the number of nodes participating in the gossip increases.
Is there a reason to gossip periodically? Does sending the topology updated when there is topology change detected should be sufficient? If periodic gossip is required then is it desirable to increase the gossip interval to higher value.
Proposed interface
type Logger interface {
Logf(format string, args ...interface{})
}
Pretty much the title. I'm curious to know if the design goals are similar and what the state of Mesh is.
talk to @miolini -- I think that he and the weave team in general would have a fascinating conversation.
So right now mesh requires the knowledge of a single node-- why not bootstrap this from bittorrent DHT while using a PSK or certificate to ensure that no neer-do-well's show up and crash your party? Would require knowledge of zero nodes, and enable glorious networked chaos.
What happens @ >100 nodes?
:).
BTW, here's an issue list mostly made by @miolini:
https://github.com/meshbird/meshbird/issues/created_by/miolini
Yeah, he's that awesome.
Lock is not released until the Send()
returns.
Taken together with #125 this means it can hold a lock forever.
Sender and receiver each maintain state including a sequence number, so if we unlock before calling Send()
it's possible that messages get out of order.
Seen at weaveworks/weave#3762
While performing mesh scaling tests on cluster sizes > 175 nodes even with combined patch #110, #110 (whose intent is to rate limit number of peer.routes()
calls performed) it is still observed high number (at the order 200-300 calls per second) of calculateBroadcast()
which calls peer.routes()
. peer.routes() is of O(n^2) complexity
routes.lookupOrCalculate
optimizes number of routes() calls made by caching the results in routes.broadcast and routes.broadcastAll. However routes.calculate() reset's cached data every time its called resulting lookupOrCalculate to miss the cache and perform routes calculation.
https://github.com/weaveworks/mesh/blob/v0.2/routes.go#L210-L211
I want to try write dns server in go, but instead of standard axfr mechanism to transfer data, i want to use mesh.
Where i need to start? Thanks!
If I config the local address in peers,does it work?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.