Code Monkey home page Code Monkey logo

mesh's Introduction

mesh GoDoc Circle CI

Mesh is a tool for building distributed applications.

Mesh implements a gossip protocol that provide membership, unicast, and broadcast functionality with eventually-consistent semantics. In CAP terms, it is AP: highly-available and partition-tolerant.

Mesh works in a wide variety of network setups, including thru NAT and firewalls, and across clouds and datacenters. It works in situations where there is only partial connectivity, i.e. data is transparently routed across multiple hops when there is no direct connection between peers. It copes with partitions and partial network failure. It can be easily bootstrapped, typically only requiring knowledge of a single existing peer in the mesh to join. It has built-in shared-secret authentication and encryption. It scales to on the order of 100 peers, and has no dependencies.

Using

Mesh is currently distributed as a Go package. See the API documentation.

We plan to offer Mesh as a standalone service + an easy-to-use API. We will support multiple deployment scenarios, including as a standalone binary, as a container, as an ambassador or sidecar component to an existing container, and as an infrastructure service in popular platforms.

Developing

Mesh builds with the standard Go tooling. You will need to put the repository in Go's expected directory structure; i.e., $GOPATH/src/github.com/weaveworks/mesh.

Building

If necessary, you may fetch the latest version of all of the dependencies into your GOPATH via

go get -d -u -t ./...

Build the code with the usual

go install ./...

Testing

Assuming you've fetched dependencies as above,

go test ./...

Dependencies

Mesh is a library, designed to be imported into a binary package. Vendoring is currently the best way for binary package authors to ensure reliable, reproducible builds. Therefore, we strongly recommend our users use vendoring for all of their dependencies, including Mesh. To avoid compatibility and availability issues, Mesh doesn't vendor its own dependencies, and doesn't recommend use of third-party import proxies.

There are several tools to make vendoring easier, including gb, gvt, glide, and govendor.

Workflow

Mesh follows a typical PR workflow. All contributions should be made as pull requests that satisfy the guidelines, below.

Guidelines

  • All code must abide Go Code Review Comments
  • Names should abide What's in a name
  • Code must build on both Linux and Darwin, via plain go build
  • Code should have appropriate test coverage, invoked via plain go test

In addition, several mechanical checks are enforced. See the lint script for details.

Getting Help

If you have any questions about, feedback for or problems with mesh:

Your feedback is always welcome!

mesh's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mesh's Issues

Example for mesh to build a distributed app

Hello, I am a newbie to weaveworks/mesh. I want to use it to build a distributed app. I wonder if there is an example that I can learn from or some commercial version. Thanks for any help!

New linter error

peers.go:270:23:warning: redundant type conversion (unconvert)

Add IPv6 support

It seems that mesh doesn't support IPv6 at the moment.

mesh/router.go

Line 102 in f76d3ef

localAddr, err := net.ResolveTCPAddr("tcp4", net.JoinHostPort(router.Host, fmt.Sprint(router.Port)))

We use IPv6 almost everywhere. And in the most cases we don't have any IPv4 addreses except 127.0.0.1.
I guess we are not alone with such setup.
Is it possible to add IPv6 support?

cached broadcast routes are reset when performing routes.calculate()

While performing mesh scaling tests on cluster sizes > 175 nodes even with combined patch #110, #110 (whose intent is to rate limit number of peer.routes() calls performed) it is still observed high number (at the order 200-300 calls per second) of calculateBroadcast() which calls peer.routes(). peer.routes() is of O(n^2) complexity

routes.lookupOrCalculate optimizes number of routes() calls made by caching the results in routes.broadcast and routes.broadcastAll. However routes.calculate() reset's cached data every time its called resulting lookupOrCalculate to miss the cache and perform routes calculation.

https://github.com/weaveworks/mesh/blob/v0.2/routes.go#L210-L211

Topology update broadcasts are sent once per route calculation

Topology updates, either sent directly or received from other peers, are sent via
GossipChannel.relayBroadcast(). It calls routes.ensureRecalculated(), which has three possible behaviours:

  1. if no recalculation is pending, it returns straight away
  2. if a recalculation is pending and nobody else is waiting, it will wait for that recalculation to start then finish
  3. otherwise it will queue for an opportunity to do 1 or 2

This means that, when recalculations are plentiful, e.g. when a large cluster is forming, each update waits for one recalc. We saw this when trying to get #106 to work.

I believe the idea was that ensureRecalculated() would wait for any previously-requested recalc to finish, but not wait for any future recalcs that might become necessary.

consistent application of channel input/output capability restrictions

A general pattern we followed to date is:

  1. create a channel in a constructor
  2. store the output capability (chan<-) of that in a struct member
  3. pass the input capability (<-chan) to a spawned goroutine

The idea here is that the typing ensures that only the spawned goroutine can read from the channel.

This was changed in 0cb8f55 for the likes of connectionMaker and gossipSender, but not others like routes.

We need to make up our mind whether to stick to the pattern or not, and apply that decision consistently.

Make the maintenance process work

Two PRs languished for ~5 months, which is not good.

Perhaps we could have an explicit list of maintainers, and an instruction to @-mention them to gain attention?

panic due to remote address discrepancy

Discovered in weaveworks/weave#2527...

weave launch 0.0.0.0 produces

github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*connectionMaker).connectionTerminated.func1(0xc8223f1f08)
    /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection_maker.go:195 +0x103
github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*connectionMaker).queryLoop(0xc820014c60, 0xc820014c00)
    /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection_maker.go:224 +0xf4
created by github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.newConnectionMaker
    /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection_maker.go:74 +0x24a

The problem is in mesh.localPeer.createConnection. The connectionMaker has the target address as "0.0.0.0:6783", which is what it passes to that function as the peerAddr. We re-parse that into remoteTCPAddr with net.ResolveTCPAddr, which has the same string representation. So far so good. But then we initialise the remoteConnection data structure in newRemoteConnection() with a remoteTCPAddr of tcpConn.RemoteAddr().String(). That actually returns "127.0.0.1:6783". I've traced the origin of that to a syscall.Getpeername() in the go socket code.

AFAICT the code has always been like that. The obvious fix is to invoke newRemoteConnection() with peerAddr instead.

in fullly connected mesh topolgy, topology update gossip's can get chatty

On each connection add/delete/established event from a peer mesh router broadcasts topology updates to the peers. In fully connected topology broadcast would be to the all nodes in mesh.

A received topolgy gossip is further relayed to the peers if its a new update. While this should not be a concern in a stable topology it can be problematic in some use-cases.

For e.g.

  • when some one deploys a weave-net deamonset in N node cluster it can result in each node connecting to other nodes. Hence concurrent topology updates can get it in the order of n^2 in the cluster
  • in auto-scaling group's nodes can get added/deleted that can result in high topology updates

Considering #114, #115 which resuts in high cpu usage, combination chatty topology gossip results in cascading effect.

As number of peers in the mesh increases it significantly impacts scalability.

Following metrics were gathered with instrumented mesh on 150 node kubernetes cluster running weave-net using mesh. rx gossip broadcast are received topology gossip per second.

===================================================================
2019-09-17 7:22:0 Peers.garbageCollect(): 365
2019-09-17 7:22:0 routes.calculate()         -> routes.calculateBroadcast(): 59
2019-09-17 7:22:0 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 335
2019-09-17 7:22:0 routes.calculateUnicast(): 119
2019-09-17 7:22:0 connectionMaker.refresh(): 63
2019-09-17 7:22:0 rx gossip unicast: 0
2019-09-17 7:22:0 rx gossip broadcast: 325
2019-09-17 7:22:0 gossip broadcast - relay broadcasts: 345
2019-09-17 7:22:0 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:1 Peers.garbageCollect(): 347
2019-09-17 7:22:1 routes.calculate()         -> routes.calculateBroadcast(): 68
2019-09-17 7:22:1 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 328
2019-09-17 7:22:1 routes.calculateUnicast(): 135
2019-09-17 7:22:1 connectionMaker.refresh(): 70
2019-09-17 7:22:1 rx gossip unicast: 0
2019-09-17 7:22:1 rx gossip broadcast: 316
2019-09-17 7:22:1 gossip broadcast - relay broadcasts: 324
2019-09-17 7:22:1 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:2 Peers.garbageCollect(): 369
2019-09-17 7:22:2 routes.calculate()         -> routes.calculateBroadcast(): 61
2019-09-17 7:22:2 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 313
2019-09-17 7:22:2 routes.calculateUnicast(): 124
2019-09-17 7:22:2 connectionMaker.refresh(): 64
2019-09-17 7:22:2 rx gossip unicast: 0
2019-09-17 7:22:2 rx gossip broadcast: 315
2019-09-17 7:22:2 gossip broadcast - relay broadcasts: 343
2019-09-17 7:22:2 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:3 Peers.garbageCollect(): 336
2019-09-17 7:22:3 routes.calculate()         -> routes.calculateBroadcast(): 75
2019-09-17 7:22:3 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 327
2019-09-17 7:22:3 routes.calculateUnicast(): 148
2019-09-17 7:22:3 connectionMaker.refresh(): 75
2019-09-17 7:22:3 rx gossip unicast: 0
2019-09-17 7:22:3 rx gossip broadcast: 322
2019-09-17 7:22:3 gossip broadcast - relay broadcasts: 326
2019-09-17 7:22:3 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:4 Peers.garbageCollect(): 353
2019-09-17 7:22:4 routes.calculate()         -> routes.calculateBroadcast(): 69
2019-09-17 7:22:4 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 344
2019-09-17 7:22:4 routes.calculateUnicast(): 138
2019-09-17 7:22:4 connectionMaker.refresh(): 71
2019-09-17 7:22:4 rx gossip unicast: 0
2019-09-17 7:22:4 rx gossip broadcast: 339
2019-09-17 7:22:4 gossip broadcast - relay broadcasts: 337
2019-09-17 7:22:4 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:5 Peers.garbageCollect(): 323
2019-09-17 7:22:5 routes.calculate()         -> routes.calculateBroadcast(): 68
2019-09-17 7:22:5 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 330
2019-09-17 7:22:5 routes.calculateUnicast(): 136
2019-09-17 7:22:5 connectionMaker.refresh(): 70
2019-09-17 7:22:5 rx gossip unicast: 0
2019-09-17 7:22:5 rx gossip broadcast: 328
2019-09-17 7:22:5 gossip broadcast - relay broadcasts: 311
2019-09-17 7:22:5 gossip broadcast - topology updates: 3
===================================================================
2019-09-17 7:22:6 Peers.garbageCollect(): 340
2019-09-17 7:22:6 routes.calculate()         -> routes.calculateBroadcast(): 78
2019-09-17 7:22:6 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 320
2019-09-17 7:22:6 routes.calculateUnicast(): 156
2019-09-17 7:22:6 connectionMaker.refresh(): 82
2019-09-17 7:22:6 rx gossip unicast: 0
2019-09-17 7:22:6 rx gossip broadcast: 321
2019-09-17 7:22:6 gossip broadcast - relay broadcasts: 322
2019-09-17 7:22:6 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:7 Peers.garbageCollect(): 321
2019-09-17 7:22:7 routes.calculate()         -> routes.calculateBroadcast(): 85
2019-09-17 7:22:7 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 300
2019-09-17 7:22:7 routes.calculateUnicast(): 172
2019-09-17 7:22:7 connectionMaker.refresh(): 90
2019-09-17 7:22:7 rx gossip unicast: 0
2019-09-17 7:22:7 rx gossip broadcast: 296
2019-09-17 7:22:7 gossip broadcast - relay broadcasts: 309
2019-09-17 7:22:7 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:8 Peers.garbageCollect(): 313
2019-09-17 7:22:8 routes.calculate()         -> routes.calculateBroadcast(): 81
2019-09-17 7:22:8 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 308
2019-09-17 7:22:8 routes.calculateUnicast(): 161
2019-09-17 7:22:8 connectionMaker.refresh(): 85
2019-09-17 7:22:8 rx gossip unicast: 0
2019-09-17 7:22:8 rx gossip broadcast: 309
2019-09-17 7:22:8 gossip broadcast - relay broadcasts: 291
2019-09-17 7:22:8 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:9 Peers.garbageCollect(): 316
2019-09-17 7:22:9 routes.calculate()         -> routes.calculateBroadcast(): 84
2019-09-17 7:22:9 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 307
2019-09-17 7:22:9 routes.calculateUnicast(): 167
2019-09-17 7:22:9 connectionMaker.refresh(): 88
2019-09-17 7:22:9 rx gossip unicast: 0
2019-09-17 7:22:9 rx gossip broadcast: 302
2019-09-17 7:22:9 gossip broadcast - relay broadcasts: 306
2019-09-17 7:22:9 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:10 Peers.garbageCollect(): 312
2019-09-17 7:22:10 routes.calculate()         -> routes.calculateBroadcast(): 83
2019-09-17 7:22:10 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 278
2019-09-17 7:22:10 routes.calculateUnicast(): 166
2019-09-17 7:22:10 connectionMaker.refresh(): 85
2019-09-17 7:22:10 rx gossip unicast: 0
2019-09-17 7:22:10 rx gossip broadcast: 275
2019-09-17 7:22:10 gossip broadcast - relay broadcasts: 300
2019-09-17 7:22:10 gossip broadcast - topology updates: 2
===================================================================

Panic on creating duplicate Gossip?

Why does NewGossip panic if a Gossip with that id already exists? Why not either return an error or return the existing Gossip?

See

mesh/router.go

Line 136 in 2534f73

panic(fmt.Sprintf("[gossip] duplicate channel %s", channelName))

The existing interface makes it difficult to build programs that build Gossips in response to runtime events.

Run tests on alternative tags

This would have picked up the issue in #70

For instance:

$ go test -tags 'peer_name_hash peer_name_alternative'
# github.com/weaveworks/mesh
./peer.go:50: name.bytes undefined (type PeerName has no field or method bytes)
./peer.go:60: name.bytes undefined (type PeerName has no field or method bytes)
./peer_name_hash.go:49: undefined: checkFatal
FAIL	github.com/weaveworks/mesh [build failed]

Remove assumptions about peer ports

Per a discussion (https://groups.google.com/d/msgid/prometheus-developers/CAL%2BpMaAC4mo%3DaD1D7PbUcK1S0UAQ2YqyzhE%2BjND7tt_8ys0MtQ%40mail.gmail.com?utm_medium=email&utm_source=footer) on the Prometheus mailing list about the component that uses this mesh library, it appears that if a peer's port is not the same as one of the ports in the statically configured initial peers list, the other cluster members will fail to connect to it directly, presumably because the gossip messages about peer membership don't contain port information.

I have a use case where I run software on top of a cluster scheduler (like Apache Aurora) where I get random assigned ports for each instance in there cluster, which would mean that I can't use anything that uses this library in that cluster effectively. Please fix the library so that there are no hidden assumptions about how to connect to each peer.

Peer restarts can be squelched

This is a transplant of weaveworks/weave#1867. Quote,

When a peer restarts quickly, not all other peers necessarily see the peer go away, since topology gossip data merging may combine the removal and addition, or the events get re-ordered in transit, with the removal being skipped (since it will have an earlier version) - both of these effectively just end up updating the peer UID and version. If that happens on just a single peer then the DNS entries of the restarted peer are all retained. This is problematic in two ways:

  • if any containers died on the peer while the weave router was down, their entries are leaked.
  • for surviving containers, the version of the re-created entry may well be lower than the existing one, e.g. if previously the entry had been tombstoned and resurrected a few times. If the last version of the entry known by surviving peers was a tombstone, then this will effectively wipe out the re-created entry.

Possible fix,

Peers could invoke the OnGC callbacks when the UID of a peer changes.

SurrogateGossiper can accumulate infinite amounts of data

Original message from weaveworks/weave#1763:

"SurrogateGossiper does not (and cannot) know how to merge GossipData, so instead it just stashes it in a queue. That is potentially unbounded since downstream connections in gossip propagation may not be able to accept data at the same rate it is received."

Note that #66 ameliorated this, in the case that some messages are exact duplicates.

High memory/CPU utilization for moderately sized cluster

I have a cluster of ~300 nodes and mesh is consuming ~1-2 GB of RAM. I dug into it and the memory is all being consumed by the topology gossip messages. Upon further inspection I found that the gossip messages are including all peers in the message -- which means the message sizes (and therefore the memory and CPU to generate them) scale with the cluster size.

Are there any plans to implement a more scalable topology gossip?

surrogateGossiper can form a feedback loop

If we have at least 3 peers which do not implement a particular channel, and some other peer sends gossip on that channel, then the following happens:

  1. Router.handleGossip() decodes the message and calls gossipChannel.deliver()
  2. gossipChannel.deliver() calls surrogateGossiper.OnGossip() with the incoming message
  3. surrogateGossiper.OnGossip returns a surrogateGossipData with the same payload
  4. gossipChannel.deliver() then relays this payload to other listeners
  5. repeat

This surfaced as an error message from Weave Net "connection shutting down due to error: host clock skew of -4043s exceeds 900s limit", reported at https://groups.google.com/a/weave.works/forum/#!topic/weave-users/zcrATGRTY6s

It is quite bad because the peers are sending gossip data in a tight loop.

redundant broadcast topology update in case of fully-connected mesh of node

When local peer connects/disconnects to remote peers (handleAddConnection,handleConnectionEstablished,handleDeleteConnection) there is broadcastTopologyUpdate

In case of fully connected mesh nodes like in case of Kubernetes deployed on AWS with Weave-net, broadcast topology updates are redundant as each node could be connected to added/deleted node so it can learn the topology first-hand.

It would be desirable to minimize the broadcast topology updates at least in case of fully connected mesh nodes.

TCP sends block forever

We set a deadline on reads but not on writes. This means if something is blocking the other end of the connection we will wait forever.

Since we expect connected peers to be fairly live, a deadline around the length of a heartbeat should allow us to error on stalled peers while not affecting normal operation.
Worst case, a connection would be torn down then re-established.

Seen at weaveworks/weave#3762

Restarted peers do not get initial gossip

At https://github.com/weaveworks/mesh/blob/master/connection.go#L194, the incoming UID is parsed into remote, but registerRemote() ignores this UID and looks up the peer by name.

If the peer has restarted and reconnected, this has the effect of overwriting the new info with the old peer's info, so for instance LocalPeer.handleAddConnection() will see it as an existing peer and not send the initial gossip to it, thus leaving it ignorant of current topology, DNS and IPAM data.

high rate of peer.routes() calculation resuting from gossip broadcasts

When mesh router receives ProtocolGossipBroadcast traffic it attempts to relay broadcast to peers. In order to broadcast gossip recived from a source, routes.BroadcastAll() is performed on each received gossip. routes.BroadcastAll() calls peer.routes which is known to be O(n^2) operation.

As an optimization to prevent peer.routes calls, routes.lookupOrCalculate caches the calculated broadcast routes. However on topology changes routes are flushed. While this should be fine on a stable cluster, when there is constant toplogy changes cache misses can be expensive.

Following metrics (gathered as calls per second) were gathered with instrumented mesh on 150 node kubernetes cluster with weave-net using the mesh. Its not uncommon for some one to apply daemon set which results each node connecting to rest of the peers (so n^2 connection) resulting in significant topology changes. Hence misses in routes.lookupOrCalculate() resulting in calculating peer.routes on every call.

It would be desirable to prevent excessive peer.routes() calls resulting from the gossip broadcast

===================================================================
2019-09-17 7:23:0 Peers.garbageCollect(): 198
2019-09-17 7:23:0 routes.calculate()         -> routes.calculateBroadcast(): 59
2019-09-17 7:23:0 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 201
2019-09-17 7:23:0 routes.calculateUnicast(): 118
2019-09-17 7:23:0 connectionMaker.refresh(): 64
2019-09-17 7:23:0 rx gossip unicast: 0
2019-09-17 7:23:0 rx gossip broadcast: 183
2019-09-17 7:23:0 gossip broadcast - relay broadcasts: 192
2019-09-17 7:23:0 gossip broadcast - topology updates: 12
===================================================================
2019-09-17 7:23:1 Peers.garbageCollect(): 247
2019-09-17 7:23:1 routes.calculate()         -> routes.calculateBroadcast(): 77
2019-09-17 7:23:1 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 246
2019-09-17 7:23:1 routes.calculateUnicast(): 155
2019-09-17 7:23:1 connectionMaker.refresh(): 88
2019-09-17 7:23:1 rx gossip unicast: 0
2019-09-17 7:23:1 rx gossip broadcast: 226
2019-09-17 7:23:1 gossip broadcast - relay broadcasts: 234
2019-09-17 7:23:1 gossip broadcast - topology updates: 4
===================================================================
2019-09-17 7:23:2 Peers.garbageCollect(): 216
2019-09-17 7:23:2 routes.calculate()         -> routes.calculateBroadcast(): 80
2019-09-17 7:23:2 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 220
2019-09-17 7:23:2 routes.calculateUnicast(): 158
2019-09-17 7:23:2 connectionMaker.refresh(): 75
2019-09-17 7:23:2 rx gossip unicast: 0
2019-09-17 7:23:2 rx gossip broadcast: 209
2019-09-17 7:23:2 gossip broadcast - relay broadcasts: 206
2019-09-17 7:23:2 gossip broadcast - topology updates: 11
===================================================================
2019-09-17 7:23:3 Peers.garbageCollect(): 233
2019-09-17 7:23:3 routes.calculate()         -> routes.calculateBroadcast(): 87
2019-09-17 7:23:3 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 256
2019-09-17 7:23:3 routes.calculateUnicast(): 174
2019-09-17 7:23:3 connectionMaker.refresh(): 88
2019-09-17 7:23:3 rx gossip unicast: 0
2019-09-17 7:23:3 rx gossip broadcast: 240
2019-09-17 7:23:3 gossip broadcast - relay broadcasts: 226
2019-09-17 7:23:3 gossip broadcast - topology updates: 6
===================================================================
2019-09-17 7:23:4 Peers.garbageCollect(): 289
2019-09-17 7:23:4 routes.calculate()         -> routes.calculateBroadcast(): 88
2019-09-17 7:23:4 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 217
2019-09-17 7:23:4 routes.calculateUnicast(): 176
2019-09-17 7:23:4 connectionMaker.refresh(): 109
2019-09-17 7:23:4 rx gossip unicast: 0
2019-09-17 7:23:4 rx gossip broadcast: 204
2019-09-17 7:23:4 gossip broadcast - relay broadcasts: 278
2019-09-17 7:23:4 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:23:5 Peers.garbageCollect(): 253
2019-09-17 7:23:5 routes.calculate()         -> routes.calculateBroadcast(): 82
2019-09-17 7:23:5 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 246
2019-09-17 7:23:5 routes.calculateUnicast(): 164
2019-09-17 7:23:5 connectionMaker.refresh(): 96
2019-09-17 7:23:5 rx gossip unicast: 0
2019-09-17 7:23:5 rx gossip broadcast: 244
2019-09-17 7:23:5 gossip broadcast - relay broadcasts: 244
2019-09-17 7:23:5 gossip broadcast - topology updates: 4
===================================================================
2019-09-17 7:23:6 Peers.garbageCollect(): 234
2019-09-17 7:23:6 routes.calculate()         -> routes.calculateBroadcast(): 71
2019-09-17 7:23:6 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 225
2019-09-17 7:23:6 routes.calculateUnicast(): 144
2019-09-17 7:23:6 connectionMaker.refresh(): 93
2019-09-17 7:23:6 rx gossip unicast: 0
2019-09-17 7:23:6 rx gossip broadcast: 228
2019-09-17 7:23:6 gossip broadcast - relay broadcasts: 226
2019-09-17 7:23:6 gossip broadcast - topology updates: 2
===================================================================
2019-09-17 7:23:7 Peers.garbageCollect(): 262
2019-09-17 7:23:7 routes.calculate()         -> routes.calculateBroadcast(): 80
2019-09-17 7:23:7 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 250
2019-09-17 7:23:7 routes.calculateUnicast(): 158
2019-09-17 7:23:7 connectionMaker.refresh(): 96
2019-09-17 7:23:7 rx gossip unicast: 0
2019-09-17 7:23:7 rx gossip broadcast: 254
2019-09-17 7:23:7 gossip broadcast - relay broadcasts: 253
2019-09-17 7:23:7 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:23:8 Peers.garbageCollect(): 264
2019-09-17 7:23:8 routes.calculate()         -> routes.calculateBroadcast(): 55
2019-09-17 7:23:8 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 259
2019-09-17 7:23:8 routes.calculateUnicast(): 109
2019-09-17 7:23:8 connectionMaker.refresh(): 60
2019-09-17 7:23:8 rx gossip unicast: 0
2019-09-17 7:23:8 rx gossip broadcast: 259
2019-09-17 7:23:8 gossip broadcast - relay broadcasts: 246
2019-09-17 7:23:8 gossip broadcast - topology updates: 3
===================================================================
2019-09-17 7:23:9 Peers.garbageCollect(): 266
2019-09-17 7:23:9 routes.calculate()         -> routes.calculateBroadcast(): 60
2019-09-17 7:23:9 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 318
2019-09-17 7:23:9 routes.calculateUnicast(): 120
2019-09-17 7:23:9 connectionMaker.refresh(): 71
2019-09-17 7:23:9 rx gossip unicast: 0
2019-09-17 7:23:9 rx gossip broadcast: 319
2019-09-17 7:23:9 gossip broadcast - relay broadcasts: 255
2019-09-17 7:23:9 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:23:10 Peers.garbageCollect(): 308
2019-09-17 7:23:10 routes.calculate()         -> routes.calculateBroadcast(): 58
2019-09-17 7:23:10 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 255
2019-09-17 7:23:10 routes.calculateUnicast(): 116
2019-09-17 7:23:10 connectionMaker.refresh(): 73
2019-09-17 7:23:10 rx gossip unicast: 0
2019-09-17 7:23:10 rx gossip broadcast: 258
2019-09-17 7:23:10 gossip broadcast - relay broadcasts: 302
2019-09-17 7:23:10 gossip broadcast - topology updates: 2
===================================================================

make gossipInterval configurable

Right now gossipInterval is hard-coded to be 30 sec in mesh library. So every 30 sec there is topology update that is gossiped. Processing topology update received through gossip can get expensive when the number of nodes participating in the gossip increases.

Is there a reason to gossip periodically? Does sending the topology updated when there is topology change detected should be sufficient? If periodic gossip is required then is it desirable to increase the gossip interval to higher value.

Provide a simple usage example

The example project from examples/increment-only-counter is overengineered and too complicated for a usage example and there are no examples in the docs. I would suggest writing minimal working example in the README so people don't have to reverse engineer the library

Clarify documentation on copying

Documentation on Gossiper.On* and GossipData.Merge could be clearer whether I can modify the argument and return it again or can return the receiver after successfully merging.

How to change peers?

For alertmanager, we'd like to be able to update the set of peers in a gossip, based on discovered kubernetes pods. What's the right way to do this?

Currently, I'm doing something like

meshRouter.ConnectionMaker.InitiateConnections(peers, true)

where meshRouter is a mesh.Router

Sub-questions:

  • does this remove / disconnect old peers?
  • does it add new ones?
  • what happens to peers that remain present across updates?
  • if peers includes the current node, what happens?

Observer for Change of peers

I don't see a way to monitor updates of changing status of peers (added new, removed, connections lost).
Our need is to measure statistics of this mesh peers status. Did I missed something or how do you see it?

rate-limit routes calculation done when a gossip topology update is received

routes() calculation is expensive operation O(n^3) where n is number of peers in the mesh

routes are calculated every time mesh receives a gossip topology update from the peer. When all the consumers of mesh library starting at once (like k8s cluster is starting up) or when a peer is joining the mesh or leaving the mesh, each peer receives topology update. at scale topology update gossip can get noisy resulting like high CPU usage resulting in calculation of routes

this issue is an enhancement request to add rate limiter to serialise limit the number of concurrent routes() calculation that can occur at a time. Also coalesce the consecutive requests effectively reducing the number of calculations.

Weave mesh comments

  1. talk to @miolini -- I think that he and the weave team in general would have a fascinating conversation.

  2. So right now mesh requires the knowledge of a single node-- why not bootstrap this from bittorrent DHT while using a PSK or certificate to ensure that no neer-do-well's show up and crash your party? Would require knowledge of zero nodes, and enable glorious networked chaos.

  3. What happens @ >100 nodes?

:).

BTW, here's an issue list mostly made by @miolini:

https://github.com/meshbird/meshbird/issues/created_by/miolini

Yeah, he's that awesome.

high rate of Peers.garbageCollect() calls resulting from topology gossip

mesh performs peers.garbageCollect on two occasions:

  • when direct topology changes is seen by the peer when connections gets deleted
  • when performing OnGossipBroadcast for topology gossip data

Peers.garbageCollect() invokes Peer.routes() which is O(n^2) operation hence results in significant CPU usage when there is significant topology change.

Following metrics (gathered as calls per second) were gathered with instrumented mesh on 150 node kubernetes cluster with weave-net using the mesh. Its not uncommon for some one to apply daemon set which results each node connecting to rest of the peers (so n^2 connection) resulting in significant topology changes and hence topology gossip.

===================================================================
2019-09-17 7:22:0 Peers.garbageCollect(): 365
2019-09-17 7:22:0 routes.calculate()         -> routes.calculateBroadcast(): 59
2019-09-17 7:22:0 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 335
2019-09-17 7:22:0 routes.calculateUnicast(): 119
2019-09-17 7:22:0 connectionMaker.refresh(): 63
2019-09-17 7:22:0 rx gossip unicast: 0
2019-09-17 7:22:0 rx gossip broadcast: 325
2019-09-17 7:22:0 gossip broadcast - relay broadcasts: 345
2019-09-17 7:22:0 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:1 Peers.garbageCollect(): 347
2019-09-17 7:22:1 routes.calculate()         -> routes.calculateBroadcast(): 68
2019-09-17 7:22:1 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 328
2019-09-17 7:22:1 routes.calculateUnicast(): 135
2019-09-17 7:22:1 connectionMaker.refresh(): 70
2019-09-17 7:22:1 rx gossip unicast: 0
2019-09-17 7:22:1 rx gossip broadcast: 316
2019-09-17 7:22:1 gossip broadcast - relay broadcasts: 324
2019-09-17 7:22:1 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:2 Peers.garbageCollect(): 369
2019-09-17 7:22:2 routes.calculate()         -> routes.calculateBroadcast(): 61
2019-09-17 7:22:2 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 313
2019-09-17 7:22:2 routes.calculateUnicast(): 124
2019-09-17 7:22:2 connectionMaker.refresh(): 64
2019-09-17 7:22:2 rx gossip unicast: 0
2019-09-17 7:22:2 rx gossip broadcast: 315
2019-09-17 7:22:2 gossip broadcast - relay broadcasts: 343
2019-09-17 7:22:2 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:3 Peers.garbageCollect(): 336
2019-09-17 7:22:3 routes.calculate()         -> routes.calculateBroadcast(): 75
2019-09-17 7:22:3 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 327
2019-09-17 7:22:3 routes.calculateUnicast(): 148
2019-09-17 7:22:3 connectionMaker.refresh(): 75
2019-09-17 7:22:3 rx gossip unicast: 0
2019-09-17 7:22:3 rx gossip broadcast: 322
2019-09-17 7:22:3 gossip broadcast - relay broadcasts: 326
2019-09-17 7:22:3 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:4 Peers.garbageCollect(): 353
2019-09-17 7:22:4 routes.calculate()         -> routes.calculateBroadcast(): 69
2019-09-17 7:22:4 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 344
2019-09-17 7:22:4 routes.calculateUnicast(): 138
2019-09-17 7:22:4 connectionMaker.refresh(): 71
2019-09-17 7:22:4 rx gossip unicast: 0
2019-09-17 7:22:4 rx gossip broadcast: 339
2019-09-17 7:22:4 gossip broadcast - relay broadcasts: 337
2019-09-17 7:22:4 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:5 Peers.garbageCollect(): 323
2019-09-17 7:22:5 routes.calculate()         -> routes.calculateBroadcast(): 68
2019-09-17 7:22:5 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 330
2019-09-17 7:22:5 routes.calculateUnicast(): 136
2019-09-17 7:22:5 connectionMaker.refresh(): 70
2019-09-17 7:22:5 rx gossip unicast: 0
2019-09-17 7:22:5 rx gossip broadcast: 328
2019-09-17 7:22:5 gossip broadcast - relay broadcasts: 311
2019-09-17 7:22:5 gossip broadcast - topology updates: 3
===================================================================
2019-09-17 7:22:6 Peers.garbageCollect(): 340
2019-09-17 7:22:6 routes.calculate()         -> routes.calculateBroadcast(): 78
2019-09-17 7:22:6 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 320
2019-09-17 7:22:6 routes.calculateUnicast(): 156
2019-09-17 7:22:6 connectionMaker.refresh(): 82
2019-09-17 7:22:6 rx gossip unicast: 0
2019-09-17 7:22:6 rx gossip broadcast: 321
2019-09-17 7:22:6 gossip broadcast - relay broadcasts: 322
2019-09-17 7:22:6 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:7 Peers.garbageCollect(): 321
2019-09-17 7:22:7 routes.calculate()         -> routes.calculateBroadcast(): 85
2019-09-17 7:22:7 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 300
2019-09-17 7:22:7 routes.calculateUnicast(): 172
2019-09-17 7:22:7 connectionMaker.refresh(): 90
2019-09-17 7:22:7 rx gossip unicast: 0
2019-09-17 7:22:7 rx gossip broadcast: 296
2019-09-17 7:22:7 gossip broadcast - relay broadcasts: 309
2019-09-17 7:22:7 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:8 Peers.garbageCollect(): 313
2019-09-17 7:22:8 routes.calculate()         -> routes.calculateBroadcast(): 81
2019-09-17 7:22:8 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 308
2019-09-17 7:22:8 routes.calculateUnicast(): 161
2019-09-17 7:22:8 connectionMaker.refresh(): 85
2019-09-17 7:22:8 rx gossip unicast: 0
2019-09-17 7:22:8 rx gossip broadcast: 309
2019-09-17 7:22:8 gossip broadcast - relay broadcasts: 291
2019-09-17 7:22:8 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:22:9 Peers.garbageCollect(): 316
2019-09-17 7:22:9 routes.calculate()         -> routes.calculateBroadcast(): 84
2019-09-17 7:22:9 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 307
2019-09-17 7:22:9 routes.calculateUnicast(): 167
2019-09-17 7:22:9 connectionMaker.refresh(): 88
2019-09-17 7:22:9 rx gossip unicast: 0
2019-09-17 7:22:9 rx gossip broadcast: 302
2019-09-17 7:22:9 gossip broadcast - relay broadcasts: 306
2019-09-17 7:22:9 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:22:10 Peers.garbageCollect(): 312
2019-09-17 7:22:10 routes.calculate()         -> routes.calculateBroadcast(): 83
2019-09-17 7:22:10 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 278
2019-09-17 7:22:10 routes.calculateUnicast(): 166
2019-09-17 7:22:10 connectionMaker.refresh(): 85
2019-09-17 7:22:10 rx gossip unicast: 0
2019-09-17 7:22:10 rx gossip broadcast: 275
2019-09-17 7:22:10 gossip broadcast - relay broadcasts: 300
2019-09-17 7:22:10 gossip broadcast - topology updates: 2
===================================================================

randomNeighbours selection is biased

The implementation relies on Go's map iteration order being random, but it isn't very.

Number of times a certain element x is extracted first from a #golang map with 1000 elements. Yup, heavily skewed: if you care about extracting fairly a random element from a map do *not* use `for i := range m { return i }`.

— Carlo Alberto Ferraris (@CAFxX) June 2, 2019

chart

Another example demonstrates it is even worse on small maps. With two connections, randomNeighbours() will pick the first-added one 90% of the time, which caused the Weave Net tests to fail on #120.

There is something wrong

Is there something run?
I run the example Increment-only counter
If 127.0.0.1:6001 down, then the cluster didn't consisitent.
Is it right?

runtime error: invalid memory address or nil pointer dereference

I'm running a server that uses mesh and got this panic:

2020/09/09 01:35:25 [service] configured logging provider (stderr)
badger 2020/09/09 01:35:25 INFO: All 0 tables opened in 0s
badger 2020/09/09 01:35:25 INFO: Replaying file id: 0 at offset: 0
badger 2020/09/09 01:35:25 INFO: Replay took: 2.48µs
badger 2020/09/09 01:35:25 DEBUG: Value log discard stats empty
2020/09/09 01:35:25 [service] configured message storage (ssd)
2020/09/09 01:35:25 [service] configured usage metering (noop)
2020/09/09 01:35:25 [service] configured contracts provider (single)
2020/09/09 01:35:25 [service] configured monitoring sink (self)
2020/09/09 01:35:25 [service] configured node name (ae:b1:cc:f5:0a:d9)
2020/09/09 01:35:25 [service] starting the listener (0.0.0.0:8080)
2020/09/09 01:35:25 [tls] unable to configure certificates, make sure a valid cache or certificate is configured
2020/09/09 01:35:25 [service] service started
2020/09/09 01:35:31 [swarm] peer created (ea:c5:b0:af:45:c1)
2020/09/09 01:35:38 [swarm] peer created (ba:54:8f:83:e8:02)
2020/09/09 01:35:38 [closing] panic recovered: runtime error: invalid memory address or nil pointer dereference
 goroutine 842 [running]:
runtime/debug.Stack(0xc006775a88, 0x119d020, 0x1b2d020)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
github.com/emitter-io/emitter/internal/broker.(*Conn).Close(0xc007b22090, 0xbfce2a370001db67, 0x4c72e81)
	/go-build/src/github.com/emitter-io/emitter/internal/broker/conn.go:353 +0x32d
panic(0x119d020, 0x1b2d020)
	/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/weaveworks/mesh.(*gossipSender).Broadcast(0xc007a8b450, 0xaeb1ccf50ad9, 0x1521c80, 0xc007b1e250)
	/go/pkg/mod/github.com/weaveworks/[email protected]/gossip.go:202 +0xb2
github.com/weaveworks/mesh.(*gossipChannel).relayBroadcast(0xc0000a3e00, 0xaeb1ccf50ad9, 0x1521c80, 0xc007b1e250)
	/go/pkg/mod/github.com/weaveworks/[email protected]/gossip_channel.go:116 +0x10b
github.com/weaveworks/mesh.(*gossipChannel).GossipBroadcast(0xc0000a3e00, 0x1521c80, 0xc007b1e250)
	/go/pkg/mod/github.com/weaveworks/[email protected]/gossip_channel.go:83 +0x4f
github.com/emitter-io/emitter/internal/service/cluster.(*Swarm).Notify(0xc0001757a0, 0x152efc0, 0xc0078d8d80, 0x1)
	/go-build/src/github.com/emitter-io/emitter/internal/service/cluster/swarm.go:359 +0xb5
github.com/emitter-io/emitter/internal/broker.(*Conn).onConnect(0xc007b22090, 0xc0079a8f00, 0x4)
	/go-build/src/github.com/emitter-io/emitter/internal/broker/conn.go:344 +0x1a6
github.com/emitter-io/emitter/internal/broker.(*Conn).onReceive(0xc007b22090, 0x152f080, 0xc0079a8f00, 0x0, 0x0)
	/go-build/src/github.com/emitter-io/emitter/internal/broker/conn.go:194 +0x7aa
github.com/emitter-io/emitter/internal/broker.(*Conn).Process(0xc007b22090, 0x0, 0x0)
	/go-build/src/github.com/emitter-io/emitter/internal/broker/conn.go:180 +0x1cb
created by github.com/emitter-io/emitter/internal/broker.(*Service).onAcceptConn
	/go-build/src/github.com/emitter-io/emitter/internal/broker/service.go:315 +0x6e

2020/09/09 01:35:38 [query] presence query received ([1156454352 2843744046 1815237614 2486959251 1909563767])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.