openziti / fabric Goto Github PK
View Code? Open in Web Editor NEWGeo-scale overlay network and core network programming model
License: Apache License 2.0
Geo-scale overlay network and core network programming model
License: Apache License 2.0
I agree. It's the way it is b/c if ZITI_SOURCE isn't specified, it evaluates to blank and we end up with a relative path ziti-fabric/etc if there's no slash, or an absolute path /ziti-fabric/etc if there is.
If having the absolute path as a fallback is OK, I can switch it.
I considered adding logic to check for empty ZITI_SOURCE but there's nothing special about it. Other people might use different values in their config files, so it seemed a little strange to add something to the code for it.
Thoughts?
We're going to need either xgress_proxy
to understand UDP, or we're going to need an xgress_proxy_udp
implementation that can speak plain UDP to a service.
Required for the fablab/zitilab/characterization
work.
For consistency with edge, add names to service and router. Depends on short ids being added first. Default name to id if no name is provided.
Allow ziti-router
instances to be used by multiple controllers.
Develop a workflow for a Fabric REST API and REST API platform that the Edge can utilize.
The fabric controller and higher-level layers (i.e. edge) need to stream events for external systems integrations that report on the functionality of the controller.
Main use case: edge session/identity events need to be exposed such that tying metrics to edge sessions/identities does not rely on scrapping the REST API.
When egress connect fails, router does not return failure to controller. This means that failing session dials can take much longer to return to the client than they should
If we don't, we lose important context, especially for router terminated connections.
$ ziti-fabric list terminators
Terminators: (1)
Id | Service | Binding | Destination
ddf92520-8699-4ec4-b2ea-1a16a5270513 | ssh | transport | 003 -> tcp:192.168.9.98:22
Terminators should use shortid.
The Transwarp Initiator, along with other initiatives will probably benefit from the ability to configure and tune the maximum payload size.
If we're going to leverage data plane retransmission to cover the TI UDP implementation retransmission, we're probably going to want to tune the Payload
size to be equivalent to the UDP MTU. Otherwise single Payload
loss, will result in multiple data plane packets being retransmitted.
There is also much fruit to be found in tuning the data plane transmission unit size.
[34m[ 171.041][39m [37m INFO[39m [36mfabric/controller/handler_ctrl.(*faultHandler).HandleReceive [ch{FsfkfA4GR}->u{classic}->i{80jk}][39m: link fault [l/ynWJ]
[34m[ 171.041][39m [37m INFO[39m [36mfabric/controller/network.(*Network).Run[39m: changed link [l/ynWJ]
[34m[ 171.041][39m [37m INFO[39m [36mfabric/controller/network.(*Network).rerouteLink[39m: link [l/ynWJ] changed
[34m[ 171.042][39m [37m INFO[39m [36mfabric/controller/network.(*Network).LinkConnected[39m: link [l/3Nr3] failed
[34m[ 171.042][39m [37m INFO[39m [36mfabric/controller/handler_ctrl.(*faultHandler).HandleReceive [ch{6dBmMM4MR}->u{classic}->i{8ZAr}][39m: link fault [l/3Nr3]
[34m[ 171.042][39m [37m INFO[39m [36mfabric/controller/network.(*Network).Run[39m: changed link [l/3Nr3]
[34m[ 171.042][39m [37m INFO[39m [36mfabric/controller/network.(*Network).rerouteLink[39m: link [l/3Nr3] changed
[34m[ 171.042][39m [37m INFO[39m [36mfabric/controller/network.(*Network).rerouteLink[39m: session [s/KAz7] uses link [l/3Nr3]
[34m[ 171.042][39m [33mWARNING[39m [36mfabric/controller/network.(*Network).rerouteSession[39m: rerouting [s/KAz7]
[34m[ 171.042][39m [31m ERROR[39m [36mfoundation/channel2.(*channelImpl).rxer [ch{NJFxCLVMR}->u{classic}->i{QJjQ}][39m: rx error (short read)
[34m[ 171.043][39m [37m INFO[39m [36mfabric/controller/handler_ctrl.(*xctrlCloseHandler).HandleClose [ch{NJFxCLVMR}->u{classic}->i{QJjQ}][39m: closing Xctrl instances
[34m[ 171.043][39m [33mWARNING[39m [36mfabric/controller/handler_ctrl.(*closeHandler).HandleClose [r/NJFxCLVMR][39m: disconnected
[34m[ 171.045][39m [31m ERROR[39m [36mfabric/controller/network.(*Network).rerouteSession[39m: error sending route to [r/NJFxCLVMR] (channel closed)
[34m[ 171.045][39m [31m ERROR[39m [36mruntime.gopanic[39m: exited
panic: runtime error: index out of range [1] with length 1
goroutine 1 [running]:
github.com/openziti/fabric/controller/network.(*Network).rerouteSession(0xc0000ce200, 0xc000949200, 0x126b18d, 0x1f)
/home/andrew/remote-repos/openziti/fabric/controller/network/network.go:637 +0x5aa
github.com/openziti/fabric/controller/network.(*Network).rerouteLink(0xc0000ce200, 0xc0017be960, 0x125cb99, 0x13)
/home/andrew/remote-repos/openziti/fabric/controller/network/network.go:611 +0x29f
github.com/openziti/fabric/controller/network.(*Network).Run(0xc0000ce200)
/home/andrew/remote-repos/openziti/fabric/controller/network/network.go:568 +0x3ff
github.com/openziti/fabric/controller.(*Controller).Run(0xc0004ec680, 0x134f370, 0xc000100460)
/home/andrew/remote-repos/openziti/fabric/controller/controller.go:106 +0x601
github.com/openziti/ziti/ziti-controller/subcmd.run(0x2063fe0, 0xc0003bc5f0, 0x1, 0x1)
/home/andrew/remote-repos/openziti/ziti/ziti-controller/subcmd/run.go:62 +0x1e4
github.com/spf13/cobra.(*Command).execute(0x2063fe0, 0xc0003bc5c0, 0x1, 0x1, 0x2063fe0, 0xc0003bc5c0)
/home/andrew/go/pkg/mod/github.com/spf13/[email protected]/command.go:846 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0x2063d40, 0x4, 0xc0000fbf48, 0x20c7d20)
/home/andrew/go/pkg/mod/github.com/spf13/[email protected]/command.go:950 +0x349
github.com/spf13/cobra.(*Command).Execute(...)
/home/andrew/go/pkg/mod/github.com/spf13/[email protected]/command.go:887
github.com/openziti/ziti/ziti-controller/subcmd.Execute()
/home/andrew/remote-repos/openziti/ziti/ziti-controller/subcmd/root.go:61 +0x31
main.main()
/home/andrew/remote-repos/openziti/ziti/ziti-controller/main.go:44 +0x20
kroot@ip-10-19-14-77:/opt/netfoundry/ziti/ziti-controller\[root@ip-10-19-14-77 ziti-controller]#
kroot@ip-10-19-14-77:/opt/netfoundry/ziti/ziti-controller\[root@ip-10-19-14-77 ziti-controller]# exit
Script done on Tue 25 Aug 2020 09:14:20 PM UTC```
Current we can stream metrics and other events over the channel2 mgmt API. To be able to fully replace the channel2 mgmt API with REST apis we need to support streaming. We can do this by supporting websocket protocol based streaming endpoints.
Allow a fixed set of “static” tags to be set in the configuration of each router.
Return these tags with the metrics for that router. The tags will need to be transmitted from the router to the controller.
Darius reported an issue where he has two links with two terminators:
Id | Src -> Dst | State | Cost | Latency
ARBA | 09TjLwHMR -> OoJQ-wHGR | Connected | 100 | 0.0433 0.0433 ->down<-
AOq7 | CZejYQNMg -> 09TjLwHMR | Connected | 1 | 0.0632 0.0630 ->down<-
7Ww7 | CZejYQNMg -> M7ZjLwNGR | Connected | 1 | 0.0740 0.0740 ->down<-
7DzA | CZejYQNMg -> OoJQ-wHGR | Connected | 1 | 0.0642 0.0642 ->down<-
nPQ7 | M7ZjLwNGR -> 09TjLwHMR | Connected | 1000000 | 0.0430 0.0430
A0J7 | M7ZjLwNGR -> CZejYQNMg | Connected | 1 | 0.0742 0.0742
nLxn | M7ZjLwNGR -> OoJQ-wHGR | Connected | 1 | 0.0230 0.0229 ->down<-
nQmn | OoJQ-wHGR -> 09TjLwHMR | Connected | 1 | 0.0433 0.0433 ->down<-
ldQA | OoJQ-wHGR -> CZejYQNMg | Connected | 1 | 0.0642 0.0641 ->down<-
nMQn | OoJQ-wHGR -> M7ZjLwNGR | Connected | 1 | 0.0239 0.0238 ->down<- (edited)
Terminators: (2)
Id | Service | Binding | Destination
KkV9 | S9XlUwNMg | edge_transport | 09TjLwHMR -> tcp:10.10.255.4:22
pw7p | S9XlUwNMg | edge_transport | CZejYQNMg -> tcp:10.10.255.3:22
Services: (1)
Id | Name | Terminator Strategy | Destination(s)
S9XlUwNMg | MicroEdgeSSH | smartrouting | 09TjLwHMR -> tcp:10.10.255.4:22
| | CZejYQNMg -> tcp:10.10.255.3:22
Despite setting the cost to 1000000
on the link nPQ7
and having the smartrouting
terminator selection strategy, he is still being directed to the higher cost terminator:
Sessions: (1)
Id | Client | Service | Path
XDGA | 301a5a9b-71d3-4a8b-9ac9-68d30e4b6350 | S9XlUwNMg | [r/M7ZjLwNGR]->{l/nPQ7}->[r/09TjLwHMR]
Allow router fingerprint to be not set. Allows partially enrolled routers. This will allow edge routers to extend fabric routers
It's looking like the underlying connectivity semantics between transwarp
and the traditional transport
-based protocols are different enough that we're not likely to contain all of the differences within the transport
framework. In other words, the transwarp
abstraction is likely to leak out of the transport
implementation.
So... we're probably going to need some kind of Xlink
framework, encapsulating the behavior of link management for the overlay. The current implementation will likely become xlink_transport
, and we'll end up developing an xlink_transwarp
.
Include a facade to push core metrics handling into the router core, and contain metric surfacing within the Xlink implementation.
xlink_transport
currently launches the existing channel metrics infrastructure, which directly inserts the metrics into the metrics registry. This change would have xlink_transport
capturing metrics, and then calling Xlink API methods to push the data into the core router.
This would clarify and encapsulate what's required metrics-wise from an Xlink implementation to have a link properly participate in smart routing.
With pluggable Xlink in play (#54), support multiple advertisements per-router (for each listener). This probably won't consider differentiating multiple dialers with the same binding on a router, but will consider Xlink listeners in the same manner that single-listener routers were supported in previous versions.
This should get also result in a bare-bones implementation of (#45).
Using nanoseconds causes latency to have an outsize affect on path selection
Mike Guthrie was asking how to check the health of a Ziti Controller. The current strategy is to watch for open ports on the controller (REST API, Management, Control), but that doesn't mean things are working properly. He asked for a holistic endpoint to check that can report on the health of the controller.
We need to discuss what this means and how to allow the Edge to report it's status as well.
Update the README for the initial open core release.
When edge terminators are created dynamically, we don't remove them if a router goes away. We probably don't want to remove them when the router goes away, but have a way to query the router when it reconnects. This will help us avoid removing terminators that are still valid in cases where the router didn't go down, just lost connection to the controller (or controller restarted).
Assume a network configured like this:
001 -> 002 <- 003
…where router 002
advertises a listener, but routers 001
and 003
do not.
If both router 001
and 003
connect to the controller at the same time, before either router has an established link with 002
, the controller will end up creating redundant links. This scenario should end up creating a mesh with 2 links, but this race condition can result in an extra link for both hops.
This is ultimately because the connection handling code, which reevaluates the mesh and sends the appropriate Link
commands across the control plane is not aware of the pending links of the other connection.
In practice, this is generally harmless as the redundant links don’t actually impact the performance of anything. This is generally an administrative hygiene issue.
This could be fixed by adding pending link tracking support to the controller. When the controller issues a Link
command to a router, the pending link is tracked. The list of pending links is considered along with the established links when evaluating the mesh.
Alternately, the controller could reap redundant links in a similar manner to how Failed links are reaped.
Make the dialers:
configuration stanza work consistently with how listeners:
works in the router config.
This will mean that unless a dialer is specified in the config, it will be unavailable in the router.
Doing this for both “security” and “consistency” reasons.
Implement basic sliding-window flow control for xlink_transwarp
.
A subtask of (#44).
During Data Net´s scaling of routers two symptoms became apparent:
To address this a configurable pool of workers for link establishment and xgress dials will be added. This will make xgress dial and link establishment non-blocking for the control channel processing and allow multiple xgress dials/link dials to occur in the same router.
Allow tags to be stored on the router in the persistent store. Combine these tags with the static tags implemented in phase 1, and return them with each metrics message.
Currently we try to derive binding based on address when creating services. Should we do that dynamically when being dialed?
If not, we need to enforce service binding so that edge cannot bind a non-edge service.
Currently services have the following attributes
To be able to support failover and HA capable services, we need to be able to support services with multiple endpoints. To support all of the use cases, we could split out the endpoint information into a separate structure.
Service would just retain
There would be a new structure, name TBD (endpoint, terminator, egressPoint), which would have the following attributes
When a new session is established the controller would pick an endpoint from all endpoints which had the highest priority.
For SDK applications which are binding, these entries would get created/removed dynamically. The binding would allow setting the priority. The priority could also be specified as -1, which would set the priority to one lower than the currently lowest priority.
So to model different scenarios -
HA - many endpoints with the same priority
Main/backups - N endpoints with priority 0, 1, etc...
Rolling failover - N endpoints, each one grabs the next lowest priority as they bind, so that we don't fail back when servers come back up
We could also store these entries associated to router and then dynamically add/remove them to/from the services as the router goes online/offline
Research and prototype an implementation of a custom, UDP-based data plane protocol.
The goal is for this to become our highest-performing wide-area data plane implementation.
Xgress setup has a race condition where the initiator may terminate the session before the terminator receives the Xgress Start.
If the client starts a connection, writes some short data then immediately closes the connection, the session may be closed before the start xgress can make it from the initiator to the controller to the terminator.
Instead the xgress start should be send in-band as the first message from the initiator (same as session end).
Improve log messages emitting from smart routing. Include as much explainability information as possible.
Fully document the configuration syntax and structure used by the controller, router, and dotziti (identities) framework.
Session failures often result in "second order" error messages, where a subsystem (xgress
, for example) produces an error message, which is bubbled up to the caller. In some cases, these messages are swallowed, resulting in something generic, like EOF
.
Revisit session setup error paths and endeavor to produce generally better, clearer error messages about why session setup may have failed.
Incorporate additional metrics into the core smart routing implementation. Support balancing using bandwidth metrics and targets. Balancing using other criteria (diversity).
In a cold controller restart, the controller doesn’t know what links may have already been extant on the mesh. It will recreate a new mesh, and the existing links will eventually go stale. This doesn’t technically hurt anything, but it is wasteful.
A second iteration of reconnect behavior would either persistently store the mesh state in the database, and then query connecting routers for their link tables (and possibly forwarding tables, eventually). It may also be useful to query the routers' link tables to rebuild the controller's internal representation.
Using a "connected" UDP socket reduces the amount of kernel overhead involved with sending datagrams. An "unconnected" UDP socket needs to do a route lookup for every datagram sent. At least on Linux, using a connected socket can provide significant performance benefits.
The current xlink_transwarp
implementation has a "listener" side and a "dialer" side. The listener side uses a net.UDPConn
to accept incoming connections, and maintains a "reader", which then uses the peer address to deference an xlink_transwarp.impl
associated with the address; there is a single reader for all of the links connected to the listener. The dialer side runs an independent reader for each xlink_transwarp.impl
. Both sides use unconnected sockets.
A revision, to allow connected sockets might be to use the listener purely to "rendezvous" a pair of xlink_transwarp.impl
sockets on both sides, which would be connected to one another. This would function similarly to a very lightweight, purpose built STUN implementation.
Modify the network
implementation to support multiple link mesh building strategies, such that multiple data plane protocols can be run in parallel.
With the implementation of Smart Routing 2 (#3), we'll be able to allow the controller to solve for the highest performing paths between routers.
Add API to kick off a database backup
If controller.Shutdown()
is called bbolt db file is not close. This is fine during normal ziti-controller use, but when running inside of a testing context this can cause file lock blocking.
Take a pass through the Xgress framework to remove any stuttering in names like xgress.XgressFactory
(becomes xgress.Factory
), etc.
Modularize the ziti
CLI, such that the ziti-fabric
components, the ziti-edge
components, and any other interested parties can contribute CLI commands. Use careful namespacing: ziti fabric
, ziti edge
, etc. to ensure that there aren't overlaps.
Migrate ziti-fabric
to use this new, modular ziti
infrastructure.
A fresh pre-1.0
survey and cleanup of the tracing and overlay message decoding.
Create a version of the management API accessible via REST. Framework should be extensible so the edge can use/extend the same framework to expose its management API.
Will need to be pluggable in some way so that different authentication mechanisms can be used. For example, when running in pure fabric mode, the controller may use certificate based authentication. When running with the edge components, it should allow using the edge authentication mechanisms.
We do not need to support both at the same time. If running with the edge, the pure fabric certificate mode does not need to be supported.
Enable building and running go test in travis. Additional work such as auto-versioning and git tagging will come later
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.