Code Monkey home page Code Monkey logo

fabric's People

Contributors

andrewpmartinez avatar camotts avatar dependabot[bot] avatar dovholuknf avatar ekoby avatar gberl002 avatar mguthrie88 avatar michaelquigley avatar plorenz avatar rentallect avatar sabedevops avatar tburtchell avatar ziti-ci avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fabric's Issues

${ZITI_SOURCE} 2

I agree. It's the way it is b/c if ZITI_SOURCE isn't specified, it evaluates to blank and we end up with a relative path ziti-fabric/etc if there's no slash, or an absolute path /ziti-fabric/etc if there is.

If having the absolute path as a fallback is OK, I can switch it.
I considered adding logic to check for empty ZITI_SOURCE but there's nothing special about it. Other people might use different values in their config files, so it seemed a little strange to add something to the code for it.

Thoughts?

Originally posted by @plorenz in #15

xgress_proxy UDP

We're going to need either xgress_proxy to understand UDP, or we're going to need an xgress_proxy_udp implementation that can speak plain UDP to a service.

Required for the fablab/zitilab/characterization work.

Add name to service and router

For consistency with edge, add names to service and router. Depends on short ids being added first. Default name to id if no name is provided.

Fabric REST API

Develop a workflow for a Fabric REST API and REST API platform that the Edge can utilize.

  • Open API documentation (aka swagger)
  • Generated server flow
  • Initial support for the Fabric JSON Gateway API endpoints

Event Streaming

The fabric controller and higher-level layers (i.e. edge) need to stream events for external systems integrations that report on the functionality of the controller.

Main use case: edge session/identity events need to be exposed such that tying metrics to edge sessions/identities does not rely on scrapping the REST API.

  • Expose programmatic interfaces/APIs:
    -- to allow events to be stream from higher-level layers (i.e. edge)
    -- to allow events to be received and handled through a generic interface

Shortids for Terminators

$ ziti-fabric list terminators

Terminators: (1)

Id           | Service      | Binding      | Destination
ddf92520-8699-4ec4-b2ea-1a16a5270513 | ssh          | transport    | 003          -> tcp:192.168.9.98:22

Terminators should use shortid.

Data Plane Payload Size Configuration

The Transwarp Initiator, along with other initiatives will probably benefit from the ability to configure and tune the maximum payload size.

If we're going to leverage data plane retransmission to cover the TI UDP implementation retransmission, we're probably going to want to tune the Payload size to be equivalent to the UDP MTU. Otherwise single Payload loss, will result in multiple data plane packets being retransmitted.

There is also much fruit to be found in tuning the data plane transmission unit size.

If link vanishes during reroute, controller can panic

[34m[ 171.041][39m [37m   INFO[39m [36mfabric/controller/handler_ctrl.(*faultHandler).HandleReceive [ch{FsfkfA4GR}->u{classic}->i{80jk}][39m: link fault [l/ynWJ]
[34m[ 171.041][39m [37m   INFO[39m [36mfabric/controller/network.(*Network).Run[39m: changed link [l/ynWJ]
[34m[ 171.041][39m [37m   INFO[39m [36mfabric/controller/network.(*Network).rerouteLink[39m: link [l/ynWJ] changed
[34m[ 171.042][39m [37m   INFO[39m [36mfabric/controller/network.(*Network).LinkConnected[39m: link [l/3Nr3] failed
[34m[ 171.042][39m [37m   INFO[39m [36mfabric/controller/handler_ctrl.(*faultHandler).HandleReceive [ch{6dBmMM4MR}->u{classic}->i{8ZAr}][39m: link fault [l/3Nr3]
[34m[ 171.042][39m [37m   INFO[39m [36mfabric/controller/network.(*Network).Run[39m: changed link [l/3Nr3]
[34m[ 171.042][39m [37m   INFO[39m [36mfabric/controller/network.(*Network).rerouteLink[39m: link [l/3Nr3] changed
[34m[ 171.042][39m [37m   INFO[39m [36mfabric/controller/network.(*Network).rerouteLink[39m: session [s/KAz7] uses link [l/3Nr3]
[34m[ 171.042][39m [33mWARNING[39m [36mfabric/controller/network.(*Network).rerouteSession[39m: rerouting [s/KAz7]
[34m[ 171.042][39m [31m  ERROR[39m [36mfoundation/channel2.(*channelImpl).rxer [ch{NJFxCLVMR}->u{classic}->i{QJjQ}][39m: rx error (short read)
[34m[ 171.043][39m [37m   INFO[39m [36mfabric/controller/handler_ctrl.(*xctrlCloseHandler).HandleClose [ch{NJFxCLVMR}->u{classic}->i{QJjQ}][39m: closing Xctrl instances
[34m[ 171.043][39m [33mWARNING[39m [36mfabric/controller/handler_ctrl.(*closeHandler).HandleClose [r/NJFxCLVMR][39m: disconnected
[34m[ 171.045][39m [31m  ERROR[39m [36mfabric/controller/network.(*Network).rerouteSession[39m: error sending route to [r/NJFxCLVMR] (channel closed)
[34m[ 171.045][39m [31m  ERROR[39m [36mruntime.gopanic[39m: exited
panic: runtime error: index out of range [1] with length 1
goroutine 1 [running]:
github.com/openziti/fabric/controller/network.(*Network).rerouteSession(0xc0000ce200, 0xc000949200, 0x126b18d, 0x1f)
	/home/andrew/remote-repos/openziti/fabric/controller/network/network.go:637 +0x5aa
github.com/openziti/fabric/controller/network.(*Network).rerouteLink(0xc0000ce200, 0xc0017be960, 0x125cb99, 0x13)
	/home/andrew/remote-repos/openziti/fabric/controller/network/network.go:611 +0x29f
github.com/openziti/fabric/controller/network.(*Network).Run(0xc0000ce200)
	/home/andrew/remote-repos/openziti/fabric/controller/network/network.go:568 +0x3ff
github.com/openziti/fabric/controller.(*Controller).Run(0xc0004ec680, 0x134f370, 0xc000100460)
	/home/andrew/remote-repos/openziti/fabric/controller/controller.go:106 +0x601
github.com/openziti/ziti/ziti-controller/subcmd.run(0x2063fe0, 0xc0003bc5f0, 0x1, 0x1)
	/home/andrew/remote-repos/openziti/ziti/ziti-controller/subcmd/run.go:62 +0x1e4
github.com/spf13/cobra.(*Command).execute(0x2063fe0, 0xc0003bc5c0, 0x1, 0x1, 0x2063fe0, 0xc0003bc5c0)
	/home/andrew/go/pkg/mod/github.com/spf13/[email protected]/command.go:846 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0x2063d40, 0x4, 0xc0000fbf48, 0x20c7d20)
	/home/andrew/go/pkg/mod/github.com/spf13/[email protected]/command.go:950 +0x349
github.com/spf13/cobra.(*Command).Execute(...)
	/home/andrew/go/pkg/mod/github.com/spf13/[email protected]/command.go:887
github.com/openziti/ziti/ziti-controller/subcmd.Execute()
	/home/andrew/remote-repos/openziti/ziti/ziti-controller/subcmd/root.go:61 +0x31
main.main()
	/home/andrew/remote-repos/openziti/ziti/ziti-controller/main.go:44 +0x20
kroot@ip-10-19-14-77:/opt/netfoundry/ziti/ziti-controller\[root@ip-10-19-14-77 ziti-controller]# 
kroot@ip-10-19-14-77:/opt/netfoundry/ziti/ziti-controller\[root@ip-10-19-14-77 ziti-controller]# exit
Script done on Tue 25 Aug 2020 09:14:20 PM UTC```

Tags in Streaming Metrics (Phase 1)

Allow a fixed set of “static” tags to be set in the configuration of each router.

Return these tags with the metrics for that router. The tags will need to be transmitted from the router to the controller.

Terminators and Fixed Link Cost, Incorrect Path Selection

Darius reported an issue where he has two links with two terminators:

Id     | Src                      -> Dst                      | State        | Cost | Latency
ARBA   | 09TjLwHMR                -> OoJQ-wHGR                | Connected    | 100  | 0.0433 0.0433 ->down<-
AOq7   | CZejYQNMg                -> 09TjLwHMR                | Connected    | 1    | 0.0632 0.0630 ->down<-
7Ww7   | CZejYQNMg                -> M7ZjLwNGR                | Connected    | 1    | 0.0740 0.0740 ->down<-
7DzA   | CZejYQNMg                -> OoJQ-wHGR                | Connected    | 1    | 0.0642 0.0642 ->down<-
nPQ7   | M7ZjLwNGR                -> 09TjLwHMR                | Connected    | 1000000 | 0.0430 0.0430
A0J7   | M7ZjLwNGR                -> CZejYQNMg                | Connected    | 1    | 0.0742 0.0742
nLxn   | M7ZjLwNGR                -> OoJQ-wHGR                | Connected    | 1    | 0.0230 0.0229 ->down<-
nQmn   | OoJQ-wHGR                -> 09TjLwHMR                | Connected    | 1    | 0.0433 0.0433 ->down<-
ldQA   | OoJQ-wHGR                -> CZejYQNMg                | Connected    | 1    | 0.0642 0.0641 ->down<-
nMQn   | OoJQ-wHGR                -> M7ZjLwNGR                | Connected    | 1    | 0.0239 0.0238 ->down<- (edited) 

Terminators: (2)
Id           | Service      | Binding      | Destination
KkV9         | S9XlUwNMg    | edge_transport | 09TjLwHMR    -> tcp:10.10.255.4:22
pw7p         | S9XlUwNMg    | edge_transport | CZejYQNMg    -> tcp:10.10.255.3:22

Services: (1)
Id           | Name         | Terminator Strategy | Destination(s)
S9XlUwNMg    | MicroEdgeSSH | smartrouting | 09TjLwHMR    -> tcp:10.10.255.4:22
             |              | CZejYQNMg    -> tcp:10.10.255.3:22

Despite setting the cost to 1000000 on the link nPQ7 and having the smartrouting terminator selection strategy, he is still being directed to the higher cost terminator:

Sessions: (1)
Id           | Client       | Service      | Path
XDGA         | 301a5a9b-71d3-4a8b-9ac9-68d30e4b6350 | S9XlUwNMg    | [r/M7ZjLwNGR]->{l/nPQ7}->[r/09TjLwHMR]

Xlink

It's looking like the underlying connectivity semantics between transwarp and the traditional transport-based protocols are different enough that we're not likely to contain all of the differences within the transport framework. In other words, the transwarp abstraction is likely to leak out of the transport implementation.

So... we're probably going to need some kind of Xlink framework, encapsulating the behavior of link management for the overlay. The current implementation will likely become xlink_transport, and we'll end up developing an xlink_transwarp.

Xlink Metrics Facade

Include a facade to push core metrics handling into the router core, and contain metric surfacing within the Xlink implementation.

xlink_transport currently launches the existing channel metrics infrastructure, which directly inserts the metrics into the metrics registry. This change would have xlink_transport capturing metrics, and then calling Xlink API methods to push the data into the core router.

This would clarify and encapsulate what's required metrics-wise from an Xlink implementation to have a link properly participate in smart routing.

Xlink Multiple Advertisements

With pluggable Xlink in play (#54), support multiple advertisements per-router (for each listener). This probably won't consider differentiating multiple dialers with the same binding on a router, but will consider Xlink listeners in the same manner that single-listener routers were supported in previous versions.

This should get also result in a bare-bones implementation of (#45).

Xlink Interface Binding

With Xlink (#54) and multi-link overlays (#45), it would be useful to support binding Xlink dialers and listeners to specific NIC addresses.

Health Check API

Mike Guthrie was asking how to check the health of a Ziti Controller. The current strategy is to watch for open ports on the controller (REST API, Management, Control), but that doesn't mean things are working properly. He asked for a holistic endpoint to check that can report on the health of the controller.

We need to discuss what this means and how to allow the Edge to report it's status as well.

Update README

Update the README for the initial open core release.

Stopping a router does not remove dynamically created terminators

When edge terminators are created dynamically, we don't remove them if a router goes away. We probably don't want to remove them when the router goes away, but have a way to query the router when it reconnects. This will help us avoid removing terminators that are still valid in cases where the router didn't go down, just lost connection to the controller (or controller restarted).

Mesh Building Race

Assume a network configured like this:

001 -> 002 <- 003

…where router 002 advertises a listener, but routers 001 and 003 do not.

If both router 001 and 003 connect to the controller at the same time, before either router has an established link with 002, the controller will end up creating redundant links. This scenario should end up creating a mesh with 2 links, but this race condition can result in an extra link for both hops.

This is ultimately because the connection handling code, which reevaluates the mesh and sends the appropriate Link commands across the control plane is not aware of the pending links of the other connection.

In practice, this is generally harmless as the redundant links don’t actually impact the performance of anything. This is generally an administrative hygiene issue.

Potential Fixes

This could be fixed by adding pending link tracking support to the controller. When the controller issues a Link command to a router, the pending link is tracked. The list of pending links is considered along with the established links when evaluating the mesh.

Alternately, the controller could reap redundant links in a similar manner to how Failed links are reaped.

"dialers" Congruent with "listeners" in Router Config

Make the dialers: configuration stanza work consistently with how listeners: works in the router config.

This will mean that unless a dialer is specified in the config, it will be unavailable in the router.

Doing this for both “security” and “consistency” reasons.

Add worker pools for link and xgress dials

During Data Net´s scaling of routers two symptoms became apparent:

  • it took longer and longer to for each router to establish links, more routers to dial
  • during xgress dial's during route establishment, the control channel processing would be held up by the dial

To address this a configurable pool of workers for link establishment and xgress dials will be added. This will make xgress dial and link establishment non-blocking for the control channel processing and allow multiple xgress dials/link dials to occur in the same router.

Tags in Streaming Metrics (Phase 2)

Allow tags to be stored on the router in the persistent store. Combine these tags with the static tags implemented in phase 1, and return them with each metrics message.

Service Binding 2

Currently we try to derive binding based on address when creating services. Should we do that dynamically when being dialed?
If not, we need to enforce service binding so that edge cannot bind a non-edge service.

Pluggable Service Terminator Strategies (with High Availability Implementations)

Currently services have the following attributes

  • name
  • binding
  • egressRouter
  • endpointAddress

To be able to support failover and HA capable services, we need to be able to support services with multiple endpoints. To support all of the use cases, we could split out the endpoint information into a separate structure.

Service would just retain

  • name

There would be a new structure, name TBD (endpoint, terminator, egressPoint), which would have the following attributes

  • id and/or name?
  • priority
  • binding
  • egressRouter
  • endpointAddress

When a new session is established the controller would pick an endpoint from all endpoints which had the highest priority.

For SDK applications which are binding, these entries would get created/removed dynamically. The binding would allow setting the priority. The priority could also be specified as -1, which would set the priority to one lower than the currently lowest priority.

So to model different scenarios -

HA - many endpoints with the same priority
Main/backups - N endpoints with priority 0, 1, etc...
Rolling failover - N endpoints, each one grabs the next lowest priority as they bind, so that we don't fail back when servers come back up

We could also store these entries associated to router and then dynamically add/remove them to/from the services as the router goes online/offline

xlink_transwarp

Research and prototype an implementation of a custom, UDP-based data plane protocol.

The goal is for this to become our highest-performing wide-area data plane implementation.

Outstanding Issues

  • Basic Payload/Acknowledgement
  • Throttle Control 1
  • Fragmentation (>MSS)
  • Timeout (no ping response >)
  • Shutdown
  • Public Key Swap
  • Message Segment Encryption (not Data Segment)
  • Metrics

Xgress setup has a race condition

Xgress setup has a race condition where the initiator may terminate the session before the terminator receives the Xgress Start.

If the client starts a connection, writes some short data then immediately closes the connection, the session may be closed before the start xgress can make it from the initiator to the controller to the terminator.

Instead the xgress start should be send in-band as the first message from the initiator (same as session end).

Fully Documented Configuration

Fully document the configuration syntax and structure used by the controller, router, and dotziti (identities) framework.

Fabric Session Error Reporting Improvements

Session failures often result in "second order" error messages, where a subsystem (xgress, for example) produces an error message, which is bubbled up to the caller. In some cases, these messages are swallowed, resulting in something generic, like EOF.

Revisit session setup error paths and endeavor to produce generally better, clearer error messages about why session setup may have failed.

Additional Smart Routing Metrics

Incorporate additional metrics into the core smart routing implementation. Support balancing using bandwidth metrics and targets. Balancing using other criteria (diversity).

Resynchronize Mesh on Reconnect

In a cold controller restart, the controller doesn’t know what links may have already been extant on the mesh. It will recreate a new mesh, and the existing links will eventually go stale. This doesn’t technically hurt anything, but it is wasteful.

A second iteration of reconnect behavior would either persistently store the mesh state in the database, and then query connecting routers for their link tables (and possibly forwarding tables, eventually). It may also be useful to query the routers' link tables to rebuild the controller's internal representation.

Transwarp 2-Phase Connect (connected UDP sockets)

Using a "connected" UDP socket reduces the amount of kernel overhead involved with sending datagrams. An "unconnected" UDP socket needs to do a route lookup for every datagram sent. At least on Linux, using a connected socket can provide significant performance benefits.

The current xlink_transwarp implementation has a "listener" side and a "dialer" side. The listener side uses a net.UDPConn to accept incoming connections, and maintains a "reader", which then uses the peer address to deference an xlink_transwarp.impl associated with the address; there is a single reader for all of the links connected to the listener. The dialer side runs an independent reader for each xlink_transwarp.impl. Both sides use unconnected sockets.

A revision, to allow connected sockets might be to use the listener purely to "rendezvous" a pair of xlink_transwarp.impl sockets on both sides, which would be connected to one another. This would function similarly to a very lightweight, purpose built STUN implementation.

Multi-Link Network 1

Modify the network implementation to support multiple link mesh building strategies, such that multiple data plane protocols can be run in parallel.

With the implementation of Smart Routing 2 (#3), we'll be able to allow the controller to solve for the highest performing paths between routers.

shutdown does not close db

If controller.Shutdown() is called bbolt db file is not close. This is fine during normal ziti-controller use, but when running inside of a testing context this can cause file lock blocking.

De-stutter Xgress Framework

Take a pass through the Xgress framework to remove any stuttering in names like xgress.XgressFactory (becomes xgress.Factory), etc.

Migrate 'ziti-fabric' into Modularized 'ziti'

Modularize the ziti CLI, such that the ziti-fabric components, the ziti-edge components, and any other interested parties can contribute CLI commands. Use careful namespacing: ziti fabric, ziti edge, etc. to ensure that there aren't overlaps.

Migrate ziti-fabric to use this new, modular ziti infrastructure.

Tracing(/Decoder) 2

A fresh pre-1.0 survey and cleanup of the tracing and overlay message decoding.

Extensible REST Management API

Create a version of the management API accessible via REST. Framework should be extensible so the edge can use/extend the same framework to expose its management API.

Will need to be pluggable in some way so that different authentication mechanisms can be used. For example, when running in pure fabric mode, the controller may use certificate based authentication. When running with the edge components, it should allow using the edge authentication mechanisms.

We do not need to support both at the same time. If running with the edge, the pure fabric certificate mode does not need to be supported.

Initial Travis CI integration

Enable building and running go test in travis. Additional work such as auto-versioning and git tagging will come later

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.