The recent relay incident which was fixed by <a class="issue-link js-issue-link" data-

Maybe this approach is be a better ROI long-term: <a class="issue-li

Ok, we'd split the relay binary into two. <blockq

Host/Use STUN-only servers about firezone HOT 13 CLOSED

thomaseizinger commented on September 26, 2024

Host/Use STUN-only servers

from firezone.

Comments (13)

jamilbk commented on September 26, 2024 1

The problem is that even if 5% of connections fail, that can still impact entire organizations (gateway NAT).

I think the added operational/product complexity (portal too) here might nullify the benefits when considering that (1) the relay may not see updates after it's "stable" and (2) we will have more operational maturity when soak testing staging.

from firezone.

jamilbk commented on September 26, 2024 1

Let's chat more after standup tomorrow, it would be good that we all row in the same direction on this topic

from firezone.

jamilbk commented on September 26, 2024

Maybe this approach is be a better ROI long-term:

#4506

from firezone.

thomaseizinger commented on September 26, 2024

The problem is that even if 5% of connections fail, that can still impact entire organizations (gateway NAT).

Surely it is better to only impact only one organisation instead of all of them?

the added operational/product complexity (portal too) here might nullify the benefits when considering that (1) the relay may not see updates after it's "stable" and (2) we will have more operational maturity when soak testing staging.

How is the portal's operation affected? I don't think it needs to monitor for uptime of those servers via a websocket or something. IMO it would be good enough to have monitoring on the machines because those are so simple, they will virtually never stop working unless there is a GCP outage or something.

It seems like a quick win to me because we already have all the code to do this, just need to use it.

from firezone.

jamilbk commented on September 26, 2024

It seems like a quick win to me because we already have all the code to do this, just need to use it.

The portal needs to know what STUN/TURN servers to send the clients upon each connection attempt, right?

We could use public ones, but will we have fallbacks if those go offline or are inaccessible in some jurisdictions? Are we adhering to the ToU there?

from firezone.

jamilbk commented on September 26, 2024

Not necessarily disagreeing that it might be (or have been) a good idea to split the STUN and TURN servers but doing so now is probably not the best time and by the time it is, maybe this will be a solved problem?

from firezone.

thomaseizinger commented on September 26, 2024

It seems like a quick win to me because we already have all the code to do this, just need to use it.

The portal needs to know what STUN/TURN servers to send the clients upon each connection attempt, right?

For STUN, we don't need regional IPs because it is a single req-res protocol. You can probably serve tens of thousands of clients with a single server, if not hundreds of thousands.

We could deploy 4, always statically return those and call it a day. They aren't critical for operation because the TURN servers can do the same job. But they would allow operation of most of firezone if the relays go down for some reason.

from firezone.

jamilbk commented on September 26, 2024

Hm I see. I'm still not sold on this idea. Would we really be more cavalier in our updates to the Relay if we know they might only affect 10% of our customers? That doesn't feel right.

That would still be a critical component outage. We already have some operational experience here with support requests from customers who chose "STUN-only" when creating a site only to hit unexpected connectivity issues when the Relays weren't accessible.

So we need to make it reliable, and if we're already doing that, why not rely on it?

Before @AndrewDryga or @bmanifold would start work on this I would consider these pre-reqs:

@AndrewDryga any thoughts?
What STUN server would we use and how would it be deployed / operated?
Why is #4290 happening
How many connections are getting Relayed today?

from firezone.

thomaseizinger commented on September 26, 2024

Hm I see. I'm still not sold on this idea. Would we really be more cavalier in our updates to the Relay if we know they might only affect 10% of our customers? That doesn't feel right.

That would still be a critical component outage. We already have some operational experience here with support requests from customers who chose "STUN-only" when creating a site only to hit unexpected connectivity issues when the Relays weren't accessible.

So we need to make it reliable, and if we're already doing that, why not rely on it?

I would see it as a simple risk reducing action. Currently, the relays are a single point of failure for every firezone customer. If they don't work, nothing works.

So yeah, we should make them work really well. At the same time, having STUN-only servers reduces that risk significantly.

STUN is also unauthenticated meaning they would continue working if we have some auth bugs etc.

Why is #4290 happening

Very valid question. I've been working on this in the form of #4268. Just today, str0m merged my PR that fixes a panic that blocked me on making progress on that.

What STUN server would we use and how would it be deployed / operated?

TBD, Writing one would take less than 100 LoC I think. I would have shoved it into another repo, make a container and never touch it again :)

Re: Operation. Naive-me would think that it doesn't really need any operation?

How many connections are getting Relayed today?

That is a very interesting question and it would be great to gather some data on this.

from firezone.

jamilbk commented on September 26, 2024

Ok, we'd split the relay binary into two.

Re: Operation. Naive-me would think that it doesn't really need any operation?

This is the problem. Will it just use std-lib and have no dependencies? If so, will we upgrade it when rust-stable is upgraded?

As a rule we still need full operational support for anything we host. Not saying it's not feasible, just the ROI doesn't seem particularly great.

Can you de-risk Relay changes within the same binary? Why not split a thread off in main that just performs STUN? That would also serve the same purpose?

from firezone.

AndrewDryga commented on September 26, 2024

This issue feels like a hack while a proper solution would be to improve our testing, monitoring, and health checks. We already do surge updates: a new instance is created before taking the old one down. So a situation like that is preventable without hosting separate infrastructure that will hide the issue for some % of clients while we still have general downtime.

Instead of hosting a separate set of servers, we can also return that information from a portal when websocket is connected but I would not even do that before our data plane becomes really stable without such optimizations.

For example,

if during the last outage, the relay would not respond OK to the health check the instance group manager would reboot them or won't even let the deploy to happen if initial probes fail;
If the relay would log an error we would know about it too and that would trigger an alert;
we also need an alert on the number of relays connected to the portal to rollback quickly if something not catched by any previous stage fails.

Can you de-risk Relay changes within the same binary? Why not split a thread off in main that just performs STUN? That would also serve the same purpose?

I think prioritizing STUN is a good idea too.

from firezone.

thomaseizinger commented on September 26, 2024

I am surprised with the amount of push-back to this idea. 3 months ago, this was an advertised product feature and now it is suddenly considered a hack?

Even the best software, with the best review process and the best alerting will have bugs. Why not further de-risk this by introducing redundancy on a service level? We are dealing with a lot of complexity in the clients for redundancy reasons by operating on a list of relays instead of one. How is this idea any different?

Not saying it's not feasible, just the ROI doesn't seem particularly great.

Maybe I am failing to express myself but are we seeing the same things here in terms of ROI?

Investment

Write a 100 LoC STUN server and stick it into a separate repository
Add some terraform code to deploy these to GCP with static IPs
Return this list of IPs for every connection attempt to the clients and gateways (code for this already exists on portal, clients and gateways)

Return

De-risk all hosted relays from a single-point-of-failure for all firezone users to only the ones that actually need a relay for their connections. Even users who don't need a relay currently have a hard-dependency on them to discover their public IP.

The clients are already designed such that the data plane doesn't have a hard-dependency on the portal (at least for some amount of time). So in general, we seem to be in agreement that we should limit the amount of hard-dependencies where possible. Why should direct connections have a hard dependency on the relay when they only need STUN?

I want to stress that this isn't just about bugs in the relay. Of course we should fix those and improve processes and testing where possible to mitigate them in the future. But as it stands today, any outage of the VMs that the relays sit in will affect all firezone connections:

What if the portal and the relays get partitioned because of some network failure in GCP and thus the websocket connection dies?
What if the load balancing the portal fails and we overload a relay?
What if somebody runs a DoS attack against our relays?
What if some middle-box filters TURN traffic?

I am happy to stop the discussion given the amount of push-back but I'd really be curious to understand why the above ROI is perceived so differently by others.

from firezone.

thomaseizinger commented on September 26, 2024

Closing in favor of a soak testbed and overall improved monitoring of the relay. Long term, we want to remove the STUN-only functionality to simplify the codebase.

from firezone.

Host/Use STUN-only servers about firezone HOT 13 CLOSED

Comments (13)

Investment

Return

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent