Even if a DNS mapping and direct connection to a Gateway exists, we seem to have a har

False positive. <a class="issue-link js-issue-link" data-error-text="Failed to load ti

Have a theory as to what's going on (or some variation): <a class="issue-link js-issue

DNS resource mappings break after redeploy about firezone HOT 13 CLOSED

jamilbk commented on June 17, 2024

DNS resource mappings break after redeploy

from firezone.

Comments (13)

AndrewDryga commented on June 17, 2024 1

Portal can't send stale credentials as it stores them attached to the portal process, when relay disconnects they are gone pretty much right away

from firezone.

jamilbk commented on June 17, 2024

False positive. #4517 should test this isn't the case.

from firezone.

jamilbk commented on June 17, 2024

Actually, re-opening this. I wasn't able to produce a failing test for this issue in #4517 but I think that this is might be related to how we deploy Relays and refresh credentials.

What I just experienced was that github was working intermittently for a few minutes on and off while the Relays were being rolled over. My client would go through different Relays around the world as this happened.

Then, they all stopped working, and my logs started filling with these messages:

connlib	23:24:37.841167-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.853528-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.854295-0700	Invalid credentials, refusing to re-authenticate refresh

After the TURN credentials refreshed again, it started working (after 5 minutes):

connlib	23:24:37.903964-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.906322-0700	Establishing new connection  duration_since_intent=86.652958ms
connlib	23:24:37.907247-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.916378-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.919698-0700	error=Allocation Mismatch
connlib	23:24:37.920663-0700	error=Allocation Mismatch
connlib	23:24:37.932501-0700	Updated candidates of allocation  lifetime=600s relay_ip4=Some(Candidate(relay=34.102.94.102:61896/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4120:f732:0:17::]:61896/udp prio=50331647)) srflx=Some(Candidate(srflx=[2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992/udp prio=1694498815))
connlib	23:24:37.934226-0700	Updated candidates of allocation  lifetime=600s relay_ip4=Some(Candidate(relay=34.102.94.102:58675/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4120:f732:0:17::]:58675/udp prio=50331647)) srflx=Some(Candidate(srflx=107.197.104.68:56889/udp base=192.168.1.65:56889 prio=1694498559))
connlib	23:24:38.017715-0700	Signalling protocol completed  duration_since_intent=197.911167ms remote=c02684f599f9e7e1c1d37f0ed801aba4896aa4541283b62a17581f14b23f1370
connlib	23:24:38.044881-0700	Updated lifetime of allocation  lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.230.156.15:50345/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:40c0:3149:0:2::]:50345/udp prio=50331647)) srflx=Some(Candidate(srflx=[2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992/udp prio=1694498815))
connlib	23:24:38.045145-0700	Updated lifetime of allocation  lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.230.156.15:55921/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:40c0:3149:0:2::]:55921/udp prio=50331647)) srflx=Some(Candidate(srflx=107.197.104.68:56889/udp base=192.168.1.65:56889 prio=1694498559))
connlib	23:24:38.916115-0700	Updating remote socket  duration_since_intent=1.095011875s new=Direct { source: [2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992, dest: [2600:1900:40b0:1504:0:14::]:62704 } old=None
connlib	23:24:39.254992-0700	Completed wireguard handshake  duration_since_intent=1.435202708s
connlib	23:26:29.059777-0700	Updated lifetime of allocation  lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.196.218.219:51147/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4020:eb:0:17::]:51147/udp prio=50331647)) srflx=Some(Candidate(srflx=107.197.104.68:56889/udp base=192.168.1.65:56889 prio=1694498559))
connlib	23:26:29.071330-0700	Updated lifetime of allocation  lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.196.218.219:62306/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4020:eb:0:17::]:62306/udp prio=50331647)) srflx=Some(Candidate(srflx=[2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992/udp prio=1694498815))

~~This could probably be fixed with a simple rolling deployment model for our Relays... cc @AndrewDryga~~ Edit: Not so sure about that. Probably a state machine bug we can fix by re-requesting new credentials or Relays for Relayed connections

refs #2143

from firezone.

jamilbk commented on June 17, 2024

Update a few minutes later: my client seems to have settled on a Relay halfway across the world after the 5-minute mark, and is "stuck" on it, so GitHub is really slow.

Signing out and back is the only thing that fixes the state completely.

from firezone.

jamilbk commented on June 17, 2024

Is it possible that CI doesn't catch this because we use static credentials there? That might mean we are falsely passing the tests that should be catching this exact scenario...

from firezone.

AndrewDryga commented on June 17, 2024

We already roll them over, first create a new one, wait for health check and then delete previous one

from firezone.

thomaseizinger commented on June 17, 2024

Is it possible that CI doesn't catch this because we use static credentials there? That might mean we are falsely passing the tests that should be catching this exact scenario...

We don't use static credentials, at least the relay's seed isn't fixed AFAIK.

Could it be that we are getting stale creds from the portal?

from firezone.

thomaseizinger commented on June 17, 2024

Then, they all stopped working, and my logs started filling with these messages:

Are they literally filling up or do you just get one per relay? The former is not expected but as they reboot, the credentials will be invalid.

I think we need #3938 for this. We need to stop using a relay as soon as we detect that the credentials are invalid. That will cause the connection to fail which makes us request new credentials from the portal.

from firezone.

jamilbk commented on June 17, 2024

Are they literally filling up or do you just get one per relay?

Yeah, my logs get spammed with about ~100 of these a second. See attached logs.

Note I wasn't ping flooding or anything, just try to use GitHub.com.

connlib.zip

from firezone.

jamilbk commented on June 17, 2024

Have a theory as to what's going on (or some variation): #4517 (comment)

from firezone.

thomaseizinger commented on June 17, 2024

Are they literally filling up or do you just get one per relay?

Yeah, my logs get spammed with about ~100 of these a second.

That is def a state machine bug, damn!

from firezone.

jamilbk commented on June 17, 2024

Yeah, seems like a state machine bug. Redeploys break DNS resources (and maybe others?) and they don't self-heal -- only signing out and signing back in fixes the state machine. Unless I do that, the following occurs:

I get logspammed "Invalid credentials" until they're refreshed (5 minutes? 10 minutes?)
After that, requests still intermittent and fail with "Bad request" "Unallowed packet in channel" etc

Note that I am forcing use of a Relay by blocking direct IP access to the Staging Gateway(s). Perhaps this issue isn't present with Direct connections only.

Here are more logs from when I started having issues after today's deploy. Hopefully there are some helpful nuggets in there

connlib.2024-04-05-06-36-15.log.zip

from firezone.

jamilbk commented on June 17, 2024

Closing in favor of #4548 which should alleviate the issue when implemented.

from firezone.

DNS resource mappings break after redeploy about firezone HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent