Comments (13)
Portal can't send stale credentials as it stores them attached to the portal process, when relay disconnects they are gone pretty much right away
from firezone.
False positive. #4517 should test this isn't the case.
from firezone.
Actually, re-opening this. I wasn't able to produce a failing test for this issue in #4517 but I think that this is might be related to how we deploy Relays and refresh credentials.
What I just experienced was that github was working intermittently for a few minutes on and off while the Relays were being rolled over. My client would go through different Relays around the world as this happened.
Then, they all stopped working, and my logs started filling with these messages:
connlib 23:24:37.841167-0700 Invalid credentials, refusing to re-authenticate refresh
connlib 23:24:37.853528-0700 Invalid credentials, refusing to re-authenticate refresh
connlib 23:24:37.854295-0700 Invalid credentials, refusing to re-authenticate refresh
After the TURN credentials refreshed again, it started working (after 5 minutes):
connlib 23:24:37.903964-0700 Invalid credentials, refusing to re-authenticate refresh
connlib 23:24:37.906322-0700 Establishing new connection duration_since_intent=86.652958ms
connlib 23:24:37.907247-0700 Invalid credentials, refusing to re-authenticate refresh
connlib 23:24:37.916378-0700 Invalid credentials, refusing to re-authenticate refresh
connlib 23:24:37.919698-0700 error=Allocation Mismatch
connlib 23:24:37.920663-0700 error=Allocation Mismatch
connlib 23:24:37.932501-0700 Updated candidates of allocation lifetime=600s relay_ip4=Some(Candidate(relay=34.102.94.102:61896/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4120:f732:0:17::]:61896/udp prio=50331647)) srflx=Some(Candidate(srflx=[2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992/udp prio=1694498815))
connlib 23:24:37.934226-0700 Updated candidates of allocation lifetime=600s relay_ip4=Some(Candidate(relay=34.102.94.102:58675/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4120:f732:0:17::]:58675/udp prio=50331647)) srflx=Some(Candidate(srflx=107.197.104.68:56889/udp base=192.168.1.65:56889 prio=1694498559))
connlib 23:24:38.017715-0700 Signalling protocol completed duration_since_intent=197.911167ms remote=c02684f599f9e7e1c1d37f0ed801aba4896aa4541283b62a17581f14b23f1370
connlib 23:24:38.044881-0700 Updated lifetime of allocation lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.230.156.15:50345/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:40c0:3149:0:2::]:50345/udp prio=50331647)) srflx=Some(Candidate(srflx=[2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992/udp prio=1694498815))
connlib 23:24:38.045145-0700 Updated lifetime of allocation lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.230.156.15:55921/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:40c0:3149:0:2::]:55921/udp prio=50331647)) srflx=Some(Candidate(srflx=107.197.104.68:56889/udp base=192.168.1.65:56889 prio=1694498559))
connlib 23:24:38.916115-0700 Updating remote socket duration_since_intent=1.095011875s new=Direct { source: [2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992, dest: [2600:1900:40b0:1504:0:14::]:62704 } old=None
connlib 23:24:39.254992-0700 Completed wireguard handshake duration_since_intent=1.435202708s
connlib 23:26:29.059777-0700 Updated lifetime of allocation lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.196.218.219:51147/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4020:eb:0:17::]:51147/udp prio=50331647)) srflx=Some(Candidate(srflx=107.197.104.68:56889/udp base=192.168.1.65:56889 prio=1694498559))
connlib 23:26:29.071330-0700 Updated lifetime of allocation lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.196.218.219:62306/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4020:eb:0:17::]:62306/udp prio=50331647)) srflx=Some(Candidate(srflx=[2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992/udp prio=1694498815))
This could probably be fixed with a simple rolling deployment model for our Relays... cc @AndrewDryga Edit: Not so sure about that. Probably a state machine bug we can fix by re-requesting new credentials or Relays for Relayed connections
refs #2143
from firezone.
Update a few minutes later: my client seems to have settled on a Relay halfway across the world after the 5-minute mark, and is "stuck" on it, so GitHub is really slow.
Signing out and back is the only thing that fixes the state completely.
from firezone.
Is it possible that CI doesn't catch this because we use static credentials there? That might mean we are falsely passing the tests that should be catching this exact scenario...
from firezone.
We already roll them over, first create a new one, wait for health check and then delete previous one
from firezone.
Is it possible that CI doesn't catch this because we use static credentials there? That might mean we are falsely passing the tests that should be catching this exact scenario...
We don't use static credentials, at least the relay's seed isn't fixed AFAIK.
Could it be that we are getting stale creds from the portal?
from firezone.
Then, they all stopped working, and my logs started filling with these messages:
Are they literally filling up or do you just get one per relay? The former is not expected but as they reboot, the credentials will be invalid.
I think we need #3938 for this. We need to stop using a relay as soon as we detect that the credentials are invalid. That will cause the connection to fail which makes us request new credentials from the portal.
from firezone.
Are they literally filling up or do you just get one per relay?
Yeah, my logs get spammed with about ~100 of these a second. See attached logs.
Note I wasn't ping flooding or anything, just try to use GitHub.com.
from firezone.
Have a theory as to what's going on (or some variation): #4517 (comment)
from firezone.
Are they literally filling up or do you just get one per relay?
Yeah, my logs get spammed with about ~100 of these a second.
That is def a state machine bug, damn!
from firezone.
Yeah, seems like a state machine bug. Redeploys break DNS resources (and maybe others?) and they don't self-heal -- only signing out and signing back in fixes the state machine. Unless I do that, the following occurs:
- I get logspammed "Invalid credentials" until they're refreshed (5 minutes? 10 minutes?)
- After that, requests still intermittent and fail with "Bad request" "Unallowed packet in channel" etc
Note that I am forcing use of a Relay by blocking direct IP access to the Staging Gateway(s). Perhaps this issue isn't present with Direct connections only.
Here are more logs from when I started having issues after today's deploy. Hopefully there are some helpful nuggets in there
connlib.2024-04-05-06-36-15.log.zip
from firezone.
Closing in favor of #4548 which should alleviate the issue when implemented.
from firezone.
Related Issues (20)
- Ensure `reconnect` clears all previous backoff timers HOT 1
- One-click installer for DO
- k8s instructions
- Pulumi instructions
- Show instructions in docs for deploying Gateways for different infra
- UX audit tracking issue
- connlib: perform mangling of DNS requests to resolvers that are CIDR resources before we look up the peer HOT 1
- connlib: implement reconnect as "drop all connections and wait for new packets to trigger new ones"
- Allow FIREZONE_TOKEN to point to file HOT 1
- chore(connlib/android): revert possible Android regression from #4788
- Tracking issue for extensions to property-based state machine tests
- techdebt(connlib): use emitted events to update DNS servers in clients
- connlib: unify packet routing between CIDR and DNS resources
- Show warning if admin enters only IPv4 or IPv6 upstream resolvers
- Linux / Windows GUI client user service HOT 4
- Add a new `General` section to Settings
- Allow removing a Resource from a Site when multi-site Resources is not active
- Policy flexibility
- Allow removing Resources and Groups from a Policy
- Add resource to favorites
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from firezone.