Code Monkey home page Code Monkey logo

Comments (13)

AndrewDryga avatar AndrewDryga commented on June 17, 2024 1

Portal can't send stale credentials as it stores them attached to the portal process, when relay disconnects they are gone pretty much right away

from firezone.

jamilbk avatar jamilbk commented on June 17, 2024

False positive. #4517 should test this isn't the case.

from firezone.

jamilbk avatar jamilbk commented on June 17, 2024

Actually, re-opening this. I wasn't able to produce a failing test for this issue in #4517 but I think that this is might be related to how we deploy Relays and refresh credentials.

What I just experienced was that github was working intermittently for a few minutes on and off while the Relays were being rolled over. My client would go through different Relays around the world as this happened.

Then, they all stopped working, and my logs started filling with these messages:

connlib	23:24:37.841167-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.853528-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.854295-0700	Invalid credentials, refusing to re-authenticate refresh

After the TURN credentials refreshed again, it started working (after 5 minutes):

connlib	23:24:37.903964-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.906322-0700	Establishing new connection  duration_since_intent=86.652958ms
connlib	23:24:37.907247-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.916378-0700	Invalid credentials, refusing to re-authenticate refresh
connlib	23:24:37.919698-0700	error=Allocation Mismatch
connlib	23:24:37.920663-0700	error=Allocation Mismatch
connlib	23:24:37.932501-0700	Updated candidates of allocation  lifetime=600s relay_ip4=Some(Candidate(relay=34.102.94.102:61896/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4120:f732:0:17::]:61896/udp prio=50331647)) srflx=Some(Candidate(srflx=[2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992/udp prio=1694498815))
connlib	23:24:37.934226-0700	Updated candidates of allocation  lifetime=600s relay_ip4=Some(Candidate(relay=34.102.94.102:58675/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4120:f732:0:17::]:58675/udp prio=50331647)) srflx=Some(Candidate(srflx=107.197.104.68:56889/udp base=192.168.1.65:56889 prio=1694498559))
connlib	23:24:38.017715-0700	Signalling protocol completed  duration_since_intent=197.911167ms remote=c02684f599f9e7e1c1d37f0ed801aba4896aa4541283b62a17581f14b23f1370
connlib	23:24:38.044881-0700	Updated lifetime of allocation  lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.230.156.15:50345/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:40c0:3149:0:2::]:50345/udp prio=50331647)) srflx=Some(Candidate(srflx=[2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992/udp prio=1694498815))
connlib	23:24:38.045145-0700	Updated lifetime of allocation  lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.230.156.15:55921/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:40c0:3149:0:2::]:55921/udp prio=50331647)) srflx=Some(Candidate(srflx=107.197.104.68:56889/udp base=192.168.1.65:56889 prio=1694498559))
connlib	23:24:38.916115-0700	Updating remote socket  duration_since_intent=1.095011875s new=Direct { source: [2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992, dest: [2600:1900:40b0:1504:0:14::]:62704 } old=None
connlib	23:24:39.254992-0700	Completed wireguard handshake  duration_since_intent=1.435202708s
connlib	23:26:29.059777-0700	Updated lifetime of allocation  lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.196.218.219:51147/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4020:eb:0:17::]:51147/udp prio=50331647)) srflx=Some(Candidate(srflx=107.197.104.68:56889/udp base=192.168.1.65:56889 prio=1694498559))
connlib	23:26:29.071330-0700	Updated lifetime of allocation  lifetime=Lifetime(600s) relay_ip4=Some(Candidate(relay=35.196.218.219:62306/udp prio=50331391)) relay_ip6=Some(Candidate(relay=[2600:1900:4020:eb:0:17::]:62306/udp prio=50331647)) srflx=Some(Candidate(srflx=[2600:1700:3ecb:2410:8cdc:e84:d6e4:7f18]:59992/udp prio=1694498815))

This could probably be fixed with a simple rolling deployment model for our Relays... cc @AndrewDryga Edit: Not so sure about that. Probably a state machine bug we can fix by re-requesting new credentials or Relays for Relayed connections

refs #2143

from firezone.

jamilbk avatar jamilbk commented on June 17, 2024

Update a few minutes later: my client seems to have settled on a Relay halfway across the world after the 5-minute mark, and is "stuck" on it, so GitHub is really slow.

Signing out and back is the only thing that fixes the state completely.

from firezone.

jamilbk avatar jamilbk commented on June 17, 2024

Is it possible that CI doesn't catch this because we use static credentials there? That might mean we are falsely passing the tests that should be catching this exact scenario...

from firezone.

AndrewDryga avatar AndrewDryga commented on June 17, 2024

We already roll them over, first create a new one, wait for health check and then delete previous one

from firezone.

thomaseizinger avatar thomaseizinger commented on June 17, 2024

Is it possible that CI doesn't catch this because we use static credentials there? That might mean we are falsely passing the tests that should be catching this exact scenario...

We don't use static credentials, at least the relay's seed isn't fixed AFAIK.

Could it be that we are getting stale creds from the portal?

from firezone.

thomaseizinger avatar thomaseizinger commented on June 17, 2024

Then, they all stopped working, and my logs started filling with these messages:

Are they literally filling up or do you just get one per relay? The former is not expected but as they reboot, the credentials will be invalid.

I think we need #3938 for this. We need to stop using a relay as soon as we detect that the credentials are invalid. That will cause the connection to fail which makes us request new credentials from the portal.

from firezone.

jamilbk avatar jamilbk commented on June 17, 2024

Are they literally filling up or do you just get one per relay?

Yeah, my logs get spammed with about ~100 of these a second. See attached logs.

Note I wasn't ping flooding or anything, just try to use GitHub.com.

connlib.zip

from firezone.

jamilbk avatar jamilbk commented on June 17, 2024

Have a theory as to what's going on (or some variation): #4517 (comment)

from firezone.

thomaseizinger avatar thomaseizinger commented on June 17, 2024

Are they literally filling up or do you just get one per relay?

Yeah, my logs get spammed with about ~100 of these a second.

That is def a state machine bug, damn!

from firezone.

jamilbk avatar jamilbk commented on June 17, 2024

Yeah, seems like a state machine bug. Redeploys break DNS resources (and maybe others?) and they don't self-heal -- only signing out and signing back in fixes the state machine. Unless I do that, the following occurs:

  1. I get logspammed "Invalid credentials" until they're refreshed (5 minutes? 10 minutes?)
  2. After that, requests still intermittent and fail with "Bad request" "Unallowed packet in channel" etc

Note that I am forcing use of a Relay by blocking direct IP access to the Staging Gateway(s). Perhaps this issue isn't present with Direct connections only.

Here are more logs from when I started having issues after today's deploy. Hopefully there are some helpful nuggets in there

connlib.2024-04-05-06-36-15.log.zip

from firezone.

jamilbk avatar jamilbk commented on June 17, 2024

Closing in favor of #4548 which should alleviate the issue when implemented.

from firezone.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.