Code Monkey home page Code Monkey logo

Comments (12)

jamilbk avatar jamilbk commented on June 20, 2024

My perf gateway however which tracks the latest release (1.0.1) is working fine, so it's most likely not related to any code changes we've made in the last few days (haven't really been any, anyway)

from firezone.

jamilbk avatar jamilbk commented on June 20, 2024

Happening for IPv4 too:

2024-04-30T02:32:17.839732Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Request timed out after 5.0625s, re-sending id=TransactionId(0x1351B35654352558F1D7072E) method=binding dst=[2600:1900:4120:691:0:26::]:3478
2024-04-30T02:32:17.839776Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Request timed out after 5.0625s, re-sending id=TransactionId(0xF787E01C4DDF12AE7EDEC6B2) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478
2024-04-30T02:32:25.433971Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Request timed out after 7.59375s, re-sending id=TransactionId(0x1351B35654352558F1D7072E) method=binding dst=[2600:1900:4120:691:0:26::]:3478
2024-04-30T02:32:25.434013Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Request timed out after 7.59375s, re-sending id=TransactionId(0xF787E01C4DDF12AE7EDEC6B2) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478
2024-04-30T02:32:36.824812Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Request timed out after 11.390625s, re-sending id=TransactionId(0x1351B35654352558F1D7072E) method=binding dst=[2600:1900:4120:691:0:26::]:3478
2024-04-30T02:32:36.824855Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Request timed out after 11.390625s, re-sending id=TransactionId(0xF787E01C4DDF12AE7EDEC6B2) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478
2024-04-30T02:32:53.909311Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Request timed out after 17.0859375s, re-sending id=TransactionId(0x1351B35654352558F1D7072E) method=binding dst=[2600:1900:4120:691:0:26::]:3478
2024-04-30T02:32:53.909350Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Unable to queue binding because we've exceeded its backoffs
2024-04-30T02:32:53.909361Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Request timed out after 17.0859375s, re-sending id=TransactionId(0xF787E01C4DDF12AE7EDEC6B2) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478
2024-04-30T02:32:53.909370Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Unable to queue binding because we've exceeded its backoffs

from firezone.

jamilbk avatar jamilbk commented on June 20, 2024
{
  "insertId": "1d44q2mg2rak5gt",
  "jsonPayload": {
    "target": "relay",
    "time": "2024-04-30T14:40:36.423077098Z",
    "cos.googleapis.com/container_id": "62b31dfee0a9b08cc10e9688b294181e49f24cc64ba99cd621b3fd3e9ead6c3a",
    "cos.googleapis.com/container_name": "klt-relay-jaoh",
    "cos.googleapis.com/stream": "stdout",
    "message": "channel bind failed: Unauthorized"
  },
  "resource": {
    "type": "gce_instance",
    "labels": {
      "project_id": "firezone-staging",
      "zone": "us-east1-d",
      "instance_id": "3277658703961888395"
    }
  },
  "timestamp": "2024-04-30T14:40:36.423314863Z",
  "severity": "WARNING",
  "logName": "projects/firezone-staging/logs/cos_containers",
  "sourceLocation": {
    "file": "relay/src/server.rs",
    "line": "318"
  },
  "receiveTimestamp": "2024-04-30T14:40:37.405382298Z"
}

@thomaseizinger We're seeing over 500,000 of these in the last day since the str0m version was bumped. Could this be related?

Causing a partial outage affecting customers:

https://firezone.statuspage.io

from firezone.

jamilbk avatar jamilbk commented on June 20, 2024

Here's the diff between working str0m and the one causing the channel bind failure:

https://github.com/firezone/str0m/compare/44b616945145033ade739f30cdce004728011619..49b9ac85679fdcdd050dad856bc5df63616fa065

from firezone.

jamilbk avatar jamilbk commented on June 20, 2024

Could it be possible that a lack of IPv6 connectivity is causing issues connecting to IPv4 Relays? This log seems suspicious:

2024-04-30T16:25:50.893420Z DEBUG handle_timeout{relay=Some(35.236.65.205:3478)}: snownet::allocation: Unable to queue binding because we've exceeded its backoffs
2024-04-30T16:25:50.893433Z DEBUG handle_timeout{relay=Some(34.72.251.203:3478)}: snownet::allocation: Request timed out after 17.0859375s, re-sending id=TransactionId(0x061BAC5117458006106CB1A0) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478

from firezone.

jamilbk avatar jamilbk commented on June 20, 2024

Hmm more debugging -- should this STUN binding request be sent to my laptop (192.168.1.65)?

2024-04-30T16:53:39.719409Z DEBUG handle_timeout{id=c2806b19-678f-4e63-9a36-4d6d041290d5}: str0m::ice_::agent: Send STUN request: 172.17.0.2:39511 -> 192.168.1.65:57359 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "4jXl:hxaY", priority: 1862270719, ice_controlled: 3305003312144811039 }, integrity_len: 0 }

This is on the problematic prod Gateway.

from firezone.

jamilbk avatar jamilbk commented on June 20, 2024

Here's a full log capture on the problematic Gateway with wire=trace enabled:

gateway.jsonl.zip

from firezone.

jamilbk avatar jamilbk commented on June 20, 2024

Seeing "Bad Request" which probably is the source of the issue:

2024-04-30T17:06:17.219703Z  WARN decapsulate{from=34.72.251.203:3478 num_bytes=40}:handle_input{id=TransactionId(0xDD4F41FE6C40163DF82DA56B) method=channel bind class=error response rtt=33.078034ms}: snownet::allocation: error=Bad Request channel=16396 peer=34.72.251.203:52214
2024-04-30T17:06:17.221707Z TRACE wire: from="network" src=34.72.251.203:3478 dst=172.17.0.2:39511 num_bytes=40
2024-04-30T17:06:17.221763Z  WARN decapsulate{from=34.72.251.203:3478 num_bytes=40}:handle_input{id=TransactionId(0xFF238A6E84B1AA98C815B34A) method=channel bind class=error response rtt=33.379065ms}: snownet::allocation: error=Bad Request channel=16397 peer=[2600:1900:4000:a47e:0:d::]:52214
2024-04-30T17:06:17.237043Z DEBUG handle_timeout{id=c2806b19-678f-4e63-9a36-4d6d041290d5}: str0m::ice_::agent: Send STUN request: 104.196.159.78:59948 -> 192.168.1.65:57359 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "lejO:BcYa", priority: 1849687295, ice_controlled: 2838992140673646111 }, integrity_len: 0 }
2024-04-30T17:06:17.239975Z TRACE wire: from="network" src=35.235.125.17:3478 dst=172.17.0.2:39511 num_bytes=40
2024-04-30T17:06:17.240037Z  WARN decapsulate{from=35.235.125.17:3478 num_bytes=40}:handle_input{id=TransactionId(0x93506EB5AB2AAD1BC65726D2) method=channel bind class=error response rtt=60.845142ms}: snownet::allocation: error=Bad Request channel=16395 peer=[2600:1900:4120:691:0:26::]:60460
2024-04-30T17:06:17.240058Z TRACE wire: from="network" src=35.235.125.17:3478 dst=172.17.0.2:39511 num_bytes=40
2024-04-30T17:06:17.240077Z  WARN decapsulate{from=35.235.125.17:3478 num_bytes=40}:handle_input{id=TransactionId(0x60B54198D4736A0281B50458) method=channel bind class=error response rtt=61.553165ms}: snownet::allocation: error=Bad Request channel=16394 peer=35.235.125.17:60460

I thought the maximum number of channels we could bind was 16,384 -- is channel=16394 an issue?

from firezone.

jamilbk avatar jamilbk commented on June 20, 2024

This turned out to be a bug with str0m, see algesten/str0m#508 (comment)

Reverted for now in #4838

from firezone.

thomaseizinger avatar thomaseizinger commented on June 20, 2024

FWIW:

2024-04-30T02:23:30.269069Z DEBUG handle_timeout{relay=Some(34.23.241.97:3478)}: snownet::allocation: Request timed out after 1.5s, re-sending id=TransactionId(0x01255695304100B3C281E31E) method=binding dst=[2600:1900:4020:8f68:0:27::]:3478

This is completely normal if we have no IPv6 connectivity. The IP we are seeing in the span is the one we have selected for this Allocation, i.e. we chose to talk to it on IPv4 likely because IPv6 never got back to us and hence it is timing out eventually.

from firezone.

thomaseizinger avatar thomaseizinger commented on June 20, 2024

Here is my theory:

  • Additional relay load is an artifact of connections that fail over and over again.
  • IPv6 packets timing out (see the dst variable) happens if you don't have IPv6 connectivity, unrelated to this issue.
  • "Bad request" on the relay could indicate a different issue. Would be interesting to see the logs of the corresponding transaction ID on the relay.

algesten/str0m#508 by itself does not invalidate candidates, it just handles less packets than before (see the log about ignoring some). That makes me think that the actual source of the bug (invalidating a candidate that we shouldn't) is still lurking somewhere in our codebase. Reverting the above PR doesn't fix that, it just masks it.

I wonder if this can be reproduced with a unit-test where we check, which candidates are invalidated?

from firezone.

thomaseizinger avatar thomaseizinger commented on June 20, 2024

I thought the maximum number of channels we could bind was 16,384 -- is channel=16394 an issue?

16384 is the first channel, 20479 is the last one. The max count is 4095.

from firezone.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.