Comments (12)
My perf
gateway however which tracks the latest release (1.0.1
) is working fine, so it's most likely not related to any code changes we've made in the last few days (haven't really been any, anyway)
from firezone.
Happening for IPv4 too:
2024-04-30T02:32:17.839732Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Request timed out after 5.0625s, re-sending id=TransactionId(0x1351B35654352558F1D7072E) method=binding dst=[2600:1900:4120:691:0:26::]:3478
2024-04-30T02:32:17.839776Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Request timed out after 5.0625s, re-sending id=TransactionId(0xF787E01C4DDF12AE7EDEC6B2) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478
2024-04-30T02:32:25.433971Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Request timed out after 7.59375s, re-sending id=TransactionId(0x1351B35654352558F1D7072E) method=binding dst=[2600:1900:4120:691:0:26::]:3478
2024-04-30T02:32:25.434013Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Request timed out after 7.59375s, re-sending id=TransactionId(0xF787E01C4DDF12AE7EDEC6B2) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478
2024-04-30T02:32:36.824812Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Request timed out after 11.390625s, re-sending id=TransactionId(0x1351B35654352558F1D7072E) method=binding dst=[2600:1900:4120:691:0:26::]:3478
2024-04-30T02:32:36.824855Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Request timed out after 11.390625s, re-sending id=TransactionId(0xF787E01C4DDF12AE7EDEC6B2) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478
2024-04-30T02:32:53.909311Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Request timed out after 17.0859375s, re-sending id=TransactionId(0x1351B35654352558F1D7072E) method=binding dst=[2600:1900:4120:691:0:26::]:3478
2024-04-30T02:32:53.909350Z DEBUG handle_timeout{relay=Some(35.235.125.17:3478)}: snownet::allocation: Unable to queue binding because we've exceeded its backoffs
2024-04-30T02:32:53.909361Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Request timed out after 17.0859375s, re-sending id=TransactionId(0xF787E01C4DDF12AE7EDEC6B2) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478
2024-04-30T02:32:53.909370Z DEBUG handle_timeout{relay=Some(35.239.126.166:3478)}: snownet::allocation: Unable to queue binding because we've exceeded its backoffs
from firezone.
{
"insertId": "1d44q2mg2rak5gt",
"jsonPayload": {
"target": "relay",
"time": "2024-04-30T14:40:36.423077098Z",
"cos.googleapis.com/container_id": "62b31dfee0a9b08cc10e9688b294181e49f24cc64ba99cd621b3fd3e9ead6c3a",
"cos.googleapis.com/container_name": "klt-relay-jaoh",
"cos.googleapis.com/stream": "stdout",
"message": "channel bind failed: Unauthorized"
},
"resource": {
"type": "gce_instance",
"labels": {
"project_id": "firezone-staging",
"zone": "us-east1-d",
"instance_id": "3277658703961888395"
}
},
"timestamp": "2024-04-30T14:40:36.423314863Z",
"severity": "WARNING",
"logName": "projects/firezone-staging/logs/cos_containers",
"sourceLocation": {
"file": "relay/src/server.rs",
"line": "318"
},
"receiveTimestamp": "2024-04-30T14:40:37.405382298Z"
}
@thomaseizinger We're seeing over 500,000 of these in the last day since the str0m version was bumped. Could this be related?
Causing a partial outage affecting customers:
https://firezone.statuspage.io
from firezone.
Here's the diff between working str0m and the one causing the channel bind failure:
from firezone.
Could it be possible that a lack of IPv6 connectivity is causing issues connecting to IPv4 Relays? This log seems suspicious:
2024-04-30T16:25:50.893420Z DEBUG handle_timeout{relay=Some(35.236.65.205:3478)}: snownet::allocation: Unable to queue binding because we've exceeded its backoffs
2024-04-30T16:25:50.893433Z DEBUG handle_timeout{relay=Some(34.72.251.203:3478)}: snownet::allocation: Request timed out after 17.0859375s, re-sending id=TransactionId(0x061BAC5117458006106CB1A0) method=binding dst=[2600:1900:4000:a47e:0:d::]:3478
from firezone.
Hmm more debugging -- should this STUN binding request be sent to my laptop (192.168.1.65)
?
2024-04-30T16:53:39.719409Z DEBUG handle_timeout{id=c2806b19-678f-4e63-9a36-4d6d041290d5}: str0m::ice_::agent: Send STUN request: 172.17.0.2:39511 -> 192.168.1.65:57359 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "4jXl:hxaY", priority: 1862270719, ice_controlled: 3305003312144811039 }, integrity_len: 0 }
This is on the problematic prod Gateway.
from firezone.
Here's a full log capture on the problematic Gateway with wire=trace
enabled:
from firezone.
Seeing "Bad Request" which probably is the source of the issue:
2024-04-30T17:06:17.219703Z WARN decapsulate{from=34.72.251.203:3478 num_bytes=40}:handle_input{id=TransactionId(0xDD4F41FE6C40163DF82DA56B) method=channel bind class=error response rtt=33.078034ms}: snownet::allocation: error=Bad Request channel=16396 peer=34.72.251.203:52214
2024-04-30T17:06:17.221707Z TRACE wire: from="network" src=34.72.251.203:3478 dst=172.17.0.2:39511 num_bytes=40
2024-04-30T17:06:17.221763Z WARN decapsulate{from=34.72.251.203:3478 num_bytes=40}:handle_input{id=TransactionId(0xFF238A6E84B1AA98C815B34A) method=channel bind class=error response rtt=33.379065ms}: snownet::allocation: error=Bad Request channel=16397 peer=[2600:1900:4000:a47e:0:d::]:52214
2024-04-30T17:06:17.237043Z DEBUG handle_timeout{id=c2806b19-678f-4e63-9a36-4d6d041290d5}: str0m::ice_::agent: Send STUN request: 104.196.159.78:59948 -> 192.168.1.65:57359 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "lejO:BcYa", priority: 1849687295, ice_controlled: 2838992140673646111 }, integrity_len: 0 }
2024-04-30T17:06:17.239975Z TRACE wire: from="network" src=35.235.125.17:3478 dst=172.17.0.2:39511 num_bytes=40
2024-04-30T17:06:17.240037Z WARN decapsulate{from=35.235.125.17:3478 num_bytes=40}:handle_input{id=TransactionId(0x93506EB5AB2AAD1BC65726D2) method=channel bind class=error response rtt=60.845142ms}: snownet::allocation: error=Bad Request channel=16395 peer=[2600:1900:4120:691:0:26::]:60460
2024-04-30T17:06:17.240058Z TRACE wire: from="network" src=35.235.125.17:3478 dst=172.17.0.2:39511 num_bytes=40
2024-04-30T17:06:17.240077Z WARN decapsulate{from=35.235.125.17:3478 num_bytes=40}:handle_input{id=TransactionId(0x60B54198D4736A0281B50458) method=channel bind class=error response rtt=61.553165ms}: snownet::allocation: error=Bad Request channel=16394 peer=35.235.125.17:60460
I thought the maximum number of channels we could bind was 16,384 -- is channel=16394
an issue?
from firezone.
This turned out to be a bug with str0m, see algesten/str0m#508 (comment)
Reverted for now in #4838
from firezone.
FWIW:
2024-04-30T02:23:30.269069Z DEBUG handle_timeout{relay=Some(34.23.241.97:3478)}: snownet::allocation: Request timed out after 1.5s, re-sending id=TransactionId(0x01255695304100B3C281E31E) method=binding dst=[2600:1900:4020:8f68:0:27::]:3478
This is completely normal if we have no IPv6 connectivity. The IP we are seeing in the span is the one we have selected for this Allocation
, i.e. we chose to talk to it on IPv4 likely because IPv6 never got back to us and hence it is timing out eventually.
from firezone.
Here is my theory:
- Additional relay load is an artifact of connections that fail over and over again.
- IPv6 packets timing out (see the
dst
variable) happens if you don't have IPv6 connectivity, unrelated to this issue. - "Bad request" on the relay could indicate a different issue. Would be interesting to see the logs of the corresponding transaction ID on the relay.
algesten/str0m#508 by itself does not invalidate candidates, it just handles less packets than before (see the log about ignoring some). That makes me think that the actual source of the bug (invalidating a candidate that we shouldn't) is still lurking somewhere in our codebase. Reverting the above PR doesn't fix that, it just masks it.
I wonder if this can be reproduced with a unit-test where we check, which candidates are invalidated?
from firezone.
I thought the maximum number of channels we could bind was 16,384 -- is
channel=16394
an issue?
16384 is the first channel, 20479 is the last one. The max count is 4095.
from firezone.
Related Issues (20)
- comment typo relating to DNS control
- bug(headless-client): adding service actor to group only takes effect after restarting the Client HOT 10
- connlib: packets are routed based on connected gateways and not resources HOT 3
- Support multimatch wildcards in DNS
- Fix code highlighting for React MDX
- landing page iteration 2
- transparent enterprise pricing
- ux(headless-client/linux): don't default to null DNS control
- Do service accounts belong to the `everyone` group or not?
- Ensure `reconnect` clears all previous backoff timers HOT 1
- One-click installer for DO
- k8s instructions
- Pulumi instructions
- Show instructions in docs for deploying Gateways for different infra
- UX audit tracking issue
- connlib: perform mangling of DNS requests to resolvers that are CIDR resources before we look up the peer HOT 1
- connlib: implement reconnect as "drop all connections and wait for new packets to trigger new ones"
- Allow FIREZONE_TOKEN to point to file HOT 1
- chore(connlib/android): revert possible Android regression from #4788
- Tracking issue for extensions to property-based state machine tests
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from firezone.