Comments (5)
The issue lies with link-up events being triggered rather late in the boot sequence (which might be caused by anything, in my case slow-to-come-up SFP modules). If we consider a DHCP interface such as the factory default WAN, the following happens before a link:
/usr/local/etc/rc.bootup: The command '/sbin/dhclient -c '/var/etc/dhclient_wan.conf' -p '/var/run/dhclient.ax0.pid' 'ax0''
returned exit code '1', the output was 'ax0: no link .............. giving up'
The devd
event is ignored during boot (by exit_on_bootup()
), as it likely should to avoid other problems. The result is that the WAN interface will never obtain an IP. Ideally, dhclient should simply wait for a link to come up (which is the case on Linux AFAIK) and not bail at all, dhclient on FreeBSD does not support this I believe.
This may just as well be a driver issue, as it's quite uncommon that a link should come up this late, though wholly assuming there is a link is in my opinion a bit risky.
from core.
I just battled what I think is relevant problem for this issue. It started late last year after an upgrade to one of the later 23.x versions (still present in 24.1). I spent alot of time learning how OPNsense boots and debugging the boot process on our slow Atom D510 4MB RAM hardware.
It's long and drawn out but I thought I would write it up in case anyone finds it helpful or I need to refer back to it myself:)
What @fichtner is trying to fix and @swhite2 is saying about late link-up events is exactly what I found the problem to be.
Problem Description
For many years we have successfully run a redundant pair of OPNsense firewalls with 4 NICs, ~8 VLANs, some with CARP, some with FRR/ospfd. After upgrade to a later 23.x versions bootup would "complete" and FRR/ospfd would converge with all OSPF routes added to the kernel and working as expected. Within 2-3 minutes, some or all OSPF learned routes for an OSPF interface (or >1 interface sometimes) would be removed from the kernel and never re-added. I upgraded to the latest 24.x version and the issue remained.
Problem Details
Turns out that interface_configure()
ends up doing an address flush which executes ifconfig <intf> <ip> -alias
. The ip gets re-added at some point, but the kernel removed the OSPF route on the IP removal. FRR/ospfd never drops the adjacency and re-neighbors so the kernel is never re-informed to add the route.
I use rc.syshook.d/early/
to enable zebra debugging during boot and I can see FRR/zebra does receive notifications from the kernel about the address/route removal, but FRR does nothing about it. As far as FRR is concerned the kernel has the route zebra gave it and ospfd continues to maintain adjacency. Toggling the interface at either end of the link will cause OSPF to re-neighbor and the route(s) get re-added to the kernel.
Ultimately I found (at least on our ancient hardware/config) that rc.bootup finishes way before all the devd LINKUP/DOWN events are complete, so the flock on /var/run/booting
is released well before all the devd linkup/down events fire off. During "booting", the call to interface_configure()
is skipped in rc.linkup by exit_on_bootup()
. After booting is considered complete, this interface_configure()
call is not skipped, so the IP address flush in interface_configure()
is executed which kills the kernel route(s) zebra added for that interface.
"My Workaround/Fix"
I added a long delay to the end of rc.bootup
, before the final exit(0). This kept the "booting" state (flock on /var/run/booting
) active for the duration of the artificial delay. With this delay place I can see in the system log file that devd events continue to be processed after bootup would normally have been "done". The interface_configure()
calls get skipped, so there are no ifconfig <intf> <ip> -alias
executions. The calls to plugin hooks for openvpn, ipsec, dhcp, dns,crl etc run very slowly. It takes 2-3 or more minutes to complete.... like @fitchner describes.
After my artificial delay is complete, rc.bootup
exits, flock on /var/run/booting
is released, normal boot continues (starting FRR etc) and everything works fine... because there are no more linkup/down devd events
For now I've left my rc.bootup
extended delay in. It's a redundant firewall setup so we can live with a 3 minute longer boot.
Questions/Thoughts
So... I'm not an expert on the OPNsense/FreeBSD boot process or future goals for it but:
- If FRR/zebra acted on the kernel messages about address/route removal, OSPF adjacency would/could drop and be re-aquired, sending the route(s) to kernel again.
- If I upgraded to faster hardware, it may chew through the devd linkup/downs before
rc.bootup
completed. - @fitchner's coalesce concept for this issue may speed things up so devd linkup/downs are processed before
rc.bootup
is completed. - From some code and comments I ran into about older versions: There "may" have been special processing for interfaces with static IP (or just static ARP?) that did not remove the IP.... which would leave kernel route intact. That might explain why we did not see this issue until recently(later 23.x versions).
- Maybe adding yet another interface plugin for FRR that restarts the OSPF process (via
vtysh "clear ip ospf process"
or full FRR/ospfd restart) could work.
Working and Not Working Log
I attached some logs with my added DEBUG logging that have been sanitized with IPs/names/etc replaced via sed.
Note that the ospfd already running?
error in the logs is ok and is not involved in this issue. I noticed this at some point and found a @fichtner comment that it's by design to make 2nd start attempt later in startup process.
boot_with_rc.bootup_delay_added.txt:
rc.bootup
has added delay, allowing a working bootup.- There is only 1 call to
ifconfig <intf> <ip> -alias
... for loopback/127.0.0.1
boot_with_rc.bootup_delay_added.txt
boot_without_rc.bootup_delay_added.txt:
- Normal bootup, no artificial delay added to
rc.bootup
- devd linkup/downs occur after bootup is "complete"
ifconfig <intf> <ip> -alias
commands are invoked (which is what kills the kernel routes added by zebra)
boot_without_rc.bootup_delay_added.txt
from core.
@framer99 thanks for the detailed report. just to be sure you are already on 24.x or still on 23.x?
from core.
Yes that's correct, 24.1.1.
from core.
I've discovered that during normal operation we can still lose kernel routes zebra installed when a link toggled when a neighbor switch rebooted
Doing a clean 5 second long cable unplugs/re-plugs generally work. However, during card reset/switch reboots for equipment plugged into the OPNsense machine, routes can still end up in zebra but not the kernel.
I turned bfd back on for one OSPF link and bfd itself seems to bounce alot (3,4, maybe more times) when recovering instead of just bfd down then bfd up.
I commented out the actual ifconfig <intf> <ip> -alias
command in legacy_interface_deladdress()
and things seem better.
I will need to create a simple single-link test setup to be able to get to the bottom of it all. Maybe its all our ancient hardware or some other part of the config.
from core.
Related Issues (20)
- monit: syntax error after deleting "depends on" test
- Wireguard Peer Generator: Field "Allowed IPs" always errors with "A value is required." HOT 9
- dhcrelay: can get stuck with 100% CPU usage in new implementation HOT 10
- WireGuard Peer generator Endpoint field does not accept valid hostname HOT 2
- /usr/local/lib/python3.9 space usage HOT 10
- OPNsense 24.7 - OpenVPN breakage due to DCO being enabled by default / add optional DCO support HOT 2
- Register Kea DHCP Leases (dynamic mappings) with Unbound HOT 4
- Ports alias saves but doesn't load: Unavailable for firewall rule creation HOT 2
- feature: Adding a duplicate IP to MAC reservation causes kea to stop working HOT 1
- [BUG] WAN interface periodically losing DHCP lease
- IPSec - Tunnel Settings - cannot edit existing Phase2 entries HOT 1
- dashboard: Add responsiveness for smaller screens/resolutions HOT 6
- [Feature] Global Aliases HOT 4
- [Feature Request] Allow custom Cron commands HOT 2
- System: Gateways: Configuration - non "far" route creating a host route HOT 1
- dhcpd6: missing route with two prefixes delegated to the same DUID HOT 12
- OPNsense 24.7 - pfsync version compatibility
- OPNsense 24.7 - support unicast CARP announcements HOT 1
- Support PREF64 for Router Advertisements HOT 5
- New Dashboard - How to change canvas text color for other themes? HOT 20
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from core.