Code Monkey home page Code Monkey logo

Comments (5)

swhite2 avatar swhite2 commented on June 2, 2024

The issue lies with link-up events being triggered rather late in the boot sequence (which might be caused by anything, in my case slow-to-come-up SFP modules). If we consider a DHCP interface such as the factory default WAN, the following happens before a link:

/usr/local/etc/rc.bootup: The command '/sbin/dhclient -c '/var/etc/dhclient_wan.conf' -p '/var/run/dhclient.ax0.pid' 'ax0'' 
returned exit code '1', the output was 'ax0: no link .............. giving up'

The devd event is ignored during boot (by exit_on_bootup()), as it likely should to avoid other problems. The result is that the WAN interface will never obtain an IP. Ideally, dhclient should simply wait for a link to come up (which is the case on Linux AFAIK) and not bail at all, dhclient on FreeBSD does not support this I believe.

This may just as well be a driver issue, as it's quite uncommon that a link should come up this late, though wholly assuming there is a link is in my opinion a bit risky.

from core.

framer99 avatar framer99 commented on June 2, 2024

I just battled what I think is relevant problem for this issue. It started late last year after an upgrade to one of the later 23.x versions (still present in 24.1). I spent alot of time learning how OPNsense boots and debugging the boot process on our slow Atom D510 4MB RAM hardware.

It's long and drawn out but I thought I would write it up in case anyone finds it helpful or I need to refer back to it myself:)

What @fichtner is trying to fix and @swhite2 is saying about late link-up events is exactly what I found the problem to be.

Problem Description

For many years we have successfully run a redundant pair of OPNsense firewalls with 4 NICs, ~8 VLANs, some with CARP, some with FRR/ospfd. After upgrade to a later 23.x versions bootup would "complete" and FRR/ospfd would converge with all OSPF routes added to the kernel and working as expected. Within 2-3 minutes, some or all OSPF learned routes for an OSPF interface (or >1 interface sometimes) would be removed from the kernel and never re-added. I upgraded to the latest 24.x version and the issue remained.

Problem Details

Turns out that interface_configure() ends up doing an address flush which executes ifconfig <intf> <ip> -alias. The ip gets re-added at some point, but the kernel removed the OSPF route on the IP removal. FRR/ospfd never drops the adjacency and re-neighbors so the kernel is never re-informed to add the route.

I use rc.syshook.d/early/ to enable zebra debugging during boot and I can see FRR/zebra does receive notifications from the kernel about the address/route removal, but FRR does nothing about it. As far as FRR is concerned the kernel has the route zebra gave it and ospfd continues to maintain adjacency. Toggling the interface at either end of the link will cause OSPF to re-neighbor and the route(s) get re-added to the kernel.

Ultimately I found (at least on our ancient hardware/config) that rc.bootup finishes way before all the devd LINKUP/DOWN events are complete, so the flock on /var/run/booting is released well before all the devd linkup/down events fire off. During "booting", the call to interface_configure() is skipped in rc.linkup by exit_on_bootup(). After booting is considered complete, this interface_configure() call is not skipped, so the IP address flush in interface_configure() is executed which kills the kernel route(s) zebra added for that interface.

"My Workaround/Fix"

I added a long delay to the end of rc.bootup, before the final exit(0). This kept the "booting" state (flock on /var/run/booting) active for the duration of the artificial delay. With this delay place I can see in the system log file that devd events continue to be processed after bootup would normally have been "done". The interface_configure() calls get skipped, so there are no ifconfig <intf> <ip> -alias executions. The calls to plugin hooks for openvpn, ipsec, dhcp, dns,crl etc run very slowly. It takes 2-3 or more minutes to complete.... like @fitchner describes.

After my artificial delay is complete, rc.bootup exits, flock on /var/run/booting is released, normal boot continues (starting FRR etc) and everything works fine... because there are no more linkup/down devd events

For now I've left my rc.bootup extended delay in. It's a redundant firewall setup so we can live with a 3 minute longer boot.

Questions/Thoughts

So... I'm not an expert on the OPNsense/FreeBSD boot process or future goals for it but:

  1. If FRR/zebra acted on the kernel messages about address/route removal, OSPF adjacency would/could drop and be re-aquired, sending the route(s) to kernel again.
  2. If I upgraded to faster hardware, it may chew through the devd linkup/downs before rc.bootup completed.
  3. @fitchner's coalesce concept for this issue may speed things up so devd linkup/downs are processed before rc.bootup is completed.
  4. From some code and comments I ran into about older versions: There "may" have been special processing for interfaces with static IP (or just static ARP?) that did not remove the IP.... which would leave kernel route intact. That might explain why we did not see this issue until recently(later 23.x versions).
  5. Maybe adding yet another interface plugin for FRR that restarts the OSPF process (via vtysh "clear ip ospf process" or full FRR/ospfd restart) could work.

Working and Not Working Log

I attached some logs with my added DEBUG logging that have been sanitized with IPs/names/etc replaced via sed.
Note that the ospfd already running? error in the logs is ok and is not involved in this issue. I noticed this at some point and found a @fichtner comment that it's by design to make 2nd start attempt later in startup process.

boot_with_rc.bootup_delay_added.txt:

boot_without_rc.bootup_delay_added.txt:

  • Normal bootup, no artificial delay added to rc.bootup
  • devd linkup/downs occur after bootup is "complete"
  • ifconfig <intf> <ip> -alias commands are invoked (which is what kills the kernel routes added by zebra)
    boot_without_rc.bootup_delay_added.txt

from core.

fichtner avatar fichtner commented on June 2, 2024

@framer99 thanks for the detailed report. just to be sure you are already on 24.x or still on 23.x?

from core.

framer99 avatar framer99 commented on June 2, 2024

Yes that's correct, 24.1.1.

from core.

framer99 avatar framer99 commented on June 2, 2024

I've discovered that during normal operation we can still lose kernel routes zebra installed when a link toggled when a neighbor switch rebooted

Doing a clean 5 second long cable unplugs/re-plugs generally work. However, during card reset/switch reboots for equipment plugged into the OPNsense machine, routes can still end up in zebra but not the kernel.

I turned bfd back on for one OSPF link and bfd itself seems to bounce alot (3,4, maybe more times) when recovering instead of just bfd down then bfd up.

I commented out the actual ifconfig <intf> <ip> -alias command in legacy_interface_deladdress() and things seem better.

I will need to create a simple single-link test setup to be able to get to the bottom of it all. Maybe its all our ancient hardware or some other part of the config.

from core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.