fedora-iot / greenboot Goto Github PK
View Code? Open in Web Editor NEWGeneric Health Checking Framework for systemd
License: GNU Lesser General Public License v2.1
Generic Health Checking Framework for systemd
License: GNU Lesser General Public License v2.1
Semantics for a name can help in remembering its capabilities; if there's a story, it would be great to see it documented
Hi, from my understanding since F39 users will be able to install systemd-boot cleanly even on Fedora-iot which raises the question if greenboot is going to work with it or will have to be disabled on such installs?
Greenboot should notify user(MOTD) that there is an abrupt boot cycle detected in-case of power loss in device or force killing of a VM during the next reboot.
As there can be things/service that dependent on shutdown targets didn't get executed correctly which may cause issue in the next boot.
example: rpm-ostree update
the staged update gets lost once there is sudden power failure.
Since we don't use headers in source files, it's not clear if the project is licensed under the:
The License itself doesn't change, just the headers that we don't use, but since we need to fix the License text in Fedora, we need to know exactly what is it
Let's analyze commonalities and differences with SUSE's https://github.com/kubic-project/health-checker
Maybe there is opportunities to collaborate and align (or even merge projects?) in the long run
Issue:
When rpm-ostree command removes a package, and a rpm-ostree reset (Remove all mutations) is performed, it affects the behavior of Greenboot. After a reboot is initiated, Greenboot does not perform the check, and the Greenboot boot_counter stats at the set default value.
Steps to reproduce:
Expected Result:
Greenboot should noticed a package is missing and attempt to restore the last known good state
Actual Result:
Greenboot does not attempt to fix itself. Greenboot boot_counter remains set at the default set value
additional notes (is any):
The ostree status:
GREENBOOT_WATCHDOG_CHECK_ENABLED=true
Greenboot variables:
boot_counter=2
A system reboot will clear the boot flag, and restore the system to a good known state
Issue description:
Provision a vm and boot, then upgrade to a new commit (which is a fake commit just for test and health check never pass). After reboot, greenboot health check fails and reboot several times, then finally fall back to previous commit.
The problem is, when run rpm-ostree status, the new failed commit is still on top of old commit, like this:
[admin@vm-1 ~]$ rpm-ostree status
State: idle
Deployments:
rhel-edge:centos/9/x86_64/edge
Version: 9-stream (2023-06-12T11:25:33Z)
BaseCommit: de8c63c4444fdc53e8b927357f458250bdaaec938e24f2923fbc071ce7aa24c2
Diff: 1 added
LocalPackages: greenboot-failing-unit-1.0-1.el8.noarch
● rhel-edge:centos/9/x86_64/edge
Version: 9-stream (2023-06-12T11:25:33Z)
Commit: de8c63c4444fdc53e8b927357f458250bdaaec938e24f2923fbc071ce7aa24c2
Actual result:
The rpm-ostree status order is not reversed, each reboot will try to boot with the top one
Expected result:
The rpm-ostree status order should be reversed
greenboot.target: Requested dependency OnFailure=redboot.target ignored (target units cannot fail).
This is due to a change in systemd, included in v245.
I am not sure if that actually breaks the functionality but it would appear so, since it should trigger the red path for boot failures.
Following up on https://bugzilla.redhat.com/show_bug.cgi?id=1920063,
Although it can be set via https://github.com/fedora-iot/greenboot/blob/main/usr/libexec/greenboot/greenboot-grub2-set-counter, it is hardcoded in https://github.com/fedora-iot/greenboot/blob/main/usr/lib/systemd/system/greenboot-grub2-set-counter.service#L19. Extending the former so it can be configured via environment variable as well can be interesting.
greenboot since the first time I used it, the main reason for bootloops in my system. And while I like how it helps to keep system-updates robust to failures, causing bootloops is sometimes worse than the original issue.
Example: We run an application on update, that stores its configs in /var on an ostree based system. When someone (namely me) now somehow manages to get a typo into the configs, of course, this component breaks. Means it can't be started no matter what version is used.
What greenboot now does is an indefinite amount of rollbacks. On every boot it comes up, figures out, it's not a successful boot and triggers a rollback.
What I wish greenboot would do: try to boot, figure out the boot was not successful and try a rollback. But if there is no way to roll back any further, maybe just keep the broken system booted. The chance that it can be fixed is a lot higher when the system is booting other than when it's stuck in a bootloop, I guess.
I would love your input on this, and while config validation is definitely the right answer for my specific case, I don't think it hits the core of the problem here.
Packit failed on creating pull-requests in dist-git:
dist-git branch | error |
---|---|
main |
Bad source: /tmp/packit-dist-git6yk70umm/greenboot-0.15.0.tar.gz: No such file or directory |
You can retrigger the update by adding a comment (/packit propose-downstream
) into this issue.
An ostree remote config may have a url=
parameter and a contenturl=
parameter included in the config. When the contenturl=
parameter is present, the ostree client will fetch content from that resource, but will fetch metadata from the resource specified by the url=
parameter.
Currently, the Fedora IoT ostree infrastructure is configured in a way that doing curl -L https://ostree.fedoraproject.org/iot
(as specified in the url=
parameter) returns an HTTP 403. But curl -L https://ostree.fedoraproject.org/iot/config
returns HTTP 200.
Along similar lines, curl -L https://ostree.fedoraproject.org/iot/mirrorlist
returns HTTP 200. And substituting the CloudFront hostname from the mirror list: curl -L https://d2ju0wfl996cmc.cloudfront.net/config
also returns HTTP 200.
The intent is to make the script more intelligent to test for actual content availability depending on how the ostree remote config is populated. If there is a contenturl=
parameter, the script should check that fetching config
asset from both the url=
and contenturl=
parameters to validate both more completely. In the absence of the contenturl=
parameter, the script should only check the url=
parameter.
Goal:
We need a default health check for ensuring that the update services are still contactable post upgrade
Acceptance Criteria
When we updated and reboot we need to ensure the updates platform is still available post upgrade:
Currently we can see greenboot log like this:
[admin@vm-1 ~]$ journalctl -b -4 -u greenboot.service
Jun 12 08:02:47 localhost systemd[1]: Starting greenboot Health Checks Runner...
Jun 12 08:02:47 localhost greenboot[909]: INFO greenboot > GreenbootConfig { max_reboot: 3 }
Jun 12 08:02:47 localhost greenboot[909]: INFO greenboot > running required check /usr/lib/greenboot/check/required.d/01_repository_d>
Jun 12 08:02:47 localhost greenboot[909]: INFO greenboot > running required check /usr/lib/greenboot/check/required.d/02_watchdog.sh
Jun 12 08:02:47 localhost greenboot[909]: INFO greenboot > running required check /etc/greenboot/check/required.d/10_failing_check.sh
Jun 12 08:02:47 localhost greenboot[909]: ERROR greenboot > required script /etc/greenboot/check/required.d/10_failing_check.sh failed!
Jun 12 08:02:47 localhost greenboot[909]: ERROR greenboot > reason:
Jun 12 08:02:47 localhost greenboot[909]: INFO greenboot > running wanted check /usr/lib/greenboot/check/wanted.d/01_update_platforms>
Jun 12 08:02:47 localhost greenboot[909]: WARN greenboot > wanted script /usr/lib/greenboot/check/wanted.d/01_update_platforms_check.>
Jun 12 08:02:47 localhost greenboot[909]: WARN greenboot > reason: grep: /etc/ostree/remotes.d/*: No such file or directory
Jun 12 08:02:47 localhost greenboot[909]: ERROR greenboot > Greenboot health-check failed!
Jun 12 08:02:47 localhost greenboot[909]: INFO greenboot::handler > boot_counter initialized
Jun 12 08:02:47 localhost greenboot[909]: INFO greenboot::handler > restarting system
Jun 12 08:02:47 localhost greenboot[909]: Error: health-check failed!
Jun 12 08:02:47 localhost systemd[1]: greenboot.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 08:02:47 localhost systemd[1]: greenboot.service: Failed with result 'exit-code'.
Jun 12 08:02:47 localhost systemd[1]: Stopped greenboot Health Checks Runner.
Suggest to add more message to help customer to understand the boot status:
Currently greenboot rollback is dependent on ostree-finallized-stage.service which is be triggered only on first reboot, after an update is deployed in ostree. So time delayed failure can not trigger any rollback which may hamper certain use cases.Also this helps greenboot to be more closely integrated with the ostree architecture.
It will also reduce dependency on systemd service orchestration.
Example: /usr/lib/greenboot/check.d/required.d/02_watchdog.sh
failure will not have any rollback triggered for cases after first reboot, which can happen in an edge scenario.
Two issues for greenboot-rollback service:
1.When enable greenboot-rollback.service, it failed and reported error.
2.Check the log of greenboot-rollback.service for first boot, there is an error message "Error: Rollback not initiated as boot_counter is either unset or not equal to 0"
[admin@vm-1 ~]$ systemctl enable --now greenboot-rollback.service
Job for greenboot-rollback.service failed because the control process exited with error code.
See "systemctl status greenboot-rollback.service" and "journalctl -xeu greenboot-rollback.service" for details.
[admin@vm-1 ~]$ journalctl -b -5 -u greenboot-rollback.service
Jul 19 09:49:48 vm-1 systemd[1]: Starting Greenboot rollback...
Jul 19 09:49:48 vm-1 greenboot[1747]: Error: Rollback not initiated as boot_counter is either unset or not equal to 0
Jul 19 09:49:48 vm-1 systemd[1]: greenboot-rollback.service: Main process exited, code=exited, status=1/FAILURE
Jul 19 09:49:48 vm-1 systemd[1]: greenboot-rollback.service: Failed with result 'exit-code'.
Jul 19 09:49:48 vm-1 systemd[1]: Failed to start Greenboot rollback.
stdout: |-
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found ordering cycle on greenboot-grub2-set-success.service/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on boot-complete.target/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on greenboot-service-monitor.service/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on multi-user.target/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Job greenboot-grub2-set-success.service/start deleted to break ordering cycle starting with multi-user.target/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found ordering cycle on greenboot-status.service/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on greenboot-task-runner.service/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on boot-complete.target/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on greenboot-service-monitor.service/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on multi-user.target/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Job greenboot-status.service/start deleted to break ordering cycle starting with multi-user.target/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found ordering cycle on greenboot-task-runner.service/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on boot-complete.target/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on greenboot-service-monitor.service/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Found dependency on multi-user.target/start
Aug 27 04:39:54 localhost systemd[1]: multi-user.target: Job greenboot-task-runner.service/start deleted to break ordering cycle starting with multi-user.target/start
full logs at https://github.com/virt-s1/rhel-edge/runs/8047704187?check_suite_focus=true#step:4:5342 - bz https://bugzilla.redhat.com/show_bug.cgi?id=2120593#c1
There will be cases where we may wish to not be able to roll back an update, in particular to prevent the exploitation of old vulnerabilities which have been fixed in newer versions.
This will require integration with the OS updates mechanism to see if a version has been tagged as "don't rollback" in case of something like a remotely exploitable CVE.
use cargo-vendor-filterer to remove windows crate.
Show which checks failed,
and the current/remaing number of boot attempts (boot iterator)
There has been a network configuration change and my Fedora IoT server needs some changes. The problem is that it keeps rebooting while I'm trying to solve the issue, so I'd like to suspend it until I fix the network.
Do I really need to uninstall greenboot for this?
If yes, that's terrible UX.
If no, that's terrible documentation. Anyone who cares about docs will probably be here because of the same reason.
Packit failed on creating pull-requests in dist-git:
dist-git branch | error |
---|---|
main |
Bad source: /tmp/packit-dist-gitgjeb2ttg/greenboot-0.15.1.tar.gz: No such file or directory |
You can retrigger the update by adding a comment (/packit propose-downstream
) into this issue.
ssh session seems to trigger the grub-boot-success.timer from grub2-tools to set the boot success flag, which means that grub does not decrement the boot counter if there is a reboot.
The readme lists two examples, the wanted check for systemd and the required check for systemd. From reading them they look exactly the same to me, when it comes to the fields that are actually relevant for Systemd. But I guess this is due to a copy and paste mistake?
The last example should be WantedBy=boot-complete.target
instead of being also RequiredBy
, right? Or is there something I'm missing?
After having quite a lot of trouble with a new clean install of Fedora 37 IOT on a raspberry pi 4, I discovered that the repository DNS check was failing due to the system time not having synced by the time greenboot-healthcheck.service was running.
I've not entirely got my head around why this might be the case. I've had problems with DNSSEC on a raspberry pi due to the time being wrong before, but I've not yet worked out if this is the problem now!
I fixed this by making greenboot wait for the correct time (see below). I'm not sure if this is the most appropriate fix, but it seems to work to allow me to boot the pi more than once, at least!
After flashing the initial image to a raspberry pi, the first boot worked OK, but reported
Script '01_repository_dns_check.sh' FAILURE (exit code '1'). Continuing...
Boot Status is RED - Health Check FAILURE!
SYSTEM is UNHEALTHY, but bootlader entry count is 1. Manual intervention necessary.
However, attempting to update the system using rpm-ostree (and rebooting to finalise this) resulted in a continuous bootloop. The system appeared to reach the operating system OK, and it was even possible to login for a few seconds before greenboot rebooted the system. The appeared to happen indefinitely, or at least for up to about 10 minutes before I got bored!
To fix this, I enabled chrony-wait and made greenboot-healthcheck.service wait for time-sync.target:
systemctl enable chrony-wait.service
systemctl edit greenboot-healthcheck.service
and add the following:
[Unit]
After=time-sync.target
Requires=time-sync.target
Current greenboot-service-monitor implementation validates the active state of the services irrespective if its enabled or not,
which causes unforced issue when user disables any service and put the same in the greenboot config to monitor.
Can be resolved by the following process:
service enabled? Yes -> service active? -> Yes -> boot success
No -> boot success No -> trigger redboot.target
Issue description:
greenboot service and greenboot-rollback service are disabled and not started
[admin@vm-1 ~]$ systemctl is-enabled greenboot.service greenboot-rollback.service
disabled
disabled
[admin@vm-1 ~]$ journalctl --list-boots
IDX BOOT ID FIRST ENTRY LAST ENTRY
0 1fd7bf376e634d0298540a6715cfa976 Wed 2023-07-19 04:58:22 EDT Wed 2023-07-19 07:15:38 EDT
[admin@vm-1 ~]$ journalctl -b -0 -u greenboot.service
– No entries –
Also remove code duplication by using one function for all script runners
The currently available default health check scripts which are required do not have the level of reliability we desire for Edge device scenarios.
See #68, #71, #90, #93, #98 for examples of difficulties, failures, etc of the scripts
Until we are able to provide more resilient health checks by default, we should package the existing scripts in a separate RPM and have it be an optional install.
🐶👀
As $subject states, this behaviour makes sense as Greenboot would mark the boot as "red/faulty" anyway, but it would bring more value to the user knowing all the required health checks that had failed, not just the first one.
Our edge devices are getting this issue below when logging into the device using SSH.
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Boot Status is GREEN - Health Check SUCCESS
It sounds like, the same issue occurred on this opened issue #68
Since ostree repos are common for every system which uses it.
Our idea was to change to script to check for the config
file instead of the root directory.
Is there another way to fix it for different repos flavors we have?
In order to avoid accidental modifications of default health checks, we should move those to a read-only directory.
This implies that Greenboot should be able to check two directories now, the read-only one for the default scripts and the current one (/etc/greenboot/check/{required,wanted}.d
).
Basically:
grub2
, status
, etc.As part of #2, there was a commit that @jwrdegoede sent to util-linux
: util-linux/util-linux@b1b0259
This would add more reliability on this watchdog check. This commit was added to the main branch on Aug, 18th, 2021 and sadly the last release is from Aug, 17th, so we will have to wait until the new release is, er, released, in order to get this functionality on util-linux
.
Once it's released, I recommend enforce this version dependency on the spec file basically making that line Requires: util-linux > 2.37.2
.
This is the tags page of util-linux.
Packit failed on creating pull-requests in dist-git:
dist-git branch | error |
---|---|
f33 |
Failed to download file from URL https://github.com/fedora-iot/greenboot/archive/vv0.12.0.tar.gz. Reason: 'Not Found'. |
f34 |
Failed to download file from URL https://github.com/fedora-iot/greenboot/archive/vv0.12.0.tar.gz. Reason: 'Not Found'. |
main |
Failed to download file from URL https://github.com/fedora-iot/greenboot/archive/vv0.12.0.tar.gz. Reason: 'Not Found'. |
You can retrigger the update by adding a comment (/packit propose-downstream
) into this issue.
The requirement for systemd
is currently too vague, since greenboot requires boot-complete.target
, added with systemd 240.
Hello
In MicroShift we're looking for a sure way to know that system rolled back so we can perform certain actions.
We have some ideas so far:
boot_counter == 0
then create some kind of file on disk to persist that information and read it on next boot. This assumes that:
boot_counter
means we're in the middle of "new deployment hasn't been determined yet to be okay, so greenboot might reboot the system" - especially when red script runs (as it will be followed by reboot)boot_counter == 0
means it's a last attempt, when system reboots, grub will see that value and select second boot entry (rollback)journalctl --boot 0 -u greenboot-rpm-ostree-grub2-check-fallback
for existence of FALLBACK BOOT DETECTED! Default rpm-ostree deployment has been rolled back
message
After=greenboot-rpm-ostree-grub2-check-fallback
, but we'd have to check if non-ostree has any impacts, or if RemainAfterExit=yes
would affect that as wellgreenboot-rpm-ostree-grub2-check-fallback
to create a file like /run/rolled-back
, this file would be removed by greenboot-grub2-set-counter
or be cleaned automatically by reboot (if new deployment wasn't staged, but machine was simply rebooted)Do you have any other ideas how could we make it a robust mechanism?
Would you happen to know about any other source of this information like grub
or (rpm-)ostree
?
I don't know if it is a duplicate of #90 and #71, so excuse me in advance:
On a fresh Fedora IoT 37 I always get
There are problems connecting with the following URLs:
https://ostree.fedoraproject.org/iot
https://ostree.fedoraproject.org/iot/mirrorlist
Also running /usr/lib/greenboot/check/wanted.d/01_update_platforms_check.sh
by hand after the boot, it returns the same result.
I'm not a great bash expert, but I suspect that the double quotes around the ${UPDATE_PLATFORM_URLS[@]}
will result in a single line containing all the UPDATE_PLATFORM_URLS making ineffective the for loop
At the end of the day, the curl command is
curl -o /dev/null -Isw '%{http_code}\n' "https://ostree.fedoraproject.org/iot https://ostree.fedoraproject.org/iot/mirrorlist"
leading to a 000 HTTP_STATUS variable (that is curl: (3) URL using bad/illegal format or missing URL)
In short: if in such line I remove the double quotes
for UPDATE_PLATFORM_URL in ${UPDATE_PLATFORM_URLS[@]}; do
the script works as expected.
Would be good to have this info for both the from github and the from RPM paths.
The packit service will do auto package updates https://packit.dev/
Line 36 in 650ce77
Do greenboot-rpm-ostree-grub2-check-fallback greenboot-grub2-set-counter redboot-task-runner redboot-auto-reboot
should be enabled as well?
I'm not sure the issue should be created here or it's a reverse proxy configuration issue. The way it checks url valid is
HTTP_STATUS=$(curl -o /dev/null -Isw '%{http_code}\n' "$UPDATE_PLATFORM_URL" || echo "Unreachable")
if ! [[ $HTTP_STATUS == 2* ]] && ! [[ $HTTP_STATUS == 3* ]]; then
URLS_WITH_PROBLEMS+=( "$UPDATE_PLATFORM_URL" )
fi
And access https://ostree.fedoraproject.org/iot/ will return 403. Although ostree updating is still working correctly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.