Reported at https://github.com/lxc/lx

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Implement transaction file retention to guard against corruption about raft HOT 9 CLOSED

canonical commented on July 22, 2024

Implement transaction file retention to guard against corruption

from raft.

Comments (9)

adamryczkowski commented on July 22, 2024

The loss of power is a very likely scenario that had triggered the error in my case.

from raft.

freeekanayaka commented on July 22, 2024

@adamryczkowski so you positively know that there was a power loss that caused the system to shutdown abruptly and the LXD issue happened right after the reboot?

Since the same issue was reported by @julian-klode and I don't think there was any power loss in his case, I'd tend to think that the fact you experienced a power loss was just a coincidence. Happy to hear more info about this though.

from raft.

iainlane commented on July 22, 2024

I've got this too. As far as I can see there was no upgrade in between, and no power outage or anything either. I rebooted the system (cleanly), and when it came back up LXD wasn't starting any more. So seems like something that can happen at any restart (of LXD), to me.

Aug 10 18:36:50 raleigh.orangesquash.org.uk lxd.daemon[3671]: => LXD is ready
Aug 11 11:53:06 raleigh.orangesquash.org.uk systemd[1]: Stopping Service for snap application lxd.daemon...
Aug 11 11:53:07 raleigh.orangesquash.org.uk lxd.daemon[242790]: => Stop reason is: host shutdown
Aug 11 11:53:07 raleigh.orangesquash.org.uk lxd.daemon[242790]: => Stopping LXD (with container shutdown)
Aug 11 11:53:15 raleigh.orangesquash.org.uk lxd.daemon[242790]: => Stopping LXCFS
Aug 11 11:53:16 raleigh.orangesquash.org.uk systemd[1]: snap.lxd.daemon.service: Succeeded.
Aug 11 11:53:16 raleigh.orangesquash.org.uk systemd[1]: Stopped Service for snap application lxd.daemon.
-- Reboot --
Aug 11 11:53:18 raleigh.orangesquash.org.uk systemd[1]: Started Service for snap application lxd.daemon.
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: => Preparing the system
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Loading snap configuration
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up mntns symlink (mnt:[4026532469])
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up persistent shmounts path
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ====> Making LXD shmounts use the persistent path
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ====> Making LXCFS use the persistent path
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up kmod wrapper
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Preparing /boot
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Preparing a clean copy of /run
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Preparing a clean copy of /etc
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up ceph configuration
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up LVM configuration
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Rotating logs
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up ZFS (0.8)
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Escaping the systemd cgroups
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ====> Detected cgroup V1
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Escaping the systemd process resource limits
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Increasing the number of inotify user instances
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Disabling shiftfs on this kernel (auto)
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: => Starting LXCFS
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: mount namespace: 5
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: hierarchies:
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   0: fd:   6: perf_event
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   1: fd:   7: freezer
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   2: fd:   8: hugetlb
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   3: fd:   9: blkio
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   4: fd:  10: devices
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   5: fd:  11: pids
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   6: fd:  12: net_cls,net_prio
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   7: fd:  13: rdma
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   8: fd:  14: cpu,cpuacct
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:   9: fd:  15: memory
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:  10: fd:  16: cpuset
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:  11: fd:  17: name=systemd
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]:  12: fd:  18: unified
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: => Starting LXD
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: t=2019-08-11T11:53:19+0100 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored."
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: t=2019-08-11T11:53:19+0100 lvl=eror msg="Failed to start the daemon: Failed to start dqlite server: run failed with 13"
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: Error: Failed to start dqlite server: run failed with 13
Aug 11 11:53:20 raleigh.orangesquash.org.uk lxd.daemon[3806]: => LXD failed to start
Aug 11 11:53:20 raleigh.orangesquash.org.uk systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE

From snap info lxd:

tracking:     stable/ubuntu-18.10
refresh-date: 10 days ago, at 20:06 BST
…
installed:        3.15                   (11437) 57MB -

from raft.

stgraber commented on July 22, 2024

@iainlane what filesystem do you have behind /var/snap/lxd?

from raft.

stgraber commented on July 22, 2024

Also, am I reading this right that your system took less than 2s to reboot (from lxd snap stopped to lxd snap starting)? That seems unusually fast, do you have some magic going on like kexec to make it so fast or is it just clock skew that's making it look that quick?

from raft.

fnordahl commented on July 22, 2024

I have come across this issue too. In my case there was no power loss.

Relevant sequence of events:
2019-08-08 lxd snap auto-refresh
2019-08-09 09:17:26 lxd shutdown triggered by system shutdown
(The system was powered off for several hours)
2019-08-09 15:38:52 lxd failure to start up when system came after power-on

My /var/snap/lxd is backed by ZFS

$ snap info lxd
...
tracking: stable
refresh-date: 4 days ago, at 07:22 CEST
...
installed: 3.15 (11437) 57MB -

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 19.04
Release: 19.04
Codename: disco

from raft.

fnordahl commented on July 22, 2024

I have captured a archive of the database itself, all available copies of /var/log/syslog and /var/snap/lxd/common/lxd/logs/lxd.log. Let me know if you need any more information.

from raft.

iainlane commented on July 22, 2024

@stgraber

laney@raleigh> mount -l | grep lxd
/var/lib/snapd/snaps/lxd_11405.snap on /snap/lxd/11405 type squashfs (ro,nodev,relatime,x-gdu.hide)
/var/lib/snapd/snaps/lxd_11437.snap on /snap/lxd/11437 type squashfs (ro,nodev,relatime,x-gdu.hide)
nsfs on /run/snapd/ns/lxd.mnt type nsfs (rw)
tmpfs on /var/snap/lxd/common/ns type tmpfs (rw,relatime,size=1024k,mode=700)
nsfs on /var/snap/lxd/common/ns/shmounts type nsfs (rw)
nsfs on /var/snap/lxd/common/ns/mntns type nsfs (rw)

but /var/snap/lxd itself is just on /, which is dev/mapper/ubuntu--vg-root on / type ext4 (rw,relatime,errors=remount-ro) (LVM on LUKS).

and, as regards the weird timestamps, it looks like my clock drifted a bit probably when my PC was off for my holiday, and the next reboot got an NTP sync which fixed it up. The actual reboot after the "last good" stop of LXD was Aug 11 11:53:28 raleigh.orangesquash.org.uk systemd[1]: Shutting down. so it certainly went backwards :).

from raft.

freeekanayaka commented on July 22, 2024

For all folks that got affected by this:

The bad news is that I could not figure out what exactly happened here. So in principle something similar might happen again.

The good news is that I completely revisited the way we deal with deletion of snapshotted raft log entries that should not be longer needed:

First of all there is now a retention window which means that instead of deleting snapshotted entries right away they are now kept until the trail starts using too much space (in case of dqlite/LXD this will be 30-50 megabytes, depending on the entries size)
In addition to that, we are now more conservative at startup time and don't delete anything during that phase, reducing the surface area for possible bugs due to deleting files that shouldn't be deleted.
Also, there is now much more sophisticated logic in place for dealing with unexpected situations at startup: previously we would bail out with a failure for any little inconsistency, now the raft lib tries its best to still start up successfully when it's safely possible to do so despite a number of inconsistent scenarios that might be detected. We'll bail out only when there's truly no decision that we can safely take ("safely" as in "no data loss will occur").
Finally, in the worst case scenario that this happens again, we should have better logging at hand to debug the situation. This is an area where we'll want further improvements even after this commit.

Note that since the logic that has possibly triggered the bug was changed considerably, I have reasonable hopes that this won't happen again. We'll see.

from raft.

Implement transaction file retention to guard against corruption about raft HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent