Comments (9)
The loss of power is a very likely scenario that had triggered the error in my case.
from raft.
@adamryczkowski so you positively know that there was a power loss that caused the system to shutdown abruptly and the LXD issue happened right after the reboot?
Since the same issue was reported by @julian-klode and I don't think there was any power loss in his case, I'd tend to think that the fact you experienced a power loss was just a coincidence. Happy to hear more info about this though.
from raft.
I've got this too. As far as I can see there was no upgrade in between, and no power outage or anything either. I rebooted the system (cleanly), and when it came back up LXD wasn't starting any more. So seems like something that can happen at any restart (of LXD), to me.
Aug 10 18:36:50 raleigh.orangesquash.org.uk lxd.daemon[3671]: => LXD is ready
Aug 11 11:53:06 raleigh.orangesquash.org.uk systemd[1]: Stopping Service for snap application lxd.daemon...
Aug 11 11:53:07 raleigh.orangesquash.org.uk lxd.daemon[242790]: => Stop reason is: host shutdown
Aug 11 11:53:07 raleigh.orangesquash.org.uk lxd.daemon[242790]: => Stopping LXD (with container shutdown)
Aug 11 11:53:15 raleigh.orangesquash.org.uk lxd.daemon[242790]: => Stopping LXCFS
Aug 11 11:53:16 raleigh.orangesquash.org.uk systemd[1]: snap.lxd.daemon.service: Succeeded.
Aug 11 11:53:16 raleigh.orangesquash.org.uk systemd[1]: Stopped Service for snap application lxd.daemon.
-- Reboot --
Aug 11 11:53:18 raleigh.orangesquash.org.uk systemd[1]: Started Service for snap application lxd.daemon.
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: => Preparing the system
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Loading snap configuration
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up mntns symlink (mnt:[4026532469])
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up persistent shmounts path
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ====> Making LXD shmounts use the persistent path
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ====> Making LXCFS use the persistent path
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up kmod wrapper
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Preparing /boot
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Preparing a clean copy of /run
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Preparing a clean copy of /etc
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up ceph configuration
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up LVM configuration
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Rotating logs
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Setting up ZFS (0.8)
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Escaping the systemd cgroups
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ====> Detected cgroup V1
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Escaping the systemd process resource limits
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Increasing the number of inotify user instances
Aug 11 11:53:18 raleigh.orangesquash.org.uk lxd.daemon[3806]: ==> Disabling shiftfs on this kernel (auto)
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: => Starting LXCFS
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: mount namespace: 5
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: hierarchies:
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 0: fd: 6: perf_event
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 1: fd: 7: freezer
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 2: fd: 8: hugetlb
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 3: fd: 9: blkio
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 4: fd: 10: devices
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 5: fd: 11: pids
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 6: fd: 12: net_cls,net_prio
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 7: fd: 13: rdma
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 8: fd: 14: cpu,cpuacct
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 9: fd: 15: memory
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 10: fd: 16: cpuset
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 11: fd: 17: name=systemd
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: 12: fd: 18: unified
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: => Starting LXD
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: t=2019-08-11T11:53:19+0100 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored."
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: t=2019-08-11T11:53:19+0100 lvl=eror msg="Failed to start the daemon: Failed to start dqlite server: run failed with 13"
Aug 11 11:53:19 raleigh.orangesquash.org.uk lxd.daemon[3806]: Error: Failed to start dqlite server: run failed with 13
Aug 11 11:53:20 raleigh.orangesquash.org.uk lxd.daemon[3806]: => LXD failed to start
Aug 11 11:53:20 raleigh.orangesquash.org.uk systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
From snap info lxd
:
tracking: stable/ubuntu-18.10
refresh-date: 10 days ago, at 20:06 BST
…
installed: 3.15 (11437) 57MB -
from raft.
@iainlane what filesystem do you have behind /var/snap/lxd?
from raft.
Also, am I reading this right that your system took less than 2s to reboot (from lxd snap stopped to lxd snap starting)? That seems unusually fast, do you have some magic going on like kexec to make it so fast or is it just clock skew that's making it look that quick?
from raft.
I have come across this issue too. In my case there was no power loss.
Relevant sequence of events:
2019-08-08 lxd snap auto-refresh
2019-08-09 09:17:26 lxd shutdown triggered by system shutdown
(The system was powered off for several hours)
2019-08-09 15:38:52 lxd failure to start up when system came after power-on
My /var/snap/lxd is backed by ZFS
$ snap info lxd
...
tracking: stable
refresh-date: 4 days ago, at 07:22 CEST
...
installed: 3.15 (11437) 57MB -
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 19.04
Release: 19.04
Codename: disco
from raft.
I have captured a archive of the database itself, all available copies of /var/log/syslog
and /var/snap/lxd/common/lxd/logs/lxd.log
. Let me know if you need any more information.
from raft.
laney@raleigh> mount -l | grep lxd
/var/lib/snapd/snaps/lxd_11405.snap on /snap/lxd/11405 type squashfs (ro,nodev,relatime,x-gdu.hide)
/var/lib/snapd/snaps/lxd_11437.snap on /snap/lxd/11437 type squashfs (ro,nodev,relatime,x-gdu.hide)
nsfs on /run/snapd/ns/lxd.mnt type nsfs (rw)
tmpfs on /var/snap/lxd/common/ns type tmpfs (rw,relatime,size=1024k,mode=700)
nsfs on /var/snap/lxd/common/ns/shmounts type nsfs (rw)
nsfs on /var/snap/lxd/common/ns/mntns type nsfs (rw)
but /var/snap/lxd
itself is just on /
, which is dev/mapper/ubuntu--vg-root on / type ext4 (rw,relatime,errors=remount-ro)
(LVM on LUKS).
and, as regards the weird timestamps, it looks like my clock drifted a bit probably when my PC was off for my holiday, and the next reboot got an NTP sync which fixed it up. The actual reboot after the "last good" stop of LXD was Aug 11 11:53:28 raleigh.orangesquash.org.uk systemd[1]: Shutting down.
so it certainly went backwards :).
from raft.
For all folks that got affected by this:
The bad news is that I could not figure out what exactly happened here. So in principle something similar might happen again.
The good news is that I completely revisited the way we deal with deletion of snapshotted raft log entries that should not be longer needed:
-
First of all there is now a retention window which means that instead of deleting snapshotted entries right away they are now kept until the trail starts using too much space (in case of dqlite/LXD this will be 30-50 megabytes, depending on the entries size)
-
In addition to that, we are now more conservative at startup time and don't delete anything during that phase, reducing the surface area for possible bugs due to deleting files that shouldn't be deleted.
-
Also, there is now much more sophisticated logic in place for dealing with unexpected situations at startup: previously we would bail out with a failure for any little inconsistency, now the raft lib tries its best to still start up successfully when it's safely possible to do so despite a number of inconsistent scenarios that might be detected. We'll bail out only when there's truly no decision that we can safely take ("safely" as in "no data loss will occur").
-
Finally, in the worst case scenario that this happens again, we should have better logging at hand to debug the situation. This is an area where we'll want further improvements even after this commit.
Note that since the logic that has possibly triggered the bug was changed considerably, I have reasonable hopes that this won't happen again. We'll see.
from raft.
Related Issues (20)
- missing getrandom on centos 7 HOT 1
- [question] Usage of raft_add, raft_remove, and raft_assign HOT 2
- [question] When is the first leader elected? HOT 2
- recvAppendEntries: Assertion `r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE` failed HOT 9
- [question] Forwarding request to leader HOT 8
- Potential use-after-free in handling of raft_transfer HOT 5
- Fix the 32-bit CI HOT 1
- can't build on m1 mac (`<linux/xxxxx.h>` is missing) HOT 3
- v1.x RFC: pull based approach HOT 5
- raft_start(): io: closed segment xxx is past last snapshot xxx HOT 4
- [question] Understanding how to add new servers HOT 15
- ./configure fails when no external dependency is found HOT 4
- Assertion: src/uv_truncate.c:168: UvTruncate: Assertion `index < uv->append_next_index' failed. HOT 3
- src/replication.c:457: getRequest: Assertion `req->type == type' failed. HOT 3
- Segment writes blocked when taking a snapshot. HOT 3
- src/log.c:87: refsTryInsert: Assertion `next_slot->term != term' failed. HOT 9
- Jepsen: Another truncate-related assertion failure
- [suggestion] add lock/unlock function pointers to raft_io (and use them) HOT 1
- `uvOsFallocateEmulation` taking too long time HOT 1
- CI: `xfs` test failures HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from raft.