Comments (20)
Well, at least recombine backup without longhorn code snippet:
Run in backup nfs mount:
e.g. /mnt/nfs-backup/backupstore/volumes/ed/f4/pvc-6728f786-51d9-11e8-beb0-0e701ed55039
truncate -s $(cat volume.cfg |jq -r '.Size') restore.img
cat backups/backup_backup-PICK_ONE.cfg |jq -r '.Blocks| .[]| [.Offset, .BlockChecksum]|@tsv' |perl -anle 'if ($F[1]=~/^(..)(..)/) {print "dd conv=notrunc if=<(zcat blocks/$1/$2/".$F[1].".blk ) of=restore.img seek=$F[0] oflag=seek_bytes" }' > /tmp/restore-image.sh
bash /tmp/restore-image.sh
from longhorn.
Created enhancement issues #1523, #1522, #1521
from longhorn.
@jhughes2112 you can find how to restore from backup using https://longhorn.io/docs/1.0.0/advanced-resources/data-recovery/recover-without-system/ . BTW, I think you've misunderstood my comment. We don't expect user to write scripts themselves to recover, that's what I meant by manually. Longhorn provided the mechanism to recover without Kubernetes already. The effort is in fact tracked by this PR.
from longhorn.
@jhughes2112 Thanks for your feedback. Especially I think it makes sense to change the script into a docker run instead of YAML file. As you said, you might lose the whole Kubernetes cluster. We did that for the single replica recovery but not for the restore from the backups.
To detect the volumes more intelligently, it would require a bit more work though.
@khushboo-rancher can you help to check https://longhorn.io/docs/1.0.0/advanced-resources/data-recovery/recover-without-system/ and the above comment, and file issues for enhancement accordingly? Thanks.
from longhorn.
Hmm.. don't know why it did not show the error, so I edited the comment above.
I did not see that salvage option. Volumes say they are in Attaching state (and stayed there) and opening host->replica showed "ERROR" on all replicas.
Restoring from backup as a new volume works at least show that on the host where it is attached (/dev/longhorn/... is there) I can mount the volume with mount (on the host).
But what I'm trying to do is to keep existing PVC and restore PV from backup.
- Delete non functioning volume - pvc-123
- Restore from backup, same name: pvc-123
- When pod is trying to mount it I can see that json missing size message in Describe
- The restored volume can be mounted on the host with mount command.
What I really would like to have is (in this order):
- document a way (with minimal example) to restore a volume from backup without recreating deployment and pvc
- document (if there is) a way to monitor and do the ui stuff from either inside a "manager" pod or with API
- can replica count be changed? Moved from host/node to another (removing, adding nodes)
- some way to recover faulty volumes (you mentioned the salvage options)
- documentation how node replica volumes are constructed. Are they just sparse image files that could be mounted? QCow format?
- offline scripts or enough design documentation to do those myself. (like my backup restore snipped. Enough was said on the rancher blog to figure out that 2Mi chunks/blocks just need to restored to correct offset.
from longhorn.
If the volume stays in the Attaching
state with all the replicas as ERROR, it's a bug. Volume shouldn't in Attaching
state at the time. Let me check if I can reproduce it.
The salvage option will be shown if the volume is in Faulted
state.
The PVC is automatically created by k8s, I think k8s wasn't passing the size option somehow to the flexvolume driver, result in error. I will check this as well.
Some quick answer below:
- I need to check why your steps don't work.
- For now, check the API endpoint of the longhorn-backend. There is a minimal UI for API, though it's more prone to break if options are wrong. Also https://github.com/rancher/longhorn-tests/tree/master/manager/integration has examples on how to interact with manager API.
- Replica count cannot be changed currently. We can work on that later.
- When the volume is in
Faulted
state (all the replicas are ERROR and volume is detached). In the volume list view, the rightmost volume options for that volume will show theSalvage
option. - They're layered sparse file. Almost all of them cannot be mounted since they normally don't contain the full data for the disk. Checked https://rancher.com/microservices-block-storage/
Replica Operations
for details. - For the restoration, not sure why you choose to use your own script instead of using longhorn? Or you just wanted a way to export the image? We can add a feature to support that.
from longhorn.
Need to cover how to recover from a single replica.
from longhorn.
For documentation purposes, I would mention that getting a valid restore out of the script is a little time consuming and hit-or-miss guessing what to pass in. Seeing as how I no longer have the cluster the data came from, I would suggest, for your future users' benefit, several improvements.
- Change the yaml from a pod to a simple docker execution line. I had to create a cluster just so I could jam the yaml into it, but it's really not necessary, is it?
- If you do keep it in yaml form, change the unnecessary env pulling secrets to a more straightforward name/value pair. There's a really good chance creating a secret properly for some folks will simply derail them unnecessarily, trying to debug that part when there's plenty of other things to debug.
- Make the script capable of producing the valid volumes it would recognize, and list the backups that are legal to insert into the config. You already know the structure of the backup S3 bucket and can easily pull out the volume.cfg files. I lost everything, so I had to spend time looking through S3 and guessing at what your script wanted as inputs.
- Maybe mention on the URL that you gave me above what the directory structure of the bucket storage maps to what inputs. There's a file called backup_backup-ab4a140a6d264cb4.cfg and the input that you're looking for is actually a piece of that filename, but is actually stored in the volume.cfg and you could have pulled that out and told me what would be accepted as valid backups in the case of an error.
For future reference and documentations sake, The path in S3 that I was restoring from was this:
Amazon S3/MYBUCKET/backupstore/volumes/34/4e/cdn-data
I didn't find a good nearly-working example for pulling a restore down from AWS, so here's one, with a couple of redactions:
apiVersion: v1
kind: Pod
metadata:
name: restore-to-file
namespace: default
spec:
nodeName: NODENAME-TO-SAVE-TO
containers:
- name: restore-to-file
command:
# set restore-to-file arguments here
- /bin/sh
- -c
- longhorn backup restore-to-file
's3://MYBUCKET@us-east-1/?backup=backup-ab4a140a6d264cb4&volume=cdn-data'
--output-file '/tmp/restore/cdn-data.qcow2'
--output-format qcow2
# the version of longhorn engine should be v0.4.1 or higher
image: rancher/longhorn-engine:v0.4.1
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
volumeMounts:
- name: disk-directory
mountPath: /tmp/restore # the argument <output-file> should be in this directory
env:
# set Backup Target Credential Secret here.
- name: AWS_ACCESS_KEY_ID
value: myawsaccesskeyid19long
- name: AWS_SECRET_ACCESS_KEY
value: myawssecretkey41characterslong
- name: AWS_ENDPOINTS
value: http://s3.us-east-1.amazonaws.com # fix the region code if you need to
volumes:
# the output file can be found on this host path
- name: disk-directory
hostPath:
path: /tmp/restore
restartPolicy: Never
There's also no mention of how to to actually get files out of a qcow2, or any links on what that is. A few quick links on converting that to some useful format would save a bunch of downstream grumbling. It's feedback... take it or leave it.
Best regards.
from longhorn.
Simulating hard crash and trying to restore from backups the volume back (with same name):
I can mount the restored volume on the host, but pod creation failes:
MountVolume.SetUp failed for volume "pvc-81cb539a-51d9-11e8-beb0-0e701ed55039" : mount command failed, status: Failure, reason: create volume fail: fail to parse size error parsing size 'null': quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP][-+]?[0-9])$'
from longhorn.
@manwegit wow, we don't really expect the user to recover the backup manually without Longhorn :)
With a backup, you can create new longhorn volume out of that backup. Add the backupstore
then select restore.
We also have a built-in way to recover the volume, but we haven't documented it. When a volume is in FAULT state, one additional action item called Salvage
will appear in the UI in the volume list page. It will ask the user to choose which error replicas to use as the potentially good candidates. Longhorn will select one of the replicas and reset its state, then the user can at least recover the data to the point that replica recorded.
The scripts look like it should work, what's the error message (just curious).
from longhorn.
My steps for cluster crash testing:
- Ubuntu 16.04 hosts, with docker and required packages installed. Dual network with k8s communication with the private net
- Deploy k8s with rke tool, three nodes (and with five)
- Install rancher 2.0 and use catalog Lonhorn (changing volume plugin path)
- I added two binds to kubelet (I think) /dev and /var/lib/ something
- Use catalog redis or something that requires pvc (use longhorn class)
- take some snapshots, backups (external nfs)
- crash everything by running k8s reboot simultaneously on all k8s nodes
- Volumes stayed in Attaching state and replicas in ERROR.
As to Point 6. from previous comments. I'm thinking recovery of data offline for many different purposes.Let's say hacking, db corruption etc. where creating a image from backed up data without k8s cluster or network access. Those tools and ways to access data are hopefully never needed, but I've been on this business too long to not to have options for data recovery/forensics.
It would be really nice if one only has snapshots to copy/transfer replica folder to another longhorn cluster, place it among e.g. backups and access/use it like a backup.
Addition to the reason for backups would be that there was a cluster crash and a new cluster is being build, but the backup is too old (replica dir with snapshots should be more recent point of recovery)
from longhorn.
Hello. I spent last 2 days recovering my longhorn volumes, so I think I can provide some info. Maybe someone else will find it of help.
What happened:
We needed to turn off one machine to replace hard drive which reported signes of failure. I've added temporal machine to share load and turned off machine in question. For some reason I couldn't evacuate machine, it didn't stop any service. So I removed it from cluster completely. After disk replacement was done I've added machine back and after checking that everything ok I've evacuated temporal machine(without problems). Deleted it from UI and then deleted that machine from hosting. Machine stuck at removing state in rancher and I couldn't do anything. I've found similar problems in rancher repository issues and restored cluster from snapshot. That's when problems with longhorn began. For some reason half(4) of volumes became faulty.
Recovery:
Few had different path on disk and in ui. One was salvageable and recovery was easy. Last three took some time to figure out how to recover. They were present on disk, but longhorn marked them as failure and I couldn't find a way to fix it. So solution was to create new volume with similar size, stop it, replace content with what was in failed volume and start it.
It would be very helpful if you could add way to manually add replica from disk. That would've reduced recovery time to few hours.
Also if backups were enabled from the start it also wouldn't have been that big of problem, but you can't backup everything every few minutes, so lots of data could be lost. It was debug cluster for me and data loss was acceptable, so I didn't have backups.
from longhorn.
@TrueCarry The last three volumes should be salvageable as well. Probably caused by an issue we found recently #576 .
But I agree that we should able to provide an easier way to recover a volume using a replica from the disk. Maybe we can follow up in #469
from longhorn.
Related to #1145
from longhorn.
Need to address how to identify a bad replica from the good ones if the incident on the disk happens.
from longhorn.
Need to cover how to recover from disk full here.
from longhorn.
Content is now ready at https://github.com/yasker/longhorn/wiki/Data-recovery-%5BDRAFT%5D .
@catherineluse Can you help to move them to the website?
from longhorn.
I think this issue can be closed because longhorn/website#128 was merged
from longhorn.
wow, we don't really expect the user to recover the backup manually without Longhorn :)
Please consider this a high priority. The backups should be held in a folder/file structure if at all possible, or a script available that can parse out the structure that requires NOTHING other than the script and an address to pull the structure from. Requiring Longhorn also requires rebuilding a cluster, standing up some kind of basic pod to hold the pvc. There's a lot of steps to get to that point. Having no access to actual files in the backups is part of the reason I gave up on Longhorn today. After a cluster crash, loss of all nodes for the Nth time, I'm also dropping Rancher today. So getting my data back is very difficult without Longhorn. You assuming how people need access to their own data is very dangerous. Good luck.
from longhorn.
I somehow missed the link in the docs. Thank you so much for responding and also for having done this. I was just on the path to upgrade the storage handler from 0.7.1 -> 1.0.0 when I started having problems with my cluster. How far back is the recovery script compatible with the backup format?
from longhorn.
Related Issues (20)
- [DOC] Typo on Longhorn-specific StorageClass parameters HOT 1
- [IMPROVEMENT] add logging for stdout and of LuksOpen() in share-manager
- Styling to match the current website in feel
- Docs migration, including versions
- Blog posts migration
- Knowledgebase migration
- Infrastructure, domains, Netlify deployment in place and tested HOT 1
- Algolia search for DS
- Docusaurus site acceptance testing HOT 2
- Investigate possibility of tile styling for main Blog page/index
- Netlify preview site for Docusaurus, provided by the project infrastructure. HOT 1
- DS requires an `authors.yml` file for the KB. HOT 1
- [BUG] RWX test passes even when share-manager's ganesha process can't run. HOT 5
- [FEATURE] Support storage network for RWX volumes
- [TEST][FEATURE] Support storage network for RWX volumes
- [BUG] [v1.6.1-rc1] v2 volume replica offline rebuilding fail HOT 1
- Styling of the KB index.
- [QUESTION] Backups.longhorn.io "backup-4603c5338ee242e1" not found HOT 3
- Longhorn Manager Pods CrashLoop after upgrade from 1.4.3 to 1.5.4 HOT 1
- [QUESTION] Kubernetes 1.29 support HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from longhorn.