I managed to overload my testing kubernetes cluster and all nodes rebooted at the same

Well, at least recombine backup without longhorn code snippet: <div class="snippet

Created enhancement issues <a class="issue-link js-issue-link" data-error-text="Failed

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Missing tools and/or documentation how to recover from server/cluster crash about longhorn HOT 20 CLOSED

longhorn commented on May 13, 2024

Missing tools and/or documentation how to recover from server/cluster crash

from longhorn.

Comments (20)

manwegit commented on May 13, 2024 3

Well, at least recombine backup without longhorn code snippet:

Run in backup nfs mount:
e.g. /mnt/nfs-backup/backupstore/volumes/ed/f4/pvc-6728f786-51d9-11e8-beb0-0e701ed55039

truncate -s $(cat volume.cfg |jq -r '.Size') restore.img

cat backups/backup_backup-PICK_ONE.cfg |jq -r '.Blocks| .[]| [.Offset, .BlockChecksum]|@tsv' |perl -anle 'if ($F[1]=~/^(..)(..)/) {print "dd conv=notrunc if=<(zcat blocks/$1/$2/".$F[1].".blk ) of=restore.img seek=$F[0] oflag=seek_bytes" }' > /tmp/restore-image.sh
bash /tmp/restore-image.sh

from longhorn.

khushboo-rancher commented on May 13, 2024 3

Created enhancement issues #1523, #1522, #1521

from longhorn.

yasker commented on May 13, 2024 2

@jhughes2112 you can find how to restore from backup using https://longhorn.io/docs/1.0.0/advanced-resources/data-recovery/recover-without-system/ . BTW, I think you've misunderstood my comment. We don't expect user to write scripts themselves to recover, that's what I meant by manually. Longhorn provided the mechanism to recover without Kubernetes already. The effort is in fact tracked by this PR.

from longhorn.

yasker commented on May 13, 2024 2

@jhughes2112 Thanks for your feedback. Especially I think it makes sense to change the script into a docker run instead of YAML file. As you said, you might lose the whole Kubernetes cluster. We did that for the single replica recovery but not for the restore from the backups.

To detect the volumes more intelligently, it would require a bit more work though.

@khushboo-rancher can you help to check https://longhorn.io/docs/1.0.0/advanced-resources/data-recovery/recover-without-system/ and the above comment, and file issues for enhancement accordingly? Thanks.

from longhorn.

manwegit commented on May 13, 2024 1

Hmm.. don't know why it did not show the error, so I edited the comment above.

I did not see that salvage option. Volumes say they are in Attaching state (and stayed there) and opening host->replica showed "ERROR" on all replicas.

Restoring from backup as a new volume works at least show that on the host where it is attached (/dev/longhorn/... is there) I can mount the volume with mount (on the host).

But what I'm trying to do is to keep existing PVC and restore PV from backup.

Delete non functioning volume - pvc-123
Restore from backup, same name: pvc-123
When pod is trying to mount it I can see that json missing size message in Describe
The restored volume can be mounted on the host with mount command.

What I really would like to have is (in this order):

document a way (with minimal example) to restore a volume from backup without recreating deployment and pvc
document (if there is) a way to monitor and do the ui stuff from either inside a "manager" pod or with API
can replica count be changed? Moved from host/node to another (removing, adding nodes)
some way to recover faulty volumes (you mentioned the salvage options)
documentation how node replica volumes are constructed. Are they just sparse image files that could be mounted? QCow format?
offline scripts or enough design documentation to do those myself. (like my backup restore snipped. Enough was said on the rancher blog to figure out that 2Mi chunks/blocks just need to restored to correct offset.

from longhorn.

yasker commented on May 13, 2024 1

If the volume stays in the Attaching state with all the replicas as ERROR, it's a bug. Volume shouldn't in Attaching state at the time. Let me check if I can reproduce it.

The salvage option will be shown if the volume is in Faulted state.

The PVC is automatically created by k8s, I think k8s wasn't passing the size option somehow to the flexvolume driver, result in error. I will check this as well.

Some quick answer below:

I need to check why your steps don't work.
For now, check the API endpoint of the longhorn-backend. There is a minimal UI for API, though it's more prone to break if options are wrong. Also https://github.com/rancher/longhorn-tests/tree/master/manager/integration has examples on how to interact with manager API.
Replica count cannot be changed currently. We can work on that later.
When the volume is in Faulted state (all the replicas are ERROR and volume is detached). In the volume list view, the rightmost volume options for that volume will show the Salvage option.
They're layered sparse file. Almost all of them cannot be mounted since they normally don't contain the full data for the disk. Checked https://rancher.com/microservices-block-storage/ Replica Operations for details.
For the restoration, not sure why you choose to use your own script instead of using longhorn? Or you just wanted a way to export the image? We can add a feature to support that.

from longhorn.

yasker commented on May 13, 2024 1

Need to cover how to recover from a single replica.

from longhorn.

jhughes2112 commented on May 13, 2024 1

For documentation purposes, I would mention that getting a valid restore out of the script is a little time consuming and hit-or-miss guessing what to pass in. Seeing as how I no longer have the cluster the data came from, I would suggest, for your future users' benefit, several improvements.

Change the yaml from a pod to a simple docker execution line. I had to create a cluster just so I could jam the yaml into it, but it's really not necessary, is it?
If you do keep it in yaml form, change the unnecessary env pulling secrets to a more straightforward name/value pair. There's a really good chance creating a secret properly for some folks will simply derail them unnecessarily, trying to debug that part when there's plenty of other things to debug.
Make the script capable of producing the valid volumes it would recognize, and list the backups that are legal to insert into the config. You already know the structure of the backup S3 bucket and can easily pull out the volume.cfg files. I lost everything, so I had to spend time looking through S3 and guessing at what your script wanted as inputs.
Maybe mention on the URL that you gave me above what the directory structure of the bucket storage maps to what inputs. There's a file called backup_backup-ab4a140a6d264cb4.cfg and the input that you're looking for is actually a piece of that filename, but is actually stored in the volume.cfg and you could have pulled that out and told me what would be accepted as valid backups in the case of an error.

For future reference and documentations sake, The path in S3 that I was restoring from was this:

Amazon S3/MYBUCKET/backupstore/volumes/34/4e/cdn-data

I didn't find a good nearly-working example for pulling a restore down from AWS, so here's one, with a couple of redactions:

apiVersion: v1
kind: Pod
metadata:
  name: restore-to-file
  namespace: default
spec:
  nodeName: NODENAME-TO-SAVE-TO
  containers:
  - name: restore-to-file
    command:
    # set restore-to-file arguments here
    - /bin/sh
    - -c
    - longhorn backup restore-to-file
      's3://MYBUCKET@us-east-1/?backup=backup-ab4a140a6d264cb4&volume=cdn-data'
      --output-file '/tmp/restore/cdn-data.qcow2'
      --output-format qcow2
    # the version of longhorn engine should be v0.4.1 or higher
    image: rancher/longhorn-engine:v0.4.1
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    volumeMounts:
    - name: disk-directory
      mountPath: /tmp/restore  # the argument <output-file> should be in this directory
    env:
    # set Backup Target Credential Secret here.
    - name: AWS_ACCESS_KEY_ID
      value: myawsaccesskeyid19long
    - name: AWS_SECRET_ACCESS_KEY
      value: myawssecretkey41characterslong
    - name: AWS_ENDPOINTS
      value: http://s3.us-east-1.amazonaws.com  # fix the region code if you need to
  volumes:
    # the output file can be found on this host path
    - name: disk-directory
      hostPath:
        path: /tmp/restore
  restartPolicy: Never

There's also no mention of how to to actually get files out of a qcow2, or any links on what that is. A few quick links on converting that to some useful format would save a bunch of downstream grumbling. It's feedback... take it or leave it.

Best regards.

from longhorn.

manwegit commented on May 13, 2024

Simulating hard crash and trying to restore from backups the volume back (with same name):

I can mount the restored volume on the host, but pod creation failes:

MountVolume.SetUp failed for volume "pvc-81cb539a-51d9-11e8-beb0-0e701ed55039" : mount command failed, status: Failure, reason: create volume fail: fail to parse size error parsing size 'null': quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP][-+]?[0-9])$'

from longhorn.

yasker commented on May 13, 2024

@manwegit wow, we don't really expect the user to recover the backup manually without Longhorn :)

With a backup, you can create new longhorn volume out of that backup. Add the backupstore then select restore.

We also have a built-in way to recover the volume, but we haven't documented it. When a volume is in FAULT state, one additional action item called Salvage will appear in the UI in the volume list page. It will ask the user to choose which error replicas to use as the potentially good candidates. Longhorn will select one of the replicas and reset its state, then the user can at least recover the data to the point that replica recorded.

The scripts look like it should work, what's the error message (just curious).

from longhorn.

manwegit commented on May 13, 2024

My steps for cluster crash testing:

Ubuntu 16.04 hosts, with docker and required packages installed. Dual network with k8s communication with the private net
Deploy k8s with rke tool, three nodes (and with five)
Install rancher 2.0 and use catalog Lonhorn (changing volume plugin path)
I added two binds to kubelet (I think) /dev and /var/lib/ something
Use catalog redis or something that requires pvc (use longhorn class)
take some snapshots, backups (external nfs)
crash everything by running k8s reboot simultaneously on all k8s nodes
Volumes stayed in Attaching state and replicas in ERROR.

As to Point 6. from previous comments. I'm thinking recovery of data offline for many different purposes.Let's say hacking, db corruption etc. where creating a image from backed up data without k8s cluster or network access. Those tools and ways to access data are hopefully never needed, but I've been on this business too long to not to have options for data recovery/forensics.

It would be really nice if one only has snapshots to copy/transfer replica folder to another longhorn cluster, place it among e.g. backups and access/use it like a backup.

Addition to the reason for backups would be that there was a cluster crash and a new cluster is being build, but the backup is too old (replica dir with snapshots should be more recent point of recovery)

from longhorn.

TrueCarry commented on May 13, 2024

Hello. I spent last 2 days recovering my longhorn volumes, so I think I can provide some info. Maybe someone else will find it of help.

What happened:
We needed to turn off one machine to replace hard drive which reported signes of failure. I've added temporal machine to share load and turned off machine in question. For some reason I couldn't evacuate machine, it didn't stop any service. So I removed it from cluster completely. After disk replacement was done I've added machine back and after checking that everything ok I've evacuated temporal machine(without problems). Deleted it from UI and then deleted that machine from hosting. Machine stuck at removing state in rancher and I couldn't do anything. I've found similar problems in rancher repository issues and restored cluster from snapshot. That's when problems with longhorn began. For some reason half(4) of volumes became faulty.

Recovery:
Few had different path on disk and in ui. One was salvageable and recovery was easy. Last three took some time to figure out how to recover. They were present on disk, but longhorn marked them as failure and I couldn't find a way to fix it. So solution was to create new volume with similar size, stop it, replace content with what was in failed volume and start it.

It would be very helpful if you could add way to manually add replica from disk. That would've reduced recovery time to few hours.

Also if backups were enabled from the start it also wouldn't have been that big of problem, but you can't backup everything every few minutes, so lots of data could be lost. It was debug cluster for me and data loss was acceptable, so I didn't have backups.

from longhorn.

yasker commented on May 13, 2024

@TrueCarry The last three volumes should be salvageable as well. Probably caused by an issue we found recently #576 .

But I agree that we should able to provide an easier way to recover a volume using a replica from the disk. Maybe we can follow up in #469

from longhorn.

yasker commented on May 13, 2024

Related to #1145

from longhorn.

yasker commented on May 13, 2024

Need to address how to identify a bad replica from the good ones if the incident on the disk happens.

from longhorn.

yasker commented on May 13, 2024

Need to cover how to recover from disk full here.

from longhorn.

yasker commented on May 13, 2024

Content is now ready at https://github.com/yasker/longhorn/wiki/Data-recovery-%5BDRAFT%5D .

@catherineluse Can you help to move them to the website?

from longhorn.

catherineluse commented on May 13, 2024

I think this issue can be closed because longhorn/website#128 was merged

from longhorn.

jhughes2112 commented on May 13, 2024

wow, we don't really expect the user to recover the backup manually without Longhorn :)

Please consider this a high priority. The backups should be held in a folder/file structure if at all possible, or a script available that can parse out the structure that requires NOTHING other than the script and an address to pull the structure from. Requiring Longhorn also requires rebuilding a cluster, standing up some kind of basic pod to hold the pvc. There's a lot of steps to get to that point. Having no access to actual files in the backups is part of the reason I gave up on Longhorn today. After a cluster crash, loss of all nodes for the Nth time, I'm also dropping Rancher today. So getting my data back is very difficult without Longhorn. You assuming how people need access to their own data is very dangerous. Good luck.

from longhorn.

jhughes2112 commented on May 13, 2024

I somehow missed the link in the docs. Thank you so much for responding and also for having done this. I was just on the path to upgrade the storage handler from 0.7.1 -> 1.0.0 when I started having problems with my cluster. How far back is the recovery script compatible with the backup format?

from longhorn.

Missing tools and/or documentation how to recover from server/cluster crash about longhorn HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent