Describe the bug I rebooted my server and my volumes were stuck on

Support bundle is empty (0 bytes). Perhaps mail it to <a href="mailto:longhorn-suppor

Some other things. There is a lot of logging about things like <div class="snippe

[BUG] Volumes stuck in Detached state after reboot about longhorn HOT 8 OPEN

ZanderPittari commented on June 25, 2024

[BUG] Volumes stuck in Detached state after reboot

from longhorn.

Comments (8)

james-munson commented on June 25, 2024 1

logs/longhorn-system/csi-attacher-57c5fd5bdf-x8wmf/csi-attacher.log

2024-04-23T19:04:17.278448755+10:00 I0423 09:04:17.278284       1 csi_handler.go:234] Error processing "csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6": failed to attach: rpc error: code = Internal desc = volume bazarr failed to attach to node k3s-worker-1 with attachmentID csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6: Waiting for volume share to be available

And also in yamls/namespaced/longhorn-system/longhorn.io/v1beta2/volumeattachments.yaml,

   spec:
      attachmentTickets:
        csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6:
          generation: 0
          id: csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6
          nodeID: k3s-worker-1
          parameters:
            disableFrontend: "false"
            lastAttachedBy: "null"
          type: csi-attacher
      volume: bazarr
    status:
      attachmentTicketStatuses:
        csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6:
          conditions:
            - lastProbeTime: "null"
              lastTransitionTime: "2024-04-22T22:09:38Z"
              message: Waiting for volume share to be available
              reason: "null"
              status: "False"
              type: Satisfied
          generation: 0
          id: csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6
          satisfied: false

That comes from
https://github.com/longhorn/longhorn-manager/blob/2c27d58245028e2475b301ab14021d51a9ef73e1/controller/volume_attachment_controller.go#L741

and therefore from
https://github.com/longhorn/longhorn-manager/blob/2c27d58245028e2475b301ab14021d51a9ef73e1/controller/volume_attachment_controller.go#L914-L918

And indeed its share-manage is not running, it is stopping, as are all but two of them.

yamls/namespaced/longhorn-system/longhorn.io/v1beta2/sharemanagers.yaml

      name: bazarr
      namespace: longhorn-system
      ownerReferences:
        - apiVersion: longhorn.io/v1beta2
          kind: Volume
          name: bazarr
          uid: e214dc42-a6e8-41e8-8d27-1971c32fddaf
      resourceVersion: "247397410"
      uid: 13836141-afbe-4edd-9881-6665dfb2fb9e
    spec:
      image: longhornio/longhorn-share-manager:v1.6.1
    status:
      endpoint: "null"
      ownerID: k3s-worker-1
      state: stopping

Looking into why that would be.

from longhorn.

james-munson commented on June 25, 2024 1

For the replica placement question, try setting replica-disk-soft-anti-affinity to false. That would mean "hard" anti-affinity, so that a replica would simply remain unscheduled rather than reluctantly sharing a disk if necessary to schedule.

It's interesting, because in the support bundle, replica-soft-anti-affinity (which is really "replica node soft anti-affinity") is already false, so they should not have shared a node, much less a disk. But I'm not entirely sure of the order of priority when calculating the scheduling.

For VM partition, I don't know. That's probably worth a separate ticket (or question, actually).

from longhorn.

james-munson commented on June 25, 2024

Support bundle is empty (0 bytes). Perhaps mail it to [email protected]

from longhorn.

ZanderPittari commented on June 25, 2024

I think the file is too large, will send a link to that email now

from longhorn.

james-munson commented on June 25, 2024

Some other things. There is a lot of logging about things like

2024-04-23T08:08:36.437737012+10:00 time="2024-04-22T22:08:36Z" level=error msg="Failed to sync Longhorn replica" func=controller.handleReconcileErrorLogging file="utils.go:67" Replica=longhorn-system/prowlarr-r-e2cd97b3 controller=longhorn-replica error="failed to sync replica for longhorn-system/prowlarr-r-e2cd97b3: instance prowlarr-r-e2cd97b3 NodeID k3s-worker-2 is not the same as the instance manager instance-manager-5b18aaee3512cb8713b524f2f78399b4 NodeID k3s-worker-3"

And when I look at the replicas.yaml, there are only a handful that even have a nodeIP currently:

grep storageIP replicas.yaml | sort | uniq -c
      4       storageIP: 10.42.3.71
      2       storageIP: 10.42.5.177
     26       storageIP: "null"

Although the engines.yaml shows other addresses for replicas in the AddressMaps, for instance:

      replicaAddressMap:
        bazarr-r-3aa7d762: 10.42.4.16:10030
        bazarr-r-501487c6: 10.42.5.167:10002
        bazarr-r-cf42f74a: 10.42.4.16:10000

This is a good time to mention that Longhorn identifies replicas by the IP:port they are assigned to listen for I/O commands. (This one is also interesting because two of the three replicas are on the same node, which is not really optimal.)

So, a couple of questions:

Is it possible that in the reboot and DNS shift that some nodes were assigned new IPs?
Can you give more detail about how the replicas were moved from one node to another?

from longhorn.

ZanderPittari commented on June 25, 2024

I'm not entirely certain that they were given new IPs, I doubt it since I didn't change my dhcp server, only the dns.

However the strangest thing happened just before, I evicted all pods from k3s-worker-1 vm to increase the root partition to a larger size, and for some reason all the deployments besides radarr and vaultwarden are working now. Not really sure why.

Also how do I force longhorn to only have 1 replica per node, it looks like it always jumps around and sometimes is always on the same node?

from longhorn.

ZanderPittari commented on June 25, 2024

Also another quick question, not sure if I should make another issue for this or not but would you know how to increase the partition of a vm that has longhorn running on it? I'm getting this error and everytime I try to expand the main partition the vm doesn't boot up again:

GPT PMBR size mismatch (209715199 != 419430399) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
This disk is currently in use - repartitioning is probably a bad idea.
It's recommended to umount all file systems, and swapoff all swap
partitions on this disk.

from longhorn.

ZanderPittari commented on June 25, 2024

Yeah I also have Default Data Locality set to best-effort which I assume would help but nothing. I can see the issue with radarr now is that the replicas are on nodes 1 and 3 but it's trying to create the pod on 2. Got to figure it out somehow. Thanks for your help

I'll make another post for the vm partition, cheers

from longhorn.

[BUG] Volumes stuck in Detached state after reboot about longhorn HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent