Code Monkey home page Code Monkey logo

Comments (8)

james-munson avatar james-munson commented on June 25, 2024 1

logs/longhorn-system/csi-attacher-57c5fd5bdf-x8wmf/csi-attacher.log

2024-04-23T19:04:17.278448755+10:00 I0423 09:04:17.278284       1 csi_handler.go:234] Error processing "csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6": failed to attach: rpc error: code = Internal desc = volume bazarr failed to attach to node k3s-worker-1 with attachmentID csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6: Waiting for volume share to be available

And also in yamls/namespaced/longhorn-system/longhorn.io/v1beta2/volumeattachments.yaml,

   spec:
      attachmentTickets:
        csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6:
          generation: 0
          id: csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6
          nodeID: k3s-worker-1
          parameters:
            disableFrontend: "false"
            lastAttachedBy: "null"
          type: csi-attacher
      volume: bazarr
    status:
      attachmentTicketStatuses:
        csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6:
          conditions:
            - lastProbeTime: "null"
              lastTransitionTime: "2024-04-22T22:09:38Z"
              message: Waiting for volume share to be available
              reason: "null"
              status: "False"
              type: Satisfied
          generation: 0
          id: csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6
          satisfied: false

That comes from
https://github.com/longhorn/longhorn-manager/blob/2c27d58245028e2475b301ab14021d51a9ef73e1/controller/volume_attachment_controller.go#L741

and therefore from
https://github.com/longhorn/longhorn-manager/blob/2c27d58245028e2475b301ab14021d51a9ef73e1/controller/volume_attachment_controller.go#L914-L918

And indeed its share-manage is not running, it is stopping, as are all but two of them.

yamls/namespaced/longhorn-system/longhorn.io/v1beta2/sharemanagers.yaml

      name: bazarr
      namespace: longhorn-system
      ownerReferences:
        - apiVersion: longhorn.io/v1beta2
          kind: Volume
          name: bazarr
          uid: e214dc42-a6e8-41e8-8d27-1971c32fddaf
      resourceVersion: "247397410"
      uid: 13836141-afbe-4edd-9881-6665dfb2fb9e
    spec:
      image: longhornio/longhorn-share-manager:v1.6.1
    status:
      endpoint: "null"
      ownerID: k3s-worker-1
      state: stopping

Looking into why that would be.

from longhorn.

james-munson avatar james-munson commented on June 25, 2024 1

For the replica placement question, try setting replica-disk-soft-anti-affinity to false. That would mean "hard" anti-affinity, so that a replica would simply remain unscheduled rather than reluctantly sharing a disk if necessary to schedule.

It's interesting, because in the support bundle, replica-soft-anti-affinity (which is really "replica node soft anti-affinity") is already false, so they should not have shared a node, much less a disk. But I'm not entirely sure of the order of priority when calculating the scheduling.

For VM partition, I don't know. That's probably worth a separate ticket (or question, actually).

from longhorn.

james-munson avatar james-munson commented on June 25, 2024

Support bundle is empty (0 bytes). Perhaps mail it to [email protected]

from longhorn.

ZanderPittari avatar ZanderPittari commented on June 25, 2024

I think the file is too large, will send a link to that email now

from longhorn.

james-munson avatar james-munson commented on June 25, 2024

Some other things. There is a lot of logging about things like

2024-04-23T08:08:36.437737012+10:00 time="2024-04-22T22:08:36Z" level=error msg="Failed to sync Longhorn replica" func=controller.handleReconcileErrorLogging file="utils.go:67" Replica=longhorn-system/prowlarr-r-e2cd97b3 controller=longhorn-replica error="failed to sync replica for longhorn-system/prowlarr-r-e2cd97b3: instance prowlarr-r-e2cd97b3 NodeID k3s-worker-2 is not the same as the instance manager instance-manager-5b18aaee3512cb8713b524f2f78399b4 NodeID k3s-worker-3"

And when I look at the replicas.yaml, there are only a handful that even have a nodeIP currently:

grep storageIP replicas.yaml | sort | uniq -c
      4       storageIP: 10.42.3.71
      2       storageIP: 10.42.5.177
     26       storageIP: "null"

Although the engines.yaml shows other addresses for replicas in the AddressMaps, for instance:

      replicaAddressMap:
        bazarr-r-3aa7d762: 10.42.4.16:10030
        bazarr-r-501487c6: 10.42.5.167:10002
        bazarr-r-cf42f74a: 10.42.4.16:10000

This is a good time to mention that Longhorn identifies replicas by the IP:port they are assigned to listen for I/O commands. (This one is also interesting because two of the three replicas are on the same node, which is not really optimal.)

So, a couple of questions:

  1. Is it possible that in the reboot and DNS shift that some nodes were assigned new IPs?

  2. Can you give more detail about how the replicas were moved from one node to another?

from longhorn.

ZanderPittari avatar ZanderPittari commented on June 25, 2024

I'm not entirely certain that they were given new IPs, I doubt it since I didn't change my dhcp server, only the dns.

However the strangest thing happened just before, I evicted all pods from k3s-worker-1 vm to increase the root partition to a larger size, and for some reason all the deployments besides radarr and vaultwarden are working now. Not really sure why.

Also how do I force longhorn to only have 1 replica per node, it looks like it always jumps around and sometimes is always on the same node?

from longhorn.

ZanderPittari avatar ZanderPittari commented on June 25, 2024

Also another quick question, not sure if I should make another issue for this or not but would you know how to increase the partition of a vm that has longhorn running on it? I'm getting this error and everytime I try to expand the main partition the vm doesn't boot up again:

GPT PMBR size mismatch (209715199 != 419430399) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
This disk is currently in use - repartitioning is probably a bad idea.
It's recommended to umount all file systems, and swapoff all swap
partitions on this disk.

from longhorn.

ZanderPittari avatar ZanderPittari commented on June 25, 2024

Yeah I also have Default Data Locality set to best-effort which I assume would help but nothing. I can see the issue with radarr now is that the replicas are on nodes 1 and 3 but it's trying to create the pod on 2. Got to figure it out somehow. Thanks for your help

I'll make another post for the vm partition, cheers

from longhorn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.