Code Monkey home page Code Monkey logo

Comments (29)

derekbit avatar derekbit commented on May 28, 2024

Do you have any taints or tolerations on these nodes?
A support bundle is appreciated.

from longhorn.

cometta avatar cometta commented on May 28, 2024

yes this new nodepool has
Taint is set to NoSchedule nvidia.com/gpu=present

it's running on GCP k8s

  • Longhorn version: CSI Driver driver.longhorn.io CSI Git commit edf23eddc3b6c307031eaa770c3d312c963f25a5 version v1.6.0
  • Kubernetes version: 1.27.8-gke.1067004

from longhorn.

cometta avatar cometta commented on May 28, 2024

for the old nodepool which work with longhorn, i able to see

csi.volume.kubernetes.io/nodeid: {"driver.longhorn.io" ...}

but for the new nodepool, i can't see this

from longhorn.

cometta avatar cometta commented on May 28, 2024

i tested uninstall longhorn and reinstall, issue still persist

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

Can you send us a support bundle?

from longhorn.

cometta avatar cometta commented on May 28, 2024

@derekbit , done generated support bundle zip, i uploaded to https://drive.google.com/file/d/1dLexKrdwuQvwKvCTu7flzeH1Ndt2mjL7/view?usp=drive_link , i will delete it after you download it.

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

OK. The node gke-dataops-gpu-cluster-gpu-np0-bf7ed2c8-9pxd is tainted with nvidia.com/gpu=present:NoSchedule, but the longhorn components cannot tolerate it.
Please check the document https://longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration/ and set appropriate tolerations for the components.

from longhorn.

cometta avatar cometta commented on May 28, 2024

@derekbit , i folllowed the guide, set the toleration on the helm. i able to see longhorn ui, manager, driver pods restarted. i go to the UI using browser, i see below message

image

image

i can't attach the disk

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

disks are unavailable-> Can you check the spec.diskStatus of the newly added node by kubectl -n longhorn-system get nodes.longhorn.io -o yaml?

from longhorn.

cometta avatar cometta commented on May 28, 2024
 kubectl -n longhorn-system get nodes.longhorn.io -o yaml
apiVersion: v1
items:
- apiVersion: longhorn.io/v1beta2
  kind: Node
  metadata:
    creationTimestamp: "2024-02-29T00:35:53Z"
    finalizers:
    - longhorn.io
    generation: 1
    name: gke-test-gpu-cluster-cpu-np0-3228fc14-9h19
    namespace: longhorn-system
    resourceVersion: "2223795"
    uid: d774cfdd-1740-4ece-9376-ceec8b39eda3
  spec:
    allowScheduling: true
    disks:
      default-disk-d1568ee2a91a8a33:
        allowScheduling: true
        diskType: filesystem
        evictionRequested: false
        path: /var/lib/longhorn/
        storageReserved: 93598043750
        tags: []
    evictionRequested: false
    instanceManagerCPURequest: 0
    name: gke-test-gpu-cluster-cpu-np0-3228fc14-9h19
    tags: []
  status:
    autoEvicting: false
    conditions:
    - lastProbeTime: ""
      lastTransitionTime: "2024-02-29T00:35:53Z"
      message: Node gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is ready
      reason: ""
      status: "True"
      type: Ready
    - lastProbeTime: ""
      lastTransitionTime: "2024-02-29T00:35:53Z"
      message: ""
      reason: ""
      status: "True"
      type: Schedulable
    - lastProbeTime: ""
      lastTransitionTime: "2024-02-29T00:35:53Z"
      message: ""
      reason: ""
      status: "True"
      type: MountPropagation
    diskStatus:
      default-disk-d1568ee2a91a8a33:
        conditions:
        - lastProbeTime: ""
          lastTransitionTime: "2024-02-29T02:00:12Z"
          message: 'Disk default-disk-d1568ee2a91a8a33(/var/lib/longhorn/) on node
            gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is not ready: record diskUUID
            doesn''t match the one on the disk '
          reason: DiskFilesystemChanged
          status: "False"
          type: Ready
        - lastProbeTime: ""
          lastTransitionTime: "2024-02-29T02:00:12Z"
          message: Disk default-disk-d1568ee2a91a8a33 (/var/lib/longhorn/) on the
            node gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is not ready
          reason: DiskNotReady
          status: "False"
          type: Schedulable
        diskType: filesystem
        diskUUID: 9fbfa3e2-4b8e-41bc-9bff-7df0caebba01
        filesystemType: ext2/ext3
        scheduledReplica: {}
        storageAvailable: 0
        storageMaximum: 0
        storageScheduled: 0
    region:<masked>
    snapshotCheckStatus: {}
    zone: <masked>
kind: List
metadata:
  resourceVersion: ""

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

Can you try

  • Delete the disk from the node gke-test-gpu-cluster-cpu-np0-3228fc14-9h19
  • Delete /var/lib/longhorn/longhorn-disk.cfg on the node
  • Add the disk /var/lib/longhorn back

from longhorn.

cometta avatar cometta commented on May 28, 2024

@derekbit can you provide command or can i do in the longhorn UI webpage?

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

Please check https://longhorn.io/docs/1.6.0/nodes-and-volumes/nodes/multidisk/.
Can you provide a new support bundle? I'd like to check how the error is triggered.

          message: 'Disk default-disk-d1568ee2a91a8a33(/var/lib/longhorn/) on node
            gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is not ready: record diskUUID
            doesn''t match the one on the disk '

from longhorn.

cometta avatar cometta commented on May 28, 2024

@derekbit i regenerated support bundle https://drive.google.com/file/d/16rVGu-7dJRilfgmcAYtRmQNpkvgQkMNP/view?usp=sharing

from longhorn.

cometta avatar cometta commented on May 28, 2024

I followed the step, delete disk, delete the file *cfg , add new disk, the node page in longhorn webpage look ok. but i still getting error on new node pool , existing node pool no issue.

AttachVolume.Attach failed for volume "pvc-5d4bc87d-b1bd-404f-a9fe-cc66fb811300" : CSINode gke-something 0-bf7ed2c8-m2wc does not contain driver driver.longhorn.io

i ssh into the node that has this error, i dont see folder /var/lib/longhorn .

from longhorn.

cometta avatar cometta commented on May 28, 2024

share screenshot of volume page

image

from longhorn.

derekbit avatar derekbit commented on May 28, 2024
AttachVolume.Attach failed for volume "pvc-5d4bc87d-b1bd-404f-a9fe-cc66fb811300" : CSINode gke-something 0-bf7ed2c8-m2wc does not contain driver driver.longhorn.io

You don't set the correct key. Isn't key nvidia.com/gpu?

from longhorn.

cometta avatar cometta commented on May 28, 2024

on the new node that does not work, i can see this , set by GCP , not me

image

on the existing node pool that work with longhorn, i dont see this key, also no gpu

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

To avoid being confused. Let's focus on the CSINode error on the new node first.
Can you fix the key of the tolerations and try longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration/?

from longhorn.

cometta avatar cometta commented on May 28, 2024

i already ran this once, using command `helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --values ./longhorn_values.yaml``` . first time i ran, all 3 pods restarted. if i re-run again, 3 pods will not be restarted again

on the ui pod, yaml file, i can see

tolerations:
  - effect: NoSchedule
    key: key
    operator: Equal
    value: value
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

driver pod

   tolerations:
   - effect: NoSchedule
     key: key
     operator: Equal
     value: value
   - effect: NoExecute
     key: node.kubernetes.io/not-ready
     operator: Exists
     tolerationSeconds: 300
   - effect: NoExecute
     key: node.kubernetes.io/unreachable
     operator: Exists
     tolerationSeconds: 300

manager pod

  tolerations:
   - effect: NoSchedule
     key: key
     operator: Equal
     value: value
   - effect: NoExecute
     key: node.kubernetes.io/not-ready
     operator: Exists
   - effect: NoExecute
     key: node.kubernetes.io/unreachable
     operator: Exists
   - effect: NoSchedule
     key: node.kubernetes.io/disk-pressure
     operator: Exists
   - effect: NoSchedule
     key: node.kubernetes.io/memory-pressure
     operator: Exists
   - effect: NoSchedule
     key: node.kubernetes.io/pid-pressure
     operator: Exists
   - effect: NoSchedule
     key: node.kubernetes.io/unschedulable
     operator: Exists

all theses pods are running on the existing node which work with longhorn, i dont see any longhorn pods run on new nodepool

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

I mean why you set key: key rather than key: nvidia.com/gpu.

from longhorn.

cometta avatar cometta commented on May 28, 2024

are you refering to , it is auto set by GCP, I tested delete it, then GCP put it back

image

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

You can follow the steps in longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration.
You need to update tolerations in the longhorn_values.yaml, and the key should be key: nvidia.com/gpu rather than key.
Then, run helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --values ./longhorn_values.yaml.
If you still run into the issue, please provide the content of longhorn_values.yaml.

from longhorn.

cometta avatar cometta commented on May 28, 2024

@derekbit i tweeked the longhorn_values.yaml, can help me check it before i run it

longhornManager:
  tolerations:
  - key: "nvidia.com/gpu=present"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"
longhornDriver:
  tolerations:
  - key: "nvidia.com/gpu=present"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"
longhornUI:
  tolerations:
  - key: "nvidia.com/gpu=present"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

Sorry, should be (key is "nvidia.com/gpu" and value is "present").
Let's try it.

longhornManager:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"
longhornDriver:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"
longhornUI:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"

from longhorn.

cometta avatar cometta commented on May 28, 2024

i reran the helm code that you shared, i able to see longhorn pods restarted. then i test run a pod on new node pool ,

 Warning  FailedMount         9s                    kubelet                  Unable to attach or mount volumes: unmounted volumes=[first-pvc secondpvc], u
nattached volumes=[first-pvc secondpvc], failed to process volumes=[]: timed out waiting for the condition
 Warning  FailedAttachVolume  0s (x9 over 2m12s)    attachdetach-controller  AttachVolume.Attach failed for volume "pvc-something" : CSINode gke-something-np0 -bf7ed2c8-554m does not contain driver driver.longhorn.io

 Warning  FailedAttachVolume  0s (x9 over 2m12s)    attachdetach-controller  AttachVolume.Attach failed for volume "pvc-87302005-f605-4ae9-81d0-5d6ad5da7630" : CSINode gke-something-bf7ed2c8-554m does not contain driver driver.longhorn.io

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

A support bundle is appreciated. BTW, you can ping me on Rancher or CNCF Slack channel for speeding up the process.

from longhorn.

derekbit avatar derekbit commented on May 28, 2024

@cometta
BTW, have you execute the step 2 in https://longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration/?

defaultSettings:
  taintToleration: "nvidia.com/gpu=present:NoSchedule"

from longhorn.

cometta avatar cometta commented on May 28, 2024

issue resolved, Thank you @derekbit

in summarize

  1. keep all existing pvc, pv, no need delete ; unattach all apps that using pvc,pv by uninstall
  2. add the value in longhorn ui -> setting->Kubernetes Taint Toleration -> nvidia.com/gpu=present:NoSchedule , save ; force longhorn pods restarted
  3. run again pod on new node pool

from longhorn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.