My existing k8s cluster is working well with longhorn. After added a new node pool int

for the old nodepool which work with longhorn, i able to see <div class="snippet-c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

add new node pool about longhorn HOT 29 CLOSED

cometta commented on May 28, 2024

add new node pool

from longhorn.

Comments (29)

derekbit commented on May 28, 2024

Do you have any taints or tolerations on these nodes?
A support bundle is appreciated.

from longhorn.

cometta commented on May 28, 2024

yes this new nodepool has
Taint is set to NoSchedule nvidia.com/gpu=present

it's running on GCP k8s

Longhorn version: CSI Driver driver.longhorn.io CSI Git commit edf23eddc3b6c307031eaa770c3d312c963f25a5 version v1.6.0
Kubernetes version: 1.27.8-gke.1067004

from longhorn.

cometta commented on May 28, 2024

for the old nodepool which work with longhorn, i able to see

csi.volume.kubernetes.io/nodeid: {"driver.longhorn.io" ...}

but for the new nodepool, i can't see this

from longhorn.

cometta commented on May 28, 2024

i tested uninstall longhorn and reinstall, issue still persist

from longhorn.

derekbit commented on May 28, 2024

Can you send us a support bundle?

from longhorn.

cometta commented on May 28, 2024

@derekbit , done generated support bundle zip, i uploaded to https://drive.google.com/file/d/1dLexKrdwuQvwKvCTu7flzeH1Ndt2mjL7/view?usp=drive_link , i will delete it after you download it.

from longhorn.

derekbit commented on May 28, 2024

OK. The node gke-dataops-gpu-cluster-gpu-np0-bf7ed2c8-9pxd is tainted with nvidia.com/gpu=present:NoSchedule, but the longhorn components cannot tolerate it.
Please check the document https://longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration/ and set appropriate tolerations for the components.

from longhorn.

cometta commented on May 28, 2024

@derekbit , i folllowed the guide, set the toleration on the helm. i able to see longhorn ui, manager, driver pods restarted. i go to the UI using browser, i see below message

i can't attach the disk

from longhorn.

derekbit commented on May 28, 2024

disks are unavailable-> Can you check the spec.diskStatus of the newly added node by kubectl -n longhorn-system get nodes.longhorn.io -o yaml?

from longhorn.

cometta commented on May 28, 2024

 kubectl -n longhorn-system get nodes.longhorn.io -o yaml
apiVersion: v1
items:
- apiVersion: longhorn.io/v1beta2
  kind: Node
  metadata:
    creationTimestamp: "2024-02-29T00:35:53Z"
    finalizers:
    - longhorn.io
    generation: 1
    name: gke-test-gpu-cluster-cpu-np0-3228fc14-9h19
    namespace: longhorn-system
    resourceVersion: "2223795"
    uid: d774cfdd-1740-4ece-9376-ceec8b39eda3
  spec:
    allowScheduling: true
    disks:
      default-disk-d1568ee2a91a8a33:
        allowScheduling: true
        diskType: filesystem
        evictionRequested: false
        path: /var/lib/longhorn/
        storageReserved: 93598043750
        tags: []
    evictionRequested: false
    instanceManagerCPURequest: 0
    name: gke-test-gpu-cluster-cpu-np0-3228fc14-9h19
    tags: []
  status:
    autoEvicting: false
    conditions:
    - lastProbeTime: ""
      lastTransitionTime: "2024-02-29T00:35:53Z"
      message: Node gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is ready
      reason: ""
      status: "True"
      type: Ready
    - lastProbeTime: ""
      lastTransitionTime: "2024-02-29T00:35:53Z"
      message: ""
      reason: ""
      status: "True"
      type: Schedulable
    - lastProbeTime: ""
      lastTransitionTime: "2024-02-29T00:35:53Z"
      message: ""
      reason: ""
      status: "True"
      type: MountPropagation
    diskStatus:
      default-disk-d1568ee2a91a8a33:
        conditions:
        - lastProbeTime: ""
          lastTransitionTime: "2024-02-29T02:00:12Z"
          message: 'Disk default-disk-d1568ee2a91a8a33(/var/lib/longhorn/) on node
            gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is not ready: record diskUUID
            doesn''t match the one on the disk '
          reason: DiskFilesystemChanged
          status: "False"
          type: Ready
        - lastProbeTime: ""
          lastTransitionTime: "2024-02-29T02:00:12Z"
          message: Disk default-disk-d1568ee2a91a8a33 (/var/lib/longhorn/) on the
            node gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is not ready
          reason: DiskNotReady
          status: "False"
          type: Schedulable
        diskType: filesystem
        diskUUID: 9fbfa3e2-4b8e-41bc-9bff-7df0caebba01
        filesystemType: ext2/ext3
        scheduledReplica: {}
        storageAvailable: 0
        storageMaximum: 0
        storageScheduled: 0
    region:<masked>
    snapshotCheckStatus: {}
    zone: <masked>
kind: List
metadata:
  resourceVersion: ""

from longhorn.

derekbit commented on May 28, 2024

Can you try

Delete the disk from the node gke-test-gpu-cluster-cpu-np0-3228fc14-9h19
Delete /var/lib/longhorn/longhorn-disk.cfg on the node
Add the disk /var/lib/longhorn back

from longhorn.

cometta commented on May 28, 2024

@derekbit can you provide command or can i do in the longhorn UI webpage?

from longhorn.

derekbit commented on May 28, 2024

Please check https://longhorn.io/docs/1.6.0/nodes-and-volumes/nodes/multidisk/.
Can you provide a new support bundle? I'd like to check how the error is triggered.

          message: 'Disk default-disk-d1568ee2a91a8a33(/var/lib/longhorn/) on node
            gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is not ready: record diskUUID
            doesn''t match the one on the disk '

from longhorn.

cometta commented on May 28, 2024

@derekbit i regenerated support bundle https://drive.google.com/file/d/16rVGu-7dJRilfgmcAYtRmQNpkvgQkMNP/view?usp=sharing

from longhorn.

cometta commented on May 28, 2024

I followed the step, delete disk, delete the file *cfg , add new disk, the node page in longhorn webpage look ok. but i still getting error on new node pool , existing node pool no issue.

AttachVolume.Attach failed for volume "pvc-5d4bc87d-b1bd-404f-a9fe-cc66fb811300" : CSINode gke-something 0-bf7ed2c8-m2wc does not contain driver driver.longhorn.io

i ssh into the node that has this error, i dont see folder /var/lib/longhorn .

from longhorn.

cometta commented on May 28, 2024

share screenshot of volume page

from longhorn.

derekbit commented on May 28, 2024

AttachVolume.Attach failed for volume "pvc-5d4bc87d-b1bd-404f-a9fe-cc66fb811300" : CSINode gke-something 0-bf7ed2c8-m2wc does not contain driver driver.longhorn.io

You don't set the correct key. Isn't key nvidia.com/gpu?

from longhorn.

cometta commented on May 28, 2024

on the new node that does not work, i can see this , set by GCP , not me

on the existing node pool that work with longhorn, i dont see this key, also no gpu

from longhorn.

derekbit commented on May 28, 2024

To avoid being confused. Let's focus on the CSINode error on the new node first.
Can you fix the key of the tolerations and try longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration/?

from longhorn.

cometta commented on May 28, 2024

i already ran this once, using command `helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --values ./longhorn_values.yaml``` . first time i ran, all 3 pods restarted. if i re-run again, 3 pods will not be restarted again

on the ui pod, yaml file, i can see

tolerations:
  - effect: NoSchedule
    key: key
    operator: Equal
    value: value
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

driver pod

   tolerations:
   - effect: NoSchedule
     key: key
     operator: Equal
     value: value
   - effect: NoExecute
     key: node.kubernetes.io/not-ready
     operator: Exists
     tolerationSeconds: 300
   - effect: NoExecute
     key: node.kubernetes.io/unreachable
     operator: Exists
     tolerationSeconds: 300

manager pod

  tolerations:
   - effect: NoSchedule
     key: key
     operator: Equal
     value: value
   - effect: NoExecute
     key: node.kubernetes.io/not-ready
     operator: Exists
   - effect: NoExecute
     key: node.kubernetes.io/unreachable
     operator: Exists
   - effect: NoSchedule
     key: node.kubernetes.io/disk-pressure
     operator: Exists
   - effect: NoSchedule
     key: node.kubernetes.io/memory-pressure
     operator: Exists
   - effect: NoSchedule
     key: node.kubernetes.io/pid-pressure
     operator: Exists
   - effect: NoSchedule
     key: node.kubernetes.io/unschedulable
     operator: Exists

all theses pods are running on the existing node which work with longhorn, i dont see any longhorn pods run on new nodepool

from longhorn.

derekbit commented on May 28, 2024

I mean why you set key: key rather than key: nvidia.com/gpu.

from longhorn.

cometta commented on May 28, 2024

are you refering to , it is auto set by GCP, I tested delete it, then GCP put it back

from longhorn.

derekbit commented on May 28, 2024

You can follow the steps in longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration.
You need to update tolerations in the longhorn_values.yaml, and the key should be key: nvidia.com/gpu rather than key.
Then, run helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --values ./longhorn_values.yaml.
If you still run into the issue, please provide the content of longhorn_values.yaml.

from longhorn.

cometta commented on May 28, 2024

@derekbit i tweeked the longhorn_values.yaml, can help me check it before i run it

longhornManager:
  tolerations:
  - key: "nvidia.com/gpu=present"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"
longhornDriver:
  tolerations:
  - key: "nvidia.com/gpu=present"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"
longhornUI:
  tolerations:
  - key: "nvidia.com/gpu=present"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"

from longhorn.

derekbit commented on May 28, 2024

Sorry, should be (key is "nvidia.com/gpu" and value is "present").
Let's try it.

longhornManager:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"
longhornDriver:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"
longhornUI:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"

from longhorn.

cometta commented on May 28, 2024

i reran the helm code that you shared, i able to see longhorn pods restarted. then i test run a pod on new node pool ,

 Warning  FailedMount         9s                    kubelet                  Unable to attach or mount volumes: unmounted volumes=[first-pvc secondpvc], u
nattached volumes=[first-pvc secondpvc], failed to process volumes=[]: timed out waiting for the condition
 Warning  FailedAttachVolume  0s (x9 over 2m12s)    attachdetach-controller  AttachVolume.Attach failed for volume "pvc-something" : CSINode gke-something-np0 -bf7ed2c8-554m does not contain driver driver.longhorn.io

 Warning  FailedAttachVolume  0s (x9 over 2m12s)    attachdetach-controller  AttachVolume.Attach failed for volume "pvc-87302005-f605-4ae9-81d0-5d6ad5da7630" : CSINode gke-something-bf7ed2c8-554m does not contain driver driver.longhorn.io

from longhorn.

derekbit commented on May 28, 2024

A support bundle is appreciated. BTW, you can ping me on Rancher or CNCF Slack channel for speeding up the process.

from longhorn.

derekbit commented on May 28, 2024

@cometta
BTW, have you execute the step 2 in https://longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration/?

defaultSettings:
  taintToleration: "nvidia.com/gpu=present:NoSchedule"

from longhorn.

cometta commented on May 28, 2024

issue resolved, Thank you @derekbit

in summarize

keep all existing pvc, pv, no need delete ; unattach all apps that using pvc,pv by uninstall
add the value in longhorn ui -> setting->Kubernetes Taint Toleration -> nvidia.com/gpu=present:NoSchedule , save ; force longhorn pods restarted
run again pod on new node pool

from longhorn.

add new node pool about longhorn HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent