Comments (29)
Do you have any taints or tolerations on these nodes?
A support bundle is appreciated.
from longhorn.
yes this new nodepool has
Taint is set to NoSchedule nvidia.com/gpu=present
it's running on GCP k8s
- Longhorn version: CSI Driver driver.longhorn.io CSI Git commit edf23eddc3b6c307031eaa770c3d312c963f25a5 version v1.6.0
- Kubernetes version: 1.27.8-gke.1067004
from longhorn.
for the old nodepool which work with longhorn, i able to see
csi.volume.kubernetes.io/nodeid: {"driver.longhorn.io" ...}
but for the new nodepool, i can't see this
from longhorn.
i tested uninstall longhorn and reinstall, issue still persist
from longhorn.
Can you send us a support bundle?
from longhorn.
@derekbit , done generated support bundle zip, i uploaded to https://drive.google.com/file/d/1dLexKrdwuQvwKvCTu7flzeH1Ndt2mjL7/view?usp=drive_link , i will delete it after you download it.
from longhorn.
OK. The node gke-dataops-gpu-cluster-gpu-np0-bf7ed2c8-9pxd
is tainted with nvidia.com/gpu=present:NoSchedule
, but the longhorn components cannot tolerate it.
Please check the document https://longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration/ and set appropriate tolerations for the components.
from longhorn.
@derekbit , i folllowed the guide, set the toleration on the helm. i able to see longhorn ui, manager, driver pods restarted. i go to the UI using browser, i see below message
i can't attach the disk
from longhorn.
disks are unavailable
-> Can you check the spec.diskStatus of the newly added node by kubectl -n longhorn-system get nodes.longhorn.io -o yaml
?
from longhorn.
kubectl -n longhorn-system get nodes.longhorn.io -o yaml
apiVersion: v1
items:
- apiVersion: longhorn.io/v1beta2
kind: Node
metadata:
creationTimestamp: "2024-02-29T00:35:53Z"
finalizers:
- longhorn.io
generation: 1
name: gke-test-gpu-cluster-cpu-np0-3228fc14-9h19
namespace: longhorn-system
resourceVersion: "2223795"
uid: d774cfdd-1740-4ece-9376-ceec8b39eda3
spec:
allowScheduling: true
disks:
default-disk-d1568ee2a91a8a33:
allowScheduling: true
diskType: filesystem
evictionRequested: false
path: /var/lib/longhorn/
storageReserved: 93598043750
tags: []
evictionRequested: false
instanceManagerCPURequest: 0
name: gke-test-gpu-cluster-cpu-np0-3228fc14-9h19
tags: []
status:
autoEvicting: false
conditions:
- lastProbeTime: ""
lastTransitionTime: "2024-02-29T00:35:53Z"
message: Node gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is ready
reason: ""
status: "True"
type: Ready
- lastProbeTime: ""
lastTransitionTime: "2024-02-29T00:35:53Z"
message: ""
reason: ""
status: "True"
type: Schedulable
- lastProbeTime: ""
lastTransitionTime: "2024-02-29T00:35:53Z"
message: ""
reason: ""
status: "True"
type: MountPropagation
diskStatus:
default-disk-d1568ee2a91a8a33:
conditions:
- lastProbeTime: ""
lastTransitionTime: "2024-02-29T02:00:12Z"
message: 'Disk default-disk-d1568ee2a91a8a33(/var/lib/longhorn/) on node
gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is not ready: record diskUUID
doesn''t match the one on the disk '
reason: DiskFilesystemChanged
status: "False"
type: Ready
- lastProbeTime: ""
lastTransitionTime: "2024-02-29T02:00:12Z"
message: Disk default-disk-d1568ee2a91a8a33 (/var/lib/longhorn/) on the
node gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is not ready
reason: DiskNotReady
status: "False"
type: Schedulable
diskType: filesystem
diskUUID: 9fbfa3e2-4b8e-41bc-9bff-7df0caebba01
filesystemType: ext2/ext3
scheduledReplica: {}
storageAvailable: 0
storageMaximum: 0
storageScheduled: 0
region:<masked>
snapshotCheckStatus: {}
zone: <masked>
kind: List
metadata:
resourceVersion: ""
from longhorn.
Can you try
- Delete the disk from the node gke-test-gpu-cluster-cpu-np0-3228fc14-9h19
- Delete
/var/lib/longhorn/longhorn-disk.cfg
on the node - Add the disk /var/lib/longhorn back
from longhorn.
@derekbit can you provide command or can i do in the longhorn UI webpage?
from longhorn.
Please check https://longhorn.io/docs/1.6.0/nodes-and-volumes/nodes/multidisk/.
Can you provide a new support bundle? I'd like to check how the error is triggered.
message: 'Disk default-disk-d1568ee2a91a8a33(/var/lib/longhorn/) on node
gke-test-gpu-cluster-cpu-np0-3228fc14-9h19 is not ready: record diskUUID
doesn''t match the one on the disk '
from longhorn.
@derekbit i regenerated support bundle https://drive.google.com/file/d/16rVGu-7dJRilfgmcAYtRmQNpkvgQkMNP/view?usp=sharing
from longhorn.
I followed the step, delete disk, delete the file *cfg , add new disk, the node page in longhorn webpage look ok. but i still getting error on new node pool , existing node pool no issue.
AttachVolume.Attach failed for volume "pvc-5d4bc87d-b1bd-404f-a9fe-cc66fb811300" : CSINode gke-something 0-bf7ed2c8-m2wc does not contain driver driver.longhorn.io
i ssh into the node that has this error, i dont see folder /var/lib/longhorn .
from longhorn.
share screenshot of volume page
from longhorn.
AttachVolume.Attach failed for volume "pvc-5d4bc87d-b1bd-404f-a9fe-cc66fb811300" : CSINode gke-something 0-bf7ed2c8-m2wc does not contain driver driver.longhorn.io
You don't set the correct key. Isn't key nvidia.com/gpu
?
from longhorn.
on the new node that does not work, i can see this , set by GCP , not me
on the existing node pool that work with longhorn, i dont see this key, also no gpu
from longhorn.
To avoid being confused. Let's focus on the CSINode error on the new node first.
Can you fix the key of the tolerations and try longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration/?
from longhorn.
i already ran this once, using command `helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --values ./longhorn_values.yaml``` . first time i ran, all 3 pods restarted. if i re-run again, 3 pods will not be restarted again
on the ui pod, yaml file, i can see
tolerations:
- effect: NoSchedule
key: key
operator: Equal
value: value
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
driver pod
tolerations:
- effect: NoSchedule
key: key
operator: Equal
value: value
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
manager pod
tolerations:
- effect: NoSchedule
key: key
operator: Equal
value: value
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
all theses pods are running on the existing node which work with longhorn, i dont see any longhorn pods run on new nodepool
from longhorn.
I mean why you set key: key
rather than key: nvidia.com/gpu
.
from longhorn.
are you refering to , it is auto set by GCP, I tested delete it, then GCP put it back
from longhorn.
You can follow the steps in longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration.
You need to update tolerations in the longhorn_values.yaml
, and the key should be key: nvidia.com/gpu
rather than key
.
Then, run helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --values ./longhorn_values.yaml
.
If you still run into the issue, please provide the content of longhorn_values.yaml
.
from longhorn.
@derekbit i tweeked the longhorn_values.yaml, can help me check it before i run it
longhornManager:
tolerations:
- key: "nvidia.com/gpu=present"
operator: "Equal"
value: "value"
effect: "NoSchedule"
longhornDriver:
tolerations:
- key: "nvidia.com/gpu=present"
operator: "Equal"
value: "value"
effect: "NoSchedule"
longhornUI:
tolerations:
- key: "nvidia.com/gpu=present"
operator: "Equal"
value: "value"
effect: "NoSchedule"
from longhorn.
Sorry, should be (key is "nvidia.com/gpu" and value is "present").
Let's try it.
longhornManager:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
longhornDriver:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
longhornUI:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
from longhorn.
i reran the helm code that you shared, i able to see longhorn pods restarted. then i test run a pod on new node pool ,
Warning FailedMount 9s kubelet Unable to attach or mount volumes: unmounted volumes=[first-pvc secondpvc], u
nattached volumes=[first-pvc secondpvc], failed to process volumes=[]: timed out waiting for the condition
Warning FailedAttachVolume 0s (x9 over 2m12s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-something" : CSINode gke-something-np0 -bf7ed2c8-554m does not contain driver driver.longhorn.io
Warning FailedAttachVolume 0s (x9 over 2m12s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-87302005-f605-4ae9-81d0-5d6ad5da7630" : CSINode gke-something-bf7ed2c8-554m does not contain driver driver.longhorn.io
from longhorn.
A support bundle is appreciated. BTW, you can ping me on Rancher or CNCF Slack channel for speeding up the process.
from longhorn.
@cometta
BTW, have you execute the step 2 in https://longhorn.io/docs/1.6.0/advanced-resources/deploy/taint-toleration/?
defaultSettings:
taintToleration: "nvidia.com/gpu=present:NoSchedule"
from longhorn.
issue resolved, Thank you @derekbit
in summarize
- keep all existing pvc, pv, no need delete ; unattach all apps that using pvc,pv by uninstall
- add the value in longhorn ui -> setting->Kubernetes Taint Toleration -> nvidia.com/gpu=present:NoSchedule , save ; force longhorn pods restarted
- run again pod on new node pool
from longhorn.
Related Issues (20)
- [BACKPORT][v1.6.3][IMPROVEMENT] Add setting to configure support bundle timeout for node bundle collection
- [BACKPORT][v1.5.6][IMPROVEMENT] Add setting to configure support bundle timeout for node bundle collection
- [TASK] Reference Architecture and Sizing Guidelines for Longhorn v1.7.x HOT 1
- [TEST] Investigate accessing lab behind vpn
- [BACKPORT][v1.6.3][IMPROVEMENT] System restore unable to restore volume with backing image HOT 1
- [BUG] Longhorn cifs backups cannot find credentials HOT 9
- [DOC] Incorrect and invalid links HOT 1
- Expanding the volume through UI but not reflecting it in backend. HOT 1
- [TEST][BUG] system restore stuck because of the volume/PV/PVC restoration
- [BACKPORT][v1.6.3][IMPROVEMENT] Improve and simplify chart values.yaml HOT 1
- [BACKPORT][v1.5.6][IMPROVEMENT] Improve and simplify chart values.yaml HOT 1
- Longhorn 1.6.2 - pvc is not ready for workloads HOT 1
- [BUG] Failed to delete a v2 orphan replica
- [FEATURE] Automatically attach the volumes for trimming filesystem HOT 1
- [TEST][FEATURE] Automatically attach the volumes for trimming filesystem
- [BUG] Fresh RWX volume on a fresh cluster install fails to ever mount (dual stack, IPv6-first cluster)
- [UI][IMPROVEMENT] Tweak some minor UI issues
- [UI][FEATURE] Multiple backup stores support
- [BUG] Request for a guide to the longhorn metric
- [BUG] System backup failed because backup creation failed. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from longhorn.