Comments (9)
Sure, it's a problem. But looks like you also installed nvidia-container-runtime
on this node?
from k8s-device-plugin.
@RenaudWasTaken when deploying to a node with no GPU, we should wait indefinitely, right? We already have a case for this, but it assumes NVML is present and working:
https://github.com/NVIDIA/k8s-device-plugin/blob/master/main.go#L46-L49
from k8s-device-plugin.
@RenaudWasTaken when deploying to a node with no GPU, we should wait indefinitely, right?
The way I expected device plugins to work is to stop when they detect that no devices are available on the node.
I also expect the restart policy to be OnFailure.
from k8s-device-plugin.
Fixed by 9b54e91
from k8s-device-plugin.
@RenaudWasTaken “A Pod Template in a DaemonSet must have a RestartPolicy equal to Always, or be unspecified, which defaults to Always.” https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#pod-template
As for now, a node without GPU will be in a crash loop.
from k8s-device-plugin.
@RenaudWasTaken “A Pod Template in a DaemonSet must have a RestartPolicy equal to Always, or be unspecified, which defaults to Always.” https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#pod-template
Looks like we'll have to update the k8s docs.
As for now, a node without GPU will be in a crash loop.
I just pushed a fix for that :)
from k8s-device-plugin.
@RenaudWasTaken I‘m getting The DaemonSet "nvidia-device-plugin-daemonset" is invalid: spec.template.spec.restartPolicy: Unsupported value: "OnFailure": supported values: Always
when applying the manifest on k8s 1.8.1, which is right according to the docs. Also I failed to find any clues about this have changed in newer version.
from k8s-device-plugin.
@idealhack thanks for noticing this mistake.
Looks like you have to manually label your nodes if you don't want the plugin to be running on every nodes.
from k8s-device-plugin.
So crashing is actually the right behavior for GPU plugin on non-GPU node, right?
Speaking of docs, how about we add a note about using taints to handle this kind of clusters in README, like https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#example-use-cases
from k8s-device-plugin.
Related Issues (20)
- How to solve could not load NVML library: libnvidia-ml.so.1 HOT 23
- undefined variable "$hasServiceAccount" HOT 4
- Is nvidia container runtime necessary for app pod when using cdi-annotations strategy? HOT 5
- More flexible time-slicing strategy configuration
- an amazon machine image (AMI HOT 1
- an amazon machine image (AMI) that meets the prequites of k8s-device-plugin HOT 2
- How to trigger gpu failure, the gpu count of node's allocatable field will be dynamically decrease HOT 4
- Unable to install in Ubuntu 20.04 a nvidia container toolkit with version < 1.14.4 HOT 15
- GPU health status exposure and remediation methods HOT 1
- GPU distribution wrong after reboot node HOT 2
- Addressing several security vulnerabilities in the version v0.14.4 and v0.14.5 HOT 1
- GPU allocation does not respect NVLink HOT 5
- A pod can access all gpu resources even if no nvidia.com/gpu is configed. HOT 1
- Using CUDA MPS to enable GPU sharing, the pod occupies all GPU memory. HOT 11
- 0/1 nodes are available: 1 Insufficient nvidia.com/gpu HOT 2
- Limiting GPU Resource Usage per Docker Container with MPS Daemon
- K8s 1.24 failed to schedule using GPU-(error code CUDA driver HOT 6
- Access NVIDIA GPUs in K8s in a non-privileged container
- can't install 0.15.0-rc.2 HOT 3
- Device plugin does not start on MIG-enabled host due to insufficient permissions HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-device-plugin.