Comments (4)
what GPU Allocated memory checks and "Zombie processes" checks do?
Those are related, and it's related to reset too. If it's not possible to reset your card, you at least need to detect when things are broken.
When things go awfully wrong, you can have the following:
- There is no process using the GPU, but
nvidia-smi
shows a non-trivial amount of memory being used. e.g. something like352MiB / 12181MiB
- There is a process already using the GPU you are supposed to give to a new container (excluding voluntary sharing). This can happen when there is a GPU fault and the process that had an open CUDA context can't teardown properly.
These kinds of checks are useful safety checks in addition to event-based healthchecks like XIDs and ECCs. Some of these errors could go unnoticed otherwise.
from k8s-device-plugin.
@flx42 @jiayingz any plan to enhance current device health check?
from k8s-device-plugin.
cc @jiayingz @vishh @mindprince
from k8s-device-plugin.
@flx42 Thanks for your detailed explanation, it's very useful for us!
from k8s-device-plugin.
Related Issues (20)
- Is nvidia container runtime necessary for app pod when using cdi-annotations strategy? HOT 5
- More flexible time-slicing strategy configuration
- an amazon machine image (AMI HOT 1
- an amazon machine image (AMI) that meets the prequites of k8s-device-plugin HOT 2
- How to trigger gpu failure, the gpu count of node's allocatable field will be dynamically decrease HOT 4
- Unable to install in Ubuntu 20.04 a nvidia container toolkit with version < 1.14.4 HOT 15
- GPU health status exposure and remediation methods HOT 1
- GPU distribution wrong after reboot node HOT 2
- Addressing several security vulnerabilities in the version v0.14.4 and v0.14.5 HOT 1
- GPU allocation does not respect NVLink HOT 5
- A pod can access all gpu resources even if no nvidia.com/gpu is configed. HOT 1
- Using CUDA MPS to enable GPU sharing, the pod occupies all GPU memory. HOT 11
- 0/1 nodes are available: 1 Insufficient nvidia.com/gpu HOT 2
- Limiting GPU Resource Usage per Docker Container with MPS Daemon
- K8s 1.24 failed to schedule using GPU-(error code CUDA driver HOT 6
- Access NVIDIA GPUs in K8s in a non-privileged container
- can't install 0.15.0-rc.2 HOT 3
- Device plugin does not start on MIG-enabled host due to insufficient permissions HOT 6
- Daemonset yaml file is not picking up Timeslicing configMap
- Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-device-plugin.