Comments (6)
@ercanserteli I am not able to reproduce the error you're seeing.
Using this zarf.yaml
:
kind: ZarfPackageConfig
metadata:
name: test-package
version: 1.0.0
components:
- name: gpu-operator
required: true
charts:
- name: gpu-operator
namespace: gpu-operator
url: https://helm.ngc.nvidia.com/nvidia
version: v23.9.2
valuesFiles:
- ./values.yaml
images:
- registry.k8s.io/nfd/node-feature-discovery:v0.14.2
- nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
- nvcr.io/nvidia/gpu-feature-discovery:v0.8.2-ubi8
- nvcr.io/nvidia/k8s-device-plugin:v0.14.5-ubi8
- nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04
- nvcr.io/nvidia/gpu-operator:v23.9.2
- nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
- nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5
note that I had to add k8s
to this image reference: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/dcgm-exporter/tags
In the output you provided, it says Loading metadata for 59 images
but the example zarf.yaml
you provided only has 8 images. There seems to be a lot more images that your package has defined that zarf is trying to pull. Could you provide the zarf.yaml
that's being used that has 59 images?
from zarf.
After waiting a few hours and trying again, I am seeing the error now. I suspect this is an issue related to NVIDIA's registry somehow as we have not seen this problem occur with other registries that I'm aware of.
from zarf.
You are right, the sample outputs I added were from running with the whole production zarf.yaml
with more images overall, but as you confirmed the problem occurs even with only this component. I also do not see this problem with any other registry than Nvidia's. But it may be that the problem's occurrence rate increases when there are more images to be pulled overall. I have a 100% failure rate with the full zarf.yaml
through ~50 tries, although I can't share it here because it includes private components. (Of course there is no error when I exclude the gpu-operator component, so the other images are not to blame for the failure.)
In any case, I believe that Zarf should handle failed image layer downloads more gracefully such that they don't get cached in a corrupted state. If that were fixed, Zarf's retry mechanism could work successfully, and the sporadic INTERNAL_ERROR
from the registry side would not ruin the pulling process.
from zarf.
Is there any possible workaround for this problem? For example doing docker pull on the images manually works, but I do not know if there is a way to make zarf use the local docker cache.
I also tried setting up a pull-through cache on AWS ECR but it seems they don't support Nvidia's registry.
Any ideas on a workaround would be great so that we can create packages in the meanwhile.
from zarf.
Yes if Zarf does not find an image, it will pull from the local docker image store. I'm not sure if Zarf will still fall back to the local docker store if it see's an image in a remote then fails to pull it. You may have to rename / retag the images
from zarf.
Thank you, this worked as a workaround! For anyone with the same problem, I first modified the hosts file to make nvcr.io unreachable and it used the local docker images, but it was extremely slow. Instead, setting up a local registry, pushing all the images and using --registry-override
during package create worked like a charm.
from zarf.
Related Issues (20)
- Zarf deployments loading images from incorrect location HOT 9
- Zarf (helm) ignores strict schema decoding on deploy HOT 1
- Error Message in Namespace Overview When Using `zarf tools k9s`
- Provide support for CRD upgrades natively in zarf
- test: enable -race flag for running unit tests in CI HOT 1
- test: reduce duration of E2E tests HOT 1
- refactor: PrintBreakingChanges should return a string
- zarf tools registry pull doesn't seem to respect --platform in newer version HOT 2
- Add support for Crossplane CRDs
- Refactoring of logging HOT 1
- Zarf dev lint warns that cosign signatures are not pinned with a digest
- Rename YOLO mode HOT 2
- Remove PLG logging stack from init package HOT 6
- lib: Remove mandatory PublicKeyPath from OCISource HOT 3
- Zarf local docker fallback not working on MacOS+DockerDesktop+ContainerdMode HOT 8
- Ability to specify that images should ONLY be pulled from the local docker daemon
- Flavor validation on `package create` HOT 1
- Package create breaks when a cosign signature image exists in the package and in the cache HOT 2
- clean up ctx.TODO -> ctx.Background HOT 3
- Context needs to be better passed through to several parts of Zarf so that interrupts can be implemented properly
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zarf.