Code Monkey home page Code Monkey logo

Comments (6)

lucasrod16 avatar lucasrod16 commented on June 16, 2024

@ercanserteli I am not able to reproduce the error you're seeing.

Using this zarf.yaml:

kind: ZarfPackageConfig
metadata:
  name: test-package
  version: 1.0.0
components:
  - name: gpu-operator
    required: true
    charts:
      - name: gpu-operator
        namespace: gpu-operator
        url: https://helm.ngc.nvidia.com/nvidia
        version: v23.9.2
        valuesFiles:
          - ./values.yaml
    images:
      - registry.k8s.io/nfd/node-feature-discovery:v0.14.2
      - nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
      - nvcr.io/nvidia/gpu-feature-discovery:v0.8.2-ubi8
      - nvcr.io/nvidia/k8s-device-plugin:v0.14.5-ubi8
      - nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04
      - nvcr.io/nvidia/gpu-operator:v23.9.2
      - nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
      - nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5
      

note that I had to add k8s to this image reference: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04

https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/dcgm-exporter/tags

In the output you provided, it says Loading metadata for 59 images but the example zarf.yaml you provided only has 8 images. There seems to be a lot more images that your package has defined that zarf is trying to pull. Could you provide the zarf.yaml that's being used that has 59 images?

from zarf.

lucasrod16 avatar lucasrod16 commented on June 16, 2024

After waiting a few hours and trying again, I am seeing the error now. I suspect this is an issue related to NVIDIA's registry somehow as we have not seen this problem occur with other registries that I'm aware of.

from zarf.

ercanserteli avatar ercanserteli commented on June 16, 2024

You are right, the sample outputs I added were from running with the whole production zarf.yaml with more images overall, but as you confirmed the problem occurs even with only this component. I also do not see this problem with any other registry than Nvidia's. But it may be that the problem's occurrence rate increases when there are more images to be pulled overall. I have a 100% failure rate with the full zarf.yaml through ~50 tries, although I can't share it here because it includes private components. (Of course there is no error when I exclude the gpu-operator component, so the other images are not to blame for the failure.)

In any case, I believe that Zarf should handle failed image layer downloads more gracefully such that they don't get cached in a corrupted state. If that were fixed, Zarf's retry mechanism could work successfully, and the sporadic INTERNAL_ERROR from the registry side would not ruin the pulling process.

from zarf.

ercanserteli avatar ercanserteli commented on June 16, 2024

Is there any possible workaround for this problem? For example doing docker pull on the images manually works, but I do not know if there is a way to make zarf use the local docker cache.

I also tried setting up a pull-through cache on AWS ECR but it seems they don't support Nvidia's registry.

Any ideas on a workaround would be great so that we can create packages in the meanwhile.

from zarf.

AustinAbro321 avatar AustinAbro321 commented on June 16, 2024

Yes if Zarf does not find an image, it will pull from the local docker image store. I'm not sure if Zarf will still fall back to the local docker store if it see's an image in a remote then fails to pull it. You may have to rename / retag the images

from zarf.

ercanserteli avatar ercanserteli commented on June 16, 2024

Thank you, this worked as a workaround! For anyone with the same problem, I first modified the hosts file to make nvcr.io unreachable and it used the local docker images, but it was extremely slow. Instead, setting up a local registry, pushing all the images and using --registry-override during package create worked like a charm.

from zarf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.