Code Monkey home page Code Monkey logo

Comments (15)

architkulkarni avatar architkulkarni commented on June 16, 2024

Here's the relevant code:

        def download_images():
            """Download Docker images from DockerHub"""
            logger.info("Download Docker images: %s", self.docker_image_dict)
            for key in self.docker_image_dict:
                # Only pull the image from DockerHub when the image does not
                # exist in the local docker registry.
                image = self.docker_image_dict[key]
                if (
                    shell_subprocess_run(
                        f"docker image inspect {image} > /dev/null", check=False
                    )
                    != 0
                ):
                    shell_subprocess_run(f"docker pull {image}")
                else:
                    logger.info("Image %s exists", image)

In our output we're seeing "Image exists", which is weird: how could the Buildkite machine already have the Ray 2.9.1 image pulled?

In any case, it looks like one fix would be to always pull the image fresh, or somehow check the hashes and only pull it if the hashes don't match.

from kuberay.

architkulkarni avatar architkulkarni commented on June 16, 2024

@kevin85421 actually, can we just eliminate this custom check and always run docker pull? It seems that docker pull will skip downloading things which already exist locally, so it will still be efficient. And maybe docker pull has some built-in way of dealing with this hash issue.

Testing this approach here: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3017#018d3279-9b0a-4ee5-82d5-4d758060095b

from kuberay.

kevin85421 avatar kevin85421 commented on June 16, 2024

It makes sense to me. We can avoid the checking.

from kuberay.

architkulkarni avatar architkulkarni commented on June 16, 2024

Unfortunately we're still getting the same error, even though it's pulling fresh:

2024-01-22:18:49:46,476 INFO     [utils.py:288] Download Docker images: {'ray-image': 'rayproject/ray:2.9.1', 'kuberay-operator-image': 'kuberay/operator:nightly'}
--
  | 2024-01-22:18:49:46,476 INFO     [utils.py:349] Execute command: docker pull rayproject/ray:2.9.1
  | 2.9.1: Pulling from rayproject/ray
  | 521f275cc58b: Pull complete
  | bf72fdb087e6: Pull complete
  | bf1ecb086b72: Pull complete
  | 4f4fb700ef54: Pull complete
  | f38b8e60ced2: Pull complete
  | 86eb5ea39abc: Pull complete
  | cf9fa6809682: Pull complete
  | 0d7128e71fad: Pull complete
  | a849c7c52514: Pull complete
  | Digest: sha256:08f711dffe947bf7aea219066ca8871bc24b78c96ff0450c4f5807d71ba07a23
  | Status: Downloaded newer image for rayproject/ray:2.9.1
  | docker.io/rayproject/ray:2.9.1
  | 2024-01-22:18:50:12,550 INFO     [utils.py:349] Execute command: docker pull kuberay/operator:nightly
  | nightly: Pulling from kuberay/operator
  | e6e98c874e21: Already exists
  | 6350fae67ca1: Pull complete
  | Digest: sha256:7aa36a5d3cda5dc424a7898c075a1883c1771e8ced1e79c2e5fddac08c71c751
  | Status: Downloaded newer image for kuberay/operator:nightly
  | docker.io/kuberay/operator:nightly
  | 2024-01-22:18:50:16,97 INFO     [utils.py:295] Load images into KinD cluster
  | 2024-01-22:18:50:16,97 INFO     [utils.py:349] Execute command: kind load docker-image rayproject/ray:2.9.1
  | Image: "rayproject/ray:2.9.1" with ID "sha256:1b80f2dd7cb3daaafc5fc33a17ad46fe68c02e2fffd6601c7b3b68c86101d394" not yet present on node "kind-control-plane", loading...
  | ERROR: failed to load image: command "docker exec --privileged -i kind-control-plane ctr --namespace=k8s.io images import -" failed with error: exit status 1
  | Command Output: unpacking docker.io/rayproject/ray:2.9.1 (sha256:553e7af4a541619772d531c3b7e1722fdbf9d9a4d43d4255e2a8debc2bbac518)...time="2024-01-22T18:51:33Z" level=info msg="apply failure, attempting cleanup" error="wrong diff id calculated on extraction \"sha256:64dba0d4b6296b5d85b7b8a27a4027bcc37405b7293e868adb5e12f6471c8359\"" key="extract-207399946-OyBF sha256:61b67ada3a810ce6d65ce004a85a36c8d2f7485c89b8330f609e052292b8314e"
  | ctr: wrong diff id calculated on extraction "sha256:64dba0d4b6296b5d85b7b8a27a4027bcc37405b7293e868adb5e12f6471c8359"

Very weird

from kuberay.

architkulkarni avatar architkulkarni commented on June 16, 2024

Docker was updated from 24.0 to 25.0 on Jan 19, which aligns with when this started failing. And currently KubeRay just pulls the latest Docker version.

I'll pin the Docker version to 24.0 and see if that fixes it.

from kuberay.

architkulkarni avatar architkulkarni commented on June 16, 2024

Pinning to 24.0 doesn't work, it results in the same error.

from kuberay.

kevin85421 avatar kevin85421 commented on June 16, 2024

Some experiments:

  • I still get the same error when I use the Ray 2.8.0 image.
  • I ran docker image ls to check whether there are any existing images or not. There are no existing images. You can search REPOSITORY TAG IMAGE ID CREATED SIZE in this build.

from kuberay.

kevin85421 avatar kevin85421 commented on June 16, 2024

A workaround is to avoid using kind load, as shown in 4a6f13d. See this build for more details. If we can find the root cause, we can increase the timeout and avoid using kind load as a workaround.

from kuberay.

architkulkarni avatar architkulkarni commented on June 16, 2024

A workaround is to avoid using kind load, as shown in 4a6f13d. See this build for more details. If we can find the root cause, we can increase the timeout and avoid using kind load as a workaround.

@kevin85421 Nice find, I was pretty stuck on this. But how does the test work if kind load isn't there? Is it just that kind load is unnecessary because the Ray containers will pull the Ray image on their own if necessary, so it's not needed to preload it in the kind cluster?

I think the priority should be (1) merge the kind load workaround to unblock CI, then (2) figure out the root cause and add back kind load if necessary.

from kuberay.

kevin85421 avatar kevin85421 commented on June 16, 2024

But how does the test work if kind load isn't there? Is it just that kind load is unnecessary because the Ray containers will pull the Ray image on their own if necessary, so it's not needed to preload it in the kind cluster?

If a Pod can't find the image on the Kubernetes node (i.e. kind node), it will pull images from remote image registry such as DockerHub or Quay. We should determine whether the issue occurs when we use kind load $KUBERAY_IMAGE, or if it only happens with kind load $RAY_IMAGE. We should still load the KubeRay image to the Kind cluster.

from kuberay.

architkulkarni avatar architkulkarni commented on June 16, 2024

It only happens when loading Ray. Loading the KubeRay operator is fine.

Passing test when only kind load $RAY_IMAGE is skipped: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3027#018d3381-ae0e-4025-b56b-6cc2d2d4b434

from kuberay.

architkulkarni avatar architkulkarni commented on June 16, 2024

It only happens when loading Ray. Loading the KubeRay operator is fine.

Passing test when only kind load $RAY_IMAGE is skipped: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3027#018d3381-ae0e-4025-b56b-6cc2d2d4b434

Actually, it turns out the KubeRay operator failed to load in one case: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3028#018d3397-cd40-4ecc-ac10-e1f3e88f1595/251-522

Since CI is now just flaky instead of consistently failing, I'll downgrade the priority to P1. But we can change it back to P0 if we think it's still too flaky.

from kuberay.

kevin85421 avatar kevin85421 commented on June 16, 2024

I believe it should remain a P0 priority, given that failures are still common.

from kuberay.

architkulkarni avatar architkulkarni commented on June 16, 2024

Sure. One possible clue is that kind load isn't failing on Ray CI, although the setup seems to be a bit different (ray-project/ray#41836)

from kuberay.

architkulkarni avatar architkulkarni commented on June 16, 2024

Seems related to kubernetes-sigs/kind#3488, it's the same error message. Maybe I somehow messed up when I tried reverting to Docker 24.0. [ For posterity: running install-docker.sh --version=24.0 didn't restart the docker server. So the docker server was still running 25.0. I missed this because I ran docker --version, which only shows the client version. docker info shows both the client and server version.]

from kuberay.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.