Comments (15)
Here's the relevant code:
def download_images():
"""Download Docker images from DockerHub"""
logger.info("Download Docker images: %s", self.docker_image_dict)
for key in self.docker_image_dict:
# Only pull the image from DockerHub when the image does not
# exist in the local docker registry.
image = self.docker_image_dict[key]
if (
shell_subprocess_run(
f"docker image inspect {image} > /dev/null", check=False
)
!= 0
):
shell_subprocess_run(f"docker pull {image}")
else:
logger.info("Image %s exists", image)
In our output we're seeing "Image exists", which is weird: how could the Buildkite machine already have the Ray 2.9.1 image pulled?
In any case, it looks like one fix would be to always pull the image fresh, or somehow check the hashes and only pull it if the hashes don't match.
from kuberay.
@kevin85421 actually, can we just eliminate this custom check and always run docker pull
? It seems that docker pull
will skip downloading things which already exist locally, so it will still be efficient. And maybe docker pull
has some built-in way of dealing with this hash issue.
Testing this approach here: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3017#018d3279-9b0a-4ee5-82d5-4d758060095b
from kuberay.
It makes sense to me. We can avoid the checking.
from kuberay.
Unfortunately we're still getting the same error, even though it's pulling fresh:
2024-01-22:18:49:46,476 INFO [utils.py:288] Download Docker images: {'ray-image': 'rayproject/ray:2.9.1', 'kuberay-operator-image': 'kuberay/operator:nightly'}
--
| 2024-01-22:18:49:46,476 INFO [utils.py:349] Execute command: docker pull rayproject/ray:2.9.1
| 2.9.1: Pulling from rayproject/ray
| 521f275cc58b: Pull complete
| bf72fdb087e6: Pull complete
| bf1ecb086b72: Pull complete
| 4f4fb700ef54: Pull complete
| f38b8e60ced2: Pull complete
| 86eb5ea39abc: Pull complete
| cf9fa6809682: Pull complete
| 0d7128e71fad: Pull complete
| a849c7c52514: Pull complete
| Digest: sha256:08f711dffe947bf7aea219066ca8871bc24b78c96ff0450c4f5807d71ba07a23
| Status: Downloaded newer image for rayproject/ray:2.9.1
| docker.io/rayproject/ray:2.9.1
| 2024-01-22:18:50:12,550 INFO [utils.py:349] Execute command: docker pull kuberay/operator:nightly
| nightly: Pulling from kuberay/operator
| e6e98c874e21: Already exists
| 6350fae67ca1: Pull complete
| Digest: sha256:7aa36a5d3cda5dc424a7898c075a1883c1771e8ced1e79c2e5fddac08c71c751
| Status: Downloaded newer image for kuberay/operator:nightly
| docker.io/kuberay/operator:nightly
| 2024-01-22:18:50:16,97 INFO [utils.py:295] Load images into KinD cluster
| 2024-01-22:18:50:16,97 INFO [utils.py:349] Execute command: kind load docker-image rayproject/ray:2.9.1
| Image: "rayproject/ray:2.9.1" with ID "sha256:1b80f2dd7cb3daaafc5fc33a17ad46fe68c02e2fffd6601c7b3b68c86101d394" not yet present on node "kind-control-plane", loading...
| ERROR: failed to load image: command "docker exec --privileged -i kind-control-plane ctr --namespace=k8s.io images import -" failed with error: exit status 1
| Command Output: unpacking docker.io/rayproject/ray:2.9.1 (sha256:553e7af4a541619772d531c3b7e1722fdbf9d9a4d43d4255e2a8debc2bbac518)...time="2024-01-22T18:51:33Z" level=info msg="apply failure, attempting cleanup" error="wrong diff id calculated on extraction \"sha256:64dba0d4b6296b5d85b7b8a27a4027bcc37405b7293e868adb5e12f6471c8359\"" key="extract-207399946-OyBF sha256:61b67ada3a810ce6d65ce004a85a36c8d2f7485c89b8330f609e052292b8314e"
| ctr: wrong diff id calculated on extraction "sha256:64dba0d4b6296b5d85b7b8a27a4027bcc37405b7293e868adb5e12f6471c8359"
Very weird
from kuberay.
Docker was updated from 24.0 to 25.0 on Jan 19, which aligns with when this started failing. And currently KubeRay just pulls the latest Docker version.
I'll pin the Docker version to 24.0 and see if that fixes it.
from kuberay.
Pinning to 24.0 doesn't work, it results in the same error.
from kuberay.
Some experiments:
- I still get the same error when I use the Ray 2.8.0 image.
- I ran
docker image ls
to check whether there are any existing images or not. There are no existing images. You can searchREPOSITORY TAG IMAGE ID CREATED SIZE
in this build.
from kuberay.
A workaround is to avoid using kind load
, as shown in 4a6f13d. See this build for more details. If we can find the root cause, we can increase the timeout and avoid using kind load
as a workaround.
from kuberay.
A workaround is to avoid using
kind load
, as shown in 4a6f13d. See this build for more details. If we can find the root cause, we can increase the timeout and avoid usingkind load
as a workaround.
@kevin85421 Nice find, I was pretty stuck on this. But how does the test work if kind load
isn't there? Is it just that kind load
is unnecessary because the Ray containers will pull the Ray image on their own if necessary, so it's not needed to preload it in the kind
cluster?
I think the priority should be (1) merge the kind load
workaround to unblock CI, then (2) figure out the root cause and add back kind load
if necessary.
from kuberay.
But how does the test work if
kind load
isn't there? Is it just that kind load is unnecessary because the Ray containers will pull the Ray image on their own if necessary, so it's not needed to preload it in the kind cluster?
If a Pod can't find the image on the Kubernetes node (i.e. kind
node), it will pull images from remote image registry such as DockerHub or Quay. We should determine whether the issue occurs when we use kind load $KUBERAY_IMAGE
, or if it only happens with kind load $RAY_IMAGE
. We should still load the KubeRay image to the Kind cluster.
from kuberay.
It only happens when loading Ray. Loading the KubeRay operator is fine.
Passing test when only kind load $RAY_IMAGE
is skipped: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3027#018d3381-ae0e-4025-b56b-6cc2d2d4b434
from kuberay.
It only happens when loading Ray. Loading the KubeRay operator is fine.
Passing test when only
kind load $RAY_IMAGE
is skipped: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3027#018d3381-ae0e-4025-b56b-6cc2d2d4b434
Actually, it turns out the KubeRay operator failed to load in one case: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3028#018d3397-cd40-4ecc-ac10-e1f3e88f1595/251-522
Since CI is now just flaky instead of consistently failing, I'll downgrade the priority to P1. But we can change it back to P0 if we think it's still too flaky.
from kuberay.
I believe it should remain a P0 priority, given that failures are still common.
from kuberay.
Sure. One possible clue is that kind load
isn't failing on Ray CI, although the setup seems to be a bit different (ray-project/ray#41836)
from kuberay.
Seems related to kubernetes-sigs/kind#3488, it's the same error message. Maybe I somehow messed up when I tried reverting to Docker 24.0. [ For posterity: running install-docker.sh --version=24.0
didn't restart the docker server. So the docker server was still running 25.0. I missed this because I ran docker --version
, which only shows the client version. docker info
shows both the client and server version.]
from kuberay.
Related Issues (20)
- [Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure HOT 1
- [Feature] RayCluster Helm Chart: Add pod level securityContext in addition to container level securityContext HOT 3
- [Feature] RayService CRD to have ImagePullSecret Reference
- [Feature] Why RayJob Spec can't set EndpointMemory? HOT 2
- [Bug] RayJob does not work when `app.kubernetes.io/name` is set HOT 3
- [Bug] "unable to find head service" error when specifying app.kubernetes.io/name on headGroupSpec HOT 3
- FT GCS should handle draining of node where head pod is scheduled HOT 3
- [Bug] RayJob falsely marked as "Running" when driver fails HOT 3
- [Feature] Checkpoint API to recover from checkpoint from previous runs HOT 2
- [Feature] Should we also set PublishNotReadyAddresses if the service is not headless? HOT 3
- [Bug] Readiness probe failed: timeout on minikube HOT 7
- [Bug] Fail the job, if the head node crashes HOT 2
- [Feature] Support GCS fault tolerance without external dependencies like Redis HOT 5
- [Bug] Image vulnerabilities found with Aquasec HOT 3
- [Bug] [API Server] JobSubmission service does not work for cluster names >41 characters HOT 1
- [Bug] Update Readme to point to 1.1.1 instead of 1.1.0 for the operator HOT 1
- [Umbrella] Ray Autoscaling tests HOT 4
- [Bug] RayCluster Helm Chart: containerEnv set to null when not values are specified HOT 1
- [Umbrella] RayService HA tests
- [Bug] RayCluster Sporadic RPC Error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuberay.