Code Monkey home page Code Monkey logo

Comments (7)

kevin85421 avatar kevin85421 commented on June 22, 2024

Which Ray images are you using? You should use images that include aarch64 in the image tag.

from kuberay.

anovv avatar anovv commented on June 22, 2024

@kevin85421 yes, I'm using aarch64 images, 2.22.0-py310-aarch64 for Ray to be exact

from kuberay.

anovv avatar anovv commented on June 22, 2024

@kevin85421 do you have any idea what may be happening? This blocks me.

from kuberay.

kevin85421 avatar kevin85421 commented on June 22, 2024

I tried the following on my Mac M1, and my RayCluster is healthy; no pods have been killed.

kind create cluster
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
helm install raycluster kuberay/ray-cluster --version 1.1.1 --set image.tag=2.22.0-py310-aarch64
  • We may have some differences: (1) kind vs minikube (2) M1 vs M2 (3) different instructions.
    • You can try kind to determine whether the question is minikube-only or not.
    • Use the exact the same instructions above in your environment.

Btw, are you in the Ray Slack channel? It will be helpful to join the Slack workspace. Other KubeRay users can also share their experiences. You can join #kuberay-questions channel.

from kuberay.

anovv avatar anovv commented on June 22, 2024

@kevin85421 what container runtime do you use? Colima or Docker Desktop?

from kuberay.

kevin85421 avatar kevin85421 commented on June 22, 2024

I use Docker.

from kuberay.

anovv avatar anovv commented on June 22, 2024

Ok @kevin85421, I think I found the culprit, some weird behaviour with worker.minReplicas parameters with enabled autoscaling head.enableInTreeAutoscaling: true

Example cases:

  • worker:
      replicas: 4
      minReplicas: 0
      maxReplicas: 1000 
    

    I get 4 pods launched, then (about 60s) all 4 failing readiness probe and getting killed

  • worker:
      replicas: 4
      minReplicas: 2
      maxReplicas: 1000 
    

    I get 4 pods launched, then (about 60s) 2 fail readiness probe and die, 2 stay healthy and work

  • If I set no min

    worker:
      replicas: 4
      maxReplicas: 1000 
    

    I get 4 pods launched, then (about 60s) 3 failing readiness probe and getting killed, 1 stays healthy and works

  • If I set worker.replicas = worker.minReplicas = 4, I get all 4 working properly.

Also noticed not setting worker.maxReplicas leads to a weird behaviour as well (number of pods does not match the request) and head node throws error with autoscaler not working properly

So I see two possible things here (which may be interconnected):

  • KubeRay uses worker.minReplicas as default when autoscaler is on after recovering from readiness probe fail (which is unexpected as it should use worker.replicas value)?
  • readiness probes fail only on pods not tracked by autoscaler (not sure why)?

Disabling enableInTreeAutoscaling makes everything work as expected.

What do you think?

from kuberay.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.