Code Monkey home page Code Monkey logo

Comments (7)

rynewang avatar rynewang commented on July 20, 2024

@jjyao can you help triage this? Thanks

from ray.

jjyao avatar jjyao commented on July 20, 2024

cc @WeichenXu123 can you take a look at this one?

from ray.

jjyao avatar jjyao commented on July 20, 2024

@WeichenXu123 gentle ping here.

from ray.

WeichenXu123 avatar WeichenXu123 commented on July 20, 2024

checking

from ray.

WeichenXu123 avatar WeichenXu123 commented on July 20, 2024

ah , you are creating 70 Ray worker nodes , but how many spark worker nodes there ?

we recommend to make one Ray worker node occupies all CPU/GPUs in a spark worker node. i.e. one spark worker node launches at most one Ray worker node. This reduces risks of port conflicts.

we have mechanism to prevent port conflicts.

def _preallocate_ray_worker_port_range():

according to your error message '1000 ports from 35000 to 35999', you must have started at least 16 Ray worker nodes in the same machine .

from ray.

WeichenXu123 avatar WeichenXu123 commented on July 20, 2024

ports > 30000 might be used by other Ray components. Reducing Ray worker nodes number per spark worker node should address the issue. @jjyao Does Ray system service use port range 10000 ~ 20000 ?

from ray.

fersarr avatar fersarr commented on July 20, 2024

Thanks for having a look @WeichenXu123 . I will answer your questions below and add some thoughts:

ah , you are creating 70 Ray worker nodes , but how many spark worker nodes there ?

We have 100+ nodes, so it should be fine to have 70 ray workers. But yes I do see Ray often putting more than one worker on the same node. How can I tell Ray not to without telling it to use all the resources of the machine?

we recommend to make one Ray worker node occupies all CPU/GPUs in a spark worker node

I think there might be a few issues with this: the nodes in the cluster might be different so it will be hard. Also, it might take longer for my ray task to launch if it has to wait for a full machine to be available instead of just using whatever is available.
Also, If I somehow configure ray to use all CPU/GPU in the spark worker node that will potentially lead to underutilisation of the cluster right? We have really big nodes, sometimes with more than 150 cores.

from ray.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.