Comments (7)
@jjyao can you help triage this? Thanks
from ray.
cc @WeichenXu123 can you take a look at this one?
from ray.
@WeichenXu123 gentle ping here.
from ray.
checking
from ray.
ah , you are creating 70 Ray worker nodes , but how many spark worker nodes there ?
we recommend to make one Ray worker node occupies all CPU/GPUs in a spark worker node. i.e. one spark worker node launches at most one Ray worker node. This reduces risks of port conflicts.
we have mechanism to prevent port conflicts.
ray/python/ray/util/spark/cluster_init.py
Line 287 in 0be0639
according to your error message '1000 ports from 35000 to 35999'
, you must have started at least 16 Ray worker nodes in the same machine .
from ray.
ports > 30000 might be used by other Ray components. Reducing Ray worker nodes number per spark worker node should address the issue. @jjyao Does Ray system service use port range 10000 ~ 20000 ?
from ray.
Thanks for having a look @WeichenXu123 . I will answer your questions below and add some thoughts:
ah , you are creating 70 Ray worker nodes , but how many spark worker nodes there ?
We have 100+ nodes, so it should be fine to have 70 ray workers. But yes I do see Ray often putting more than one worker on the same node. How can I tell Ray not to without telling it to use all the resources of the machine?
we recommend to make one Ray worker node occupies all CPU/GPUs in a spark worker node
I think there might be a few issues with this: the nodes in the cluster might be different so it will be hard. Also, it might take longer for my ray task to launch if it has to wait for a full machine to be available instead of just using whatever is available.
Also, If I somehow configure ray to use all CPU/GPU in the spark worker node that will potentially lead to underutilisation of the cluster right? We have really big nodes, sometimes with more than 150 cores.
from ray.
Related Issues (20)
- [RLlib] - `Algorithm.add_module` does not use the `module_state` argument. HOT 1
- [autoscaler][gcp] wrong values for scheduling in example gcp cluster yaml files
- [<Ray component: RLlib] module 'numpy' has no attribute 'product'
- [Serve] does not work with tracing
- Test linux://python/ray/tests:test_channel is flaky
- [Data] Additional metrics for tracking block times
- [Ray Core] Actor On Finish Function
- [Ray Core]: Cannot find gpu on Jetson AGX Orin HOT 2
- [Dashboard] Jobs page breaks when multiple drivers associated with same job submission id
- [in-progress] Readers across different nodes - ADAG Developer Preview - Test Coverage HOT 1
- [Data] Support read Hudi table as a DataSource
- [core][experimental] Support binding kwargs to DAG nodes in accelerated DAG
- Data: `PandasBlockAccessor` does not have the attribute `_munge_conflict`
- CI test linux://rllib:learning_tests_multi_agent_cartpole_impala is consistently_failing HOT 4
- [Core] Allow customizing session name HOT 1
- [Data] Ray can't reconstruct inputs if Python garbage collects input references
- [core][experimental] Calling ray.get() on CompiledDAGRef after dag.teardown() or actor failure hangs HOT 6
- Release test microbenchmark_unstable failed HOT 2
- [<Ray component: data] `ray.data.read_text` raise `numpy.core._exceptions._ArrayMemoryError: Unable to allocate` HOT 1
- RPC issues with changing network topology
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ray.