Comments (7)
We use this code
ray/python/ray/_private/utils.py
Line 544 in e75689e
from ray.
raylet_out_2024-05-20-16-54-06.txt
Here is our raylet.out log.
from ray.
Update: it can run with specifying both num_cpus and num_gpus. But why it cannot work when only set num_gpus? And how do ray detect num_cpus by default?How to set the num_cpus properly?
from ray.
We don't have AMD GPU environments. If you can provide us an environment to reproduce, please ping us on Slack. https://ray-distributed.slack.com/team/U055TQCDAAY
from ray.
If you don't set num_cpus or num_gpus, Ray will auto detect. Have you tried to not set num_gpus and see if it can detect the CPU and GPU counts?
from ray.
If you don't set num_cpus or num_gpus, Ray will auto detect. Have you tried to not set num_gpus and see if it can detect the CPU and GPU counts?
Yes we tried it, but if we don't set any arguments, we just used ray.init(), it would still corrupt. Sorry that we cannot provide an AMD environment right now. But I think the problem is in cpu side. Because it cannot detect cpu automatically, we need to set num_cpus manually. Can you tell me how do ray detect CPUs? And is there any method to figure out more about the problem? For example, check whether the cpu threads are working well?
from ray.
Update: we can use multiprocessing.cpu_count() to get cpu number successfully. But we cannot set num_cpus too large. We have 192 cpus on the machine, but we can only set num_cpus to be up to 10. If we set it to be 20, it would interrupt. Here is the log of num_cpus=20.
ray-num-cpu-20-log.txt
from ray.
Related Issues (20)
- Release test chaos_torch_batch_inference_16_gpu_300gb_raw.aws failed HOT 1
- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte [<Ray component: tune } HOT 12
- [Data] Dataset.write_parquet(**arrow_parquet_args) not work
- [Tune] Quantized with q=1 bug HOT 1
- CI test linux://python/ray/tests:test_draining is flaky HOT 16
- [Core] error happened in _raylet.so, thread_proxy when using ray.init HOT 1
- [RayTrain] Checkpoint API to recover from checkpoint from previous runs HOT 5
- [Ray Tune] [ JobSubmissionClient] Error fetching job logs using client.get_job_logs(job_id) with JobSubmissionClient
- [Ray Core] Ray agent crashes: grpcio version mismatch and unexpected errors (dashboard_agent, runtime_env_agent)
- [Core] `RayTaskError.as_instanceof_cause` raises a `TypeError` if the cause is a `BaseExceptionGroup`
- [RLlib] ONNX Export support when using RLModule API HOT 1
- [Data] `num_rows_per_file` parameter description is misleading
- How to speed up ray.get() to get a large object from another node?
- CI test darwin://python/ray/tests:test_job is consistently_failing
- [Data] Add WarcDatasource for reading WARC/ARC files HOT 1
- CI test darwin://python/ray/tests:test_job is consistently_failing HOT 5
- [data] ray_tqdm does not work with numba HOT 1
- Release test single_node_oom.aws failed HOT 3
- [Ray debugger] Unable to use debugger on Ray Cluster on k8s HOT 1
- [RLlib] Incorrect Callback Order HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ray.