Comments (4)
Adding one more observation: We have seen this issue occurring only when Jobs are submitted using RayJob CRD, however with static Ray cluster and ray job cli for job submission we do not see this issue.
from ray.
![Screenshot 2024-05-21 at 10 20 40 PM](https://private-user-images.githubusercontent.com/13175315/338293128-717ac211-eb15-4e20-9f11-d3ae9e923f3c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEzOTg5MzYsIm5iZiI6MTcyMTM5ODYzNiwicGF0aCI6Ii8xMzE3NTMxNS8zMzgyOTMxMjgtNzE3YWMyMTEtZWIxNS00ZTIwLTlmMTEtZDNhZTllOTIzZjNjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE5VDE0MTcxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI3MzRmNGJmMzVlZTE0YmIyZTJiY2M5MjNmMTFjODA1MjNlMmUxYjNlYmU4YWE5Nzc3OTk4Mzc2NjU0YWI5YTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.LpXzd2sm3Xe26JMsgBV5k6Y4Ker3omRdNnD6VTrWAz0)
![Screenshot 2024-06-10 at 11 30 02 PM](https://private-user-images.githubusercontent.com/13175315/338293770-af2527bd-eed9-492d-817a-a385d4bbeb4a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEzOTg5MzYsIm5iZiI6MTcyMTM5ODYzNiwicGF0aCI6Ii8xMzE3NTMxNS8zMzgyOTM3NzAtYWYyNTI3YmQtZWVkOS00OTJkLTgxN2EtYTM4NWQ0YmJlYjRhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE5VDE0MTcxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY4MWI1NjY2ZWI4OGY3ZjJjMjIwYmQ3OTEzNTQ1MTM4YmFhZjczMzU1YjA4MjhlODRhYmI4OGVmOWI4ZTcxNTgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.p9VUvzqsShYplRKwg3plxdopXnnNlDhqVUlZ8vXlbvM)
Attaching screenshots of the Ray Dashboard when job was in stuck state.
from ray.
Team, this is a critical issue that has become a blocker for us to use Ray for batch inferencing in a predictable way. While running batch inferencing for millions of records, the job gets stuck almost at the end and there is no way to recover other than killing the job and the entire time spent in inferencing is wasted as there is no way to know what is the remaining batch. Also this happens too often, even if we figure out the leftover data for inferencing, the solution is practically unusable. Appreciate any help that we can get here. Please let us know if there are any further inputs that we can provide to help debug this.
from ray.
Stacktrace from the one of the idle actors if it helps:
Process 207: ray::_MapWorker
Python v3.10.14 (/home/ray/anaconda3/bin/python3.10)
Thread 207 (idle): "MainThread"
epoll_wait (libc-2.31.so)
boost::asio::detail::epoll_reactor::run (ray/_raylet.so)
boost::asio::detail::scheduler::do_run_one (ray/_raylet.so)
boost::asio::detail::scheduler::run (ray/_raylet.so)
boost::asio::io_context::run (ray/_raylet.so)
ray::core::CoreWorker::RunTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcess::RunTaskExecutionLoop (ray/_raylet.so)
run_task_loop (ray/_raylet.so)
main_loop (ray/_private/worker.py:876)
<module> (ray/_private/workers/default_worker.py:289)
Thread 3778 (idle): "ThreadPoolExecutor-0_0"
do_futex_wait.constprop.0 (libpthread-2.31.so)
__new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
PyThread_acquire_lock_timed.localalias (python3.10)
_queue_SimpleQueue_get_impl (_queuemodule.c:248)
_queue_SimpleQueue_get (_queuemodule.c.h:175)
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
thread_run (python3.10)
clone (libc-2.31.so)
Thread 14118 (idle): "Thread-1"
do_futex_wait.constprop.0 (libpthread-2.31.so)
__new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
PyThread_acquire_lock_timed.localalias (python3.10)
lock_PyThread_acquire_lock (python3.10)
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
thread_run (python3.10)
clone (libc-2.31.so)
from ray.
Related Issues (20)
- CI test windows://python/ray/tests:test_typing is consistently_failing HOT 3
- [Core] Default concurrency using concurrency groups HOT 3
- [Data] Execution of repartition stuck at `ray.wait`
- [<Ray component: Ray Docs AI>] Ray Docs AI Blurs After Answering Question
- Test test_dataclient_disconnect is failing on microcheck HOT 1
- [Ray Docs] Add icons to the example galley and change background images
- CI test linux://rllib:learning_tests_cartpole_dqn_gpu is flaky HOT 1
- CI test linux://rllib:learning_tests_cartpole_dqn_gpu is flaky HOT 7
- [Core] Jax fails within `ray.remote` on GPUs HOT 1
- [autoscaler][aws] cloudwatch alarm config placeholders are not replaced
- Release test air_benchmark_tensorflow_mnist_gpu_4x4.aws failed HOT 1
- CI test linux://rllib:TestLearnerGroupSyncUpdate is flaky HOT 3
- CI test linux://rllib:learning_tests_multi_agent_cartpole_dqn_gpu is flaky HOT 1
- [Ray Data] how to set spawn start method to ray, when using torch
- Ray Serve with Fastapi is a lot (10x) slower than plain fastapi
- [Serve] Streamlit example
- Ray RL | Local Env Runner spinning up when it shouldn't
- [Ray Serve] | Replicas keep failing health check under high QPS while scaling up HOT 2
- Release test single_node_oom.aws failed HOT 1
- [Ray Dashboard] Support killing interactively submitted ray jobs from dashboard HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ray.