Comments (3)
In our Ray deployment, currently there are some unknown conditions that cause tasks on some preempted nodes to register as "running" and the node to appear in the Ray Dashboard as alive, even though the node is long gone. The node page on the Ray Dashboard displays an empty screen, and the task continues "running" forever.
Hi @terraflops1048576 this seems a Ray bug that we should fix. Could you elaborate more?
from ray.
I don't really have the ability to diagnose what's going on here. Opening the Chrome DevTools on the node page (http://<cluster ip>/#/cluster/nodes/<node id>
) shows:
TypeError: Cannot read properties of undefined (reading '0')
at hc (NodeDetail.tsx:115:30)
at oo (react-dom.production.min.js:157:137)
...
which suggests to me that the cluster can't fetch the information for the node because it's gone. The node IP is unreachable over SSH, which suggests that the node has been preempted.
However, the task continues to show "running" in the Ray Core Dashboard; it's blue. However, it just runs forever and it doesn't terminate. Basically how we encountered this problem is that the tasks appear to run forever, and then clicking on the task to get the node information yields a blank screen. I have screenshots of the problem, but I'm not sure that they're helpful.
from ray.
I should add that I understand that this information is certainly not sufficient to reproduce the bug, and I would love to collect information to track this down -- if I could be told what exactly to gather, because this seems to happen often enough.
I think at least the CLI/API would be a workaround to unstick tasks that get stuck in this state.
from ray.
Related Issues (20)
- [Ray Core] Multiple tasks on the same node calling ray.get for the same object will result in multiple copies
- [Data] ArrowNotImplementedError for array column after `map_batches` in pandas format
- Release test jobs_specify_num_gpus.aws failed HOT 3
- Release test jobs_basic_local_working_dir.aws failed HOT 3
- [DOC] Vllm example is not work
- Release test chaos_dask_on_ray_large_scale_test_no_spilling.aws failed HOT 3
- Release test chaos_dataset_shuffle_push_based_sort_1tb.aws failed HOT 1
- Release test air_benchmark_tune_torch_mnist_gpu.aws failed HOT 3
- Release test ray-data-resnet50-ingest-out-of-memory-benchmark.aws failed HOT 4
- Release test air_benchmark_tune_torch_mnist.aws failed HOT 3
- Release test train_multinode_persistence.aws failed HOT 3
- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte
- Release test long_running_node_failures.aws failed HOT 3
- Ray Dashboard is susceptible to a Local File Inclusion bug with default settings HOT 5
- [Core] Show per task/actor GPU usage metric HOT 1
- Release test agent_stress_test.aws failed HOT 3
- [Core] `ray stop` does not clean up `ray_current_cluster` file HOT 2
- [Tune] lightGBM callback cannot write locally during cluster run
- [Core] Ray Worker stuck in launching state - Azure AKS HOT 3
- [Serve] Provide backpressure on handle metrics push HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ray.