Code Monkey home page Code Monkey logo

Comments (4)

pravingadakh avatar pravingadakh commented on July 22, 2024

Adding one more observation: We have seen this issue occurring only when Jobs are submitted using RayJob CRD, however with static Ray cluster and ray job cli for job submission we do not see this issue.

from ray.

pravingadakh avatar pravingadakh commented on July 22, 2024
Screenshot 2024-05-21 at 10 20 40 PM Screenshot 2024-06-10 at 11 30 02 PM

Attaching screenshots of the Ray Dashboard when job was in stuck state.

from ray.

shallys avatar shallys commented on July 22, 2024

Team, this is a critical issue that has become a blocker for us to use Ray for batch inferencing in a predictable way. While running batch inferencing for millions of records, the job gets stuck almost at the end and there is no way to recover other than killing the job and the entire time spent in inferencing is wasted as there is no way to know what is the remaining batch. Also this happens too often, even if we figure out the leftover data for inferencing, the solution is practically unusable. Appreciate any help that we can get here. Please let us know if there are any further inputs that we can provide to help debug this.

from ray.

pravingadakh avatar pravingadakh commented on July 22, 2024

Stacktrace from the one of the idle actors if it helps:

Process 207: ray::_MapWorker
Python v3.10.14 (/home/ray/anaconda3/bin/python3.10)

Thread 207 (idle): "MainThread"
    epoll_wait (libc-2.31.so)
    boost::asio::detail::epoll_reactor::run (ray/_raylet.so)
    boost::asio::detail::scheduler::do_run_one (ray/_raylet.so)
    boost::asio::detail::scheduler::run (ray/_raylet.so)
    boost::asio::io_context::run (ray/_raylet.so)
    ray::core::CoreWorker::RunTaskExecutionLoop (ray/_raylet.so)
    ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop (ray/_raylet.so)
    ray::core::CoreWorkerProcess::RunTaskExecutionLoop (ray/_raylet.so)
    run_task_loop (ray/_raylet.so)
    main_loop (ray/_private/worker.py:876)
    <module> (ray/_private/workers/default_worker.py:289)
Thread 3778 (idle): "ThreadPoolExecutor-0_0"
    do_futex_wait.constprop.0 (libpthread-2.31.so)
    __new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
    PyThread_acquire_lock_timed.localalias (python3.10)
    _queue_SimpleQueue_get_impl (_queuemodule.c:248)
    _queue_SimpleQueue_get (_queuemodule.c.h:175)
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
    thread_run (python3.10)
    clone (libc-2.31.so)
Thread 14118 (idle): "Thread-1"
    do_futex_wait.constprop.0 (libpthread-2.31.so)
    __new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
    PyThread_acquire_lock_timed.localalias (python3.10)
    lock_PyThread_acquire_lock (python3.10)
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
    thread_run (python3.10)
    clone (libc-2.31.so)

from ray.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.