Code Monkey home page Code Monkey logo

basnijholt / adaptive-scheduler Goto Github PK

View Code? Open in Web Editor NEW
26.0 5.0 9.0 954 KB

Run many functions (adaptively) on many cores (>10k-100k) using mpi4py.futures, ipyparallel, loky, or dask-mpi. :tada:

Home Page: http://adaptive-scheduler.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 96.22% Jupyter Notebook 3.78%
parallel-computing distributed-computing adaptive-learning active-learning python ipyparallel mpi4py dask slurm pbs

adaptive-scheduler's Issues

Ideas ๐Ÿ’ก

  • Show serialization time in live info
  • Show serialized function size in info
  • Show total size of files (data/df/learners)

todos

  • do not log everything in the notebook
  • don't start new learners when some are marked is_done

Early requeue of jobs

(copy-paste from a chat)
Consider a case of 100 jobs, each for 30 min.
When adaptive scheduler submits these jobs it creates a slurm job for each one.
That requires allocating a node per job.
But in fact when the jobs are not super-long by the time the 50th node becomes available the prior nodes can be empty.
So now you end up with 98 allocated nodes and 98% of calculation finished only to wait for that last node to boot up.

I was wondering if more frequent checking and requeuing jobs would be useful for adaptive-scheduler or is it too much of a hassle?

extra_script is missing in SLURM(BaseScheduler)

scheduler_kwargs = dict(
        num_threads = 4,
        extra_scheduler=[
            "--exclusive", 
            "--partition=partition"
        ],
        extra_script="umask 0000", # or ["umask 0000"]
        executor_type="process-pool",
    )

ends up having

#!/bin/bash
#SBATCH --ntasks 4
#SBATCH --no-requeue
#SBATCH --exclusive
#SBATCH --partition=partition

export MKL_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

python ... 

Note the missing lines from extra_script

mpi4py not working on SLURM

The following never starts:

test.sbatch

#!/bin/bash
#SBATCH --ntasks 10
#SBATCH --no-requeue
#SBATCH --job-name test

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

srun --mpi=pmi2 -n 10 /gscratch/home/a-banijh/miniconda3/envs/py37_new/bin/python -m mpi4py.futures run_learner.py

run_learner.py

#!/usr/bin/env python3

import cloudpickle
from mpi4py import MPI
from mpi4py.futures import MPIPoolExecutor
MPI.pickle.__init__(cloudpickle.dumps, cloudpickle.loads)

if __name__ == "__main__":  # โ† use this, see warning @ https://bit.ly/2HAk0GG
    executor = MPIPoolExecutor()
    executor.bootup()  # wait until all workers are up and running
    s = executor._pool.size
    print("yolo", s)

running mpiexec -n 10 python -m mpi4py.futures run_learner.py does work!

add NUMEXPR_NUM_THREADS

set all
"OPENBLAS_NUM_THREADS=1",

"MKL_NUM_THREADS=1",

"OMP_NUM_THREADS=1",

"NUMEXPR_NUM_THREADS

Calls to scheduler are not asynchronous

The Scheduler methods 'queue' and 'start_job' are blocking functions.

'start_new_jobs' is run in a ThreadPoolExecutor inside JobManager, but 'queue' is never run in a separate thread, and will block the event loop.

One fix would be to use 'asyncio.run_in_thread' everywhere these methods are called. An alternative is to make these methods async.

The first fix is easier to do, but also easier to footgun (forgetting to 'run_in_thread' impacts performance). Second fix would require more changes.

Periodically call arbitrary callback

Currently adaptive-scheduler is capable of periodically saving the learner data, and calling 'learner.to_dataframe' and saving the resultant dataframe.

I would like to request the option to be able to additionally call an arbitrary callback.
My use-case is that I would like to save the learner data in a more efficient format.

Failing this I could potentially override the 'save' method for my learners; does adaptive-scheduler assume anything about the filenames/data format produced by 'save'?

Flood of RuntimeErrors with latest adaptive-scheduler

When I launch an adaptive scheduler run with ~100 learners or more, I get a RuntimeError after some time, which is triggered by this line:

This happens with the latest adaptive-scheduler release, and previous releases back to 0.2.3 (when this code block was introduced).

The run manager continues to run (the exception is logged and swallowed before it bubbles to the top), but the notebook output gets filled to the point that the notebook crashes.

This condition is triggered when adaptive scheduler starts new jobs (because it sees that there are still learners to do), but when it actually tries to get an fname out of the database that meets the necessary conditions:

lambda e: e.job_id is None and not e.is_done and not e.is_pending,

it cannot find one.

Even more vexing: if I kill the run manager as soon as this condition is detected and inspect the state of the database I can run the 'get' and I do get an fname back. To me this indicates that there is some kind of race condition happening.

forced --no-requeue

According to https://github.com/basnijholt/adaptive-scheduler/blob/master/adaptive_scheduler/scheduler.py#L594
the automatic requeing by slurm is disabled in adaptive-scheduled jobs. I ran into an issue, where the node that was hosting the job faltered and the job hung in preparation state for a while (50 min). I was able to fix it by requeing the job (one can override --no-requeue with scontrol later), and adaptive-scheduler happily picked up the job and showed it as running.

So, I was wondering what's the reason behind forced --no-requeue?

Unicode error on template writing

I've been seeing some sporadic r/w issues with unicode. Difficult to reproduce, but it looks like modifying line 608 of server_support.py to:

with open(run_script_fname, "w",encoding='utf-8') as f:
    f.write(template)

fixes some things, since template has a unicode character (left arrow). @basnijholt , any objection to specifying the encoding?

Make it harder to accidentally "cancel jobs"

The "update info" and "cancel jobs" buttons are very close together in the interactive widget.

We could either increase the distance between the buttons, or require a confirmation before canceling jobs.

Yellow text is difficult to read in jupyter lab

This is a quality of life issue, but it would be fantastic if the warning color for run metrics could be made a deeper shade of yellow. For example, the mean cpu usage is unreadable for me without using viewing angle tricks on my monitor:
image

Adaptive goal is reached, but job does not shut down

When running adaptive scheduler, sometimes a learner completes, but the job does not shut down:
{"job_id": "289289", "log_fname": "adaptive-scheduler-1-289289.log", "job_name": "adaptive-scheduler-1", "event": "trying to get learner", "timestamp": "2020-01-14 18:10.46"} {"event": "sent start signal, timeout after 10s.", "timestamp": "2020-01-14 18:10.46"} {"reply": "[I DELETED MY PATH]", "event": "got reply", "timestamp": "2020-01-14 18:10.46"} {"event": "got fname", "timestamp": "2020-01-14 18:10.46"} {"event": "picked a learner", "timestamp": "2020-01-14 18:10.46"} {"event": "started logger on hostname [DELETED HOSTNAME]", "timestamp": "2020-01-14 18:10.46"} {"npoints": 100, "event": "npoints at start", "timestamp": "2020-01-14 18:10.46"} {"status": "finished", "event": "runner status changed", "timestamp": "2020-01-14 18:10.46"} {"elapsed_time": "0:00:00.000688", "overhead": 0, "npoints": 100, "cpu_usage": 1.6, "mem_usage": 2.2, "event": "current status", "timestamp": "2020-01-14 18:10.46"} {"event": "goal reached! \ud83c\udf89\ud83c\udf8a\ud83e\udd73", "timestamp": "2020-01-14 18:10.46"} {"fname": "[I DELETED MY PATH]", "event": "sent stop signal, timeout after 10s", "timestamp": "2020-01-14 18:10.46"}
image

Also, when running parse_log_files(), the log files of the running jobs don't show up

add pathos backend

works:

import adaptive_scheduler
learner = adaptive_scheduler.utils.fname_to_learner("data/offset_0.0__width_0.01.pickle")

from pathos.multiprocessing import ProcessPool

ex = ProcessPool()
fut = ex.map(learner.function, [0])
fut

Can't show logs on error

Ran into this issue about 5 times, not sure how to consistently reproduce it.

-> 1078             box.children = (*new_children, log_explorer(self))
   1079 
   1080         buttons["cancel jobs"].on_click(cancel)

~/conda/envs/qms/lib/python3.7/site-packages/adaptive_scheduler/widgets.py in log_explorer(run_manager)
     97 
     98     fnames = _get_fnames(run_manager, only_running=False)
---> 99     text = _read_file(fnames[0]) if fnames else ""
    100     textarea = Textarea(text, layout=dict(width="auto"), rows=20)
    101     dropdown = Dropdown(options=fnames)

~/conda/envs/qms/lib/python3.7/site-packages/adaptive_scheduler/widgets.py in _read_file(fname)
     19 def _read_file(fname: Path) -> str:
     20     with fname.open() as f:
---> 21         return "".join(f.readlines())
     22 
     23 

~/conda/envs/qms/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Here's adaptive scheduler runner output
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.