The adaptive-scheduler's discuss from basnijholt

sorting is broken in the logging widget

incompatible with python 3.8

Hey, I think this line is not compatible with python<3.9

adaptive-scheduler/adaptive_scheduler/utils.py

Line 676 in 332432d

return str(p.with_stem(f"{prefix}{p.stem}"))

Ideas 💡

Show serialization time in live info
Show serialized function size in info
Show total size of files (data/df/learners)

add helper functions to set SLURM options

todos

do not log everything in the notebook
don't start new learners when some are marked is_done

(copy-paste from a chat)
Consider a case of 100 jobs, each for 30 min.
When adaptive scheduler submits these jobs it creates a slurm job for each one.
That requires allocating a node per job.
But in fact when the jobs are not super-long by the time the 50th node becomes available the prior nodes can be empty.
So now you end up with 98 allocated nodes and 98% of calculation finished only to wait for that last node to boot up.

I was wondering if more frequent checking and requeuing jobs would be useful for adaptive-scheduler or is it too much of a hassle?

extra_script is missing in SLURM(BaseScheduler)

scheduler_kwargs = dict(
        num_threads = 4,
        extra_scheduler=[
            "--exclusive", 
            "--partition=partition"
        ],
        extra_script="umask 0000", # or ["umask 0000"]
        executor_type="process-pool",
    )

ends up having

#!/bin/bash
#SBATCH --ntasks 4
#SBATCH --no-requeue
#SBATCH --exclusive
#SBATCH --partition=partition

export MKL_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

python ...

Note the missing lines from extra_script

mpi4py not working on SLURM

The following never starts:

test.sbatch

#!/bin/bash
#SBATCH --ntasks 10
#SBATCH --no-requeue
#SBATCH --job-name test

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

srun --mpi=pmi2 -n 10 /gscratch/home/a-banijh/miniconda3/envs/py37_new/bin/python -m mpi4py.futures run_learner.py

run_learner.py

#!/usr/bin/env python3

import cloudpickle
from mpi4py import MPI
from mpi4py.futures import MPIPoolExecutor
MPI.pickle.__init__(cloudpickle.dumps, cloudpickle.loads)

if __name__ == "__main__":  # ← use this, see warning @ https://bit.ly/2HAk0GG
    executor = MPIPoolExecutor()
    executor.bootup()  # wait until all workers are up and running
    s = executor._pool.size
    print("yolo", s)

running mpiexec -n 10 python -m mpi4py.futures run_learner.py does work!

add NUMEXPR_NUM_THREADS

set all
"OPENBLAS_NUM_THREADS=1",

"MKL_NUM_THREADS=1",

"OMP_NUM_THREADS=1",

"NUMEXPR_NUM_THREADS

SequenceLearner don't make dicts to tuple in silence

Calls to scheduler are not asynchronous

The Scheduler methods 'queue' and 'start_job' are blocking functions.

'start_new_jobs' is run in a ThreadPoolExecutor inside JobManager, but 'queue' is never run in a separate thread, and will block the event loop.

One fix would be to use 'asyncio.run_in_thread' everywhere these methods are called. An alternative is to make these methods async.

The first fix is easier to do, but also easier to footgun (forgetting to 'run_in_thread' impacts performance). Second fix would require more changes.

canceling jobs with tqdm

Periodically call arbitrary callback

Currently adaptive-scheduler is capable of periodically saving the learner data, and calling 'learner.to_dataframe' and saving the resultant dataframe.

I would like to request the option to be able to additionally call an arbitrary callback.
My use-case is that I would like to save the learner data in a more efficient format.

Failing this I could potentially override the 'save' method for my learners; does adaptive-scheduler assume anything about the filenames/data format produced by 'save'?

Flood of RuntimeErrors with latest adaptive-scheduler

When I launch an adaptive scheduler run with ~100 learners or more, I get a RuntimeError after some time, which is triggered by this line:

adaptive-scheduler/adaptive_scheduler/_server_support/database_manager.py

Line 261 in e1702e6

raise RuntimeError(msg)

This happens with the latest adaptive-scheduler release, and previous releases back to 0.2.3 (when this code block was introduced).

The run manager continues to run (the exception is logged and swallowed before it bubbles to the top), but the notebook output gets filled to the point that the notebook crashes.

This condition is triggered when adaptive scheduler starts new jobs (because it sees that there are still learners to do), but when it actually tries to get an fname out of the database that meets the necessary conditions:

adaptive-scheduler/adaptive_scheduler/_server_support/database_manager.py

Line 257 in e1702e6

lambda e: e.job_id is None and not e.is_done and not e.is_pending,

it cannot find one.

Even more vexing: if I kill the run manager as soon as this condition is detected and inspect the state of the database I can run the 'get' and I do get an fname back. To me this indicates that there is some kind of race condition happening.

Add a button to load the learners

Add sequential 'strategy' for BalancingLearner

I will add more details later.

cleanup button doesn't remove sbatch files

Make show failed jobs the default in the log viewer

forced --no-requeue

According to https://github.com/basnijholt/adaptive-scheduler/blob/master/adaptive_scheduler/scheduler.py#L594
the automatic requeing by slurm is disabled in adaptive-scheduled jobs. I ran into an issue, where the node that was hosting the job faltered and the job hung in preparation state for a while (50 min). I was able to fix it by requeing the job (one can override --no-requeue with scontrol later), and adaptive-scheduler happily picked up the job and showed it as running.

So, I was wondering what's the reason behind forced --no-requeue?

Implement tests

We can probably copy the setup of SLURM and PBS from https://github.com/dask/dask-jobqueue/blob/master/.travis.yml

UnicodeDecodeError in _make_default_run_script

Similar to but distinct from #54. When trying to submit jobs I get a unicode error at the location where the run_script.py.j2 template is opened:

adaptive-scheduler/adaptive_scheduler/server_support.py

Line 553 in f5a8204

with open(Path(__file__).parent / "run_script.py.j2") as f:

Passing encoding='utf-8' to open should fix this one, too.

raise error if ncores <= 1

then one cores will be used for the controller and zero for the engines.

add a aync task that logs the npoints and when it's saved

Add average node starting time to widget 💡

Unicode error on template writing

I've been seeing some sporadic r/w issues with unicode. Difficult to reproduce, but it looks like modifying line 608 of server_support.py to:

with open(run_script_fname, "w",encoding='utf-8') as f:
    f.write(template)

fixes some things, since template has a unicode character (left arrow). @basnijholt , any objection to specifying the encoding?

fix the FAQ

there is still some old code in the FAQ.

save_dataframe does not write atomically

This is problematic, as data may be corrupted if the writing process dies during the write.

Make it harder to accidentally "cancel jobs"

The "update info" and "cancel jobs" buttons are very close together in the interactive widget.

We could either increase the distance between the buttons, or require a confirmation before canceling jobs.

add a check in the local process for finished jobs

Yellow text is difficult to read in jupyter lab

This is a quality of life issue, but it would be fantastic if the warning color for run metrics could be made a deeper shade of yellow. For example, the mean cpu usage is unreadable for me without using viewing angle tricks on my monitor:

log explorer looked in `old_logs/logs/logfile.out` instead of `old_logs/logfile.out`

write doc-strings

have a timeout for `reply = socket.recv_pyobj()`, the job manager could have died

implement ipyparallel and dask-mpi for PBS

I do not know how to submit multiple tasks inside a single script ATM for PBS.

Adaptive goal is reached, but job does not shut down

When running adaptive scheduler, sometimes a learner completes, but the job does not shut down:
{"job_id": "289289", "log_fname": "adaptive-scheduler-1-289289.log", "job_name": "adaptive-scheduler-1", "event": "trying to get learner", "timestamp": "2020-01-14 18:10.46"} {"event": "sent start signal, timeout after 10s.", "timestamp": "2020-01-14 18:10.46"} {"reply": "[I DELETED MY PATH]", "event": "got reply", "timestamp": "2020-01-14 18:10.46"} {"event": "got fname", "timestamp": "2020-01-14 18:10.46"} {"event": "picked a learner", "timestamp": "2020-01-14 18:10.46"} {"event": "started logger on hostname [DELETED HOSTNAME]", "timestamp": "2020-01-14 18:10.46"} {"npoints": 100, "event": "npoints at start", "timestamp": "2020-01-14 18:10.46"} {"status": "finished", "event": "runner status changed", "timestamp": "2020-01-14 18:10.46"} {"elapsed_time": "0:00:00.000688", "overhead": 0, "npoints": 100, "cpu_usage": 1.6, "mem_usage": 2.2, "event": "current status", "timestamp": "2020-01-14 18:10.46"} {"event": "goal reached! \ud83c\udf89\ud83c\udf8a\ud83e\udd73", "timestamp": "2020-01-14 18:10.46"} {"fname": "[I DELETED MY PATH]", "event": "sent stop signal, timeout after 10s", "timestamp": "2020-01-14 18:10.46"}

Also, when running parse_log_files(), the log files of the running jobs don't show up

Allow to only save only dataframes instead of both fnames and derived fnames

Also allow to pass these kwargs:

def save_dataframe(
    fname: str | list[str],
    *,
    format: _DATAFRAME_FORMATS = "parquet",  # noqa: A002
    save_kwargs: dict[str, Any] | None = None,
    expand_dicts: bool = True,
    **to_dataframe_kwargs: Any,
) -> Callable[[adaptive.BaseLearner], None]:

import adaptive_scheduler
learner = adaptive_scheduler.utils.fname_to_learner("data/offset_0.0__width_0.01.pickle")

from pathos.multiprocessing import ProcessPool

ex = ProcessPool()
fut = ex.map(learner.function, [0])
fut

Can't show logs on error

Ran into this issue about 5 times, not sure how to consistently reproduce it.

-> 1078             box.children = (*new_children, log_explorer(self))
   1079 
   1080         buttons["cancel jobs"].on_click(cancel)

~/conda/envs/qms/lib/python3.7/site-packages/adaptive_scheduler/widgets.py in log_explorer(run_manager)
     97 
     98     fnames = _get_fnames(run_manager, only_running=False)
---> 99     text = _read_file(fnames[0]) if fnames else ""
    100     textarea = Textarea(text, layout=dict(width="auto"), rows=20)
    101     dropdown = Dropdown(options=fnames)

~/conda/envs/qms/lib/python3.7/site-packages/adaptive_scheduler/widgets.py in _read_file(fname)
     19 def _read_file(fname: Path) -> str:
     20     with fname.open() as f:
---> 21         return "".join(f.readlines())
     22 
     23 

~/conda/envs/qms/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Here's adaptive scheduler runner output

basnijholt / adaptive-scheduler Goto Github PK

adaptive-scheduler's Issues

Recommend Projects

Recommend Topics

Recommend Org