Code Monkey home page Code Monkey logo

basnijholt / adaptive-scheduler Goto Github PK

View Code? Open in Web Editor NEW
26.0 26.0 9.0 954 KB

Run many functions (adaptively) on many cores (>10k-100k) using mpi4py.futures, ipyparallel, loky, or dask-mpi. :tada:

Home Page: http://adaptive-scheduler.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 96.22% Jupyter Notebook 3.78%
active-learning adaptive adaptive-learning dask distributed-computing interactive ipyparallel loky mpi4py parallel-computing pbs python slurm

adaptive-scheduler's Introduction

Bas Nijholt 👋

  • 👷🏻‍♂️ Currently at IonQ, doing my bit in building a quantum computer, before that I was at Microsoft Quantum.
  • 🌟 A deep dive into computational topological quantum mechanics earned me my PhD.
  • 🎨 I've crafted a few libraries for Home Assistant, making home automation a bit more fun.
  • ⚒️ Made other tools speed up and massively parallelize numerical simulations.
  • 🏅 Very passionate about open-source, software quality, user experience, and smooth performance.
  • 🐍 Python is my go-to language in most of my projects.
  • Some of my favorite creations:
    • 📈 python-adaptive/adaptive: Parallel active learning of mathematical functions? Check!
    • 🧬 unidep: Unifying pip and conda requirements, single command to set up a full dev environment.
    • 💡 adaptive-lighting: A custom component for Home Assistant to keep your lighting in synn with the sun.
    • 📝 markdown-code-runner: Run (hidden) code blocks right within your Markdown files - keep simple README.mds in sync!
    • 🕒 rsync-time-machine.py: Time Machine-style backups with rsync for the minimalists.
    • 🏠 home-assistant-config: Over 100 documented automations in my Home Assistant config

Below are some (automatically generated) statistics about my activity on GitHub. For more info check out my website www.nijho.lt or talk to me on Mastodon.

Ask me about:

Last updated at 2024-05-12 12:08:51.127257.

GitHub statistics — my top 20

number of GitHub stars ⭐️

  1. basnijholt/adaptive-lighting, 1655 ⭐️s
  2. basnijholt/home-assistant-config, 1648 ⭐️s
  3. python-adaptive/adaptive, 1114 ⭐️s
  4. python-kasa/python-kasa, 1103 ⭐️s
  5. basnijholt/lovelace-ios-themes, 570 ⭐️s
Click to expand!
  1. basnijholt/lovelace-ios-dark-mode-theme, 442 ⭐️s
  2. basnijholt/rsync-time-machine.py, 367 ⭐️s
  3. basnijholt/miflora, 361 ⭐️s
  4. topocm/topocm_content, 267 ⭐️s
  5. basnijholt/unidep, 209 ⭐️s
  6. basnijholt/home-assistant-streamdeck-yaml, 207 ⭐️s
  7. basnijholt/home-assistant-macbook-touch-bar, 94 ⭐️s
  8. kwant-project/kwant, 84 ⭐️s
  9. basnijholt/markdown-code-runner, 82 ⭐️s
  10. basnijholt/home-assistant-streamdeck-yaml-addon, 62 ⭐️s
  11. basnijholt/aiokef, 37 ⭐️s
  12. basnijholt/thesis-cover, 34 ⭐️s
  13. basnijholt/adaptive-scheduler, 26 ⭐️s
  14. basnijholt/instacron, 20 ⭐️s
  15. kwant-project/kwant-tutorial-2016, 19 ⭐️s

number of commits :octocat:

  1. basnijholt/home-assistant-config, 1769 commits :octocat:
  2. python-adaptive/adaptive, 1427 commits :octocat:
  3. basnijholt/adaptive-scheduler, 755 commits :octocat:
  4. basnijholt/adaptive-lighting, 554 commits :octocat:
  5. basnijholt/home-assistant-streamdeck-yaml, 313 commits :octocat:
Click to expand!
  1. basnijholt/aiokef, 288 commits :octocat:
  2. ohld/igbot, 191 commits :octocat:
  3. basnijholt/lovelace-ios-themes, 161 commits :octocat:
  4. basnijholt/media_player.kef, 157 commits :octocat:
  5. basnijholt/hpc05, 152 commits :octocat:
  6. basnijholt/instacron, 115 commits :octocat:
  7. basnijholt/markdown-code-runner, 97 commits :octocat:
  8. basnijholt/basnijholt, 86 commits :octocat:
  9. basnijholt/home-assistant-streamdeck-yaml-addon, 80 commits :octocat:
  10. basnijholt/lovelace-ios-dark-mode-theme, 80 commits :octocat:
  11. basnijholt/home-assistant-macbook-touch-bar, 69 commits :octocat:
  12. conda-forge/kwant-feedstock, 65 commits :octocat:
  13. basnijholt/lovelace-ios-light-mode-theme, 65 commits :octocat:
  14. basnijholt/addon-otmonitor, 59 commits :octocat:
  15. basnijholt/codestructure, 52 commits :octocat:

These plots and stats are generated by this Jupyter notebook using this GitHub Action.

adaptive-scheduler's People

Contributors

basnijholt avatar github-actions[bot] avatar jbweston avatar jorgectf avatar pre-commit-ci[bot] avatar sbalk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

adaptive-scheduler's Issues

add pathos backend

works:

import adaptive_scheduler
learner = adaptive_scheduler.utils.fname_to_learner("data/offset_0.0__width_0.01.pickle")

from pathos.multiprocessing import ProcessPool

ex = ProcessPool()
fut = ex.map(learner.function, [0])
fut

Yellow text is difficult to read in jupyter lab

This is a quality of life issue, but it would be fantastic if the warning color for run metrics could be made a deeper shade of yellow. For example, the mean cpu usage is unreadable for me without using viewing angle tricks on my monitor:
image

add NUMEXPR_NUM_THREADS

set all
"OPENBLAS_NUM_THREADS=1",

"MKL_NUM_THREADS=1",

"OMP_NUM_THREADS=1",

"NUMEXPR_NUM_THREADS

extra_script is missing in SLURM(BaseScheduler)

scheduler_kwargs = dict(
        num_threads = 4,
        extra_scheduler=[
            "--exclusive", 
            "--partition=partition"
        ],
        extra_script="umask 0000", # or ["umask 0000"]
        executor_type="process-pool",
    )

ends up having

#!/bin/bash
#SBATCH --ntasks 4
#SBATCH --no-requeue
#SBATCH --exclusive
#SBATCH --partition=partition

export MKL_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

python ... 

Note the missing lines from extra_script

Early requeue of jobs

(copy-paste from a chat)
Consider a case of 100 jobs, each for 30 min.
When adaptive scheduler submits these jobs it creates a slurm job for each one.
That requires allocating a node per job.
But in fact when the jobs are not super-long by the time the 50th node becomes available the prior nodes can be empty.
So now you end up with 98 allocated nodes and 98% of calculation finished only to wait for that last node to boot up.

I was wondering if more frequent checking and requeuing jobs would be useful for adaptive-scheduler or is it too much of a hassle?

Make it harder to accidentally "cancel jobs"

The "update info" and "cancel jobs" buttons are very close together in the interactive widget.

We could either increase the distance between the buttons, or require a confirmation before canceling jobs.

Ideas 💡

  • Show serialization time in live info
  • Show serialized function size in info
  • Show total size of files (data/df/learners)

mpi4py not working on SLURM

The following never starts:

test.sbatch

#!/bin/bash
#SBATCH --ntasks 10
#SBATCH --no-requeue
#SBATCH --job-name test

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

srun --mpi=pmi2 -n 10 /gscratch/home/a-banijh/miniconda3/envs/py37_new/bin/python -m mpi4py.futures run_learner.py

run_learner.py

#!/usr/bin/env python3

import cloudpickle
from mpi4py import MPI
from mpi4py.futures import MPIPoolExecutor
MPI.pickle.__init__(cloudpickle.dumps, cloudpickle.loads)

if __name__ == "__main__":  # ← use this, see warning @ https://bit.ly/2HAk0GG
    executor = MPIPoolExecutor()
    executor.bootup()  # wait until all workers are up and running
    s = executor._pool.size
    print("yolo", s)

running mpiexec -n 10 python -m mpi4py.futures run_learner.py does work!

forced --no-requeue

According to https://github.com/basnijholt/adaptive-scheduler/blob/master/adaptive_scheduler/scheduler.py#L594
the automatic requeing by slurm is disabled in adaptive-scheduled jobs. I ran into an issue, where the node that was hosting the job faltered and the job hung in preparation state for a while (50 min). I was able to fix it by requeing the job (one can override --no-requeue with scontrol later), and adaptive-scheduler happily picked up the job and showed it as running.

So, I was wondering what's the reason behind forced --no-requeue?

Calls to scheduler are not asynchronous

The Scheduler methods 'queue' and 'start_job' are blocking functions.

'start_new_jobs' is run in a ThreadPoolExecutor inside JobManager, but 'queue' is never run in a separate thread, and will block the event loop.

One fix would be to use 'asyncio.run_in_thread' everywhere these methods are called. An alternative is to make these methods async.

The first fix is easier to do, but also easier to footgun (forgetting to 'run_in_thread' impacts performance). Second fix would require more changes.

Flood of RuntimeErrors with latest adaptive-scheduler

When I launch an adaptive scheduler run with ~100 learners or more, I get a RuntimeError after some time, which is triggered by this line:

This happens with the latest adaptive-scheduler release, and previous releases back to 0.2.3 (when this code block was introduced).

The run manager continues to run (the exception is logged and swallowed before it bubbles to the top), but the notebook output gets filled to the point that the notebook crashes.

This condition is triggered when adaptive scheduler starts new jobs (because it sees that there are still learners to do), but when it actually tries to get an fname out of the database that meets the necessary conditions:

lambda e: e.job_id is None and not e.is_done and not e.is_pending,

it cannot find one.

Even more vexing: if I kill the run manager as soon as this condition is detected and inspect the state of the database I can run the 'get' and I do get an fname back. To me this indicates that there is some kind of race condition happening.

Adaptive goal is reached, but job does not shut down

When running adaptive scheduler, sometimes a learner completes, but the job does not shut down:
{"job_id": "289289", "log_fname": "adaptive-scheduler-1-289289.log", "job_name": "adaptive-scheduler-1", "event": "trying to get learner", "timestamp": "2020-01-14 18:10.46"} {"event": "sent start signal, timeout after 10s.", "timestamp": "2020-01-14 18:10.46"} {"reply": "[I DELETED MY PATH]", "event": "got reply", "timestamp": "2020-01-14 18:10.46"} {"event": "got fname", "timestamp": "2020-01-14 18:10.46"} {"event": "picked a learner", "timestamp": "2020-01-14 18:10.46"} {"event": "started logger on hostname [DELETED HOSTNAME]", "timestamp": "2020-01-14 18:10.46"} {"npoints": 100, "event": "npoints at start", "timestamp": "2020-01-14 18:10.46"} {"status": "finished", "event": "runner status changed", "timestamp": "2020-01-14 18:10.46"} {"elapsed_time": "0:00:00.000688", "overhead": 0, "npoints": 100, "cpu_usage": 1.6, "mem_usage": 2.2, "event": "current status", "timestamp": "2020-01-14 18:10.46"} {"event": "goal reached! \ud83c\udf89\ud83c\udf8a\ud83e\udd73", "timestamp": "2020-01-14 18:10.46"} {"fname": "[I DELETED MY PATH]", "event": "sent stop signal, timeout after 10s", "timestamp": "2020-01-14 18:10.46"}
image

Also, when running parse_log_files(), the log files of the running jobs don't show up

todos

  • do not log everything in the notebook
  • don't start new learners when some are marked is_done

Can't show logs on error

Ran into this issue about 5 times, not sure how to consistently reproduce it.

-> 1078             box.children = (*new_children, log_explorer(self))
   1079 
   1080         buttons["cancel jobs"].on_click(cancel)

~/conda/envs/qms/lib/python3.7/site-packages/adaptive_scheduler/widgets.py in log_explorer(run_manager)
     97 
     98     fnames = _get_fnames(run_manager, only_running=False)
---> 99     text = _read_file(fnames[0]) if fnames else ""
    100     textarea = Textarea(text, layout=dict(width="auto"), rows=20)
    101     dropdown = Dropdown(options=fnames)

~/conda/envs/qms/lib/python3.7/site-packages/adaptive_scheduler/widgets.py in _read_file(fname)
     19 def _read_file(fname: Path) -> str:
     20     with fname.open() as f:
---> 21         return "".join(f.readlines())
     22 
     23 

~/conda/envs/qms/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Here's adaptive scheduler runner output
image

Periodically call arbitrary callback

Currently adaptive-scheduler is capable of periodically saving the learner data, and calling 'learner.to_dataframe' and saving the resultant dataframe.

I would like to request the option to be able to additionally call an arbitrary callback.
My use-case is that I would like to save the learner data in a more efficient format.

Failing this I could potentially override the 'save' method for my learners; does adaptive-scheduler assume anything about the filenames/data format produced by 'save'?

Unicode error on template writing

I've been seeing some sporadic r/w issues with unicode. Difficult to reproduce, but it looks like modifying line 608 of server_support.py to:

with open(run_script_fname, "w",encoding='utf-8') as f:
    f.write(template)

fixes some things, since template has a unicode character (left arrow). @basnijholt , any objection to specifying the encoding?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.