d-krupke / slurminade Goto Github PK

View Code? Open in Web Editor NEW

11.0 3.0 3.0 261 KB

A decorator-based slurm runner.

Home Page: https://slurminade.readthedocs.io

License: MIT License

Python 100.00%

python slurm slurm-utility

slurminade's Introduction

slurminade - A decorator-based slurm runner for Python-code.

slurminade makes using the workload manager slurm with Python beautiful. It is based on simple_slurm, but instead of just allowing to comfortably execute shell commands in slurm, it allows to directly distribute Python-functions. A function decorated with @slurminade.slurmify(partition="alg") will automatically be executed by a node of the partition alg by just calling .distribute(yes_also_args_are_allowed). The general idea is that the corresponding Python-code exists on both machines, thus, the slurm-node can also call the functions of the original code if you tell if which one and what arguments to use. This is similar to celery but you do not need to install anything, just make sure the same Python-environment is available on the nodes (usually the case in a proper slurm setup).

Please check the documentation of simple_slurm to get to know more about the possible parameters. You can also call simple_slurm directly by srun and sbatch (automatically with the configuration specified with slurminade).

slurminade has two design goals:

Pythonic slurm: Allowing to use slurm in a Pythonic-way, without any shell commands etc.
Compatibility: Scripts can also run without slurm. You can share a script and also people without slurm can execute it without any changes.

We use it to empirically evaluate optimization algorithms for research papers on hundreds of instances that can require 15min each to solve. With slurminade, we can distribute the workload by just changing a few lines of code in our local Python scripts (those that you use for probing and development before running big experiments). An example of such a usage can be found here: Example of an empirical algorithm performance study for graph coloring heuristics using slurminade and AlgBench. You will find the original runner and the slurmified runner, showing the simplicity of distributing your experiments with slurminade.

A simple script could look like this:

import slurminade

slurminade.update_default_configuration(
    partition="alg", exclusive=True
)  # global options for slurm

# If no slurm environment is found, the functions are called directly to make scripts
# compatible with any environment.
# You can enforce slurm with `slurminade.set_dispatcher(slurminade.SlurmDispatcher())`


@slurminade.node_setup
def setup():
    print("I will run automatically on every slurm node at the beginning!")


# use this decorator to make a function distributable with slurm
@slurminade.slurmify(
    constraint="alggen02"
)  # function specific options can be specified
def prepare():
    print("Prepare")


@slurminade.slurmify()
def f(foobar):
    print(f"f({foobar})")


@slurminade.slurmify()
def clean_up():
    print("Clean up")


if __name__ == "__main__":
    prepare.distribute()
    slurminade.join()  # make sure that no job runs before prepare has finished
    with slurminade.JobBundling(max_size=20):  # automatically bundles up to 20 tasks
        # run 100x f after `prepare` has finished
        for i in range(100):
            f.distribute(i)

    slurminade.join()  # make sure that the clean up jobs runs after all f-jobs have finished
    clean_up.distribute()

If slurm is not available, distribute results in a local function call. Analogous for srun and sbatch (giving some extra value on top of just forwarding to simple_slurm).

Warning

Always use JobBundling when distributing many small tasks to few nodes. Slurm jobs have a certain overhead and you do not want to spam your infrastructure with too many jobs. However, function calls joined by JobBundling are considered as a single job by slurm, thus, not shared across nodes.

What are the limitations of slurminade? Slurminade reconstructs the environment by basically loading the code on the slurm node (without the __main__-part) and then calling the slurmified function with parameters serialized as JSONSs. This means that the code must be written in a common .py-file and all (distributed) function arguments must be JSON-serializable. Also, the function must not use any global state (e.g., global variables, file or database connections) initialized in the __main__-part. Additionally, the Python-environment must be available under the same path on the slurm node as slurminade will use the same paths on the slurm node to reconstruct the environment (allowing to use virtual environments).

Does slurminade work with Python 2? No, it is a Python 3 project. We tested it with Python 3.7 and higher.

Does slurminade work with Windows? Probably not, but I never saw a slurm cluster running on Windows. The (automatic) slurm-less mode should work on Windows. So your code will run, but all function calls will be local.

Are multi-file projects supported? Yes, as long as the files are available on the slurm node.

Does slurminade work with virtual environments? Yes. We recommend to use slurminade with conda. We have not tested it with other virtual environments.

Can I run my slurmified code outside a slurm environment? Yes, if you do not have slurm, the distributed functions are run as normal Python function calls. This means that you can share the same code with people that do not have slurm. It was important to us that the experimental evaluations we run on our slurm cluster can also be run in a common Python environment by reviewers without any changes.

Can I receive the return value of a slurmified function? No, the return value is not transmitted back to the caller. Note that the distribute-calls are non-blocking, i.e., the function returns immediately. Return values could be implemented via a Promise-object like for other distributed computing frameworks, but we did not see the need for it yet. We are usually saving the results in a database or files, e.g., using AlgBench.

Can I use command line arguments ``sys.argv`` in my scripts? Yes, but only in your __main__-part. The arguments are not transmitted to the slurm nodes as they are not part of the function call. You can add these as normal function arguments to your slurmified functions if needed. It is important that your global objects to not rely on these arguments for initialization, as the __main__-part is not executed on the slurm node. It is theoretically possible to transmit the arguments to the slurm node, but we did not see the need for it. Let us know if you need it and we may implement it.

The code is super simple and open source, don’t be afraid to create a fork that fits your own needs.

Note

Talk with you system administrator or supervisor to get the proper slurm configuration.

Installation

You can install slurminade with pip install slurminade.

Usage

You can set task specific slurm arguments within the decorator, e.g., @slurminade.slurmify(constraint="alggen03"). These arguments are directly passed to simple_slurm, such that all its arguments are supported.

In order for slurminade to work, the code needs to be in a Python file/project shared by all slurm-nodes. Otherwise, slurminade will not find the corresponding function. The slurmified functions also must be importable, i.e., on the top level. Currently, all function names must be unique as slurminade will only transmit the function’s name.

Don’t do:

Bad: Non blocking system calls

import slurminade
import os
import subprocess


@slurminade.slurmify()
def run_shell_command():
    # non-blocking system call
    subprocess.Popen("complex call")
    # BAD! The system call will run outside of slurm! The slurm task directly terminates.

instead use

import slurminade

if __name__ == "__main__":
    slurminade.sbatch(
        "complex call"
    )  # forwards your call to simple_slurm that is better used for such things.

Bad: Global variables in the `main` part

import slurminade

FLAG = True


@slurminade.slurmify()
def bad_global(args):
    if FLAG:  # BAD! Will be True because the __main__ Part is not executed on the node.
        pass
    else:
        pass


if __name__ == "__main__":
    FLAG = False
    bad_global.distribute("args")

instead do

import slurminade


@slurminade.slurmify()
def bad_global(
    args, FLAG
):  # Now the flag is passed correctly as an argument. Note that only json-compatible arguments are possible.
    if FLAG:
        pass
    else:
        pass


# Without the `if`, the node would also execute this part (*slurminade* will abort automatically)
if __name__ == "__main__":
    FLAG = False
    bad_global.distribute("args", FLAG)

Warning

The same is true for any global state such as file or database connections. You can use global variables, but be wary of side effects.

Error: Complex objects as arguments

import slurminade


@slurminade.slurmify()
def sec_order_func(func):
    func()


def f():
    print("hello")


def g():
    print("world!")


if __name__ == "__main__":
    sec_order_func.distribute(f)  # will throw an exception
    sec_order_func.distribute(g)

Instead, create individual slurmified functions for each call or pass a simple identifier that lets the function deduce, what to do, e.g., a switch-case. If you really need to pass complex objects, you could also pickle the object and only pass the file name.

Default configuration

You can set up a default configuration in ~/.slurminade_default.json. This should simply be a dictionary of arguments for simple_slurm. For example

{
  "partition": "alg"
}

The current version checks the following files and overwrites values in the following order:

~/.slurminade_default.json
~/$XDG_CONFIG_HOME/slurminade/.slurminade_default.json
./.slurminade_default.json

Debugging

You can use

import slurminade

slurminade.set_dispatcher(slurminade.TestDispatcher())

to see the serialization or

import slurminade

slurminade.set_dispatcher(slurminade.SubprocessDispatcher())

to distribute the tasks without slurm using subprocesses.

If there is a bug, you will directly see it in the output (at least for most bugs).

Project structure

The project is reasonably easy:

bundling.py: Contains code for bundling tasks, so we don’t spam slurm with too many.
conf.py: Contains code for managing the configuration of slurm.
dispatcher.py: Contains code for actually dispatching tasks to slurm.
execute.py: Contains code to execute the task on the slurm node.
function.py: Contains the code for making a function slurm-compatible.
function_map.py: Saves all the slurmified functions.
guard.py: Contains code to prevent you accidentally DDoSing your infrastructure.
options.py: Contains a simple data structure to save slurm options.

Changes

1.1.1: Fixing bug when there is some output to stdout when loading the code, such as deprecation warnings.
1.1.0: Slurminade can now be called from iPython, too! exec has been renamed shell to prevent confusion with the Python call exec which will evaluate a string as Python code.
1.0.1: Dispatcher now return jobs references instead of job ids. This allows to do some fancier stuff in the future, when the jobs infos are only available a short time after the job has been submitted.
0.10.1: FIX: Listing functions will no longer execute setup functions.
0.10.0: Batch is now named JobBundling. There is a method join for easier synchronization. exec allows to executed commands just like srun and sbatch, but uniform syntax with other slurmified functions. Functions can now also be called with distribute_and_wait. If you call python3 -m slurminade.check --partition YOUR_PARTITION --constraint YOUR_CONSTRAINT you can check if your slurm configuration is running correctly.
0.9.0: Lots of improvements.
0.8.1: Bugfix and automatic detection of wrong usage when using Batch with wait_for.
0.8.0: Added extensive logging and improved typing.
0.7.0: Warning if a Batch is flushed multiple times, as we noticed this to be a common indentation error.
0.6.2: Fixes recursive distribution guard, which seemed to be broken.
0.6.1: Bugfixes in naming
0.6.0: Autmatic naming of tasks.
0.5.5: Fixing bug guard bug in subprocess dispatcher.
0.5.4: Dispatched function calls that are too long for the command line now use a temporary file instead.
0.5.3: Fixed a bug that caused the dispatch limit to have no effect.
0.5.2: Added pyproject.toml for PEP compliance
0.5.1: Batch will now flush on delete, in case you forgot.
0.5.0:
- Functions now have a wait_for-option and return job ids.
- Braking changes: Batches have a new API.
  - add is no longer needed.
  - AutoBatch is now called Batch.
- Fundamental code changes under the hood.
<0.5.0:
- Lots of experiments on finding the right interface.

Contributors

This project is developed at the Algorithms Group at TU Braunschweig, Germany. The lead developer is Dominik Krupke. Further contributors are Matthias Konitzny and Patrick Blumenberg.

Similar Projects

This project is greatly inspired by Celery, but does not require any additional infrastructure except for slurm.
If you want a more powerful library to, e.g., also distribute lambdas or functions with complex arguments, check out submitit. It is a great project, and we may use it as a backend in the future. However, it does not support the slurm-less mode and can easily hide non-deterministic errors. Slurminade on the other hand is restricted on purpose to write reproducible scripts that can also be run without slurm.

slurminade's People

Contributors

Stargazers

Watchers

Forkers

neoextended p-blumenberg chekmanh

slurminade's Issues

Allow slurm configuration update outside of module level

Currently the slurm configuration must be configured at module level, as the decorator will directly create a SimpleSlurm object with the current config.
This behaviour can be inconvenient in cases where the slurm configuration is dynamically computed, or loaded from a separate configuration file. The SimpleSlurm object therefore should be created after calling .distribute to enable prior slurminade.update_default_configuration() calls to take effect.

Provide some options for machine dependent "preparations".

Some machines need specific preparations before a job can be run. Provide some simple functionality to run some local preparation code, if it is available.
This could look like this:

Add a local_setup() function at the beginning of your script.
This function checks if a ~/.pythonrc.pyexists and execute it if it exists.

The important use case for us is to set up the Gurobi-license, which is individual for every workstation. We usually do this via the .bashrc resp. .zshrc but this is not executed when using slurm.

Dispatch limit does not seem to trigger when using Batch

Just have been able to spawn 500 jobs despite a dispatch limit of 100.

The code

from aemeasure import MeasurementSeries, read_as_pandas_table, Database
from samplns.lns.neighborhood import RandomNeighborhood, Neighborhood
import slurminade
import os

from samplns.simple import ConvertingLns
from samplns.utils import Timer

from _utils import get_instance, parse_solution_overview, parse_sample
from samplns.lns.lns import LnsLogger

slurminade.update_default_configuration(
    partition="alg",
    constraint="alggen03",
    mail_user="**********",
    mail_type="FAIL",
)

slurminade.set_dispatch_limit(100)

ITERATIONS = 10000
ITERATION_TIME_LIMIT = 60.0
TIME_LIMIT = 900

BASE = "900_seconds_1_it"
INPUT_SAMPLE_ARCHIVE = f"./baseline/{BASE}.zip"
INSTANCE_ARCHIVE = "./00_benchmark_instances.zip"
RESULT_FOLDER = f"01_results/{BASE}_{TIME_LIMIT}"


class MyLnsLogger(LnsLogger):
    def __init__(self):
        self.timer = Timer(0)
        self.iterations = []

    def report_neighborhood_optimization(self, neighborhood: Neighborhood):
        self.iterations[-1]["nbrhd_tuples"] = len(neighborhood.missing_tuples)
        self.iterations[-1]["nbrhd_confs"] = len(neighborhood.initial_solution)

    def report_iteration_begin(self, iteration: int):
        self.iterations.append({})

    def report_iteration_end(
        self, iteration: int, runtime: float, lb: int, solution, events
    ):
        self.iterations[-1].update(
            {
                "iteration": iteration,
                "lb": lb,
                "ub": len(solution),
                "time": self.timer.time(),
                "iteration_time": runtime,
                "events": events,
            }
        )

    def __call__(self, *args, **kwargs):
        print(f"LOG[{self.timer.time()}]", *args)


@slurminade.slurmify
def optimize(instance_name, solution_path):
    configure_grb_license_path()
    try:
        instance = get_instance(instance_name, INSTANCE_ARCHIVE)
    except Exception as e:
        print("Skipping due to parser error:", instance_name, str(e))
        return
    sample = parse_sample(sample_path=solution_path, archive_path=INPUT_SAMPLE_ARCHIVE)

    with MeasurementSeries(RESULT_FOLDER) as ms:
        with ms.measurement() as m:
            m["instance"] = instance_name
            m["initial_sample_path"] = solution_path
            m["iteration_time_limit"] = ITERATION_TIME_LIMIT
            m["iterations"] = ITERATIONS
            m["time_limit"] = TIME_LIMIT

            # setup (needs time measurement as already involves calculations)
            logger = MyLnsLogger()
            solver = ConvertingLns(
                instance=instance,
                initial_solution=sample,
                neighborhood_selector=RandomNeighborhood(),
                logger=logger,
            )

            solver.optimize(
                iterations=ITERATIONS,
                iteration_timelimit=ITERATION_TIME_LIMIT,
                timelimit=TIME_LIMIT,
            )

            m["solution"] = solver.get_best_solution(verify=True)
            m["lower_bound"] = solver.get_lower_bound()
            m["upper_bound"] = len(solver.get_best_solution())
            m["optimal"] = solver.get_lower_bound() == len(solver.get_best_solution())
            m["runtime"] = m.time().total_seconds()
            m["iteration_info"] = logger.iterations


@slurminade.slurmify(mail_type="ALL")
def pack_after_finish():
    Database(RESULT_FOLDER).compress()


def configure_grb_license_path():
    # hack for gurobi license on alg workstations. TODO: Find a nicer way
    import socket
    from pathlib import Path

    os.environ["GRB_LICENSE_FILE"] = os.path.join(
        Path.home(), ".gurobi", socket.gethostname(), "gurobi.lic"
    )
    if not os.path.exists(os.environ["GRB_LICENSE_FILE"]):
        raise RuntimeError("Gurobi License File does not exist.")


if __name__ == "__main__":
    samples = parse_solution_overview(INPUT_SAMPLE_ARCHIVE)
    data = read_as_pandas_table(RESULT_FOLDER)
    already_done = data["initial_sample_path"].unique() if len(data) else []
    with slurminade.Batch(max_size=1) as batch:
        for idx in samples.index:
            if not samples["Path"][idx]:
                print("Skipping unsuccessful row", samples.loc[idx])
                continue
            if samples["#Variables"][idx] > 1500:
                print("Skipping",samples["Instance"][idx],"due to its size.")
                continue
            path = samples["Path"][idx]
            if path in already_done:
                print("Skipping", path)
                continue
            instance = samples["Instance"][idx]
            if "uclibc" in instance:
                print("Skipping uclibc instance! They seem to be inconsistent.")
                continue
            optimize.distribute(instance, path)
        pack_after_finish.wait_for(batch.flush()).distribute()

Write an introduction into the most important slurm options

It is difficult for beginners, i.e., our students, to grasp all the possibilities and options of slurm. Write a short introduction in the various options.

Important scenarios are:

The jobs have to run on specific hosts.
Multiple jobs can in parallel on the same machine, but they have requirements regarding cpu cores and memory.
Using slurminade as replacement for slurm-bash scripts.
Using default configurations.

It is easy to use `wait_for` wrong with `Batch`.

The batch will automatically flush on exit, making it important to use .wait_for(batch.flush()) within the context. Otherwise all information is lost about the distributed job ids. This can be a difficult to spot bug. It should be reasonably easy to see if flush was called without any jobs and to give a warning.

Think about a cleaner approach to offer "setup" and "clean" jobs.

When doing experiments, there are usually some "setup" or "clean up" jobs.
A "setup"-job may ensures that all the binaries have been built, the instances are available etc. All other jobs have a dependency to this job and should only be executed if the setup succeeded.
A "clean up" job may compress data or remove artifacts.

You can implement them with the current functionality, but maybe it is better to make it more explicit.
One could take inspiration from testing frameworks that have similar properties.

A further thought maybe on doing something similar for local setup and clean up when for running a batch of jobs.
For example, only loading a database once for the whole batch instead of reloading it for every single job.
This currently can be done by simply using global variables, but maybe it can be done "cleaner".

Fix the warning of ruff

This is a pretty simple task. Ruff finds some anti-patterns in the code that should be updated. Most of the time, it directly gives hints. You can run Ruff with pre-commit yourself via pre-commit run --all-files.

Here is a log from my computer:

ruff.....................................................................Failed
- hook id: ruff
- exit code: 1

docs/conf.py:9:20: PTH100 `os.path.abspath()` should be replaced by `Path.resolve()`
examples/example_1.py:11:10: PTH123 `open()` should be replaced by `Path.open()`
examples/example_1.py:12:9: T201 `print` found
examples/example_2.py:14:5: T201 `print` found
examples/example_2.py:15:10: PTH123 `open()` should be replaced by `Path.open()`
examples/example_2b.py:6:5: T201 `print` found
examples/example_2b.py:7:5: T201 `print` found
examples/example_3.py:10:5: T201 `print` found
examples/example_3.py:15:5: T201 `print` found
examples/example_3.py:20:5: T201 `print` found
src/slurminade/batch.py:83:21: PLW2901 `for` loop variable `tasks` overwritten by assignment target
src/slurminade/batch.py:137:13: T201 `print` found
src/slurminade/conf.py:17:12: PTH113 `os.path.isfile()` should be replaced by `Path.is_file()`
src/slurminade/conf.py:18:18: PTH123 `open()` should be replaced by `Path.open()`
src/slurminade/conf.py:23:9: T201 `print` found
src/slurminade/conf.py:35:12: PTH118 `os.path.join()` should be replaced by `Path` with `/` operator
src/slurminade/conf.py:38:16: PTH118 `os.path.join()` should be replaced by `Path` with `/` operator
src/slurminade/dispatcher.py:168:53: ARG002 Unused method argument: `options`
src/slurminade/dispatcher.py:183:12: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
src/slurminade/dispatcher.py:184:13: PTH107 `os.remove()` should be replaced by `Path.unlink()`
src/slurminade/dispatcher.py:189:9: ARG002 Unused method argument: `conf`
src/slurminade/dispatcher.py:190:9: ARG002 Unused method argument: `simple_slurm_kwargs`
src/slurminade/dispatcher.py:194:9: T201 `print` found
src/slurminade/dispatcher.py:199:9: ARG002 Unused method argument: `conf`
src/slurminade/dispatcher.py:200:9: ARG002 Unused method argument: `simple_slurm_kwargs`
src/slurminade/dispatcher.py:204:9: T201 `print` found
src/slurminade/dispatcher.py:225:16: RET504 Unnecessary variable assignment before `return` statement
src/slurminade/dispatcher.py:231:9: RET505 Unnecessary `else` after `return` statement
src/slurminade/dispatcher.py:260:9: RET505 Unnecessary `else` after `return` statement
src/slurminade/dispatcher.py:275:9: RET505 Unnecessary `else` after `return` statement
src/slurminade/dispatcher.py:293:53: ARG002 Unused method argument: `options`
src/slurminade/dispatcher.py:303:9: ARG002 Unused method argument: `conf`
src/slurminade/dispatcher.py:304:9: ARG002 Unused method argument: `simple_slurm_kwargs`
src/slurminade/dispatcher.py:313:9: ARG002 Unused method argument: `conf`
src/slurminade/dispatcher.py:314:9: ARG002 Unused method argument: `simple_slurm_kwargs`
src/slurminade/dispatcher.py:330:53: ARG002 Unused method argument: `options`
src/slurminade/dispatcher.py:340:9: ARG002 Unused method argument: `conf`
src/slurminade/dispatcher.py:341:9: ARG002 Unused method argument: `simple_slurm_kwargs`
src/slurminade/dispatcher.py:349:9: ARG002 Unused method argument: `conf`
src/slurminade/dispatcher.py:350:9: ARG002 Unused method argument: `simple_slurm_kwargs`
src/slurminade/dispatcher.py:398:5: PLW0603 Using the global statement to update `__dispatcher` is discouraged
src/slurminade/dispatcher.py:401:13: PLW0603 Using the global statement to update `__dispatcher` is discouraged
src/slurminade/dispatcher.py:415:5: PLW0603 Using the global statement to update `__dispatcher` is discouraged
src/slurminade/execute.py:34:14: PTH123 `open()` should be replaced by `Path.open()`
src/slurminade/execute.py:51:10: PTH123 `open()` should be replaced by `Path.open()`
src/slurminade/execute.py:55:5: PLW0603 Using the global statement to update `__name__` is discouraged
src/slurminade/function.py:75:9: SIM108 Use ternary operator `job_ids = [job_ids] if isinstance(job_ids, int) else list(job_ids)` instead of `if`-`else`-block
src/slurminade/function.py:146:5: RET505 Unnecessary `else` after `return` statement
src/slurminade/function_map.py:42:33: PTH100 `os.path.abspath()` should be replaced by `Path.resolve()`
src/slurminade/function_map.py:104:12: PTH113 `os.path.isfile()` should be replaced by `Path.is_file()`
src/slurminade/function_map.py:107:19: PTH100 `os.path.abspath()` should be replaced by `Path.resolve()`
src/slurminade/guard.py:20:12: PLW0602 Using global for `_exec_flag` but no assignment is done
src/slurminade/guard.py:35:5: PLW0603 Using the global statement to update `_exec_flag` is discouraged
src/slurminade/guard.py:44:5: PLW0603 Using the global statement to update `_exec_flag` is discouraged
src/slurminade/options.py:14:17: PLW2901 `for` loop variable `v` overwritten by assignment target
tests/test_create_command.py:13:10: PTH123 `open()` should be replaced by `Path.open()`
tests/test_create_command.py:47:14: PTH123 `open()` should be replaced by `Path.open()`
tests/test_local.py:13:10: PTH123 `open()` should be replaced by `Path.open()`
tests/test_local.py:18:8: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
tests/test_local.py:19:9: PTH107 `os.remove()` should be replaced by `Path.unlink()`
tests/test_local.py:24:10: PTH123 `open()` should be replaced by `Path.open()`
tests/test_local.py:29:8: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
tests/test_local.py:30:9: PTH107 `os.remove()` should be replaced by `Path.unlink()`
tests/test_local.py:37:16: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
tests/test_local.py:43:16: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
tests/test_local.py:44:14: PTH123 `open()` should be replaced by `Path.open()`
tests/test_subprocess.py:12:10: PTH123 `open()` should be replaced by `Path.open()`
tests/test_subprocess.py:17:8: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
tests/test_subprocess.py:18:9: PTH107 `os.remove()` should be replaced by `Path.unlink()`
tests/test_subprocess.py:23:10: PTH123 `open()` should be replaced by `Path.open()`
tests/test_subprocess.py:28:8: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
tests/test_subprocess.py:29:9: PTH107 `os.remove()` should be replaced by `Path.unlink()`
tests/test_subprocess.py:35:12: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
tests/test_subprocess.py:37:12: PTH100 `os.path.abspath()` should be replaced by `Path.resolve()`
tests/test_subprocess.py:48:16: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
tests/test_subprocess.py:58:16: PTH110 `os.path.exists()` should be replaced by `Path.exists()`
tests/test_subprocess.py:59:14: PTH123 `open()` should be replaced by `Path.open()`

Throw an error if a non-global function gets decorated

Slurminade will run the exact same code but without the if __name__=="__main__" part on the host to find the function it is supposed to call. If you do something nasty like

if __name__=="__main__":
   @slurminade.slurmify()
   def f():
     pass
   f.distribute()

slurminade has no chance to find f and there will be an error on the worker, but not directly on the host. To directly get an error, we could perform the function collection before the first distribute and check if f will be found. I suggest implementing an additional command line argument that will print a list of all available function ids. The host can call this and simply parse the last line or throw an error if Python failed.

When doing this, it can also be automatically checked if there are some distribute calls on the global level, as this is also a very common error with students not understanding why there is the if __name__=="__main__" since many scripts work fine without it.

Should be a relatively simple thing to implement.

Automatically name tasks based on function names

Currently, the default names of slurminade are quite useless. If now name is set, we could automatically name a task "slurminade[FUNC_NAME]". This would probably help to parse the output of squeue. Should be easy to implement.

Enable dependencies also in batches

Batches do not immediately return a job id, so no dependency can be set. However, one could create virtual job ids and during the batched dispatch we could ensure the dependencies. This would involve quite a bit of logic, so I am not sure if it is worth to sacrifice simplicity for this feature.

Too long batches will fail due to limits in the commands

We attach the meta data for every task to the shell command. Unfortunately, these will get too long with many tasks and lead to a failure. Add a limit and use a temporary file instead for passing the meta data if it is too much for the command line.

d-krupke / slurminade Goto Github PK

slurminade's Introduction

slurminade - A decorator-based slurm runner for Python-code.

Installation

Usage

Don’t do:

Bad: Non blocking system calls

Bad: Global variables in the __main__ part

Error: Complex objects as arguments

Default configuration

Debugging

Project structure

Changes

Contributors

Similar Projects

slurminade's People

Contributors

Stargazers

Watchers

Forkers

slurminade's Issues

Recommend Projects

Recommend Topics

Recommend Org

Bad: Global variables in the `main` part