Code Monkey home page Code Monkey logo

um-bridge / umbridge Goto Github PK

View Code? Open in Web Editor NEW
32.0 5.0 15.0 11.7 MB

UM-Bridge (the UQ and Model Bridge) provides a unified interface for numerical models that is accessible from virtually any programming language or framework.

Home Page: https://um-bridge-benchmarks.readthedocs.io/en/docs/

License: MIT License

Dockerfile 0.07% Python 3.20% C++ 49.91% Shell 0.09% CMake 0.02% TeX 0.22% R 0.59% Jupyter Notebook 45.14% PowerShell 0.01% MATLAB 0.64% M 0.02% Objective-C 0.01% Julia 0.05% Makefile 0.05%

umbridge's Introduction

UM-Bridge

UM-Bridge (the UQ and Model Bridge) is a unified interface for numerical models that is accessible from virtually any programming language or framework. It is primarily intended for coupling advanced models (e.g. simulations of complex physical processes) to advanced statistical or optimization methods.

Documentation including tutorials and UQ benchmark library can be found here: Documentation.

Instructions for contacting the team, bug reports and contributions can be found in CONTRIBUTING.md.

umbridge's People

Contributors

andyddavis avatar annereinarz avatar chun9l avatar crambor avatar danielskatz avatar dolgov avatar imacklui avatar krosenfeld avatar lennoxliu avatar linusseelinger avatar marlenaweidenauer avatar purusharths avatar schlevidon avatar sebwolf-de avatar vivilearns2code avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

umbridge's Issues

Remove umbridge.h copy in hpc setup

Remove hpc/lib/* and use change load balancer to use lib/umbridge.h instead. The latter should now have all the necessary features (disabling error checks, optional parallel model evaluations).

Problem running Tsunami model benchmark

I am having trouble running the benchmark examples. Below I show the result for the Tsunami model following both the UMBridge benchmark talk and the online quickstart guide

Tsunami model

I initialize the docker in a powershell:
docker run -p 4242:4242 linusseelinger/model-exahype-tsunami
which outputs:

Running on number of ranks: 1
Listening on port 4242...

And then evaluating in python:

>>> import umbridge
>>> model = umbridge.HTTPModel('localhost:4242')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() missing 1 required positional argument: 'name'

If I instead use the example script provided in the quickstart:

'(env)'C:\Users\katherinero\Documents\projects\playpen\umbridge_example>python umbridge-client.py http://localhost:4242
Connecting to host URL http://localhost:4242
Traceback (most recent call last):
  File "umbridge-client.py", line 13, in <module>
    model = umbridge.HTTPModel(args.url, "posterior")
  File "C:\Users\katherinero\Documents\projects\playpen\umbridge_example\env\lib\site-packages\umbridge\um.py", line 45, in __init__
    raise Exception(f'Model {name} not supported by server! Supported models are: {supported_models(url)}')
Exception: Model posterior not supported by server! Supported models are: ['forward']

Thank you!

System info:

Conda environment
Window 10
python 3.8
Installing directly from source or from pypi results in the same behavior.

'(env)'C:\Users\katherinero\Documents\projects\playpen\umbridge_example>conda list
# packages in environment at C:\Users\katherinero\Documents\projects\playpen\umbridge_example\env:
#
# Name                    Version                   Build  Channel
aiohttp                   3.8.3                    pypi_0    pypi
aiosignal                 1.2.0                    pypi_0    pypi
async-timeout             4.0.2                    pypi_0    pypi
asyncio                   3.4.3                    pypi_0    pypi
attrs                     22.1.0                   pypi_0    pypi
bzip2                     1.0.8                h8ffe710_4    conda-forge
ca-certificates           2022.9.24            h5b45459_0    conda-forge
certifi                   2022.9.24                pypi_0    pypi
charset-normalizer        2.1.1                    pypi_0    pypi
frozenlist                1.3.1                    pypi_0    pypi
idna                      3.4                      pypi_0    pypi
libffi                    3.4.2                h8ffe710_5    conda-forge
libsqlite                 3.39.4               hcfcfb64_0    conda-forge
libzlib                   1.2.13               hcfcfb64_4    conda-forge
multidict                 6.0.2                    pypi_0    pypi
numpy                     1.23.4                   pypi_0    pypi
openssl                   3.0.5                hcfcfb64_2    conda-forge
pip                       22.3               pyhd8ed1ab_0    conda-forge
python                    3.8.13          hcf16a7b_0_cpython    conda-forge
requests                  2.28.1                   pypi_0    pypi
scipy                     1.9.3                    pypi_0    pypi
setuptools                65.5.0             pyhd8ed1ab_0    conda-forge
sqlite                    3.39.4               hcfcfb64_0    conda-forge
tk                        8.6.12               h8ffe710_0    conda-forge
ucrt                      10.0.22621.0         h57928b3_0    conda-forge
umbridge                  1.1.3                    pypi_0    pypi
urllib3                   1.26.12                  pypi_0    pypi
vc                        14.3                 h3d8a991_9    conda-forge
vs2015_runtime            14.32.31332          h1d6e394_9    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h8d14728_0    conda-forge
yarl                      1.8.1                    pypi_0    pypi

pymc version

Hello,

I was attempting to run the example pymc client and am getting an error related to the TensorType in UmbridgeOp. First I run the benchmark:

docker run -it -p 4243:4243 linusseelinger/behmark-analytic-gaussian-mixture

and then the client:

(/home/krosenfeld/projects/umbridge_test/env) [[email protected]@ipapvwks25 umbridge_test]$ python pymc-client.py http://localhost:4243
Connecting to host URL http://localhost:4243
Model output: [-48.17723351]
Check model's gradient against numerical gradient. This requires an UM-Bridge model with gradient support.
Traceback (most recent call last):
  File "pymc-client.py", line 39, in <module>
    map_estimate = pm.find_MAP()
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/site-packages/pymc/tuning/starting.py", line 125, in find_MAP
    model.check_start_vals(start)
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/site-packages/pymc/model.py", line 1776, in check_start_vals
    initial_eval = self.point_logps(point=elem)
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/site-packages/pymc/model.py", line 1805, in point_logps
    factor_logps_fn = [at.sum(factor) for factor in self.logp(factors, sum=False)]
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/site-packages/pymc/model.py", line 759, in logp
    rv_logps = joint_logp(
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/site-packages/pymc/logprob/joint_logprob.py", line 293, in joint_logp
    temp_logp_terms = factorized_joint_logprob(
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/site-packages/pymc/logprob/joint_logprob.py", line 211, in factorized_joint_logprob
    q_logprob_vars = _logprob(
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/site-packages/pymc/distributions/distribution.py", line 568, in custom_dist_logp
    return logp(values[0], *dist_params)
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/site-packages/aesara/graph/op.py", line 297, in __call__
    node = self.make_node(*inputs, **kwargs)
  File "/home/krosenfeld/projects/umbridge_test/env/lib/python3.8/site-packages/aesara/graph/op.py", line 241, in make_node
    raise TypeError(
TypeError: Invalid input types for Op UmbridgeOp:
Input 1/1: Expected TensorType(float64, (?,)), got TensorType(float64, (2,))

I believe this is related to the version of pymc I have. I had installed umbridge following the directions on the docs:

pip install umbridge[pymc]

This did not install pymc itself so I then installed pymc:

mamba install pymc

which results in pymc 5.2.0.

If I revert to an old version of pymc:

mamba install pymc=4.0.0

The example runs:

(/home/krosenfeld/projects/umbridge_test/env) [[email protected]@ipapvwks25 umbridge_test]$ python pymc-client.py http://localhost:4243
Connecting to host URL http://localhost:4243
Model output: [-48.17723351]
Check model's gradient against numerical gradient. This requires an UM-Bridge model with gradient support.
 |█████████████████████████████████████████████████████████████████████████████████████████| 100.00% [8/8 00:00<00:00 logp = -3.7296, ||grad|| = 0.025167]MAP estimate of posterior is [-1.99936594  1.99936594]
Only 400 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [posterior]
Sampling 2 chains for 100 tune and 400 draw iterations (200 + 800 draws total) took 49 seconds.0.00% [500/500 00:24<00:00 Sampling chain 1, 0 divergences]
The acceptance probability does not match the target. It is 0.8838, but should be close to 0.8. Try to increase the number of tuning steps.

I am using python 3.8.16 on RHEL 8.6

Model readiness check in hpc load balancer

Ensure a model instance is only called after it has completed its setup phase (i.e. left the model instance constructor). Some models may have considerable setup cost, and calls to them during setup may cause errors or (at the very least) avoidable delays. In kubernetes we can achieve a readiness check by checking if the server's /Info endpoint is reachable.

Merge hpc load balancer's umbridge.h with main umbridge.h

lib/umbridge.h always uses a mutex in serveModels to ensure the user-supplied model is only ever called sequentially; makes sense for costly models. We need parallel evaluations etc. for the hpc load balancer though in order to support distributing requests to parallel model instances. hpc/lib/umbridge.h allows disabling the mutex (by accident unconditionally disables it though!)

Task: Add support for switching the mutex on/off in lib/umbridge.h. Default should be on as before. Ensure it works with existing models and the hpc load balancer.

Pass model errors through hpc load balancer to client

Task: Catch errors thrown by user model instance and pass them through the load balancer all the way to the client.

Could be done by using error handling capability from #28 : Catch model instance error in load balancer, then throw again so the client receives it.
If this and other issues turn out hard to solve cleanly, we might consider building a hyperqueue-backed reverse proxy on HTTP level instead of building on top of umbridge. Needs careful consideration though.

Remove direct SLURM call in HPC load balancer

The load balancer currently does a direct call to SLURM to find out what models the server offers. This should go through hyperqueue as well, so we have hq as an abstraction for other job schedulers (hq supports at least one other scheduler).

Improve servers' error handling

We already have a protocol for sending error messages to the clients and the clients react appropriately, but currently it's only used for errors the server itself detects.

Task: Catch user model errors when the server calls the user model, and pass sensible error message to client using existing error message format.

Remove URL dependency on filesystem for HPC loadbalancer implementation

An alternative approach to temporarily storing all model URLs on the shared filesystem would be to have an endpoint on the loadbalancer that can accept a model URL (e.g. via HTTP POST) inside of each hyperqueue job. This would allow for the storage of URLs in memory rather than requiring an expensive file I/O operation.

HPC: Sporadic model crashes on Helix

Multiply by 2 test jobs (modified to take 10 seconds per evaluation) occasionally log Quit after Listening on port x.... I'm running 100 instances via HQ, queried from the test script below. The issue happens around once every 300 runs. As a result, the test script waits infinitely for the failed job to return something.

#!/bin/bash 
 
echo "Sending requests..." 
 
for i in {1..100} 
do 
   # Expected output: {"output":[[200.0]]} 
   # Check if curl output equals expected output 
   # If not, print error message 
 
   if [ "$(curl -s http://localhost:4242/Evaluate -X POST -d '{"name": "forward", "input": [[100.0]]}')" == '{"output":[[200.0]]}' ]; then 
       echo -n "y" 
   else 
       echo -n "n" 
       #echo "Error: curl output does not equal expected output" 
   fi & 
 
done 
 
echo "Requests sent. Waiting for responses..." 
 
wait

A possible workaround is to set a minimum port of 60000. I therefore suspect the issue is just an occupied port.

Either the port finder does not work as intended on Helix, or maybe there is a race condition due to the short time between port finder and reserving the port by launching the model (seems unlikely).

HPC: Enforce FIFO order in jobs

hq seems to work on most recently supported jobs first, possibly leading to some jobs never finishing if too many new ones come in!

Fix via priority in submitting jobs:

56c7038

Double check and merge this into main, possibly use atomic int to ensure unique priority index.

Investigate launching of model instances

Right now we are launching the model server on worker start via --worker-start-cmd .

According to docs: (https://it4innovations.github.io/hyperqueue/stable/deployment/allocation/)

"Specifies a shell command that will be executed on each allocated node just before a worker is started on that node. You can use it e.g. to initialize some shared environment for the node, or to load software modules."

So if multiple tasks end up on the same node, we probably route the request to a single model instance. Verify that that's the case and come up with a solution; fallback solution: launch model upon every task submission, then tear down once task finished. Not ideal since model initialization cost appears each time.

pymc-client.py fails on window to sample

I am running the pymc-client.py script using the analytic gaussian mixture benchmark:

docker run -it -p 4243:4243 linusseelinger/benchmark-analytic-gaussian-mixture
python3 umbridge/clients/python pymc-client.py http://localhost:4243

and seeing an error during pm.sample:

http://localhost:4243
Connecting to host URL http://localhost:4243
Model output: [-48.17723351]
Check model's gradient against numerical gradient. This requires an UM-Bridge model with gradient support.
 |----------------------------------------------------------| 0.02% [1/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.04% [2/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.06% [3/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.08% [4/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.10% [5/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.12% [6/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.14% [7/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.16% [8/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |███████████████████████████████████████████████████████| 100.00% [8/8 00:00<00:00 logp = -3.7296, ||grad|| = 0.025167]
MAP estimate of posterior is [-1.99936594  1.99936594]
Only 50 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [posterior]
Connecting to host URL http://localhost:4243
Model output: [-48.17723351]
Check model's gradient against numerical gradient. This requires an UM-Bridge model with gradient support.
 |----------------------------------------------------------| 0.02% [1/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.04% [2/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.06% [3/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.08% [4/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.10% [5/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.12% [6/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.14% [7/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |----------------------------------------------------------| 0.16% [8/5000 00:00<? logp = -3.7296, ||grad|| = 0.025167] |███████████████████████████████████████████████████████| 100.00% [8/8 00:00<00:00 logp = -3.7296, ||grad|| = 0.025167]
MAP estimate of posterior is [-1.99936594  1.99936594]
Only 50 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [posterior]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\katherinero\Documents\code\umbridge\clients\python\pymc-client.py", line 40, in <module>
    inferencedata = pm.sample(draws=50)
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\site-packages\pymc\sampling.py", line 617, in sample
    mtrace = _mp_sample(**sample_args, **parallel_args)
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\site-packages\pymc\sampling.py", line 1508, in _mp_sample
    sampler = ps.ParallelSampler(
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\site-packages\pymc\parallel_sampling.py", line 412, in __init__
    self._samplers = [
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\site-packages\pymc\parallel_sampling.py", line 413, in <listcomp>
    ProcessAdapter(
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\site-packages\pymc\parallel_sampling.py", line 272, in __init__
    self._process.start()
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\katherinero\Documents\code\umbridge\env\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Skip error checks in hpc load balancer

The umbridge c++ server currently (unconditionally) does some error checking for user/model vector sizes etc. This happens both in the load balancer AND the model instances themselves, leading to unnecessary hyperqueue tasks being spawned by the load balancer.

Task: Avoid unnecessary error checking in load balancer by making error checks optional (on by default) and disable them in the hpc load balancer. Verify there are no superfluous hyperqueue tasks anymore.

Shut down Hyperqueue server automatically after program stops

A Hyperqueue server starts whenever the loadbalancer executable is called, but it will continue to run after the program is killed. Subsequent calls will yield a warning about the existing Hyperqueue server. This does not break anything though.

Multiple allocation queues create too many SLURM jobs

In the current setup multiple allocation queues are used (at least two - one to determine the model names and one for each model). It seems like whenever HQ decides to add a new allocation to meet the current demand, all allocation queues create a new SLURM job. The result is that more SLURM jobs/HQ workers are created than necessary.

It might be best to have one shared allocation queue for all models to avoid wasting resources.

Removing URL files when HQ jobs are cancelled

After #45 all old URL files are removed when starting the load balancer. Should we still remove the URL file whenever a HQ job is cancelled?

  • Not removing the files might be helpful for debugging and keeping them wouldn't clutter the system too much, since they get deleted before each run anyways.
  • If we decide to keep the current behavior, the code should be refactored to use the std::filesystem library instead of plain bash.

std::system(("rm ./urls/url-" + job_id + ".txt").c_str());

HPC: Add multiple backend options (e.g. HyperQueue and pure SLURM)

Motivation

The current HPC setup uses HyperQueue to manage SLURM allocations. This works very well for some clusters, but unfortunately, not for all. Especially larger clusters tend to run customized SLURM implementations which violate certain assumptions that HyperQueue makes to function correctly (e.g. non-standard formatting of the job submission message which breaks parsing of the SLURM job ID).
To solve this issue we would like to implement multiple backends (i.e. software that handles SLURM allocations) which users can freely choose from.
For example, one implementation that is already in use (see: hpc-slurm) is a pure SLURM backend which simply creates a new SLURM allocation for each request received by the load balancer. This works well for applications that require only a few long-running jobs, since in this case, overhead from resubmitting SLURM allocations is comparatively small. But most importantly, it does not make any assumptions, so it should run on pretty much any cluster that uses SLURM for job management.

Key requirements

  • Extendability: It should be easy to add new backend implementations.
  • Flexibility: The interface should not make too many assumptions, since we want the load balancer to be able to run on as many clusters as possible.
  • Useability: The backend should be determined at run-time not compile-time, so users don't need to deal with multiple builds of the load balancer.
  • Add the current HyperQueue setup and the pure SLURM setup in hpc-slurm as available backend options.

HPC: Broken/Insufficient Testing of Load Balancer Functionality

Mistakes related to the examples in hpc/test:

  • Helix/Makefile is outdated and unusable in its current state.
  • MultiplyBy2/client.py sends a request with too many input dimensions and crashes.
  • minimal and MultiplyBy2 do almost the same thing on the server side and should be merged into a single example or renamed appropriately.
  • Currently only Evaluate requests are tested.

Tasks:

  • Clean up and fix the current examples.
  • Add tests to ensure coverage of the entire UM-Bridge model interface, i.e
    • GetInputSizes and GetOutputSizes
    • Evaluate, Gradient, ApplyJacobian and ApplyHessian
    • SupportsEvaluate, SupportsGradient, SupportsApplyJacobian and SupportsApplyHessian
  • If possible: Add CI to automatically run tests.
    • It should be possible to use HQ without SLURM by using hq worker start instead of automatic allocation.
    • Alternatively, it might be possible to run a virtual SLURM cluster?

Print what job files are found/used

Give clear indication to the user what job scripts are found and used for each model.

Could be like: Print a list of available models and what script each of them will use

umridge.h: Pass json config by const reference

Context

Methods of UM-Bridge models take an optional json parameter which may be used to change the behavior of a model (e.g. setting different fidelity levels). Some methods defined in the current C++ implementation (see umbridge.h) use pass-by-reference (e.g. GetInputSizes) while others use pass-by-value (e.g. Evaluate) for the json config argument.

Proposed Solution

This behavior is likely unintentional: Instead, all model methods should accept a const reference for the json config argument, since there is no reason for the method to modify the config and we want to avoid copying (potentially large) json objects.

Remove HTTP timeout for slow models entirely

Currently umbridge.h is setting a fixed timeout for the http library. Let's get rid of the timeout entirely since we can't know how long a model might take. Hopefully possible to do without digging deep into the http library, since we don't want to maintain a fork of that.

Avoid file transfers in hpc load balancer

The hpc load balancer currently transfers files to find out the address of a specific task's model instance.

Task: Avoid file transfers. Maybe need to chat with hyperqueue devs about how to make this possible.

Change timeout for slow job allocations

In the LoadBalancer implementation, a URL to the compute node will be returned after submitting the model SLURM script, but there is a timeout while waiting for this. Currently, the countdown for the timeout begins the moment the job is submitted, which means slow job allocation will be an issue.

Investigate multi-node support in hpc platform

According to its docs, hyperqueue seems to have limitations on multi-node tasks (and setting up model instances through a worker start command probably sets them once up per physical node, we need only a single instance).

Investigate how hyperqueue currently behaves in our setting, in particular see if auto allocation works.

Then check with hyperqueue devs, and (as fallback) possibly keep direct SLURM calls as an optional backend for very large jobs.

Refactor waiting for url files to use C++ instead of bash

Now that we use C++17 standard, go wtih std::filesystem::exists("helloworld.txt"); in waitForFile instead of a bash loop (which may be more system dependent than C++ standard library).

Make sure the timeout actually works then, currently the loop is just infinite due to the bash call never completing if the file doesn't show up.

Fix how the amount of time passed is computed: steady_clock::now does not necessarily return seconds. See for example the second answer in https://stackoverflow.com/questions/728068/how-to-calculate-a-time-difference-in-c for computing a duration in modern c++

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.