Code Monkey home page Code Monkey logo

supervisor's Introduction

supervisor's People

Contributors

adammoody avatar andrew-weisman avatar brettin avatar brettinanl avatar bvanessen avatar georgezakinih avatar gihanpanapitiya avatar gounley avatar hyoo avatar j-woz avatar jozik avatar ncollier avatar pbalapra avatar rajeeja avatar rylieweaver avatar yngtodd avatar zhuyitan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

supervisor's Issues

CANDLE Python Library

Prepare CANDLE repo that contains,

  • core library for CANDLE complaint codes
  • examples that using CALDE Library
  • Multi-layer network for MNIST dataset
  • CNN for MNIST dataset
  • Unet

-- Benchmark repo

  • make new release branch
  • merge or clean up stale branches
  • update Pilot1/2/3 code with CANDLE Library (?)

Auto-configuration in CP1

Auto-configure:

Set TURBINE_RESIDENT_WORK_WORKERS based on studies[12].txt
Auto-create DB based on HPO search space.

Initial DB scripts

  1. Set up the database
  2. Insert a record f(N1,NE)->val_loss
  3. List records
  4. Query for (N1,NE)

CPU / GPU configuration to avoid resource contention

Add functionality for setting number of threads per run / which GPU to use per run in order to avoid resource contention / starvation when running multiple runs on the same node.

Thread config for Keras + tensor flow:

keras-team/keras#4740
http://stackoverflow.com/questions/34389945/changing-the-number-of-threads-in-tensorflow-on-cifar10

W/r to GPU, the env var CUDA_VISIBLE_DEVICES can be used, but we'd need away to figure out which device was free.

http://stackoverflow.com/questions/37893755/tensorflow-set-cuda-visible-devices-within-jupyter

Add restart capability

If the run time is insufficient a rerun should take the completed iterations and restart with an updated better estimate of time.

CANDLE Usability

  • "How to run in CANDLE" in github

  • "How to write CANDLE complaint code" in github

Try Horovod in Supervisor

Horovod is currently hard-coded to use MPI_COMM_WORLD. However, the underlying MPI code is only ~2000 lines of C++ and I think it may be possible to make this work.

mechanism for checking python dependencies?

There are several python libraries we're using and we only come to know missing libs when jobs failed. It would be ideal to have a routine to check all the dependencies.

However, due to the complexity of machines, checking python lib on login node may not guarantees. Any idea for this?

Hyperopt EMEWS Python API change/update

I'm trying to compile and run examples on a vanilla Ubuntu 16.04 virtualbox, I see the error below:
Possible Hyperopt EMEWS module change required.

jain@jain-VirtualBox:~/Supervisor/workflows/p1b1_hyperopt/swift$ ./workflow.sh ex11
Experiment directory exists. Continue? (Y/n) Y
/home/jain/Supervisor/workflows/p1b1_hyperopt/python:/home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py:/home/jain/Supervisor/workflows/p1b1_hyperopt/../../python/hyperopt:/home/jain/Supervisor/workflows/p1b1_hyperopt/../../../Benchmarks/Pilot1/P1B1
+ SWIFT_FILE=workflow.swift
+ swift-t -n 4 -p -I /home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py -r /home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift ex11 -seed=1234 -max_evals=4 -param_batch_size=1 -space_description_file=/home/jain/Supervisor/workflows/p1b1_hyperopt/data/space_description.txt -data_directory=/home/jain/Supervisor/workflows/p1b1_hyperopt/data
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:74:29: Variable usage warning. Variable trials is not used
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:74:29: Variable usage warning. Variable params is not used
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:97:5: Variable usage warning. Variable trials is not used
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:97:5: Variable usage warning. Variable id_suffix is not used
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:103:10: Variable usage warning. Variable v might be read and not written, possibly leading to deadlock
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py/EQPy.swift:24: Variable usage warning. Variable loc is not used
CAUGHT ERROR:
wrong # args: should be "turbine::python persist exceptions_are_errors code expression"
    while executing
"turbine::python 1 ${v:code:1:1} "\"\"""
    (procedure "_void_py-argwait" line 3)
    invoked from within
"_void_py-argwait 5 7 2 {import eqpy
import eqpy_hyperopt.hyperopt_runner
import threading
p = threading.Thread(target=eqpy_hyperopt.hyperopt_runner.ru..."
Turbine worker task error in: _void_py-argwait 5 7 2 {import eqpy
import eqpy_hyperopt.hyperopt_runner
import threading
p = threading.Thread(target=eqpy_hyperopt.hyperopt_runner.run)
p.start()} /home/jain/Supervisor/workflows/p1b1_hyperopt/data /home/jain/Supervisor/workflows/p1b1_hyperopt/experiments/ex1 6
    invoked from within
"c::worker_loop $WORK_TYPE($mode) $keyword_args"
    (procedure "standard_worker" line 27)
    invoked from within
"standard_worker $rules $startup_cmd $mode"
    (procedure "custom_worker" line 5)
    invoked from within
"custom_worker $rules $startup_cmd $mode"
    (procedure "enter_mode_unchecked" line 7)
    invoked from within
"enter_mode_unchecked $rules $startup_cmd"
    (procedure "enter_mode" line 5)
    invoked from within
"enter_mode $rules $startup_cmd "
ADLB: ADLB_Abort(1) calling MPI_Abort(1)
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2

Require "id" to be unique in parameters file (upf-1.txt)

I mistakenly set the id to test0 for all the entries in the upf-1.txt file. The workflow ran and it looks multiple workers are writing to the same files. I will have to resubmit the run. Perhaps the workflow could verify the ids are unique before starting.

{"id": "test0",
{"id": "test0",
{"id": "test0",
{"id": "test0",
...

writing parameter.txt for runs

Divide the output/run-dir/parameters.txt into two:
[Model Params]
and
[Monitor Params]

It currently dumps all under one subheading and cannot be used "as is" with Benchmarks.

External DNNs

Clean up and document results from tutorial work.

make install

We need to be able to install the Benchmarks and Supervisor from the Git working copy into another FS that is used at runtime. At OLCF and on Beagle, we need to be able to install from HOME to Lustre. At NERSC, we should use the software-optimized FS (#29).

model_runner.py model naming convention

From Tom B.

when you get a chance, can you look at model_runner.py not dependent on the the naming convention, particularly:
module_name = "{}_baseline_keras2".format(model_name)
Not sure how best to do it.
but it really locks us into a directory and naming convention
say I have ...Combo/infer.py that I want to run using the UPF workflow
I can name it Combo/infer_baseline_keras.py but the current use of MODEL_NAME assumes the directory and the first part of the model name are the same.

Bash invocation intermittently fails with no error output

Invoking python code via a bash script intermittently fails. For example, sometimes the experiment_start.json file is not created because bash seems to immediately return without running the script. Similarly, an invocation of a benchmark run will sometimes return immediately without running the benchmark. In the latter case we are return NaN back to mlrMBO.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.