Please refer to CANDLE Documentation Home for tutorials and CANDLE Library API.
ecp-candle / supervisor Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Please refer to CANDLE Documentation Home for tutorials and CANDLE Library API.
@jmjwozniak working on system scale run
@pbalapra system scale runs for P1 three (3) benchmarks (Titan), new version of code.
Prepare CANDLE repo that contains,
-- Benchmark repo
For mlrMBO scaling tests.
Auto-configure:
Set TURBINE_RESIDENT_WORK_WORKERS based on studies[12].txt
Auto-create DB based on HPO search space.
Add functionality for setting number of threads per run / which GPU to use per run in order to avoid resource contention / starvation when running multiple runs on the same node.
Thread config for Keras + tensor flow:
keras-team/keras#4740
http://stackoverflow.com/questions/34389945/changing-the-number-of-threads-in-tensorflow-on-cifar10
W/r to GPU, the env var CUDA_VISIBLE_DEVICES can be used, but we'd need away to figure out which device was free.
http://stackoverflow.com/questions/37893755/tensorflow-set-cuda-visible-devices-within-jupyter
Do a quick review of the open source license.
As requested in the meeting.
Currently have trouble running Python/TF on Summit.
Also note the difference between Python nan and R NaN .
If the run time is insufficient a rerun should take the completed iterations and restart with an updated better estimate of time.
"How to run in CANDLE" in github
"How to write CANDLE complaint code" in github
If evaluation for a set of hyperparameters is already performed, don't run and supply the results from the previous run.
Doing this now...
Use #33 to call infer without the need for a soft link at infer_baseline_keras2.py .
It is in /global/common/software
Horovod is currently hard-coded to use MPI_COMM_WORLD. However, the underlying MPI code is only ~2000 lines of C++ and I think it may be possible to make this work.
There are several python libraries we're using and we only come to know missing libs when jobs failed. It would be ideal to have a routine to check all the dependencies.
However, due to the complexity of machines, checking python lib on login node may not guarantees. Any idea for this?
Just a prototype for discussion.
Get these two running together.
Example will build on simple synthetic Swift workflow.
or a fork of these scripts.
The workflow scripts contain a lot of duplicated code that can be removed and put in a single place such as Supervisor/common/sh.
Should do immediately- need to get hyperparameter test case.
Reported by @hyoo
Need to determine convention for Swift scripts, such as indentation level and opening brace position. @ncollier , any preferences?
with Lorenzo's Python. Already half done...
for mlrMBO tests.
Add a test module for each workflow. This feeds into #20 .
Draft workflow that uses S[1-5].py to inject user code.
Start with existing workflows.
from the winter.
via Python subprocess
mlrMBO1 and mlrMBO3
ai_workflow
workflowx.sh.. etc
for FS issues on Theta, Summit.
I'm trying to compile and run examples on a vanilla Ubuntu 16.04 virtualbox, I see the error below:
Possible Hyperopt EMEWS module change required.
jain@jain-VirtualBox:~/Supervisor/workflows/p1b1_hyperopt/swift$ ./workflow.sh ex11
Experiment directory exists. Continue? (Y/n) Y
/home/jain/Supervisor/workflows/p1b1_hyperopt/python:/home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py:/home/jain/Supervisor/workflows/p1b1_hyperopt/../../python/hyperopt:/home/jain/Supervisor/workflows/p1b1_hyperopt/../../../Benchmarks/Pilot1/P1B1
+ SWIFT_FILE=workflow.swift
+ swift-t -n 4 -p -I /home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py -r /home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift ex11 -seed=1234 -max_evals=4 -param_batch_size=1 -space_description_file=/home/jain/Supervisor/workflows/p1b1_hyperopt/data/space_description.txt -data_directory=/home/jain/Supervisor/workflows/p1b1_hyperopt/data
WARN /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:74:29: Variable usage warning. Variable trials is not used
WARN /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:74:29: Variable usage warning. Variable params is not used
WARN /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:97:5: Variable usage warning. Variable trials is not used
WARN /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:97:5: Variable usage warning. Variable id_suffix is not used
WARN /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:103:10: Variable usage warning. Variable v might be read and not written, possibly leading to deadlock
WARN /home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py/EQPy.swift:24: Variable usage warning. Variable loc is not used
CAUGHT ERROR:
wrong # args: should be "turbine::python persist exceptions_are_errors code expression"
while executing
"turbine::python 1 ${v:code:1:1} "\"\"""
(procedure "_void_py-argwait" line 3)
invoked from within
"_void_py-argwait 5 7 2 {import eqpy
import eqpy_hyperopt.hyperopt_runner
import threading
p = threading.Thread(target=eqpy_hyperopt.hyperopt_runner.ru..."
Turbine worker task error in: _void_py-argwait 5 7 2 {import eqpy
import eqpy_hyperopt.hyperopt_runner
import threading
p = threading.Thread(target=eqpy_hyperopt.hyperopt_runner.run)
p.start()} /home/jain/Supervisor/workflows/p1b1_hyperopt/data /home/jain/Supervisor/workflows/p1b1_hyperopt/experiments/ex1 6
invoked from within
"c::worker_loop $WORK_TYPE($mode) $keyword_args"
(procedure "standard_worker" line 27)
invoked from within
"standard_worker $rules $startup_cmd $mode"
(procedure "custom_worker" line 5)
invoked from within
"custom_worker $rules $startup_cmd $mode"
(procedure "enter_mode_unchecked" line 7)
invoked from within
"enter_mode_unchecked $rules $startup_cmd"
(procedure "enter_mode" line 5)
invoked from within
"enter_mode $rules $startup_cmd "
ADLB: ADLB_Abort(1) calling MPI_Abort(1)
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
I mistakenly set the id to test0 for all the entries in the upf-1.txt file. The workflow ran and it looks multiple workers are writing to the same files. I will have to resubmit the run. Perhaps the workflow could verify the ids are unique before starting.
{"id": "test0",
{"id": "test0",
{"id": "test0",
{"id": "test0",
...
Latest challenge problem doc adds required to train Uno on each study with no feature file. We need to add this to the workflow, including the DB.
Divide the output/run-dir/parameters.txt into two:
[Model Params]
and
[Monitor Params]
It currently dumps all under one subheading and cannot be used "as is" with Benchmarks.
with Python.
Clean up and document results from tutorial work.
We need to be able to install the Benchmarks and Supervisor from the Git working copy into another FS that is used at runtime. At OLCF and on Beagle, we need to be able to install from HOME to Lustre. At NERSC, we should use the software-optimized FS (#29).
https://github.com/ECP-CANDLE/Supervisor/blob/develop/workflows/common/sh/env-summit.sh#L37
for summit, it is hardcoded and ignore user's setting.
Can we preserve what users set, and give default value when it is empty?
From Tom B.
when you get a chance, can you look at model_runner.py not dependent on the the naming convention, particularly:
module_name = "{}_baseline_keras2".format(model_name)
Not sure how best to do it.
but it really locks us into a directory and naming convention
say I have ...Combo/infer.py that I want to run using the UPF workflow
I can name it Combo/infer_baseline_keras.py but the current use of MODEL_NAME assumes the directory and the first part of the model name are the same.
Invoking python code via a bash script intermittently fails. For example, sometimes the experiment_start.json file is not created because bash seems to immediately return without running the script. Similarly, an invocation of a benchmark run will sometimes return immediately without running the benchmark. In the latter case we are return NaN back to mlrMBO.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.