Light

ecp-candle / supervisor Goto Github PK

License: MIT License

Swift 9.57% Python 29.68% Shell 31.23% R 10.13% C++ 0.46% Makefile 0.60% M4 0.24% Tcl 0.20% C 1.79% Awk 0.11% SWIG 0.05% Jupyter Notebook 15.86% Perl 0.10%

nci-doe-collaboration-capability

supervisor's Introduction

CANDLE Documentation Home

Please refer to CANDLE Documentation Home for tutorials and CANDLE Library API.

supervisor's People

Contributors

Stargazers

Watchers

Forkers

cbiit rajeeja brettin rafaelmarconiramos hyoo yngtodd carrondt zhuyitan bvanessen adammoody ornl-bsec mdorier hongjuny samadejacobs andrew-weisman gihanpanapitiya rylieweaver

supervisor's Issues

System scale runs

according to Meeting note on Apr 5, 2018
https://doe-nci.atlassian.net/wiki/spaces/PROJECT/pages/394297351/April+5+2018+Meeting+notes

@jmjwozniak working on system scale run
@pbalapra system scale runs for P1 three (3) benchmarks (Titan), new version of code.

@hyoo and @brettin are planning to run Combo on theta.

CANDLE Python Library

Prepare CANDLE repo that contains,

-- Benchmark repo

make new release branch
merge or clean up stale branches
update Pilot1/2/3 code with CANDLE Library (?)

Install Swift/T with R on Cori

For mlrMBO scaling tests.

Auto-configuration in CP1

Auto-configure:

Set TURBINE_RESIDENT_WORK_WORKERS based on studies[12].txt
Auto-create DB based on HPO search space.

Initial DB scripts

Set up the database
Insert a record f(N1,NE)->val_loss
List records
Query for (N1,NE)

CPU / GPU configuration to avoid resource contention

Add functionality for setting number of threads per run / which GPU to use per run in order to avoid resource contention / starvation when running multiple runs on the same node.

Thread config for Keras + tensor flow:

keras-team/keras#4740
http://stackoverflow.com/questions/34389945/changing-the-number-of-threads-in-tensorflow-on-cifar10

W/r to GPU, the env var CUDA_VISIBLE_DEVICES can be used, but we'd need away to figure out which device was free.

http://stackoverflow.com/questions/37893755/tensorflow-set-cuda-visible-devices-within-jupyter

Review License

Do a quick review of the open source license.

Configuring TF on Summit

Currently have trouble running Python/TF on Summit.

Figure out where NaNs are allowed

Also note the difference between Python nan and R NaN .

Add restart capability

If the run time is insufficient a rerun should take the completed iterations and restart with an updated better estimate of time.

CANDLE Usability

"How to run in CANDLE" in github
"How to write CANDLE complaint code" in github

Detect runs with same hyperparams

If evaluation for a set of hyperparameters is already performed, don't run and supply the results from the previous run.

Example of calling infer using OBJ_MODULE

Use #33 to call infer without the need for a soft link at infer_baseline_keras2.py .

Use software-optimized directory on Cori

It is in /global/common/software

Try Horovod in Supervisor

Horovod is currently hard-coded to use MPI_COMM_WORLD. However, the underlying MPI code is only ~2000 lines of C++ and I think it may be possible to make this work.

mechanism for checking python dependencies?

There are several python libraries we're using and we only come to know missing libs when jobs failed. It would be ideal to have a routine to check all the dependencies.

However, due to the complexity of machines, checking python lib on login node may not guarantees. Any idea for this?

Demonstrate mapping from functional interface to persistent database

Just a prototype for discussion.

Demonstrate workflow use of DB

Example will build on simple synthetic Swift workflow.

Integrate Supervisor workflows with real benchmark scripts

or a fork of these scripts.

Refactor common code out of workflow scripts

The workflow scripts contain a lot of duplicated code that can be removed and put in a single place such as Supervisor/common/sh.

Scaling runs: 50%, 100%

Should do immediately- need to get hyperparameter test case.

Swift whitespace conventions

Need to determine convention for Swift scripts, such as indentation level and opening brace position. @ncollier , any preferences?

Install Swift/T on JLSE KNL

with Lorenzo's Python. Already half done...

Install Swift/T with R on Cooley

for mlrMBO tests.

Add test modules to workflows from report

Add a test module for each workflow. This feeds into #20 .

Draft Challenge Problem -formatted workflow

Draft workflow that uses S[1-5].py to inject user code.

Draft and document examples of functional interface to models

Start with existing workflows.

Replace categorical variables in mlrMBO with e.g., integer params with mapping.

Link docs back to Beagle workflows

from the winter.

Performance capture in model_runner

via Python subprocess

Delete old stale files

mlrMBO1 and mlrMBO3
ai_workflow
workflowx.sh.. etc

Hyperopt EMEWS Python API change/update

I'm trying to compile and run examples on a vanilla Ubuntu 16.04 virtualbox, I see the error below:
Possible Hyperopt EMEWS module change required.

jain@jain-VirtualBox:~/Supervisor/workflows/p1b1_hyperopt/swift$ ./workflow.sh ex11
Experiment directory exists. Continue? (Y/n) Y
/home/jain/Supervisor/workflows/p1b1_hyperopt/python:/home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py:/home/jain/Supervisor/workflows/p1b1_hyperopt/../../python/hyperopt:/home/jain/Supervisor/workflows/p1b1_hyperopt/../../../Benchmarks/Pilot1/P1B1
+ SWIFT_FILE=workflow.swift
+ swift-t -n 4 -p -I /home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py -r /home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift ex11 -seed=1234 -max_evals=4 -param_batch_size=1 -space_description_file=/home/jain/Supervisor/workflows/p1b1_hyperopt/data/space_description.txt -data_directory=/home/jain/Supervisor/workflows/p1b1_hyperopt/data
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:74:29: Variable usage warning. Variable trials is not used
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:74:29: Variable usage warning. Variable params is not used
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:97:5: Variable usage warning. Variable trials is not used
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:97:5: Variable usage warning. Variable id_suffix is not used
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/swift/workflow.swift:103:10: Variable usage warning. Variable v might be read and not written, possibly leading to deadlock
WARN  /home/jain/Supervisor/workflows/p1b1_hyperopt/ext/EQ-Py/EQPy.swift:24: Variable usage warning. Variable loc is not used
CAUGHT ERROR:
wrong # args: should be "turbine::python persist exceptions_are_errors code expression"
    while executing
"turbine::python 1 ${v:code:1:1} "\"\"""
    (procedure "_void_py-argwait" line 3)
    invoked from within
"_void_py-argwait 5 7 2 {import eqpy
import eqpy_hyperopt.hyperopt_runner
import threading
p = threading.Thread(target=eqpy_hyperopt.hyperopt_runner.ru..."
Turbine worker task error in: _void_py-argwait 5 7 2 {import eqpy
import eqpy_hyperopt.hyperopt_runner
import threading
p = threading.Thread(target=eqpy_hyperopt.hyperopt_runner.run)
p.start()} /home/jain/Supervisor/workflows/p1b1_hyperopt/data /home/jain/Supervisor/workflows/p1b1_hyperopt/experiments/ex1 6
    invoked from within
"c::worker_loop $WORK_TYPE($mode) $keyword_args"
    (procedure "standard_worker" line 27)
    invoked from within
"standard_worker $rules $startup_cmd $mode"
    (procedure "custom_worker" line 5)
    invoked from within
"custom_worker $rules $startup_cmd $mode"
    (procedure "enter_mode_unchecked" line 7)
    invoked from within
"enter_mode_unchecked $rules $startup_cmd"
    (procedure "enter_mode" line 5)
    invoked from within
"enter_mode $rules $startup_cmd "
ADLB: ADLB_Abort(1) calling MPI_Abort(1)
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2

Require "id" to be unique in parameters file (upf-1.txt)

I mistakenly set the id to test0 for all the entries in the upf-1.txt file. The workflow ran and it looks multiple workers are writing to the same files. I will have to resubmit the run. Perhaps the workflow could verify the ids are unique before starting.

{"id": "test0",
{"id": "test0",
{"id": "test0",
{"id": "test0",
...

Update xcorr to Train on each study with no feature file

Latest challenge problem doc adds required to train Uno on each study with no feature file. We need to add this to the workflow, including the DB.

writing parameter.txt for runs

Divide the output/run-dir/parameters.txt into two:
[Model Params]
and
[Monitor Params]

It currently dumps all under one subheading and cannot be used "as is" with Benchmarks.

Temporary features for Milestone 13

Install Swift/T on Cori

with Python.

External DNNs

Clean up and document results from tutorial work.

make install

We need to be able to install the Benchmarks and Supervisor from the Git working copy into another FS that is used at runtime. At OLCF and on Beagle, we need to be able to install from HOME to Lustre. At NERSC, we should use the software-optimized FS (#29).

TURBINE_LAUNCH_OPTIONS

https://github.com/ECP-CANDLE/Supervisor/blob/develop/workflows/common/sh/env-summit.sh#L37

for summit, it is hardcoded and ignore user's setting.

Can we preserve what users set, and give default value when it is empty?

model_runner.py model naming convention

From Tom B.

when you get a chance, can you look at model_runner.py not dependent on the the naming convention, particularly:
module_name = "{}_baseline_keras2".format(model_name)
Not sure how best to do it.
but it really locks us into a directory and naming convention
say I have ...Combo/infer.py that I want to run using the UPF workflow
I can name it Combo/infer_baseline_keras.py but the current use of MODEL_NAME assumes the directory and the first part of the model name are the same.

Bash invocation intermittently fails with no error output

Invoking python code via a bash script intermittently fails. For example, sometimes the experiment_start.json file is not created because bash seems to immediately return without running the script. Similarly, an invocation of a benchmark run will sometimes return immediately without running the benchmark. In the latter case we are return NaN back to mlrMBO.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.