mrc-ide / didehpc Goto Github PK

View Code? Open in Web Editor NEW

10.0 2.0 4.0 1.5 MB

:cloud::computer::cloud: Support for the DIDE cluster

Home Page: http://mrc-ide.github.io/didehpc

License: Other

Makefile 0.81% R 96.65% Batchfile 1.35% Shell 1.19%

infrastructure cluster

didehpc's Introduction

didehpc

DIDE Cluster Support

NOTICE: This will only be of use to people at DIDE, as it uses our cluster web portal, local cluster, and local network file systems.

What is this?

This is a package for interfacing with the DIDE cluster directly from R. It is meant make jobs running on the cluster appear as if they are running locally but asynchronously. The idea is to let the cluster appear as an extension of your own computer so you can get using it within an R project easily.

How does it work?

The steps below are described in more detail in the vignettes

Ensure that your project is in a directory that the cluster can see (i.e. on one of the network drives). See notes for instructions
Set your DIDE credentials up so that you can log in and tell didehpc about them.
Create a "context" in which future expressions will be evaluated (which will be recreated on the cluster)
Create a "queue" that uses that context
Queue expressions which will be run at some future time on the cluster
Monitor progress, retrieve results, etc.

Documentation

New to this? The main vignette contains full instructions and explanations about why some bits are needed.
Need a reminder? There is a quickstart guide which is much shorter and will be quicker to glance through.
Trying to install packages on the cluster? Check the packages vignette for ways of controlling this.
Having problems? Check the troubleshooting guide.
Lots of small jobs to run? Consider using workers for a fast queue over several cluster nodes.

Issues

Check the issue tracker for known problems, or to create a new one
Use the "Cluster" channel on Teams, which Rich and Wes keep an eye on

Installation

The simplest approach is to run:

# install.packages("drat") # if needed
drat:::add("mrc-ide")
install.packages("didehpc")

License

MIT © Imperial College of Science, Technology and Medicine

didehpc's People

Contributors

Stargazers

Watchers

Forkers

jackolney jeffeaton kklot jfontestad

didehpc's Issues

context_save argument name has changed from root to path

update the porting guide

Loss of job tracking

After some time, the connection to the cluster drops and I cannot ping the status of jobs. eg.

> real_inc_1$status(); Error in context_root(x$root) : context database not set up at contexts

set.seed()

Ability to set a custom seed for jobs would be great.

I also have questions about whether the task_list can be deleted? edited? As surely these folders are going to become fairly cumbersome

Detect and gracefully error failed path mappings

share <- didehpc::path_mapping("malaria", tempdir(),
                               "//fi--wronghost/path", "M:")
config <- didehpc::didehpc_config(shares = share)
ctx <- context::context_save("contexts")
obj <- didehpc::queue_didehpc(ctx, config)
t <- obj$enqueue(sessionInfo())
t$wait(10)
obj$dide_log(t)

gives

generated on host: wpia-dide136.dide.ic.ac.uk
generated on date: 2017-05-05
didehpc version: 0.1.1
context version: 0.1.1
running on: FI--DIDEMRC18
mapping Q: -> \\fi--san03\homes\rfitzjoh
The command completed successfully.
mapping T: -> \\fi--didef2\tmp
The command completed successfully.
mapping M: -> \\fi--wronghost\path
System error 53 has occurred.
The network path was not found.
working directory: Q:\cluster_testing\20170505\basic_1322433a3c2e
this is a single task
logfile: Q:\cluster_testing\20170505\basic_1322433a3c2e\contexts\logs\4b9d447dd87220803cab6e92a6ec54cf
Q:\cluster_testing\20170505\basic_1322433a3c2e>Rscript "Q:\cluster_testing\20170505\basic_1322433a3c2e\contexts\bin\task_run" "Q:\cluster_testing\20170505\basic_1322433a3c2e\contexts" 4b9d447dd87220803cab6e92a6ec54cf  1>"Q:\cluster_testing\20170505\basic_1322433a3c2e\contexts\logs\4b9d447dd87220803cab6e92a6ec54cf" 2>&1
Quitting

Detect the failed commands and if they fail then pass through something to automatically error the job perhaps?

Use windows command line tools where available

That way the username will be as expected in the real HPC tools.

on windows don't randomly assign drives for network paths, use existing

otherwise it's needlessly confusing for people

Check that paths given as shares actually match things that are in the detected mounts

Avoid stan path warning

rstan's rtools detection does not seem to pick up Rtools in the location that we put it.

It finds gcc in T:\Rtools\Rtools33\gcc-4.6.3\bin but finds ls in C:'\windows so rstan gives a warning that it can't be found. However PATH seems correct as

"T:\\Rtools\\Rtools33\\bin;T:\\Rtools\\Rtools33\\gcc-4.6.3\\bin;C:\\Program Files (x86)\\Common Files\\libhmsbeagle\\;C:\\Program Files\\Microsoft MPI\\Bin\\;C:\\Program Files\\Microsoft HPC Pack 2012\\Bin\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Program Files\\Dell\\SysMgt\\oma\\bin;C:\\Program Files\\Dell\\SysMgt\\shared\\bin;C:\\Program Files\\Dell\\SysMgt\\idrac;C:\\Program Files\\Mellanox\\\\WinMFT\\;;C:\\Program Files\\R\\R-3.2.4revised\\bin\\x64"

install commonly used packages on the temp drive to save installation headaches

Will need to be versioned

Put in the same place as the rtools is already

BH definitely should go in there, along side all of the context dependencies at least. Plus whatever else is big and common

Potentially corrupt task ID database

Running cluster jobs that write output to the network drive. The disk runs out of space, which results in corrupted processes. These processes seem to become lost on the web portal and through the didewin tools, but retain a lock on the drive.

Error in:aa7caae7e6dd49e5ba5bcd047e019faf. Error is: Error in exists(name, envir = envir, inherits = FALSE): invalid first argument

aborting jobs

Is it possible to cancel jobs that I've submitted? Is it possible to do this in batch?

workers docs still mentions unique_value

As reported by @OJWatson

wmic command can still fail

This is not always the case:

format_csv <- sprintf('/format:"%s\\System32\\wbem\\en-US\\csv"', windir)

sometimes it is en-GB - I'm sure that there can be others too!

Replace seagull based workers with redux workers

Becomes a single list. The entry points are small enough that this is not that big a deal.

Register doParallel cluster automatically if doParallel is a package is requested

or as an explicit option

Job fail on package load should log as error

if a package is not found (or is an incorrect version), or of the source files are invalid then the task will fail but the job is not recorded as failed. This might be solved in queuer, but we need to change the task status from PENDING to ERROR

Keep users on only the headnodes that they are allowed to use

web_headnodes() lists the valid headnodes

Document in porting guide that the root argument to context has changed to path

Work on paths with spaces

setwd("Q:/cluster_test/path with spaces")
didewin::didewin_config_global(credentials="rfitzjoh",
                               cluster="fi--didemrchnb")
ctx <- context::context_save("contexts", packages=NULL, sources=NULL)
context::context_log_start()
obj <- didewin::queue_didewin(ctx)

gives

Loading context f413d5b3eb34996d77ffb38f680ad15d
[ context   ]  f413d5b3eb34996d77ffb38f680ad15d
[ lib       ]  contexts/R/x86_64-w64-mingw32/3.3.0
[ library   ]
[ namespace ]
[ source    ]
[ cross     ]  checking available packages
[ cross     ]  Downloading binary packages for: crayon, curl, digest, drat, memoise, R6, storr, uuid
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.2/crayon_1.3.1.zip'
Content type 'application/zip' length 606260 bytes (592 KB)
downloaded 592 KB
...
[ cross     ]  Cross installing source packages for: context
trying URL 'https://richfitz.github.io/drat/src/contrib/context_0.0.7.tar.gz'
Content type 'application/octet-stream' length 41779 bytes (40 KB)
downloaded 40 KB

Warning: invalid package 'with'
Warning: invalid package 'spaces/contexts/R/x86_64-w64-mingw32/3.2.4'
Error: ERROR: cannot cd to directory 'Q:/cluster_test/path'
Error in FUN(X[[i]], ...) : Command failed (code: 1)
In addition: Warning message:
running command '"C:/PROGRA~1/R/R-33~1.0/bin/R" R_LIBS_USER=Q:/cluster_test/path with spaces/contexts/R/x86_64-w64-mingw32/3.2.4;Q:/cluster_test/path with spaces/contexts/R/x86_64-w64-mingw32/3.3.0;C:/Users/rfitzjoh/Documents/R/win-library/3.3;C:/Program Files/R/R-3.3.0/library CYGWIN=nodosfilewarning CMD INSTALL --no-test-load --library=Q:/cluster_test/path with spaces/contexts/R/x86_64-w64-mingw32/3.2.4 C:\Users\rfitzjoh\AppData\Local\Temp\Rtmpiq85SP\file8e030676c43b6\context' had status 1

Submit workers

function to determine overall cluster load

return overall load for both clusters as a free function

info on nodes

Is the information on which node a job is run on stored in the task bundle object? Would that be possible?

Simplified installation and configuration procedure

batch file help to write out failed jobs

should be fairly simple to do, and we can just fire off a different R process straight after the failed one

rename?

To didehpc or something.

Force upgrade of a package

This is particularly complicated for packages that are built by buildr because they need purging from the drat binary repo too...

Upgrade supported R versions

installation instructions in README installs didewin instead of didehpc

errors don't show in the HPC software

When I'm running jobs on nodes with not enough memory and they fail, I get the normal R error message in the log script, but in the HPC software the job is logged as finished, which is a little inconvenient. It does come up as ERROR in the task bundle object. However, if I run really long jobs I don't really want to keep the R session running until the jobs are finished (I might even switch off my computer over night). Can I even get the ids back if I've exited the R session? Or would it be worthwhile to output the task bundle object as an .RData file so I can get back to it?

Check to see if we're logged in already

Parse output / http code?

Look for fully qualified names in paths; these are ok

Don't just look for \fi--didef2 but also for fi--didef2.dide.ic.ac.uk

See @OJWatson's windows config

fi--dideclusthn core limit 8 => 12

Reported error:
"Error in check_resources(cluster, dat$template, dat$cores, dat$wholenode, :
Maximum number of cores for fi--dideclusthn is 8"

Which used to be true, but we now have four 12-core nodes.

bulk submit jobs

Submitting very large numbers of jobs (well really anything over ~50) is tedious and takes ages; it will be nice to have something that can submit a bunch of jobs at once.

Submit multiple jobs at once via the web interface (loses names, but could be fixed)
Submit jobs on the cluster itself (not actually any faster but at least is async)

Better support for picking up existing drive mappings

It would be nice to be able to write (on windows)

didehpc::path_mapping("shared", "P:")

and have it detect where the path is - we can already work that out. Similarly on linux/
macos it could be

didehpc::path_mapping("shared", "~/net/malaria", NULL, "P:")

perhaps

Depending on context fails

root <- "context"
## ctx <- context::context_save(root=root)
src <- context::package_sources(github="dide-tools/context")
ctx <- context::context_save(root=root,
                             package_sources=src,
                             packages=list(attached="context"))
obj <- didewin::queue_didewin(ctx)

gives

Warning: unable to access index for repository context/drat/bin/windows/contrib/3.2:
  unsupported URL scheme
Warning: unable to access index for repository context/drat/src/contrib:
  unsupported URL scheme
Error in cross_install_packages(lib, platform, r_version, repos, "context") : 
  Can't find installation candidate for: context

traceback:

8: stop(sprintf("Can't find installation candidate for: %s", paste(msg, 
       collapse = ", ")))
7: cross_install_packages(lib, platform, r_version, repos, "context")
6: cross_install_bootstrap(lib, platform, r_version, root)
5: context::cross_install_context(path_lib, "windows", r_version_2, 
       context)
4: initialise_windows_packages(self)
3: public_bind_env$initialize(...)
2: .R6_queue_didewin$new(context, config, initialise, sync)
1: didewin::queue_didewin(ctx)

Also seen in Windows

Support for old versions of R

Anything older than oldrel (currently 3.3.x) is going to require significant work in at least didehpc and provisionr, probably also buildr

Document how one reconnects to the queue

And especially how to get back jobs that you have launched

qdrive mapping need to work

For linux, this is probably needed

apt-get install libcurl4-openssl-dev

Allow selection of version

Automatically detect mounts

Check for the working directory for sure

On mac and linux can use mount, on windows can use net use

cache package set on temp drive

The basic set of packages can be cached and reused for a faster setup, similar to how rtools is done.

Update to use new xml2

1: 'xml2::xml_find_one' is deprecated.
Use 'xml_find_first' instead.

🔥🔥🔥🔥🔥

Tracking failed jobs

When submitting jobs through the didewin package, jobs that return status ERROR do not enter the failed state on the HPC web portal.

`install#extras` not installing required package `RPostgreSQL`

Working through workers.R vignette, throws the following error:

+                              packages="ape", package_sources=pkgsrc,
+                              storage_type = didehpc:::storage_driver_psql())
[ init:id   ]  0bf248af0181672b0842b20d57301055
[ init:db   ]  postgres
Error in loadNamespace("RPostgreSQL") : 
  there is no package called 'RPostgreSQL'

Name change issue

In master, this exists:didehpc::didewin_config_global

But just noticed that in the linux dev branch its all been corrected – so this issue may be moot.

cluster name alias

Long names are annoying and cryptic, as suggested by @jackolney in #30, use alias names

big/mrc
small/ide/dide
linux

pomp is a package that requires rtools

hook for "all jobs complete"

Either within a bundle, or for all tasks, test to see if everything is done and run a hook on the cluster (e.g., send an email).

Issues to consider:

credentials for sending email
tasks don't know that they're part of a group
querying all task status is fairly costly with the disk system, especially when people submit >1K jobs

cc: @OJWatson

jobtemplate

My jobs are failing at the moment because they're so memory hungry, and the default seems to be to just request 1 core, and so they have to share the node's memory. That doesn't work for these.
Is there a way to request the whole node? And the 24Core template (that didn't work earlier yet, but I think you'd made a note of that already).

mrc-ide / didehpc Goto Github PK

didehpc's Introduction

didehpc

What is this?

How does it work?

Documentation

Issues

Installation

License

didehpc's People

Contributors

Stargazers

Watchers

Forkers

didehpc's Issues

Recommend Projects

Recommend Topics

Recommend Org