Code Monkey home page Code Monkey logo

didehpc's Introduction

didehpc

Project Status: WIP โ€“ Initial development is in progress, but there has not y et been a stable, usable release suitable for the public. R build status codecov.io

DIDE Cluster Support

NOTICE: This will only be of use to people at DIDE, as it uses our cluster web portal, local cluster, and local network file systems.

What is this?

This is a package for interfacing with the DIDE cluster directly from R. It is meant make jobs running on the cluster appear as if they are running locally but asynchronously. The idea is to let the cluster appear as an extension of your own computer so you can get using it within an R project easily.

How does it work?

The steps below are described in more detail in the vignettes

  1. Ensure that your project is in a directory that the cluster can see (i.e. on one of the network drives). See notes for instructions
  2. Set your DIDE credentials up so that you can log in and tell didehpc about them.
  3. Create a "context" in which future expressions will be evaluated (which will be recreated on the cluster)
  4. Create a "queue" that uses that context
  5. Queue expressions which will be run at some future time on the cluster
  6. Monitor progress, retrieve results, etc.

Documentation

  • New to this? The main vignette contains full instructions and explanations about why some bits are needed.
  • Need a reminder? There is a quickstart guide which is much shorter and will be quicker to glance through.
  • Trying to install packages on the cluster? Check the packages vignette for ways of controlling this.
  • Having problems? Check the troubleshooting guide.
  • Lots of small jobs to run? Consider using workers for a fast queue over several cluster nodes.

Issues

  • Check the issue tracker for known problems, or to create a new one
  • Use the "Cluster" channel on Teams, which Rich and Wes keep an eye on

Installation

The simplest approach is to run:

# install.packages("drat") # if needed
drat:::add("mrc-ide")
install.packages("didehpc")

License

MIT ยฉ Imperial College of Science, Technology and Medicine

didehpc's People

Contributors

richfitz avatar weshinsley avatar hillalex avatar r-ash avatar

Stargazers

Gina Charnley avatar Adam Howes avatar Paddy O'Toole avatar Hillary Topazian avatar  avatar Manuel avatar Phillip Ressler avatar Florent Lassalle avatar Jack Olney avatar Thibaut Jombart avatar

Watchers

James Cloos avatar  avatar

didehpc's Issues

Loss of job tracking

After some time, the connection to the cluster drops and I cannot ping the status of jobs. eg.

> real_inc_1$status(); Error in context_root(x$root) : context database not set up at contexts

set.seed()

Ability to set a custom seed for jobs would be great.

I also have questions about whether the task_list can be deleted? edited? As surely these folders are going to become fairly cumbersome

Detect and gracefully error failed path mappings

share <- didehpc::path_mapping("malaria", tempdir(),
                               "//fi--wronghost/path", "M:")
config <- didehpc::didehpc_config(shares = share)
ctx <- context::context_save("contexts")
obj <- didehpc::queue_didehpc(ctx, config)
t <- obj$enqueue(sessionInfo())
t$wait(10)
obj$dide_log(t)

gives

generated on host: wpia-dide136.dide.ic.ac.uk
generated on date: 2017-05-05
didehpc version: 0.1.1
context version: 0.1.1
running on: FI--DIDEMRC18
mapping Q: -> \\fi--san03\homes\rfitzjoh
The command completed successfully.
mapping T: -> \\fi--didef2\tmp
The command completed successfully.
mapping M: -> \\fi--wronghost\path
System error 53 has occurred.
The network path was not found.
working directory: Q:\cluster_testing\20170505\basic_1322433a3c2e
this is a single task
logfile: Q:\cluster_testing\20170505\basic_1322433a3c2e\contexts\logs\4b9d447dd87220803cab6e92a6ec54cf
Q:\cluster_testing\20170505\basic_1322433a3c2e>Rscript "Q:\cluster_testing\20170505\basic_1322433a3c2e\contexts\bin\task_run" "Q:\cluster_testing\20170505\basic_1322433a3c2e\contexts" 4b9d447dd87220803cab6e92a6ec54cf  1>"Q:\cluster_testing\20170505\basic_1322433a3c2e\contexts\logs\4b9d447dd87220803cab6e92a6ec54cf" 2>&1
Quitting

Detect the failed commands and if they fail then pass through something to automatically error the job perhaps?

Avoid stan path warning

rstan's rtools detection does not seem to pick up Rtools in the location that we put it.

It finds gcc in T:\Rtools\Rtools33\gcc-4.6.3\bin but finds ls in C:'\windows so rstan gives a warning that it can't be found. However PATH seems correct as

"T:\\Rtools\\Rtools33\\bin;T:\\Rtools\\Rtools33\\gcc-4.6.3\\bin;C:\\Program Files (x86)\\Common Files\\libhmsbeagle\\;C:\\Program Files\\Microsoft MPI\\Bin\\;C:\\Program Files\\Microsoft HPC Pack 2012\\Bin\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Program Files\\Dell\\SysMgt\\oma\\bin;C:\\Program Files\\Dell\\SysMgt\\shared\\bin;C:\\Program Files\\Dell\\SysMgt\\idrac;C:\\Program Files\\Mellanox\\\\WinMFT\\;;C:\\Program Files\\R\\R-3.2.4revised\\bin\\x64"

Potentially corrupt task ID database

Running cluster jobs that write output to the network drive. The disk runs out of space, which results in corrupted processes. These processes seem to become lost on the web portal and through the didewin tools, but retain a lock on the drive.

Error in:aa7caae7e6dd49e5ba5bcd047e019faf. Error is: Error in exists(name, envir = envir, inherits = FALSE): invalid first argument

aborting jobs

Is it possible to cancel jobs that I've submitted? Is it possible to do this in batch?

wmic command can still fail

This is not always the case:

format_csv <- sprintf('/format:"%s\\System32\\wbem\\en-US\\csv"', windir)

sometimes it is en-GB - I'm sure that there can be others too!

Job fail on package load should log as error

if a package is not found (or is an incorrect version), or of the source files are invalid then the task will fail but the job is not recorded as failed. This might be solved in queuer, but we need to change the task status from PENDING to ERROR

Work on paths with spaces

setwd("Q:/cluster_test/path with spaces")
didewin::didewin_config_global(credentials="rfitzjoh",
                               cluster="fi--didemrchnb")
ctx <- context::context_save("contexts", packages=NULL, sources=NULL)
context::context_log_start()
obj <- didewin::queue_didewin(ctx)

gives

Loading context f413d5b3eb34996d77ffb38f680ad15d
[ context   ]  f413d5b3eb34996d77ffb38f680ad15d
[ lib       ]  contexts/R/x86_64-w64-mingw32/3.3.0
[ library   ]
[ namespace ]
[ source    ]
[ cross     ]  checking available packages
[ cross     ]  Downloading binary packages for: crayon, curl, digest, drat, memoise, R6, storr, uuid
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.2/crayon_1.3.1.zip'
Content type 'application/zip' length 606260 bytes (592 KB)
downloaded 592 KB
...
[ cross     ]  Cross installing source packages for: context
trying URL 'https://richfitz.github.io/drat/src/contrib/context_0.0.7.tar.gz'
Content type 'application/octet-stream' length 41779 bytes (40 KB)
downloaded 40 KB

Warning: invalid package 'with'
Warning: invalid package 'spaces/contexts/R/x86_64-w64-mingw32/3.2.4'
Error: ERROR: cannot cd to directory 'Q:/cluster_test/path'
Error in FUN(X[[i]], ...) : Command failed (code: 1)
In addition: Warning message:
running command '"C:/PROGRA~1/R/R-33~1.0/bin/R" R_LIBS_USER=Q:/cluster_test/path with spaces/contexts/R/x86_64-w64-mingw32/3.2.4;Q:/cluster_test/path with spaces/contexts/R/x86_64-w64-mingw32/3.3.0;C:/Users/rfitzjoh/Documents/R/win-library/3.3;C:/Program Files/R/R-3.3.0/library CYGWIN=nodosfilewarning CMD INSTALL --no-test-load --library=Q:/cluster_test/path with spaces/contexts/R/x86_64-w64-mingw32/3.2.4 C:\Users\rfitzjoh\AppData\Local\Temp\Rtmpiq85SP\file8e030676c43b6\context' had status 1 

info on nodes

Is the information on which node a job is run on stored in the task bundle object? Would that be possible?

Force upgrade of a package

This is particularly complicated for packages that are built by buildr because they need purging from the drat binary repo too...

errors don't show in the HPC software

When I'm running jobs on nodes with not enough memory and they fail, I get the normal R error message in the log script, but in the HPC software the job is logged as finished, which is a little inconvenient. It does come up as ERROR in the task bundle object. However, if I run really long jobs I don't really want to keep the R session running until the jobs are finished (I might even switch off my computer over night). Can I even get the ids back if I've exited the R session? Or would it be worthwhile to output the task bundle object as an .RData file so I can get back to it?

fi--dideclusthn core limit 8 => 12

Reported error:
"Error in check_resources(cluster, dat$template, dat$cores, dat$wholenode, :
Maximum number of cores for fi--dideclusthn is 8"

Which used to be true, but we now have four 12-core nodes.

bulk submit jobs

Submitting very large numbers of jobs (well really anything over ~50) is tedious and takes ages; it will be nice to have something that can submit a bunch of jobs at once.

  • Submit multiple jobs at once via the web interface (loses names, but could be fixed)
  • Submit jobs on the cluster itself (not actually any faster but at least is async)

Better support for picking up existing drive mappings

It would be nice to be able to write (on windows)

didehpc::path_mapping("shared", "P:")

and have it detect where the path is - we can already work that out. Similarly on linux/
macos it could be

didehpc::path_mapping("shared", "~/net/malaria", NULL, "P:")

perhaps

Depending on context fails

root <- "context"
## ctx <- context::context_save(root=root)
src <- context::package_sources(github="dide-tools/context")
ctx <- context::context_save(root=root,
                             package_sources=src,
                             packages=list(attached="context"))
obj <- didewin::queue_didewin(ctx)

gives

Warning: unable to access index for repository context/drat/bin/windows/contrib/3.2:
  unsupported URL scheme
Warning: unable to access index for repository context/drat/src/contrib:
  unsupported URL scheme
Error in cross_install_packages(lib, platform, r_version, repos, "context") : 
  Can't find installation candidate for: context

traceback:

8: stop(sprintf("Can't find installation candidate for: %s", paste(msg, 
       collapse = ", ")))
7: cross_install_packages(lib, platform, r_version, repos, "context")
6: cross_install_bootstrap(lib, platform, r_version, root)
5: context::cross_install_context(path_lib, "windows", r_version_2, 
       context)
4: initialise_windows_packages(self)
3: public_bind_env$initialize(...)
2: .R6_queue_didewin$new(context, config, initialise, sync)
1: didewin::queue_didewin(ctx)

Also seen in Windows

Support for old versions of R

Anything older than oldrel (currently 3.3.x) is going to require significant work in at least didehpc and provisionr, probably also buildr

Update to use new xml2

1: 'xml2::xml_find_one' is deprecated.
Use 'xml_find_first' instead.

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ

Tracking failed jobs

When submitting jobs through the didewin package, jobs that return status ERROR do not enter the failed state on the HPC web portal.

`install#extras` not installing required package `RPostgreSQL`

Working through workers.R vignette, throws the following error:

+                              packages="ape", package_sources=pkgsrc,
+                              storage_type = didehpc:::storage_driver_psql())
[ init:id   ]  0bf248af0181672b0842b20d57301055
[ init:db   ]  postgres
Error in loadNamespace("RPostgreSQL") : 
  there is no package called 'RPostgreSQL'

Name change issue

In master, this exists:didehpc::didewin_config_global

But just noticed that in the linux dev branch its all been corrected โ€“ย so this issue may be moot.

hook for "all jobs complete"

Either within a bundle, or for all tasks, test to see if everything is done and run a hook on the cluster (e.g., send an email).

Issues to consider:

  • credentials for sending email
  • tasks don't know that they're part of a group
  • querying all task status is fairly costly with the disk system, especially when people submit >1K jobs

cc: @OJWatson

jobtemplate

My jobs are failing at the moment because they're so memory hungry, and the default seems to be to just request 1 core, and so they have to share the node's memory. That doesn't work for these.
Is there a way to request the whole node? And the 24Core template (that didn't work earlier yet, but I think you'd made a note of that already).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.