mllg / batchtools Goto Github PK

View Code? Open in Web Editor NEW

170.0 10.0 51.0 4.25 MB

Tools for computation on batch systems

Home Page: https://mllg.github.io/batchtools/

License: GNU Lesser General Public License v3.0

R 93.91% Shell 3.20% C 1.42% TeX 1.46%

cran batchjobs batchexperiments slurm lsf sge docker-swarm torque openlava parallel-computing

batchtools's People

Contributors

Stargazers

Watchers

Forkers

imbs-hl alenzhao bobp4ski jakob-r gabora bunhoel biodev nsheff deann88 mnwright sampoll cfhammill brnleehng kuhnrl30 smilesun ja-thomas fouodo bomeara mtmorgan tdhock gluque jasonserviss mb706 eliford kchester12 payamemami scottporter damirpolat michaelchirico dankessler syma-research edgbr rhjp izahn mkrzak jimsforks aaronpeikert define-ag plusge reikoch michaelmayer2 tcikezu jzl singhravipratap crerecombinase living1069 franzbischoff takewiki rockefelleruniversity jemus42 olivroy

batchtools's Issues

Fix static docs

Maybe completely rely on rdocumentation.com and CRAN for hosting manuals and vignettes. Currently the man stuff is broken.

Implement cfProcessx

... as soon as https://github.com/MangoTheCat/processx/ is released. Benefit over cfSSH w/ only local workers: works on windows.

doc issue: hyperparameters

hyperparameters are very often mentioned. this does not make sense, batchtools does not only concern ML.

it should be "parameters".

submitjobs / max.concurrent jobs: numbers for sleeping should be options

currently we have these magic numbers in the code

default.wait = 5
....
wait = wait * 1.025

while i will heavily argue that we should have very reasonable defaults here, so that most users never have to touch them, these 2 should be configurable via options.

No Error message, when src file does not work.

See #15
The source file should contain something like
setwd(dir)

Your main file

library(batchtools)
library(plyr)

dir = "/home/probst/Random_Forest/RFParset"
setwd(paste0(dir,"/results"))

unlink("probs-test", recursive = TRUE)
regis = makeExperimentRegistry("probs-test", 
                               source = "/home/probst/Random_Forest/RFParset/code/probst_defs.R"
)

regis$cluster.functions = makeClusterFunctionsMulticore(debug = TRUE)

addProblem(name = as.character(1), data = 1)

addAlgorithm("eval", fun = function(job, data, instance, ...) {
  x = list(...)
  data + x$x + 1
})

set.seed(124)
ades = data.frame(c(sample(1:100)))
names(ades) = "x"

addExperiments(algo.designs = list(eval = ades))

ids = getJobTable()$job.id
ids = chunkIds(ids, chunk.size = 10)
submitJobs(ids)
getStatus()

Result:

OS cmd: /nfsmb/koll/probst/R/x86_64-pc-linux-gnu-library/3.2/batchtools/bin/linux-helper list-jobs probs-test
OS result (exit code 0):
character(0)
Status for 100 jobs:
  Submitted: 100 (100.0%)
  Queued   :   0 (  0.0%)
  Started  :   0 (  0.0%)
  Running  :   0 (  0.0%)
  Done     :   0 (  0.0%)
  Error    :   0 (  0.0%)

doJobCollection: assertPathForOutput leads to unintuitive errors

we discussed this on hangout

happens if:

eg SLURM already creates empty log files on SBATCH
you still pass the logfile path to doJobCollection in the template

apparently the latter is a user error on our part, but this was still hard to figure out and completely broke our workflow after a package update

resolution:

check that this is really well documented
maybe the error reporting can be made clearer
then close quickly

Bug: convertIds can create columns entirely with NAs

I can not say exactly how this happens but starting from a bit messed up state:

> getStatus()
Status for 660 jobs:
  Submitted : 540 ( 81.8%)
  Queued    :   0 (  0.0%)
  Started   :  55 (  8.3%)
  Running   :   0 (  0.0%)
  Done      :  23 (  3.5%)
  Error     :  32 (  4.8%)
  Expired   : 485 ( 73.5%)

Now I run

> submit.ids = chunkIds(ids = findNotDone(), chunk.size=5)  
> submitJobs(ids = submit.ids, resources=list(walltime = 60^2, memory = 4000))                                                                                          
Error in submitJobs(ids = submit.ids, resources = list(walltime = 60^2,  : 
  Assertion on 'ids$chunk' failed: Contains missing values.

Looking into it, it seems that the cause is convertIds() which generates something like

> submit.ids2 = batchtools:::convertIds(reg = reg, ids = submit.ids, default = batchtools:::.findNotSubmitted(reg = reg), keep.extra = c("job.id", "chunk"))            
> submit.ids2
     job.id chunk
  1:      1    42
  2:      2    79
  3:      3    35
  4:      4    21
  5:      5    43
 ---             
633:     NA    NA
634:     NA    NA
635:     NA    NA
636:     NA    NA
637:     NA    NA

listJobs should differentiate between running and queued jobs

Messaging back to the registry as soon as the job starts is currently not implemented and should not be necessary if listJobs would differentiate between running and queued jobs.
Should be pretty simple to implement.

capital letter in conf file required

We need a capital R in the conf file on LiDO. ~/.batchtools.conf.r is not used, only ~/.batchtools.conf.R switches to Torque. Do we intend this?
I needed an hour to find this issue...

Additional info:
ExperimentRegistry.R uses .batchtools.conf.r, Registry.R uses .batchtools.conf.R. I switch the first one to R.

multi row results

I added a test for multi row results in 0b83fe8. Do we expect the results from tab.expect or is tab correct?

does clusterFunctionsTorque use input template for anything?

I inspect makeClusterFunctionsTorque.R and I cannot understand how you handle the template, it seems you read the template file once template = cfReadBrewTemplate(template, "##"), and then drop it as it is not returned by the function. How is the template then added to the registry via a conf.file? How is the template used when submitting jobs? Does the current version of batchtools rather rely on that batchtools:::findTemplateFile always recover some template file later?

all the best

makeClusterFunctionsMulticore seems to waste a lot of memory

make_data <- function(data, scale, job=NULL) {
  gamSim(eg = 1, n = 4000, dist = "normal", scale = scale, 
    verbose = FALSE)
}

fit_model <- function(data, job=NULL, instance) {
  m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = instance)
  m$coefficients
}

library(batchtools)
file.dir <- paste0("testtest_", Sys.Date())
reg <- makeExperimentRegistry(file.dir = file.dir, packages = "mgcv", 
  seed = 1)
reg$cluster.functions <- makeClusterFunctionsMulticore(25)
saveRegistry()


# Add problem and algorithms to the registry
addProblem(name = "make_data", data = NULL, fun = make_data, seed = 1)
addAlgorithm(name = "fit_model", fun = fit_model)

# Add experiments
problems <- list(make_data = data.frame(scale = 2 ^ (-4 : 4)))
addExperiments(problems, repls = 50)

#
options(error = function( ) dump.frames("batchtools.dump", to.file = TRUE))
# testJob(1)
submitJobs()

This results in

# Submitting 450 jobs in 450 chunks using cluster functions 'Parallel' ...
# Submitting [===================================----------------]  70% eta: 14sError in mcfork(detached) : 
#   unable to fork, possible reason: Cannot allocate memory

Closing the session, reopening a fresh one, loading the registry and doing submitJobs() again immediately triggers the same error after "x files synced" is written to the console.

Exactly the same experiment setup works fine with reg$cluster.functions <- makeClusterFunctionsSocket(25) instead of reg$cluster.functions <- makeClusterFunctionsMulticore(25).

Until the R session that throws the "unable to fork"- error is closed nothing else works properly (some browser tabs crash (?), other open R sessions all fail with "Cannot allocate memory")

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mgcv_1.8-13      nlme_3.1-128     batchtools_0.1   data.table_1.9.6

loaded via a namespace (and not attached):
 [1] lattice_0.20-33   snow_0.4-1        prettyunits_1.0.2 digest_0.6.10    
 [5] assertthat_0.1    chron_2.3-47      grid_3.3.1        R6_2.1.3         
 [9] backports_1.0.3   magrittr_1.5      progress_1.0.2    stringi_1.1.1    
[13] Matrix_1.2-6      checkmate_1.8.1   tools_3.3.1       parallel_3.3.1   

> packageDescription("batchtools")
Package: batchtools
Title: Tools for Computation on Batch Systems
Version: 0.1
[...]
Built: R 3.3.0; x86_64-pc-linux-gnu; 2016-08-22 15:16:02 UTC; unix
[...]
RemoteSha: e34e069ce00e2d9e727cfedaf7e2278751f0cfad
[...]

JobTable: wrong submitted date on windows interactive

MWE:

library(batchtools)
reg = makeRegistry(file.dir = NA)
ids = batchMap(function(x) {Sys.sleep(x)}
               , x = c(2, 6, 6, 6, 4)
               )
ids = chunkIds(ids)
submitJobs(ids, resources = list(chunk.ncpus = 4))
getJobStatus()[, list(submitted, started, time.queued)]

This gives on my system:

             submitted             started time.queued
1: 2016-04-14 11:24:20 2016-04-14 11:24:14     -6 secs
2: 2016-04-14 11:24:20 2016-04-14 11:24:14     -6 secs
3: 2016-04-14 11:24:20 2016-04-14 11:24:14     -6 secs
4: 2016-04-14 11:24:20 2016-04-14 11:24:14     -6 secs
5: 2016-04-14 11:24:20 2016-04-14 11:24:16     -4 secs

Firstly, I thought the calculation of time.queued is wrong, but it is the submitted date which is after started. A linux system give the correct number.

job.name and arrayjobs equivalents in batchtools SGE template

I've begun experimenting with batchtools by migrating some previous projects that used BatchJobs and BatchExperiments. We use SGE on our cluster. Here are the contents of my working template file (much of which was copied from my existing template with BatchJobs):

#!/bin/bash

# The name of the job, can be anything, simply used when displaying the list of running jobs
#$ -N "my_job"
# Combining output/error messages into one file
#$ -j y
# Giving the name of the output log file
#$ -o <%= log.file %>
# One needs to tell the queue system to use the current directory as the working directory
# Or else the script may fail as it will execute in your top level home directory /home/username
#$ -cwd
# use environment variables
#$ -V
# define multiple cores per node
$ -pe smp <%= resources$n.cores %>
# use resource reservation
$ -R y

# we merge R output with stdout from SGE, which gets then logged via -o option
module load cluster-setup
module unload R
module load R/3.2.0
Rscript -e 'batchtools::doJobCollection("<%= uri %>")' /dev/stdout
exit 0

Everything works so that's great! In my previous template I had the following that utilized job.name and arrayjobs:

# The name of the job, can be anything, simply used when displaying the list of running jobs
#$ -N <%= job.name %>

# use job arrays
#$ -t 1-<%= arrayjobs %>

You can see that I simply hard-coded a job name in my new template, but is there a more elegant way to define the job name dynamically? Also how would I go about specifying arrayjobs? Thanks for your great work on this package!

Relax checks on arguments of problems and algorithms

... should also be fine.

Multicore with batchtools

Hi,

I tried to use makeClusterFunctionsMulticore.

My registry looks like this:

regis = makeExperimentRegistry("probs-muell", 
                             packages = c("mlr", "OpenML"),
                             source = "/nfsmb/koll/probst/Random_Forest/RFParset/code/probst_defs.R",
                             work.dir = paste0(dir,"/results")
)
regis$cluster.functions = makeClusterFunctionsMulticore()

Then I try to submit it (after algorithm setting, etc)

ids = getJobTable()$job.id
ids = chunkIds(ids, chunk.size = 30)
submitJobs(ids)

getStatus() gives me following (also after waiting more time):

getStatus()
Status for 240 jobs:
  Submitted: 240 (100.0%)
  Queued   :   0 (  0.0%)
  Started  :   0 (  0.0%)
  Running  :   0 (  0.0%)
  Done     :   0 (  0.0%)
  Error    :   0 (  0.0%)

I do not find anything done in my results folder.
What is wrong here?

param designs: either convert factors or at least warn about them

unlink("registry", recursive = T)
reg = makeExperimentRegistry()
addProblem("p1")
addAlgorithm("a1", fun = function(instance, method, ...) {
  print(str(method))
  return(method)
})

ades = data.frame(method = c("a", "b"))
addExperiments(algo.designs = list(a1 = ades))
testJob(1)

this shows that "method" is a factor with one element, and 2 levels.

a) this is really nearly never what you want. you want "method" to be a string.
for potential problems see this:
mllg/checkmate#75
(see the switch problem there)
NB: these are NOT the same issues. the one in cm is about guarding against this. this here is about not creating the problem in the first place, also wrt code you later dont control.

b) of course one can say that this is a user error, as data.frame did not set stringsAsFactors = FALSE.
but this will happen one million times, even for experienced R coders.
although i dislike warning usually, this seems to be a clear case where a warning would be beneficial.
the other option would be to auto-convert the factor columns of the design to chars.

Support caching of problem instances

If the problem has a problem seed, there should be a possibility to cache the results. Use a hash of problem id and problem seed as unique identifier?

QUESTION: Is `resources` the only way to pass data to the templates?

In BatchJobs, I think argument resources (named list) of submitJobs() was the only way to pass a variable to the template, which then is also named resources. Is this the case for batchtools as well? It looks so from inspecting the code.

There are two reasons why I ask:

There could be other info that one wish to send to the template that is not really "resource" related;

email address (I know you can set this up with the registry too)
other hard to predict properties of future or rare job schedulers
there could be other things the user with to do at the same time as the template is compiled (though too early to come up with a reasonable example now)

If resources is the only one, the couldn't one just attach its field so that the template doesn't have to do resources$foo etc. each time?

Maybe what I'm fishing for is a generic args argument that is a named list (or environment) whose elements are attached to the template evaluation environment. Then it could look like this:

resources <- list(ncpus = 4, walltime = 3600, memory = 2.0)
submitJobs(reg, args = resources)

and the template could immediate access ncpus, walltime, and memory without having to use resources$ncpus etc. The downside might be that it's less clear what's an argument coming from the submitJobs() call and what comes from the internals of batchtools.

Just a thought and wondered if you already thought about this in the past.

NAMING CONVENTION: Slurm not SLURM

I think it should be 'Slurm' not 'SLURM', cf. http://slurm.schedmd.com/ and https://en.wikipedia.org/wiki/Slurm_Workload_Manager (Wikipedia implies it was SLURM in the past).

This affects some of your function names.

makeRegistry and makeExperimentRegistry use different defaults for conf.file

Hi,

makeRegistry and makeExperimentRegistry have two different defaults for conf.file

makeRegistry = function(file.dir = "registry", work.dir = getwd(), conf.file = "~/.batchtools.conf.R", ...)

makeExperimentRegistry = function(file.dir = "registry", work.dir = getwd(), conf.file ="~/.batchtools.conf.r",...)

(Just the difference between .R and .r)

I don't think this is intended?

Change logging mechanism

Because there is no way (yet) to create a connection which combines its output with a prefix, output is delayed until the job is terminated. Find a better workaround.

Summarize Registry

We should implement a function to summarize the registry. For instance, it is hard to get the added problems from an experimental registry, hence an object containing this and other interesting information of the registry is necessary.

autoload package "methods" on slaves

as you chose to use Rscript. it is really confusing to users to not do that.

see #15

conceptual problems: designs where algo configs change over problems

hi,

this is a problem that came up recently in a project and i couldnt find a way to write this down properly with bt.

i have a couple of instances, and some algos. to simplify this, imagine that the problems dont have any params. so i just have p_1, ... p_k. for the algos i would precreate, as a data.frame, the different config settings i want to compute and study. but: the algo configs should not be the same for every p_i.

reason: instead of "variance reduction" (= try out the same setting for each p_i), i want more "exploration" to learn possibly better how the params affect the algo performance.

problem: batchtools does not allow this. as i have to specify "algo.design" which is then used for every p_i.
this is conceptually problematic as what i just outlined is something which is extremely common as an at least potential approach in experimental designs.

solution (?):
instead of always forcing the user to enter prob.design and algo design, then internally compute the crossproduct, let the user already pass the combined design as a single df / dt. then he has complete control.

so in my case i would pass something like

prob.id, algo.id, algo.par.1, algo.par.2, ..., algo.par.m

removeExperiments(ids) deletes all result files

If you want to delete some jobs of the registry, removeExperiments(ids) will delete all result files and not only the results with the specified ids!

Here an example:

library(batchtools)

reg <- makeExperimentRegistry(file.dir = "test_registry")

subsample <- function(data, job) {
  n <- nrow(data)
  train <- sample(n, floor(n * 0.5))
  test <- setdiff(seq(n), train)
  list(test = test, train = train)
}
data("iris", package = "datasets")
addProblem(reg, name = "iris", data = iris, fun = subsample, seed = 123)


forest.wrapper <- function(job, data, instance, ...) {
  library("randomForest")
  mod <- randomForest(Species ~ ., data = data,
                      subset = instance$train, ...)
  pred <- predict(mod, newdata = data[instance$test, ])
  table(data$Species[instance$test], pred)
  }
addAlgorithm(reg = reg, name = "forest", fun = forest.wrapper)

minsplit <- c(5, 10, 20, 6)
cp <- c(0.01, 0.1, 0.01, 0.1)
ntree <- c(100, 500, 1000, 200)

design <- data.frame(minsplit = minsplit, cp = cp, ntree = ntree)

algodes <- list( forest = design)
addExperiments(reg = reg, algo.design = algodes, repls = 2)
summarizeExperiments(reg = reg)

submitJobs()
getStatus()

all_jobs <- getJobPars(reg = reg)
jobs_500 <- subset(all_jobs, all_jobs$ntree == 500)
res_1 <- reduceResults(reg = reg, ids = jobs_500$job.id, fun = function(x, y) c(x, y))
res_1


> removeExperiments(ids = jobs_500$job.id)

Removing 2 Experiments
Cleaning up 1 job definitions
Removing 2 obsolete result files
Removing 2 obsolete log files
   job.id
1:      3
2:      4

> res_2 <- reduceResults(reg = reg, fun = function(x, y) c(x, y))

Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/test_registry/results/1.rds', probable reason 'No such file or directory'

Make batchtools easier for simple problems

It's kind of a hassle to conduct easy experiments with batchtools that are more complicated than batchMap but kind of to simple for the algo problem differntiation.
For example if I have a static problem without parameters I still have to generate the problem.design list.
At the moment it kind of looks like this:

for (obj.fun in obj.funs){
  addProblem(name = getID(obj.fun), data = list(obj.fun = obj.fun))
}
pdes = lapply(obj.funs, function(x) data.frame())
names(pdes) = sapply(obj.funs, getID)

This is kind of suboptimal to read and comprehend.

Link to other vignette

This issue is part of this JOSS review

In the Pi example vignette, there is "see one of the other vignettes for an example", maybe give the title of the mentioned vignette?

Appveyor badge

This issue is part of this JOSS review

The Appveyor badge in the README is red and refers to the estimateRuntimes branch, while the build for master is not failing (but it is the one one gets when clicking on the badge). I'm not sure it's a real problem, I wonder if it would make more sense to have two badges or to show the Appveyor badge for the master branch only.

Son-of Grid Engine needs job titles to start with alphabetical

If I try batchtools with the SGE template it fails on our cluster with:

Fatal error occurred: 101. Command 'qsub' produced exit code 1. Output: 'Unable to run job: denied: "528799ad34365a4c3e32ebb3f08bfd98" is not a valid object name (cannot start with a digit)

Which comes from the SGE template file:

#$ -N <%= job.hash %>

Fix can be as simple as prefixing something to the hash:

#$ -N job<%= job.hash %>

I'm not sure if this is a difference between the original Sun Grid Engine and the Son-of Grid Engine that we use.

addExperiments: referring to problems or algos without parameters

currently one has to write this

reg = makeExperimentRegistry(file.dir = NA, make.default = FALSE)
prob = addProblem(reg = reg, "p1", data = 1)
prob = addProblem(reg = reg, "p2", data = 2)
algo = addAlgorithm(reg = reg, "a", fun = function(...) list(...))

algo.designs = list(a = data.table(par1 = 1, par2 = 2))
addExperiments(reg = reg, prob.designs = list(p1 = data.table()), algo.designs = algo.designs)

BE offered the possibility to shorten the last line to
addExperiments(reg = reg, prob.designs = "p1", algo.designs = algo.designs)

and i would suggest that we should allow this again

this is simple and useful convenience for the user. if i pass a charvec (x, y, z) this is internally tranformed to list(x = data.table(), y = data.table(), z = data.table())

Workaround for AppVeyor

Currently the building test on AppVeyor fails. The reason is an issue with r-appveyor which doesn't load dependencies of dependencies as mentioned here: krlmlr/r-appveyor#69

As a workaround I add the missing packages to the appveyor.yml in commit 40ab0d3 to load these packages manually.

reduceResultsDataTable

I am getting my results with
reduceResultsDataTable(ids = ids_classif, fun = function(r) as.data.frame(as.list(r)), reg = regis, fill = TRUE)

and it takes about 5 hours (50000 jobs). Is there a way to make it faster, maybe parallelizable?

Link to the package website in the repository description?

This issue is part of this JOSS review

The link to https://mllg.github.io/batchtools/ (really nice that the package has the website by the way) could be added to the description of the repository, so that it might be easier to see?

Either `ncpus` or `max.jobs` is redundant

Does Batchtools work on windows ?

Hi,

I tried to use batchtools on windows.
I get the following error launching getStatus after creating my experiment.

> getStatus()
Error: 'mccollect' is not an exported object from 'namespace:parallel'

Looking at the function mccollect, it seems not available in windows:
https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcparallel.html

Should I use BatchExperiment instead ?

Raphael

removeExperiments will delete all results no matter what ids are passed

This was found by one of our master students.

Here is a smallish example:

library(batchtools)


reg <- makeExperimentRegistry(file.dir = "test_registry")

subsample <- function(data, job) {
  n <- nrow(data)
  train <- sample(n, floor(n * 0.5))
  test <- setdiff(seq(n), train)
  list(test = test, train = train)
}
data("iris", package = "datasets")
addProblem(reg, name = "iris", data = iris, fun = subsample, seed = 123)


forest.wrapper <- function(job, data, instance, ...) {
  library("randomForest")
  mod <- randomForest(Species ~ ., data = data,
                      subset = instance$train, ...)
  pred <- predict(mod, newdata = data[instance$test, ])
  table(data$Species[instance$test], pred)
  }
addAlgorithm(reg = reg, name = "forest", fun = forest.wrapper)


minsplit <- c(5, 10, 20, 6)
cp <- c(0.01, 0.1, 0.01, 0.1)
ntree <- c(100, 500, 1000, 200)

design <- data.frame(minsplit = minsplit, cp = cp, ntree = ntree)

algodes <- list( forest = design)
addExperiments(reg = reg, algo.design = algodes, repls = 2)
summarizeExperiments(reg = reg)

submitJobs()
getStatus()

all_jobs <- getJobPars(reg = reg)
jobs_500 <- subset(all_jobs, all_jobs$ntree == 500)
res_1 <- reduceResults(reg = reg, ids = jobs_500$job.id, fun = function(x, y) c(x, y))
res_1

removeExperiments(ids = jobs_500$job.id)

res_2 <- reduceResults(reg = reg, fun = function(x, y) c(x, y))

[Same results if we use the data.table]

Also, if we specify no ids in removeExperiments the documentation says that no jobs are deleted. But it seems that all results are deleted as well.

problem and algorithm in getJobPars as factors?

Currently problem and algorithm are returned as characters, whereas nominal parameters are stored as factors in the result of getJobPars. This results in problems for statistical methods like ranger. Here data.matrix(data) is used which return NA on character columns.
Does it make sense to store everything by default as factors?

resetJobs

There is no function resetJobs like in BatchJobs. Maybe this could be helpful, although it is also possible to resubmit the Job just by writing submitJobs(ids) again.

JSS paper vs. JOSS paper

This issue is part of this JOSS review

The JSS paper is mentioned as a resource, which is understandable. I have a few comments/questions on this:

Is there any way to update the JSS paper to mention the new package? Or at least to add a warning in https://www.jstatsoft.org/article/view/v064i11 ?
Could you precise more in the README what is still usable in the JSS paper (e.g. the section about " We use an ExperimentRegistry where the job definition is split into creating problems and algorithms. " according to one batchtools vignette) / what parts of the interfaces changed (besides having one single package now). I guess one can get this information by reading NEWS.md + the JSS paper, but NEWS.md contains other information too. This would also make the novelty of the software described in the JOSS paper clearer.
The second paragraph of the abstract of the JSS paper is really nice (the list with letters), I wonder if it is allowed to have it in your README / in an intro vignette too.

batch command line tools

the clusterfunctions rely on issuing a couple of specific system commands. We do this by having (user configurable) R code where this cmds are generated and then executed.

Better approach might be this.
We shoud add another layer of abstraction. Which are the batch commands.
This is a small set of aliases / scripts.

Like this:
btsubmit
btkill
btlist

These we ship out, like we do with the cfs before, but the user could also adapt them if they have to.

This has a couple of advantages:

it is easier for the user to locally adapt these simple bash scripts, then our current R code
they are nicely seperately testable
we kinda need a defined API here anyway
if you put them in your path, you dont even need to remember whether you are on SLURM or TORQUE are whatever

I would actually ship out at some point some extra cmds which kinda only make sense on the console:
btkill-all: kill everything from the CLI the hard way
btlist: with an better overview option, so you can see what currently runs, with the state.

One even later add stuff like show-active-user or show-queues, which we now have as "bad throw-away" versions for lido and which do not work at other places.

waitForJobs

I fixed some issues of waitForJobs in 7925170. However, the progressbar hides before all jobs are done. A MWE will follow.

Organization of the reference section of the website

This issue is part of this JOSS review

In the reference section of the package website, each .Rd is an entry, and there are many of them. I wonder if it would make the documentation easier to read in that part if the author used a _pkgdown.yml for creating groups, see e.g. this one with the resulting reference section

there should really me a reduceExperiments

one really always want to have the job and algo ids in the results.

yes, one can getJobTable, reducerResultsDataTable and then join / merge the 2.
but this is a somewhat unintuitive and cumbersome for the absolute default use case

Template filename for automatic lookup

Hey, just following up on our discussion at useR about a standardized filename format for automatic lookup of template files. My proposal is to use filenames of format:

.batchtools.<scheduler>.tmpl

Examples:

.batchtools.slurm.tmpl
.batchtools.torque.tmpl
.batchtools.sge.tmpl
.batchtools.openlava.tmpl
.batchtools.lsf.tmpl

One reason for the <scheduler> part is that I can image projects with collaborators that work on different systems and this would provide (at least some flexibility) to provide multiple template files in the same (Git) repository.

Where should these template files be located? The default could be be to search for the first available file in:

the current directory (.)
the user's home directory (~)

I'm hesitating whether it is a good thing to also fall back to a generic template file provided by the package or not. The reason why I'm not sure is that it might be confusing and it's not clear if it will work for all schedulers and for all setups. On the other hand, if it is possible to find a good enough template, then it's pretty neat.

LSF: Error: $ operator is invalid for atomic vectors

I'm having a problem that occurs only with LSF.

The simplest reproducible code is this:

> btlapply(1:3, function(x) x^2)
Sourcing configuration file '~/.batchtools.conf.R' ...
Adding 3 jobs ...
Error: $ operator is invalid for atomic vectors

My config file and template follow.

I have some experience with BatchJobs so I have a good working LSF system. Perhaps I do not understand the setup or there is a problem?

You're help is greatly appreciated.

I will be running 10's of millions of jobs with batchtools and will be happy to report results:

###cluster.functions = makeClusterFunctionsLSF()
cluster.functions = makeClusterFunctionsLSF(template="/xxxx/users/yyyy/lsf_bob0.tmpl")
###cluster.functions = makeClusterFunctionsInteractive()
mail.start = "none"
mail.done = "none"
mail.error = "none"

#BSUB-J <%= job.hash %>      
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

Flowcharts

This issue is part of this JOSS review

This is again only a suggestion: the JSS paper has a flowchart for batchExperiments, maybe a flowchart would make sense to explain the relationships between batchtools function?

feature: resources can be defined by algorithm or problem

It would be nice if there was a handy option to define the resources needed by a specific algorithm or problem within the definition of them.
Now you just generate everything and then have to select afterwards what to start with which resources.

Further remark: Sending jobs with different resource demands in different batches might lead to underutilized queues so it would be beneficial to be able to start jobs kind of stratified.

reduction of results in parallel

in BE for good reason we had a parallel reduction method.
as for regs with a larger number of results / a non trivial reduction which at least does a little bit of "computation" or transformation you dont want to wait for hours (if you are on a parallel system)

is this supported now? because i dont think so

Links to other packages

This issue is part of this JOSS review

The README states "As a successor of the packages BatchJobs and BatchExperiments, batchtools".

Maybe you could explain why it is a successor of those packages, e.g. what it does better? Is it meant to replace the other two packages?
In the README of those two packages I could not find a reference to batchtools. It might make sense to add a link to batchtools in the README (and even documentation for non Github users) of BatchJobs and BatchExperiments, if batchtools does some tasks better?

mllg / batchtools Goto Github PK

batchtools's People

Contributors

Stargazers

Watchers

Forkers

batchtools's Issues

Recommend Projects

Recommend Topics

Recommend Org