Code Monkey home page Code Monkey logo

Comments (22)

mllg avatar mllg commented on August 25, 2024

Can you call traceback() after the error is thrown and post the output so that I can narrow this down?

from batchtools.

phaverty avatar phaverty commented on August 25, 2024

from batchtools.

mllg avatar mllg commented on August 25, 2024

I guess @phaverty is right, this is very likely an already fixed bug.

@BobP4Ski Can you please install the devel version (devtools::install_github("mllg/batchtools")) and check if the error disappears?

from batchtools.

BobP4Ski avatar BobP4Ski commented on August 25, 2024

Yes, thank you. I will try the newer version an let you know.

The traceback follows and is the same as I get with submitJobs:

traceback()
7: listJobs(reg, c("bjobs", "-u $USER", "-w", "-r"))
6: cf$listJobsRunning(reg)
5: unique(cf$listJobsRunning(reg))
4: getBatchIds(reg, status = status)
3: .findOnSystem(reg = reg, cols = c("job.id", "batch.id"))
2: submitJobs(ids = ids, resources = resources, reg = reg)
1: btlapply(1:3, function(x) x^2)

from batchtools.

BobP4Ski avatar BobP4Ski commented on August 25, 2024

Looks like the newest version worked but I ended up with another problem which may be a config problem;

Error in listJobs(reg, c("bjobs", "-u $USER", "-w", "-r")) :
Command 'bjobs -u $USER -w -r' produced exit code: 127; output: command not found

I can execute this command manually without problems

Full template follows:

> btlapply(1:3, function(x) x^2)
Sourcing configuration file '~/.batchtools.conf.R' ...
Adding 3 jobs ...
Error in listJobs(reg, c("bjobs", "-u $USER", "-w", "-r")) : 
  Command 'bjobs -u $USER -w -r' produced exit code: 127; output: command not found
> traceback()
9: stop(simpleError(sprintf(...), call = sys.call(sys.parent())))
8: stopf("Command '%s' produced exit code: %i; output: %s", stri_flatten(cmd, 
       " "), res$exit.code, res$output)
7: listJobs(reg, c("bjobs", "-u $USER", "-w", "-r"))
6: cf$listJobsRunning(reg)
5: unique(cf$listJobsRunning(reg))
4: getBatchIds(reg, status = status)
3: .findOnSystem(reg = reg, cols = c("job.id", "batch.id"))
2: submitJobs(ids = ids, resources = resources, reg = reg)
1: btlapply(1:3, function(x) x^2)
> 
#BSUB -J "<%= job.hash %>[1-<%= n.array.jobs %>]" # name of the job / number of jobs in chunk
#BSUB -o <%= log.file %>              # output is sent to logfile, stdout + stderr by default
#BSUB -q <%= resources$queue %> # Job queue
#BSUB -W <%= resources$walltime %> # Walltime in minutes
#BSUB -R  "select[rhel6], rusage[free_slot=0] "
#BSUB -n <%= resources$n.cores %> # Number of cores
#BSUB -M <%= resources$memoryGb %> # Gb of memory

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'


from batchtools.

mllg avatar mllg commented on August 25, 2024

I've pushed a potential fix in 4a43e55. Please re-install the devel version and try again.

from batchtools.

BobP4Ski avatar BobP4Ski commented on August 25, 2024

It was a dumb mistake on my part. I did not have bsub in my path within RStudio.

Is there a way to get a log file of the bsub (an other Linux) commands that batchtools is invoking?

The reason I ask is that I'm having what looks like a bsub command string error. Although I've been careful with my template perhaps the command string being sent out is different that I assume. You'll see the pi example below.

Many many thanks for your help!
Bob


batchtools::batchMap(fun=piApprox,n=rep(1e5,1000),reg=reg)
Adding 1000 jobs ...

batchtools::submitJobs(reg=reg)
Submitting 1000 jobs in 1000 chunks using cluster functions 'LSF' ...
Submitting [------------------------------------------------------------------------------------------------------------------------] 0% eta: ?s
Error in batchtools::submitJobs(reg = reg) :
Fatal error occurred: 101. Command 'bsub' produced exit code 255. Output: 'Sorry, you must tell cluster how many free_slot (gb cpu) you want to use...
Request aborted by esub. Job not submitted.'

from batchtools.

BobP4Ski avatar BobP4Ski commented on August 25, 2024

Found the problem...

On the Linux system I am using (Redhat) when an LSF job file is used the proper invocation is:

bsub < foo.job # This works

In your code you have

bsub foo.job # This fails

I believe you want to change with a simple paste function with a < in the runOSCommand.

res = suppressWarnings(system2(command = sys.cmd, args = sys.args, stdout = TRUE, stderr = TRUE, wait = TRUE))

Change to paste(sys.cmd," < ") ?

res = suppressWarnings(system2(command = paste(sys.cmd," < "), args = sys.args, stdout = TRUE, stderr = TRUE, wait = TRUE))


You'll see that I get a nonsense error message because nothing got piped into bsub.

beecs7711:bpack> bsub < foo.job
Job <176724> is submitted to queue .
beecs7711:bpack> bsub foo.job
Sorry, you must tell cluster how many free_slot (gb cpu) you want to use...
Request aborted by esub. Job not submitted.
beecs7711:bpack> cat foo.job
#BSUB -R "rusage[free_slot=1]"

Rscript -e 'batchtools::doJobCollection("~/tmp/jobs/job5dc4b3f11d3d2a172765f2c24ae632f8.rds")'
beecs7711:bpack>


runOSCommand = function(sys.cmd, sys.args = character(0L), nodename = "localhost") {
assertCharacter(sys.cmd, any.missing = FALSE, len = 1L)
assertCharacter(sys.args, any.missing = FALSE)
assertString(nodename, min.chars = 1L)

if (nodename != "localhost") {
sys.args = c(nodename, shQuote(stri_flatten(c(sys.cmd, sys.args), " ")))
sys.cmd = "ssh"
} else if (length(sys.args) == 0L) {
sys.args = ""
}

"!DEBUG OS cmd: sys.cmd stri_flatten(sys.args, ' ')"
print(paste("!!! system command'",sys.cmd,sys.args))
res = suppressWarnings(system2(command = sys.cmd, args = sys.args, stdout = TRUE, stderr = TRUE, wait = TRUE))
output = as.character(res)
exit.code = attr(res, "status") %??% 0L

"!DEBUG OS result (exit code exit.code):"
"!DEBUG cat(output, sep = \"\n\")"

return(list(exit.code = exit.code, output = output))

from batchtools.

phaverty avatar phaverty commented on August 25, 2024

from batchtools.

mllg avatar mllg commented on August 25, 2024

bsub has the flag -i <input-file>. I think this might be the safest way. -> 48de42c

from batchtools.

mllg avatar mllg commented on August 25, 2024

If you want more output, install the debugme package and call Sys.setenv(DEBUGME="batchtools") before loading batchtools.

from batchtools.

BobP4Ski avatar BobP4Ski commented on August 25, 2024

Wasn't able to get system2 to work. I think it must strip out the "<" however I have not drilled to deeply into this. I must have a "<" on my Linux disto. I'm using system command and it works..

If I do the following kludge it works great. This is not production-worthy though...

  • res = runOSCommand("bsub <", outfile) in submitJob in makeClusterFunctionsLSF
  • res = (system(paste(sys.cmd,sys.args))) in runOSCommand

from batchtools.

mllg avatar mllg commented on August 25, 2024

Sorry, now I did a dumb mistake but hopefully fixed it with my last commit. Can you please try again? If

bsub -i <job.file>

works on your system, it should also work now with batchtools.

from batchtools.

BobP4Ski avatar BobP4Ski commented on August 25, 2024

From the bsub documentation it should work, but it appears to go into an interactive mode...

I did the command manually...

Here's the job file...

beecs7711:bpack> cat jobffb4e41423b4b4807b5dae05cdbec301.job
#BSUB-J "jobffb4e41423b4b4807b5dae05cdbec301[1-1]"
#BSUB-o ~/tmp/logs/jobffb4e41423b4b4807b5dae05cdbec301.log
#BSUB-q long
#BSUB-R  "select[rhel6,lm], rusage[free_slot=1,mem=1]"
export DEBUGME=batchtools
Rscript -e 'batchtools::doJobCollection("~/tmp/jobs/jobffb4e41423b4b4807b5dae05cdbec301.rds")'

Here's the bsub command... You'll see that it's looking for more input...

beecs7711:bpack> bsub -i jobffb4e41423b4b4807b5dae05cdbec301.job
bsub> 

This works though...

beecs7711:bpack> bsub < jobffb4e41423b4b4807b5dae05cdbec301.job
Job <302620> is submitted to queue <development>.
beecs7711:bpack> 

from batchtools.

BobP4Ski avatar BobP4Ski commented on August 25, 2024

This is the only clean way I could get system2 to work. (in clusterFunctions.R) need to add in the < conditionally.

res = (system2(command = sys.cmd,
                 args = ifelse(sys.cmd=="bsub",paste("<",paste(sys.args,collapse=' ')),sys.args),
                 stdout = TRUE, stderr = TRUE, wait = TRUE))

from batchtools.

mllg avatar mllg commented on August 25, 2024

I'll port back the the argument stdin for runOSCommand on Monday. Worked in BatchJobs, should work here too.

from batchtools.

BobP4Ski avatar BobP4Ski commented on August 25, 2024

Michel,

Yes, BatchJobs worked in this regard so perhaps switching to this approach may fix this and the following issue too.

New issue... Using the approach I showed, all jobs complete however the post-submitJobs functions such as getStatus and reduceResultsDataTable give improper results. So I may have fouled something up or there is another problem with the system2 approach such in the stderr or stdout messages.

For example (1) below fails but (2) gives proper results.

# (1) FAIL ... only 110 rows of 10000 are returned with this
df <- reduceResultsDataTable(reg=reg)

# (2) GOOD... all 10000 rows are correctly returned
resDir <- paste(reg$file.dir,"/results",sep="")
res <- system(paste("ls",resDir),intern=TRUE))
df2 <- data.frame(rdsFile=res)
df2$result <- sapply(res, function(f) readRDS(paste(resDir,f,sep="/")))

from batchtools.

mllg avatar mllg commented on August 25, 2024

batchtools now uses stdin to pass the contents of the job file to bsub.

I cannot reproduce your second problem. What does findDone() return?

from batchtools.

BobP4Ski avatar BobP4Ski commented on August 25, 2024

Sorry for the delay Michel, Took some time off for the Holidays. Hope you have a happy New Year.

I've tried your new version it it works great. I still have the second problem though. I get a lot of 'expired' jobs'. There's always the good possibility that I'm making a mistake.

If I use findDone() or findExpired() they are consistent with the getStatus(). You'll see the output of this below.

My short example below has 10000 trivial simulations. I specify n.chunk=500 so I can limit this to 500 cores at a time.

Any help/direction you can provide is much appreciated.

When I look at the results directory, it appears that there are 10000 rds files and the data in each of the files appears to be correct. So it appear that there is something happening with getStatus because the data appears to have been generated ok.

Bob


library(debugme)
Sys.setenv(DEBUGME='batchtools')

library(devtools)
load_all("~/R_Packages/Source/batchtools-master")

if(file.exists("~/tmp"))unlink("~/tmp",recursive=T)

reg <- makeRegistry(file.dir="~/tmp", seed=1, conf.file="~/.batchtools.conf.R")

dummyFunc <- function(n) {
  #Sys.sleep(0.04)
  n
}

n.sim <- 10000

ids <- batchtools::batchMap(fun=dummyFunc,n=1:n.sim,reg=reg)

system.time(batchtools::submitJobs(
  chunkIds(n.chunk=500,reg=reg),
  reg=reg,
  resources=list(walltime=10000, n.cores=1,
                 memoryGb=1,
                 chunks.as.arrayjobs=TRUE))
)

print("waitForJobs")
(batchtools::waitForJobs(reg = reg))

print("getStatus")
(stat <- batchtools::getStatus(reg=reg))

Here the output of getStatus....

�Status for 10000 jobs:
  Submitted : 10000 (100.0%)
  Queued    :     0 (  0.0%)
  Started   :   500 (  5.0%)
  Running   :     0 (  0.0%)
  Done      :   500 (  5.0%)
  Error     :     0 (  0.0%)
  Expired   :  9500 ( 95.0%)

My template is:

#BSUB-J "<%= job.hash %>[1-<%= n.array.jobs %>]" 
#BSUB-o <%= log.file %>              
#BSUB-q <%= resources$queue %> 
#BSUB-R  "select[rhel6,lm], rusage[free_slot=0,mem=<%= resources$memoryGb %>]"
#BSUB-W <%= resources$walltime %> 
#BSUB-n <%= resources$n.cores %> 
#BSUB-M <%= resources$memoryGb %> 

setenv DEBUGME <%= Sys.getenv("DEBUGME") %>
setenv LSB_JOB_REPORT_MAIL N

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

The following is one of the *.job files that is created by batchtools.

job0011b966fc6571f5e1a4c8060ba219e4.job
::::::::::::::
#BSUB-J "job0011b966fc6571f5e1a4c8060ba219e4[1-20]"
#BSUB-o ~/tmp/logs/job0011b966fc6571f5e1a4c8060ba219e4.log
#BSUB-q opcapd
#BSUB-R  "select[rhel6,lm], rusage[free_slot=0,mem=1]"
#BSUB-W 10000
#BSUB-n 1
#BSUB-M 1
setenv DEBUGME batchtools
setenv LSB_JOB_REPORT_MAIL N
Rscript -e 'batchtools::doJobCollection("~/tmp/jobs/job0011b966fc6571f5e1a4c8060ba219e4.rds")'


from batchtools.

mllg avatar mllg commented on August 25, 2024

Sadly I cannot debug this myself as I'm still lacking access to a system with array jobs.

It looks like the batch ids in the data base do not match the batch ids reported by listJobs. Can you give me the output of the following lines?

batchtools:::getBatchIds(reg = reg)
head(reg$status$batch.id)

from batchtools.

mllg avatar mllg commented on August 25, 2024

I managed to get my hands on some schedulers with array job support (Slurm and Torque/PBS) and realized that I have to rework the handling of array jobs:

  1. It is necessary to store the array id in the data base. If array jobs are submitted, I just assume the array ids 1:nrow(jobs).
  2. It is necessary to store a unique job identifier for each array job (or I cannot kill). For Slurm I have to construct it as [job.id]_[array.id], for Torque something like [job.id][task.id][host].
  3. The log files given in the template must use [log.file]-[array.id] for array jobs. There is an environment variable with the array id for Slurm (%a), Gridengine ($TASK_ID, but escaping not working, so this is broken), LSF (%I?). No option for Torque/PBS, so this defined the default name.
  4. There is usually an option to display all array jobs in the list functions. This has to be enabled.

I disabled array jobs for SGE for the moment, I encountered too many nasty incompatibilities there. I would like to support array jobs for LSF, but would need the output of the commands bsub and bjobs for array jobs.

from batchtools.

mllg avatar mllg commented on August 25, 2024

Array jobs are supported for selected schedulers now, see NEWS.

from batchtools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.