kvnkuang / pbmcapply Goto Github PK

View Code? Open in Web Editor NEW

43.0 5.0 5.0 91 KB

Tracking the progress of mc*apply with progress bar.

License: Other

R 95.96% C 4.04%

r parallelization progress-bar

pbmcapply's People

Contributors

Stargazers

Watchers

Forkers

daroczig thierrygosselin hoardboard qykong zhongwei-yao

pbmcapply's Issues

Non-interactive mode should not print progress bar

@kvnkuang : I was using the package in a knitr document and the output had the progress bar:

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |==                                                               |   2%

Not copying here the whole output, but you can guess the rest. I would suggest printing the progress bar only when interactive(). At least this is what I did in pbapply.

Width calculation problems for short and long durations

Related to #25. If I use the same use case there with the latest (Github) commit of pbmcapply (packageVersion() reports 1.3.0), I get the following output:

If I change 10^7 to 10^2 in the use case, there are still linebreak problems.

Can you reproduce?

Issue with differing behaviour from mcapply

I don't believe it is documented in the code but there is a choice when using seeded pbmclapply to use the mc.set.stream() function where the mclapply uses mc.reset.stream(). In pbmclapply this is called in the utils.R file through the .customized_mcparallel() function. This has been raised on stackoverflow as a query here:
https://stackoverflow.com/questions/67655726/parallel-processing-in-r-setting-seed-with-mclapply-vs-pbmclapply/69595595#69595595
The decision on how to proceed may additionally be linked to another stackoverflow query as expected behaviour may be different for different users:
https://stackoverflow.com/questions/15070377/r-doesnt-reset-the-seed-when-lecuyer-cmrg-rng-is-used

Some values in result list are just NULL

I have a strange issue since I have upgraded to R 3.5.0 and installed the latest CRAN version of this package. The resulting list from a pbmclapply call sometimes contains some NULL elements. Replacing the same with lapply gives the desired result of no NULL elements. Also with mclapply it seems to work, though I am not sure whether this was just luck or this is some sort of a race condition.

Could pbmclapply behave differently in the result compared to mclapply and apply?

Width calculation problem if ETA has hours (or longer)

Consider the following example:

library(pbmcapply)

lazySqrt <- function(num) {
  # Sleep randomly between 0 to 0.5 second
  Sys.sleep(runif(1, 0, 0.5))
  return(sqrt(num))
}


# Get the sqrt of 1-3 in parallel
result <- pbmclapply(seq.int(10^7), lazySqrt, mc.cores = 2)

When I run the above code, because the ETA is hours, I think that a longer line is output by pbmclapply than the detected width of the terminal.

I've tested on gnome-terminal, with and without byobu, and with version 1.2.5 and the latest Github version.

Can you reproduce?

The link to the blog post with comparison does not work

@kvnkuang, thank you for this package it seems to be a nice extension to parallel.

Would it be possible to recover the post with comparison and/or upload it to GitHub so it remains available regardless of the blog status?

That's how "https://kevinkuang.net/tracking-progress-in-r-ad97998c359f" looks in my browser:

(It also shows Error 502: bad gateway)

Address "cannot wait for child %d as it does not exist" warning

This is a spinoff from #37 (comment). I just wanted to add some extra info:

It seems that this warning might be converted to an error in the future (?):
https://github.com/wch/r-source/blob/trunk/src/library/parallel/src/fork.c#L815-L816

My own selfish motivation is that I like to run my code with options(warn = 2). This helps me find issues at the source more quickly. But if indeed there is no way around this particular warning, I can temporarily switch "warn" to 1 whenever I run pbmcapply.

As always, thanks for your great work on this package Kevin!

Execution cannot be interrupted with Ctrl-C

When I do some expensive calculation and realize that I want to abort it, I can usually do that with CtrlC. With mclapply this also works:

> parallel::mclapply(1:15, function (x) Sys.sleep(1))
^C
>

Doing the same with pbmclapply nothing happens and the process gets stuck:

> pbmcapply::pbmclapply(1:15, function (x) Sys.sleep(1))
  |================                                            |  27%, ETA 00:03^C

Note the ^C at the end of the line.

I usually help myself by pressing CtrlZ to send the process in the background. This is what the frozen process looks like:

Trying to kill it with kill 13668 with an implicit SIGTERM does not work. I have so send it a SIGKILL in order to get rid of it.

This is annoying because I always lose my R session in case I realize that I do not want to let that finish. In RStudio I have to restart R, but also my environment is lost.

Is there something one could do to improve SIGINT handling?

Infinite call to pbmclapply()

HI @kvnkuang,

in the following example I get an infinite execution of pbmclapply(). Using mclapply() I get "at least" NULL returned.

This problem only happens on macOS with a normal R startup. Using R in Vanilla mode solves it.
Up to now its unclear what exactly causes this behaviour (http://stackoverflow.com/questions/44058387/r-mclapply-pblapply-vs-lapply-use-case).
It also works fine on Linux. Anyway, to be able to deal with the NULL output and return an informative error message, it would be important that pbmclapply() does not run infinite.

I would like to use your package in parsperrorest() as one parallel mode option.

remotes::install_github("pat-s/sperrorest@performance")

library(MASS)
library(sperrorest)
library(pbmcapply)

currentSample <- partition.cv(maipo, nfold = 4)
currentSample[[2]] <- partition.cv(maipo, nfold = 4)[[1]]
currentRes <- currentSample

lda.predfun <- function(object, newdata, fac = NULL) {
  library(nnet)
  majority <- function(x) {
    levels(x)[which.is.max(table(x))]
  }
  
  majority.filter <- function(x, fac) {
    for (lev in levels(fac)) {
      x[ fac == lev ] <- majority(x[ fac == lev ])
    }
    x
  }
  
  pred <- predict(object, newdata = newdata)$class
  if (!is.null(fac)) pred <- majority.filter(pred, newdata[,fac])
  return(pred)
}

data("maipo", package = "sperrorest")
predictors <- colnames(maipo)[5:ncol(maipo)]
fo <- as.formula(paste("croptype ~", paste(predictors, collapse = "+")))


# pbmclapply
runreps_res <- pbmclapply(cl = 2, currentSample, function(X) 
  runreps(currentSample = X, data = maipo,
          formula = fo, par.mode = 1, pred.fun = lda.predfun,
          do.try = FALSE, model.fun = lda,
          error.fold = TRUE, error.rep = TRUE, do.gc = 1,
          err.train = TRUE, importance = FALSE, currentRes = currentRes, 
          pred.args = list(fac = "field"), response = "croptype", par.cl = 2, 
          coords = c("x", "y"), progress = 1, pooled.obs.train = c(), 
          pooled.obs.test = c(), err.fun = err.default))

# mclapply
runreps_res <- mclapply(cl = 2, currentSample, function(X) 
  runreps(currentSample = X, data = maipo,
          formula = fo, par.mode = 1, pred.fun = lda.predfun,
          do.try = FALSE, model.fun = lda,
          error.fold = TRUE, error.rep = TRUE, do.gc = 1,
          err.train = TRUE, importance = FALSE, currentRes = currentRes, 
          pred.args = list(fac = "field"), response = "croptype", par.cl = 2, 
          coords = c("x", "y"), progress = 1, pooled.obs.train = c(), 
          pooled.obs.test = c(), err.fun = err.default))

WISH: Option for outputting progress bar to standard error (stderr)

Background

The progress bar generated by:

> y <- pbmcapply::pbmclapply(1:3, sqrt)
  |==================================================| 100%, Elapsed 00:00

is sent to the standard output (stdout). Proof:

> out <- capture.output(y <- pbmcapply::pbmclapply(1:3, sqrt))
> str(out)
 chr "\r  |                                                         |   0%, ETA NA\r  |==================            "| __truncated__

Issue

This means that it captured by report generators (e.g. Sweave, knitr, and rmarkdown) and becomes part of the echoed output in the report/vignette. This is not always wanted - personally, I'd say it's rarely wanted. The reason for this is that utils::txtProgressBar() unfortunately defaults to file = "", which means, "output to stdout" [https://github.com/HenrikBengtsson/Wishlist-for-R/issues/75]. Looking at other progress bar solutions in R, but also in other languages and software tools, outputting progress bars to stderr is the de facto standard.

Wish

Add an argument, and or option, to control ("stderr" or "stdout") where progress bar output is sent.

progressBar() error after the 5/11/2018 update

Hi,

Firstly, thanks for this great package! I have been using it a lot.

The recent update has caused the following error in my code (used to work perfectly)

Error in progressBar(0, length, style = mc.style, substyle = mc.substyle) :
must have max bigger than min.

I will try to come up with a reproducible example. Meanwhile, let me know if you can think of anything that might contribute to this error.

Thanks,
Tony

Use getOption for ignore.interactive parameter.

Use an option to store the status of ignore.interactive parameter.
It allows a global setting of this flag.

future.globals.maxSize has no effect in pbmclapply

I really like the function pbmclapply(). Thanks a lot for providing.

Today I run into the following error:

Error in getGlobalsAndPackages(expr, envir = envir, tweak = tweakExpression,  : 
  The total size of the 3 globals that need to be exported for the future expression (‘do.call(what = FUN, args = args)’) is 6.24 GiB. This exceeds the maximum allowed size of 1.00 GiB (option 'future.globals.maxSize'). There are three globals: ‘args’ (6.24 GiB of class ‘list’), ‘FUN’ (5.59 KiB of class ‘function’) and ‘progressFifo’ (584 bytes of class ‘numeric’).

And I found a solution here: https://stackoverflow.com/questions/40536067/how-to-adjust-future-global-maxsize-in-r

However using options(future.globals.maxSize = 22020096000) and checking with

> options("future.globals.maxSize")
$future.globals.maxSize
[1] 22020096000

did not solve the problem. However, using mclapply() instead of pbmclapply() with the changed options did not run into an error.

Dynamically select port.

Dynamically select port available for socket connection.
Currently, if two instances of pbmcapply is running on the same machine, one will fail due to the port conflict.

mc.preschedule

Hi Kevin,

Thanks for the package!

Does the design of the package allow exposing mc.foo arguments of parallel::mclapply like mc.preschedule?

Currently, I am forced to use parallel::mclapply without a progress bar when mc.preschedule is FALSE. Would this be a meaningful feature addition?

Regards,
Srikanth

learning resources for possible contributor

Hi!

great package. it's used in a codebase i've just started working with.

Do you have any resources that you'd recommend for someone looking to understand more about this parallelization stuff and maybe contribute to this project?
e.g.

setting up a dev environment that interfaces between this project, R , C source code
any good articles/blogs/.. on do_lapply and other important internal C functions.
general starter guides for exploring the C source code for R

Thanks!

windows installation

Thanks for the cool package. I have been waiting for this functionality for quite a while.

I included this package as a dependency to my package ("warbleR"). However, I just realized that pbmcapply can't be installed on windows OS. This would largely limit the range of users. It might be better if it can be installed on windows OS but return and error if more than 1 core is attempted to be used as in the parallel package ("'mc.cores' > 1 is not supported on Windows").

thanks

pbmcapply not showing in Rscript mode

Awesome package, really love it!

I noticed that when I run a script via Rscript myscript.r the progress bar does not appear in the terminal.

Is there an easy way to set that up? Couldn't see anything in the documentation.

Thanks

progressBar(max=0): throws error on 'missing value where TRUE/FALSE needed'

Hi/FYI,

> pb <- pbmcapply::progressBar(max=0)
Error in if (nb == .nb && pc == .pc && timenow - .timenow < 1) { : 
  missing value where TRUE/FALSE needed

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /home/hb/shared/software/CBI/R-4.0.4/lib/R/lib/libRblas.so
LAPACK: /home/hb/shared/software/CBI/R-4.0.4/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.0.4  parallel_4.0.4  pbmcapply_1.5.0

pbmclapply fails to return list where lapply, pbapply, and mclapply have no issue

Hi,

I'm trying to write custom function and having issue where pbmclapply fails to return a list where identical functions with lapply, pbapply, and mclapply function with no issues.

In this example I'm using a function from the Seurat package to reach a matrix in HDF5 file format but the same errors are persisting when replaced with fread, or read.csv, etc).

pblapply and mclapply versions that are inside of larger function but relevant portion is here given a defined list of files and sample names:

  pboptions(char = "=")
  if (parallel) {
    raw_data_list <- mclapply(mc.cores = num_cores, 1:length(sample.names), function(i) {
      h5_loc <- file.path(data.dir, file.list[1])
      data <- Read10X_h5(filename = h5_loc)
    })
  } else {
    raw_data_list <- pblapply(1:length(x = sample.names), function(i) {
      h5_loc <- file.path(data.dir, file.list[1])
      data <- Read10X_h5(filename = h5_loc)
    })
  }
  names(raw_data_list) <- sample.names
  return(raw_data_list)
}

If I change the mclapply section to use pbmclapply:

  if (parallel) {
    raw_data_list <- pbmclapply(mc.cores = num_cores, X = 1:length(sample.names), FUN = function(i) {
      h5_loc <- file.path(data.dir, file.list[1])
      data <- Read10X_h5(filename = h5_loc)
    })
  } else {
    raw_data_list <- pblapply(1:length(x = sample.names), function(i) {
      h5_loc <- file.path(data.dir, file.list[1])
      data <- Read10X_h5(filename = h5_loc)
    })
  }
  names(raw_data_list) <- sample.names
  return(raw_data_list)
}

It appears to be working and then I get error:

Reading 10X H5 files from directory
  |====================================================================| 100%, Elapsed 00:15
Error in names(raw_data_list) <- sample.names : 
  attempt to set an attribute on NULL

Basically it is not returning the list during the function.

However, it also gets slightly weirder. When I got this error I was trying to read in 12 files with 4 cores. If I remove files from the target directory so that it is only trying to read 5 files with 4 cores it succeeds with no issues. The files are all identical except for the file names which are sequentially ordered so it is not issue of a corrupt file or anything.

Any insights would be great because I'd really love to have progress bars for parallel versions of these functions.

Thanks!
Sam

Check the length of vector before drawing the progressBar.

Return an empty list with warning message if length == 0.

must have 'max' > 'min'

I had 3 operations. They were sequential. First two operations finished successfully, but second operation(probably) had an error in the end, please see the error below.
Third operation as I see even wasn't started.

[1] "Started processing"
|======================================================================================================| 100%
|======================================================================================================| 100%
Fehler in txtProgressBar(0, length, style = mc.style) :
must have 'max' > 'min'

size of globals too big for future

Hello Kevin,

I want to use your pbmclapply() function to run linear models for an analysis I am working on. I have a dataset (cpg.p) that has 850K columns. Each one of these is a probe set that I am using as a predictor in the linear models. This means I will be running 850K linear models. I run this code where cpg.p is the dataset with 850K columns, and f is a simple function I wrote that fits the models and extracts the coefficients and p-values:
rslts.p <- do.call("rbind", pbmclapply(cpg.p, FUN = f, mc.cores =getOption("mc.cores", 20L)))
I am getting this error:
Error in getGlobalsAndPackages(expr, envir = envir, tweak = tweakExpression, : The total size of the 3 globals that need to be exported for the future expression (‘do.call(what = FUN, args = args)’) is 696.56 MiB. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). There are three globals: ‘args’ (696.55 MiB of class ‘list’), ‘FUN’ (5.55 KiB of class ‘function’) and ‘progressFifo’ (552 bytes of class ‘numeric’).
I am thinking this has something to do with the fact the cpg.p is too big for the future? If I run the same code with a smaller dataset (e.g. 250K) it runs beautifully. I can easily split up the dataset, run the code, and recombine, but I was wondering if there is a workaround when working with large datasets like mine?

Thank you
Harry

Execution cannot be interrupted with Ctrl-C since ignore.interactive was added

Related to #31

I just encountered this same error (Ctrl-C not stopping pbcmlapply) under Linux using rscript.

Perhaps Ctrl-C does not work anymore since the addition of ignore.interactive? Since I am using it from the terminal I need to switch ignore.interactive to T in order to see the progress bar, but then Ctrl-C does not work (at least in version 1.5.0)

Thank you!

pbmclapply slow down

pbmclapply is much slower than mclapply, and is consistently proportionally slower as the number of objects run increases. Is there a fix for this?

After completion, show time taken instead of ETA

After pbmclapply completes, it shows that "ETA" is "00:00" but this does not contain much information. Instead, I would be interested in seeing how much total time was taken by the pbmclapply call. e.g., if it took 20 minutes, I would want to see "20:00". I'm not sure what would be a good abbreviation for this in place of "ETA".

I'm fine with a "won't fix" on this feature request. I think it would be possible to write a wrapper around pbmclapply using system.time that would output the information I'm interested in, but since this feature might be of interest to others, I thought I would first check in to see what your thoughts are.

Thanks for this great package!

pbmclapply hangs if cores is undefined

First of all, thanks! This package is very useful. I noticed something when trying to debug a hanging problem.

Running the example from the manual:

> library(pbmcapply)
Loading required package: parallel
> lazySqrt <- function(num) {
+     # Sleep randomly between 0 to 0.5 second
+     Sys.sleep(runif(1, 0, 0.5))
+     return(sqrt(num))
+ }
> cores <- detectCores()
> result <- pbmclapply(1:3, lazySqrt, mc.cores = cores)
  |=======================================================| 100%, Elapsed 00:00
>

If I have not previously defined the variable cores, the command hangs forever without starting and I have to terminate R. It would be good if it could throw an error instead.

> rm(cores)
> result <- pbmclapply(1:3, lazySqrt, mc.cores = cores)
  |                                                              |   0%, ETA NA`

The above never progresses.

reproducibility

set.seed(1, "L'Ecuyer-CMRG")
pbmcapply::pbmcmapply(function(n, mean, sd, ...) rnorm(n, mean, sd), 1:4,
                      mc.set.seed=T, mc.cores=2, 
                      MoreArgs=list(n=1, mu=0, sd=1)
)

can't return the same result after running twice.

But

set.seed(1, "L'Ecuyer-CMRG")
parallel::mcmapply(function(n, mean, sd, ...) rnorm(n, mean, sd), 1:4,
                      mc.set.seed=T, mc.cores=2, 
                      MoreArgs=list(n=1, mu=0, sd=1)
)

can give me the same result

If warning, return of pbmclapply has different return format if length of X is 1

Consider the following code:

library("pbmcapply")

nsims <- 1

example_fn <- function(x) {
  y <- x^2
  warning("this is a warning.")
  return(y)
}

out <- pbmclapply(X = seq_len(nsims), FUN = example_fn)

I get the following when printing out:

> out
$value
$value[[1]]
[1] 1


$warning
<simpleWarning in FUN(...): this is a warning.>

I think this is potentially an interesting way to handle warnings, but the return format is different when nsims is greater than 1 in the example above.

Thanks for your work on pbmcapply!

Custom progress bar?

Hi,

This is a great package! The only thing I miss is the ability to have a custom progress bar.

For me, I need a progress bar that reprints with a "\n" newline after every update, so that it can get caught by Airflow's logging.

Do you think that something like this would be possible?

Thanks in advance,
Richard

Object 'mcinteractive' not found

Hi!

Since lately using pbmclapply() I get the error

Error in get("mcinteractive", pkg) : 
  Object 'mcinteractive' not found

Running the lines

pkg <- asNamespace('parallel')
mcfork <- get('mcfork', pkg)
mc.advance.stream <- get('mc.advance.stream', pkg)
mcexit <- get('mcexit', pkg)
mcinteractive <- get('mcinteractive', pkg)

reproduces the error, meaning that mcintercative is no (longer a) name/function of parallel. Thus, I'd assume this is an error due to an update of the parallel-package.
I tried to confirm this by installing an older version of the parallel-package but failed to do so ...🤷‍♂️

I do not seem to be the only one running into this issue see here.

Would you have any idea on how to solve this issue?

Thanks in advance!

Reset plan() on exit

I just like to suggest that you undo your changes to plan() when you exit your functions, e.g.

pbmclapply <- function(X, FUN, ..., mc.style = 3, mc.cores =getOption("mc.cores", 2L)) {
  oplan <- plan("list")
  on.exit(plan(oplan))

  plan(multiprocess)

Otherwise there's a risk that calling this function will break other functions / scripts relying on futures called afterwards.

not supported by MRO?

Dear sir,
Thanks for the excellent tool. I was using it on windows 10, and it works perfect. However when I wanna install it on Ubuntu 18.04 (server version) + Rstudio serve + Microsoft R open, it reports as following:

> install.packages("pbmcapply")
Installing package into ‘/home/username/R/x86_64-pc-linux-gnu-library/3.5’
(as ‘lib’ is unspecified)
Error in install.packages : This version of R is not set up to install source packages
If it was installed from an RPM, you may need the R-devel RPM
Warning messages:
1: In .rs.normalizePath(libPaths) :
  path[2]="/opt/microsoft/ropen/3.5.3/lib64/R/library": No such file or directory
2: In .rs.normalizePath(libPaths) :
  path[2]="/opt/microsoft/ropen/3.5.3/lib64/R/library": No such file or directory

Have tried many way including downloading the tar file. But still not working.
Please advise, I really wanna see it work on my system.
Many thanks.

`pbmclapply` not capturing dot `(...)` correctly

Hi !

I don't think pbmclapply is working properly when using a function with more than 2 arguments. Usually extra arguments are captured in parallel::mclapply ... argument.

might want to use dots <- pryr::named_dots(...)

Cheers
Thierry

provide a pbmcapply version of apply (not only mapply and lapply)

I was surprised to find that pbmcapply only provides pbmc{l,m}apply, and not pbmcapply. Is there a specific technical reason for this? Thanks.