kvnkuang / pbmcapply Goto Github PK
View Code? Open in Web Editor NEWTracking the progress of mc*apply with progress bar.
License: Other
Tracking the progress of mc*apply with progress bar.
License: Other
@kvnkuang : I was using the package in a knitr document and the output had the progress bar:
##
|
| | 0%
|
|= | 1%
|
|== | 2%
Not copying here the whole output, but you can guess the rest. I would suggest printing the progress bar only when interactive()
. At least this is what I did in pbapply.
Related to #25. If I use the same use case there with the latest (Github) commit of pbmcapply (packageVersion()
reports 1.3.0), I get the following output:
If I change 10^7
to 10^2
in the use case, there are still linebreak problems.
Can you reproduce?
I don't believe it is documented in the code but there is a choice when using seeded pbmclapply to use the mc.set.stream()
function where the mclapply uses mc.reset.stream()
. In pbmclapply this is called in the utils.R file through the .customized_mcparallel()
function. This has been raised on stackoverflow as a query here:
https://stackoverflow.com/questions/67655726/parallel-processing-in-r-setting-seed-with-mclapply-vs-pbmclapply/69595595#69595595
The decision on how to proceed may additionally be linked to another stackoverflow query as expected behaviour may be different for different users:
https://stackoverflow.com/questions/15070377/r-doesnt-reset-the-seed-when-lecuyer-cmrg-rng-is-used
I have a strange issue since I have upgraded to R 3.5.0 and installed the latest CRAN version of this package. The resulting list from a pbmclapply
call sometimes contains some NULL
elements. Replacing the same with lapply
gives the desired result of no NULL
elements. Also with mclapply
it seems to work, though I am not sure whether this was just luck or this is some sort of a race condition.
Could pbmclapply
behave differently in the result compared to mclapply
and apply
?
Consider the following example:
library(pbmcapply)
lazySqrt <- function(num) {
# Sleep randomly between 0 to 0.5 second
Sys.sleep(runif(1, 0, 0.5))
return(sqrt(num))
}
# Get the sqrt of 1-3 in parallel
result <- pbmclapply(seq.int(10^7), lazySqrt, mc.cores = 2)
When I run the above code, because the ETA is hours, I think that a longer line is output by pbmclapply
than the detected width of the terminal.
I've tested on gnome-terminal, with and without byobu, and with version 1.2.5 and the latest Github version.
Can you reproduce?
@kvnkuang, thank you for this package it seems to be a nice extension to parallel.
Would it be possible to recover the post with comparison and/or upload it to GitHub so it remains available regardless of the blog status?
That's how "https://kevinkuang.net/tracking-progress-in-r-ad97998c359f" looks in my browser:
(It also shows Error 502: bad gateway)
This is a spinoff from #37 (comment). I just wanted to add some extra info:
It seems that this warning might be converted to an error in the future (?):
https://github.com/wch/r-source/blob/trunk/src/library/parallel/src/fork.c#L815-L816
My own selfish motivation is that I like to run my code with options(warn = 2)
. This helps me find issues at the source more quickly. But if indeed there is no way around this particular warning, I can temporarily switch "warn" to 1 whenever I run pbmcapply
.
As always, thanks for your great work on this package Kevin!
When I do some expensive calculation and realize that I want to abort it, I can usually do that with CtrlC. With mclapply
this also works:
> parallel::mclapply(1:15, function (x) Sys.sleep(1))
^C
>
Doing the same with pbmclapply
nothing happens and the process gets stuck:
> pbmcapply::pbmclapply(1:15, function (x) Sys.sleep(1))
|================ | 27%, ETA 00:03^C
Note the ^C
at the end of the line.
I usually help myself by pressing CtrlZ to send the process in the background. This is what the frozen process looks like:
Trying to kill it with kill 13668
with an implicit SIGTERM
does not work. I have so send it a SIGKILL
in order to get rid of it.
This is annoying because I always lose my R session in case I realize that I do not want to let that finish. In RStudio I have to restart R, but also my environment is lost.
Is there something one could do to improve SIGINT
handling?
HI @kvnkuang,
in the following example I get an infinite execution of pbmclapply()
. Using mclapply()
I get "at least" NULL
returned.
This problem only happens on macOS with a normal R startup. Using R in Vanilla mode solves it.
Up to now its unclear what exactly causes this behaviour (http://stackoverflow.com/questions/44058387/r-mclapply-pblapply-vs-lapply-use-case).
It also works fine on Linux. Anyway, to be able to deal with the NULL
output and return an informative error message, it would be important that pbmclapply()
does not run infinite.
I would like to use your package in parsperrorest()
as one parallel mode option.
remotes::install_github("pat-s/sperrorest@performance")
library(MASS)
library(sperrorest)
library(pbmcapply)
currentSample <- partition.cv(maipo, nfold = 4)
currentSample[[2]] <- partition.cv(maipo, nfold = 4)[[1]]
currentRes <- currentSample
lda.predfun <- function(object, newdata, fac = NULL) {
library(nnet)
majority <- function(x) {
levels(x)[which.is.max(table(x))]
}
majority.filter <- function(x, fac) {
for (lev in levels(fac)) {
x[ fac == lev ] <- majority(x[ fac == lev ])
}
x
}
pred <- predict(object, newdata = newdata)$class
if (!is.null(fac)) pred <- majority.filter(pred, newdata[,fac])
return(pred)
}
data("maipo", package = "sperrorest")
predictors <- colnames(maipo)[5:ncol(maipo)]
fo <- as.formula(paste("croptype ~", paste(predictors, collapse = "+")))
# pbmclapply
runreps_res <- pbmclapply(cl = 2, currentSample, function(X)
runreps(currentSample = X, data = maipo,
formula = fo, par.mode = 1, pred.fun = lda.predfun,
do.try = FALSE, model.fun = lda,
error.fold = TRUE, error.rep = TRUE, do.gc = 1,
err.train = TRUE, importance = FALSE, currentRes = currentRes,
pred.args = list(fac = "field"), response = "croptype", par.cl = 2,
coords = c("x", "y"), progress = 1, pooled.obs.train = c(),
pooled.obs.test = c(), err.fun = err.default))
# mclapply
runreps_res <- mclapply(cl = 2, currentSample, function(X)
runreps(currentSample = X, data = maipo,
formula = fo, par.mode = 1, pred.fun = lda.predfun,
do.try = FALSE, model.fun = lda,
error.fold = TRUE, error.rep = TRUE, do.gc = 1,
err.train = TRUE, importance = FALSE, currentRes = currentRes,
pred.args = list(fac = "field"), response = "croptype", par.cl = 2,
coords = c("x", "y"), progress = 1, pooled.obs.train = c(),
pooled.obs.test = c(), err.fun = err.default))
The progress bar generated by:
> y <- pbmcapply::pbmclapply(1:3, sqrt)
|==================================================| 100%, Elapsed 00:00
is sent to the standard output (stdout). Proof:
> out <- capture.output(y <- pbmcapply::pbmclapply(1:3, sqrt))
> str(out)
chr "\r | | 0%, ETA NA\r |================== "| __truncated__
This means that it captured by report generators (e.g. Sweave, knitr, and rmarkdown) and becomes part of the echoed output in the report/vignette. This is not always wanted - personally, I'd say it's rarely wanted. The reason for this is that utils::txtProgressBar()
unfortunately defaults to file = ""
, which means, "output to stdout" [https://github.com/HenrikBengtsson/Wishlist-for-R/issues/75]. Looking at other progress bar solutions in R, but also in other languages and software tools, outputting progress bars to stderr is the de facto standard.
Add an argument, and or option, to control ("stderr"
or "stdout"
) where progress bar output is sent.
Hi,
Firstly, thanks for this great package! I have been using it a lot.
The recent update has caused the following error in my code (used to work perfectly)
Error in progressBar(0, length, style = mc.style, substyle = mc.substyle) :
must have max bigger than min.
I will try to come up with a reproducible example. Meanwhile, let me know if you can think of anything that might contribute to this error.
Thanks,
Tony
Use an option to store the status of ignore.interactive parameter.
It allows a global setting of this flag.
I really like the function pbmclapply()
. Thanks a lot for providing.
Today I run into the following error:
Error in getGlobalsAndPackages(expr, envir = envir, tweak = tweakExpression, :
The total size of the 3 globals that need to be exported for the future expression (‘do.call(what = FUN, args = args)’) is 6.24 GiB. This exceeds the maximum allowed size of 1.00 GiB (option 'future.globals.maxSize'). There are three globals: ‘args’ (6.24 GiB of class ‘list’), ‘FUN’ (5.59 KiB of class ‘function’) and ‘progressFifo’ (584 bytes of class ‘numeric’).
And I found a solution here: https://stackoverflow.com/questions/40536067/how-to-adjust-future-global-maxsize-in-r
However using options(future.globals.maxSize = 22020096000)
and checking with
> options("future.globals.maxSize")
$future.globals.maxSize
[1] 22020096000
did not solve the problem. However, using mclapply()
instead of pbmclapply()
with the changed options did not run into an error.
Dynamically select port available for socket connection.
Currently, if two instances of pbmcapply is running on the same machine, one will fail due to the port conflict.
Hi Kevin,
Thanks for the package!
Does the design of the package allow exposing
mc.foo
arguments ofparallel::mclapply
likemc.preschedule
?
Currently, I am forced to use parallel::mclapply
without a progress bar when mc.preschedule
is FALSE
. Would this be a meaningful feature addition?
Regards,
Srikanth
Hi!
great package. it's used in a codebase i've just started working with.
Do you have any resources that you'd recommend for someone looking to understand more about this parallelization stuff and maybe contribute to this project?
e.g.
Thanks!
Thanks for the cool package. I have been waiting for this functionality for quite a while.
I included this package as a dependency to my package ("warbleR"). However, I just realized that pbmcapply can't be installed on windows OS. This would largely limit the range of users. It might be better if it can be installed on windows OS but return and error if more than 1 core is attempted to be used as in the parallel package ("'mc.cores' > 1 is not supported on Windows").
thanks
Awesome package, really love it!
I noticed that when I run a script via Rscript myscript.r
the progress bar does not appear in the terminal.
Is there an easy way to set that up? Couldn't see anything in the documentation.
Thanks
Hi/FYI,
> pb <- pbmcapply::progressBar(max=0)
Error in if (nb == .nb && pc == .pc && timenow - .timenow < 1) { :
missing value where TRUE/FALSE needed
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /home/hb/shared/software/CBI/R-4.0.4/lib/R/lib/libRblas.so
LAPACK: /home/hb/shared/software/CBI/R-4.0.4/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.4 parallel_4.0.4 pbmcapply_1.5.0
Hi,
I'm trying to write custom function and having issue where pbmclapply fails to return a list where identical functions with lapply, pbapply, and mclapply function with no issues.
In this example I'm using a function from the Seurat package to reach a matrix in HDF5 file format but the same errors are persisting when replaced with fread, or read.csv, etc).
pblapply and mclapply versions that are inside of larger function but relevant portion is here given a defined list of files and sample names:
pboptions(char = "=")
if (parallel) {
raw_data_list <- mclapply(mc.cores = num_cores, 1:length(sample.names), function(i) {
h5_loc <- file.path(data.dir, file.list[1])
data <- Read10X_h5(filename = h5_loc)
})
} else {
raw_data_list <- pblapply(1:length(x = sample.names), function(i) {
h5_loc <- file.path(data.dir, file.list[1])
data <- Read10X_h5(filename = h5_loc)
})
}
names(raw_data_list) <- sample.names
return(raw_data_list)
}
If I change the mclapply section to use pbmclapply:
if (parallel) {
raw_data_list <- pbmclapply(mc.cores = num_cores, X = 1:length(sample.names), FUN = function(i) {
h5_loc <- file.path(data.dir, file.list[1])
data <- Read10X_h5(filename = h5_loc)
})
} else {
raw_data_list <- pblapply(1:length(x = sample.names), function(i) {
h5_loc <- file.path(data.dir, file.list[1])
data <- Read10X_h5(filename = h5_loc)
})
}
names(raw_data_list) <- sample.names
return(raw_data_list)
}
It appears to be working and then I get error:
Reading 10X H5 files from directory
|====================================================================| 100%, Elapsed 00:15
Error in names(raw_data_list) <- sample.names :
attempt to set an attribute on NULL
Basically it is not returning the list during the function.
However, it also gets slightly weirder. When I got this error I was trying to read in 12 files with 4 cores. If I remove files from the target directory so that it is only trying to read 5 files with 4 cores it succeeds with no issues. The files are all identical except for the file names which are sequentially ordered so it is not issue of a corrupt file or anything.
Any insights would be great because I'd really love to have progress bars for parallel versions of these functions.
Thanks!
Sam
Return an empty list with warning message if length == 0.
I had 3 operations. They were sequential. First two operations finished successfully, but second operation(probably) had an error in the end, please see the error below.
Third operation as I see even wasn't started.
[1] "Started processing"
|======================================================================================================| 100%
|======================================================================================================| 100%
Fehler in txtProgressBar(0, length, style = mc.style) :
must have 'max' > 'min'
Hello Kevin,
I want to use your pbmclapply() function to run linear models for an analysis I am working on. I have a dataset (cpg.p) that has 850K columns. Each one of these is a probe set that I am using as a predictor in the linear models. This means I will be running 850K linear models. I run this code where cpg.p is the dataset with 850K columns, and f is a simple function I wrote that fits the models and extracts the coefficients and p-values:
rslts.p <- do.call("rbind", pbmclapply(cpg.p, FUN = f, mc.cores =getOption("mc.cores", 20L)))
I am getting this error:
Error in getGlobalsAndPackages(expr, envir = envir, tweak = tweakExpression, : The total size of the 3 globals that need to be exported for the future expression (‘do.call(what = FUN, args = args)’) is 696.56 MiB. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). There are three globals: ‘args’ (696.55 MiB of class ‘list’), ‘FUN’ (5.55 KiB of class ‘function’) and ‘progressFifo’ (552 bytes of class ‘numeric’).
I am thinking this has something to do with the fact the cpg.p is too big for the future? If I run the same code with a smaller dataset (e.g. 250K) it runs beautifully. I can easily split up the dataset, run the code, and recombine, but I was wondering if there is a workaround when working with large datasets like mine?
Thank you
Harry
Related to #31
I just encountered this same error (Ctrl-C not stopping pbcmlapply) under Linux using rscript.
Perhaps Ctrl-C does not work anymore since the addition of ignore.interactive? Since I am using it from the terminal I need to switch ignore.interactive to T in order to see the progress bar, but then Ctrl-C does not work (at least in version 1.5.0)
Thank you!
After pbmclapply
completes, it shows that "ETA" is "00:00" but this does not contain much information. Instead, I would be interested in seeing how much total time was taken by the pbmclapply
call. e.g., if it took 20 minutes, I would want to see "20:00". I'm not sure what would be a good abbreviation for this in place of "ETA".
I'm fine with a "won't fix" on this feature request. I think it would be possible to write a wrapper around pbmclapply
using system.time
that would output the information I'm interested in, but since this feature might be of interest to others, I thought I would first check in to see what your thoughts are.
Thanks for this great package!
First of all, thanks! This package is very useful. I noticed something when trying to debug a hanging problem.
Running the example from the manual:
> library(pbmcapply)
Loading required package: parallel
> lazySqrt <- function(num) {
+ # Sleep randomly between 0 to 0.5 second
+ Sys.sleep(runif(1, 0, 0.5))
+ return(sqrt(num))
+ }
> cores <- detectCores()
> result <- pbmclapply(1:3, lazySqrt, mc.cores = cores)
|=======================================================| 100%, Elapsed 00:00
>
If I have not previously defined the variable cores
, the command hangs forever without starting and I have to terminate R. It would be good if it could throw an error instead.
> rm(cores)
> result <- pbmclapply(1:3, lazySqrt, mc.cores = cores)
| | 0%, ETA NA`
The above never progresses.
set.seed(1, "L'Ecuyer-CMRG")
pbmcapply::pbmcmapply(function(n, mean, sd, ...) rnorm(n, mean, sd), 1:4,
mc.set.seed=T, mc.cores=2,
MoreArgs=list(n=1, mu=0, sd=1)
)
can't return the same result after running twice.
But
set.seed(1, "L'Ecuyer-CMRG")
parallel::mcmapply(function(n, mean, sd, ...) rnorm(n, mean, sd), 1:4,
mc.set.seed=T, mc.cores=2,
MoreArgs=list(n=1, mu=0, sd=1)
)
can give me the same result
Consider the following code:
library("pbmcapply")
nsims <- 1
example_fn <- function(x) {
y <- x^2
warning("this is a warning.")
return(y)
}
out <- pbmclapply(X = seq_len(nsims), FUN = example_fn)
I get the following when printing out
:
> out
$value
$value[[1]]
[1] 1
$warning
<simpleWarning in FUN(...): this is a warning.>
I think this is potentially an interesting way to handle warnings, but the return format is different when nsims is greater than 1 in the example above.
Thanks for your work on pbmcapply!
Hi,
This is a great package! The only thing I miss is the ability to have a custom progress bar.
For me, I need a progress bar that reprints with a "\n" newline after every update, so that it can get caught by Airflow's logging.
Do you think that something like this would be possible?
Thanks in advance,
Richard
Hi!
Since lately using pbmclapply()
I get the error
Error in get("mcinteractive", pkg) :
Object 'mcinteractive' not found
Running the lines
pkg <- asNamespace('parallel')
mcfork <- get('mcfork', pkg)
mc.advance.stream <- get('mc.advance.stream', pkg)
mcexit <- get('mcexit', pkg)
mcinteractive <- get('mcinteractive', pkg)
reproduces the error, meaning that mcintercative
is no (longer a) name/function of parallel
. Thus, I'd assume this is an error due to an update of the parallel
-package.
I tried to confirm this by installing an older version of the parallel
-package but failed to do so ...🤷♂️
I do not seem to be the only one running into this issue see here.
Would you have any idea on how to solve this issue?
Thanks in advance!
I just like to suggest that you undo your changes to plan()
when you exit your functions, e.g.
pbmclapply <- function(X, FUN, ..., mc.style = 3, mc.cores =getOption("mc.cores", 2L)) {
oplan <- plan("list")
on.exit(plan(oplan))
plan(multiprocess)
Otherwise there's a risk that calling this function will break other functions / scripts relying on futures called afterwards.
Dear sir,
Thanks for the excellent tool. I was using it on windows 10, and it works perfect. However when I wanna install it on Ubuntu 18.04 (server version) + Rstudio serve + Microsoft R open, it reports as following:
> install.packages("pbmcapply")
Installing package into ‘/home/username/R/x86_64-pc-linux-gnu-library/3.5’
(as ‘lib’ is unspecified)
Error in install.packages : This version of R is not set up to install source packages
If it was installed from an RPM, you may need the R-devel RPM
Warning messages:
1: In .rs.normalizePath(libPaths) :
path[2]="/opt/microsoft/ropen/3.5.3/lib64/R/library": No such file or directory
2: In .rs.normalizePath(libPaths) :
path[2]="/opt/microsoft/ropen/3.5.3/lib64/R/library": No such file or directory
Have tried many way including downloading the tar file. But still not working.
Please advise, I really wanna see it work on my system.
Many thanks.
Hi !
I don't think pbmclapply
is working properly when using a function with more than 2 arguments. Usually extra arguments are captured in parallel::mclapply
...
argument.
might want to use dots <- pryr::named_dots(...)
Cheers
Thierry
I was surprised to find that pbmcapply
only provides pbmc{l,m}apply
, and not pbmcapply
. Is there a specific technical reason for this? Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.