ruochenj / mbimpute Goto Github PK
View Code? Open in Web Editor NEWmbImpute: an accurate and robust imputation method for microbiome data
License: MIT License
mbImpute: an accurate and robust imputation method for microbiome data
License: MIT License
It appears that mbImpute()
doesn't use row/column names for the D
matrix. So, the order of the matrix must correspond to the otu_tab
table, but this isn't in the function docs. Even if it was in the docs, relying on the user to make sure that the order is correct is dangerous.
Greetings mbImpute Team!
I have been trying to utilise the mbImpute method for a microbiome dataset with OTU counts provided for various samples. Pardon my ignorance, but since this is a new domain I am working in, I was wondering if I need to apply normalisation procedures before utilizing mbImpute( ) or if setting the 'unnormalised' to TRUE would be sufficient (default normalisation procedures outlined in your paper)?
The reason I am asking you this is that the dataset contains over 2,000 samples with over 100 OTU variables, and 2 metadata variables (excluding the study-condition variable). The observations are the raw OTU counts. It might be due to the size of the dataset, however, the function has still not finished running after 24 hours - so I wanted to check if I am taking the correct path.
Additionally, do I need to initiate extra packages to leverage parallel processing in the mbImpute( ) function?
Hello, mbImpute developers,
I read your mbImpute manuscript on bioRxiv recently. I really liked the idea and would like to give it a try on my datasets. However I kept getting the following error when running the imputation:
This is the code I'm using:
a_imp <- mbImpute(condition = a_meta_num, otu_tab = otu_tab)
.
The otu_tab
and a_meta_num
I used is attached below in RDS format ('data.zip'). The entry of the otu_tab
is defined as raw_counts divided by total_counts_in_the_sample, which is similar to the one presented in your manuscript except that it doesn't multiply by 10^6.
data.zip
Could you please help me look into this? Thanks in advance!
Also, there's a small typo in README. It should be
install.packages("glmnet")
instead of
install.pacakges("glmnet")
.
Best,
Zifan
library(mbImpute)
library(glmnet)
library(Matrix)
mbImpute(otu_tab = otu_tab)
[1] "Meta data information unavailable"
[1] "Phylogenentic information unavailable"
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'a' in selecting a method for function 'solve': requires numeric/complex matrix/vector arguments
Also, Demo 3 in your README example does not work:
otu_tab_T2D <- otu_tab[study_condition == "T2D",]
Error: object 'study_condition' not found
The otu_tab
matrix doesn't include a study_condition
column.
sessionInfo:
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/software/dev/miniconda3_dev/envs/mbimpute/lib/libopenblasp-r0.3.12.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] glmnet_4.1-1 Matrix_1.3-2 mbImpute_0.1.0
loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3 parallel_4.0.3 survival_3.2-10
[5] splines_4.0.3 codetools_0.2-18 doParallel_1.0.16 grid_4.0.3
[9] iterators_1.0.13 foreach_1.5.1 shape_1.4.5 lattice_0.20-41
Dear @ruochenj and mbImpute developers.
I would like to use your tool for zero imputation because zComposition cmultrepl() does not work due to the huge of zeroes in my abundance table.
I followed your README and the package instructions version 0.1.0 and I do not know how to calculate D or previously I need to calculate the edges value. In th mbImpute package help appears some functions incomplete...such as function edges or D...
I tried the following:
data(GlobalPatterns)
gpotu<-as.data.frame(otu_table(GlobalPatterns))
gpotu2<-t(gpotu)
gpotu2[1:6,1:6] # This works
but when I try D[1:6,1:6] automatically redirects to otu_tab matrix (Karlsson data), and I tried some code but appears error and I am not able to calculate the D value...
Could you facilitate me some hints or how can I proceed from GlobalPatterns toy data to use your mbImpute.
Thanks on advance,
Magi
imputed_count_mat_list <- mbImpute(condition = study_condition, otu_tab = otu_table, D = D)
Produces:
Error in if (max(wt) > 1e-05) {: missing value where TRUE/FALSE needed
Traceback:
1. mbImpute(condition = study_condition, otu_tab = otu_table, D = D)
2. data_fit2(otu_tab, meta_data, D, k = k)
3. lapply(1:ncol(y_sim), FUN = function(col_i) {
. y = y_sim[, col_i]
. return(gamma_norm_mix(y, X)$d)
. })
4. FUN(X[[i]], ...)
5. gamma_norm_mix(y, X)
6. update_gmm_pars(x = y, wt = a_hat_t)
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Anxiety_Twins_Metagenomes/envs/tidyverse-clr/lib/libopenblasp-r0.3.12.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ape_5.4-1 glmnet_4.1-1 Matrix_1.3-2 mbImpute_0.1.0
[5] LeyLabRMisc_0.1.9 tidytable_0.5.9 data.table_1.14.0 ggplot2_3.3.3
[9] tidyr_1.1.3 dplyr_1.0.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 pillar_1.5.1 compiler_4.0.3 iterators_1.0.13
[5] base64enc_0.1-3 tools_4.0.3 digest_0.6.27 uuid_0.1-4
[9] nlme_3.1-152 lattice_0.20-41 jsonlite_1.7.2 evaluate_0.14
[13] lifecycle_1.0.0 tibble_3.1.0 gtable_0.3.0 pkgconfig_2.0.3
[17] rlang_0.4.10 foreach_1.5.1 cli_2.3.1 IRdisplay_1.0
[21] parallel_4.0.3 IRkernel_1.1.1 repr_1.1.3 withr_2.4.1
[25] generics_0.1.0 vctrs_0.3.6 grid_4.0.3 tidyselect_1.1.0
[29] glue_1.4.2 R6_2.5.0 fansi_0.4.2 survival_3.2-10
[33] pbdZMQ_0.3-5 purrr_0.3.4 magrittr_2.0.1 splines_4.0.3
[37] codetools_0.2-18 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.1.1
[41] assertthat_0.2.1 shape_1.4.5 colorspace_2.0-0 utf8_1.2.1
[45] doParallel_1.0.16 munsell_0.5.0 crayon_1.4.1
Hi,
When I tried to install the package, but the error arose the below:
Error: .onLoad failed in loadNamespace() for 'pkgload', details:
call: loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]])
error: there is no package called ‘backports’
The command I typed was the below:
devtools::install_github("ruochenj/mbImpute/mbImpute R package")
How shoud I do to solve it?
data_fit2()
writes 2 RDS files to the current working directory:
saveRDS(c1, file = "dat_sim_add_filter_coef.rds")
saveRDS(y_imp, file = "imputed_mat_condition_as_covariate.rds")
The use has now control on where these files are written. This can cause problems in cases where multiple jobs are run at the same time.
It would be helpful to have more information than just:
[1] 1
[1] 0
during an mbImpute()
run (single core). It appears that this output is simply print(mat_num-1-mat_new)
in the data_fit2()
function. Using message()
with a bit of text along with the value allow for more informative output.
While I was looking at data_fit2()
, I noticed that you have different code for parallel == TRUE
versus parallel == FALSE
:
if(!parallel){
for(mat_new in 1:(mat_num-1)){
print(mat_num-1-mat_new)
design_mat_fit = sparseMatrix(i = 1, j =1, x = 0, dims = c(size, row_length))
track = ((mat_new-1)*size+1):(mat_new*size)
for(i in 1:size){
if(is.vector(X)){
result <- design_mat_row_gen2(y_sim, X[1:n], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
design_mat_fit[i,result$nz_idx] <- result$nz_val
}
else{
result <- design_mat_row_gen2(y_sim, X[1:n,], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
design_mat_fit[i,result$nz_idx] <- result$nz_val
}
}
mat_list[[mat_new]] = design_mat_fit
}
}else{
no_cores <- max(ncores, detectCores() - 1)
registerDoParallel(cores=no_cores)
cl <- makeCluster(no_cores, "FORK")
f <- function(mat_new){
design_mat_fit = sparseMatrix(i = 1, j =1, x = 0, dims = c(size, row_length))
track = ((mat_new-1)*size+1):(mat_new*size)
for(i in 1:size){
if(is.vector(X)){
result <- design_mat_row_gen2(y_sim, X[1:n], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
design_mat_fit[i,result$nz_idx] <- result$nz_val
}
else{
result <- design_mat_row_gen2(y_sim, X[1:n,], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
design_mat_fit[i,result$nz_idx] <- result$nz_val
}
}
return(design_mat_fit)
}
mat_list <- parLapply(cl, 1:(mat_num-1), f)
Why is this separate code instead of using the same f()
function for cores=1
versus cores=>1
? Do these different implementations generate different results?
Given the challenge of formatting all of the data exactly as required for mbImpute()
:
...it would be VERY helpful to provide support for phyloseq objects so that the user doesn't have to worry about all of this formatting (and less chance of getting the order wrong)
This occurs often for me when running mbImpute()
(current version from github):
[1] "Meta data information unavailable"
[1] "Phylogenentic information unavailable"
[1] 441 567
[1] 0.1891608 5.0000000 5.0000000 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 5.0000000 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 0.1714688 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 0.1714688 0.1614214
Error in impute_set[i, 1]: subscript out of bounds
Traceback:
1. mbImpute(otu_tab = cnt)
2. data_fit2(otu_tab, metadata, D, k = k)
3. design_mat_row_gen2_imp(y_sim, X[1:n, ], impute_set[i, 1], impute_set[i,
. 2], close_taxa)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.