ruochenj / mbimpute Goto Github PK

mbImpute: an accurate and robust imputation method for microbiome data

License: MIT License

R 100.00%

mbimpute's Issues

D matrix: row/column names not used

It appears that mbImpute() doesn't use row/column names for the D matrix. So, the order of the matrix must correspond to the otu_tab table, but this isn't in the function docs. Even if it was in the docs, relying on the user to make sure that the order is correct is dangerous.

Normalisation & Parallel Computing for mbImpute

Greetings mbImpute Team!

I have been trying to utilise the mbImpute method for a microbiome dataset with OTU counts provided for various samples. Pardon my ignorance, but since this is a new domain I am working in, I was wondering if I need to apply normalisation procedures before utilizing mbImpute( ) or if setting the 'unnormalised' to TRUE would be sufficient (default normalisation procedures outlined in your paper)?

The reason I am asking you this is that the dataset contains over 2,000 samples with over 100 OTU variables, and 2 metadata variables (excluding the study-condition variable). The observations are the raw OTU counts. It might be due to the size of the dataset, however, the function has still not finished running after 24 hours - so I wanted to check if I am taking the correct path.

Additionally, do I need to initiate extra packages to leverage parallel processing in the mbImpute( ) function?

Error when running mbImpute

Hello, mbImpute developers,

I read your mbImpute manuscript on bioRxiv recently. I really liked the idea and would like to give it a try on my datasets. However I kept getting the following error when running the imputation:

This is the code I'm using:
a_imp <- mbImpute(condition = a_meta_num, otu_tab = otu_tab).

The otu_tab and a_meta_num I used is attached below in RDS format ('data.zip'). The entry of the otu_tab is defined as raw_counts divided by total_counts_in_the_sample, which is similar to the one presented in your manuscript except that it doesn't multiply by 10^6.
data.zip

Could you please help me look into this? Thanks in advance!

Also, there's a small typo in README. It should be
install.packages("glmnet")
instead of
install.pacakges("glmnet").

Best,
Zifan

function 'solve': requires numeric/complex matrix/vector arguments

library(mbImpute)
library(glmnet)
library(Matrix)

mbImpute(otu_tab = otu_tab)
[1] "Meta data information unavailable"
[1] "Phylogenentic information unavailable"
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'a' in selecting a method for function 'solve': requires numeric/complex matrix/vector arguments

Also, Demo 3 in your README example does not work:

otu_tab_T2D <- otu_tab[study_condition == "T2D",]
Error: object 'study_condition' not found

The otu_tab matrix doesn't include a study_condition column.

sessionInfo:

R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/software/dev/miniconda3_dev/envs/mbimpute/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] glmnet_4.1-1   Matrix_1.3-2   mbImpute_0.1.0

loaded via a namespace (and not attached):
 [1] compiler_4.0.3    tools_4.0.3       parallel_4.0.3    survival_3.2-10
 [5] splines_4.0.3     codetools_0.2-18  doParallel_1.0.16 grid_4.0.3
 [9] iterators_1.0.13  foreach_1.5.1     shape_1.4.5       lattice_0.20-41

How it works your package in a dataset different from Karlsson et al. data?

Dear @ruochenj and mbImpute developers.

I would like to use your tool for zero imputation because zComposition cmultrepl() does not work due to the huge of zeroes in my abundance table.

I followed your README and the package instructions version 0.1.0 and I do not know how to calculate D or previously I need to calculate the edges value. In th mbImpute package help appears some functions incomplete...such as function edges or D...

I tried the following:

data(GlobalPatterns)

gpotu<-as.data.frame(otu_table(GlobalPatterns))
gpotu2<-t(gpotu)

gpotu2[1:6,1:6] # This works

but when I try D[1:6,1:6] automatically redirects to otu_tab matrix (Karlsson data), and I tried some code but appears error and I am not able to calculate the D value...

Could you facilitate me some hints or how can I proceed from GlobalPatterns toy data to use your mbImpute.

Thanks on advance,

Magi

Error in if (max(wt) > 1e-05)

imputed_count_mat_list <- mbImpute(condition = study_condition,  otu_tab = otu_table, D = D)

Produces:

Error in if (max(wt) > 1e-05) {: missing value where TRUE/FALSE needed
Traceback:

1. mbImpute(condition = study_condition, otu_tab = otu_table, D = D)
2. data_fit2(otu_tab, meta_data, D, k = k)
3. lapply(1:ncol(y_sim), FUN = function(col_i) {
 .     y = y_sim[, col_i]
 .     return(gamma_norm_mix(y, X)$d)
 . })
4. FUN(X[[i]], ...)
5. gamma_norm_mix(y, X)
6. update_gmm_pars(x = y, wt = a_hat_t)

sessionInfo

R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Anxiety_Twins_Metagenomes/envs/tidyverse-clr/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ape_5.4-1         glmnet_4.1-1      Matrix_1.3-2      mbImpute_0.1.0   
 [5] LeyLabRMisc_0.1.9 tidytable_0.5.9   data.table_1.14.0 ggplot2_3.3.3    
 [9] tidyr_1.1.3       dplyr_1.0.5      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        pillar_1.5.1      compiler_4.0.3    iterators_1.0.13 
 [5] base64enc_0.1-3   tools_4.0.3       digest_0.6.27     uuid_0.1-4       
 [9] nlme_3.1-152      lattice_0.20-41   jsonlite_1.7.2    evaluate_0.14    
[13] lifecycle_1.0.0   tibble_3.1.0      gtable_0.3.0      pkgconfig_2.0.3  
[17] rlang_0.4.10      foreach_1.5.1     cli_2.3.1         IRdisplay_1.0    
[21] parallel_4.0.3    IRkernel_1.1.1    repr_1.1.3        withr_2.4.1      
[25] generics_0.1.0    vctrs_0.3.6       grid_4.0.3        tidyselect_1.1.0 
[29] glue_1.4.2        R6_2.5.0          fansi_0.4.2       survival_3.2-10  
[33] pbdZMQ_0.3-5      purrr_0.3.4       magrittr_2.0.1    splines_4.0.3    
[37] codetools_0.2-18  scales_1.1.1      ellipsis_0.3.1    htmltools_0.5.1.1
[41] assertthat_0.2.1  shape_1.4.5       colorspace_2.0-0  utf8_1.2.1       
[45] doParallel_1.0.16 munsell_0.5.0     crayon_1.4.1

can not install mbImpute

Hi,

When I tried to install the package, but the error arose the below:

Error: .onLoad failed in loadNamespace() for 'pkgload', details:
call: loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]])
error: there is no package called ‘backports’

The command I typed was the below:

devtools::install_github("ruochenj/mbImpute/mbImpute R package")

How shoud I do to solve it?

data_fit2 writes RDS files to current working directory

data_fit2() writes 2 RDS files to the current working directory:

  saveRDS(c1, file = "dat_sim_add_filter_coef.rds")
  saveRDS(y_imp, file = "imputed_mat_condition_as_covariate.rds")

The use has now control on where these files are written. This can cause problems in cases where multiple jobs are run at the same time.

status updates and parallel vs single-core implementation

It would be helpful to have more information than just:

[1] 1
[1] 0

during an mbImpute() run (single core). It appears that this output is simply print(mat_num-1-mat_new) in the data_fit2() function. Using message() with a bit of text along with the value allow for more informative output.

While I was looking at data_fit2(), I noticed that you have different code for parallel == TRUE versus parallel == FALSE:

    if(!parallel){
      for(mat_new in 1:(mat_num-1)){
        print(mat_num-1-mat_new)
        design_mat_fit = sparseMatrix(i = 1, j =1, x = 0, dims = c(size, row_length))
        track = ((mat_new-1)*size+1):(mat_new*size)
        for(i in 1:size){
          if(is.vector(X)){
            result <- design_mat_row_gen2(y_sim, X[1:n], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
          else{
            result <- design_mat_row_gen2(y_sim, X[1:n,], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
        }
        mat_list[[mat_new]] = design_mat_fit
      }
    }else{
      no_cores <- max(ncores, detectCores() - 1)
      registerDoParallel(cores=no_cores)
      cl <- makeCluster(no_cores, "FORK")
      f <- function(mat_new){
        design_mat_fit = sparseMatrix(i = 1, j =1, x = 0, dims = c(size, row_length))
        track = ((mat_new-1)*size+1):(mat_new*size)
        for(i in 1:size){
          if(is.vector(X)){
            result <- design_mat_row_gen2(y_sim, X[1:n], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
          else{
            result <- design_mat_row_gen2(y_sim, X[1:n,], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
        }
        return(design_mat_fit)
      }
      mat_list <- parLapply(cl, 1:(mat_num-1), f)

Why is this separate code instead of using the same f() function for cores=1 versus cores=>1? Do these different implementations generate different results?

improve scaling?

Is there any way to improve the scaling of mbImpute()? Even with 4 cores, the function is not scaling well on real metagenome data -- at least with default parameters:

phyloseq integration

Given the challenge of formatting all of the data exactly as required for mbImpute():

condition
- vector in the same order as the sample order used for the OTU table
OTU
- "wide" matrix format, with column names as taxon IDs and row names as samples
metadata
- data.table with rownames that match the sample IDs in the OTU table
D
- phylogeny distance matrix in which the row and column orders much match the OTU ID order in the OTU table

...it would be VERY helpful to provide support for phyloseq objects so that the user doesn't have to worry about all of this formatting (and less chance of getting the order wrong)

subscript out of bounds

This occurs often for me when running mbImpute() (current version from github):

[1] "Meta data information unavailable"
[1] "Phylogenentic information unavailable"
[1] 441 567
[1] 0.1891608 5.0000000 5.0000000 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 5.0000000 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 0.1714688 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 0.1714688 0.1614214
Error in impute_set[i, 1]: subscript out of bounds
Traceback:

1. mbImpute(otu_tab = cnt)
2. data_fit2(otu_tab, metadata, D, k = k)
3. design_mat_row_gen2_imp(y_sim, X[1:n, ], impute_set[i, 1], impute_set[i, 
 .     2], close_taxa)

ruochenj / mbimpute Goto Github PK

mbimpute's Issues

D matrix: row/column names not used

Normalisation & Parallel Computing for mbImpute

Error when running mbImpute

function 'solve': requires numeric/complex matrix/vector arguments

How it works your package in a dataset different from Karlsson et al. data?

Error in if (max(wt) > 1e-05)

sessionInfo

can not install mbImpute

data_fit2 writes RDS files to current working directory

status updates and parallel vs single-core implementation

improve scaling?

phyloseq integration

subscript out of bounds

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent