Code Monkey home page Code Monkey logo

mbimpute's Issues

D matrix: row/column names not used

It appears that mbImpute() doesn't use row/column names for the D matrix. So, the order of the matrix must correspond to the otu_tab table, but this isn't in the function docs. Even if it was in the docs, relying on the user to make sure that the order is correct is dangerous.

Normalisation & Parallel Computing for mbImpute

Greetings mbImpute Team!

I have been trying to utilise the mbImpute method for a microbiome dataset with OTU counts provided for various samples. Pardon my ignorance, but since this is a new domain I am working in, I was wondering if I need to apply normalisation procedures before utilizing mbImpute( ) or if setting the 'unnormalised' to TRUE would be sufficient (default normalisation procedures outlined in your paper)?

The reason I am asking you this is that the dataset contains over 2,000 samples with over 100 OTU variables, and 2 metadata variables (excluding the study-condition variable). The observations are the raw OTU counts. It might be due to the size of the dataset, however, the function has still not finished running after 24 hours - so I wanted to check if I am taking the correct path.

Additionally, do I need to initiate extra packages to leverage parallel processing in the mbImpute( ) function?

Error when running mbImpute

Hello, mbImpute developers,

I read your mbImpute manuscript on bioRxiv recently. I really liked the idea and would like to give it a try on my datasets. However I kept getting the following error when running the imputation:

image

This is the code I'm using:
a_imp <- mbImpute(condition = a_meta_num, otu_tab = otu_tab).

The otu_tab and a_meta_num I used is attached below in RDS format ('data.zip'). The entry of the otu_tab is defined as raw_counts divided by total_counts_in_the_sample, which is similar to the one presented in your manuscript except that it doesn't multiply by 10^6.
data.zip

Could you please help me look into this? Thanks in advance!

Also, there's a small typo in README. It should be
install.packages("glmnet")
instead of
install.pacakges("glmnet").

Best,
Zifan

function 'solve': requires numeric/complex matrix/vector arguments

library(mbImpute)
library(glmnet)
library(Matrix)

mbImpute(otu_tab = otu_tab)
[1] "Meta data information unavailable"
[1] "Phylogenentic information unavailable"
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'a' in selecting a method for function 'solve': requires numeric/complex matrix/vector arguments

Also, Demo 3 in your README example does not work:

otu_tab_T2D <- otu_tab[study_condition == "T2D",]
Error: object 'study_condition' not found

The otu_tab matrix doesn't include a study_condition column.

sessionInfo:

R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/software/dev/miniconda3_dev/envs/mbimpute/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] glmnet_4.1-1   Matrix_1.3-2   mbImpute_0.1.0

loaded via a namespace (and not attached):
 [1] compiler_4.0.3    tools_4.0.3       parallel_4.0.3    survival_3.2-10
 [5] splines_4.0.3     codetools_0.2-18  doParallel_1.0.16 grid_4.0.3
 [9] iterators_1.0.13  foreach_1.5.1     shape_1.4.5       lattice_0.20-41

How it works your package in a dataset different from Karlsson et al. data?

Dear @ruochenj and mbImpute developers.

I would like to use your tool for zero imputation because zComposition cmultrepl() does not work due to the huge of zeroes in my abundance table.

I followed your README and the package instructions version 0.1.0 and I do not know how to calculate D or previously I need to calculate the edges value. In th mbImpute package help appears some functions incomplete...such as function edges or D...

I tried the following:

data(GlobalPatterns)

gpotu<-as.data.frame(otu_table(GlobalPatterns))
gpotu2<-t(gpotu)

gpotu2[1:6,1:6] # This works

but when I try D[1:6,1:6] automatically redirects to otu_tab matrix (Karlsson data), and I tried some code but appears error and I am not able to calculate the D value...

Could you facilitate me some hints or how can I proceed from GlobalPatterns toy data to use your mbImpute.

Thanks on advance,

Magi

Error in if (max(wt) > 1e-05)

imputed_count_mat_list <- mbImpute(condition = study_condition,  otu_tab = otu_table, D = D)

Produces:

Error in if (max(wt) > 1e-05) {: missing value where TRUE/FALSE needed
Traceback:

1. mbImpute(condition = study_condition, otu_tab = otu_table, D = D)
2. data_fit2(otu_tab, meta_data, D, k = k)
3. lapply(1:ncol(y_sim), FUN = function(col_i) {
 .     y = y_sim[, col_i]
 .     return(gamma_norm_mix(y, X)$d)
 . })
4. FUN(X[[i]], ...)
5. gamma_norm_mix(y, X)
6. update_gmm_pars(x = y, wt = a_hat_t)

sessionInfo

R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Anxiety_Twins_Metagenomes/envs/tidyverse-clr/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ape_5.4-1         glmnet_4.1-1      Matrix_1.3-2      mbImpute_0.1.0   
 [5] LeyLabRMisc_0.1.9 tidytable_0.5.9   data.table_1.14.0 ggplot2_3.3.3    
 [9] tidyr_1.1.3       dplyr_1.0.5      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        pillar_1.5.1      compiler_4.0.3    iterators_1.0.13 
 [5] base64enc_0.1-3   tools_4.0.3       digest_0.6.27     uuid_0.1-4       
 [9] nlme_3.1-152      lattice_0.20-41   jsonlite_1.7.2    evaluate_0.14    
[13] lifecycle_1.0.0   tibble_3.1.0      gtable_0.3.0      pkgconfig_2.0.3  
[17] rlang_0.4.10      foreach_1.5.1     cli_2.3.1         IRdisplay_1.0    
[21] parallel_4.0.3    IRkernel_1.1.1    repr_1.1.3        withr_2.4.1      
[25] generics_0.1.0    vctrs_0.3.6       grid_4.0.3        tidyselect_1.1.0 
[29] glue_1.4.2        R6_2.5.0          fansi_0.4.2       survival_3.2-10  
[33] pbdZMQ_0.3-5      purrr_0.3.4       magrittr_2.0.1    splines_4.0.3    
[37] codetools_0.2-18  scales_1.1.1      ellipsis_0.3.1    htmltools_0.5.1.1
[41] assertthat_0.2.1  shape_1.4.5       colorspace_2.0-0  utf8_1.2.1       
[45] doParallel_1.0.16 munsell_0.5.0     crayon_1.4.1    

can not install mbImpute

Hi,

When I tried to install the package, but the error arose the below:

Error: .onLoad failed in loadNamespace() for 'pkgload', details:
call: loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]])
error: there is no package called ‘backports’

The command I typed was the below:

devtools::install_github("ruochenj/mbImpute/mbImpute R package")

How shoud I do to solve it?

data_fit2 writes RDS files to current working directory

data_fit2() writes 2 RDS files to the current working directory:

  saveRDS(c1, file = "dat_sim_add_filter_coef.rds")
  saveRDS(y_imp, file = "imputed_mat_condition_as_covariate.rds")

The use has now control on where these files are written. This can cause problems in cases where multiple jobs are run at the same time.

status updates and parallel vs single-core implementation

It would be helpful to have more information than just:

[1] 1
[1] 0

during an mbImpute() run (single core). It appears that this output is simply print(mat_num-1-mat_new) in the data_fit2() function. Using message() with a bit of text along with the value allow for more informative output.


While I was looking at data_fit2(), I noticed that you have different code for parallel == TRUE versus parallel == FALSE:

    if(!parallel){
      for(mat_new in 1:(mat_num-1)){
        print(mat_num-1-mat_new)
        design_mat_fit = sparseMatrix(i = 1, j =1, x = 0, dims = c(size, row_length))
        track = ((mat_new-1)*size+1):(mat_new*size)
        for(i in 1:size){
          if(is.vector(X)){
            result <- design_mat_row_gen2(y_sim, X[1:n], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
          else{
            result <- design_mat_row_gen2(y_sim, X[1:n,], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
        }
        mat_list[[mat_new]] = design_mat_fit
      }
    }else{
      no_cores <- max(ncores, detectCores() - 1)
      registerDoParallel(cores=no_cores)
      cl <- makeCluster(no_cores, "FORK")
      f <- function(mat_new){
        design_mat_fit = sparseMatrix(i = 1, j =1, x = 0, dims = c(size, row_length))
        track = ((mat_new-1)*size+1):(mat_new*size)
        for(i in 1:size){
          if(is.vector(X)){
            result <- design_mat_row_gen2(y_sim, X[1:n], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
          else{
            result <- design_mat_row_gen2(y_sim, X[1:n,], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
        }
        return(design_mat_fit)
      }
      mat_list <- parLapply(cl, 1:(mat_num-1), f)

Why is this separate code instead of using the same f() function for cores=1 versus cores=>1? Do these different implementations generate different results?

improve scaling?

Is there any way to improve the scaling of mbImpute()? Even with 4 cores, the function is not scaling well on real metagenome data -- at least with default parameters: mbImpute_scaling

phyloseq integration

Given the challenge of formatting all of the data exactly as required for mbImpute():

  • condition
    • vector in the same order as the sample order used for the OTU table
  • OTU
    • "wide" matrix format, with column names as taxon IDs and row names as samples
  • metadata
    • data.table with rownames that match the sample IDs in the OTU table
  • D
    • phylogeny distance matrix in which the row and column orders much match the OTU ID order in the OTU table

...it would be VERY helpful to provide support for phyloseq objects so that the user doesn't have to worry about all of this formatting (and less chance of getting the order wrong)

subscript out of bounds

This occurs often for me when running mbImpute() (current version from github):

[1] "Meta data information unavailable"
[1] "Phylogenentic information unavailable"
[1] 441 567
[1] 0.1891608 5.0000000 5.0000000 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 5.0000000 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 0.1714688 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 0.1714688 0.1614214
Error in impute_set[i, 1]: subscript out of bounds
Traceback:

1. mbImpute(otu_tab = cnt)
2. data_fit2(otu_tab, metadata, D, k = k)
3. design_mat_row_gen2_imp(y_sim, X[1:n, ], impute_set[i, 1], impute_set[i, 
 .     2], close_taxa)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.