Code Monkey home page Code Monkey logo

mbimpute's Introduction

mbImpute: an accurate and robust imputation method for microbiome data

Ruochen Jiang, Wei Vivian Li, and Jingyi Jessica Li 2021-08-18

mbImpute

The goal of mbImpute is to impute false zero counts in microbiome sequencing data, i.e., a sample-by-taxon count matrix, by jointly borrowing information from similar samples, similar taxa and optional metadata including sample covariates and taxon phylogeny.

Installation

Please install the following R packages first.

install.packages("glmnet")
install.packages("devtools")
install.packages("Matrix")

Then you can use the following R command to directly install the mbImpute package from GitHub:

library(devtools)
install_github("ruochenj/mbImpute/mbImpute R package")

Example

We use the microbiome dataset from Karlsson et al (2013) as an example to demonstrate the use of mbImpute:

# Load the R packages
library(mbImpute)
library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.1-2
library(Matrix)

# Display part of the OTU table
otu_tab[1:6, 1:6]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                 3954419                         2602398
#> S118                       0                         1169731
#> S121                  550000                         3162050
#> S126                       0                          986563
#> S127                       0                         3940520
#> S131                       0                          502030
#>      s__Dialister_invisus s__Dorea_longicatena s__Ruminococcus_obeum
#> S112              1671620              1440718               1112728
#> S118              1412991               623343                190968
#> S121               827400               855614                274969
#> S126              2411487               163956                112493
#> S127                    0               554154                286616
#> S131                    0               159910                802850
#>      s__Coprococcus_comes
#> S112               947723
#> S118                58660
#> S121               281024
#> S126               148430
#> S127               210698
#> S131               450721

# Display part of the taxon phylogenetic distance matrix, whose rows and columns correspond to the columns in otu_tab 
D[1:6, 1:6]
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    0    2    9   10   10    8
#> [2,]    2    0    9   10   10    8
#> [3,]    9    9    0    3    3    3
#> [4,]   10   10    3    0    2    4
#> [5,]   10   10    3    2    0    4
#> [6,]    8    8    3    4    4    0

# Display part of the (optional) meta data, i.e., the sample covariate matrix with rows representing samples and corresponding to the rows in otu_tab
meta_data[1:6, 1:6]
#>      study_condition      age number_reads triglycerides     hba1c       ldl
#> S112             IGT 1.293993    0.6475183     0.9926486 1.2575721 1.0022890
#> S118         control 2.587987    1.4075527     0.2357540 0.8803004 2.8883168
#> S121         control 1.293993    0.9558486     0.1613054 1.2575721 0.9915117
#> S126         control 1.293993    1.2244933     0.7072621 1.3833293 2.4895566
#> S127         control 2.587987    0.9428587     0.9057918 1.2575721 3.2116358
#> S131             IGT 1.293993    1.0145444     1.8239918 1.5090865 1.7351455
study_condition = meta_data[,1]
meta_data <- as.data.frame(unclass(meta_data))
meta_data <- meta_data[,-1]

# Demo 1: run mbImpute on a single core
imputed_count_mat_list <- mbImpute(condition = study_condition, otu_tab = otu_tab, metadata = meta_data, D = D)
#> [1] 1
#> [1] 0
#> [1] 11230 23280
#> [1] 0.7156728 5.0000000 5.0000000 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 5.0000000 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 0.5841656 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 0.5841656 0.6362045
#> [1] 10000
#> [1] "Finished."

# A glance at the imputed result, which includes three matrices
## The first is an imputed matrix on the log10 scale; we recommend users to perform downstream analysis based on normal distributions on this data, whose values in each taxon (column) follows an approximate normal distribution (see our paper for detail)
imputed_count_mat_list$imp_count_mat_lognorm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                5.335660                        5.153952
#> S118                4.000822                        4.721291
#> S121                4.409327                        5.168918
## The second is an imputed normalized count matrix, where each sample (row) is set to have the same total of a million reads
imputed_count_mat_list$imp_count_mat_norm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                  216599                          142544
#> S118                   10017                           52635
#> S121                   25663                          147541
## The third is an imputed count matrix on the original scale, with each sample (row) having the read count same as that in the original otu_tab
imputed_count_mat_list$imp_count_mat_origlibsize[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                 3954404                         2602397
#> S118                  222608                         1169709
#> S121                  549997                         3162029

# Demo 2: if you have multiple (e.g., 4) cores and would like to do parallel computing
imputed_count_mat_list <- mbImpute(condition = meta_data$study_condition, otu_tab = otu_tab, metadata = meta_data, D = D, parallel = TRUE, ncores = 4)
#> [1] 1
#> [1] 0
#> [1] 11230 23280
#> [1] 0.7156728 5.0000000 5.0000000 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 5.0000000 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 0.5841656 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 0.5841656 0.6362045
#> [1] 10000
#> [1] "Finished."

# A glance at the imputed result, which includes three matrices
## The first is an imputed matrix on the log10 scale; we recommend users to perform downstream analysis based on normal distributions on this data, whose values in each taxon (column) follows an approximate normal distribution (see our paper for detail)
imputed_count_mat_list$imp_count_mat_lognorm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                5.335660                        5.153952
#> S118                4.000822                        4.721291
#> S121                4.409327                        5.168918
## The second is an imputed normalized count matrix, where each sample (row) is set to have the same total of a million reads
imputed_count_mat_list$imp_count_mat_norm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                  216599                          142544
#> S118                   10017                           52635
#> S121                   25663                          147541
## The third is an imputed count matrix on the original scale, with each sample (row) having the read count same as that in the original otu_tab
imputed_count_mat_list$imp_count_mat_origlibsize[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                 3954404                         2602397
#> S118                  222608                         1169709
#> S121                  549997                         3162029

# Demo 3: if you do not have meta data or phylogenetic information, and the samples belong to one condition
otu_tab_T2D <- otu_tab[study_condition == "T2D",]
imputed_count_matrix_list <- mbImpute(otu_tab = otu_tab_T2D)
#> [1] "Meta data information unavailable"
#> [1] "Phylogenentic information unavailable"
#> [1] 4204 3707
#> [1] 1.055828 5.000000 5.000000 5.000000
#> [1] 4204 3707
#> [1] 1.055828 1.053811 5.000000 5.000000
#> [1] 4204 3707
#> [1] 1.055828 1.053811 0.962270 5.000000
#> [1] 4204 3707
#> [1] 1.0558281 1.0538114 0.9622700 0.7518937
#> [1] "Finished."

# A glance at the imputed result, which includes three matrices
## The first is an imputed matrix on the log10 scale; we recommend users to perform downstream analysis based on normal distributions on this data, whose values in each taxon (column) follows an approximate normal distribution (see our paper for detail)
imputed_count_mat_list$imp_count_mat_lognorm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                5.335660                        5.153952
#> S118                4.000822                        4.721291
#> S121                4.409327                        5.168918
## The second is an imputed normalized count matrix, where each sample (row) is set to have the same total of a million reads
imputed_count_mat_list$imp_count_mat_norm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                  216599                          142544
#> S118                   10017                           52635
#> S121                   25663                          147541
## The third is an imputed count matrix on the original scale, with each sample (row) having the read count same as that in the original otu_tab
imputed_count_mat_list$imp_count_mat_origlibsize[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                 3954404                         2602397
#> S118                  222608                         1169709
#> S121                  549997                         3162029

Reference:

Karlsson, F. H., Tremaroli, V., Nookaew, I., Bergström, G., Behre, C. J., Fagerberg, B., … & Bäckhed, F. (2013). Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature, 498(7452), 99-103.

mbimpute's People

Contributors

ruochenj avatar

Stargazers

Shuchen avatar  avatar Pierre-Marie Allard avatar Kris Sankaran avatar Ashish Damania avatar Cassandra Wattenburger avatar  avatar Silvio Waschina avatar Zifan Zhu avatar

mbimpute's Issues

improve scaling?

Is there any way to improve the scaling of mbImpute()? Even with 4 cores, the function is not scaling well on real metagenome data -- at least with default parameters: mbImpute_scaling

Error in if (max(wt) > 1e-05)

imputed_count_mat_list <- mbImpute(condition = study_condition,  otu_tab = otu_table, D = D)

Produces:

Error in if (max(wt) > 1e-05) {: missing value where TRUE/FALSE needed
Traceback:

1. mbImpute(condition = study_condition, otu_tab = otu_table, D = D)
2. data_fit2(otu_tab, meta_data, D, k = k)
3. lapply(1:ncol(y_sim), FUN = function(col_i) {
 .     y = y_sim[, col_i]
 .     return(gamma_norm_mix(y, X)$d)
 . })
4. FUN(X[[i]], ...)
5. gamma_norm_mix(y, X)
6. update_gmm_pars(x = y, wt = a_hat_t)

sessionInfo

R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Anxiety_Twins_Metagenomes/envs/tidyverse-clr/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ape_5.4-1         glmnet_4.1-1      Matrix_1.3-2      mbImpute_0.1.0   
 [5] LeyLabRMisc_0.1.9 tidytable_0.5.9   data.table_1.14.0 ggplot2_3.3.3    
 [9] tidyr_1.1.3       dplyr_1.0.5      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        pillar_1.5.1      compiler_4.0.3    iterators_1.0.13 
 [5] base64enc_0.1-3   tools_4.0.3       digest_0.6.27     uuid_0.1-4       
 [9] nlme_3.1-152      lattice_0.20-41   jsonlite_1.7.2    evaluate_0.14    
[13] lifecycle_1.0.0   tibble_3.1.0      gtable_0.3.0      pkgconfig_2.0.3  
[17] rlang_0.4.10      foreach_1.5.1     cli_2.3.1         IRdisplay_1.0    
[21] parallel_4.0.3    IRkernel_1.1.1    repr_1.1.3        withr_2.4.1      
[25] generics_0.1.0    vctrs_0.3.6       grid_4.0.3        tidyselect_1.1.0 
[29] glue_1.4.2        R6_2.5.0          fansi_0.4.2       survival_3.2-10  
[33] pbdZMQ_0.3-5      purrr_0.3.4       magrittr_2.0.1    splines_4.0.3    
[37] codetools_0.2-18  scales_1.1.1      ellipsis_0.3.1    htmltools_0.5.1.1
[41] assertthat_0.2.1  shape_1.4.5       colorspace_2.0-0  utf8_1.2.1       
[45] doParallel_1.0.16 munsell_0.5.0     crayon_1.4.1    

D matrix: row/column names not used

It appears that mbImpute() doesn't use row/column names for the D matrix. So, the order of the matrix must correspond to the otu_tab table, but this isn't in the function docs. Even if it was in the docs, relying on the user to make sure that the order is correct is dangerous.

data_fit2 writes RDS files to current working directory

data_fit2() writes 2 RDS files to the current working directory:

  saveRDS(c1, file = "dat_sim_add_filter_coef.rds")
  saveRDS(y_imp, file = "imputed_mat_condition_as_covariate.rds")

The use has now control on where these files are written. This can cause problems in cases where multiple jobs are run at the same time.

phyloseq integration

Given the challenge of formatting all of the data exactly as required for mbImpute():

  • condition
    • vector in the same order as the sample order used for the OTU table
  • OTU
    • "wide" matrix format, with column names as taxon IDs and row names as samples
  • metadata
    • data.table with rownames that match the sample IDs in the OTU table
  • D
    • phylogeny distance matrix in which the row and column orders much match the OTU ID order in the OTU table

...it would be VERY helpful to provide support for phyloseq objects so that the user doesn't have to worry about all of this formatting (and less chance of getting the order wrong)

Error when running mbImpute

Hello, mbImpute developers,

I read your mbImpute manuscript on bioRxiv recently. I really liked the idea and would like to give it a try on my datasets. However I kept getting the following error when running the imputation:

image

This is the code I'm using:
a_imp <- mbImpute(condition = a_meta_num, otu_tab = otu_tab).

The otu_tab and a_meta_num I used is attached below in RDS format ('data.zip'). The entry of the otu_tab is defined as raw_counts divided by total_counts_in_the_sample, which is similar to the one presented in your manuscript except that it doesn't multiply by 10^6.
data.zip

Could you please help me look into this? Thanks in advance!

Also, there's a small typo in README. It should be
install.packages("glmnet")
instead of
install.pacakges("glmnet").

Best,
Zifan

Normalisation & Parallel Computing for mbImpute

Greetings mbImpute Team!

I have been trying to utilise the mbImpute method for a microbiome dataset with OTU counts provided for various samples. Pardon my ignorance, but since this is a new domain I am working in, I was wondering if I need to apply normalisation procedures before utilizing mbImpute( ) or if setting the 'unnormalised' to TRUE would be sufficient (default normalisation procedures outlined in your paper)?

The reason I am asking you this is that the dataset contains over 2,000 samples with over 100 OTU variables, and 2 metadata variables (excluding the study-condition variable). The observations are the raw OTU counts. It might be due to the size of the dataset, however, the function has still not finished running after 24 hours - so I wanted to check if I am taking the correct path.

Additionally, do I need to initiate extra packages to leverage parallel processing in the mbImpute( ) function?

can not install mbImpute

Hi,

When I tried to install the package, but the error arose the below:

Error: .onLoad failed in loadNamespace() for 'pkgload', details:
call: loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]])
error: there is no package called ‘backports’

The command I typed was the below:

devtools::install_github("ruochenj/mbImpute/mbImpute R package")

How shoud I do to solve it?

function 'solve': requires numeric/complex matrix/vector arguments

library(mbImpute)
library(glmnet)
library(Matrix)

mbImpute(otu_tab = otu_tab)
[1] "Meta data information unavailable"
[1] "Phylogenentic information unavailable"
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'a' in selecting a method for function 'solve': requires numeric/complex matrix/vector arguments

Also, Demo 3 in your README example does not work:

otu_tab_T2D <- otu_tab[study_condition == "T2D",]
Error: object 'study_condition' not found

The otu_tab matrix doesn't include a study_condition column.

sessionInfo:

R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/software/dev/miniconda3_dev/envs/mbimpute/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] glmnet_4.1-1   Matrix_1.3-2   mbImpute_0.1.0

loaded via a namespace (and not attached):
 [1] compiler_4.0.3    tools_4.0.3       parallel_4.0.3    survival_3.2-10
 [5] splines_4.0.3     codetools_0.2-18  doParallel_1.0.16 grid_4.0.3
 [9] iterators_1.0.13  foreach_1.5.1     shape_1.4.5       lattice_0.20-41

How it works your package in a dataset different from Karlsson et al. data?

Dear @ruochenj and mbImpute developers.

I would like to use your tool for zero imputation because zComposition cmultrepl() does not work due to the huge of zeroes in my abundance table.

I followed your README and the package instructions version 0.1.0 and I do not know how to calculate D or previously I need to calculate the edges value. In th mbImpute package help appears some functions incomplete...such as function edges or D...

I tried the following:

data(GlobalPatterns)

gpotu<-as.data.frame(otu_table(GlobalPatterns))
gpotu2<-t(gpotu)

gpotu2[1:6,1:6] # This works

but when I try D[1:6,1:6] automatically redirects to otu_tab matrix (Karlsson data), and I tried some code but appears error and I am not able to calculate the D value...

Could you facilitate me some hints or how can I proceed from GlobalPatterns toy data to use your mbImpute.

Thanks on advance,

Magi

status updates and parallel vs single-core implementation

It would be helpful to have more information than just:

[1] 1
[1] 0

during an mbImpute() run (single core). It appears that this output is simply print(mat_num-1-mat_new) in the data_fit2() function. Using message() with a bit of text along with the value allow for more informative output.


While I was looking at data_fit2(), I noticed that you have different code for parallel == TRUE versus parallel == FALSE:

    if(!parallel){
      for(mat_new in 1:(mat_num-1)){
        print(mat_num-1-mat_new)
        design_mat_fit = sparseMatrix(i = 1, j =1, x = 0, dims = c(size, row_length))
        track = ((mat_new-1)*size+1):(mat_new*size)
        for(i in 1:size){
          if(is.vector(X)){
            result <- design_mat_row_gen2(y_sim, X[1:n], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
          else{
            result <- design_mat_row_gen2(y_sim, X[1:n,], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
        }
        mat_list[[mat_new]] = design_mat_fit
      }
    }else{
      no_cores <- max(ncores, detectCores() - 1)
      registerDoParallel(cores=no_cores)
      cl <- makeCluster(no_cores, "FORK")
      f <- function(mat_new){
        design_mat_fit = sparseMatrix(i = 1, j =1, x = 0, dims = c(size, row_length))
        track = ((mat_new-1)*size+1):(mat_new*size)
        for(i in 1:size){
          if(is.vector(X)){
            result <- design_mat_row_gen2(y_sim, X[1:n], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
          else{
            result <- design_mat_row_gen2(y_sim, X[1:n,], confidence_set[track[i]+1,1], confidence_set[track[i]+1,2], close_taxa)
            design_mat_fit[i,result$nz_idx] <- result$nz_val
          }
        }
        return(design_mat_fit)
      }
      mat_list <- parLapply(cl, 1:(mat_num-1), f)

Why is this separate code instead of using the same f() function for cores=1 versus cores=>1? Do these different implementations generate different results?

subscript out of bounds

This occurs often for me when running mbImpute() (current version from github):

[1] "Meta data information unavailable"
[1] "Phylogenentic information unavailable"
[1] 441 567
[1] 0.1891608 5.0000000 5.0000000 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 5.0000000 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 0.1714688 5.0000000
[1] 441 567
[1] 0.1891608 0.1791422 0.1714688 0.1614214
Error in impute_set[i, 1]: subscript out of bounds
Traceback:

1. mbImpute(otu_tab = cnt)
2. data_fit2(otu_tab, metadata, D, k = k)
3. design_mat_row_gen2_imp(y_sim, X[1:n, ], impute_set[i, 1], impute_set[i, 
 .     2], close_taxa)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.