kstreet13 / scry Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 2.0 10.31 MB

Collection of analysis methods for small count data generated by rafalab members (the methods, not the data).

R 100.00%

scry's People

Contributors

Stargazers

Watchers

Forkers

fanyue322 eweine

scry's Issues

more than one "batch"

hi, I was wondering if there's any example to deal with "more than one batch" situation using the devianceFeatureSelection() function? for example I have "batch", "sex", "age" ... to be corrected.

Benchmark scry using in-memory vs `DelayedArray` objects

Tasks for @stephaniehicks:

Compare (1) in-memory and (2) DelayedArray objects for increasing sizes of observations and fixed number of features:

Immediate work

Using Poisson deviance residuals, plot time (y-axis) for increasing observations (x-axis)
- Track time for both scry::nullResiduals() and BiocSingular::runPCA() separately
- Try BiocSingular::runPCA() with ExactParam() and IrlbaParam()
Using Poisson deviance residuals, plot memory-usage (y-axis) for increasing observations (x-axis)
- Track memory-usage for both scry::nullResiduals() and BiocSingular::runPCA() separately
- Try BiocSingular::runPCA() with ExactParam() and IrlbaParam()

Near in the future work

@kstreet13 is currently working on implementing the code for the Poisson Pearson and Binomial Pearson cases. Once complete, @stephaniehicks will do the following:

Using Poisson Pearson residuals, plot time (y-axis) for increasing observations (x-axis)
Using Poisson Pearson residuals, plot memory-usage (y-axis) for increasing observations (x-axis)
Using Binomial Pearson residuals, plot time (y-axis) for increasing observations (x-axis)
Using Binomial Pearson residuals, plot memory-usage (y-axis) for increasing observations (x-axis)

Really in the future work

These last two cases, we will set aside for now.

Using Binomial deviance residuals, plot time (y-axis) for increasing observations (x-axis)
Using Binomial deviance residuals, plot memory-usage (y-axis) for increasing observations (x-axis)

Support for large matrices

Hello,

I wanted to use the devianceFeatureSelection on a fairly large dataset I am working on.
I'm giving the function a 584802 x 31394 sparse Matrix of class "dgRMatrix" named mat :

sce <- devianceFeatureSelection(mat)

And end up with the following error :

Error in h(simpleError(msg, call)): 
     error in evaluating the argument 'x' in selecting a method for function 'colSums' : unable to aggregate TsparseMatrix with 'i' and 'j' slots of length exceeding 2^31-1

From what I've seen online, this is a limitation of the sparse matrix class in R, as it cannot store more that 2^32-1 non-zero elements. Would you have any idea on how to handle this error in this case ?

Thank you,

`NaN` gene deviance scores

An issue I've been running into lately seems to stem from the fact that when running devianceFeatureSelection, genes with zero counts across all cells end up with a deviance of NaN. In practice, this isn't always a problem, since you probably wouldn't want to include those genes, anyway. However, the deviance scores are calculated separately for each batch and NaN + <a number> = NaN, so there is a cascading effect. One or two small-ish batches can end up removing a large percentage of the genes from downstream analysis (anywhere from 5-60% in the datasets I'm currently working with). Counterintuitively, this means I have to remove data in order to get more actual results.

@willtownes, would it make sense to replace these NaN values with zeros? It seems like there would be nothing mathematically incorrect about a degenerate poisson/NB model with mean zero (and zero variance). And that way, devianceFeatureSelection could return a value for every gene.

Experiences with the book, round 2

Similar to #7, but on a different dataset:

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
qcdata <- bfcrpath(bfc, "https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true")

env <- new.env()
load(qcdata, envir=env)
sce.8qc <- env$sce8_qc
sce.8qc$mix <- factor(sce.8qc$mix)

# Library size normalization and log-transformation.
library(scuttle)
library(scran)
sce.8qc <- logNormCounts(sce.8qc)
dec.8qc <- modelGeneVar(sce.8qc)
hvgs.8qc <- getTopHVGs(dec.8qc, n=1000)

library(scry)
sce.8qc2 <- GLMPCA(sce.8qc[hvgs.8qc,], fam="nb",
    L=20, sz=sizeFactors(sce.8qc))
## Error: Poor model fit (final deviance higher than initial deviance). Try modifying control parameters to improve optimization.

I guess I could fiddle with the control parameters to sneak it through, but that is going to be beyond the ability of most users, and this dataset is pretty tame. Seems like some more care could be taken with the step sizes to guarantee convergence, or at least reduction of the deviance.

Session information

R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scry_1.1.0                  scran_1.17.15              
 [3] scuttle_0.99.12             SingleCellExperiment_1.11.6
 [5] SummarizedExperiment_1.19.6 DelayedArray_0.15.7        
 [7] matrixStats_0.56.0          Matrix_1.2-18              
 [9] Biobase_2.49.0              GenomicRanges_1.41.6       
[11] GenomeInfoDb_1.25.10        IRanges_2.23.10            
[13] S4Vectors_0.27.12           BiocGenerics_0.35.4        
[15] BiocFileCache_1.13.1        dbplyr_1.4.4               

loaded via a namespace (and not attached):
 [1] statmod_1.4.34            tidyselect_1.1.0         
 [3] locfit_1.5-9.4            purrr_0.3.4              
 [5] BiocSingular_1.5.0        lattice_0.20-41          
 [7] vctrs_0.3.2               generics_0.0.2           
 [9] blob_1.2.1                rlang_0.4.7              
[11] pillar_1.4.6              glue_1.4.1               
[13] DBI_1.1.0                 BiocParallel_1.23.2      
[15] rappdirs_0.3.1            bit64_4.0.2              
[17] dqrng_0.2.1               GenomeInfoDbData_1.2.3   
[19] lifecycle_0.2.0           zlibbioc_1.35.0          
[21] rsvd_1.0.3                memoise_1.1.0            
[23] irlba_2.3.3               curl_4.3                 
[25] BiocNeighbors_1.7.0       Rcpp_1.0.5               
[27] edgeR_3.31.4              limma_3.45.10            
[29] XVector_0.29.3            bit_4.0.4                
[31] digest_0.6.25             bluster_0.99.1           
[33] dplyr_1.0.1               grid_4.0.0               
[35] tools_4.0.0               bitops_1.0-6             
[37] magrittr_1.5              RCurl_1.98-1.2           
[39] tibble_3.0.3              RSQLite_2.2.0            
[41] crayon_1.3.4              pkgconfig_2.0.3          
[43] MASS_7.3-51.6             ellipsis_0.3.1           
[45] DelayedMatrixStats_1.11.1 glmpca_0.2.0             
[47] assertthat_0.2.1          httr_1.4.2               
[49] R6_2.4.1                  igraph_1.2.5             
[51] compiler_4.0.0

Null residuals with HDF5/DelayedArray/Matrix matrices

To achieve this we need to:

create "ANY" method (thanks @stephaniehicks!)
generalize the code so that it works on all these matrix types (using blockApply()?)
write on disk the residuals as presumably they are the same size as the original data, so we cannot store them in memory (unless we perform feature selection first and the selected genes are much less than the total genes)
Related, should we make HDF5-ready also the devianceFeatureSelection function?
Unit tests to ensure we get the same results with all matrix types
Illustrate usage in the vignette with BiocSingular for PCA

Including covariates with nullResiduals()

Is it possible to include gene or cell covariates when using the nullResiduals approximation of glc-pca (i.e. X and Z)? I am working with a large dataset and am trying to speed up glm-pca by using nullResiduals instead. By cell covariate I mean a computed value, not a batch identity.

Thanks,
John

cache sizeFactors in SingleCellExperiment slot

decide the naming convention for sizeFactors (eg for binomial "colSums" and for Poisson "colMeans")
for method SingleCellExperiment: check sizeFactors slot for cached sz vector. If family is Poisson use the sizeFactors (only check they are all positive). If family is Binomial, check first 100 cells colSums to make sure they match the sizeFactors. If sizeFactors not present, compute them and store in the slot of returned sce object
for method SummarizedExperiment: compute the sizeFactors and return a SingleCellExperiment object
write automated tests to make sure this works as expected.

Build error from Bioconductor

@kstreet13 was notified of a build error by the Bioconductor folks, quoting from his email:

In the main vignette, there seems to be some problem loading the Zhengmix4eq dataset (though there are no issues with the DuoClustering2018 package, itself).

I'm unable to reproduce the error locally, but here's the build report: https://bioconductor.org/checkResults/devel/bioc-LATEST/scry/

It seems like the error only occurs with windows machines. I tried to reproduce the error (on a mac) but got a different one:

Error(s) in re-building vignettes:
    ...
  --- re-building ‘bigdata.Rmd’ using knitr
  Error: processing vignette 'bigdata.Rmd' failed with diagnostics:
  there is no package called ‘markdown’
  --- failed re-building ‘bigdata.Rmd’
  
  --- re-building ‘scry.Rmd’ using knitr
  Error: processing vignette 'scry.Rmd' failed with diagnostics:
  there is no package called ‘markdown’
  --- failed re-building ‘scry.Rmd’

The weird thing for my error is, the markdown package is installed, and I can build the vignettes with no issue if I build them directly, they failure only occurs when I try to do the build/check process.

Assay should default to `counts` in `SummarizedExperiment` objects

It seems to be the standard to store raw counts in the counts assay, so it would be a better default than the current (assay = 1).

Feature Selection
GLMPCA
Null residuals

Merge celltype branch to master?

@igrabski @kstreet13 should we merge your celltype branch into the main branch? Does it need any additional documentation or tests?

Cleaning up old branches

@kstreet13 are you OK with me deleting the HDF5 branch? I think it is fully integrated into both develop and master. Also, I am going to bring the "develop" and "r3" branches up to date with "master". I will merge the pull request from eweine separately.

No genes names when computing devianceFeatureSelection() with batch

Hello,
I am encountering a problem when setting the batch parameter in devianceFeatureSelection(), I do not have the gene names associated with the deviance.

When doing

sce = devianceFeatureSelection(sds, nkeep = 5000)
deviance.genes = names(rowData(sce)$binomial_deviance)

I do obtain the genes names ordered by deviance, but when I would like to correct batch effect by adding factor of orig.ident :

sce = devianceFeatureSelection(sds, nkeep = 5000, batch = factor(colData(sds)$orig.ident))
deviance.genes = names(rowData(sce)$binomial_deviance)

The vector is empty, because rowData(sce)$binomial_deviance isn't a named numeric vector, just a numeric vector. I guess it comes from the fact that in your code, after computing deviance on each factor level, you rowSums() the results, but without renaming afterwards.

edit :
I fixed it for me using the following modification of .compute_deviance_batch() :

    stopifnot(length(batch)==ncol(m))
    #each row is a gene, each column is deviance within a batch.
    #res<-matrix(0.0,nrow=nrow(m),ncol=nlevels(batch))
    tmp = list()
    for(i in seq_along(levels(batch))){
      b<-levels(batch)[i]
      #res[,i]<-.compute_deviance(m[,batch==b],fam=fam)
      tmp[[i]]<-.compute_deviance(m[,batch==b],fam=fam)
    }
    tmp = lapply(tmp, function(x) x[order(match(names(x), names(tmp[[1]])))])
    res = do.call(cbind,tmp)
    #deviance is additive across subsets of observations
    return(rowSums(res))

Stability issues and warnings in 10x Genomics Visium data

Hi, I have been using scry::devianceResiduals() for preprocessing and transformation in 10x Genomics Visium data, and have been running into some possible stability issues and warnings.

I am running the following code:

spe <- nullResiduals(spe, assay = "counts", fam = "binomial", type = "deviance")

and getting the following warning:

Warning messages:
1: In .local(x, ...) : NaNs produced
2: In sqrt(x@x) : NaNs produced

If this warning occurs, then I see that scry::nullResiduals() has returned zeros for some genes, e.g. see this screenshot showing values for the first 20 genes in one dataset:

Here is some complete reproducible code, using a SpatialExperiment object that I have saved on Dropbox here (30 MB download). This requires SpatialExperiment version >1.5.3, which can be installed from GitHub:

# install version >1.5.3 of SpatialExperiment
remotes::install_github("drighelli/SpatialExperiment")

library(SpatialExperiment)
library(scry)

# load data (.rds file from Dropbox link)
spe <- readRDS("scry_example.rds")

# calculate deviance residuals
spe <- nullResiduals(spe, assay = "counts", fam = "binomial", type = "deviance")

Have you seen this error before, and do you have any ideas what might be causing it? Thank you!

bigdata vignette not compiling

Documentation: SingleCellExperiment functionality

As a user with data in a SingleCellExperiment I can learn how to run deviance feature selection and null residuals via examples in the documentation as well as in the vignette.

deviance feature selection on sce in roxygen docs
null residuals on sce in roxygen docs
deviance feature selection on sce in vignette
null residuals on sce in vignette

Update the URL field on the bioc package page

The URL field for the bioconductor package does not point back to this repository: https://bioconductor.org/packages/release/bioc/html/scry.html

Also, we should maybe link to the bioconductor repository from the readme

Experiences from trying to get GLMPCA to run with the book

Consider the following stretch of code:

library(scRNAseq)
sce <- ZeiselBrainData()

library(scuttle)
sce <- logNormCounts(sce)

library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

library(scry)
out <- GLMPCA(sce[hvgs,], L=50, sz=sizeFactors(sce))

The most obvious issue is that I can't get it to work, GLMPCA conks out with:

Error in glmpca(Y = assay(object, assay), L = L, fam = fam, ctl = ctl,  : 
  Numerical divergence (deviance no longer finite), try increasing the penalty to improve stability of optimization.

Increasing the penalty to 100 causes it to run for a while. It's been 10 minutes at least; by comparison, running the usual approximate PCA on the equivalent log-expression matrix would take 1 second, maybe 2. I know it's doing more computational work than a vanilla PCA but this seems even slower than zinbwave. It's only a 3k x 1k matrix here, anything above 30 seconds would be too much.

There are also some other quality-of-life issues that you might consider addressing:

Add a subset= argument to GLMPCA() so that we don't have to row-subset the entire SCE in the input, especially when only one of the matrices is of interest.
Add a SingleCellExperiment method that defaults sz=sizeFactors(x).
What he said in #5.

Session information

R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scRNAseq_2.3.2              scuttle_0.99.8             
 [3] scran_1.17.0                SingleCellExperiment_1.11.2
 [5] SummarizedExperiment_1.19.4 DelayedArray_0.15.1        
 [7] matrixStats_0.56.0          Biobase_2.49.0             
 [9] GenomicRanges_1.41.1        GenomeInfoDb_1.25.0        
[11] IRanges_2.23.5              S4Vectors_0.27.7           
[13] BiocGenerics_0.35.2         scry_1.1.0                 

loaded via a namespace (and not attached):
 [1] httr_1.4.1                    edgeR_3.31.0                 
 [3] BiocSingular_1.5.0            bit64_0.9-7                  
 [5] AnnotationHub_2.21.0          DelayedMatrixStats_1.11.0    
 [7] shiny_1.4.0.2                 assertthat_0.2.1             
 [9] interactiveDisplayBase_1.27.2 statmod_1.4.34               
[11] BiocManager_1.30.10           BiocFileCache_1.13.0         
[13] dqrng_0.2.1                   blob_1.2.1                   
[15] GenomeInfoDbData_1.2.3        yaml_2.2.1                   
[17] BiocVersion_3.12.0            pillar_1.4.4                 
[19] RSQLite_2.2.0                 lattice_0.20-41              
[21] glue_1.4.1                    limma_3.45.0                 
[23] digest_0.6.25                 promises_1.1.0               
[25] XVector_0.29.1                htmltools_0.4.0              
[27] httpuv_1.5.2                  Matrix_1.2-18                
[29] pkgconfig_2.0.3               zlibbioc_1.35.0              
[31] purrr_0.3.4                   xtable_1.8-4                 
[33] glmpca_0.1.0                  later_1.0.0                  
[35] BiocParallel_1.23.0           tibble_3.0.1                 
[37] ellipsis_0.3.1                magrittr_1.5                 
[39] crayon_1.3.4                  mime_0.9                     
[41] memoise_1.1.0                 tools_4.0.0                  
[43] lifecycle_0.2.0               locfit_1.5-9.4               
[45] irlba_2.3.3                   AnnotationDbi_1.51.0         
[47] compiler_4.0.0                rsvd_1.0.3                   
[49] rlang_0.4.6                   grid_4.0.0                   
[51] RCurl_1.98-1.2                BiocNeighbors_1.7.0          
[53] rappdirs_0.3.1                igraph_1.2.5                 
[55] bitops_1.0-6                  ExperimentHub_1.15.0         
[57] DBI_1.1.0                     curl_4.3                     
[59] R6_2.4.1                      dplyr_0.8.5                  
[61] fastmap_1.0.1                 bit_1.1-15.2                 
[63] Rcpp_1.0.4.6                  vctrs_0.3.0                  
[65] dbplyr_1.4.3                  tidyselect_1.1.0

Deviance feature selection- increase speed

The current deviance feature selection implementation is slow. Improve the speed by vectorizing operations. The functions should support dense matrices, sparse Matrix objects, and dense DelayedArray/ HDF5Array objects.

Automated tests for batch effect adjustment

As a developer, I can verify the batch effect adjustment works as intended for both deviance feature selection and null residuals by running automated tests.

automated tests for batch effect adjustment in deviance feature selection
automated tests for batch effect adjustment in null residuals

Store GLM-PCA factors as LinearEmbeddingMatrix in SingleCellExperiment

As a user I can apply dimension reduction to a SingleCellExperiment object and access both the factors and the loadings from a LinearEmbeddingMatrix in the reducedDims slot.

Verify the factors matrix U is stored properly
Verify the loadings matrix V is stored properly
Verify there is not an implication in the SingleCellExperiment documentation that UV is supposed to be the mean of the data.
Make clear in the documentation that UV is not a predicted mean of the data.

Negative binomial null residuals

As a user with data in a SingleCellExperiment, I can compute either Pearson or deviance negative binomial null residuals so that I can use them in downstream analysis.

data in dense matrix in memory
data in DelayedArray on disk
(maybe) sparse Matrix data in memory

Possible implementation paths:

Wrap the sctransform CRAN package. This provides regularized negative binomial Pearson residuals. It doesn't appear to support disk-based data though.
Do our own implementation with glmGamPoi package as backend for fast nb regression.