kstreet13 / scry Goto Github PK
View Code? Open in Web Editor NEWCollection of analysis methods for small count data generated by rafalab members (the methods, not the data).
Collection of analysis methods for small count data generated by rafalab members (the methods, not the data).
hi, I was wondering if there's any example to deal with "more than one batch" situation using the devianceFeatureSelection()
function? for example I have "batch", "sex", "age" ... to be corrected.
Tasks for @stephaniehicks:
Compare (1) in-memory and (2) DelayedArray
objects for increasing sizes of observations and fixed number of features:
scry::nullResiduals()
and BiocSingular::runPCA()
separatelyBiocSingular::runPCA()
with ExactParam()
and IrlbaParam()
scry::nullResiduals()
and BiocSingular::runPCA()
separatelyBiocSingular::runPCA()
with ExactParam()
and IrlbaParam()
@kstreet13 is currently working on implementing the code for the Poisson Pearson and Binomial Pearson cases. Once complete, @stephaniehicks will do the following:
These last two cases, we will set aside for now.
Hello,
I wanted to use the devianceFeatureSelection on a fairly large dataset I am working on.
I'm giving the function a 584802 x 31394 sparse Matrix of class "dgRMatrix"
named mat
:
sce <- devianceFeatureSelection(mat)
And end up with the following error :
Error in h(simpleError(msg, call)):
error in evaluating the argument 'x' in selecting a method for function 'colSums' : unable to aggregate TsparseMatrix with 'i' and 'j' slots of length exceeding 2^31-1
From what I've seen online, this is a limitation of the sparse matrix class in R, as it cannot store more that 2^32-1 non-zero elements. Would you have any idea on how to handle this error in this case ?
Thank you,
An issue I've been running into lately seems to stem from the fact that when running devianceFeatureSelection
, genes with zero counts across all cells end up with a deviance of NaN
. In practice, this isn't always a problem, since you probably wouldn't want to include those genes, anyway. However, the deviance scores are calculated separately for each batch and NaN + <a number> = NaN
, so there is a cascading effect. One or two small-ish batches can end up removing a large percentage of the genes from downstream analysis (anywhere from 5-60% in the datasets I'm currently working with). Counterintuitively, this means I have to remove data in order to get more actual results.
@willtownes, would it make sense to replace these NaN
values with zeros? It seems like there would be nothing mathematically incorrect about a degenerate poisson/NB model with mean zero (and zero variance). And that way, devianceFeatureSelection
could return a value for every gene.
Similar to #7, but on a different dataset:
library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
qcdata <- bfcrpath(bfc, "https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true")
env <- new.env()
load(qcdata, envir=env)
sce.8qc <- env$sce8_qc
sce.8qc$mix <- factor(sce.8qc$mix)
# Library size normalization and log-transformation.
library(scuttle)
library(scran)
sce.8qc <- logNormCounts(sce.8qc)
dec.8qc <- modelGeneVar(sce.8qc)
hvgs.8qc <- getTopHVGs(dec.8qc, n=1000)
library(scry)
sce.8qc2 <- GLMPCA(sce.8qc[hvgs.8qc,], fam="nb",
L=20, sz=sizeFactors(sce.8qc))
## Error: Poor model fit (final deviance higher than initial deviance). Try modifying control parameters to improve optimization.
I guess I could fiddle with the control parameters to sneak it through, but that is going to be beyond the ability of most users, and this dataset is pretty tame. Seems like some more care could be taken with the step sizes to guarantee convergence, or at least reduction of the deviance.
R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] scry_1.1.0 scran_1.17.15
[3] scuttle_0.99.12 SingleCellExperiment_1.11.6
[5] SummarizedExperiment_1.19.6 DelayedArray_0.15.7
[7] matrixStats_0.56.0 Matrix_1.2-18
[9] Biobase_2.49.0 GenomicRanges_1.41.6
[11] GenomeInfoDb_1.25.10 IRanges_2.23.10
[13] S4Vectors_0.27.12 BiocGenerics_0.35.4
[15] BiocFileCache_1.13.1 dbplyr_1.4.4
loaded via a namespace (and not attached):
[1] statmod_1.4.34 tidyselect_1.1.0
[3] locfit_1.5-9.4 purrr_0.3.4
[5] BiocSingular_1.5.0 lattice_0.20-41
[7] vctrs_0.3.2 generics_0.0.2
[9] blob_1.2.1 rlang_0.4.7
[11] pillar_1.4.6 glue_1.4.1
[13] DBI_1.1.0 BiocParallel_1.23.2
[15] rappdirs_0.3.1 bit64_4.0.2
[17] dqrng_0.2.1 GenomeInfoDbData_1.2.3
[19] lifecycle_0.2.0 zlibbioc_1.35.0
[21] rsvd_1.0.3 memoise_1.1.0
[23] irlba_2.3.3 curl_4.3
[25] BiocNeighbors_1.7.0 Rcpp_1.0.5
[27] edgeR_3.31.4 limma_3.45.10
[29] XVector_0.29.3 bit_4.0.4
[31] digest_0.6.25 bluster_0.99.1
[33] dplyr_1.0.1 grid_4.0.0
[35] tools_4.0.0 bitops_1.0-6
[37] magrittr_1.5 RCurl_1.98-1.2
[39] tibble_3.0.3 RSQLite_2.2.0
[41] crayon_1.3.4 pkgconfig_2.0.3
[43] MASS_7.3-51.6 ellipsis_0.3.1
[45] DelayedMatrixStats_1.11.1 glmpca_0.2.0
[47] assertthat_0.2.1 httr_1.4.2
[49] R6_2.4.1 igraph_1.2.5
[51] compiler_4.0.0
To achieve this we need to:
blockApply()
?)devianceFeatureSelection
function?BiocSingular
for PCAIs it possible to include gene or cell covariates when using the nullResiduals approximation of glc-pca (i.e. X and Z)? I am working with a large dataset and am trying to speed up glm-pca by using nullResiduals instead. By cell covariate I mean a computed value, not a batch identity.
Thanks,
John
@kstreet13 was notified of a build error by the Bioconductor folks, quoting from his email:
In the main vignette, there seems to be some problem loading the Zhengmix4eq dataset (though there are no issues with the DuoClustering2018 package, itself).
I'm unable to reproduce the error locally, but here's the build report: https://bioconductor.org/checkResults/devel/bioc-LATEST/scry/
It seems like the error only occurs with windows machines. I tried to reproduce the error (on a mac) but got a different one:
Error(s) in re-building vignettes:
...
--- re-building ‘bigdata.Rmd’ using knitr
Error: processing vignette 'bigdata.Rmd' failed with diagnostics:
there is no package called ‘markdown’
--- failed re-building ‘bigdata.Rmd’
--- re-building ‘scry.Rmd’ using knitr
Error: processing vignette 'scry.Rmd' failed with diagnostics:
there is no package called ‘markdown’
--- failed re-building ‘scry.Rmd’
The weird thing for my error is, the markdown package is installed, and I can build the vignettes with no issue if I build them directly, they failure only occurs when I try to do the build/check process.
It seems to be the standard to store raw counts in the counts
assay, so it would be a better default than the current (assay = 1
).
@igrabski @kstreet13 should we merge your celltype branch into the main branch? Does it need any additional documentation or tests?
@kstreet13 are you OK with me deleting the HDF5 branch? I think it is fully integrated into both develop and master. Also, I am going to bring the "develop" and "r3" branches up to date with "master". I will merge the pull request from eweine separately.
Hello,
I am encountering a problem when setting the batch parameter in devianceFeatureSelection()
, I do not have the gene names associated with the deviance.
When doing
sce = devianceFeatureSelection(sds, nkeep = 5000)
deviance.genes = names(rowData(sce)$binomial_deviance)
I do obtain the genes names ordered by deviance, but when I would like to correct batch effect by adding factor of orig.ident :
sce = devianceFeatureSelection(sds, nkeep = 5000, batch = factor(colData(sds)$orig.ident))
deviance.genes = names(rowData(sce)$binomial_deviance)
The vector is empty, because rowData(sce)$binomial_deviance
isn't a named numeric vector, just a numeric vector. I guess it comes from the fact that in your code, after computing deviance on each factor level, you rowSums()
the results, but without renaming afterwards.
edit :
I fixed it for me using the following modification of .compute_deviance_batch()
:
stopifnot(length(batch)==ncol(m))
#each row is a gene, each column is deviance within a batch.
#res<-matrix(0.0,nrow=nrow(m),ncol=nlevels(batch))
tmp = list()
for(i in seq_along(levels(batch))){
b<-levels(batch)[i]
#res[,i]<-.compute_deviance(m[,batch==b],fam=fam)
tmp[[i]]<-.compute_deviance(m[,batch==b],fam=fam)
}
tmp = lapply(tmp, function(x) x[order(match(names(x), names(tmp[[1]])))])
res = do.call(cbind,tmp)
#deviance is additive across subsets of observations
return(rowSums(res))
Hi, I have been using scry::devianceResiduals()
for preprocessing and transformation in 10x Genomics Visium data, and have been running into some possible stability issues and warnings.
I am running the following code:
spe <- nullResiduals(spe, assay = "counts", fam = "binomial", type = "deviance")
and getting the following warning:
Warning messages:
1: In .local(x, ...) : NaNs produced
2: In sqrt(x@x) : NaNs produced
If this warning occurs, then I see that scry::nullResiduals()
has returned zeros for some genes, e.g. see this screenshot showing values for the first 20 genes in one dataset:
Here is some complete reproducible code, using a SpatialExperiment
object that I have saved on Dropbox here (30 MB download). This requires SpatialExperiment
version >1.5.3, which can be installed from GitHub:
# install version >1.5.3 of SpatialExperiment
remotes::install_github("drighelli/SpatialExperiment")
library(SpatialExperiment)
library(scry)
# load data (.rds file from Dropbox link)
spe <- readRDS("scry_example.rds")
# calculate deviance residuals
spe <- nullResiduals(spe, assay = "counts", fam = "binomial", type = "deviance")
Have you seen this error before, and do you have any ideas what might be causing it? Thank you!
As a user with data in a SingleCellExperiment I can learn how to run deviance feature selection and null residuals via examples in the documentation as well as in the vignette.
The URL field for the bioconductor package does not point back to this repository: https://bioconductor.org/packages/release/bioc/html/scry.html
Also, we should maybe link to the bioconductor repository from the readme
Consider the following stretch of code:
library(scRNAseq)
sce <- ZeiselBrainData()
library(scuttle)
sce <- logNormCounts(sce)
library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)
library(scry)
out <- GLMPCA(sce[hvgs,], L=50, sz=sizeFactors(sce))
The most obvious issue is that I can't get it to work, GLMPCA
conks out with:
Error in glmpca(Y = assay(object, assay), L = L, fam = fam, ctl = ctl, :
Numerical divergence (deviance no longer finite), try increasing the penalty to improve stability of optimization.
Increasing the penalty
to 100 causes it to run for a while. It's been 10 minutes at least; by comparison, running the usual approximate PCA on the equivalent log-expression matrix would take 1 second, maybe 2. I know it's doing more computational work than a vanilla PCA but this seems even slower than zinbwave. It's only a 3k x 1k matrix here, anything above 30 seconds would be too much.
There are also some other quality-of-life issues that you might consider addressing:
subset=
argument to GLMPCA()
so that we don't have to row-subset the entire SCE in the input, especially when only one of the matrices is of interest.SingleCellExperiment
method that defaults sz=sizeFactors(x)
.R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS: /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] scRNAseq_2.3.2 scuttle_0.99.8
[3] scran_1.17.0 SingleCellExperiment_1.11.2
[5] SummarizedExperiment_1.19.4 DelayedArray_0.15.1
[7] matrixStats_0.56.0 Biobase_2.49.0
[9] GenomicRanges_1.41.1 GenomeInfoDb_1.25.0
[11] IRanges_2.23.5 S4Vectors_0.27.7
[13] BiocGenerics_0.35.2 scry_1.1.0
loaded via a namespace (and not attached):
[1] httr_1.4.1 edgeR_3.31.0
[3] BiocSingular_1.5.0 bit64_0.9-7
[5] AnnotationHub_2.21.0 DelayedMatrixStats_1.11.0
[7] shiny_1.4.0.2 assertthat_0.2.1
[9] interactiveDisplayBase_1.27.2 statmod_1.4.34
[11] BiocManager_1.30.10 BiocFileCache_1.13.0
[13] dqrng_0.2.1 blob_1.2.1
[15] GenomeInfoDbData_1.2.3 yaml_2.2.1
[17] BiocVersion_3.12.0 pillar_1.4.4
[19] RSQLite_2.2.0 lattice_0.20-41
[21] glue_1.4.1 limma_3.45.0
[23] digest_0.6.25 promises_1.1.0
[25] XVector_0.29.1 htmltools_0.4.0
[27] httpuv_1.5.2 Matrix_1.2-18
[29] pkgconfig_2.0.3 zlibbioc_1.35.0
[31] purrr_0.3.4 xtable_1.8-4
[33] glmpca_0.1.0 later_1.0.0
[35] BiocParallel_1.23.0 tibble_3.0.1
[37] ellipsis_0.3.1 magrittr_1.5
[39] crayon_1.3.4 mime_0.9
[41] memoise_1.1.0 tools_4.0.0
[43] lifecycle_0.2.0 locfit_1.5-9.4
[45] irlba_2.3.3 AnnotationDbi_1.51.0
[47] compiler_4.0.0 rsvd_1.0.3
[49] rlang_0.4.6 grid_4.0.0
[51] RCurl_1.98-1.2 BiocNeighbors_1.7.0
[53] rappdirs_0.3.1 igraph_1.2.5
[55] bitops_1.0-6 ExperimentHub_1.15.0
[57] DBI_1.1.0 curl_4.3
[59] R6_2.4.1 dplyr_0.8.5
[61] fastmap_1.0.1 bit_1.1-15.2
[63] Rcpp_1.0.4.6 vctrs_0.3.0
[65] dbplyr_1.4.3 tidyselect_1.1.0
The current deviance feature selection implementation is slow. Improve the speed by vectorizing operations. The functions should support dense matrices, sparse Matrix objects, and dense DelayedArray/ HDF5Array objects.
As a developer, I can verify the batch effect adjustment works as intended for both deviance feature selection and null residuals by running automated tests.
As a user I can apply dimension reduction to a SingleCellExperiment object and access both the factors and the loadings from a LinearEmbeddingMatrix in the reducedDims slot.
As a user with data in a SingleCellExperiment, I can compute either Pearson or deviance negative binomial null residuals so that I can use them in downstream analysis.
Possible implementation paths:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.