quon-titative-biology / scalign Goto Github PK

A deep learning-based tool for alignment and integration of single cell genomic data across multiple datasets, species, conditions, batches

License: GNU General Public License v3.0

R 100.00%

human-cell-atlas scrna-seq scrna-seq-analysis single-cell single-cell-genomics

scalign's People

Contributors

Stargazers

Watchers

Forkers

zorrodong rebeccagj abuchin tbrassel cnk113 habibmrad kant luisviolinist hui2000ji qindan2008 wook2014

scalign's Issues

tensorflow error

Hello,
I have installed tensorflow and i know R locates it when i run py_config(), however everytime i try to run scAlign it gives me the error
Error in value[3L] :
Error with system install of tensorflow, check R for Tensorflow docs.

I have installed tensoflow 2.0 with python 3.6.5 and the tensor flow works in python.

Any help would be appreciated.

Regards,
Devika

ImportError: No module named tensorflow , when installing scAlign

Hi. I have been able to install tensorflow into a virtual environment. I can load it from python using import tensorflow as tf.

In R, I am using the following code:

use_condaenv("py36", required = T)
library(tensorflow)
install_tensorflow(version="1.15rc2")

Tensorflow appears to be installed:
py_config() gives:

python:         /opt/anaconda3/envs/py36/bin/python3
libpython:      /opt/anaconda3/envs/py36/lib/libpython3.6m.dylib
pythonhome:     /opt/anaconda3/envs/py36:/opt/anaconda3/envs/py36
version:        3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 18:53:43)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
numpy:          /opt/anaconda3/envs/py36/lib/python3.6/site-packages/numpy
numpy_version:  1.18.1
tensorflow:     /opt/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow

tf_config() gives:
TensorFlow v1.15.0-rc2 (/opt/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow) Python v3.6 (/opt/anaconda3/envs/py36/bin/python3)

However, when I try to install scAlign in the same environment using devtools::install_github(repo = 'quon-titative-biology/scAlign') I get the following error:

Error: package or namespace load failed for ‘tensorflow’:
 .onLoad failed in loadNamespace() for 'tensorflow', details:
  call: py_module_import(module, convert = convert)
  error: ImportError: No module named tensorflow
Error: package ‘tensorflow’ could not be loaded
Execution halted

Thanks in advance for any help on this installation!

Session info:

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tensorflow_2.0.0 reticulate_1.15 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     lattice_0.20-41  fansi_0.4.1      crayon_1.3.4     assertthat_0.2.1
 [6] grid_3.6.3       jsonlite_1.6.1   magrittr_1.5     tfruns_1.4       rlang_0.4.5     
[11] cli_2.0.2        rstudioapi_0.11  whisker_0.4      Matrix_1.2-18    tools_3.6.3     
[16] glue_1.4.0       compiler_3.6.3   base64enc_0.1-3

Clarification on choosing a reference and SCTransform.

Hello,

I was wondering how the reference set is chosen, is it the first dataset in the list or is it randomly chosen? If it's the latter will there be functionality to choose a reference?
Also is it better to use Seurat's integrated SCTransform on each dataset separately or Normalize, ScaleData, etc?

Thanks

scAlign_supervised_alignment example: "Error during alignment, returning scAlign class."

I am running the code below from the example "scAlign_supervised_alignment example"

scAlignHSC = scAlign(scAlignHSC, options=scAlignOptions(steps=5000, log.every=5000, norm=TRUE, early.stop=FALSE, architecture="small"), encoder.data="scale.data", supervised='both', run.encoder=TRUE, log.dir=file.path(results.dir, 'models','gene_input'), device="CPU")

I am getting this error:

Could you help me resolve this?
If you need more information please let me know.

Session info:
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] SeuratData_0.2.1 crfsuite_0.3.4 SingleR_1.6.1 caret_6.0-89
[5] lattice_0.20-45 reshape2_1.4.4 magrittr_2.0.1 gridExtra_2.3
[9] scAlign_1.6.0 FNN_1.1.3 ggplot2_3.3.5 Rtsne_0.15
[13] irlba_2.3.3 Matrix_1.3-4 purrr_0.3.4 tensorflow_2.6.0.9000
[17] SeuratObject_4.0.2 Seurat_4.0.4 SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
[21] Biobase_2.52.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.4 IRanges_2.26.0
[25] MatrixGenerics_1.4.3 matrixStats_0.61.0 S4Vectors_0.30.1 BiocGenerics_0.38.0

loaded via [1] utf8_1.2.2 [6] BiocParallel_1.26.2 [11] codetools_0.2-18 [16] colorspace_2.0-2 [21] listenv_0.8.0 [26] vctrs_0.3.8 [31] cachem_1.0.6 [36] scales_1.1.1 [41] processx_3.5.2 [46] lazyeval_0.2.2 [51] httpuv_1.6.3 [56] spatstat.core_2.3-0 [61] plyr_1.8.6 [66] ps_1.6.0 [71] cowplot_1.1.1 [76] fs_1.5.0 [81] lmtest_0.9-38 [86] hms_1.1.1 [91] readxl_1.3.1 [96] KernSmooth_2.23-20 [101] tidyr_1.1.4 [106] cli_3.0.1 [111] foreign_0.8-81 [116] XVector_0.32.0 [121] sctransform_0.3.2 [126] uwot_0.1.10 [131] nlme_3.1-153 [136] fansi_0.5.0 [141] survival_3.2-13 [146] iterators_1.0.13 [151] dplyr_1.0.7 a namespace (and not attached):
reticulate_1.22-9000 tidyselect_1.1.1 htmlwidgets_1.5.4 grid_4.1.1
pROC_1.18.0 devtools_2.4.2 munsell_0.5.0 ScaledMatrix_1.0.0
ica_1.0-2 future_1.22.1 miniUI_0.1.1.1 withr_2.4.2
config_0.3.1 rstudioapi_0.13 ROCR_1.0-11 tensor_1.5
GenomeInfoDbData_1.2.6 polyclip_1.10-0 rprojroot_2.0.2 parallelly_1.28.1
generics_0.1.0 ipred_0.9-12 R6_2.5.1 rsvd_1.0.5
bitops_1.0-7 spatstat.utils_2.2-0 DelayedArray_0.18.0 promises_1.2.0.1
nnet_7.3-16 gtable_0.3.0 beachmat_2.8.1 globals_0.14.0
goftest_1.2-2 timeDate_3043.102 rlang_0.4.11 splines_4.1.1
ModelMetrics_1.2.2.2 spatstat.geom_2.2-2 yaml_2.2.1 abind_1.4-5
usethis_2.0.1 tools_4.1.1 lava_1.6.10 ellipsis_0.3.2
RColorBrewer_1.1-2 sessioninfo_1.1.1 ggridges_0.5.3 Rcpp_1.0.7
base64enc_0.1-3 sparseMatrixStats_1.4.2 zlibbioc_1.38.0 RCurl_1.98-1.5
prettyunits_1.1.1 rpart_4.1-15 deldir_0.2-10 pbapply_1.5-0
zoo_1.8-9 haven_2.4.3 ggrepel_0.9.1 cluster_2.1.2
here_1.0.1 data.table_1.14.2 scattermore_0.7 openxlsx_4.2.4
RANN_2.6.1 whisker_0.4 fitdistrplus_1.1-6 pkgload_1.2.2
patchwork_1.1.1 mime_0.12 xtable_1.8-4 rio_0.5.27
tfruns_1.5.0 testthat_3.1.0 compiler_4.1.1 tibble_3.1.5
crayon_1.4.1 htmltools_0.5.2 mgcv_1.8-37 later_1.3.0
lubridate_1.7.10 DBI_1.1.1 rappdirs_0.3.3 MASS_7.3-54
gower_0.2.2 igraph_1.2.6 forcats_0.5.1 pkgconfig_2.0.3
plotly_4.9.4.1 spatstat.sparse_2.0-0 recipes_0.1.17 foreach_1.5.1
prodlim_2019.11.13 stringr_1.4.0 callr_3.7.0 digest_0.6.28
RcppAnnoy_0.0.19 spatstat.data_2.1-0 cellranger_1.1.0 leiden_0.3.9
DelayedMatrixStats_1.14.3 curl_4.3.2 shiny_1.7.1 lifecycle_1.0.1
jsonlite_1.7.2 BiocNeighbors_1.10.0 desc_1.4.0 viridisLite_0.4.0
pillar_1.6.3 fastmap_1.1.0 httr_1.4.2 pkgbuild_1.2.0
glue_1.4.2 remotes_2.4.1 zip_2.2.0 png_0.1-7
class_7.3-19 stringi_1.7.5 BiocSingular_1.8.1 memoise_2.0.0
future.apply_1.8.1

MultiCCA parameter

Hey Nelson,

I noticed you implemented the PMA MultiCCA, and I think when I emailed you I also put the MultiCCA type as ordered. It should be "standard." At the time I misunderstood what the type did, but ordered for scRNA-seq dataset doesn't make sense since the cells are randomly distributed in the matrix.

Best,
Chang

Running Issues

Hello,

I would like to know exactly how you ran your method? because I am getting the same error as the one in the tutorial:
https://www.bioconductor.org/packages/release/bioc/vignettes/scAlign/inst/doc/scAlign.pdf
page 3:
[1] "Error during alignment, returning scAlign class."

<Rcpp::exception in py_call_impl(callable, dots$args, dot

Thanks.

Question about the loss?

Hello,

I have a question about how the model optimized?

In my understanding, the model (f assume for two datasets) needs to keep the reconstruction for each dataset and keep the similarity distance within each dataset? So for the two datasets, data_1, data_2, there will be totally four losses need to be calculated.

First is similarity of data_1 in latent space.
Second is similarity of data_2 in latent space.
Third is reconstruction of data_1 of decoder.
Fourth is reconstruction of data_2 of another decoder.

So did you sum all these four losses and trained the model?

I am sorry that I tried to read how does it work from the Github but I couldn't be sure of it. Or could you let me know which file I should look at?

Thank you!

Setting up differential expression with multiway alignment

Hello,

I was wondering if you could share some insights on how one would go about setting up differential expression at a cell-type level after performing the multiway alignment. Specifically, I have a dataset with 2 conditions but with 2 replicates for each condition. I wish to test for differential expression across conditions post alignment across all four samples. Therefore, I opted to perform multiway alignment using the "Multiway alignment using all pairs method" tutorial. Post alignment, the SCE object only contains a 'multiCCA' under the reducedDims@listData. So I am unsure whether the paired DE pipeline from the "Unsupervised alignment and projection of HSCs" tutorial can be used. Alternatively, can the SCE object along with the interpolated data be converted into Seurat object in order to use the v3 differential expression pipeline (FindMarker())? I specifically tried this with as.Seurat() part of the Seurat package but got index errors:
Warning: Non-unique cell names (colnames) present in the input matrix, making unique
Error in intI(j, n = d[2], dn[[2]], give.dn = FALSE) :
invalid character index
So I assumed a direct conversion to a Seurat object wasn't possible.

Thank you.
-Vijay

Error during interpolation, returning scAlign class.

Hi,

I am trying to run the decoder procedure at the gene level to project cells into logcount space, but I get this error. I am successfully able to run the encoder.

> scAlign_obj = scAlign(scAlign_obj,
+                         options=scAlignOptions(steps=5000,
+                                                log.every=5000,
+                                                norm=TRUE,
+                                                early.stop=FALSE,
+                                                architecture="small"),
+                         encoder.data="scale.data",
+                         decoder.data="logcounts",
+                         supervised='none',
+                         run.encoder=TRUE,
+                         run.decoder=TRUE,
+                         log.dir=file.path(results.dir, 'models','gene_input'),
+                         device="CPU")
TensorFlow check: [Passed]
[1] "============== Step 1/3: Encoder training ==============="
[1] "Graph construction"
[1] "Adding source walker loss"
[1] "Adding target walker loss"
[1] "Done random initialization"
[1] "Step: 1    Loss: 10.0231"
[1] "Step: 100    Loss: 9.9664"
...<removed steps log for brevity>
[1] "Step: 5000    Loss: 8.8523"
[1] "============== Alignment Complete =============="
[1] "============== Step 2/3: dataset_1 decoder training ==============="
[1] "Graph construction"
[1] "Error during interpolation, returning scAlign class."
<simpleError in get(as.character(FUN), mode = "function", envir = envir): object 'decoder_small' of mode 'function' was not found>

Do you have any insights?

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3               scater_1.12.2               SingleCellExperiment_1.6.0  SummarizedExperiment_1.14.1 DelayedArray_0.10.0        
 [6] BiocParallel_1.18.1         matrixStats_0.54.0          Biobase_2.44.0              GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
[11] IRanges_2.18.2              S4Vectors_0.22.0            BiocGenerics_0.30.0         Seurat_2.3.4                Matrix_1.2-17              
[16] cowplot_1.0.0               ggplot2_3.2.1               tensorflow_1.14.0           reticulate_1.13.0-9000      scAlign_1.1.2              
[21] devtools_2.1.0              usethis_1.5.1              

loaded via a namespace (and not attached):
<redacted for brevity>

Feature request: per batch l2-normalization at embedding generation for training on large datasets

I'm trying scAlign on large datasets and found that the current implementation couldn't handle them.
Here's the output message when I ran it on the HumanPancreas Dataset (14890 cells x 34363 genes):

(everything ok back here)
[1] "Done random initialization"
2020-09-01 22:48:53.644928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
[1] "Step: 1    Loss: 119.6789"
[1] "Error during alignment, returning scAlign class."
<Rcpp::exception in py_call_impl(callable, dots$args, dots$keywords): ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Detailed traceback:
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\ops\nn_impl.py", line 616, in l2_normalize
    return l2_normalize_v2(x, axis, epsilon, name)
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\ops\nn_impl.py", line 642, in l2_normalize_v2
    x = ops.convert_to_tensor(x, name="x")
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1184, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1242, in convert_to_tensor_v2
    as_ref=False)
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1297, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\framework\tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\framework\constant_op.py", line 227, in constant
    allow_broadcast=True)
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\framework\constant_op.py", line 265, in _constant_impl
    allow_broadcast=allow_broadcast))
  File "d:\programdata\anaconda3\lib\site-packages\tensorflow_core\python\framework\tensor_util.py", line 520, in make_tensor_proto
    "Cannot create a tensor proto whose content is larger than 2GB.")
>

I tracked down the problem to the embedding saving process, where the whole dataset is l2-normalized by the following statement (in train_encoder.R and train_encoder_multi.R):

data_norm = sess$run(tf$nn$l2_normalize(data, axis=as.integer(1), name=paste0("data")))

which results in the aforementioned error.

Could you please add a per batch l2-normalization feature at embedding generation? Specifically, when the dataset is too large, could you split it into batches and l2-normalize each batch before putting it into the encoder network? That would be most useful for my project, thank you!

quon-titative-biology / scalign Goto Github PK

scalign's People

Contributors

Stargazers

Watchers

Forkers

scalign's Issues

tensorflow error

ImportError: No module named tensorflow , when installing scAlign

Clarification on choosing a reference and SCTransform.

scAlign_supervised_alignment example: "Error during alignment, returning scAlign class."

MultiCCA parameter

Running Issues

<Rcpp::exception in py_call_impl(callable, dots$args, dot

Question about the loss?

Setting up differential expression with multiway alignment

Error during interpolation, returning scAlign class.

Feature request: per batch l2-normalization at embedding generation for training on large datasets

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent