caravagnalab / rcongas Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 1.0 160.97 MB

rcongas

License: GNU General Public License v3.0

R 91.47% CSS 7.69% JavaScript 0.83%

rcongas's People

Contributors

Stargazers

Watchers

Forkers

mmfalco

rcongas's Issues

Filtering genes base on vst and absolute mean expression

The actual filtering function allows just a basic filter on the quantile of the mean maximum expression, it would definitely be better to filter based also on the variance

Models vignettes

See the new vignettes [9f169d5] 1_Models

We should rename this repo to Rcongas to make it consistent with the package, update link everywhere (yaml files, Rmd, etc), as well as in the forks at the Lab Git - and eventually change ownership and let it be in just one place.

DE_table inconsistent

Problem:

  # Load DE results - forward params
  DE_table = get_DE_table(x, chromosomes = chromosomes, ...)

  # Get gene locations
  x$data$gene_locations %>%
    filter(gene %in% DE_table$gene)

DE_table contains many genes, when I subset given gene locations I loose multiple genes. There is some problem with the way genes are stored across the various data structions

Missing unique

Missing unique after computation of get_segment_ids.

Error running fit_congas on multiome data

I was able to run segments_selector_congas() but fit_congas() failed with the following errors:
[easypar] 4/4 computations returned errors and will be removed.

Error in `pull()`:
! Can't extract columns that don't exist.
✖ Column `BIC` doesn't exist.

I get a similar error if I try to rerun with a different model selection criterion.

I was able to run successfully on the demo data using the same code and environment. I am using the most recent GitHub version of RCONGAS.

here's the full trace:

<error/vctrs_error_subscript_oob>
Error in `pull()`:
! Can't extract columns that don't exist.
✖ Column `BIC` doesn't exist.
---
Backtrace:
     ▆
  1. ├─Rcongas:::fit_congas(...)
  2. │ ├─base::order(model_selection_df %>% pull(!!model_selection))
  3. │ └─model_selection_df %>% pull(!!model_selection)
  4. ├─dplyr::pull(., !!model_selection)
  5. ├─dplyr:::pull.data.frame(., !!model_selection)
  6. │ └─tidyselect::vars_pull(names(.data), !!enquo(var))
  7. │   └─tidyselect:::pull_as_location2(...)
  8. │     ├─tidyselect:::with_subscript_errors(...)
  9. │     │ └─rlang::try_fetch(...)
 10. │     │   └─base::withCallingHandlers(...)
 11. │     └─vctrs::vec_as_location2(...)
 12. │       ├─vctrs:::result_get(...)
 13. │       └─vctrs:::vec_as_location2_result(...)
 14. │         ├─base::tryCatch(...)
 15. │         │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 16. │         │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 17. │         │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
 18. │         └─vctrs::vec_as_location(i, n, names = names, arg = arg, call = call)
 19. └─vctrs (local) `<fn>`()
 20.   └─vctrs:::stop_subscript_oob(...)
 21.     └─vctrs:::stop_subscript(...)
 22.       └─rlang::abort(...)

plot_segment_density doe not work with mixturegaussiannorm

Error in function fit_congas

Really appreciate your contributions. I met the error when executing the function Rcongas:::fit_congas as follows:

── Fit with k = 4 and lambda = 0.5. 
── Fit with k = 2 and lambda = 0.5. 
── Fit with k = 1 and lambda = 0.5. 
── Fit with k = 3 and lambda = 0.5. 
No protocol specified
No protocol specified
No protocol specified
No protocol specified
sys:1: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:178.)
sys:1: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:178.)
sys:1: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:178.)
sys:1: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:178.)
Warning messages:
1: replacing previous import ‘Matrix::head’ by ‘utils::head’ when loading ‘Rcongas’ 
2: replacing previous import ‘cli::num_ansi_colors’ by ‘crayon::num_ansi_colors’ when loading ‘easypar’ 
Warning messages:
1: replacing previous import ‘Matrix::head’ by ‘utils::head’ when loading ‘Rcongas’ 
2: replacing previous import ‘cli::num_ansi_colors’ by ‘crayon::num_ansi_colors’ when loading ‘easypar’ 
Warning messages:
1: replacing previous import ‘Matrix::head’ by ‘utils::head’ when loading ‘Rcongas’ 
2: replacing previous import ‘cli::num_ansi_colors’ by ‘crayon::num_ansi_colors’ when loading ‘easypar’ 
Warning messages:
1: replacing previous import ‘Matrix::head’ by ‘utils::head’ when loading ‘Rcongas’ 
2: replacing previous import ‘cli::num_ansi_colors’ by ‘crayon::num_ansi_colors’ when loading ‘easypar’ 
[easypar] 4/4 computations returned errors and will be removed.



── (R)CONGAS+ fits completed in 24.4s. ──



Error in `pull()`:
! Can't extract columns that don't exist.
✖ Column `BIC` doesn't exist.
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/vctrs_error_subscript_oob>
Error in `pull()`:
! Can't extract columns that don't exist.
✖ Column `BIC` doesn't exist.
---
Backtrace:
    ▆
 1. ├─Rcongas:::fit_congas(...)
 2. │ ├─base::order(model_selection_df %>% pull(!!model_selection))
 3. │ └─model_selection_df %>% pull(!!model_selection)
 4. ├─dplyr::pull(., !!model_selection)
 5. └─dplyr:::pull.data.frame(., !!model_selection)
Run rlang::last_trace(drop = FALSE) to see 17 hidden frames.

I don't know how to deal with this. Look forward to your reply.

problem of Rcongas installing

I encountered an issue when installing the package and received the following message:

E  checking DESCRIPTION meta-information ... 
   Authors@R field gives more than one person with maintainer role:
     Salvatore Milite <[email protected]> [aut, cre]
     Lucrezia Patruno <[email protected]> [aut, cre]
   
   See section 'The DESCRIPTION file' in the 'Writing R Extensions'
   manual.
Error: Failed to install 'Rcongas' from GitHub:
  ! System command 'Rcmd.exe' failed

Is there any method to handle this problem? ATM I could only install the package by modifying DESCRIPTION file to be only 1 maintainer.

Filter output clusters

We should ensure that when we filter clusters we do not just remove labels, but we actually re-compute the model score, latent variables and clustering assignments

Default fit object

It contains old cluster labels 1, 2, etc instead of c1, c2 etc.

`get_clones_ploidy`

Should be get_clusters_ploidy for consistency with all the other get_clusters* functions.

segments_selector_congas ERROR

Hi，
I meet the error as below when I run the following code:

code:

filt = segments_selector_congas(x,score="BIC")

Warning message:
replacing previous import ‘cli::num_ansi_colors’ by ‘crayon::num_ansi_colors’ when loading ‘easypar’
replacing previous import ‘cli::num_ansi_colors’ by ‘crayon::num_ansi_colors’ when loading ‘easypar’
Error in f():
! Argument 1 must be a data frame or a named atomic vector.
Run rlang::last_trace() to see where the error occurred.

Looks like this is the underlying error location:
runs <- easypar::run(FUN = one k, PARAMS = apply(k lambda pairs,1,list)，parallel = parallel, progress bar = FALSE)
Fit with k = 2 and lambda = 0.5.
[easypar] 1/1 computations returned errors and will be removed.
runs
NULL

This is the data I used:

x
── [ (R)CONGAS+ ] Bimodal P4 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── CNA segments (reference: GRCh38)
→ Input 22 CNA segments, mean ploidy 2.

      | | | |  |  |  | |  |  |  |  |  

     Ploidy:    0     1     2     3     4     5     *

── Modalities
→ RNA: 7727 cells with 11101 mapped genes, 8889659 non-zero values. Likelihood: Negative Binomial.
→ ATAC: not available
! Clusters: not available.

── LOG ──

2023-12-21 09:40:26 Created input object.
2023-12-21 09:47:01 Filtered s123egments: [0|50|50]
2023-12-21 09:54:13 Filtered s123egments: [0|50|50]

Thank you very much!

Refitting

If we refit an object we should cancel all analyses based on clusters (DE)

best_cluster: "Model not present, please run list_models() to see what is available"

Hi,
Thanks for the awesome package! I run the exmple_data, but meet the error as below:
inference <- best_cluster(input_rcongas, clusters = 1:5, model = "MixtureGaussian", param_list = list(theta_shape = theta_vals[1], theta_rate = 1),
steps = 500, lr = 0.01, MAP = TRUE, seed = 3, normalize_by_segs = FALSE, method = "ICL")
[1] "Model not present, please run list_models() to see what is available"
Error in eval(expr, envir, enclos): TypeError: 'str' object is not callable

Traceback:

best_cluster(input_rcongas2, clusters = 1:5, model = "mixturegaussian",
. param_list = list(theta_shape = theta_vals[1], theta_rate = 1),
. steps = 500, lr = 0.01, MAP = TRUE, seed = 3, normalize_by_segs = FALSE,
. method = "ICL")
lapply(clusters, function(x) run_inference(X = X, model = model,
. optim = optim, elbo = elbo, inf_type = inf_type, steps = steps,
. lr = lr, param_list = c(param_list, list(K = x, mixture = mixture[[x]])),
. MAP = MAP, seed = seed, ))
FUN(X[[i]], ...)
run_inference(X = X, model = model, optim = optim, elbo = elbo,
. inf_type = inf_type, steps = steps, lr = lr, param_list = c(param_list,
. list(K = x, mixture = mixture[[x]])), MAP = MAP, seed = seed,
. )
int$initialize_model(data_list)
py_call_impl(callable, dots$args, dots$keywords)

Interface to "multomics" model described in paper

I have multimodal 10x data (both RNA-seq and ATAC-seq measured from the same cells) and would like to use the "multomics" model (Fig. 2C in the PLoS Comp Bio paper) in which each cell is assigned to the same cluster in both modalities. However, the vignette for fitting a CONGAS+ model returns different cluster sizes between the two modalities, and I can't find a parameter to e.g. init() or fit_congas() to specify the multiomics model or that clusters should be consistent between modalities. Is there an interface available to the multiomics model?

Could not find function "fit_congas_single_run"

Thanks for the tool. I am stopped by the error Could not find function "fit_congas_single_run" when excuting one_k function, which finally leads to the error x/x computations returned errors and will be removed when running the function Rcongas:::fit_congas.

Each row in `x` should match at most 1 row in `y`.

After running CONGAS on 1 of my samples I was not able to obtain the clusters I think due to the following warning:

> input_rcongas <- init(data = count_matrix , cnv_data = cnv_table, description = s,
+                       reference_genome = "hg19", online = FALSE, correct_bins = T)
ℹ Extracting gene data from reference.
✔ Validated input(s): 904 cells, 12449 genes and 126 CNA segments.
ℹ Processing input counts.
ℹ Assembly Rcongas object.
! Mapping inconsistent for 687 genes out 10610, removing those from the raw data table.
ℹ Retaining 98.6 Mb long-format tibble data with 4274786 points, matrix was 86.7 Mb.
ℹ Object size in memory: 99.9 Mb.

── [ Rcongas ] X ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

→ Data: 904 cells with 9878 genes, aggregated in 31 segments.

! Clusters: not available.
Warning message:
Each row in `x` should match at most 1 row in `y`.
ℹ Row 109 of `x` matches multiple rows.
ℹ If multiple matches are expected, specify `multiple = "all"` in the join call to silence this warning.

After digging in the code apparently some genes appear in multiple chromosomes which makes this step crash:

  gene_locations = gene_locations %>% left_join(locations, 
                                                by = "gene") %>% dplyr::select(gene, chr, from, to, segment_id)

For example in my case gene "U1" according to gene_locations object appears in :

 1 U1    chr1   16860381  16862144 chr1:14743501:22118500  
 2 U1    chr1   17197440  17200587 chr1:14743501:22118500  
 3 U1    chr3  125879126 125879288 chr1:14743501:22118500  
 4 U1    chr7  119646030 119646155 chr1:14743501:22118500

So maybe it is necessary applying an extra filtering step.

Documentation and tutorial

We need to systematically produce the Rmd files with roxygen to document the basic functions for the analysis. We need also to expose some tutorial on scRNAseq analysis with CONGAS.

Using Rcongas+ only in the scATAC modality

Good day!

I have read the documentation in https://caravagnalab.github.io/rcongas/index.html, and it provides valuable information on how to run Rcongas on multimodal data, but it is not clear to me whether it can work only using the scATAC modality. If so, what adjustments would I need to make to run it correctly?

Also, after running the pipeline, do the results also include the classification of a cell to be either diploid or aneuploid?

Thanks in advance!

CNA power law

Can we add a function to compute the CCF of a CNA segment of the raw data? Can we plot that as a histogram, is that a power law?

Error in `mutate()` when plotting example data

I liked the idea behind CONGAS, and I was about to try it on my dataset. But after installing I could not run the example datasets. The following error popped out:

> data(congas_example)

> print(congas_example)
── [ Rcongas ]  ──────────────────────────────────────────────────────────────────────────────────

→ Data: 503 cells with 8564 genes, aggregated in 70 segments.

ℹ Clusters: k = 2, model with AIC = 577427.35.

● Cluster c1, n = 379 [75.35% of total cells].
● Cluster c2, n = 124 [24.65% of total cells].

── CNA highlights (alpha = 0.05) 

Error in `mutate()`:
! Problem while computing `cluster = ifelse(...)`.
Caused by error in `str_starts()`:
! could not find function "str_starts"
Run `rlang::last_error()` to see where the error occurred.

> plot(congas_example)
Error in `mutate()`:
! Problem while computing `cluster = ifelse(...)`.
Caused by error in `str_starts()`:
! could not find function "str_starts"
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/dplyr:::mutate_error>
Error in `mutate()`:
! Problem while computing `cluster = ifelse(...)`.
Caused by error in `str_starts()`:
! could not find function "str_starts"
---
Backtrace:
  1. base::plot(congas_example)
 22. base::ifelse(...)
Run `rlang::last_trace()` to see the full context.

It seems like a package version issue but I updated all the dependencies, so I am not sure if I should try some old versions for dplyr or stringr packages. Which versions did you use?

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rcongas_0.1.0

loaded via a namespace (and not attached):
 [1] gtools_3.9.3          clisymbols_1.2.0      tidyselect_1.1.2.9000 akima_0.6-3.4         remotes_2.4.2         purrr_0.3.4           reshape2_1.4.4       
 [8] lattice_0.20-45       carData_3.0-5         testthat_3.1.4        colorspace_2.0-3      vctrs_0.4.1.9000      generics_0.1.3        usethis_2.1.5        
[15] utf8_1.2.2            rlang_1.0.5           pkgbuild_1.3.1        ggpubr_0.4.0          pillar_1.8.1          withr_2.5.0           glue_1.6.2           
[22] sp_1.5-0              sessioninfo_1.2.2     lifecycle_1.0.2.9000  plyr_1.8.7            stringr_1.4.1         munsell_0.5.0         ggsignif_0.6.3       
[29] gtable_0.3.1          devtools_2.4.3        memoise_2.0.1         callr_3.7.2           fastmap_1.1.0         ps_1.7.1              curl_4.3.2           
[36] fansi_1.0.3           broom_1.0.1           Rcpp_1.0.9            backports_1.4.1       scales_1.2.1          cachem_1.0.6          desc_1.4.2           
[43] pkgload_1.3.0         abind_1.4-5           fs_1.5.2              brio_1.1.3            ggplot2_3.3.6         stringi_1.7.8         rstatix_0.7.0        
[50] processx_3.7.0        dplyr_1.0.99.9000     rprojroot_2.0.3       grid_4.2.1            cli_3.4.0             tools_4.2.1           magrittr_2.0.3       
[57] patchwork_1.1.2       tibble_3.1.8          crayon_1.5.1          car_3.1-0             tidyr_1.2.1           pkgconfig_2.0.3       ellipsis_0.3.2       
[64] prettyunits_1.1.1     extraDistr_1.9.1      rstudioapi_0.14       R6_2.5.1              CNAqc_1.0.0           compiler_4.2.1

question about this

hi, i have problem with this, is there any files not included or it's just private. Ans if possible, may i have this to be able to run your R file to check the result?

rcongas/nobuild/OV_analysis.Rmd

Line 29 in 9aae602

devtools::load_all("../annealToolbox/")

Raw data visualisation

Can we plot segments and then find some way to overlay the input data? Both the aggregated ccounts, and the raw per-gene,

Add a function to suggest good value for the theta parameter

As the choice of the library size distribution prior is somewhat hard if you are not familiar with Gamma distributions. A function that adapts the parameter to the data in an approximate way can be useful to standardize the inference

Doublet detection in scATAC

Good day!

I want to apply your tool to a set of scATAC data and read in your latest pre-print that CONGAS+ can be used on this moldaity of data. However, I have not been able to find a vignette/tutorial/documentation on how to.

Could you please address me where I can find that information?

Also, does CONGAS+ give a final classification for each cell to be aneuploid or diploid?

I really appreciate any help you can provide.

plot_gene_stdev(congas_example, transpose = TRUE)

Add the stat as a parameter, and remake the plot

Bug remove_small_clusters

In https://github.com/Militeee/rcongas/blob/master/R/filter_clusters.R (function remove_small_clusters) get_best_model cannot be called as

bm <- Rcongas::get_best_model(x)

since it is not exported in https://github.com/Militeee/rcongas/blob/master/NAMESPACE.

Maybe this should be exported no?

clusters reproducibility and size

I have run CONGAS on several cancer samples and overall I am quite happy the sensitivity of the regions identified as subclonal and the subclones it finds. However, there are some samples where CONGAS finds clusters composed by only 1 cell (sorry but I can't share the data). Moreover, the identity of the cell clusters sometimes change (I thought that the seed parameter would solve this), and of course it also does when you change the number of clusters you test. But this seems to be influenced by the small subclusters found, hence the important of filtering small clusters.
For example this is the clustering result for 2 consecutive runs of the same dataset with the same parameters (inference <- best_cluster(input_rcongas, clusters = 1:5, model = "MixtureGaussian", param_list = list(theta_shape = theta_vals[1], theta_rate = 1),steps = 500, lr = 0.01, MAP = TRUE, seed = 3, normalize_by_segs = F, method = "ICL"))

    run2
run1  c1  c2  c3  c4
  c1 314  13   0   2
  c2 294   7   0   0
  c3   1 221   0   0
  c4   0   1  48   3

this is the comparison after running again changing clusters = 1:10:

    run3
run1  c1  c2  c3  c4  c5  c6  c7
  c1 326   3   0   0   0   0   0
  c2   0 299   1   0   0   1   0
  c3   1   1 197   0  23   0   0
  c4   0   0   0  51   0   0   1

    run3
run2  c1  c2  c3  c4  c5  c6  c7
  c1 311 296   1   0   0   1   0
  c2  14   7 197   1  23   0   0
  c3   0   0   0  47   0   0   1
  c4   2   0   0   3   0   0   0

As you can see the emergence of a small subcluster on the second run has made the clones c1 and c2 from run1 to collapse into 1 clone in run2. The validity of clones detected in run1 seems to be verified by the third run (finds sames clones) now that it has room to find this smalls clusters. As you can see in run3 we have c6 and 7 composed by 1 cell.
Is there anything you would suggest to increase reproducibility?
Would it make sense to add a parameter to limit the minimum size of the clusters?
Or maybe it can also be interesting to be able to filter "undesired" clusters a posteriori from the results and being able to use plot_gw_cna_profiles without having to run again the analysis.

Diagnostic plot with mean gene distributions by segments

As the number of false positive can be quite high in settings where we have different cell types (or developmental states) it is good practice to compare the gene expression distributions across different clusters for a given segment, to spot possible outliers

calculate_DE

I think calculate_DE does not work anymore.. is there a problem with the transpose input etc?

get_clusters(congas_example)

With the default object the above command files.

> get_clusters(congas_example)
 Error in `$<-.data.frame`(`*tmp*`, cell, value = c("AAACCTGTCGACAGCC-1",  : 
  replacement has 503 rows, data has 504

The error comes from

  z_nk$cell = names(best_model$parameters$assignement)

because there is 1 extra row in z_nk compared to assignments.

binsize missing

binsize line 48 of plot_single_segment is not defined.

Type-checking (sort of)

All functions that require to receive in input an object with clusters inside, should check that the input is correct.

For instance, all those in getters_clustering_results.R should do that. Same for the others.

Intelligible errors should be raised via stop.

get_input_raw_data(congas_example)

Returns a matrix with 0 rows and 32737 columns.