cole-trapnell-lab / cicero-release Goto Github PK

View Code? Open in Web Editor NEW

54.0 10.0 14.0 21.93 MB

Home Page: https://cole-trapnell-lab.github.io/cicero-release/

License: MIT License

R 100.00%

cicero-release's Introduction

Cicero

Predicting the cis-regulatory landscape

Please see our website for information on installing and using Cicero

cicero-release's People

Contributors

Stargazers

Watchers

Forkers

hypercompetent shamoni rmtsoa tohein mengchengyao jindalk vivekbhr weiiioyo janzulene wook2014 helianfeixing leahykim puriney sukses24

cicero-release's Issues

How to prepare input data for atac-seq trajectory as "cicero_data"?

input_cds <- make_atac_cds(cicero_data, binarize = TRUE)

I noticed that the example data "cicero_data", has three columns.
Peak Cell Count
140 chr18_30209631_30210783 AGCGATAGGCGCTATGGTGGAATTCAGTCAGGACGT 4
150 chr18_45820294_45821666 AGCGATAGGTAGCAGCTATGGTAATCCTAGGCGAAG 2
185 chr18_32820116_32820994 TAATGCGCCGCTTATCGTTGGCAGCTCGGTACTGAC 2
266 chr18_41888433_41890138 AGCGATAGGCGCTATGGTGGAATTCAGTCAGGACGT 2

Just wanna know how do you prepare this input data. I know that I could obtain Peak from peaks.bed and cell from barcodes.tsv and count from count matrix. However, they have different dimensions. How do you combine them to generate this input data? I am getting confused.

Thanks!

Question for aggregated cells

Is there any way of keeping the original cell metadata and still link back the agg cells to the original single cells?

Add FAQ

where to find other genome coordinates files

Make promoter annotation clearer in gene activity section

Questions regarding to "plot_connections", "plot_cells" and "new_cell_data_set", thank you for your help.

Dear Hannah,

Thank you for your quick response, you are very helpful.
I still have 3 questions below, if possible, could you give me some tips? Thank you so much!

What is the reason why "yscale" is invalid for drawing the connection plot as below in cicero for monocle3? What should I do to figure it out? Thank you so much.
#===#
plot_connections(conns, "chr2", 13451, 9848598,gene_model = gene_anno, coaccess_cutoff = .005, connection_width = .5,collapseTranscripts = "longest" )
Error in valid.viewport(x, y, width, height, just, gp, clip, xscale, yscale, :
invalid 'yscale' in viewport
#===#
How can I plot cell figures based on genes assigned after processing "Cicero gene activity scores" step in cicero for monocle3? Is the way below the correct way to do it? Or any other ways? Thank you.
#===#

plot_cells(input_cds, genes=c("GR"),show_trajectory_graph=TRUE,label_cell_groups=FALSE,label_leaves=FALSE)
Error in plot_cells(input_cds, genes = c("GR"), show_trajectory_graph = TRUE, :
None of the provided genes were found in the cds
#===#

How can I merge 10x genomics samples as one single cell_data_set (CDS) objective with sample names as a feature stored at a column of colData(cds), through multiple loading samples by "new_cell_data_set" function?

Thank you so much!

Best,
Qi

estimateDispersion Issue

Dear Cicero Team,

when running aggregate_by_cell_bin on an scATAC dataset, I am receiving the following error:
binned_input_lin <-aggregate_by_cell_bin(input_cds, "combinedId")
|======================================================|100% ~0 s remaining Error in parametricDispersionFit(disp_table, verbose) :
Parametric dispersion fit failed. Try a local fit and/or a pooled estimation. (See '?estimateDispersions'

and these warnings:
Warning messages:
1: In log(ifelse(y == 0, 1, y/mu)) : NaNs produced
2: step size truncated due to divergence

I ran cicero on this dataset before but with a different cell clustering and filtering and it worked fine.

It looks like the fit is consistently failing but I don't understand why. There are no empty samples/cells or peaks that are not present in at least of the clusters.
I would appreciate any suggestions.

Thanks a lot,
Michael

Is there "data(gene_annotation_sample)" file for hg38 in cicero in step of Constructing cis-regulatory networks? Thank you so much.

Dear Drs. Hannah Pliner and Cole Trapnell,

Is there "data(gene_annotation_sample)" file for hg38 in cicero in step of Constructing cis-regulatory networks? Thank you so much.

Best,
Qi

It looks like a conflint between monocle::detectGenes and new_cell_data_set function in Cicero for monocle3? Could you help to give me some tips to figure it out? Thank you so much.

Dear Hannah,

When I using Cicero for monocle3 to process my scATACseq data, I encountered one question below: function of "monocle::detect_genes" requires "lowerDetectionLimit" setting at "new_cell_data_set", but there is no such parameter setting there at "new_cell_data_set" showed as below: Could you help to give me some tips to figure it out? Thank you so much.
#====#

input_cds <- suppressWarnings(new_cell_data_set(indata,

cell_metadata = cellinfo,
gene_metadata = peakinfo))

input_cds <- monocle::detect_genes(input_cds)
Error: 'detect_genes' is not an exported object from 'namespace:monocle'
input_cds <- monocle::detectGenes(input_cds)
Error in monocle::detectGenes(input_cds) :
no slot of name "lowerDetectionLimit" for this object of class "cell_data_set"
#====#

how to compare two sample using plot_connections

Hi @ctrapnell and all
I have two sample, treatment and control, and I want to compare specific position between treatment and control sample using plot_connections. Base above request, what should i do on
this cases.
I wonder what the line that plot_connections showed mean ? whether or not it showed the
connective weight on curl line?

Deal with no 'chr' in plot_connections

Transfer to travis.com

using non-UCSC chromosome name formats in plot_connections()

The plot_connections() function assumes that all chromosome names start with "chr", and errors out if a user tries to pass a non-"chr" chromosome name as the window coordinate.

Fix overlapping menu on mobile

annotation of genes needs to take strandness into account?

Hi,

I am following https://cole-trapnell-lab.github.io/cicero-release/docs/#cicero-gene-activity-scores
and

# Make a subset of the gene annotation column containing just the coordinates and the 
# gene name
gene_annotation_sub <- gene_annotation_sample[,c(1:3, 8)]

# Rename the gene symbol column to "gene"
names(gene_annotation_sub)[4] <- "gene"

only the chr, start, end and gene information are used. should add the strandness information as well? TSS (transcription start site) will be at the start if strand is + and at the end if strand is -.

thanks!

Could not find function "newCellDataSet"

I have installed cicero as described here. But when I run it, I get the following error:

Error in newCellDataSet(indata, phenoData = pd, featureData = fd, expressionFamily = VGAM::binomialff(), : could not find function "newCellDataSet"

When I use another R environment and I install the old monocle it works until this step:

input_cds <- preprocessCDS(input_cds, norm_method = "none")

where it fails because the preprocessCDS doesn't exists, which is what happens when you have the old cicero instalation.

Do you know how to solve this with the new monocle3-compatible cicero?

Question for trajector analysis

Hi,
Thank you for your wonderful software. Currently I am working on scATAC-seq data trajectory analysis. After I prepared input_cds and run estimate_size_factors() function, I got this error:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'counts' for signature '"CellDataSet"'
How shall I modify the input cds to avoid this error?

Thank you,
Best,
Yan

Error with creating cicero CDS

Hi! I'm getting this error with make_cicero_CDS function.

cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = umap.emb, k = 50)
Error in parametricDispersionFit(disp_table, verbose) :
Parametric dispersion fit failed. Try a local fit and/or a pooled estimation. (See ‘?estimateDispersions’)

My input_cds has about 8k cells and 100k peaks. These are the overlap QC metrics.
Maximum shared cells bin-bin: 44
Mean shared cells bin-bin: 0.322901408806195
Median shared cells bin-bin: 0

Thanks!

ERROR:duplicate 'row.names' are not allowed

Dear Sir:
I was trying to use Cicero to get gene activity score, however, when I follow the tutorial in the website, I encounter an error, when I began to process make_cicero_cds module, after input all the variable like tutorial and begin the chunk ,
the console report like this:
Overlap QC metrics:
Cells per bin: 50
Maximum shared cells bin-bin: 44
Mean shared cells bin-bin: 3.82782937736498
Median shared cells bin-bin: 0
NAs introduced by coercionnon-unique value when setting 'row.names': 'Peak'Error in .rowNamesDF<-(x, value = value) :
duplicate 'row.names' are not allowed

I was wondering which step I was wrong, as I checked all the annotate files and peakinfo, there is no duplicate index
Could you please help me to find how can I check the duplicate row or how to fix this issue?
eager for your help

                                                                                                                                 thanks

Add check for zero rows in normalize gene activities

Plot_connections: Error in `$<-.data.frame`(`tmp`, "feature", value = character(0)) : replacement has 0 rows, data has 101

Hi, I was running the vignette provided on the Cicero website and encountered the following error when trying to plot the connections. Could you please help me figure out what the issue is? Thanks a lot!!

plot_connections(conns, "chr2", 9773451, 9848598,
                 gene_model = gene_anno, 
                 coaccess_cutoff = .25, 
                 connection_width = .5, 
                 collapseTranscripts = "longest" )

Error in $<-.data.frame(*tmp*, "feature", value = character(0)) : replacement has 0 rows, data has 101

Traceback:

5. stop(sprintf(ngettext(N, "replacement has %d row, data has %d", "replacement has %d rows, data has %d"), N, nrows), domain = NA)
4. `$<-.data.frame`(`*tmp*`, "feature", value = character(0))
3. `$<-`(`*tmp*`, "feature", value = character(0))
2. make_gene_model_track(gene_model, chr, collapseTranscripts, gene_model_color, gene_model_shape)
1. plot_connections(conns, "chr2", 9773451, 9848598, gene_model = gene_anno, coaccess_cutoff = 0.25, connection_width = 0.5, collapseTranscripts = "longest")

And here is my session info:

Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cicero_1.3.3                Gviz_1.28.3                 monocle3_0.2.1.1           
 [4] SingleCellExperiment_1.6.0  SummarizedExperiment_1.14.1 DelayedArray_0.10.0        
 [7] BiocParallel_1.18.1         matrixStats_0.55.0          GenomicRanges_1.36.1       
[10] GenomeInfoDb_1.20.0         IRanges_2.18.3              S4Vectors_0.22.1           
[13] Biobase_2.44.0              BiocGenerics_0.30.0        

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.16.0      bitops_1.0-6             bit64_0.9-7              RcppAnnoy_0.0.13        
 [5] RColorBrewer_1.1-2       progress_1.2.2           httr_1.4.1               tools_3.6.0             
 [9] backports_1.1.5          irlba_2.3.3              R6_2.4.0                 rpart_4.1-15            
[13] uwot_0.1.4               Hmisc_4.2-0              DBI_1.0.0                lazyeval_0.2.2          
[17] colorspace_1.4-1         nnet_7.3-12              tidyselect_0.2.5         gridExtra_2.3           
[21] prettyunits_1.0.2        curl_4.2                 bit_1.1-14               compiler_3.6.0          
[25] htmlTable_1.13.2         rtracklayer_1.44.4       labeling_0.3             scales_1.0.0            
[29] checkmate_1.9.4          stringr_1.4.0            digest_0.6.22            Rsamtools_2.0.3         
[33] foreign_0.8-72           XVector_0.24.0           dichromat_2.0-0          base64enc_0.1-3         
[37] pkgconfig_2.0.3          htmltools_0.4.0          ensembldb_2.8.1          BSgenome_1.52.0         
[41] htmlwidgets_1.5.1        rlang_0.4.0              VGAM_1.1-1               rstudioapi_0.10         
[45] RSQLite_2.1.2            FNN_1.1.3                acepack_1.4.1            dplyr_0.8.3             
[49] VariantAnnotation_1.30.1 RCurl_1.95-4.12          magrittr_1.5             GenomeInfoDbData_1.2.1  
[53] Formula_1.2-3            Matrix_1.2-17            Rcpp_1.0.2               munsell_0.5.0           
[57] viridis_0.5.1            yaml_2.2.0               stringi_1.4.3            zlibbioc_1.30.0         
[61] plyr_1.8.4               blob_1.2.0               crayon_1.3.4             lattice_0.20-38         
[65] Biostrings_2.52.0        splines_3.6.0            GenomicFeatures_1.36.4   hms_0.5.1               
[69] zeallot_0.1.0            knitr_1.25               pillar_1.4.2             codetools_0.2-16        
[73] reshape2_1.4.3           biomaRt_2.40.5           XML_3.98-1.20            glue_1.3.1              
[77] biovizBase_1.32.0        latticeExtra_0.6-28      RcppParallel_4.4.4       data.table_1.12.6       
[81] vctrs_0.2.0              gtable_0.3.0             purrr_0.3.3              assertthat_0.2.1        
[85] ggplot2_3.2.1            xfun_0.10                AnnotationFilter_1.8.0   RSpectra_0.15-0         
[89] glasso_1.11              survival_2.44-1.1        viridisLite_0.3.0        tibble_2.1.3            
[93] GenomicAlignments_1.20.1 AnnotationDbi_1.46.1     memoise_1.1.0            cluster_2.1.0      ```

Adding fData

About CellDataSet and cell_data_set

I use cicero to analyse my scATAC data from 10X, when I create the cds object use the code "input_cds <- suppressWarnings(new_cell_data_set(indata,
cell_metadata = cellinfo,
gene_metadata = peakinfo))", the output is "Large cell_data_set", not "Large CellDataSet".But the input of make_cicero_cds should be "Large CellDataSet".So how can I solve this problem?The error shows :"Error: is(object = cds, class2 = "CellDataSet") is not TRUE"

Add documentation support for 10x scATAC output

Question about differentialGeneTest

I noticed that some of the examples in Cicero tutorial use num_genes_expressed in full model while others (also in Monocle tutorial) don't. What's the reason for that and when should we use it? Thanks!

cannot go through reduceDimension()

I used cicero version 1.4.4 wrapped with monocle.
The analysis get stuck when run reduceDimension(),
Error: vector memory exhausted (limit reached?).
This is on my laptop, I also tried on server, while get killed.

Do you have any solutions for that?

I tried to switch to the new version wrapped with monocle3 as well, but cannot get it installed.

Enhancement - Matrix Multiplication in make_cicero_cds

Hi! I have a clunky enhancement that I made to make_cicero_cds that requires the use of Rcpp. I'm not the best at pull requests yet, so I'll report it here and if you choose to implement it, it's up to you :)

Basically, I am working with a very large set of imputed data (14,000 cells x 256,000 peaks). Normally this data is very sparse (~95%) but after imputation, less than half a percent is a 0. Previously make_cicero_cds has been able to (although not super efficiently) work with this by converting sparse matrices to dense ones within R. The specific step that was overloading my system (64 core 256GB RAM) was the Matrix multiplication on line 73.
new_exprs <- exprs_old %*% mask

So, I decided to spend some time trying to find alternatives.

Rcpp Implementation

I figured that a more efficient way of accomplishing this would be to use C++, implemented through Rcpp. I searched the internet and found this script.

// [[Rcpp::depends(RcppArmadillo, RcppEigen)]]

#include <RcppArmadillo.h>
#include <RcppEigen.h>

// [[Rcpp::export]]
SEXP armaMatMult(arma::mat A, arma::mat B){
    arma::mat C = A * B;
    
    return Rcpp::wrap(C);
}

// [[Rcpp::export]]
SEXP eigenMatMult(Eigen::MatrixXd A, Eigen::MatrixXd B){
    Eigen::MatrixXd C = A * B;
    
    return Rcpp::wrap(C);
}

// [[Rcpp::export]]
SEXP eigenMapMatMult(const Eigen::Map<Eigen::MatrixXd> A, Eigen::Map<Eigen::MatrixXd> B){
    Eigen::MatrixXd C = A * B;
    
    return Rcpp::wrap(C);
}

Following the article, I saved this as MatrixMtp.cpp and loaded it into my R session.
Next, I edited the make_cicero_cds function:

    #mask <- Matrix::Matrix(mask)
    #new_exprs <- exprs_old %*% mask
    mask <- as(mask, "matrix")
    new_exprs <- eigenMatMult(as(exprs_old,"matrix"), mask)

Now, this function eigenMatMult works, but the faster version eigenMapMatMult didn't, due to the error

Error in eigenMapMatMult(as(exprs_old, "matrix"), mask) : 
  Wrong R type for mapped matrix

I'm not a C++ expert so I just used the other version.

It worked! Although it took between 50-125GB of RAM to complete the function, it was able to actually make the dataset. I think there may be other people struggling with this problem, so maybe this will help.

argument "maxbp" is missing, with no default

Error in reduce dimension part of trajectory analysis

Hi Hannah,
Thank you for creating this useful package for scATACseq data.

I have created a celldataset going through the Cicero method and used the same object for running trajectory analysis and got the following error

Error in as.igraph.vs(graph, nodes) : Invalid vertex names Calls: reduceDimension ... learnGraph -> project2MST -> neighborhood -> as.igraph.vs

Can you help figuring out what might be causing this issue?

Thank you
Sasi

How to use gene activity matrix in plot_cells & mm10 genome to view connections?

Hello,

I have two questions:

1). How to view gene accessibility using plot_cells() with gene activity matrix?
Following the instructions on Cicero website normally, I was successful in clustering my single-cell data. However, when I try to use the genes argument in plot_cells(), I get an error that "None of the provided genes were found in the cds".

Thus, I am trying to create a new CDS object with cicero_gene_activities as the expression_matrix, per your response to MQMQ2018's second question in resolved issue #35. You responded to MQMQ2018 to use the gene activities matrix in place of the expression matrix and "gene information" in place of gene metadata.

I am not quite sure what you meant by "gene information". I have tried using gene_metadata = gene_anno as well as gene_annotation_sub (processed from ensemble's GTF files), and I keep getting this error:

input_cds2 <- suppressWarnings(new_cell_data_set(cicero_gene_activities,
cell_metadata = cellinfo,
gene_metadata = gene_anno))
Error: gene_metadata must be NULL or have the same number of rows as rows in expression_data

If the annotation data was what you meant by "gene information", can you advise me on how to reduce the annotation object to have the proper number of rows to successfully create the new CDS object so I can view gene accessibility through plot_cells?

2). mm10 genome with run_cicero?
We did peak alignments with mm10 genome using Cell Ranger. I am interested in viewing coacessibility and other Cicero connections with my scATAC-seq data. Since Cicero has the mm9 genome preloaded, I was wondering how to load in the mm10 genome to Cicero so that I can call the mm10 genome with data("mouse.mm10.genome"), then whole_genome <- mouse.mm10.genome for use in run_cicero.

I have already done run_cicero using the default mm9 genome out of curiosity and would like to investigate connections and coaccessibility in Cicero further. However, I would prefer to be consistent and use the mm10 genome with run_cicero since we used mm10 for the peak alignments. Do you have any advice for how I can address this?

In case it is helpful, here is my session info:

sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] cicero_1.3.4.5 Gviz_1.30.1 monocle3_0.2.0 SingleCellExperiment_1.8.0
[5] SummarizedExperiment_1.16.1 DelayedArray_0.12.2 BiocParallel_1.20.1 matrixStats_0.55.0
[9] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 IRanges_2.20.2 S4Vectors_0.24.3
[13] Biobase_2.46.0 BiocGenerics_0.32.0

loaded via [1] ProtGenerics_1.18.0 [5] progress_1.2.2 [9] R6_2.4.1 [13] lazyeval_0.2.2 [17] gridExtra_2.3 [21] compiler_3.6.2 [25] checkmate_2.0.0 [29] digest_0.6.23 [33] dichromat_2.0-0 [37] pkgconfig_2.0.3 [41] htmlwidgets_1.5.1 [45] RSQLite_2.2.0 [49] RCurl_1.98-1.1 [53] Matrix_1.2-18 [57] lifecycle_0.1.0 [61] BiocFileCache_1.10.2 [65] Biostrings_2.54.0 [69] knitr_1.28 [73] XML_3.99-0.3 [77] data.table_1.12.8 [81] openssl_1.4.1 [85] xfun_0.12 [89] tibble_2.1.3 [93] cluster_2.1.0 a namespace (and not attached):
bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
httr_1.4.1 tools_3.6.2 backports_1.1.5
rpart_4.1-15 Hmisc_4.3-1 DBI_1.1.0
colorspace_1.4-1 nnet_7.3-12 tidyselect_1.0.0
prettyunits_1.1.1 bit_1.1-15.2 curl_4.3
htmlTable_1.13.3 rtracklayer_1.46.0 scales_1.1.0
askpass_1.1 rappdirs_0.3.1 stringr_1.4.0
Rsamtools_2.2.1 foreign_0.8-74 XVector_0.26.0
htmltools_0.4.0 base64enc_0.1-3 jpeg_0.1-8.1
ensembldb_2.10.2 BSgenome_1.54.0 dbplyr_1.4.2
rlang_0.4.4 VGAM_1.1-2 rstudioapi_0.11
acepack_1.4.1 dplyr_0.8.4 VariantAnnotation_1.32.0
magrittr_1.5 GenomeInfoDbData_1.2.2 Formula_1.2-3
Rcpp_1.0.3 munsell_0.5.0 viridis_0.5.1
stringi_1.4.5 zlibbioc_1.32.0 plyr_1.8.5
blob_1.2.1 crayon_1.3.4 lattice_0.20-38
splines_3.6.2 GenomicFeatures_1.38.1 hms_0.5.3
pillar_1.4.3 reshape2_1.4.3 biomaRt_2.42.0
glue_1.3.1 biovizBase_1.34.1 latticeExtra_0.6-29
png_0.1-7 vctrs_0.2.2 gtable_0.3.0
purrr_0.3.3 assertthat_0.2.1 ggplot2_3.2.1
AnnotationFilter_1.10.0 survival_3.1-8 viridisLite_0.3.0
GenomicAlignments_1.22.1 AnnotationDbi_1.48.0 memoise_1.1.0

Thank you

plot_connections issued error for specific range position

Hi all,
The error occured when runging function below, ranges at chr17:50236092-50244917
plot_connections(conns, "chr17", 50236092, 50244917,
gene_model = gene_anno,
coaccess_cutoff = .25,
connection_width = .5,
collapseTranscripts = "longest")
The error showing:
Error in plot_connections(conns, "chr17", 50236092, 50244917, gene_model = gene_anno, :
No peaks in the specified range, nothing to plot!
But I can sure that chromosome length own 94987271 bases in chr17, and i make conns object was like below:
sample_genome <- subset(mouse.mm9.genome, V1 == "chr17")

Usually use the whole mouse.mm9.genome

conns <- run_cicero(cicero_cds, sample_genome, sample_num = 100)
Please give me a tips for this solution.
Thanks!

last step for normalizing the gene activity scores failed

Hi,
I was following https://cole-trapnell-lab.github.io/cicero-release/docs/#cicero-gene-activity-scores using 10x data from https://support.10xgenomics.com/single-cell-atac/datasets/1.0.1/atac_v1_pbmc_10k

everything worked fine but the last command

> cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
  NA/NaN/Inf in 'y'

full command (copied from the tutorial)

library(cicero)
#library(here)

# read in matrix data using the Matrix package
indata <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx") 
# binarize the matrix
indata@x[indata@x > 0] <- 1

# format cell info
cellinfo <- read.table("filtered_peak_bc_matrix/barcodes.tsv")
row.names(cellinfo) <- cellinfo$V1
names(cellinfo) <- "cells"

# format peak info
peakinfo <- read.table("filtered_peak_bc_matrix/peaks.bed")
names(peakinfo) <- c("chr", "bp1", "bp2")
peakinfo$site_name <- paste(peakinfo$chr, peakinfo$bp1, peakinfo$bp2, sep="_")
row.names(peakinfo) <- peakinfo$site_name

row.names(indata) <- row.names(peakinfo)
colnames(indata) <- row.names(cellinfo)

# make CDS
fd <- methods::new("AnnotatedDataFrame", data = peakinfo)
pd <- methods::new("AnnotatedDataFrame", data = cellinfo)
input_cds <-  suppressWarnings(newCellDataSet(indata,
                                              phenoData = pd,
                                              featureData = fd,
                                              expressionFamily=VGAM::binomialff(),
                                              lowerDetectionLimit=0))
input_cds@expressionFamily@vfamily <- "binomialff"
input_cds <- monocle::detectGenes(input_cds)

#Ensure there are no peaks included with zero reads
input_cds <- input_cds[Matrix::rowSums(exprs(input_cds)) != 0,] 

set.seed(2017)
input_cds <- detectGenes(input_cds)
input_cds <- estimateSizeFactors(input_cds)

# *** if you are using Monocle 3, you need to run the following line as well!
#input_cds <- preprocessCDS(input_cds, norm_method = "none"), take ~10mins
input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6,
                             reduction_method = 'tSNE', norm_method = "none")


tsne_coords <- t(reducedDimA(input_cds))
row.names(tsne_coords) <- row.names(pData(input_cds))
cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords)

### run cicero
data("human.hg19.genome")

sample_genome <- subset(human.hg19.genome, V1 == "chr18")
conns <- run_cicero(cicero_cds, sample_genome)


# Make a subset of the gene annotation column containing just the coordinates and the 
# gene name
data(gene_annotation_sample)
gene_annotation_sub <- gene_annotation_sample[,c(1:3, 8)]

# Rename the gene symbol column to "gene"
names(gene_annotation_sub)[4] <- "gene"

input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub)

# head(fData(input_cds))

# generate unnormalized gene activity matrix
unnorm_ga <- build_gene_activity_matrix(input_cds, conns)

# make a list of num_genes_expressed
num_genes <- pData(input_cds)$num_genes_expressed
names(num_genes) <- row.names(pData(input_cds))

# normalize
cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)

Thanks for looking into it.

Error in `[.data.table`(bedpedata, , i)

Hello. This error is driving me nuts and I don't know how to fix this. I tested the code with the chia_conns from the tutorial and that generated a plot, but my own data does not.

Error in [.data.table(bedpedata, , i) : j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.
10.
stop("j (the 2nd argument inside [...]) is a single symbol but column name '", jsubChar, "' is not found. Perhaps you intended DT[, ..", jsubChar, "]. This difference to data.frame is deliberate and explained in FAQ 1.1.")
9.
[.data.table(bedpedata, , i)
8.
bedpedata[, i]
7.
plotBedpe(sub, chrom = chr, chromstart = minbp, chromend = maxbp, connection_ymax, coaccess_cutoff, connection_width, alpha_by_coaccess, color_names)
6.
GdObject@plottingFunction(GdObject, prepare = prepare)
5.
.local(GdObject, ...)
4.
drawGD(expandedTrackList[[i]], minBase = ranges["from"], maxBase = ranges["to"], subset = FALSE)
3.
drawGD(expandedTrackList[[i]], minBase = ranges["from"], maxBase = ranges["to"], subset = FALSE)
2.
plotTracks(component_list, title.width = 0.5, showTitle = TRUE, from = minbp, to = maxbp, chromosome = chr, sizes = size_track, transcriptAnnotation = "symbol", background.title = "transparent", col.border.title = "transparent", lwd.border.title = "transparent", ...
1.
plot_connections(conns, "chr10", 10000500, 100335118, coaccess_cutoff = 0, connection_width = 0.5, return_as_list = FALSE)

> head(conns)
                       Peak1                     Peak2  coaccess
1: chr10_100005717_100007112 chr10_100333926_100335118 0.3932521
2: chr10_100146917_100147763 chr10_100184684_100187167 0.2977492
3: chr10_100146917_100147763 chr10_100209218_100209752 0.3271076
4: chr10_100146917_100147763 chr10_100285473_100287454 0.3059880
5: chr10_100146917_100147763 chr10_100338389_100338891 0.3741931
6: chr10_100146917_100147763 chr10_100341043_100341567 0.3778341

plot_connections(conns, "chr10", 100005000, 100335118,
                # gene_model = gene_anno,
                coaccess_cutoff = 0,
                connection_width = .5, return_as_list = FALSE)`

Running cicero on bulk data

Hi,

I had a naive question--if I'm trying to run cicero on bulk atac-seq data, should I assign each peak randomly to 100 cells? What is the best way to convert cicero for bulk use?

Best,
Anu

Link to a page is broken.

https://cole-trapnell-lab.github.io/monocle3/docs/#analyzing-branches-in-single-cell-trajectories link is broken.

Aggregating Co-Accessible Peaks

Hey guys! I just had a quick question about aggregating peaks. Is there a reason the default peak aggregation is using nearby peaks instead of aggregating co-accessible peaks?

Deal with plot_connections NAs

" does your gene_anno data frame contain NA transcripts?
I had to remove them to get rid from the invalid 'yscale' in viewport error:

gene_anno <- subset(gene_anno, !(is.na(transcript))) . "

Benchmarks

Hi thanks for putting out interesting software. Im curious how long the software takes.
Maybe something along the lines of the 10, 25, 50, 100k (mouse-atlas) single cells with 250k peaks. I don't have that big of a dataset, but its been running for quite a while.

Thanks!

make_atac_cds: fData(atac_cds)$chr <- as.numeric(as.character(fData(atac_cds)$chr))

Hi,

If having sex chromosomes ('X', 'Y') this line in make_atac_cds results in the chromosome name stored as NA, which results in further errors later on. Changing it to fData(atac_cds)$chr <- as.character(fData(atac_cds)$chr) seems to work.

Cheers!

Monocle 3 alpha compatibility issue

Errors in plot_connections

Hi! I was wondering if you could help me understand why I am getting errors while using plot_connections. I have the properly formatted cicero conns file...

> head(conns)
             Peak1            Peak2    coaccess
2 chr1_21162_21368 chr1_21798_22017 -0.01375438
3 chr1_21162_21368 chr1_27072_27330 -0.12267344
4 chr1_21162_21368 chr1_31671_32072 -0.23830674
5 chr1_21162_21368 chr1_40908_41039 -0.15464157
6 chr1_21162_21368 chr1_41164_41472 -0.10644979
7 chr1_21162_21368 chr1_41742_42056 -0.15240281

And also the gene_model...

> head(galGal6.genes)
             ENSEMBL     start       end chromosome transcript strand
1 ENSGALG00000051400 195689244 195693062       chr1                 *
2 ENSGALG00000049220 195648859 195651062       chr1                 *
3 ENSGALG00000009966  35566490  35612114       chr1       FRS2      *
4 ENSGALG00000050778 181186765 181189206       chr1                 *
5 ENSGALG00000036738  93596025  93648691       chr1      IGSF3      *
6 ENSGALG00000052160 181146434 181217925       chr1                 *

But I get the following error when I run this code.

> plot_connections(conns, "chr3", 107073545, 109341053, 
+                  gene_model = galGal6.genes, 
+                  coaccess_cutoff = .25, 
+                  connection_width = .5, 
+                  collapseTranscripts = "longest",
+                  return_as_list = FALSE)
Error in .normargSeqlevels(seqnames) : 
  supplied 'seqlevels' cannot contain NAs or empty strings ("")

Could you help me with this error? I've tried exporting this as a list but I can't find anything called seqlevels and I'm not sure what the issue is.

Thank you!

Compatibility with Monocle3

Hello,

I went through the trouble of getting Monocle3 installed in hopes of using some of its features, including easily generating TSNE and tree plots that encode locus accessibility via color. The Cicero website suggests that Monocle3 is compatible and you simple need to add in an extra command:

*** if you are using Monocle 3, you need to run the following line as well!
#input_cds <- preprocessCDS(input_cds, norm_method = "none")

That command doesn't exist (closest thing I can find is preprocess_cds(), but "none" is not a valid norm_method for that function). Other commands like detectGenes and estimateSizeFactors also generate errors. I also couldn't figure out how to set a comparable expressionFamily.

Is Cicero simply not compatible with Monocle3?

Thanks!

plot_accessibility_in_pseudotime update to work with Monocle 3

Would it be possible to update the plot_accessibility_in_pseudotime function in Cicero to work with Monocle3 objects?

For a cds of class 'cell_data_set', this is the error:
Error: is(object = cds_subset, class2 = "CellDataSet") is not TRUE

Tried debugging and could get it to make the cds_exprs but failed at the generation of merged_df_with_vgam.

This would be really helpful, thanks!!

multi-sample

Hi!
Good tool!
How does cicero handle multi-sample data?
How to eliminate batch effect?

Filtering genomic_coords in estimate_distance_parameters() ?

It looks like estimate_distance_parameters() will run generate_windows() on all of the chromosomes provided in genomic_coords, even if there aren't any reads in the cds for those chromosomes.

In some cases, I think this can cause a lot of iterations to find no values in the win_range, resulting in no distance parameter estimate because length(distance_parameters) < sample_num at the end of the iterations.

Would it make sense to add a filtering step before generate_windows() to keep only the chromosomes with any reads in the cds, or should users filter genomic_coords before run_cicero()?

I ran into this by trying to run cicero on separate chromosomes in parallel, and got a lot of failed distance parameter estimates because most of the windows were on different chromosomes. Maybe this is an edge case, or doesn't match how you expect the functions to be used, in which case you could skip this suggestion :)

Thanks!
-Lucas

Error in preprocessing

Hi Hannah,
I have succesfully the cicero pipeline on 10x generated data but for only one of the samples i have the following error:

Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed Calls: [<- -> [<- -> [<- -> [<- -> int2i In addition: Warning message: In int2i(as.integer(i), n) : NAs introduced by coercion to integer range

Does this error point to anything with how the data is for this sample?I tried to check from where this error is raised and i could see that this part of the preprocess/reduceDimension functions,but i cannot figure out why

Can you help in troubleshooting this?

Thank you
Sasi

Replace website color plots to include legend

scale of the normalized the gene activity matrix

Hi,

In general, what's the scale of the matrix?

> range(atac_mat)
[1] 0.0000000 0.9999923

I asked the Seurat V3 author because I am using it for label transferring from scRNAseq data.

That is a weird distribution for the normalized values, the cortex ATAC data we looked at (generated by the Shendure/Trapnell labs) has a distribution more similar to scRNA (see attached, title of the plot says RNA but it’s ATAC data)

Does cicero normalize the value somehow? or should I use the un-normalized values?

Thanks,
Tommy

Add a check for chr_UN/chr_random in make_input_cds

Handling large dataset

Hello,
I am wondering, what would you recommend to create a cicero object for a dataset that is ~150k regions by 40k cells. The data is currently stored as a csv files in the proper cicero data format (location, barcode, number of reads).

In different computer clusters, when I tried to load the data, I encountered a segmentation error (memory error).

Thanks,

Issues with Cicero gene activity score

I tried to run the gene activity score pipeline using the exact code:
https://cole-trapnell-lab.github.io/cicero-release/docs_m3/#cicero-gene-activity-scores

But I can't seem to get it working on 10X's 10K PBMC data set or my own B cell data set. In both cases, the activity score looks very low for some key marker genes that I know are expressed. Do you guys have some result plots of gene activity score that you can share based on some example data set?

FYI, these are the code I used to generate and plot the scores:
`
hg19 <- read.table("../hg19.chrom.sizes.txt")
conns <- run_cicero(cicero_cds, hg19)

gene_anno <- rtracklayer::readGFF("/somewhere/gencode.v19.annotation.gtf")
gene_anno$chromosome <- gene_anno$seqid
gene_anno$gene <- gene_anno$gene_id
gene_anno$transcript <- gene_anno$transcript_id
gene_anno$symbol <- gene_anno$gene_name
pos <- subset(gene_anno, strand == "+")
pos <- pos[order(pos$start),]
pos <- pos[!duplicated(pos$transcript),]
pos$end <- pos$start + 1
neg <- subset(gene_anno, strand == "-")
neg <- neg[order(neg$start, decreasing = TRUE),]
neg <- neg[!duplicated(neg$transcript),]
neg$start <- neg$end - 1
gene_annotation_sub <- rbind(pos, neg)

gene_annotation_sub <- gene_annotation_sub[,c("chromosome", "start", "end", "symbol")]
names(gene_annotation_sub)[4] <- "gene"
input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub)
unnorm_ga <- build_gene_activity_matrix(input_cds, conns)
unnorm_ga <- unnorm_ga[!Matrix::rowSums(unnorm_ga) == 0,
!Matrix::colSums(unnorm_ga) == 0]
num_genes <- pData(input_cds)$num_genes_expressed
names(num_genes) <- row.names(pData(input_cds))
cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)

genes = c("CD8A", "CD4", "NKG7", "MS4A1", "PRDM1", "CD19", "IRF8")

for(gene in genes){
x1 <- cicero_gene_activities[gene,]
pData(input_cds)$marker_x <- x1
cairo_pdf( paste0("gene_activity_score_", gene, ".pdf"), width=4, height=4)
p <- plot_cells(input_cds, show_trajectory_graph=F, label_branch_points=F, color_cells_by="marker_x")
print(p)
dev.off()
}
`

cole-trapnell-lab / cicero-release Goto Github PK

cicero-release's Introduction

Cicero

Predicting the cis-regulatory landscape

cicero-release's People

Contributors

Stargazers

Watchers

Forkers

cicero-release's Issues

Rcpp Implementation

Usually use the whole mouse.mm9.genome

Recommend Projects

Recommend Topics

Recommend Org