vccri / cidr Goto Github PK

Clustering through Imputation and Dimensionality Reduction

License: GNU General Public License v2.0

R 75.90% C++ 24.10%

cidr's Introduction

Clustering through Imputation and Dimensionality Reduction

Ultrafast and accurate clustering through imputation and dimensionality reduction for single-cell RNA-seq data.

Most existing dimensionality reduction and clustering packages for single-cell RNA-Seq (scRNA-Seq) data deal with dropouts by heavy modelling and computational machinery. Here we introduce CIDR (Clustering through Imputation and Dimensionality Reduction), an ultrafast algorithm which uses a novel yet very simple ‘implicit imputation’ approach to alleviate the impact of dropouts in scRNA-Seq data in a principled manner.

For more details about CIDR, refer to the paper: Peijie Lin, Michael Troup, Joshua W.K. Ho, CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biology 2017 Mar 28;18(1):59.

CIDR is maintained by Dr Joshua Ho [email protected].

Getting Started

Make sure your version of R is at least 3.1.0.
CIDR has been tested primarily on the Linux and Mac platforms. CIDR has also been tested on the Windows platform - however this requires the use of an external software package Rtools.
If you are on the Windows platorm, ensure that Rtools is installed. Rtools is software (installed external to R) that assists in building R packages, and R itself. Note that the downlaod for Rtools is in the order of 100M.
Install the CRAN package devtools package which will be used to install CIDR and its dependencies:

## this is an R command
install.packages("devtools")

Install the CIDR package directly from the Github repository (including any dependencies):

## this is an R command
devtools::install_github("VCCRI/CIDR")
## Note that for some Windows platforms, you may be asked to re-install RTools
## - even though it may already have been installed.  Say yes if prompted.
## Your windows platform may require the specific version of RTools being suggested.
##
## For Mac platforms, ensure that the software "Xcode" and "Command Line Tools" are
## installed, by issuing the following command from a terminal prompt:
##  /usr/bin/clang --version
##

Examples

Simulated Data

Test the newly installed CIDR package:

library(cidr)
example("cidr")
#> 
#> cidr> par(ask=FALSE)
#> 
#> cidr> ## Generate simulated single-cell RNA-Seq tags.
#> cidr> N=3 ## 3 cell types
#> 
#> cidr> k=50 ## 50 cells per cell type
#> 
#> cidr> sData <- scSimulator(N=N, k=k)
#> 
#> cidr> ## tags - the tag matrix
#> cidr> tags <- as.matrix(sData$tags)
#> 
#> cidr> cols <- c(rep("RED",k), rep("BLUE",k), rep("GREEN",k))
#> 
#> cidr> ## Standard principal component analysis.
#> cidr> ltpm <- log2(t(t(tags)/colSums(tags))*1000000+1)
#> 
#> cidr> pca <- prcomp(t(ltpm))
#> 
#> cidr> plot(pca$x[,c(1,2)],col=cols,pch=1,xlab="PC1",ylab="PC2",main="prcomp")

#> 
#> cidr> ## Use cidr to analyse the simulated dataset.
#> cidr> ## The input for cidr should be a tag matrix.
#> cidr> sData <- scDataConstructor(tags)
#> 
#> cidr> sData <- determineDropoutCandidates(sData)
#> 
#> cidr> sData <- wThreshold(sData)
#> 
#> cidr> sData <- scDissim(sData)
#> 
#> cidr> sData <- scPCA(sData)

#> 
#> cidr> sData <- nPC(sData)
#> 
#> cidr> nCluster(sData)

#> 
#> cidr> sData <- scCluster(sData)
#> 
#> cidr> ## Two dimensional visualization: different colors denote different cell types,
#> cidr> ## while different plotting symbols denote the clusters output by cidr.
#> cidr> plot(sData@PC[,c(1,2)], col=cols,
#> cidr+      pch=sData@clusters, main="CIDR", xlab="PC1", ylab="PC2")

#> 
#> cidr> ## Use Adjusted Rand Index to measure the accuracy of the clustering output by cidr.
#> cidr> adjustedRandIndex(sData@clusters,cols)
#> [1] 0.9203693
#> 
#> cidr> ## 0.92
#> cidr> 
#> cidr> 
#> cidr>

Biological Datasets

Examples of applying CIDR to real biological datasets can be found at this Github repository. The name of the repository is CIDR-examples.

Clicking on the Clone or Download button in the Github repository for CIDR-examples will enable the user to download a zip file containing the raw biological data and the R files for the examples. The user can then extract the files and run the provided R examples.

Human Brain scRNA-Seq Dataset

CIDR-examples contains a human brain single-cell RNA-Seq dataset, located in the Brain folder. In this dataset there are 420 cells in 8 cell types after we exclude hybrid cells.

Reference for the human brain dataset:

Darmanis, S. et al. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences 112, 7285–7290 (2015).

Human Pancreatic Islet scRNA-Seq Dataset

CIDR-examples contains a human pancreatic islet single-cell RNA-Seq dataset, located in the PancreaticIslet folder. In this dataset there are 60 cells in 6 cell types after we exclude undefined cells and bulk RNA-Seq samples.

Reference for the human pancreatic islet dataset:

Li, J. et al. Single-cell transcriptomes reveal characteristic features of human pancreatic islet cell types. EMBO Reports 17, 178–187 (2016).

Troubleshooting

Masking of hclust

CIDR utilises the hclust function from the base stats package. Loading CIDR masks hclust in other packages automatically. However, if any package with an hclust function (e.g., flashClust) is loaded after CIDR, the name clashing can possibly cause a problem. In this case unloading that package should resolve the issue.

Reinstallation of CIDR - cidr.rdb corruption

In some cases when installing a new version of CIDR on top of an existing version may result in the following error message:

Error in fetch(key) : lazy-load database '/Library/Frameworks/R.framework/Versions/3.3/Resources/library/cidr/help/cidr.rdb' is corrupt

In this case, one way to resolve this issue is to reinstall the devtools package:

install.packages("devtools")
## Click “Yes” in “Updating Loaded Packages”
devtools::install_github("VCCRI/CIDR",force=TRUE)

Some users might have installed an older version of RcppEigen. CIDR requires RcppEigen version >=0.3.2.9.0. Please re-install the latest version of this package if necessary.

cidr's People

Contributors

Stargazers

Watchers

cidr's Issues

Input as batch effect corrected expression matrix

I am wondering if there is a way to use batch effect corrected expression matrix as input rather than raw counts?

Error in nls.lm(par = start, fn = FCT, jac = jac, control = control, lower = lower, : evaluation of fn function returns non-sensible value!

Hello, recently I use the CIDR package to analysis(cluster) my data, The data has 32061 samples. But I meet this error and do not know how to deal with it. So can you give me some advice?

why cpp_dist and cpp_dist_weighted function can work？

When using the pakcage CIDR, I don't know why the functions cpp_dist and cpp_dist_weighted defined in RcppExports.cpp can work. There is only the statement of cpp_dist and cpp_dist_weighted, how the parameters in cpp_dist and cpp_dist_weighted, like dist, truth, counts, ncol et al dalculated.

The definitions of dist_cpp and dist_cpp_weighted function as following,

// cpp_dist

| NumericMatrix cpp_dist(NumericMatrix dist, IntegerMatrix truth, NumericMatrix counts, int ncol, double threshold);
| RcppExport SEXP _cidr_cpp_dist(SEXP distSEXP, SEXP truthSEXP, SEXP countsSEXP, SEXP ncolSEXP, SEXP thresholdSEXP) {
| BEGIN_RCPP
| Rcpp::RObject rcpp_result_gen;
| Rcpp::RNGScope rcpp_rngScope_gen;
| Rcpp::traits::input_parameter< NumericMatrix >::type dist(distSEXP);
| Rcpp::traits::input_parameter< IntegerMatrix >::type truth(truthSEXP);
| Rcpp::traits::input_parameter< NumericMatrix >::type counts(countsSEXP);
| Rcpp::traits::input_parameter< int >::type ncol(ncolSEXP);
| Rcpp::traits::input_parameter< double >::type threshold(thresholdSEXP);
| rcpp_result_gen = Rcpp::wrap(cpp_dist(dist, truth, counts, ncol, threshold));
| return rcpp_result_gen;
| END_RCPP
| }
| // cpp_dist_weighted
| NumericMatrix cpp_dist_weighted(NumericMatrix dist, IntegerMatrix truth, NumericMatrix counts, int ncol, double a, double b);
| RcppExport SEXP _cidr_cpp_dist_weighted(SEXP distSEXP, SEXP truthSEXP, SEXP countsSEXP, SEXP ncolSEXP, SEXP aSEXP, SEXP bSEXP) {
| BEGIN_RCPP
| Rcpp::RObject rcpp_result_gen;
| Rcpp::RNGScope rcpp_rngScope_gen;
| Rcpp::traits::input_parameter< NumericMatrix >::type dist(distSEXP);
| Rcpp::traits::input_parameter< IntegerMatrix >::type truth(truthSEXP);
| Rcpp::traits::input_parameter< NumericMatrix >::type counts(countsSEXP);
| Rcpp::traits::input_parameter< int >::type ncol(ncolSEXP);
| Rcpp::traits::input_parameter< double >::type a(aSEXP);
| Rcpp::traits::input_parameter< double >::type b(bSEXP);
| rcpp_result_gen = Rcpp::wrap(cpp_dist_weighted(dist, truth, counts, ncol, a, b));
| return rcpp_result_gen;
| END_RCPP
| }

calc_npc() returns 0

Hi, this package is really helpful! When I ran it on my data, I found the data@nPC is zero. Is it possible that the calc_npc() returns 0?

missing value?

Hi,

Is there an option in determineDropoutCandidates to handle missing value?

I made a reduced UMI count matrix for my single-cell RNA-Seq data. Then I had the "missing value" issue:

x=read.csv("filtered_reduced_matrix.csv", header=T, row.names=1, skipNul=T)
scPan <- scDataConstructor(as.matrix(x))
scPan <- determineDropoutCandidates(scPan)
Error in density.default(object@nData[, topLibraries[i]], kernel = "epanechnikov", :
'x' contains missing values

Best,
Ying

How to set.seed?

Dear CIDR team,

I would like to run CIDR 10 times in a dataset to get cluster result. However when I use set.seed at the beginning of CIDR and it get the same result for 10 times? How can I get different result unless changing basic parameter.

Best

Error in hclust(md, method = "ward.D2") : Invalid clustering method "

trying CIDR on ChIP-seq style vectors

Hi,

Has anyone tried CIDR to cluster ChIP-seq data?
Any recommendations on where to start with?

Error running CIDR on Windows 10

Hi,

I tried to install CIDR on a Windows 10 machine, but I ran into an error. I installed Rtools as suggested and allowed Rtools to modify my system path. The following error occured after running devtools::install_github("VCCRI/CIDR"):

'"C:/PROGRA~1/R/R-34~1.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD config CC' had status 127

I followed some suggestions from http://socserv.mcmaster.ca/jfox/Courses/R/ICPSR/R-install-instructions.html, paragraph "Building Packages Under Windows, etc. (Optional)", second bulletpoint. But adding extra paths was not the solution.

How can I solve this error?

Error in if ((3 * a[b + c]) < a[b]) { : argument is of length zero

Hi,

I run CIDR with a 32738 gene x 6143 cell dataset.
This is the main code, however, when running the last step (scCluster(sdata)), the function returns an error message "argument is of length zero".

Could you help me figure out the problem and fix it?

>dat <- counts(sce)
>sdata <- scDataConstructor(dat)
>sdata <- determineDropoutCandidates(sdata)
>sdata <- wThreshold(sdata)
>sdata <- scDissim(sdata, threads = params$nCore)
>sdata <- scPCA(sdata, plotPC =F)
>sdata <- nPC(sdata)
>sdata <- scCluster(sdata)
Error in if ((3 * a[b + c]) < a[b]) { : argument is of length zero

Here is the data structure of sdata.

> str(sdata)
Formal class 'scData' [package "cidr"] with 20 slots
  ..@ tags             : num [1:20751, 1:6143] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20751] "ENSG00000228463" "ENSG00000228327" "ENSG00000237491" "ENSG00000225880" ...
  .. .. ..$ : chr [1:6143] "AAACATACACTGGT-1" "AAACATACAGACTC-1" "AAACATTGACCAAC-1" "AAACATTGAGGCGA-1" ...
  ..@ tagType          : chr "raw"
  ..@ sampleSize       : int 6143
  ..@ librarySizes     : Named num [1:6143] 19337 8786 24636 16030 27341 ...
  .. ..- attr(*, "names")= chr [1:6143] "AAACATACACTGGT-1" "AAACATACAGACTC-1" "AAACATTGACCAAC-1" "AAACATTGAGGCGA-1" ...
  ..@ nData            : num [1:20751, 1:6143] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20751] "ENSG00000228463" "ENSG00000228327" "ENSG00000237491" "ENSG00000225880" ...
  .. .. ..$ : chr [1:6143] "AAACATACACTGGT-1" "AAACATACAGACTC-1" "AAACATTGACCAAC-1" "AAACATTGAGGCGA-1" ...
  ..@ priorTPM         : num 1
  ..@ dThreshold       : num [1:6143] 8.62 8.62 8.62 8.62 8.62 ...
  ..@ wThreshold       : Named num 9.6
  .. ..- attr(*, "names")= chr "a"
  ..@ pDropoutCoefA    : Named num 4.31
  .. ..- attr(*, "names")= chr "a"
  ..@ pDropoutCoefB    : Named num 9.6
  .. ..- attr(*, "names")= chr "b"
  ..@ dropoutCandidates: logi [1:20751, 1:6143] TRUE TRUE TRUE TRUE TRUE TRUE ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20751] "ENSG00000228463" "ENSG00000228327" "ENSG00000237491" "ENSG00000225880" ...
  .. .. ..$ : chr [1:6143] "AAACATACACTGGT-1" "AAACATACAGACTC-1" "AAACATTGACCAAC-1" "AAACATTGAGGCGA-1" ...
  ..@ PC               : num [1:6143, 1:3353] 16.3 20.7 18.3 17.8 14.2 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:3353] "Axis.1" "Axis.2" "Axis.3" "Axis.4" ...
  ..@ variation        : num [1:3353] 0.20404 0.01935 0.01156 0.01042 0.00794 ...
  ..@ eigenvalues      : num [1:6143] 1849685 175401 104792 94445 71980 ...
  ..@ dissim           : num [1:6143, 1:6143] 0 19.1 18 13.2 17.8 ...
  ..@ nCluster         : num 0
  ..@ clusters         : logi(0) 
  ..@ nPC              : num 4
  ..@ cMethod          : chr(0) 
  ..@ correction       : chr "none"

CIDR on Bioconductor?

I was just wondering if you have considered submitting CIDR to Bioconductor?

I have been working on a package which I have submitted and would like to use a couple of the funcitons in CIDR in it but currently Bioconductor packages aren't allowed to depend on packages that are only on Github.

Details of the submission process are here and the Bioconductor maintainers are good at walking people through it. As well as helping me out I think it would have some benefits for you in terms of advertising CIDR and making it more accessible.

Cheers

CIDR imputed data

Dear CIDR developers,

thanks for making CIDR freely available, it works great for me. I just have one small question:
Is it possible to extract the imputed count matrix / imputed gene expression? As far as I could see it is internally calculated and directly used for the downstream clustering.

Kind regards
Beate

Retrieving contribution of genes to axes

With PCA, it is possible to get the contribution of the genes to the different axes. Is there a way to get this information with the PCoA of CIDR?

vccri / cidr Goto Github PK

cidr's Introduction

Clustering through Imputation and Dimensionality Reduction

Getting Started

Examples

Simulated Data

Biological Datasets

Human Brain scRNA-Seq Dataset

Human Pancreatic Islet scRNA-Seq Dataset

Troubleshooting

Masking of hclust

Reinstallation of CIDR - cidr.rdb corruption

cidr's People

Contributors

Stargazers

Watchers

Forkers

cidr's Issues

// cpp_dist

Recommend Projects

Recommend Topics

Recommend Org