Code Monkey home page Code Monkey logo

clustifyr's Introduction

clustifyr

R-CMD-check-bioc Codecov test coverage platforms bioc #downloads

clustifyr classifies cells and clusters in single-cell RNA sequencing experiments using reference bulk RNA-seq data sets, sorted microarray expression data, single-cell gene signatures, or lists of marker genes.

Installation

Install the Bioconductor version with:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("clustifyr")

Install the development version with:

BiocManager::install("rnabioco/clustifyr")

Example usage

In this example we use the following built-in input data:

  • an expression matrix of single cell RNA-seq data (pbmc_matrix_small)
  • a metadata data.frame (pbmc_meta), with cluster information stored ("classified")
  • a vector of variable genes (pbmc_vargenes)
  • a matrix of mean normalized scRNA-seq UMI counts by cell type (cbmc_ref)

We then calculate correlation coefficients and plot them on a pre-calculated projection (stored in pbmc_meta).

library(clustifyr)

# calculate correlation
res <- clustify(
  input = pbmc_matrix_small,
  metadata = pbmc_meta$classified,
  ref_mat = cbmc_ref,
  query_genes = pbmc_vargenes
)

# print assignments
cor_to_call(res)
#> # A tibble: 9 ร— 3
#> # Groups:   cluster [9]
#>   cluster      type           r
#>   <chr>        <chr>      <dbl>
#> 1 B            B          0.909
#> 2 CD14+ Mono   CD14+ Mono 0.915
#> 3 FCGR3A+ Mono CD16+ Mono 0.929
#> 4 Memory CD4 T CD4 T      0.861
#> 5 Naive CD4 T  CD4 T      0.889
#> 6 DC           DC         0.849
#> 7 Platelet     Mk         0.732
#> 8 CD8 T        NK         0.826
#> 9 NK           NK         0.894

# plot assignments on a projection
plot_best_call(
  cor_mat = res,
  metadata = pbmc_meta,
  cluster_col = "classified"
)

clustify() can take a clustered SingleCellExperiment or seurat object (both v2 and v3) and assign identities.

# for SingleCellExperiment
clustify(
  input = sce_small,          # an SCE object
  ref_mat = cbmc_ref,         # matrix of RNA-seq expression data for each cell type
  cluster_col = "cell_type1", # name of column in meta.data containing cell clusters
  obj_out = TRUE              # output SCE object with cell type inserted as "type" column
) 
#> class: SingleCellExperiment 
#> dim: 200 200 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(200): SGIP1 AZIN2 ... TAF12 SNHG3
#> rowData names(10): feature_symbol is_feature_control ... total_counts
#>   log10_total_counts
#> colnames(200): AZ_A1 AZ_A10 ... HP1502401_E18 HP1502401_E19
#> colData names(35): cell_quality cell_type1 ... type r
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

library(Seurat)
# for Seurat3/4
clustify(
  input = s_small3,
  cluster_col = "RNA_snn_res.1",
  ref_mat = cbmc_ref,
  seurat_out = TRUE
)
#> An object of class Seurat 
#> 230 features across 80 samples within 1 assay 
#> Active assay: RNA (230 features, 20 variable features)
#>  2 dimensional reductions calculated: pca, tsne

# New output option, directly as a vector (in the order of the metadata), which can then be inserted into metadata dataframes and other workflows
clustify(
  input = s_small3,
  cluster_col = "RNA_snn_res.1",
  ref_mat = cbmc_ref,
  vec_out = TRUE
)
#>  [1] "Mk"         "Mk"         "Mk"         "Mk"         "Mk"        
#>  [6] "Mk"         "Mk"         "Mk"         "Mk"         "Mk"        
#> [11] "B"          "B"          "B"          "B"          "B"         
#> [16] "B"          "B"          "B"          "B"          "B"         
#> [21] "CD16+ Mono" "CD16+ Mono" "CD16+ Mono" "CD16+ Mono" "CD16+ Mono"
#> [26] "CD16+ Mono" "CD16+ Mono" "CD16+ Mono" "CD16+ Mono" "CD16+ Mono"
#> [31] "Mk"         "B"          "Mk"         "Mk"         "Mk"        
#> [36] "Mk"         "Mk"         "Mk"         "Mk"         "Mk"        
#> [41] "Mk"         "B"          "Mk"         "Mk"         "B"         
#> [46] "B"          "Mk"         "Mk"         "Mk"         "Mk"        
#> [51] "CD16+ Mono" "CD16+ Mono" "B"          "CD16+ Mono" "CD16+ Mono"
#> [56] "CD16+ Mono" "CD16+ Mono" "CD16+ Mono" "CD16+ Mono" "Mk"        
#> [61] "B"          "CD16+ Mono" "B"          "CD16+ Mono" "B"         
#> [66] "CD16+ Mono" "CD16+ Mono" "CD16+ Mono" "CD16+ Mono" "B"         
#> [71] "Mk"         "Mk"         "Mk"         "Mk"         "Mk"        
#> [76] "Mk"         "Mk"         "Mk"         "Mk"         "CD16+ Mono"

New reference matrix can be made directly from SingleCellExperiment and Seurat objects as well. Other scRNAseq experiment object types are supported as well.

# make reference from SingleCellExperiment objects
sce_ref <- object_ref(
  input = sce_small,               # SCE object
  cluster_col = "cell_type1"       # name of column in colData containing cell identities
)
#> The following clusters have less than 10 cells for this analysis: co-expression, ductal, endothelial, epsilon, MHC class II, PSC. Classification is likely inaccurate.

# make reference from seurat objects
s_ref <- seurat_ref(
  seurat_object = s_small3,
  cluster_col = "RNA_snn_res.1"
)

head(s_ref)
#>                 0        1        2
#> MS4A1    0.000000 1.126047 5.151065
#> CD79B    2.469341 2.920407 5.031316
#> CD79A    0.000000 2.535151 5.375681
#> HLA-DRA  3.640368 6.008446 7.055386
#> TCL1A    0.000000 1.495867 4.963367
#> HLA-DQB1 1.603068 3.836290 5.137422

clustify_lists() handles identity assignment of matrix or SingleCellExperiment and seurat objects based on marker gene lists.

clustify_lists(
  input = pbmc_matrix_small,
  metadata = pbmc_meta,
  cluster_col = "classified",
  marker = pbmc_markers,
  marker_inmatrix = FALSE
)
#>                      0        1        2         3         4        5        6
#> Naive CD4 T  1.5639055 20.19469 31.77095  8.664074 23.844992 19.06931 19.06931
#> Memory CD4 T 1.5639055 20.19469 31.77095 10.568007 23.844992 17.97875 19.06931
#> CD14+ Mono   0.9575077 14.70716 76.21353 17.899569 11.687739 49.86699 16.83210
#> B            0.6564777 12.70976 31.77095 26.422929 13.536295 20.19469 13.53630
#> CD8 T        1.0785353 17.97875 31.82210 12.584823 31.822099 22.71234 40.45383
#> FCGR3A+ Mono 0.6564777 13.63321 72.43684 17.899569  9.726346 56.48245 14.61025
#> NK           0.6564777 14.61025 31.82210  7.757206 31.822099 22.71234 45.05072
#> DC           0.6564777 15.80598 63.34978 19.069308 13.758144 40.56298 17.97875
#> Platelet     0.5428889 13.34769 59.94938 14.215244 15.158755 46.92861 19.49246
#>                      7          8
#> Naive CD4 T   6.165348  0.6055118
#> Memory CD4 T  6.165348  0.9575077
#> CD14+ Mono   25.181595  1.0785353
#> B            17.899569  0.1401901
#> CD8 T         7.882145  0.3309153
#> FCGR3A+ Mono 21.409177  0.3309153
#> NK            5.358651  0.3309153
#> DC           45.101877  0.1401901
#> Platelet     19.492465 59.9493793

clustify_lists(
  input = s_small3,
  marker = pbmc_markers,
  marker_inmatrix = FALSE,
  cluster_col = "RNA_snn_res.1",
  seurat_out = TRUE
)
#> An object of class Seurat 
#> 230 features across 80 samples within 1 assay 
#> Active assay: RNA (230 features, 20 variable features)
#>  2 dimensional reductions calculated: pca, tsne

Additional resources

  • Script for benchmarking, compatible with scRNAseq_Benchmark

  • Additional reference data (including tabula muris, immgen, etc) are available in a supplemental package clustifyrdatahub. Also see list for individual downloads.

  • See the FAQ for more details.

clustifyr's People

Contributors

agillen avatar dcgenomics avatar hpages avatar jayhesselberth avatar jwokaty avatar kriemo avatar nturaga avatar raysinensis avatar sheridar avatar sidhantpuntambekar avatar yueyvettehao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clustifyr's Issues

additional features, to do

[immediate]
other scRNA-seq methods (sci? homemade?)
feature selection on ref side (ie most important features to preserve correct calls)
pruning? https://jef.works/blog/2018/02/28/stability-testing/
UMI distribution?
look up tissue specific expression feature selection

[soonish]
per cell cor in C++

[way down the line ideas]
some form of correction for internal priming from 10x data?

[done]
plot umap, find seurat umap slot
put genelist options into main wrapper
handle seurat 3
manual assignment wrapper function
output top 3 calls, allowing visualization of how strong the call is VS next best
incorporation of list (a couple genes) and matrix
absolute negatives to marker gene list (sort of)
umap reclustering (intentionally over cluster then merge?)
built-in general reference from tabula muris -> move to separate data package
built-in general reference from recount2?
ALRA incorporation (can't use m3drop anymore, maybe just vargenes)

Simple example analysis with plotting code

library(RClusterCT)
library(ggplot2)
data("pbmc4k_avg");
data("pbmc_bulk_matrix");
gene_constraints <- list(pbmc4k_vargenes, rownames(pbmc_bulk_matrix));
pbmc4k_avg <- select_gene_subset(pbmc4k_avg, gene_constraints);
pbmc_bulk_matrix <- select_gene_subset(pbmc_bulk_matrix, gene_constraints);

out <- lapply(colnames(pbmc4k_avg),
              function(x){
                per_col <- lapply(colnames(pbmc_bulk_matrix),
                                  function(y){
                                    compute_similarity(pbmc4k_avg[,x],
                                                       pbmc_bulk_matrix[,y], corr_coef,
                                                       method = "spearman")})
                do.call(cbind, per_col)
              })

res <- do.call(rbind, out)
rownames(res) <- colnames(pbmc4k_avg)
colnames(res) <- colnames(pbmc_bulk_matrix)
res[1:10, 1:10]
#>   primary human B cells rep 1 primary human myeloid DC rep 1
#> 0                  0.50820425                      0.4579628
#> 1                  0.32281079                      0.7202308
#> 2                  0.70432854                      0.4584629
#> 3                  0.49785471                      0.4361370
#> 4                  0.44130928                      0.4059912
#> 5                  0.40268039                      0.4198868
#> 6                  0.39736997                      0.6842751
#> 7                  0.36540944                      0.7451606
#> 8                  0.44289258                      0.4494797
#> 9                  0.06624097                      0.3350129
#>   primary human monocytes rep 1 primary human neutrophils rep 1
#> 0                     0.4022359                       0.3689856
#> 1                     0.7752308                       0.6415800
#> 2                     0.3883629                       0.3570139
#> 3                     0.3782507                       0.3569020
#> 4                     0.3422209                       0.3380242
#> 5                     0.3568106                       0.3518037
#> 6                     0.6643071                       0.5200495
#> 7                     0.6849474                       0.5670303
#> 8                     0.4079214                       0.3208810
#> 9                     0.3462206                       0.2792161
#>   primary human NK cells rep 1 primary human PBMC rep 1
#> 0                    0.5551566                0.5746751
#> 1                    0.3957439                0.6468407
#> 2                    0.4228903                0.4963271
#> 3                    0.5955744                0.5769416
#> 4                    0.6570067                0.5806158
#> 5                    0.7198484                0.5841992
#> 6                    0.4178842                0.6037755
#> 7                    0.4332017                0.6213894
#> 8                    0.3562691                0.4491803
#> 9                    0.1525381                0.3215894
#>   primary human T cells rep 1 primary human B cells rep 2
#> 0                  0.67825132                  0.50702268
#> 1                  0.29332822                  0.34436167
#> 2                  0.44460967                  0.71343253
#> 3                  0.70072958                  0.49931182
#> 4                  0.63764459                  0.44837165
#> 5                  0.57486120                  0.41192369
#> 6                  0.32515553                  0.41995243
#> 7                  0.34017640                  0.39020492
#> 8                  0.33956974                  0.45445820
#> 9                  0.05847776                  0.08712384
#>   primary human myeloid DC rep 2 primary human monocytes rep 2
#> 0                      0.4560228                     0.3908564
#> 1                      0.7079754                     0.7876344
#> 2                      0.4580736                     0.3794703
#> 3                      0.4332183                     0.3689152
#> 4                      0.4044759                     0.3372726
#> 5                      0.4171209                     0.3511799
#> 6                      0.6806905                     0.6662284
#> 7                      0.7437065                     0.6927999
#> 8                      0.4451127                     0.3973351
#> 9                      0.3345931                     0.3615359


plot_tsne(pbmc4k_meta, feature = "classified")

plot_cor(res,
         pbmc4k_meta,
         colnames(res)[1:7])
#> Warning: package 'bindrcpp' was built under R version 3.4.4
#> [[1]]

#> 
#> [[2]]

#> 
#> [[3]]

#> 
#> [[4]]

#> 
#> [[5]]

#> 
#> [[6]]

#> 
#> [[7]]

Created on 2018-06-22 by the reprex package (v0.2.0).

sample plot

library(tidyverse)
library(RClusterCT)

ggplot(
  pbmc4k_meta,
  aes(x = tSNE_1, y = tSNE_2, color = cluster)
) + geom_point()

Created on 2018-06-21 by the reprex package (v0.2.0).

use of Seurat

It would be great if we could avoid using Seurat. Too many dependencies.

Cleanup inst directory

What do we need to keep, if anything?

> fs::dir_ls("inst/")
inst/RForest
inst/dl_recount.R
inst/evaluate_refine_clusters.R
inst/flowchart.png
inst/images
inst/logo
inst/negative_controls.Rmd
inst/note_on_refine_clusters.txt
inst/presentation.pptx
inst/var_genes_M3Drop.Rmd

plotting code

library(tidyverse)
library(RClusterCT)

cont <- pbmc4k_meta %>%
  mutate(colx = sample(10, size = n(), replace = TRUE))
         
plot_tsne(cont, feature = colx)

cats <- cont %>% mutate(colx = as.character(colx))

plot_tsne(cats, feature = colx)

Created on 2018-06-21 by the reprex package (v0.2.0).

refine_clusters

Do we use this? Should it be re-written in Rcpp? Lots of loops in there.

Convert use_seurat_comp/meta to S3

What is a comp? component? expression? I don't think this is a good suffix.

And why is it use_)? Couldn't it just be seurat_expr() and seurat_meta()?

Need to dispatch on v2 and v3.

Add master function to predict cell/cluster ID

  • Call it classifyR()?
  • Take scRNA-seq matrix, metadata, and classifier info (bulk RNA-seq, gene list, etc.) and return a list of predicted IDs per-cell/cluster.
  • Arguments for correlation method, score/correlation coefficient cutoff, p-value cutoff, etc.

how to run per cell

# run correlation per cell
res <- run_cor(pbmc4k_matrix,
        pbmc4k_meta,
        pbmc_bulk_matrix,
        pbmc4k_vargenes,
        if_permute = F,
        compute_method = corr_coef, 
        per_cell = T,
        metadata = "rn")

To do list for hackathon

package building:

  • fix the logo
  • rename to clustifyr (so it's clustifyr::clustifyr())
  • remove s_small and s_small3, so no package errors are thrown in building, but that means removing tests as well
  • evaluate vignette chunks (to identify errors) <- mostly unreferenced packages and data files, currently set as eval = F
  • currently all RMD examples need clustifyrdata, is that ok or do we need to subset to 2 clusters?
  • dplyr variables throwing warnings in building
  • does this install and run on windows?
  • put on bioconductor

good code practice:

  • code tidying by packages
  • no "print()"
  • all T/F update to TRUE/FALSE
  • no if (abc == TRUE)
  • add usage examples to all functions?
  • clusters in seurat 3 meta now come as factors, which throws a lot of warnings
  • unify var/option order and names
  • opinions on the state of the tutorial website
  • tutorial example of going from ucsc data (like https://cells.ucsc.edu/?ds=cortex-dev#, there's also annotated markers at cole-trapnell-lab/garnett#4), and Ebi SCE atlas (https://www.ebi.ac.uk/gxa/sc/experiments/E-EHCA-2/downloads, also need metadata)
  • clustifyrdata cleaning and documenting, also needs to be renamed now, also too large now?

classification theory:

  • what is the best dataset example? pbmcs aren't actually that good (CD8) without correction
  • way to assess whether markers are actually good
  • negative selection on markers
  • lower limit on fgsea n_perm?

clustify S3 function

We should provide clustify as an S3 generic that dispatches on a matrix (currently run_cor()) or on a Seurat object (currently clustify_seurat()). This would also enable other inputs in the future.

PBMC cell types

B-Cells (CD19 positive)
T-Cells (CD3 positive and CD8 positive)
Megakaryocytes
Natural Killer Cells
Dendritic Cells
Monocytes

example of how to download data from recount

install recount

source("https://bioconductor.org/biocLite.R")
biocLite("recount")

Next we load the required packages. Loading recount will load the required dependencies.

Load recount R package

library('recount')
download_study("SRP051688")
load(file.path("SRP051688", "rse_gene.Rdata"))
unlink("SRP051688", recursive = TRUE)
rse_gene
## covert to read counts
rse <- scale_counts(rse_gene)
head(assay(rse, "counts"))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.