felixthestudent / cellpypes Goto Github PK

View Code? Open in Web Editor NEW

51.0 1.0 3.0 3.1 MB

Cell type pipes for R

License: GNU General Public License v3.0

R 100.00%

celltype-annotation classification-algorithm scrnaseq single-cell-rna-seq

cellpypes's Introduction

cellpypes – Cell type pipes for R

Pipe your types!
Installation
Citation
cellpypes input
List of functions
Annotating PBMCs
Understanding how cellpypes works
Function demos
FAQ

Pipe your types!

Cellpypes uses the popular piping operator %>% to manually annotate cell types in single-cell RNA sequencing data. It can be applied to UMI data (e.g. 10x Genomics).

Define gene rules interactively:

Adjust the threshold until the selected cells agree well with marker gene expression.
Use positive (CD3E+) and negative (MS4A1-) markers to annotate any subpopulation of interest.
Explore with feat, select with rule, visualize with plot_last or plot_classes and get cell type labels with classify.

Resolve detailed cell type subsets. Switch between cell type hierarchy levels in your analysis:

Installation

Install cellpypes with the following commands:

# install.packages("devtools")
devtools::install_github("FelixTheStudent/cellpypes")

Citation

To cite cellpypes, download your favorite citation style from zenodo, type citation("cellpypes") in R or simply use:

Frauhammer, Felix, & Anders, Simon. (2022). cellpypes: Cell Type Pipes for R (0.1.1). Zenodo. https://doi.org/10.5281/zenodo.6555728

cellpypes input

cellpypes input has four slots:

raw: (sparse) matrix with genes in rows, cells in columns
totalUMI: the colSums of obj$raw
embed: two-dimensional embedding of the cells, provided as data.frame or tibble with two columns and one row per cell.
neighbors: index matrix with one row per cell and k-nearest neighbor indices in columns. We recommend k=50, but generally 15<k<100 works well. Here are two ways to get the neighbors index matrix:
- Use find_knn(featureMatrix)$idx, where featureMatrix could be principal components, latent variables or normalized genes (features in rows, cells in columns).
- use as(seurat@graphs[["RNA_nn"]], "dgCMatrix")>.1 to extract the kNN graph computed on RNA. This also works with RNA_snn, wknn/wsnn or any other available graph – check with names(seurat@graphs).

Examples for cellpypes input:

# Object from scratch:
obj <- list(
  raw = counts,              # raw UMI, cells in columns
  neighbors = knn_ids,       # neighbor indices, cells in rows, k columns
  embed = umap,              # 2D embedding, cells in rows
  totalUMI = library_sizes   # colSums of raw, one value per cell
)

# Object from Seurat:
obj <- list(
    raw      =SeuratObject::GetAssayData(seurat, "counts"),
    neighbors=as(seurat@graphs[["RNA_nn"]], "dgCMatrix")>.1, # sometims "wknn"
    embed    =FetchData(seurat, dimension_names),
    totalUMI = seurat$nCount_RNA
)

# Object from Seurat (experimental short-cut):
obj <- pype_from_seurat(seurat_object)

List of functions

Functions for manual classification:

feat: feature plot (UMAP colored by gene expression)
rule: add a cell type rule
plot_last: plot the most recent rule or class
classify: classify cells by evaluating cell type rules
plot_classes: call and visualize classify

Functions for pseudobulking and differential gene expression (DE) analysis:

class_to_deseq2: Create DESeq2 object for a given cell type
pseudobulk: Form pseudobulks from single-cells
pseudobulk_id: Generate unique IDs to identify pseudobulks

Annotating PBMCs

Here, we annotate the same PBMC data set as in the popular Seurat tutorial, using the Seurat object seurat_object that comes out of it.

library(cellpypes)
library(tidyverse) # provides piping operator %>%


pype <- seurat_object %>%
  pype_from_seurat %>%
  rule("B",           "MS4A1",   ">", 1)                    %>%
  rule("CD14+ Mono",  "CD14",    ">", 1)                    %>%
  rule("CD14+ Mono",  "LYZ",     ">", 20)                   %>%
  rule("FCGR3A+ Mono","MS4A7",   ">", 2)                    %>%
  rule("NK",          "GNLY",    ">", 75)                   %>%
  rule("DC",          "FCER1A",  ">", 1)                    %>%
  rule("Platelet",    "PPBP",    ">", 30)                   %>%
  rule("T",           "CD3E",    ">", 3.5)                  %>% 
  rule("CD8+ T",      "CD8A",    ">", .8,  parent="T")      %>%
  rule("CD4+ T",      "CD4",     ">", .05, parent="T")      %>%
  rule("Naive CD4+",  "CCR7",    ">", 1.5, parent="CD4+ T") %>%
  rule("Memory CD4+",  "S100A4", ">", 13,  parent="CD4+ T")

plot_classes(pype)+ggtitle("PBMCs annotated with cellpypes")
#> as(<lgCMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "dMatrix") instead

All major cell types are annotated with cell pypes. Note how there are naive CD8+ T cells among the naive CD4 cells. While overlooked in the original tutorial, the marker-based nature of cellpypes revealed this. This is a good example for cellpype’s resolution limit: If UMAP cannot separate cells properly, cellpypes will also struggle – but at least it will be obvious. In practice, one would re-cluster the T cells and try to separate naive CD8+ from naive CD4+, or train a specialized machine learning algorithm to discriminate these two cell types in particular.

Understanding how cellpypes works

cellpypes works with classes defined by gene-based rules.

Whether your classes correspond to biologically relevant cell types is best answered in a passionate discussion on their marker genes you should have with your peers.

Until you are sure, “MS4A1+” is a better class name than “B cell”.

CP10K measure UMI fractions

cellpypes uses CP10K (counts per 10 thousand) in functions rule and feat. This is why and what that means:

For marker genes, CP10K typically lie in the human-friendly range between 0.1 and 10 CP10K.
A typical mammalian cell can be expected to have around 10K UMIs in total (100K mRNAs captured with 10 % conversion rate), so 1 CP10K means roughly 1 UMI in a typical cell.
If one out of 10 thousand mRNA molecules in cells originated from CD3E, then we’d expect to observe 1 CP10K in our UMI count matrix. In reality, there is technical noise, so we might see 1 CP10K in some cells and other values in others – but by design, UMI fractions and mRNA fractions are highly correlated!
CP10K are a noisy estimate of mRNA fractions. Cellpypes models this uncertainty during manual thresholding by considering the UMI counts of nearest neighbor cells as well. When deciding if cells are positive (or negative) for a marker gene, the functions classify, plot_last and plot_classes leave cells unassigned if there was not enough evidence.

Intuition behind cell pypes

cellpypes compares the expression of a cell and its nearest neighbors to a user-provided threshold, taking the uncertainty due to technical noise into account.

cellpypes assumes a cell’s nearest neighbors are transcriptionally highly similar, so that the technical noise dominates the biological variation. This means that the UMI counts for a given marker gene, say CD3E, came from cells that roughly had the same fraction of CD3E mRNA. We can not use this reasoning to infer the individual cell’s mRNA fraction reliably (which is why imputation introduces artifacts), but we can decide with reasonable confidence whether this cell was at least part of a subpopulation in which many cells expressed this gene highly. In other words:

In cellpypes logic, CD3E+ cells are virtually indistinguishable from cells with high CD3E expression. We just can’t prove they all had CD3E mRNA due to data sparsity.

Math/statistics behind cellpypes

cellpypes models UMI counts as negative binomial (NB) random variable with a fixed overdisperions of 0.01 (size parameter of 100 in R’s pnbinom), as recommended by Lause, Berens and Kobak (Genome Biology 2021).
The marker gene UMI counts are summed up across a cell and its neighbors, forming the pooled counts K.
Summation allows cellpypes to use the NB distribution when comparing expression K with the user-provided threshold t, because the sum across NB random variables is again an NB random variable.
cellpypes checks if the summed counts K are likely to have come from an NB distribution with rate parameter t*S, where t is the CP10K threshold supplied through the rule function, and S is the sum of totalUMI counts from the cell and its neighbors. If it is very likely, the counts are too close to the threshold to decide and the cell is left unassigned. If K lies above the expectancy t*S, the cell is marked as positive, if below, it is marked as negative for the given marker gene.
The threshold t is chosen such that it separates positive from negative cells, so will typically lie directly between these two populations. This means that the above procedure should not be considered a hypothesis test, because t is picked deliberately to make the null hypothesis (H0: K came from Pois(t*S)) unlikely.
Instead, cellpypes is a tool to quantify uncertainty around the threshold t. If cells were sequenced deeply, S becomes larger, which means we have more information to decide.

Function demos

The following has a short demo of every function. Let’s say you have completed the Seurat pbmc2700 tutorial (or it’s shortcut), then you can start pyping from this pbmc object:

pbmc <- pype_from_seurat(seurat_object)

feat

Visualize marker gene expression with feat (short for feature plot):

pbmc %>% 
  feat(c("NKG7", "MS4A1"))

The viridis color scale is used because it encodes higher expression with higher color intensity, and is robust to colorblindness.
Default UMAP setting produce crowded embeddings. To avoid overplotting, we recommend playing with UMAP’s min_dist and spread parameters. Compute UMAP with spread=5 and you’ll be able to see much more in your embeddings!
Manual thresholding is easier if you know whether your gene is expressed highly or lowly. In above example, I’d start with a large threshold for NKG7 (e.g. 10 CP10K) and a moderate one for MS4A1 (e.g. 1 CP10K), simply because NKG7 goes up to 381 CP10K in some cells.

rule and plot_last

Create a few cell type rules and plot the most recent one with plot_last:

pbmc %>%
  rule("CD14+ Mono",  "CD14",    ">", 1)                    %>%
  rule("CD14+ Mono",  "LYZ",     ">", 20)                   %>%
  # uncomment this line to have a look at the LYZ+ rule:
  # plot_last()   
  rule("Tcell",       "CD3E",    ">", 3.5)                  %>% 
  rule("CD8+ T",      "CD8A",    ">", 1,  parent="Tcell")   %>%
  plot_last()

The plot_last function plots the last rule you have added to the object. You can move it between any of the above lines to look at the preceding rule, as indicated by the commented plot_last call above.
Try lower and higher thresholds, until there is good agreement between positive cells (left plot) and high marker gene expression (right plot).
cellpypes classes (aka cell types) can have as many rules as you want. CD14+ monocytes have two in this example.
You can build hierarchy with the parent argument, to arbitrary depths. In this example, CD8+ T cells are CD3E+CD8A+, not just CD8A+, because their ancestor Tcell had a rule for CD3E.
Above code chunk is a neat way to document your cell type assignment. You can generate a template with neat text alignment with pype_code_template().

classify and plot_classes

Get cell type labels with classify or plot them directly with plot_classes (which wraps ggplot2 code around classify):

# rules for several cell types:
pbmc2 <- pbmc %>%
  rule("Bcell",       "MS4A1",   ">", 1)                    %>%
  rule("CD14+ Mono",  "CD14",    ">", 1)                    %>%
  rule("CD14+ Mono",  "LYZ",     ">", 20)                   %>%
  rule("Tcell",       "CD3E",    ">", 3.5)                  %>% 
  rule("CD8+ T",      "CD8A",    ">", 1,  parent="Tcell")   

pbmc2 %>% plot_classes()

pbmc2 %>% plot_classes(c("Tcell", "CD8+ T")) + ggtitle("T cells")

head(classify(pbmc2))
#> AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
#>       Unassigned            Bcell       Unassigned       Unassigned 
#> AAACCGTGTATGCG-1 AAACGCACTGGTAC-1 
#>       Unassigned       Unassigned 
#> Levels: Bcell CD14+ Mono CD8+ T Unassigned

By default, classify/plot_classes will use all classes at the end of the hierarchy. T cells are not plotted in the first example because they are not the end, they have a ‘child’ called CD8+ T. You can still have them in the same plot, by specifically asking for plot_classes(c("Tcell", "CD8+ T")) (second example).
If cell types overlap, classify returns Unassigned for cells with mutually exclusive labels (such as Tcell and Bcell).
For this reason it matters whether you call classify("Tcell") or classify(c("Tcell","Bcell") – the overlap is masked with Unassigned in this second call, but not in the first.
If a cell gets multiple labels but from the same lineage (e.g. Tcell and CD8+ T), the more detailed class is returned (CD8+ T).

class_to_deseq2

Let’s imagine the PBMC data had multiple patients and treatment conditions (we made them up here for illustraion):

head(pbmc_meta)
#>    patient treatment
#> 1 patient1   treated
#> 2 patient5   treated
#> 3 patient1   control
#> 4 patient1   treated
#> 5 patient2   treated
#> 6 patient4   control

Every row in pbmc_meta corresponds to one cell in the pbmc object.

With cellpypes, you can directly pipe a given cell type into DESeq2 to create a DESeq2 DataSet (dds) and test it:

library(DESeq2)
dds <- pbmc %>% 
  rule("Bcell",   "MS4A1",   ">", 1)     %>%
  rule("Tcell",   "CD3E",    ">", 3.5)   %>% 
  class_to_deseq2(pbmc_meta, "Tcell", ~ treatment)
#> converting counts to integer mode
# test for differential expression and get DE genes:
dds <- DESeq(dds)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
data.frame(results(dds)) %>% arrange(padj) %>% head
#>                  baseMean log2FoldChange     lfcSE         stat    pvalue
#> AL627309.1     0.25114064    -0.46193495 2.2886865 -0.201834087 0.8400464
#> AP006222.2     0.15001429     0.01895508 3.1165397  0.006082093 0.9951472
#> RP11-206L10.2  0.08742633    -0.46193859 3.1165397 -0.148221631 0.8821679
#> LINC00115      0.58340105     0.43747304 1.5879796  0.275490341 0.7829395
#> SAMD11         0.08742633    -0.46193859 3.1165397 -0.148221631 0.8821679
#> NOC2L         12.53882917    -0.57349677 0.3465506 -1.654871847 0.0979505
#>                    padj
#> AL627309.1    0.9998786
#> AP006222.2    0.9998786
#> RP11-206L10.2 0.9998786
#> LINC00115     0.9998786
#> SAMD11        0.9998786
#> NOC2L         0.9998786

In this dummy example, there is no real DE to find because we assigned cells randomly to treated/control.

pseudobulk and pseudobulk_id

Instead of piping into DESeq2 directly, you can also form pseudobulks with pseudobulk and helper function pseudobulk_id. This can be applied to any single-cell count data, independent from cellpypes. For example, counts could be seurat@assays$RNA@counts and meta_df could be selected columns from [email protected].

pbmc3 <- pbmc %>% rule("Tcell",   "CD3E",    ">", 3.5)
is_class <- classify(pbmc3) == "Tcell" 
counts <- pseudobulk(pbmc$raw[,is_class], 
                     pseudobulk_id(pbmc_meta[is_class,]))
counts[35:37, 1:3]
#> 3 x 3 Matrix of class "dgeMatrix"
#>            patient1.control patient1.treated patient2.control
#> AL645608.1                0                0                0
#> NOC2L                    13                6               16
#> KLHL17                    0                0                0

FAQ

Should I report bugs?

Yes. Do it. You can search similar problems or create new issues on gitHub. If you want to be nice and like fast help for free, try to provide a minimal example of your problem if possible. Make sure to include the version of your cellpypes installation, obtained e.g. with packageVersion("cellpypes").

Why are some cells unassigned?

Unassigned cells (grey) are not necessarily bad but a way to respect the signal-to-noise limit. Unassigned cells arise for two reasons:

Not enough signal for any rule to apply. For example, outlier cells typically get few neighbors in Seurat’s SNN graph, making them negative for most rules.
Not enough separation. If two classes are highly similar, such as CD4+ and CD8+ T cell subsets, cells in the noisy class border may be positive for rules from both classes. By default, cellpypes sets them to Unassigned, but this behavior can be controlled with the replace_overlap_with argument in classify and plot_classes.

How is DE different from cluster markers?

In cellpypes logic, Differential Expression (DE) analysis refers to comparing multiple samples (patients/mice/…) from at least two different groups (control/treated/…). These so called multi-condition multi-sample comparisons have individuals, not cells, as unit of replication, and give reliable p-values.

Finding cluster markers, in contrast, is circular and results in invalid p-values (which are useful for ranking marker genes, not for determining significance). The circularity comes from first using gene expression differences to find clusters, and then testing the null hypothesis that the same genes have no differences between clusters.

Why pseudobulks?

Pseudobulk approaches have been shown to perform as advertised, while many single-cell methods do not adjust p-values correctly and fail to control the false-discovery rate. Note that DESeq2, however, requires you to filter out lowly expressed genes.

cellpypes's People

Contributors

Stargazers

Watchers

Forkers

eegk crsky1023 animesh

cellpypes's Issues

plot_classes/classify: Handle overlap between parent and child

Here is a minimal example with the problem:

simulated_umis %>% rule("B", "MS4A1", ">", 1e-4) %>% rule("T", "CD3E", ">", .1e-4) %>% rule("Treg", "FOXP3", ">", .05e-4, parent="T") %>% plot_classes(c("B","T","Treg"))

As you can see, we can't plot Tregs and all other Tcells in the same plot, because all Tregs are removed as overlap this way.

Possible solutions:

don't replace overlap if it is between a (leaf) class and one of its ancestors. Instead, use leaf class, not ancestor class.
warn the user that more than 15 % of a class is overlapping with another class... ties in with plot_overlap function I am planning I guess?

Bonus feature:

perhaps introduce shortcuts that can be passed to classify's classes argument. For example ..leafs.. or leafs(). Then I could do something like this: classify(classes=c(..leafs.., "T")).

Function to combine objects

combine() or similar in name

combines cellpypes objects, which is useful for pseudobulking and setting same thresholds for all samples
makes it easier to compute Differential expression
perhaps I could write a deseq <- function(obj, class) function, that returns obj again with DESeq2 results. Then one could start playing with plotting log-foldchanges against each other, e.g. for different marker gene sets or hierarchy levels.
I propose that DE genes in scRNAseq are also something to explore, not a static inference result. While problematic for multiple testing correction (how to correct for interactively playing around until results are judged as good?), I like the idea that DE is also amenable to exploration.

dot_plot function

Here's code to generate a dot plot of marker genes, given an identity_dataframe. Perhaps add this function to cellpypes package?

Usage example:

Tsubsets_seurat = data.frame(
  Th = malt$seurat_clusters %in% c(2, 5, 7),
  Ttox=malt$seurat_clusters == 3
)


# The difference in purity is invisible in dot plots:
genes = c("CD4", "CD8B", "CD8A")
dot_plot(FetchData(malt, genes, slot="counts"),
         malt$nCount_RNA,
         Tsubsets_seurat)

Function code:

# stuff to develop the function:
# obj = malt %>% pype_from_seurat() 
# identity_dataframe = data.frame(Bcell=malt$seurat_clusters==0,
#                                 Tcell=malt$seurat_clusters==7)
# counts   = FetchData(malt, c("CD3E", "CD79B"), slot = "counts")
# totalUMI = malt$nCount_RNA


dot_plot <- function(counts, totalUMI, identity_dataframe, return_data=FALSE){
  
  res =data.frame(
    celltype = NULL,
    gene     = NULL,
    percent_expressing = NULL,
    average_expression = NULL
  )
  
  for(celltype in colnames(identity_dataframe)) {
    for(i in 1:ncol(counts)) {
      umi = counts[, i]
      relevant <-identity_dataframe[,celltype] 
      res = rbind(res, 
                  data.frame(
                    celltype = celltype,
                    gene     = colnames(counts)[i],
                    percent_expressing = 100 * mean(umi[relevant] > 0),
                    average_expression = 1e4*mean(umi[relevant]/totalUMI[relevant]) 
                  ))
    }
  }
  
  
  plot <- res %>%
    ggplot(aes(celltype, gene, size=percent_expressing, col=average_expression)) +
    geom_point() +
    scale_size_continuous(name = "Percent Expressed") +
    viridis::scale_color_viridis(name="Average Expression") +
    cowplot::theme_cowplot()
  
  if(return_data) return(res)
  return(plot)
  
}`

Existing labels

Create an object slot where the user can put existing labels (e.g. output from Louvain, reference label transfer, etc.).

Write functions to:

manually edit selected clusters. Specifically, a vector of existing classes could be supplied as parent to the rule function and then we simply define subsets of these with cellpypes.
rename classes
group classes together (see issue #8)

class_factor function and class_matrix function

Currently I have the "class_boolean" function, that returns a boolean vector or matrix.
I want to rename this to 'class_matrix' function to make it clear that a matrix is returned.

Simon also suggested a "classify" function, and I could name it "celltype_factor" and "celltype_matrix", not sure yet.

class_factor requires a few decisions:

How to deal with overlap? Several options. class_factor could label cells with strings such as "Unassigned" or "multiple_labels". Alternatively it could create cell type labels by concatenating the two (or more) classes for which a cell fulfills the rules (such as T_B or T_B_monocyte).
Hierarchies: by default, should only "leaves" of the hierarchy tree, i.e. classes that do not have children.

pype_from_seurat function

Converting Seurat objects to cellpypes object is best done with a wrapper that guides the user towards reasonable choices, e.g. pype_from_seurat function. Parameters that might make sense:

neighborGraph="kNN/SNN/WNN/..."
assay should be RNA

Useful accessors and code:

s <- CreateSeuratObject(t(malt_raw))
wnn <- s
wnn[["ADT"]] <- CreateAssayObject(counts = t(malt_prot))
Key(wnn[["RNA"]]); Key(wnn[["ADT"]]) 

DefaultAssay(wnn) <- 'RNA'
wnn <- NormalizeData(wnn) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()
DefaultAssay(wnn) <- 'ADT'
VariableFeatures(wnn) <- rownames(wnn[["ADT"]])
wnn <- NormalizeData(wnn, normalization.method = 'CLR', margin = 2) %>% 
  ScaleData()  %>% RunPCA(npcs=10, reduction.name="apca", approx=F)
wnn <- FindMultiModalNeighbors(
  wnn, reduction.list = list("pca", "apca"), 
  dims.list = list(1:30, 1:10)
)

pype_from_seurat only works on simple cases

pype_from_seurat gave the following error:

Error in methods::as(seurat@graphs[[graph_choice]], "dgCMatrix") : no method or default for coercing “NULL” to “dgCMatrix”

The solution was to use the explicit object creation strategy outlined in the tutorial, like this:

obj <- list(
    raw      = SeuratObject::GetAssayData(midgut.combined, "counts"),
    neighbors= as(midgut.combined@graphs[["integrated_nn"]], "dgCMatrix")>.1,
    embed    = FetchData(midgut.combined, c("UMAP_1", "UMAP_2")),
    totalUMI = midgut.combined$nCount_RNA
)

Genes in rows (not columns)

In bioinformatics, the features (genes) go in rows. For the rest of statistics and data science it's the other way around :D

plot_last: make plots the same size

Plot_last creates two plots:

rule plot, colours indicate TRUE/FALSE
feature plot, colour indicates gene expression levels

It's notoriously difficult to make the two plots align vertically, because they are different in width. The rel_widths ratio varies depending on how many digits the maximum expression has (as this is displayed in feat's legend).

Solution I'd like to implement:

make both plots a little wider, by extending the x-axis (increasing the maximal UMAP1 coordinate value)
for feat, put the legend into this area as inset graph. Does that work?!
for rule, put TRUE / FALSE together with the corresponding number of cells there, colored same as the points.

plot_last function

Here's code to get you started, taken from the script playground/ms.Rmd. It has these innovations over other plot_last functions I made in different scripts:

uses check_obj to verify object has the expected slots
works if embedding is a tbl, which requires using drop=T like this: obj$embed[,1, drop=T]
that's it :)

plot_last <- function(obj, what="rule") {
  check_obj(obj)
  if(what=="rule") {
    last_rule <- obj$rules[nrow(obj$rules),]
    boolean=evaluate_rule(obj,
                          last_rule$class, last_rule$feature,
                          last_rule$operator,   last_rule$threshold)
    plot_title <- paste0("Rule: ", last_rule$feature, last_rule$operator, 
                         last_rule$threshold)
  } else if (what=="class") {
    last_class <- obj$rules[nrow(obj$rules), "class"]
    boolean=drop(classify(obj, classes=last_class, return_logical_matrix = T))
    plot_title <- paste0("Class: ", last_class)
  } else { stop("plot_last argument 'what' should either be rule or class.") }

  ggplot(data=data.frame(V1=obj$embed[,1, drop=T], # tbl makes drop necessary
                         V2=obj$embed[,2, drop=T],
                         last=boolean),
         aes(V1, V2, col=last))+coord_fixed()+
    geom_point()  +
    xlab( colnames(obj$embed)[1] ) + 
    ylab( colnames(obj$embed)[2] ) +
    ggtitle(plot_title)
    
}

Backtrace which cells that got classified?

Hi Felix,
Thank you so much for this very useful tool!

I was wondering if you were considering adding a function to backtrace the classification to individual cells of the Single-cell RNA seq data. I would love to be able to import the results classifier back to my Seurat object with the classification.

Al the best,
Peter

embedding instead of embed

Currently, my scripts have an obj$embed slot. I would like to change this to obj$embedding (as discussed with Simon on 9th June).

format_rules function

I think readibility is improved if all calls to rule function share the same format. Here is a skeleton:

# skeleton:
obj %>%
  rule(  "A",       "gene1",          ">",   1e-3              )             %>%
  rule(  "A",       "gene1",          ">",   1e-3, parent = "A")             %>%
  rule(  "A",       "gene1",          ">",   1e-3, parent = "A")

Two things might be useful, please experiment with them:

a function that generates above skeleton, so that I can copy-paste it into my R script and start modifying it.
Perhaps it is too annoying though to adjust white spaces every time, let's see.
a function that takes an object AFTER the rules were added, and outputs the rules formatted like this.
The user could then copy-paste the nicely formatted rules into his Rscript, so that at least after developing the
code the script looks nice.

Perhaps think about this more!

cellpypes does not handle objects processed with SoupX

Starting with Seurat objects where the raw matrices are processed with SoupX, giving counts with decimals.

Here's my script
```obj <- list(
    raw      = SeuratObject::GetAssayData(lympho_NKT, "counts"),
    neighbors=as(lympho_NKT@graphs[["CSS_PCA"]], "dgCMatrix")>.1, 
    embed    =FetchData(lympho_NKT, vars=c("umapCSS_1","umapCSS_1")),
    totalUMI = lympho_NKT$nCount_RNA
)
pype <- obj %>%
  rule("T",           "CD3E",    ">", 2)                  %>% 
  rule("CD8+ T",      "CD8A",    ">", 1,  parent="T")

I get this error:

Error in check_obj(obj) : s and as.integer(s) are not equal
Mean relative difference: 0.0001429285

parent invalid gives unclear error message

Using a non-existent parent gives a cryptic error message. Minimal example:
simulated_umis %>% rule("T", "CD3E", ">", 1e-4, parent="oops") %>% plot_classes()

Note that the error only occurs through plot_classes. I think that's good behavior, we would not want to create obstacles when interactively developing rules.

The error is cryptic, please fix:

Error in if (current_parent == "..root..") { : argument is of length zero

precompute function

Problem:
For large data set (MS data from Schirmer group, for example), the interactive commands (rule + plot_last) are permissively slow.

Solution:
Precomputing the totalUMI S would speed things up for the MS data set, since raw is in gene-wise HDF5Array format (so cell-wise computations take long).

I found it unnecessary to precompute K when using an NxN neighbor graph (SNN > .1), since that uses matrix multiplication which is incredibly fast. I still have to test an Nxk neighbor graph (i.e. matrix with indices of kNN for each cell), that might be slower.

Both cases could be solved with something like this:

obj <- list(raw, neighbors, embedding)
obj <- precompute(obj)  # computes S
obj <- precompute(obj, genes=c("PLP1", "AQP1") # precomputes genes

options function for cellpypes

Perhaps it makes sense to set options at beginning of pypes like this:

obj %>%
  options(...) %>%
  rule(...) %>% rule(...)

Parameters I can imagine:

threshold_units: "fractions", "permille", "per10k", "ppm". This way, I could write .1 instead of .1e-3 in all rules, increasing
readibility. Also, the feature plot function could display permille/per10k/... in its color legend.
perhaps pull in meta.data here, so that you can write rules for totalUMI, percent.mito etc.?
method: "count_pooling", "saver", or others?

Prepare for upcoming Seurat v5 release

I am opening this issue as a notification because cellpypes is listed here as a package that relies (depends/imports/suggests) on Seurat. As you may know, we recently released Seurat v5 as a beta in March of this year, with new updates for spatial, multimodal, and massively scalable analysis. For more information on updates and improvements, check out our website https://satijalab.org/seurat/.
We are now preparing to release Seurat v5 to CRAN, and plan to submit it on October 23rd. While we have tried our best to keep things backward-compatible, it is possible that updates to Seurat and SeuratObject might break your existing functionality or tests. We wanted to reach out before the new version is on CRAN, so that there's time to report issues/incompatibilities and prepare you for any changes in your code base that might be necessary.

We apologize for any disruption or inconvenience, but hope that the improvements to Seurat v5 will benefit your users going forward.
To test the upcoming release, you can install Seurat from the seurat5 branch using the instructions available on this page: https://satijalab.org/seurat/articles/install.

Thank you!
Seurat v5 team

threshold in rule function

Hi,
First of all, thank you for this wonderful tool.
I have a question about how to convert from count matrix to cp10k units.

It is said that the unit of argument threshold of the 'rule' function is cp10k. Is the way to get the unit cp10k from count matrix dividing the gene-specific UMI count by the total number of UMI in the cell and multiplying by 10,000?
Can you please tell me if this is correct? If not, what is the right way to get cp10k from count matrix?

Thank you very much in advance!

Function: group_classes (OR logic for rules)

Currently all rules are combined with AND, but there might be cases where OR is useful.

Motivation
Use-case 1: For example, markers a1 and a2 might have nice but imperfect overlap, and both mark cell type A. If A has subtypes it may well be we'd want to set the subtype's parent to either a1 OR a2. For this, introducing a group called "A" as a kind of virtual class (no own rules) would help.

Use-case 2: for plotting, we might want to have all myeloid cells in the same colors. Introducing the "myeloid" class could achieve this:

obj %>%
rule("mono", "CD14", .1) %>%
rule("DC",   "CD1C", .1) %>%
group_classes("myeloid"=c("mono","DC") %>%
rule("T",    "CD3E", .1) %>%
plot_classes( c("myeloid", "T"))

Since mono and DC have no common parent, this can not be achieved without the group_classes function.

How to implement
obj$classes currently has "class" and "parent" columns. parent is either a class name or the special key word "..root..". I propose to create the second key word "..group.." and the slot obj$class_groups; this could be a named list (name is group name, elements are classes belonging to the group.

plot_overlap function

Some classes may have many overlapping cells, it would be nice to visualize this somehow.

How to handle hierarchies? Overlap between parent and child class is obviously a superset-subset relationship, so that'd be pointless. See also the issue on class_factor function.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.