Code Monkey home page Code Monkey logo

hitme's Introduction

HiTME 🎯 👊

High-resolution Tumor Micro-Environment cell type classification and compositional analysis for single-cell RNA-seq

HiTME is designed for precise cell type classification within the complex tumor microenvironment (TME), providing high accuracy and interpretability in cell type identification.

Find a vignette describing its main functions in html and its code (repository).

Installation

remotes::install_github("carmonalab/HiTME")

How to cite HiTME

Please note that the publication describing HiTME is currently in preparation. In the meantime, we kindly ask that you cite the two primary components of HiTME in your work:

  • scGate: Andreatta, Massimo, Ariel J. Berenstein, and Santiago J. Carmona. 2022. “scGate: Marker-Based Purification of Cell Types from Heterogeneous Single-Cell RNA-Seq Datasets.” Bioinformatics 38 (April): 2642–44. https://doi.org/10.1093/BIOINFORMATICS/BTAC141.
  • ProjecTILs: Andreatta, Massimo, Jesus Corria-Osorio, Sören Müller, Rafael Cubas, George Coukos, and Santiago J. Carmona. 2021. “Interpretation of t Cell States from Single-Cell Transcriptomics Data Using Reference Atlases.” Nature Communications 2021 12:1 12 (May): 1–19. https://doi.org/10.1038/s41467-021-23324-4.

Cell type annotation

HiTME is an R package that combines scGate and ProjecTILs to classify cell types in single-cell RNA-seq data at high resolution and with large flexibility (e.g. easy to include new cell types).

The function takes as input Seurat objects (or list of them). These should be split by sample to avoid batch effects, or split internally in Run.HitME by indicating the parameter split.by.

This wrapper firstly runs scGate (easily customizable) marker-based classification, resulting in a coarse-grained cell type classification (CD4T, B cell, Dendritic cell...). Next, it runs for each broad cell type ProjecTILs for a finer cell type classification (CD4+ TFH, Tex CD8+, cDC1...) based on cell mapping onto expert-curated single-cell reference maps.

library(scGate)
library(ProjecTILs)
library(HiTME)

# If multiple samples are within the same Seurat object, split by sample.
# obj.list <- SplitObject(obj, split.by = "Sample")

# Define scGate model if other than default is wanted
scGate_models_DB <- get_scGateDB(branch = "master")
models.TME <- scGate_models_DB$human$HiTME

# Load ProjecTILs reference maps
path_ref <- "~/reference_atlases"
ref.maps <- list(CD8 = load.reference.map(file.path(path_ref, "CD8T_human_ref_v1.rds")),
                 CD4 = load.reference.map(file.path(path_ref, "CD4T_human_ref_v2.rds")),
                 DC = load.reference.map(file.path(path_ref, "DC_human_ref_v1.rds")),
                 MoMac = load.reference.map(file.path(path_ref, "MoMac_human_v1.rds"))
                 )

By default scGate (layer 1) will return the cell ontology ID for each predicted cell type. This ID will be then used to link each coarse cell type with its respective reference map for finer cell type classification using ProjecTILs. Hence, we need to indicate each respective cell ontology ID(s) for each reference map.

If alternative cell type link are used between the coarse and finer cell type classification, this must be specified in Run.HiTME using layer1_link parameter.

# add scGate_link to ref.maps
# Include a slot in @misc with the cell name output by scGate
# By default scGate returns cell ontology ID

layer1.links <- list("CD8" = "CL:0000625",
                  "CD4" = "CL:0000624",
                  "DC" = "CL:0000451",
                  "MoMac" = "CL:0000576_CL:0000235"
                  )
                  
for(a in names(ref.maps)){
  ref.maps[[a]]@misc$layer1_link <- layer1.links[[a]]
}
# Run HiTME
annotated.obj <- Run.HiTME(object = obj,
                scGate.model = models.TME,
                ref.maps = ref.maps)

annotated.obj <- Run.HiTME(obj,
                            scGate.model = models.TME,
                            ref.maps = ref.maps,
                            # already split object
                            split.by = NULL,
                            # if splitting or providing list, whether to return a single merged object
                            remerge = FALSE,
                            # link between scGate and ProjecTILs
                            layer1_link = "CellOntology_ID",
                            # extra signatures to be computed per celltype
                            additional.signatures = additional.signatures, 
                            # paralelization parameters
                            ncores = 4,
                            progressbar = TRUE
                            )

Summarized cell annotation

Run.HiTME will return the Seurat object or list of them with new metadata indicating cell type annotation.

Annotated Seurat objects can be summarized into HiT objects using get.HiTObject function. For this function the grouping variable group.by resulting from Run.HiTME annotation or additional annotations need to be indicated. Compositional cell type distribution and aggregated transcriptomic profile (pseudobulk) are returned for each sample.

HiT_summary <- get.HiTObject(annotated.obj ,
                            group.by = list("layer1" = "scGate_multi",
                                            "layer2" = "functional.cluster"))

Alternatively, HiT summarizing object can be obtained directly using Run.HiTME with parameters return.Seurat = FALSE.

HiT_summary <- Run.HiTME(object = obj,
                        scGate.model = models.TME,
                        ref.maps = ref.maps,
                        return.Seurat = FALSE)

Hit Object content

The Hit object summarize the cell type annotation and contain the following slots:

  1. Seurat object metadata (dataframe): metadata

  2. Cell type predictions for each cell in the data set (list): predictions

  3. Cell type composition for each layer of cell type prediction: composition. Including:

    3.1. cell counts

    3.2. frequency

    3.3. CLR (Centred log ratio)-transformed counts (useful for downstream analyses such as PCA/Logratio analysis )

  4. Aggregated profile of predicted cell types: aggregated_profile. Including:

    4.1. Average and aggregated expression per cell type of all genes in the dataset and a subset of them.

    4.2. Mean of UCell scores per cell type, if additional signatures are provided, for example from SignatuR.

hitme's People

Contributors

jgarnica22 avatar halterc avatar sjcarmona avatar

Stargazers

 avatar Ron Finn avatar Xue Huiwen avatar

Watchers

 avatar

hitme's Issues

Typo in get.aggregated.profile?

In get.aggregated.profile there is a part where names are assigned to columns (for cell types). (see reference to code section below)
If I understand correctly:
The first part handles the case if there are multiple cell types. (?)
The second part if there is only one cell type.

Can't this be handled by a general loop, valid for ncol >=1 ?
I.e. like this

for(av in names(avg.exp)){ colnames(avg.exp[[i]]) <- unique([email protected][!is.na([email protected][[group.by.aggregated[[i]]]]), group.by.aggregated[[i]]]) }

Alternatively, shouldn't the first line read:

if(ncol(avg.exp[[i]]) > 1){
?

HiTME/R/main.R

Lines 717 to 731 in d1514ac

if(ncol(avg.exp[[i]]) == 1){
for(av in names(avg.exp)){
colnames(avg.exp[[i]]) <-
unique(object@meta.data[!is.na(object@meta.data[[group.by.aggregated[[i]]]]),
group.by.aggregated[[i]]])
}
}
# add colnames if only one cell type is found
if(ncol(avg.exp[[i]] ) == 1){
colnames(avg.exp[[i]] ) <-
unique(object@meta.data[!is.na(object@meta.data[[group.by.aggregated[[i]]]]),
group.by.aggregated[[i]]])
}
}

compatibility with lower version of R

Hello,

I am very interested in using your tools which seems great for cell classification. However I am currently using R version 4.1.3 and HiTME requires R >= 4.3.1. Would it be possible to make it compatible with a lower version of R?

Thanks

Noemie

get.cluster.score: alternative method to Fisher's method to combine p-values

When using get.cluster.score with batching, batch size should be considered. For example, if one batch has 30 observations and the other one only 3-5, then the latter will never produce a meaningful p-value, whereas the first prior one should be very meaningful. Fisher's method does not account for that and weights them equally. A better option would be the Weighted Z-Score Method.

Fisher's Method
Use Cases: Fisher's method is suitable when the p-values being combined are independent and come from studies testing the same null hypothesis.
Advantages: It is straightforward to implement and does not require assumptions about the distribution of the underlying test statistics.
Disadvantages: Fisher's method tends to lose power when the p-values being combined are not uniformly distributed under the null hypothesis.

Weighted Z-Score Method
Use Cases: Weighted Z-score method is useful when some studies are deemed more reliable or informative than others.
Advantages: Allows researchers to assign weights to individual studies based on their sample sizes, effect sizes, or other relevant factors, which can improve power and accuracy.
Disadvantages: Requires prior knowledge or assumptions about the relative reliability of individual studies, and incorrect weighting can bias results.

scGate multi cells parameter in Run.HiTME

Hello,

Thanks again for creating this package. When using scGate alone, we have the option to choose whether the cell called as Multi should be considered as NA or not. This option is not included in Run.HiTME. Could you please include this parameter in the Run.HiTME function to be able to get the scGate multi cell.

Thanks

Run.HiTME with ref.maps messes up seurat object's data structure

Hey Josep. I just noticed that when I Run.HiTME on a Seurat object, it messes up the data structure but only if I run ref.maps. Here is a minimal working example:

scGate_models_DB <- get_scGateDB(branch = "master", verbose = T, force_update = TRUE)
scGate_models <- scGate_models_DB$human$HiTME

layer1.links <- list("CD8" = "CL:0000625",
                     "CD4" = "CL:0000624",
                     "DC" = "CL:0000451",
                     "MoMac" = "CL:0000576_CL:0000235")

for(a in names(ref.maps)){
  ref.maps[[a]]@misc$layer1_link <- layer1.links[[a]]
}

# Take as input any small single sample that you have for testing
a <- subset(obj, subset = sample == obj$sample[1])

# This gives an error on running Run.HiTME a second time
b <- Run.HiTME(a, scGate.model = scGate_models, ref.maps = ref.maps)
c <- Run.HiTME(b, scGate.model = scGate_models, ref.maps = ref.maps)

# This gives no error on running Run.HiTME a second time
a <- subset(obj, subset = sample == obj$sample[1])
b <- Run.HiTME(a, scGate.model = scGate_models, ref.maps = NULL)
c <- Run.HiTME(b, scGate.model = scGate_models, ref.maps = NULL)

The error I get is:

Error in validObject(object = x) : 
  invalid class "Seurat" object: 1: all cells in assays must be in the same order as the Seurat object
invalid class "Seurat" object: 2: 'active.idents' must be named with cell names

I assume it might be due to subsetting the cells.

misc Projectils_param -> only ref maps that were actually run or all tried?

Hi Josep,

at one point during "Run.HiTME", the ref maps used get saved to obj@misc (see reference to specific line of code below).

HiTME/R/utils.R

Line 100 in 57fb295

object@misc[["Projectils_param"]] <- names(ref.maps)

However, the current behaviour only saves ref maps that were actually run, not all that were tried. For example:
I Run.HiTME on a list of seurat objects ("obj.list") with the CD4 and CD8 ref maps.
obj.list[[1]] has CD4 and CD8 cells, so obj.list[[1]]@misc$Projectils_param = c("CD4", "CD8")
obj.list[[2]] has CD4 but no CD8 cells, so obj.list[[2]]@misc$Projectils_param = c("CD4")

Wouldn't it be nicer to store all ref maps run on the object for traceability?
Alternatively, we could save both, e.g:

obj.list[[2]]@misc$Projectils_param$all <- c("CD4", "CD8")
obj.list[[2]]@misc$Projectils_param$executed <- c("CD4")

I point this out because it might be important to know if a cell type was not present, so the ref map was not executed.
PS: It would also be nice to store all cell types within the ref maps, either in @misc or in $functional.cluster as factor. This is in order to identify, which cell subtypes are NA (because layer 1 cell type did not map to layer 2) and which are 0 (layer 1 cell type was mapped to layer 2 but not detected in this sample) for the calculation of the cell type composition if some or all subtypes are equal to zero.

What do you think?

plot.celltype.freq: major re-work: simplify, add parameter barplot.group.by, add parameter freq.type, exclude na and show in separate plot

plot.celltype.freq currently operates on the HiTObject_list. I suggest that we do this from the HiT_summary resulting from merge.HiTObjects as this object uses less memory (in my case HiTObject_list (56 elements, 1GB) vs HiT_summary (391 MB)) and because downstream analyses are carried out with the HiT_summary.

It would also make it easier (simplifies the code) to create grouped boxplots, e.g. comparing responders vs non-responder celltype composition.

Would be nice to have a paramter freq.type to switch between absolute relative abundance and clr (to better visualize low abundant cell types).

NA values can be interesting. However, I would exclude NAs from the relative abundances as they add noise and make it difficult to interpret the ratios in the plots visually (i.e. keep according to what user specified in get.HiTObject useNA parameter, so by default exclude). I suggest we show number of NAs per sample in a separate plot, e.g. like this, right in the beginning (after Run.HiTME on the obj.list) as a quality control:

plot.nas <- function (obj.list = NULL,
                      annot.col = c("scGate_multi", "functional.cluster"),
                      bottom.mar = 10.2) {
  if (is.null(obj.list) &
      !is.list(obj.list) &
      !all(lapply(obj.list, inherits, "Seurat"))) {
    stop("Please provide a list of seurat objects")
  }
  
  for (col in annot.col) {
    na_perc_per_sample <- c()
    for (i in names(obj.list)) {
      if (col %in% names(obj.list[[i]]@meta.data)) {
        percs <- prop.table(table(obj.list[[i]]@meta.data[[col]], useNA = "ifany"))*100
        nas_perc <- unname(percs[is.na(names(percs))])
        na_perc_per_sample <- c(na_perc_per_sample, setNames(nas_perc, i))
      } else {
        stop(paste(col, " not found in obj.list item ", i))
      }
    }
    par(mar = c(bottom.mar, 4.1, 4.1, 2.1))
    barplot(na_perc_per_sample,
            ylab = "Percentage of NA values",
            las=2)
  }
}

plot.nas(obj.list)

image
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.