ktrns / scrnaseq Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 15.0 663.81 MB

Workflow for single-cell RNA-seq analysis using Seurat

License: MIT License

HTML 99.39% R 0.60% CSS 0.01%

scrnaseq's People

Contributors

Stargazers

Watchers

Forkers

colindaven colorstorm oliver-d-b mariusrueve mhh-rcug hediatnani markobarovic jessesiu drejom

scrnaseq's Issues

Small issues to change in the main script

Working on my clients project, I noticed a few things

The color scale for DotPlots is off -> I will change it to blue-grey-yellow
I will add DotPlots per cluster to show DEG expression per sample per cluster
I will add another set of DotPlots with non-scaled DEG expression
Lets try the feature plots on log scale
We could already add slingshot as an option for pseudotime analysis

HTML tables appear as full width even though full_width=FALSE

In some cases, HTML tables appear as full width, e.g.:

knitr::kable(summary, align="l", caption="Dataset summary") %>% 
  kableExtra::kable_styling(bootstrap_options=c("striped", "hover"), full_width=FALSE, position="left")

In other cases, the exact same code works:

knitr::kable(sc_markers_top2, align="l", caption="Top 2 DEGs per cell cluster") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width=FALSE, position="left")

Single sample input

How can I generate a report of a single sample?
When I'm trying to do so I get an error in the 'cells_per_cluster' chunk where 'tbl' is expected to be a matrix or data.frame with more than just one dimension.

Allow more possibilities for gene name conversion

At the moment, we (more or less) expect Ensembl IDs and convert them into gene symbols for Seurat.

We could add for more flexibility:

user-provided gene symbols -> mapped to Ensembl IDs -> mapped to Seurat gene symbols
user-provided gene symbols/gene ids of another species -> mapped to Ensembl IDs of one reference species -> mapped to Seurat gene symbols

The latter would allow for cross-species analysis.

Singularity container trial

Hi @mariusrueve (I moved the issue from this fork to here)

I will try to create a prototype Singularity container recipe for this.

This is only one file which then needs to be built:

apt packages need to be downloaded and installed
R packages need to be built (takes 30-60minutes .... which is really a pain if you need to make adjustments)

I'll need to know the exact packages that are used:

a) Ubuntu packages (apt-get). Ubuntu version (20.04)

b) R packages

Marius answer - key packages

library(shiny)
library(Seurat)
library(ggplot2)
library(readxl)

Some (sub-)panels (e.g. of ridge plot, violin plot) get strongly compressed the more clusters were identified

Depending on the identified number of clusters, some (sub-)panels (e.g. of the ridge plot and the violin plot) get strongly compressed the more clusters have been identified in a current situation (see example below). Therefore, some different way to define the space and dimensions in the html report is required.

in case of few cells error in Seurat::RunPCA

Dear all,

in the case of a project with few cells in a sc-RNA-seq dataset, I am getting the same error as in:

satijalab/seurat#1914

I'm correcting it via editing the npcs argument (default npcs=50), such as follows:

sc = Seurat::RunPCA(sc, features=Seurat::VariableFeatures(object=sc), verbose=FALSE, npcs=14)

Best,

Dimitra

Improve feature plots for known marker genes

Marker genes are provided as lists of genes, e.g.:

param$known_markers = list()
param$known_markers[["bcell"]] = c("CD79A", "MS4A1")
param$known_markers[["tcell"]] = "CD3D"
param$known_markers[["tcell.cd8+"]] = c("CD8A", "CD8B")
param$known_markers[["nk"]] = c("GNLY", "NKG7")
param$known_markers[["myeloid"]] = c("CST3", "LYZ")
param$known_markers[["monocytes"]] = "FCGR3A"
param$known_markers[["dendritic"]] = "FCER1A"

These lists might be much longer. It would be nice to

allow reading of marker genes from file
plot genes per list separately as feature plots

We need to rename cells if we integrate several datasets

... because now cell names are adapted during integration and don't match the original cell names.

SmartSeq-2 input data

We are re-writing bits in the beginning of the main script to be able to read SmartSeq-2 sequencing data as well.

Identified celltype-specific markers are sometimes misleading due to current selection criteria

The very central heatmap of the top 10 (if available) differentially expressed marker genes per cluster - along with the underlying filtering criteria and visualization settings applied - is sometimes misleading for certain projects and specific data structures. There are cases, where the identified ‘celltype-specific markers’ are questionable (in the sense to 'undoubtedly fullfill this very property in some different respects). This still needs some further improvement (mainly at the level of the underlying selection criteria regarding: “what are the best criteria to call a gene a cluster-specific marker?"
This general issue is not trivial at all and needs some extensive deliberations.

qc_plot_cells does not work for samples with less than three cells

In the chunk qc_plot_cells, the violin plots do not work for samples with less than three cells. In this case I suggest that we just plot points. Here is code that would do this.

  p_list[[i]]= ggplot(sc_cell_metadata, aes_string(x="orig.ident", y=i, fill="orig.ident")) +
    geom_violin(scale="width")

  # Adds points for samples with less than three cells since geom_violin does not work here
  p_list[[i]] = p_list[[i]] + 
    geom_point(data=sc_cell_metadata %>% dplyr::filter(orig.ident %in% names(which(table(sc_cell_metadata$orig.ident)<3))), aes_string(x="orig.ident", y=i, fill="orig.ident"), shape=21, size=2)
  
  # Now add styles
  p_list[[i]] = p_list[[i]] + 
    AddStyle(title=i, legend_position="none", fill=param$col_samples, xlab="") + 
    theme(axis.text.x=element_text(angle=45, hjust=1))
  
  # Creates a table with min/max values for filter i for each dataset

DEGs between multiple samples

If we use the workflow to analyse multiple samples, we would also like to find DEGs that are specific to one sample versus the other.

Pseudotime analysis

It would be great to extend the workflow with pseudotime analysis. Several customers would like to see this kind of analysis, see velocity or trajectory inference.

Split HTO demultiplexing and actual scrnaseq analysis

We will split the code again into two parts such that HTO demultiplexing will generate a standalone small report, and generates demultiplexed data that can be processed with the main scrnaseq analysis script.

SCTransform and JackStraw

You added the new normalization method SCTransform to the script. But it seems that JackStraw does not accept SCT transformed data and stops with an error.

Double-check translation of gene names in reading function

... to make sure that genes are properly translated and we don't skip half of the genes.

Tabs in HTML

We should carefully go through the main HTML report and think which Plots can go into Tabs, as it is done for the HTO HTML report.

Even more explanatory / introductory texts for unexperienced users

I would recommend integrating even more explanatory text sections for the unexperienced end-users into the report (e.g. such as from the current original vignette "Seurat - Guided Clustering Tutorial" or some more own ones) in addition to the explanations already included. Since such text blocks in the report will then increase in proportion and space (with a general negative impact on visual clarity), one could consider making these explanatory information either visible - or hiding it (similar as already implemented with the source code buttons in the html files). It may be most intuitive and understandable for unexperienced end-users to have such an explanatory (introductory) text block (respectively the button for it) before each visualization panel or analysis step (in part already realized, I know).

Clarify the difference between DEGs and markers

Once we implement another step to identify DEGs between 2 defined conditions, we can rename current DEGs to markers, and change the documentation accordingly.

Read functions and how they convert names

This doesn't seem quite right. Needs a double-check. Also, in the HTO script, we are not yet converting any names, not sure if it is even required.

Remove RNA signal associated to cell cycle or other specified biological processes

We plan to add the possibility to regress out cell cycle effects, or effects due to other biological processes defined by user-specified gene lists.

DegsAvgData mean calculation

In DegsAvgData, mean values are calculated for the assays and the three slots count, data, scale.data. For data, the expression values are logged counts and I would suggest to first convert them back to normal counts. At the moment we calculate the mean of log values which different from the log mean of values.

          if (sl=="data") {
            id_avg = Matrix::rowMeans(exp(GetAssayData(sc, assay=as, slot=sl)[genes, id_cells]))
            id_avg = log(id_avg)
          } else if (sl=="counts") {
            id_avg = Matrix::rowMeans(GetAssayData(sc, assay=as, slot=sl)[genes, id_cells])
          }

One might also think about using log2 for the means.

For scale.data I would not calculate a mean at all since it is already centered and scaled. I do not know how the mean can then be interpreted.

Andreas

Seurat FindAllMarkers uses natural log to calculate FC, we'd like to have log2

satijalab/seurat#3346 (comment)

Posted a question to the Seurat GitHub. If Seurat indeed uses the natural log, we should re-calculate it to log2. Its more intuitive to understand.

At the moment, we use a log2-based threshold for FC, which would be a bug.

Authors of the scrnaseq workflow

The current way we name DcGC and RCUG as authors at the top of the report is slightly confusing. We discussed this issue and decided it would be better to move an author section to the bottom of the report, and add the name of the bioinformatician on top (as parameter in the YAML header) who is working on the very project.

Export to Cerebro

We would like to add a chunk to export the Seurat object to be able to visualise the data in Cerebro.

https://github.com/romanhaa/Cerebro

Consistency in filter criteria and order of marker genes needs improvement

It is confusing for end-users that the actual filter criteria that yield in candidates of the Excel output file „Markers.xlsx“ are quite different from what is currently applied for the bar graph panel “Number of DEGs per cell cluster”. This should be harmonized.
At present, there are some obvious inconsistencies in marker gene order comparing the three different panels, namely 1) global heatmap 2) Excel output file „Markers.xlsx“ 3) table “Top 2 DEGs per cell cluster”. Markers in 1) are sorted acc. to decreasing “avg_logFC”, markers in 2) acc. to increasing “p_val”, markers in 3) were selected (as the top two) based on “avg_logFC” but were intrinsically ordered according to increasing “p_val”(as in “Markers.xlsx). The top gene selection for showing individual marker genes (e.g. in the ridge plot) is again based on decreasing “avg_logFC” similar as for the global heatmap. To improve general and intuitive orientation within the report this should be harmonized.

future framework warnings about not statistically sound random seeds

The following code throws several warnings about not statistically sound random seeds:
(Normalise data the original way)
sc = purrr::map(sc, Seurat::NormalizeData, normalization.method = "LogNormalize", scale.factor=10000, verbose=FALSE)

"UNRELIABLE VALUE: Future (‘future_lapply-1’) unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore"."

These warnings appear always when knitting the project to html within Rstudio.
The warnings appear occasionally when running the separate chunks within Rstudio (but not always).

Can we add expression values to the output in markers.xlsx?

It would be helpful to add some kind of expression data to the output in markers.xlsx.

At the moment, the table contains the average log fold change and p-value. We could add the average raw or normalised expression per cluster of cells. This way clients can discard DEGs with low expression.

(Re)Integration of the violin plots

The “violin plots” as examples for marker genes of individual clusters are extremely valuable and do show aspects of the data that no other panel is able to. Thus, they should still be (re-)integrated into the standard report (as we have already realized in a recently ‘launched’ RCUG-specific version of the script/report).

Export figures to PDF

It would be nice to introduce the option to save figures in high-res PDF, so they can be used downstream for publications.

scrnaseq shiny app

https://github.com/MHH-RCUG/scrnaseq_app/tree/dev

This is the link to my dev branch of the shiny app.

Problem with the dot plot panel if the same gene serves as TOP candidate for two clusters

We observed a substantial problem (program abort?) with the dot plot panel in an example, where one and the same gene was identified as top ranking marker in two different clusters.

DEG output table markers.xlsx lacks annotation

It looks as if markers.xlsx lacks the annotation. Headers are there, but cells are not filled.

Multiple input datasets

We plan to add the possibility to read multiple datasets and integrate them into one Seurat object.

Output of functional enrichment results is not informative yet

The currently contained table as output of the Functional enrichment analysis is not really informative (regarding actual results) and should be removed or modified to better reflect an overview of most strongly enriched gene sets or categories (e.g. GO terms) per cluster (to directly reflect the outcome of this specific approach).

Convert PlotMyStyle into theme

The PlotMyStyle function should be converted into a full ggplot theme function so that we can apply it in ggplot fashion: ggplot() + theme_plotmystyle()

See for example here: https://www.statworx.com/de/blog/custom-themes-in-ggplot2/

It makes the code more readable. Also we should think about another name and title/legend_title should be set outside of the function (I was thinking about the patchwork system).

Issues

Hi Oliver/Marius.
@mariusrueve

you can watch this repo (button top right) to be notified of all conversations, issues, pull requests (PRs) etc.

You can address people with @colindaven like in gitlab.

I'm not sure who can assign issues to users, perhaps only owners/maintainers like @ktrns ?

Best wishes,
Colin

Check online ressources at the start

Check online ressources at the beginning of the script and fail if a ressource is not available

HTO normalisation advice

Here is some advice about choosing normalisation for HTOs. Can we include this in the script?

From: satijalab/seurat#2954

Thanks, these are good questions

We advise normalizing across tags particualrly when there is substantial variation in how well each hash performs. We saw this a lot when we were doing our own conjugations (like the original hashing paper)

We advise normalizing across cells in most other cases

Print versions at the end of report

How about printing the repository version and R packages versions at the end of the HTML report? This is done for nf-core pipelines for example, and is useful for the user when writing the paper.

Small fixes

Feedback from Andreas:

Introduce HTO name tags
Introduce the term meta data
Replace "Identity" y-axis label with "Cluster"
Introduce what different plots are showing, particularly a dot plot
Look up adjusted p-value for DEGs: rounding issue? why is it all 0?

Logging

We need to decide whether we will use futile.logger, or simply message, warning, stop, etc.

Documentation of code

We need to go through our R/ scripts and update documentation of the functions.

Update code after major changes in cerebroApp

The cerebroApp has been rewritten substantially and most of the code for the export does not work anymore. I commented these parts out for now (commit 7b60d91) but we need to rewrite the function in order to export as much as possible to the cerebroApp.

Switch from CombinePlots() to the 'patchwork' package

CombinePlots() is soon deprecated as this functionality is now supported by the 'patchwork' package.

Check parameters at the start

Similar to the online ressources, it would be good to verify the parameters at the start of the script and stop when something is not right. This could be combined.

CiteFuse

Hi @ktrns , Marius and all.

Alternatives just released for Cite-seq analysis (might be helpful ?)

CiteFuse is freely available at http://shiny.maths.usyd.edu.au/CiteFuse/ as an online web service and at https://github.com/SydneyBioX/CiteFuse/ as an R package.

TotalVI - different software I believe and more about SC data integration:
https://www.biorxiv.org/content/10.1101/2020.05.08.083337v1

Best wishes,
Colin

Integration versus merging, CellCycleScoring on individual samples or merged dataset?

We need to be wise with integration. To integrate data from several samples, there should be some cells that overlap the samples, e.g. skin tissue from mouse sequenced in two different labs. The respective paper to read is: Comprehensive Integration of Single-Cell Data (Cell).
If we only "merge" several samples, we are currently re-running SCTransform. In addition, we might want to re-run CellCycleScoring, because it makes a difference whether this is run per sample, or run for the whole dataset. It should make more sense to move the complete merge chunk up to after filtering, as we anyway have to re-run normalisation. It doesn't make much sense to normalise individual samples if we don't use this at all.
What makes more sense? CellCycleScoring per sample, or for the whole merged dataset? The results differ quite a bit, not in terms of scores, but in terms of the assigned phase. If the scores wobble around 0, phase can easily change. This is discussed in: satijalab/seurat#2277

Read simple expression matrix as input

@andpet0101
Can we make it possible to read a simple expression matrix as input to our script? This would be useful for test datasets, which usually are saved as matrix.txt tab separated file, no further information.

Or is this already possible with the SmartSeq-2 reading function?