Code Monkey home page Code Monkey logo

scrnaseq's People

Contributors

andpet0101 avatar colindaven avatar colorstorm avatar kosankem avatar ktrns avatar tglomb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrnaseq's Issues

Small issues to change in the main script

Working on my clients project, I noticed a few things

  • The color scale for DotPlots is off -> I will change it to blue-grey-yellow
  • I will add DotPlots per cluster to show DEG expression per sample per cluster
  • I will add another set of DotPlots with non-scaled DEG expression
  • Lets try the feature plots on log scale
  • We could already add slingshot as an option for pseudotime analysis

HTML tables appear as full width even though full_width=FALSE

In some cases, HTML tables appear as full width, e.g.:

knitr::kable(summary, align="l", caption="Dataset summary") %>% 
  kableExtra::kable_styling(bootstrap_options=c("striped", "hover"), full_width=FALSE, position="left")

In other cases, the exact same code works:

knitr::kable(sc_markers_top2, align="l", caption="Top 2 DEGs per cell cluster") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width=FALSE, position="left")

Single sample input

How can I generate a report of a single sample?
When I'm trying to do so I get an error in the 'cells_per_cluster' chunk where 'tbl' is expected to be a matrix or data.frame with more than just one dimension.

Allow more possibilities for gene name conversion

At the moment, we (more or less) expect Ensembl IDs and convert them into gene symbols for Seurat.

We could add for more flexibility:

  • user-provided gene symbols -> mapped to Ensembl IDs -> mapped to Seurat gene symbols
  • user-provided gene symbols/gene ids of another species -> mapped to Ensembl IDs of one reference species -> mapped to Seurat gene symbols

The latter would allow for cross-species analysis.

Singularity container trial

Hi @mariusrueve (I moved the issue from this fork to here)

I will try to create a prototype Singularity container recipe for this.

This is only one file which then needs to be built:

  • apt packages need to be downloaded and installed
  • R packages need to be built (takes 30-60minutes .... which is really a pain if you need to make adjustments)

I'll need to know the exact packages that are used:

a) Ubuntu packages (apt-get). Ubuntu version (20.04)

b) R packages

Marius answer - key packages

  • library(shiny)
  • library(Seurat)
  • library(ggplot2)
  • library(readxl)

in case of few cells error in Seurat::RunPCA

Dear all,

in the case of a project with few cells in a sc-RNA-seq dataset, I am getting the same error as in:

satijalab/seurat#1914

I'm correcting it via editing the npcs argument (default npcs=50), such as follows:

sc = Seurat::RunPCA(sc, features=Seurat::VariableFeatures(object=sc), verbose=FALSE, npcs=14)

Best,

Dimitra

Improve feature plots for known marker genes

Marker genes are provided as lists of genes, e.g.:

param$known_markers = list()
param$known_markers[["bcell"]] = c("CD79A", "MS4A1")
param$known_markers[["tcell"]] = "CD3D"
param$known_markers[["tcell.cd8+"]] = c("CD8A", "CD8B")
param$known_markers[["nk"]] = c("GNLY", "NKG7")
param$known_markers[["myeloid"]] = c("CST3", "LYZ")
param$known_markers[["monocytes"]] = "FCGR3A"
param$known_markers[["dendritic"]] = "FCER1A"

These lists might be much longer. It would be nice to

  • allow reading of marker genes from file
  • plot genes per list separately as feature plots

SmartSeq-2 input data

We are re-writing bits in the beginning of the main script to be able to read SmartSeq-2 sequencing data as well.

Identified celltype-specific markers are sometimes misleading due to current selection criteria

  • The very central heatmap of the top 10 (if available) differentially expressed marker genes per cluster - along with the underlying filtering criteria and visualization settings applied - is sometimes misleading for certain projects and specific data structures. There are cases, where the identified ‘celltype-specific markers’ are questionable (in the sense to 'undoubtedly fullfill this very property in some different respects). This still needs some further improvement (mainly at the level of the underlying selection criteria regarding: “what are the best criteria to call a gene a cluster-specific marker?"
  • This general issue is not trivial at all and needs some extensive deliberations.

qc_plot_cells does not work for samples with less than three cells

In the chunk qc_plot_cells, the violin plots do not work for samples with less than three cells. In this case I suggest that we just plot points. Here is code that would do this.

  p_list[[i]]= ggplot(sc_cell_metadata, aes_string(x="orig.ident", y=i, fill="orig.ident")) +
    geom_violin(scale="width")

  # Adds points for samples with less than three cells since geom_violin does not work here
  p_list[[i]] = p_list[[i]] + 
    geom_point(data=sc_cell_metadata %>% dplyr::filter(orig.ident %in% names(which(table(sc_cell_metadata$orig.ident)<3))), aes_string(x="orig.ident", y=i, fill="orig.ident"), shape=21, size=2)
  
  # Now add styles
  p_list[[i]] = p_list[[i]] + 
    AddStyle(title=i, legend_position="none", fill=param$col_samples, xlab="") + 
    theme(axis.text.x=element_text(angle=45, hjust=1))
  
  # Creates a table with min/max values for filter i for each dataset

DEGs between multiple samples

If we use the workflow to analyse multiple samples, we would also like to find DEGs that are specific to one sample versus the other.

Pseudotime analysis

It would be great to extend the workflow with pseudotime analysis. Several customers would like to see this kind of analysis, see velocity or trajectory inference.

SCTransform and JackStraw

You added the new normalization method SCTransform to the script. But it seems that JackStraw does not accept SCT transformed data and stops with an error.

Tabs in HTML

We should carefully go through the main HTML report and think which Plots can go into Tabs, as it is done for the HTO HTML report.

Even more explanatory / introductory texts for unexperienced users

  • I would recommend integrating even more explanatory text sections for the unexperienced end-users into the report (e.g. such as from the current original vignette "Seurat - Guided Clustering Tutorial" or some more own ones) in addition to the explanations already included. Since such text blocks in the report will then increase in proportion and space (with a general negative impact on visual clarity), one could consider making these explanatory information either visible - or hiding it (similar as already implemented with the source code buttons in the html files). It may be most intuitive and understandable for unexperienced end-users to have such an explanatory (introductory) text block (respectively the button for it) before each visualization panel or analysis step (in part already realized, I know).

DegsAvgData mean calculation

In DegsAvgData, mean values are calculated for the assays and the three slots count, data, scale.data. For data, the expression values are logged counts and I would suggest to first convert them back to normal counts. At the moment we calculate the mean of log values which different from the log mean of values.

          if (sl=="data") {
            id_avg = Matrix::rowMeans(exp(GetAssayData(sc, assay=as, slot=sl)[genes, id_cells]))
            id_avg = log(id_avg)
          } else if (sl=="counts") {
            id_avg = Matrix::rowMeans(GetAssayData(sc, assay=as, slot=sl)[genes, id_cells])
          }

One might also think about using log2 for the means.

For scale.data I would not calculate a mean at all since it is already centered and scaled. I do not know how the mean can then be interpreted.

Andreas

Authors of the scrnaseq workflow

The current way we name DcGC and RCUG as authors at the top of the report is slightly confusing. We discussed this issue and decided it would be better to move an author section to the bottom of the report, and add the name of the bioinformatician on top (as parameter in the YAML header) who is working on the very project.

Consistency in filter criteria and order of marker genes needs improvement

  • It is confusing for end-users that the actual filter criteria that yield in candidates of the Excel output file „Markers.xlsx“ are quite different from what is currently applied for the bar graph panel “Number of DEGs per cell cluster”. This should be harmonized.

  • At present, there are some obvious inconsistencies in marker gene order comparing the three different panels, namely 1) global heatmap 2) Excel output file „Markers.xlsx“ 3) table “Top 2 DEGs per cell cluster”. Markers in 1) are sorted acc. to decreasing “avg_logFC”, markers in 2) acc. to increasing “p_val”, markers in 3) were selected (as the top two) based on “avg_logFC” but were intrinsically ordered according to increasing “p_val”(as in “Markers.xlsx). The top gene selection for showing individual marker genes (e.g. in the ridge plot) is again based on decreasing “avg_logFC” similar as for the global heatmap. To improve general and intuitive orientation within the report this should be harmonized.

future framework warnings about not statistically sound random seeds

The following code throws several warnings about not statistically sound random seeds:
(Normalise data the original way)
sc = purrr::map(sc, Seurat::NormalizeData, normalization.method = "LogNormalize", scale.factor=10000, verbose=FALSE)

"UNRELIABLE VALUE: Future (‘future_lapply-1’) unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore"."

These warnings appear always when knitting the project to html within Rstudio.
The warnings appear occasionally when running the separate chunks within Rstudio (but not always).

Can we add expression values to the output in markers.xlsx?

It would be helpful to add some kind of expression data to the output in markers.xlsx.

At the moment, the table contains the average log fold change and p-value. We could add the average raw or normalised expression per cluster of cells. This way clients can discard DEGs with low expression.

(Re)Integration of the violin plots

  • The “violin plots” as examples for marker genes of individual clusters are extremely valuable and do show aspects of the data that no other panel is able to. Thus, they should still be (re-)integrated into the standard report (as we have already realized in a recently ‘launched’ RCUG-specific version of the script/report).

Export figures to PDF

It would be nice to introduce the option to save figures in high-res PDF, so they can be used downstream for publications.

Multiple input datasets

We plan to add the possibility to read multiple datasets and integrate them into one Seurat object.

Output of functional enrichment results is not informative yet

  • The currently contained table as output of the Functional enrichment analysis is not really informative (regarding actual results) and should be removed or modified to better reflect an overview of most strongly enriched gene sets or categories (e.g. GO terms) per cluster (to directly reflect the outcome of this specific approach).

Convert PlotMyStyle into theme

The PlotMyStyle function should be converted into a full ggplot theme function so that we can apply it in ggplot fashion: ggplot() + theme_plotmystyle()

See for example here: https://www.statworx.com/de/blog/custom-themes-in-ggplot2/

It makes the code more readable. Also we should think about another name and title/legend_title should be set outside of the function (I was thinking about the patchwork system).

Issues

Hi Oliver/Marius.
@mariusrueve

you can watch this repo (button top right) to be notified of all conversations, issues, pull requests (PRs) etc.

You can address people with @colindaven like in gitlab.

I'm not sure who can assign issues to users, perhaps only owners/maintainers like @ktrns ?

Best wishes,
Colin

HTO normalisation advice

Here is some advice about choosing normalisation for HTOs. Can we include this in the script?

From: satijalab/seurat#2954

Thanks, these are good questions

We advise normalizing across tags particualrly when there is substantial variation in how well each hash performs. We saw this a lot when we were doing our own conjugations (like the original hashing paper)

We advise normalizing across cells in most other cases

Print versions at the end of report

How about printing the repository version and R packages versions at the end of the HTML report? This is done for nf-core pipelines for example, and is useful for the user when writing the paper.

Small fixes

Feedback from Andreas:

  • Introduce HTO name tags
  • Introduce the term meta data
  • Replace "Identity" y-axis label with "Cluster"
  • Introduce what different plots are showing, particularly a dot plot
  • Look up adjusted p-value for DEGs: rounding issue? why is it all 0?

Logging

We need to decide whether we will use futile.logger, or simply message, warning, stop, etc.

Update code after major changes in cerebroApp

The cerebroApp has been rewritten substantially and most of the code for the export does not work anymore. I commented these parts out for now (commit 7b60d91) but we need to rewrite the function in order to export as much as possible to the cerebroApp.

Check parameters at the start

Similar to the online ressources, it would be good to verify the parameters at the start of the script and stop when something is not right. This could be combined.

Integration versus merging, CellCycleScoring on individual samples or merged dataset?

  • We need to be wise with integration. To integrate data from several samples, there should be some cells that overlap the samples, e.g. skin tissue from mouse sequenced in two different labs. The respective paper to read is: Comprehensive Integration of Single-Cell Data (Cell).
  • If we only "merge" several samples, we are currently re-running SCTransform. In addition, we might want to re-run CellCycleScoring, because it makes a difference whether this is run per sample, or run for the whole dataset. It should make more sense to move the complete merge chunk up to after filtering, as we anyway have to re-run normalisation. It doesn't make much sense to normalise individual samples if we don't use this at all.
  • What makes more sense? CellCycleScoring per sample, or for the whole merged dataset? The results differ quite a bit, not in terms of scores, but in terms of the assigned phase. If the scores wobble around 0, phase can easily change. This is discussed in: satijalab/seurat#2277

Read simple expression matrix as input

@andpet0101
Can we make it possible to read a simple expression matrix as input to our script? This would be useful for test datasets, which usually are saved as matrix.txt tab separated file, no further information.

Or is this already possible with the SmartSeq-2 reading function?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.