ktrns / scrnaseq Goto Github PK
View Code? Open in Web Editor NEWWorkflow for single-cell RNA-seq analysis using Seurat
License: MIT License
Workflow for single-cell RNA-seq analysis using Seurat
License: MIT License
Working on my clients project, I noticed a few things
In some cases, HTML tables appear as full width, e.g.:
knitr::kable(summary, align="l", caption="Dataset summary") %>%
kableExtra::kable_styling(bootstrap_options=c("striped", "hover"), full_width=FALSE, position="left")
In other cases, the exact same code works:
knitr::kable(sc_markers_top2, align="l", caption="Top 2 DEGs per cell cluster") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width=FALSE, position="left")
How can I generate a report of a single sample?
When I'm trying to do so I get an error in the 'cells_per_cluster' chunk where 'tbl' is expected to be a matrix or data.frame with more than just one dimension.
At the moment, we (more or less) expect Ensembl IDs and convert them into gene symbols for Seurat.
We could add for more flexibility:
The latter would allow for cross-species analysis.
Hi @mariusrueve (I moved the issue from this fork to here)
I will try to create a prototype Singularity container recipe for this.
This is only one file which then needs to be built:
I'll need to know the exact packages that are used:
a) Ubuntu packages (apt-get). Ubuntu version (20.04)
b) R packages
Marius answer - key packages
Depending on the identified number of clusters, some (sub-)panels (e.g. of the ridge plot and the violin plot) get strongly compressed the more clusters have been identified in a current situation (see example below). Therefore, some different way to define the space and dimensions in the html report is required.
Dear all,
in the case of a project with few cells in a sc-RNA-seq dataset, I am getting the same error as in:
I'm correcting it via editing the npcs argument (default npcs=50), such as follows:
sc = Seurat::RunPCA(sc, features=Seurat::VariableFeatures(object=sc), verbose=FALSE, npcs=14)
Best,
Dimitra
Marker genes are provided as lists of genes, e.g.:
param$known_markers = list()
param$known_markers[["bcell"]] = c("CD79A", "MS4A1")
param$known_markers[["tcell"]] = "CD3D"
param$known_markers[["tcell.cd8+"]] = c("CD8A", "CD8B")
param$known_markers[["nk"]] = c("GNLY", "NKG7")
param$known_markers[["myeloid"]] = c("CST3", "LYZ")
param$known_markers[["monocytes"]] = "FCGR3A"
param$known_markers[["dendritic"]] = "FCER1A"
These lists might be much longer. It would be nice to
... because now cell names are adapted during integration and don't match the original cell names.
We are re-writing bits in the beginning of the main script to be able to read SmartSeq-2 sequencing data as well.
In the chunk qc_plot_cells, the violin plots do not work for samples with less than three cells. In this case I suggest that we just plot points. Here is code that would do this.
p_list[[i]]= ggplot(sc_cell_metadata, aes_string(x="orig.ident", y=i, fill="orig.ident")) +
geom_violin(scale="width")
# Adds points for samples with less than three cells since geom_violin does not work here
p_list[[i]] = p_list[[i]] +
geom_point(data=sc_cell_metadata %>% dplyr::filter(orig.ident %in% names(which(table(sc_cell_metadata$orig.ident)<3))), aes_string(x="orig.ident", y=i, fill="orig.ident"), shape=21, size=2)
# Now add styles
p_list[[i]] = p_list[[i]] +
AddStyle(title=i, legend_position="none", fill=param$col_samples, xlab="") +
theme(axis.text.x=element_text(angle=45, hjust=1))
# Creates a table with min/max values for filter i for each dataset
If we use the workflow to analyse multiple samples, we would also like to find DEGs that are specific to one sample versus the other.
It would be great to extend the workflow with pseudotime analysis. Several customers would like to see this kind of analysis, see velocity or trajectory inference.
We will split the code again into two parts such that HTO demultiplexing will generate a standalone small report, and generates demultiplexed data that can be processed with the main scrnaseq analysis script.
You added the new normalization method SCTransform to the script. But it seems that JackStraw does not accept SCT transformed data and stops with an error.
... to make sure that genes are properly translated and we don't skip half of the genes.
We should carefully go through the main HTML report and think which Plots can go into Tabs, as it is done for the HTO HTML report.
Once we implement another step to identify DEGs between 2 defined conditions, we can rename current DEGs to markers, and change the documentation accordingly.
This doesn't seem quite right. Needs a double-check. Also, in the HTO script, we are not yet converting any names, not sure if it is even required.
We plan to add the possibility to regress out cell cycle effects, or effects due to other biological processes defined by user-specified gene lists.
In DegsAvgData, mean values are calculated for the assays and the three slots count, data, scale.data. For data, the expression values are logged counts and I would suggest to first convert them back to normal counts. At the moment we calculate the mean of log values which different from the log mean of values.
if (sl=="data") {
id_avg = Matrix::rowMeans(exp(GetAssayData(sc, assay=as, slot=sl)[genes, id_cells]))
id_avg = log(id_avg)
} else if (sl=="counts") {
id_avg = Matrix::rowMeans(GetAssayData(sc, assay=as, slot=sl)[genes, id_cells])
}
One might also think about using log2 for the means.
For scale.data I would not calculate a mean at all since it is already centered and scaled. I do not know how the mean can then be interpreted.
Andreas
satijalab/seurat#3346 (comment)
Posted a question to the Seurat GitHub. If Seurat indeed uses the natural log, we should re-calculate it to log2. Its more intuitive to understand.
At the moment, we use a log2-based threshold for FC, which would be a bug.
The current way we name DcGC and RCUG as authors at the top of the report is slightly confusing. We discussed this issue and decided it would be better to move an author section to the bottom of the report, and add the name of the bioinformatician on top (as parameter in the YAML header) who is working on the very project.
We would like to add a chunk to export the Seurat object to be able to visualise the data in Cerebro.
It is confusing for end-users that the actual filter criteria that yield in candidates of the Excel output file „Markers.xlsx“ are quite different from what is currently applied for the bar graph panel “Number of DEGs per cell cluster”. This should be harmonized.
At present, there are some obvious inconsistencies in marker gene order comparing the three different panels, namely 1) global heatmap 2) Excel output file „Markers.xlsx“ 3) table “Top 2 DEGs per cell cluster”. Markers in 1) are sorted acc. to decreasing “avg_logFC”, markers in 2) acc. to increasing “p_val”, markers in 3) were selected (as the top two) based on “avg_logFC” but were intrinsically ordered according to increasing “p_val”(as in “Markers.xlsx). The top gene selection for showing individual marker genes (e.g. in the ridge plot) is again based on decreasing “avg_logFC” similar as for the global heatmap. To improve general and intuitive orientation within the report this should be harmonized.
The following code throws several warnings about not statistically sound random seeds:
(Normalise data the original way)
sc = purrr::map(sc, Seurat::NormalizeData, normalization.method = "LogNormalize", scale.factor=10000, verbose=FALSE)
"UNRELIABLE VALUE: Future (‘future_lapply-1’) unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore"."
These warnings appear always when knitting the project to html within Rstudio.
The warnings appear occasionally when running the separate chunks within Rstudio (but not always).
It would be helpful to add some kind of expression data to the output in markers.xlsx
.
At the moment, the table contains the average log fold change and p-value. We could add the average raw or normalised expression per cluster of cells. This way clients can discard DEGs with low expression.
It would be nice to introduce the option to save figures in high-res PDF, so they can be used downstream for publications.
https://github.com/MHH-RCUG/scrnaseq_app/tree/dev
This is the link to my dev branch of the shiny app.
It looks as if markers.xlsx
lacks the annotation. Headers are there, but cells are not filled.
We plan to add the possibility to read multiple datasets and integrate them into one Seurat object.
The PlotMyStyle function should be converted into a full ggplot theme function so that we can apply it in ggplot fashion: ggplot() + theme_plotmystyle()
See for example here: https://www.statworx.com/de/blog/custom-themes-in-ggplot2/
It makes the code more readable. Also we should think about another name and title/legend_title should be set outside of the function (I was thinking about the patchwork system).
Hi Oliver/Marius.
@mariusrueve
you can watch this repo (button top right) to be notified of all conversations, issues, pull requests (PRs) etc.
You can address people with @colindaven like in gitlab.
I'm not sure who can assign issues to users, perhaps only owners/maintainers like @ktrns ?
Best wishes,
Colin
Check online ressources at the beginning of the script and fail if a ressource is not available
Here is some advice about choosing normalisation for HTOs. Can we include this in the script?
From: satijalab/seurat#2954
Thanks, these are good questions
We advise normalizing across tags particualrly when there is substantial variation in how well each hash performs. We saw this a lot when we were doing our own conjugations (like the original hashing paper)
We advise normalizing across cells in most other cases
How about printing the repository version and R packages versions at the end of the HTML report? This is done for nf-core pipelines for example, and is useful for the user when writing the paper.
Feedback from Andreas:
We need to decide whether we will use futile.logger
, or simply message
, warning
, stop
, etc.
We need to go through our R/ scripts and update documentation of the functions.
The cerebroApp has been rewritten substantially and most of the code for the export does not work anymore. I commented these parts out for now (commit 7b60d91) but we need to rewrite the function in order to export as much as possible to the cerebroApp.
CombinePlots() is soon deprecated as this functionality is now supported by the 'patchwork' package.
Similar to the online ressources, it would be good to verify the parameters at the start of the script and stop when something is not right. This could be combined.
Hi @ktrns , Marius and all.
Alternatives just released for Cite-seq analysis (might be helpful ?)
CiteFuse is freely available at http://shiny.maths.usyd.edu.au/CiteFuse/ as an online web service and at https://github.com/SydneyBioX/CiteFuse/ as an R package.
TotalVI - different software I believe and more about SC data integration:
https://www.biorxiv.org/content/10.1101/2020.05.08.083337v1
Best wishes,
Colin
@andpet0101
Can we make it possible to read a simple expression matrix as input to our script? This would be useful for test datasets, which usually are saved as matrix.txt tab separated file, no further information.
Or is this already possible with the SmartSeq-2 reading function?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.