The scrnaseq_processing_seurat from epigen

update documentation

update current docs (work through current README)
incorporate changes to latest features
- restructuring of results and plots
add latest features
- heatmaps: rows are clustered and cells downsampled
- pseudobulking
- extended gRNA & KO call assignment
- sctransform flavor v2
- tested on 3 datasets of different sizes (10k-350k cells) and all modalities (RNA, Antibody, CRISPR, Custom)
generalize config.yaml

add example data

Lee 2021 Nat Genet CRC scRNA-seq data set from Weizmann 3CA

use marker genes for gene lists of all expected cell types -> get from GPT4

restructure results and speed up plotting

ideas

consider parallelization: https://stackoverflow.com/questions/8364288/what-hardware-limits-plotting-speed-in-r
consider split into jobs i.e., via snakemake -> no loops through feature lists
consider change of output device e..g, from PNG to PDF? -> PDF vectors... faster or slower? often far bigger...
problem: too many categories (e.g., 300 KOs) would generate too. large plots
- Split up the plot panels into multiple panels with suffix _1, _2…
- maybe splitting internally or using Snakmeake (how)? will increase the speed?
- generate sub folder and split into individual plots -> gigantic parallelization possible but also a lot of file outputs (many many plots) and always lots of time lost to load the object
for now: increased size to max 100in in ggsave_new() in utils.R - works for everything BUT heatmaps -> see below

plots -> whats the error, problem? -> slurmstepd: error: Detected 1 oom_kill event in StepId=4572064.batch. Some of the step tasks have been OOM Killed.
heatmaps large but only white
everything else squished or too large to look at
-> check if there is a solution to split a ggplot opbject in multiple in a useful manner -> could go into ggsave_new
-> if too large then put text that says so (previusly done in dea_limma?)

order all plot panels alphabetical

DotPlots: rows and columns in
VlnPlot: columns
RidgePlot: rows
Heatmap rows

extend KOcall strategy beyond singlets

support multiplets in a meaningful way
alphabetically ordered gene names as KO type in snake_case e.g., KOA_KOB_KOC

add column nKOcall (similar to Seurat nomenclature -> check again) describing number of KO genes assigned:
- Negative (no KO): 0
- Singlet: 1
- Multiplet: X
KOcall snakecase of all gene names:
- Negative: NA
- Singlet: KOA
- Multiplet (alphabetically): KOA_KOB_KOC
add column ngRNA describing number of guides assigned:
- Negative: 0
- Singlet: 1
- Multiplet: X
gRNAcall snakecase of all guide calls:
- Negative: NA
- Singlet: guide-1
- Multiplet (alphabetically): guide-1_guide-300

note: gRNA multiplet can be a KO singlet!
e.g., gRNAcall: geneX-1_geneX-2 -> KOcall: geneX

generalize to other modalities e.g., sc/snATAC-seq

generalize to other/all modalities available
rename to sc_processing_seurat or similar

save result object in h5ad format for better interoperability with other packages

provide info in the docs e.g., resources but do not implement as the use cases are highly custom (which assays to put where etc.)

example code:

##converting to h5ad format
obj_m@active.assay <- "RNA"
                              
if(file.exists(file.path(analysis_dir, "merged", "rna_merged.h5Seurat"))) file.remove(file.path(analysis_dir, "merged", "rna_merged.h5Seurat"))                          
SaveH5Seurat(obj_m, filename = file.path(analysis_dir, "merged", "rna_merged.h5Seurat"))
                              
if(file.exists(file.path(analysis_dir, "merged", "rna_merged.h5ad"))) file.remove(file.path(analysis_dir, "merged", "rna_merged.h5ad"))                              
Convert(file.path(analysis_dir, "merged", "rna_merged.h5Seurat"), dest = "h5ad")

speed up rule save_counts (bottleneck)

fwrite

test
- input has to be data.frame
- changed for now in save_counts.R
change in utils.R
change in metadata_plot.R
sctransform_cellScore.R

fread

change in metadata_plot.R
merge.R
prepare.R

general

make mr.pareto issue: look for write.csv across all MR.PARETO modules

https://rdrr.io/cran/data.table/man/fwrite.html

library(data.table)
fwrite(as.data.frame(GetAssayData(object = seurat_object, slot = "scale.data", assay = "SCT")), file = file.path(result_dir, paste0(step, 'scaled_', 'RNA', '.csv')), row.names=TRUE)

# more general
#fast writing
fwrite(as.data.frame(df), file=file.path("path/to/file.csv"), row.names=TRUE)

#fast reading
df <- data.frame(fread(file.path("path/to/file.csv"), header=TRUE), row.names=1)

add support for generic gene by barcode/cell count matrices

step prepare:
check if folder or (sparse) matrix and import appropriately to create Seurat object.

consider how to deal with multimodal data?

make dot plot color scale centered at 0

add support for additional metadata file emerged from downstream analyses

add additional metadata file support for merged data, before the split to enable analyses of subsets that emerged from downstream analyses (eg clustering, cell-type annotation, perturbation classification)

config: MERGE section (would also fit with the prefixed barcodes)
rule merge (would trigger reprocessing of all other subsets as well)

OR

config: SPLIT section (would also fit with the prefixed barcodes)
rule split (would not trigger reprocessing of all other subsets as well, bit the merged object does not have the metadata)

pseudobulking of counts by metadata feature

make normalization configurable

provide options from Seurat e.g., log2transform, SCTransform,...
think carefully. Is SCTransform enough/best? Are there any downsides (work)/updsides (???) in adding this functionality? Do I/we need it?

in that case, more functionality is required

Normalize: https://satijalab.org/seurat/reference/normalizedata
Scale
FindVariableFestures
check wherever assay "SCT" is hard-coded and change

update to sctransform to use vst.flavor="v2"

how to install
https://github.com/satijalab/sctransform
https://anaconda.org/conda-forge/r-sctransform

how to use
https://satijalab.org/seurat/archive/v4.3/sctransform_v2_vignette

full tests on new functions

AKsmall_test
Lee2021NatGenet -> #17
EMICROP (when does it break? -> e.g., Heatmaps) -> RUNNING

dynamic memory for data intense rules depending on input

https://stackoverflow.com/questions/50891407/snakemake-how-to-dynamically-set-memory-resource-based-on-input-file-size
added dynamic mem_mb based on attempt variable to the following rules

merge
normalize
save counts

mem_mb = lambda wc, attempt: attempt*int(config.get("mem", "16000")),

background noise removal using CellBender

review why important: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02978-x
published in Nature Methods: https://www.nature.com/articles/s41592-023-01943-7
e.g., CellBender https://cellbender.readthedocs.io/en/latest/

Cyclic graph dependency when metadata not provided

Try to return only sample path and put metadata as parameter

extend pseudobulking functionality

check Teams discussion on pesudobulking for additional features
configurable filtering for cell_count e.g., if <20 cells do not include in output
metadata aggregation: keep columns which have the same values within each group -> if easy
visualize pseudobulked cell counts? Histograms? Metadata plots?
make report.rst for plot

epigen / scrnaseq_processing_seurat Goto Github PK

scrnaseq_processing_seurat's People

Contributors

Stargazers

Watchers

scrnaseq_processing_seurat's Issues

Recommend Projects

Recommend Topics

Recommend Org