Code Monkey home page Code Monkey logo

scrnaseq_processing_seurat's People

Contributors

sreichl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

scrnaseq_processing_seurat's Issues

update documentation

  • update current docs (work through current README)
  • incorporate changes to latest features
    • restructuring of results and plots
  • add latest features
    • heatmaps: rows are clustered and cells downsampled
    • pseudobulking
    • extended gRNA & KO call assignment
    • sctransform flavor v2
    • tested on 3 datasets of different sizes (10k-350k cells) and all modalities (RNA, Antibody, CRISPR, Custom)
  • generalize config.yaml

add example data

Lee 2021 Nat Genet CRC scRNA-seq data set from Weizmann 3CA

  • use marker genes for gene lists of all expected cell types -> get from GPT4

restructure results and speed up plotting

  • Check how the report behaves if output is a directory. How to make it still readable and nice?
  • make list of scripts (and rules) to adapt
  • restructure directory architecture in every script to {split}/{step}/plots/{type}/...
  • split panels into individual plots within their own subfolders {split}/{step}/plots/{type}/{vis_cat}/{feature_list}/{gene}.png
    • e.g., merged/NORMALIZED/plots/VlnPlot/condition/MHCgenes/gene1.png
      • Use output directory as result of the rule and use nested directories to encode metadata in addition to the file name? -> Actually possible as Snakemake is aware of metadata through the config file
    • ridge_plot.R
    • metadata_plot.R
    • violin_plot.R (STOPPED here with CORRECTED violins, the error pertains to the falsely inferred wildcard of gene_list -> enforce camelCase)
    • heatmap (collapse/simplify into one rule and adapt directory structure)
    • dotplot (collapse/simplify into one rule and adapt directory structure)
  • improve report by using labels
  • check "rule correct": Why is input NORMALIZED and not FILTERED? (RUNNING -> check if results differ)
    • read up on the correction, maybe its only on the scale.data? maybe module scores and HVG are only based on data...
  • highly variable genes pre and post CORRECTION are the same ie redundant output. make it only for NORMALIZED
  • perform test run on AKsmall subset

ideas

  • consider parallelization: https://stackoverflow.com/questions/8364288/what-hardware-limits-plotting-speed-in-r

  • consider split into jobs i.e., via snakemake -> no loops through feature lists

  • consider change of output device e..g, from PNG to PDF? -> PDF vectors... faster or slower? often far bigger...

  • problem: too many categories (e.g., 300 KOs) would generate too. large plots

    • Split up the plot panels into multiple panels with suffix _1, _2โ€ฆ
    • maybe splitting internally or using Snakmeake (how)? will increase the speed?
    • generate sub folder and split into individual plots -> gigantic parallelization possible but also a lot of file outputs (many many plots) and always lots of time lost to load the object
  • for now: increased size to max 100in in ggsave_new() in utils.R - works for everything BUT heatmaps -> see below

plots -> whats the error, problem? -> slurmstepd: error: Detected 1 oom_kill event in StepId=4572064.batch. Some of the step tasks have been OOM Killed.
heatmaps large but only white
everything else squished or too large to look at
-> check if there is a solution to split a ggplot opbject in multiple in a useful manner -> could go into ggsave_new
-> if too large then put text that says so (previusly done in dea_limma?)

extend KOcall strategy beyond singlets

support multiplets in a meaningful way
alphabetically ordered gene names as KO type in snake_case e.g., KOA_KOB_KOC

  • add column nKOcall (similar to Seurat nomenclature -> check again) describing number of KO genes assigned:
    • Negative (no KO): 0
    • Singlet: 1
    • Multiplet: X
  • KOcall snakecase of all gene names:
    • Negative: NA
    • Singlet: KOA
    • Multiplet (alphabetically): KOA_KOB_KOC
  • add column ngRNA describing number of guides assigned:
    • Negative: 0
    • Singlet: 1
    • Multiplet: X
  • gRNAcall snakecase of all guide calls:
    • Negative: NA
    • Singlet: guide-1
    • Multiplet (alphabetically): guide-1_guide-300

note: gRNA multiplet can be a KO singlet!
e.g., gRNAcall: geneX-1_geneX-2 -> KOcall: geneX

save result object in h5ad format for better interoperability with other packages

provide info in the docs e.g., resources but do not implement as the use cases are highly custom (which assays to put where etc.)

example code:

##converting to h5ad format
obj_m@active.assay <- "RNA"
                              
if(file.exists(file.path(analysis_dir, "merged", "rna_merged.h5Seurat"))) file.remove(file.path(analysis_dir, "merged", "rna_merged.h5Seurat"))                          
SaveH5Seurat(obj_m, filename = file.path(analysis_dir, "merged", "rna_merged.h5Seurat"))
                              
if(file.exists(file.path(analysis_dir, "merged", "rna_merged.h5ad"))) file.remove(file.path(analysis_dir, "merged", "rna_merged.h5ad"))                              
Convert(file.path(analysis_dir, "merged", "rna_merged.h5Seurat"), dest = "h5ad")

speed up rule save_counts (bottleneck)

fwrite

  • test
    • input has to be data.frame
    • changed for now in save_counts.R
  • change in utils.R
  • change in metadata_plot.R
  • sctransform_cellScore.R

fread

  • change in metadata_plot.R
  • merge.R
  • prepare.R

general

  • make mr.pareto issue: look for write.csv across all MR.PARETO modules

https://rdrr.io/cran/data.table/man/fwrite.html

library(data.table)
fwrite(as.data.frame(GetAssayData(object = seurat_object, slot = "scale.data", assay = "SCT")), file = file.path(result_dir, paste0(step, 'scaled_', 'RNA', '.csv')), row.names=TRUE)

# more general
#fast writing
fwrite(as.data.frame(df), file=file.path("path/to/file.csv"), row.names=TRUE)

#fast reading
df <- data.frame(fread(file.path("path/to/file.csv"), header=TRUE), row.names=1)

add support for additional metadata file emerged from downstream analyses

add additional metadata file support for merged data, before the split to enable analyses of subsets that emerged from downstream analyses (eg clustering, cell-type annotation, perturbation classification)

config: MERGE section (would also fit with the prefixed barcodes)
rule merge (would trigger reprocessing of all other subsets as well)

OR

config: SPLIT section (would also fit with the prefixed barcodes)
rule split (would not trigger reprocessing of all other subsets as well, bit the merged object does not have the metadata)

pseudobulking of counts by metadata feature

  • new config field(s) to generate a simple pseudo count matrix for downstream analysis using bulk methods (breaking change)
    • probably a list of categorical metadata
    • pseudobulk: by: ['patient', 'cellType', 'treatment'] method: "sum"
  • look how others do it
  • support multiple methods
    • sum
    • mean
    • median
  • support modalities
    • RNA
    • AB
    • grna
    • custom
  • generate metadata sheet
    • include statistics of pseudobulked cell numbers per sample
  • document r-dplyr=1.1.2

make normalization configurable

provide options from Seurat e.g., log2transform, SCTransform,...
think carefully. Is SCTransform enough/best? Are there any downsides (work)/updsides (???) in adding this functionality? Do I/we need it?

in that case, more functionality is required

extend pseudobulking functionality

  • check Teams discussion on pesudobulking for additional features
  • configurable filtering for cell_count e.g., if <20 cells do not include in output
  • metadata aggregation: keep columns which have the same values within each group -> if easy
  • visualize pseudobulked cell counts? Histograms? Metadata plots?
  • make report.rst for plot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.