Code Monkey home page Code Monkey logo

pseudobulk's People

Contributors

vertesy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pseudobulk's Issues

Better PCA analysis

Motivation

To better understand how similar samples are, PCA is a good start.
However, there the current analysis is insufficient:

  • first 3 PCs often do not separate all identified clusters,
  • We need to scale each axis in proportion to the variance explained by each PC

Output needed

  • Get the variance explained by each PC
  • Plot more PC's

Strand specific mapping location of reads via stranded bigwig files

Motivation

Sense and antisense transcripts have overlapping regions for some interesting genes, such as TSC2 or MALAT1.
You need stranded bigwig to look if reads map to the sense or the antisense strand.

How to

You can get both stranded bigwig and unstranded bigwig with the standard pipeline

Results

load stranded .bw to igv:
/Volumes/abel-1/Data/pseudobulk/iiRNAseq_ii.GRCh38_20191120162110/bigwig/
or:
/groups/knoblich/users/abel/Data/pseudobulk/iiRNAseq_ii.GRCh38_20191120162110/bigwig/

Interpretation

DeepTools somehow gives you the opposite strand by some weird naming.

turn pseudobulk from scRNAseq with logCPM value back to a seurat object with counts value?

May i ask 1 question:

if data is already pseudobulk object from scRNAseq data with logCPM value, how can i change it back to a seurat object with counts value? Can i still use above method to turn data back to a seurat object?

My data is normalized to become a pseudobulk data as following:
"Normalizing count data
After excluding poor quality cells, we normalized the sequencing depth of each cell by dividing
each cell’s counts by the total counts in that cell, resulting in a matrix where the entries represent
the proportion of a cell’s reads allocated to each gene (i.e. values in the range [0,1]). To estimate
a library size for each dataset, we summed the total counts in each cell, and then we took the
median as the library size for the dataset. Next, we multiplied the proportions by the library size
to get a count matrix that was normalized for sequencing depth. Finally, we transformed the
normalized count matrix with log2(1 + count). We referred to this log-transformed quantity in the
figures as log2CPM.

We created pseudobulk expression (L. Lun, Bach, and Marioni 2016) for the cells in a
cluster for each donor such that the pseudobulk matrix had one row for each gene and one column
for each cluster from each patient.
We normalized the pseudobulk counts to log2CPM as
described in the previous section. Then we use limma::lmFit() to test for differential gene
expression with the log2CPM pseudobulk matrix (Ritchie et al. 2015). We also use
presto::wilcoxauc() to compute the area under receiver operator curve (AUROC or AUC) for the
log2CPM value of each gene as a predictor of the cluster membership for each cluster
(Korsunsky, Nathan, et al. 2019).
"
thanks best wishes
J.

Biological findings from pseudoBulk: intronic content and MALAT1

General

  • Check if you see the same in our other sc datasets
    • rerun with Oli all 6
  • Check if you see the same in bulk

MALAT1

  • Is the MALAT1 correlation to RC true in general?
  • Is the MALAT1 fraction correlation to TSS-RC true in general?

intronic content and

  • Where do we find high intronic content?

To Do

Milestones to finalize the pipeline

  • Barcode export from Seurat
  • Barcode de-duplications
  • Find the cause of repeated qNames in the 10X aligned bam file

Barcode de-duplications

  • Explore: is it done by 10x ??
  • Find solution via 10X or alternatively via a python script

Possible future improvements

  • Use the exact same index & reference to what 10X cell ranger uses.
  • Make it directly executable directly on a 10X output folder (using the clustering that 10X does).

Using the exact same index & reference

→ This is actually needed if you want to analyze 10X runs that have been mapped to the pre-mRNA reference

When you run the VBC RNAseq pipeline, provide the gene model (GTF) and the exact same human genome the use (you have to directly link both).

  • --gtf to use exactly the same names
    • if you provide both --gtf and --fasta it recreates the reference, see:
    • both --gtf and --fasta and--saveReference but not --genome to save the custom reference
    • Pass on to Maria → add it to the cextflow config

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.