vertesy / pseudobulk Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 2.0 331 KB

Cluster-specific pseudo-bulk analysis of 10X single-cell RNA-seq data by connecting Seurat to the VBC RNA-seq pipeline.

Home Page: https://vertesy.github.io/pseudoBulk

License: MIT License

R 41.65% Shell 48.40% Perl 9.96%

bulk pseudo-bulk rna-seq seurat single-cell split-bam

pseudobulk's Issues

Better PCA analysis

Motivation

To better understand how similar samples are, PCA is a good start.
However, there the current analysis is insufficient:

first 3 PCs often do not separate all identified clusters,
We need to scale each axis in proportion to the variance explained by each PC

Output needed

Get the variance explained by each PC
Plot more PC's

New Cell Ranger has different input format for clusters – I need to change that

Now I table with cell and cluster columns is needed.

Strand specific mapping location of reads via stranded bigwig files

Motivation

Sense and antisense transcripts have overlapping regions for some interesting genes, such as TSC2 or MALAT1.
You need stranded bigwig to look if reads map to the sense or the antisense strand.

How to

You can get both stranded bigwig and unstranded bigwig with the standard pipeline

Results

load stranded .bw to igv:
/Volumes/abel-1/Data/pseudobulk/iiRNAseq_ii.GRCh38_20191120162110/bigwig/
or:
/groups/knoblich/users/abel/Data/pseudobulk/iiRNAseq_ii.GRCh38_20191120162110/bigwig/

Interpretation

DeepTools somehow gives you the opposite strand by some weird naming.

turn pseudobulk from scRNAseq with logCPM value back to a seurat object with counts value?

May i ask 1 question:

if data is already pseudobulk object from scRNAseq data with logCPM value, how can i change it back to a seurat object with counts value? Can i still use above method to turn data back to a seurat object?

My data is normalized to become a pseudobulk data as following:
"Normalizing count data
After excluding poor quality cells, we normalized the sequencing depth of each cell by dividing
each cell’s counts by the total counts in that cell, resulting in a matrix where the entries represent
the proportion of a cell’s reads allocated to each gene (i.e. values in the range [0,1]). To estimate
a library size for each dataset, we summed the total counts in each cell, and then we took the
median as the library size for the dataset. Next, we multiplied the proportions by the library size
to get a count matrix that was normalized for sequencing depth. Finally, we transformed the
normalized count matrix with log2(1 + count). We referred to this log-transformed quantity in the
figures as log2CPM.

We created pseudobulk expression (L. Lun, Bach, and Marioni 2016) for the cells in a
cluster for each donor such that the pseudobulk matrix had one row for each gene and one column
for each cluster from each patient.
We normalized the pseudobulk counts to log2CPM as
described in the previous section. Then we use limma::lmFit() to test for differential gene
expression with the log2CPM pseudobulk matrix (Ritchie et al. 2015). We also use
presto::wilcoxauc() to compute the area under receiver operator curve (AUROC or AUC) for the
log2CPM value of each gene as a predictor of the cluster membership for each cluster
(Korsunsky, Nathan, et al. 2019).
"
thanks best wishes
J.

Biological findings from pseudoBulk: intronic content and MALAT1

General

Check if you see the same in our other sc datasets
- rerun with Oli all 6
Check if you see the same in bulk

MALAT1

Is the MALAT1 correlation to RC true in general?
Is the MALAT1 fraction correlation to TSS-RC true in general?

intronic content and

Where do we find high intronic content?

To Do

Milestones to finalize the pipeline

Barcode export from Seurat
Barcode de-duplications
Find the cause of repeated qNames in the 10X aligned bam file

Barcode de-duplications

Explore: is it done by 10x ??
Find solution via 10X or alternatively via a python script

Possible future improvements

Use the exact same index & reference to what 10X cell ranger uses.
Make it directly executable directly on a 10X output folder (using the clustering that 10X does).

Using the exact same index & reference

→ This is actually needed if you want to analyze 10X runs that have been mapped to the pre-mRNA reference

When you run the VBC RNAseq pipeline, provide the gene model (GTF) and the exact same human genome the use (you have to directly link both).

--gtf to use exactly the same names
- if you provide both --gtf and --fasta it recreates the reference, see:
- both --gtf and --fasta and--saveReference but not --genome to save the custom reference
- Pass on to Maria → add it to the cextflow config

vertesy / pseudobulk Goto Github PK

pseudobulk's People

Contributors

Stargazers

Watchers

Forkers

pseudobulk's Issues

Motivation

Output needed

Motivation

How to

Results

Interpretation

General

MALAT1

intronic content and

Milestones to finalize the pipeline

Barcode de-duplications

Possible future improvements

Using the exact same index & reference

Steps

Recommend Projects

Recommend Topics

Recommend Org