README for pRSEM demo

Introduction
Download
System Requirements
Submodules
Usage
Input
Output
Contact
License

Introduction

Prior-enhanced RSEM (pRSEM) is an RNA-seq quantification method that utilizes external data for the task of transcript abundance estimation. The workflow of pRSEM is illustrated in the following figure.

This repository is a mini-example for running pRSEM. It contains all the required software packages, input files, and scripts. More details is described below. This demo runs in 4 threads. The installation and running will take about 20 to 30 minutes on a 4 x 2.4GHz core machine depending on which R/Bioconductor libraries users have already installed.

Download

There are two ways to download this demo and all three required submodules

Download the tar ball or zip file via https. They are also listed on the release page with name pRSEM_demo_and_AllSubmodules
Use a git command: git clone --recursive [email protected]:pliu55/pRSEM_demo

System Requirements

Linux
Hard drive space > 2.5G
Perl version >= 5.8.8
Python version >= 2.7.3
R version >= 3.3.1
POSIX Threads
Boost (C++ libraries)

Submodules

This demo requires three submodules:

pRSEM: prior-enhanced RSEM
STAR: aligner for RNA-seq reads, version 2.5.2b
Bowtie: aligned for ChIP-seq reads, version 1.0.1

Usage

Go to this demo's folder and type

./run_pRSEM_demo.sh

This script will carry out the following tasks:

Install Bowtie, STAR, RSEM, and all required libraries not yet installed by users
Prepare genome references for Bowtie, STAR, and RSEM
Derive prior parameters from RNA Polymerase II ChIP-seq data and use them to quantify RNA-seq data
Derive prior parameters from a combination of four histone modfication ChIP-seq data sets and use them to quantify RNA-seq data
Perform a testing procedure using a combination of four histone modification ChIP-seq data sets as the external data
Perform a testing procedure using RNA Polymerase II ChIP-seq peaks as the external data

Input

All of the following data sets are under the folder input/. The RNA-seq and PolII ChIP-seq data were derived from ENCODE2 mouse Mel cell line. Although they are derived from a cell line rather than from tissue, we named them with keyword mmliver just to be consistent with the examples given in RSEM's documentation. The four histone modification ChIP-seq data sets were derived from Lara-Astiaso and Weiner et al. Science 2014 345:943.

RNA-seq, paired-end reads in FASTQ format, a small subset of the RNA-seq data on Mel's biological replicate 1
- mmliver_1.fa.gz: first mate
- mmliver_2.fa.gz: second mate
Pol II ChIP-seq reads in FASTQ format
- mmliver_PolIIRep1.fq.gz: replicate 1 for RNA polymerase II, a small subset of the Pol II ChIP-seq data on Mel's biological replicate 1
- mmliver_PolIIRep2.fq.gz: replicate 2 for RNA polymerase II, a small subset of the Pol II ChIP-seq data on Mel's biological replicate 2
- mmliver_ChIPseqCtrl.fa.gz: replicate 1 for control, a small subset of the the control ChIP-seq data on Mel's biological replicate 1
Pol II ChIP-seq peaks in BED format
- PolII.bed.gz: ChIP-seq peaks for RNA polymerase II. It was derived from the above FASTQ files using ENCODE2's pipeline with an IDR threshold at 0.05.
Histone modification ChIP-seq in FASTQ format
- H3K27Ac.fastq.gz: a small subset of H3K27Ac modification ChIP-seq data from LT_HSC cells
- H3K4me1.fastq.gz: a small subset of H3K4me1 modification ChIP-seq data from LT_HSC cells
- H3K4me2.fastq.gz: a small subset of H3K4me2 modification ChIP-seq data from LT_HSC cells
- H3K4me3.fastq.gz: a small subset of H3K4me3 modification ChIP-seq data from LT_HSC cells
Genome sequence
- chr19.fa.gz: gzipped Chromosome 19 sequence from mouse's mm10 assembly.
Transcript annotation
- chr19.gtf.gz: gzipped Protein-coding genes on chromsome 19 from GENCODE mouse release M4
Mappability
- mm10.36mer.chr19.fake.bigWig: a mocked alignability files for mm10's chromosome 19

Output

All output files will be stored in the following four folders under output/:

genome_references/: genomic index files for Bowtie, STAR, and RSEM

quantification_results_PolII/: output files from pRSEM's expression-calculation step by using prior from Pol II ChIP-seq data. They are organized in the same way as RSEM's output. The only difference is that there are five files specifically generated by pRSEM under folder output/quantification_results_PolII/demo.stat/:

demo_prsem.all_tr_features: isofrom features to derive and assign pRSEM prior. The first line is a header and the rest is one isoform per line. The description for each column is:

column name	descrption
trid	transcript ID from input annotation
geneid	gene ID from input anntation
chrom	isoform's chromosome name
strand	isoform's strand name
start	TSS if isoform is on '+' strand; TES if isoform is on '-' strand; where TSS stands for isoform's transcription start site, i.e. 5'-end, and TES stands for isoform's transcript end site, i.e. 3'-end
end	TES if isoform is on '+' strand; TSS if isoform is on '-' strand
tss_mpp	average mappability of [TSS-500bp, TSS+500bp]
body_mpp	average mappability of (TSS+500bp, TES-500bp)
tes_mpp	average mappability of [TES-500bp, TES+500bp]
pme_count	isoform's fragment or read count from RSEM's posterior mean estimates
tss	isoform's TSS
tss_pk	equal to 1 if isoform's [TSS-500bp, TSS+500bp] region overlaps with a RNA Pol II peak; 0 otherwise
is_training	equal to 1 if isoform is in the training set where Pol II prior is learned; 0 otherwise
partition	group number that isoform was partitioned to

demo_prsem.all_tr_prior: prior parameters for every isoform. This file does not have a header. Each line contains an isoform's prior parameter and its transcript ID delimited by #
demo_prsem.pval_LL: contains a p-value indicating if Pol II ChIP-seq data is informative for RNA-seq quantification and a log-likelihood denoting the fitness of pRSEM's Dirichlet-multinomial model to Pol II-partitioned training set isoforms
demo_uniform_prior_1.gene.results: RSEM's posterior mean estimates on the gene level with an initial pseudo-count of one for every isoform
demo_uniform_prior_1.isoform.results: RSEM's posterior mean estimates on the isoform level with an initial pseudo-count of one for every isoform

quantification_results_histone/: output files from pRSEM's expression-calculation step by using prior from four histone modification ChIP-seq data. They are very similar to those in quantification_results_PolII/. The only difference are in two files:

demo_prsem.all_tr_features: similar to quantification_results_PolII/demo.stat/demo_prsem.all_tr_features. It does not have the column tss_pk, but a few more new columns as shown below:

column name	description
pme_TPM	isoform abundance from RSEM's posterior mean estimates
H3K27Ac	normalized H3K27Ac signals (in RPKM) of isoform's [TSS-500, TSS+500]
H3K4me1	normalized H3K4me1 signals (in RPKM) of isoform's [TSS-500, TSS+500]
H3K4me2	normalized H3K4me2 signals (in RPKM) of isoform's [TSS-500, TSS+500]
H3K4me3	normalized H3K4me3 signals (in RPKM) of isoform's [TSS-500, TSS+500]
prd_expr_prob	probability of being expressed from a logistic regression model
prior	isoform's prior parameter

demo_prsem.lgt_mdl.RData: It stores an R logistic regression object, which was obtained based on four histone modification signals and fragment counts of isoforms in the training set.

testing_procedure_histone_PolII/: output files from pRSEM's testing procedure. Test results are saved in demo.all.pval_LL. All the other files are similar to those in quantification_results_PolII/ and quantification_results_histone/.
- demo.all.pval__LL: contains three lines
  - line 1: Header
  - line 2: Testing procedure result when using four histone modification ChIP-seq data as the external data for RNA-seq quantification
  - line 3: Testing procedure result when using Pol II peaks as the external data for RNA-seq quantification

Please note that, in order to shorten the running time as much as possible, the input ChIP-seq and RNA-seq files were prepared in extremely small (and unrealistic) sizes, and Gibbs sampling were set to run in just 100 instead of the default 1000 steps. As a result, the final quantification results may vary from time to time.

Contact

Got a question? Please post it at RSEM's GitHub Issues page with @pliu55 mentioned.

License

This demo is licensed under the GNU General Public License v3.

pliu55 / prsem_demo Goto Github PK

prsem_demo's Introduction

README for pRSEM demo

Table of Contents

Introduction

Download

System Requirements

Submodules

Usage

Input

Output

Contact

License

prsem_demo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent