RNA-Seq User Guide

RNA-Seq Piplines using Tophat or STAR

RNA-Seq User Guide
- Table of Contents
- Report issues or feature requests

Overview

Prepare index and data:
- Build STAR index
- Download data
STAR Pipeline:
- Run STAR_pipeline.sh on HPC environment
- Run deseq2STAR.R
Data Visualization:
- Run BubblePlotGoTerm.R
- Run HeatMap.R
- Run VennDiagram.R
- Run BarPlot.R

Build STAR index locally

Download genome

wget http://ftp.ensembl.org/pub/release-103/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz

Download annotation file

wget http://ftp.ensembl.org/pub/release-103/gtf/mus_musculus/Mus_musculus.GRCm39.103.gtf.gz

Preprocess genome files

gunzip *.gz
mv Mus_musculus.GRCm39.dna.primary_assembly.fa mm10.fa
mv Mus_musculus.GRCm39.103.gtf mm10.gtf
grep -v -i 'gene_biotype "rRNA";\|gene_biotype "Mt_rRNA";' mm10.gtf > mm10_no_rRNA.gtf

Index genome

samtools faidx mm10.fa
java -jar -Xmx16g /public/apps/picard/2.17.1/picard.jar CreateSequenceDictionary R=mm10.fa O=mm10.dict

Create STAR Index

# If read length is 150, use 149 (150-1) for parameter --sjdbOverhang
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir . --genomeFastaFiles mm10.ERCC92.fa --sjdbGTFfile mm10.ERCC92.gtf --sjdbOverhang 149  --genomeChrBinNbits 18 --limitGenomeGenerateRAM 48524399488

Build STAR index on HPC

Submit STAR_index.sh

# Config the genomedir
# Update the genome and gtf file path for download
qsub STAR_index.sh

# Check the download logs to ensure no transfer errors.

Download data

Download published data with HPC environment. Download scripts and example are available at SRA Download

# Config HPC settings in sra-download.sh. Make sure the task number equals to the sample number. (eg. -t 1~9 means 9 sampels for download)
# Update the rename step accordingly for single ended data. Then submit the array jobs to HPC.

qsub sra-download.sh

# Check the download logs to ensure no transfer errors. Move downloaded data to indir in STAR_pipeline.sh after download jobs are completed.

Run STAR pipeline

Run STAR pipleine with HPC environment

# Config HPC settings in sra-download.sh. Make sure the task number equals to the sample number. eg. '-t 1~12' means 12 sampels for download; '-pe smp 8' means using 8 cores of parallel environment(pe) "smp". You need to change it to your pe.

# Configure the directories in the job script. Especially for indir, genomedir, gtf, and RefSeqbed (RefSeqbed is used by infer_experiment.py(RSeQC) for checking library strand info)
# sourcedir=$(pwd)
# indir=$sourcedir/00_fastq
# genomedir=<STAR4 directory>
# gtf=$genomedir/mm10.gtf
# RefSeqbed=$genomedir/mm10_RefSeq_Ensembl.bed
# Add sample_list.txt to script directory and update it with sample names

qsub STAR_pipeline.sh

# Check whethr the library is stranded and set the '-s 1'. Default is '-s 0' for non-stranded library.
# infer_experiment.py -i ${sample}.bam -r mm10_RefSeq.bed
# None Stranded data look like:
# Reading reference gene model mm10_RefSeq.bed ... Done
# Loading SAM/BAM file ...  Total 200000 usable reads were sampled
# This is PairEnd Data
# Fraction of reads failed to determine: 0.0212
# Fraction of reads explained by "1++,1--,2+-,2-+": 0.4974
# Fraction of reads explained by "1+-,1-+,2++,2--": 0.4814

Run fastqc and RSeQC

Run fastqc and RSeQC pipleine with HPC environment

# Config HPC settings in sra-download.sh. Make sure the task number equals to the sample number. eg. '-t 1~12' means 12 sampels for download; '-pe smp 8' means using 8 cores of parallel environment(pe) "smp". You need to change it to your pe.

# Configure the directories in the job script. Especially for trimmdir, bamdir (generated or copied from STAR_pipeline.sh), genomedir, and RefSeqbed
# sourcedir=$(pwd)
# trimmdir=$sourcedir/00_fastq/trimmed
# bamdir=$sourcedir/02_bam
# genomedir=<STAR4 directory>
# gtf=$genomedir/mm10.gtf
# RefSeqbed=$genomedir/mm10_RefSeq_Ensembl.bed

qsub RseQC.sh

# Check fastqc and RseQC reports and figures in according folders.

Run DEG analysis

Run deseq2STAR.R

# Config comparison and projectdir before running the script.
# Create a folder for each comparion, add the counts files (output of STAR_pipeline.sh) for samples from both groups to this folder
# Add Conditions.txt to each comparion folder, and update sample, group, and batch info for according comparions. Leave batch info blank if no batch effect expected.
# Add SelectedLabledGenes.csv to Selected_genes folder, and update it with the genes you want to lable in the MA plot and Volcano plot
# Install below packages if not installed
# Recommend to run it in Rstudio
# Run deseq2STAR.R for each comparison first, then run the R scripts in plots folder for BarPlot, BubblePlot, Heatmap, and VennDiagrams.
# To included all samples in the PCA plot, run a pseduo comparison that includes all of the samples.

library(DESeq2)
library(gtools)
library(dplyr)
library(RColorBrewer)
library(pheatmap)
library(ggplot2)

# Run below scripts in Rstudio
# deseq2STAR.R
# VennDiagram.R
# HeatMap.R
# BarPlot.R
# BubblePlotGOTerm.R

Report issues or feature requests

Open git repository link: Issues and Feature
Click "New Issue"
Enter the details and submit.

mikefeixu / rna-seq Goto Github PK

rna-seq's Introduction

RNA-Seq User Guide

Table of Contents

Overview

Build STAR index locally

Build STAR index on HPC

Download data

Run STAR pipeline

Run fastqc and RSeQC

Run DEG analysis

Report issues or feature requests

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent