Code Monkey home page Code Monkey logo

rna-seq's Introduction

RNA-Seq User Guide

RNA-Seq Piplines using Tophat or STAR

Table of Contents

Overview

  • Prepare index and data:
    • Build STAR index
    • Download data
  • STAR Pipeline:
    • Run STAR_pipeline.sh on HPC environment
    • Run deseq2STAR.R
  • Data Visualization:
    • Run BubblePlotGoTerm.R
    • Run HeatMap.R
    • Run VennDiagram.R
    • Run BarPlot.R

Build STAR index locally

  • Download genome
wget http://ftp.ensembl.org/pub/release-103/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
  • Download annotation file
wget http://ftp.ensembl.org/pub/release-103/gtf/mus_musculus/Mus_musculus.GRCm39.103.gtf.gz
  • Preprocess genome files
gunzip *.gz
mv Mus_musculus.GRCm39.dna.primary_assembly.fa mm10.fa
mv Mus_musculus.GRCm39.103.gtf mm10.gtf
grep -v -i 'gene_biotype "rRNA";\|gene_biotype "Mt_rRNA";' mm10.gtf > mm10_no_rRNA.gtf
  • Index genome
samtools faidx mm10.fa
java -jar -Xmx16g /public/apps/picard/2.17.1/picard.jar CreateSequenceDictionary R=mm10.fa O=mm10.dict
  • Create STAR Index
# If read length is 150, use 149 (150-1) for parameter --sjdbOverhang
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir . --genomeFastaFiles mm10.ERCC92.fa --sjdbGTFfile mm10.ERCC92.gtf --sjdbOverhang 149  --genomeChrBinNbits 18 --limitGenomeGenerateRAM 48524399488

Build STAR index on HPC

  • Submit STAR_index.sh
# Config the genomedir
# Update the genome and gtf file path for download
qsub STAR_index.sh

# Check the download logs to ensure no transfer errors. 

Download data

  • Download published data with HPC environment. Download scripts and example are available at SRA Download
# Config HPC settings in sra-download.sh. Make sure the task number equals to the sample number. (eg. -t 1~9 means 9 sampels for download)
# Update the rename step accordingly for single ended data. Then submit the array jobs to HPC.

qsub sra-download.sh

# Check the download logs to ensure no transfer errors. Move downloaded data to indir in STAR_pipeline.sh after download jobs are completed.

Run STAR pipeline

  • Run STAR pipleine with HPC environment
# Config HPC settings in sra-download.sh. Make sure the task number equals to the sample number. eg. '-t 1~12' means 12 sampels for download; '-pe smp 8' means using 8 cores of parallel environment(pe) "smp". You need to change it to your pe.

# Configure the directories in the job script. Especially for indir, genomedir, gtf, and RefSeqbed (RefSeqbed is used by infer_experiment.py(RSeQC) for checking library strand info)
# sourcedir=$(pwd)
# indir=$sourcedir/00_fastq
# genomedir=<STAR4 directory>
# gtf=$genomedir/mm10.gtf
# RefSeqbed=$genomedir/mm10_RefSeq_Ensembl.bed
# Add sample_list.txt to script directory and update it with sample names

qsub STAR_pipeline.sh

# Check whethr the library is stranded and set the '-s 1'. Default is '-s 0' for non-stranded library.
# infer_experiment.py -i ${sample}.bam -r mm10_RefSeq.bed
# None Stranded data look like:
# Reading reference gene model mm10_RefSeq.bed ... Done
# Loading SAM/BAM file ...  Total 200000 usable reads were sampled
# This is PairEnd Data
# Fraction of reads failed to determine: 0.0212
# Fraction of reads explained by "1++,1--,2+-,2-+": 0.4974
# Fraction of reads explained by "1+-,1-+,2++,2--": 0.4814

Run fastqc and RSeQC

  • Run fastqc and RSeQC pipleine with HPC environment
# Config HPC settings in sra-download.sh. Make sure the task number equals to the sample number. eg. '-t 1~12' means 12 sampels for download; '-pe smp 8' means using 8 cores of parallel environment(pe) "smp". You need to change it to your pe.

# Configure the directories in the job script. Especially for trimmdir, bamdir (generated or copied from STAR_pipeline.sh), genomedir, and RefSeqbed
# sourcedir=$(pwd)
# trimmdir=$sourcedir/00_fastq/trimmed
# bamdir=$sourcedir/02_bam
# genomedir=<STAR4 directory>
# gtf=$genomedir/mm10.gtf
# RefSeqbed=$genomedir/mm10_RefSeq_Ensembl.bed

qsub RseQC.sh

# Check fastqc and RseQC reports and figures in according folders.

Run DEG analysis

  • Run deseq2STAR.R
# Config comparison and projectdir before running the script.
# Create a folder for each comparion, add the counts files (output of STAR_pipeline.sh) for samples from both groups to this folder
# Add Conditions.txt to each comparion folder, and update sample, group, and batch info for according comparions. Leave batch info blank if no batch effect expected.
# Add SelectedLabledGenes.csv to Selected_genes folder, and update it with the genes you want to lable in the MA plot and Volcano plot
# Install below packages if not installed
# Recommend to run it in Rstudio
# Run deseq2STAR.R for each comparison first, then run the R scripts in plots folder for BarPlot, BubblePlot, Heatmap, and VennDiagrams.
# To included all samples in the PCA plot, run a pseduo comparison that includes all of the samples.

library(DESeq2)
library(gtools)
library(dplyr)
library(RColorBrewer)
library(pheatmap)
library(ggplot2)

# Run below scripts in Rstudio
# deseq2STAR.R
# VennDiagram.R
# HeatMap.R
# BarPlot.R
# BubblePlotGOTerm.R

Report issues or feature requests

  • Open git repository link: Issues and Feature
  • Click "New Issue"
  • Enter the details and submit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.