Code Monkey home page Code Monkey logo

rmats's Introduction

Input

  • Input: fastq or bam

fastq

Building the STAR index

To index the genome with STAR for RNA-seq analysis, the sjdbOverhang option needs to be specified for detecting possible splicing sites. It usually equals to the minimum read size minus 1; it tells STAR what is the maximum possible stretch of sequence that can be found on one side of a spicing site

Path to hg38 gtf file: /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf

Path to hg38 fasta file: /data/Data/Database/00.human_reference_genome/hg38/hg38.fa

STAR \
--runMode genomeGenerate \
--genomeDir /path/to/genomeDir \
--genomeFastaFiles /path/to/genome/fasta \
--sjdbGTFfile /path/to/annotations.gtf \
--sjdbOverhang ReadLength-1

Use fastp to do the quality control for fasta file

/data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file$ fastp --in1 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ENCFF802OSJ_1_R1.fastq --in2 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ENCFF566QBO_1_R2.fastq --out1 ENCFF802OSJ_1_R1_trimmed.fastq --out2 ENCFF566QBO_1_R2_trimmed.fastq

Test data

/data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files$ fastp --in1 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/ENCFF735NFI_1_R1.fastq --in2 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/ENCFF578JJD_1_R2.fastq --out1 ENCFF735NFI_1_R1_trimmed.fastq --out2 ENCFF578JJD_1_R2_trimmed.fastq

Create BAM files using STAR alignment

  • Ctrl

STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file/trimmed_fastp_files/ENCFF802OSJ_1_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ENCFF566QBO_1_R2.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ctrl1 STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file/trimmed_fastp_files/ENCFF330LGV_2_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file/trimmed_fastp_files/ENCFF232USH_2_R2_trimmed.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ctrl2

  • Test

STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF735NFI_1_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF578JJD_1_R2_trimmed.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix test1 STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF747JAG_2_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF058PDK_2_R2_trimmed.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix test2

bam

  • The aligned bam file do not need to be aligned again --> can be analyzed directly.
  • The alignment must be genome-base = align directly to chromosome coordinates and non specific transcripts --> splicing information
  • Need to prepare: –b1 [Group 1 bam], –b2 [Group 2 bam] –s1, –s2, –b1, –b2 are .txt files, and the content is a combination of paths to the same set of files, separated by ",". pair-ended fastq separated by ":" For example:

ctrl_bam.txt

cat ctrl_bam /path/to/ctrl_1.bam,/path/to/ctrl_2.bam

ctrl_fastq.txt

cat ctrl_fastq /path/to/ctrl_1_R1.fastq:/path/to/ctrl_1_R2.fastq,/path/to/ctrl_2_R1.fastq:/path/to/ctrl_2_R2.fastq

** Note: please note that the file content's path should only occupy a single row. If it spans multiple rows, an error may occur, preventing the detection of the content in the second row

Right

cat ctrl_bam /path/to/ctrl_1.bam,/path/to/ctrl_2.bam

Wrong

cat ctrl_bam /path/to/ctrl_1.bam, /path/to/ctrl_2.bam

Other important parameters

  • -gtf (required): The source file for splicing events and data annotation.
  • -od (required): output data folder (dir)
  • -tmp (required): temporary archive folder
  • -t (required, default: paired): input: paired-end or single-end
  • -readLength (required): single-nthended read length
  • Can use fastqc statistics for fastq file
  • Can use stamtools stat for bam file
  • Really important: sequence length that rmats should use to determine events. If it is too short, the type and quantity of events will be greatly underestimated. If it is too long, it will lead to insufficient information and the output will directly return to zero.
  • -variable-read-length:
  • The read length can vary due to the quality control process (trimming, filter), and this parameter allows rmats to use the longest read length as much as possible, helping to address this issue.
  • -nthread: Multi-threaded computing settings, rmats itself has good performance and does not need to be set to many cores

Unzip the fastq.gz

for file in *.fastq.gz; do echo "Processing $file..." seqtk seq -a "$file" > "${file%.fastq.gz}.fa" done

Example

1.1 conda create environment + download rMATs (envName: virtual env name, freely change it)

conda create -n envName rmats

1.2 Install rMATs on existing conda env

conda install rmats

2. Activate env

conda activate envName

3.1 BAM file rMATs analysis (paired-end reads, readLength=100, gtf annotation from Gencode)

Create group file

echo /path/to/group1_1.bam,/path/to/group1_2.bam > group1 echo /path/to/group2_1.bam,/path/to/group2_2.bam > group2

Create output/temporary folder

mkdir output output/tmp

rMATs analysis

rmats.py --b1 group1 --b2 group2 --gtf gencode.gtf --od output --tmp output/tmp -t paired --readLength 100 --nthread 4

rMATs analysis with variable readLength

rmats.py --b1 group1 --b2 group2 --gtf gencode.gtf --od output --tmp output/tmp -t paired --readLength 100 --variable-read-length --nthread 4

3.2 fastq file rMATs analysis (paired-end reads, readLength=100, Gencode gtf annotation)

Create group file

echo /path/to/group1_1_R1.fastq:/path/to/group1_1_R2.fastq, /path/to/group1_2_R1.fastq:/path/to/group1_2_R2.fastq > fastqGroup1 echo /path/to/group2_1_R1.fastq:/path/to/group2_1_R2.fastq, /path/to/group2_2_R1.fastq:/path/to/group2_2_R2.fastq > fastqGroup2

Create fqOutput and fqOutput/tmp

mkdir fqOutput fqOutput/tmp

Create a STAR genome index. If there is no STAR in the virtual environment, you can use "conda install star" to install it.

STAR --runMode genomeGenerate --genomeDir path/to/indexDir --genomeFastaFiles hg 38.fasta --sjdbGTFfile gencode.gtf --sjdb Overhang 99

  • For example

STAR --runMode genomeGenerate --genomeDir /home/Giang/raw_RNA_seq_practice/hg38_index1/ --genomeFastaFiles /data/Data/Database/00.human_reference_genome/hg38/hg38.fa --sjdbGTFfile /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf --sjdbOverhang 99

rMATs analysis

rmats.py --s1 fastqGroup1 --s2 fastqGroup2 --bi path/to/indexDir --gtf gencode.gtf -od fqOutput --tmp fqOutput/tmp -t paired --readLength 100 --nthread 4

  • For example

rmats.py --s1 /home/Giang/raw_RNA_seq_practice/ctrl1.txt --s2 /home/Giang/raw_RNA_seq_practice/test1.txt --bi /home/Giang/raw_RNA_seq_practice/hg38_index/ --gtf /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf --od /home/Giang/raw_RNA_seq_practice/output/ --tmp /home/Giang/raw_RNA_seq_practice/output/tmp -t paired --readLength 100 --nthread 4

End analysis and close virtual environment

conda deactivate

sashimii plot

rMATs have designed tools can draw the rMATs result and the coverage data of BAM file into a sashimii plot.

Create env in conda

conda create n envName python=2 bedtools samtools pysam

Install matplotlib

pip install matplotlib==2.0.2

Install scipy

pip install scipy

Install rmats2sashimiplot

pip install rmats2sashimiplot

input: two modes: even mode and custom mode (required gff3 annotation)

  • rMATs event input is [AS_Event].MATS.JC.txt
  • It is recommended to use R, python, excel to filter those events first (FDR<0.05, specific target), save as a new file but maintain the tsv format
  • The alignment result part accepts SAM [–s1/–s2] or BAM [–b1/–b2] You can directly input the path. It does not have to generate a path text file like rMATs. The group and order of the input path must be consistent with running rMATs, otherwise the picture will be reversed.

Example

rmats2sashimiplot activate

conda activate envName

Making the output folder

mkdir outFolder

rmats2sashimiplot
--l1 shSF3B4 \ # label the group1 name (freely) --l2 ctrl \ # label the group2 name --o outFolder/
--t SE \ #Event type -e filtered.MATS.JC.txt \ # Filtered results file from [AS_Event].MATS.JC.txt --b1 /path/to/group1_1.bam,/path/to/group1_2.bam
--b2 /path/to/group2_1.bam,/path/to/group2_2.bam \

Output

image 5 Alternative splicing events:

  • SE: skipping exon
  • A5SS: Alternative 5'SS
  • A3SS: alternative 3'SS
  • MXE: mutually exclusive exons
  • RI: retained intron
  • The number of supporting reads can be counted by junction reads only (JC) or by both junction and exon reads (JCEC)
  • In --od direction:
  • [AS_Event].MATS.JC.txt: list of events and read counts, only count splice junction read
  • [AS_Event].MATS.JCEC.txt: list of events and read counts, count both splice junction and exon body reads
  • fromGTF.[AS_Event].txt: All identified AS events from GTF and RNA
  • fromGTF.novelJunction.[AS_Event].txt: AS events were identified only after considering the RNA --> not include events with unannotated splice site
  • fromGTF.novelSpliceSite.[AS_Event].txt: only events that include an unannotated SS
  • JC.raw.input.[AS_Event].txt: event counts only reads that span junctions
  • JCEC.raw.input.[AS_Event].txt: event counts including reads span junction and reads not cross an exon boundary
  • Column
  • ID: rMATS event id
  • GeneID
  • geneSymbol: gene name
  • chr: chromosome
  • strand: strand of the gene
  • IJC_SAMPLE_1: inclusion counts for sample 1. Replicates are comma separated
  • SJC_SAMPLE_1: skipping counts for sample 1. Replicates are comma separated
  • IncFormlen: Length of inclusion form, used for normalization
  • SkipFormLen: length of skipping form, used for normalization
  • P value: significance of splicing difference between 2 sample groups (Only if statistical model is on)
  • FDR: False Discovery Rate calculated from p-value
  • IncLevel1:inclusion level for sample 1. Replicates are comma separated (from normalized counts)
  • IncLevel2: inclusion level for sample 2. Replicates are comma separated (from normalized counts)
  • IncLevel2: average(IncLevel1) - average(IncLevel2)

Change RMATS command line

/home/Giang/anaconda3/envs/rMATs/bin/STAR-avx2 --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS3_QC_bam/hg38_index1/ --readFilesIn /data/Data/Giang/trimmed_fastq/trimmed_test_fastq/ENCFF735NFI_1_R1_trimmed.fastq /data/Data/Giang/trimmed_fastq/trimmed_test_fastq/ENCFF578JJD_1_R2_trimmed.fastq --outFileNamePrefix test1 --outSAMtype BAM SortedByCoordinate --outSJfilterOverhangMin 20 12 12 12 --alignIntronMin 10 --alignIntronMax 500000 --alignEndsType EndToEnd

rmats's People

Contributors

jagbaskerville avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.