Input

Input: fastq or bam

fastq

Send fastq to rMATs, rMATs will use STAR to align fastq, and then send it to RMATs for analysis
Need to prepare: -s1 (group 1 fastq), -s2 (group 2 fastq) and -bi (STAR index)
-bi[STAR index] is the genome index generated by STAR aligner, for details: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

Building the STAR index

To index the genome with STAR for RNA-seq analysis, the sjdbOverhang option needs to be specified for detecting possible splicing sites. It usually equals to the minimum read size minus 1; it tells STAR what is the maximum possible stretch of sequence that can be found on one side of a spicing site

Path to hg38 gtf file: /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf

Path to hg38 fasta file: /data/Data/Database/00.human_reference_genome/hg38/hg38.fa

STAR \
--runMode genomeGenerate \
--genomeDir /path/to/genomeDir \
--genomeFastaFiles /path/to/genome/fasta \
--sjdbGTFfile /path/to/annotations.gtf \
--sjdbOverhang ReadLength-1

Use fastp to do the quality control for fasta file

/data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file$ fastp --in1 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ENCFF802OSJ_1_R1.fastq --in2 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ENCFF566QBO_1_R2.fastq --out1 ENCFF802OSJ_1_R1_trimmed.fastq --out2 ENCFF566QBO_1_R2_trimmed.fastq

Test data

/data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files$ fastp --in1 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/ENCFF735NFI_1_R1.fastq --in2 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/ENCFF578JJD_1_R2.fastq --out1 ENCFF735NFI_1_R1_trimmed.fastq --out2 ENCFF578JJD_1_R2_trimmed.fastq

Create BAM files using STAR alignment

Ctrl

STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file/trimmed_fastp_files/ENCFF802OSJ_1_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ENCFF566QBO_1_R2.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ctrl1 STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file/trimmed_fastp_files/ENCFF330LGV_2_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file/trimmed_fastp_files/ENCFF232USH_2_R2_trimmed.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ctrl2

Test

STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF735NFI_1_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF578JJD_1_R2_trimmed.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix test1 STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF747JAG_2_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF058PDK_2_R2_trimmed.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix test2

bam

The aligned bam file do not need to be aligned again --> can be analyzed directly.
The alignment must be genome-base = align directly to chromosome coordinates and non specific transcripts --> splicing information
Need to prepare: –b1 [Group 1 bam], –b2 [Group 2 bam] –s1, –s2, –b1, –b2 are .txt files, and the content is a combination of paths to the same set of files, separated by ",". pair-ended fastq separated by ":" For example:

ctrl_bam.txt

cat ctrl_bam /path/to/ctrl_1.bam,/path/to/ctrl_2.bam

ctrl_fastq.txt

cat ctrl_fastq /path/to/ctrl_1_R1.fastq:/path/to/ctrl_1_R2.fastq,/path/to/ctrl_2_R1.fastq:/path/to/ctrl_2_R2.fastq

** Note: please note that the file content's path should only occupy a single row. If it spans multiple rows, an error may occur, preventing the detection of the content in the second row

Right

cat ctrl_bam /path/to/ctrl_1.bam,/path/to/ctrl_2.bam

Wrong

cat ctrl_bam /path/to/ctrl_1.bam, /path/to/ctrl_2.bam

Other important parameters

-gtf (required): The source file for splicing events and data annotation.
-od (required): output data folder (dir)
-tmp (required): temporary archive folder
-t (required, default: paired): input: paired-end or single-end
-readLength (required): single-nthended read length

Can use fastqc statistics for fastq file
Can use stamtools stat for bam file
Really important: sequence length that rmats should use to determine events. If it is too short, the type and quantity of events will be greatly underestimated. If it is too long, it will lead to insufficient information and the output will directly return to zero.

-variable-read-length:

The read length can vary due to the quality control process (trimming, filter), and this parameter allows rmats to use the longest read length as much as possible, helping to address this issue.

-nthread: Multi-threaded computing settings, rmats itself has good performance and does not need to be set to many cores

Unzip the fastq.gz

for file in *.fastq.gz; do echo "Processing $file..." seqtk seq -a "$file" > "${file%.fastq.gz}.fa" done

Example

1.1 conda create environment + download rMATs (envName: virtual env name, freely change it)

conda create -n envName rmats

1.2 Install rMATs on existing conda env

conda install rmats

2. Activate env

conda activate envName

3.1 BAM file rMATs analysis (paired-end reads, readLength=100, gtf annotation from Gencode)

Create group file

echo /path/to/group1_1.bam,/path/to/group1_2.bam > group1 echo /path/to/group2_1.bam,/path/to/group2_2.bam > group2

Create output/temporary folder

mkdir output output/tmp

rMATs analysis

rmats.py --b1 group1 --b2 group2 --gtf gencode.gtf --od output --tmp output/tmp -t paired --readLength 100 --nthread 4

rMATs analysis with variable readLength

rmats.py --b1 group1 --b2 group2 --gtf gencode.gtf --od output --tmp output/tmp -t paired --readLength 100 --variable-read-length --nthread 4

3.2 fastq file rMATs analysis (paired-end reads, readLength=100, Gencode gtf annotation)

Create group file

echo /path/to/group1_1_R1.fastq:/path/to/group1_1_R2.fastq, /path/to/group1_2_R1.fastq:/path/to/group1_2_R2.fastq > fastqGroup1 echo /path/to/group2_1_R1.fastq:/path/to/group2_1_R2.fastq, /path/to/group2_2_R1.fastq:/path/to/group2_2_R2.fastq > fastqGroup2

Create fqOutput and fqOutput/tmp

mkdir fqOutput fqOutput/tmp

Create a STAR genome index. If there is no STAR in the virtual environment, you can use "conda install star" to install it.

STAR --runMode genomeGenerate --genomeDir path/to/indexDir --genomeFastaFiles hg 38.fasta --sjdbGTFfile gencode.gtf --sjdb Overhang 99

For example

STAR --runMode genomeGenerate --genomeDir /home/Giang/raw_RNA_seq_practice/hg38_index1/ --genomeFastaFiles /data/Data/Database/00.human_reference_genome/hg38/hg38.fa --sjdbGTFfile /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf --sjdbOverhang 99

rMATs analysis

rmats.py --s1 fastqGroup1 --s2 fastqGroup2 --bi path/to/indexDir --gtf gencode.gtf -od fqOutput --tmp fqOutput/tmp -t paired --readLength 100 --nthread 4

For example

rmats.py --s1 /home/Giang/raw_RNA_seq_practice/ctrl1.txt --s2 /home/Giang/raw_RNA_seq_practice/test1.txt --bi /home/Giang/raw_RNA_seq_practice/hg38_index/ --gtf /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf --od /home/Giang/raw_RNA_seq_practice/output/ --tmp /home/Giang/raw_RNA_seq_practice/output/tmp -t paired --readLength 100 --nthread 4

End analysis and close virtual environment

conda deactivate

sashimii plot

rMATs have designed tools can draw the rMATs result and the coverage data of BAM file into a sashimii plot.

Create env in conda

conda create n envName python=2 bedtools samtools pysam

Install matplotlib

pip install matplotlib==2.0.2

Install scipy

pip install scipy

Install rmats2sashimiplot

pip install rmats2sashimiplot

input: two modes: even mode and custom mode (required gff3 annotation)

rMATs event input is [AS_Event].MATS.JC.txt
It is recommended to use R, python, excel to filter those events first (FDR<0.05, specific target), save as a new file but maintain the tsv format
The alignment result part accepts SAM [–s1/–s2] or BAM [–b1/–b2] You can directly input the path. It does not have to generate a path text file like rMATs. The group and order of the input path must be consistent with running rMATs, otherwise the picture will be reversed.

Example

rmats2sashimiplot activate

conda activate envName

Making the output folder

mkdir outFolder

rmats2sashimiplot
--l1 shSF3B4 \ # label the group1 name (freely) --l2 ctrl \ # label the group2 name --o outFolder/
--t SE \ #Event type -e filtered.MATS.JC.txt \ # Filtered results file from [AS_Event].MATS.JC.txt --b1 /path/to/group1_1.bam,/path/to/group1_2.bam
--b2 /path/to/group2_1.bam,/path/to/group2_2.bam \

Output

5 Alternative splicing events:

SE: skipping exon
A5SS: Alternative 5'SS
A3SS: alternative 3'SS
MXE: mutually exclusive exons
RI: retained intron

The number of supporting reads can be counted by junction reads only (JC) or by both junction and exon reads (JCEC)
In --od direction:

[AS_Event].MATS.JC.txt: list of events and read counts, only count splice junction read
[AS_Event].MATS.JCEC.txt: list of events and read counts, count both splice junction and exon body reads
fromGTF.[AS_Event].txt: All identified AS events from GTF and RNA
fromGTF.novelJunction.[AS_Event].txt: AS events were identified only after considering the RNA --> not include events with unannotated splice site
fromGTF.novelSpliceSite.[AS_Event].txt: only events that include an unannotated SS
JC.raw.input.[AS_Event].txt: event counts only reads that span junctions
JCEC.raw.input.[AS_Event].txt: event counts including reads span junction and reads not cross an exon boundary
Column

ID: rMATS event id
GeneID
geneSymbol: gene name
chr: chromosome
strand: strand of the gene
IJC_SAMPLE_1: inclusion counts for sample 1. Replicates are comma separated
SJC_SAMPLE_1: skipping counts for sample 1. Replicates are comma separated
IncFormlen: Length of inclusion form, used for normalization
SkipFormLen: length of skipping form, used for normalization
P value: significance of splicing difference between 2 sample groups (Only if statistical model is on)
FDR: False Discovery Rate calculated from p-value
IncLevel1:inclusion level for sample 1. Replicates are comma separated (from normalized counts)
IncLevel2: inclusion level for sample 2. Replicates are comma separated (from normalized counts)
IncLevel2: average(IncLevel1) - average(IncLevel2)

Change RMATS command line

/home/Giang/anaconda3/envs/rMATs/bin/STAR-avx2 --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS3_QC_bam/hg38_index1/ --readFilesIn /data/Data/Giang/trimmed_fastq/trimmed_test_fastq/ENCFF735NFI_1_R1_trimmed.fastq /data/Data/Giang/trimmed_fastq/trimmed_test_fastq/ENCFF578JJD_1_R2_trimmed.fastq --outFileNamePrefix test1 --outSAMtype BAM SortedByCoordinate --outSJfilterOverhangMin 20 12 12 12 --alignIntronMin 10 --alignIntronMax 500000 --alignEndsType EndToEnd

jagbaskerville / rmats Goto Github PK

rmats's Introduction

Input

fastq

Building the STAR index

Path to hg38 gtf file: /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf

Path to hg38 fasta file: /data/Data/Database/00.human_reference_genome/hg38/hg38.fa

Use fastp to do the quality control for fasta file

Test data

Create BAM files using STAR alignment

bam

ctrl_bam.txt

ctrl_fastq.txt

Right

Wrong

Other important parameters

Unzip the fastq.gz

Example

1.1 conda create environment + download rMATs (envName: virtual env name, freely change it)

1.2 Install rMATs on existing conda env

2. Activate env

3.1 BAM file rMATs analysis (paired-end reads, readLength=100, gtf annotation from Gencode)

Create group file

Create output/temporary folder

rMATs analysis

rMATs analysis with variable readLength

3.2 fastq file rMATs analysis (paired-end reads, readLength=100, Gencode gtf annotation)

Create group file

Create fqOutput and fqOutput/tmp

Create a STAR genome index. If there is no STAR in the virtual environment, you can use "conda install star" to install it.

rMATs analysis

End analysis and close virtual environment

sashimii plot

Create env in conda

Install matplotlib

Install scipy

Install rmats2sashimiplot

input: two modes: even mode and custom mode (required gff3 annotation)

Example

rmats2sashimiplot activate

Making the output folder

Output

Change RMATS command line

rmats's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org