- Input: fastq or bam
- Send fastq to rMATs, rMATs will use STAR to align fastq, and then send it to RMATs for analysis
- Need to prepare: -s1 (group 1 fastq), -s2 (group 2 fastq) and -bi (STAR index)
- -bi[STAR index] is the genome index generated by STAR aligner, for details: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
To index the genome with STAR for RNA-seq analysis, the sjdbOverhang option needs to be specified for detecting possible splicing sites. It usually equals to the minimum read size minus 1; it tells STAR what is the maximum possible stretch of sequence that can be found on one side of a spicing site
Path to hg38 gtf file: /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf
STAR \
--runMode genomeGenerate \
--genomeDir /path/to/genomeDir \
--genomeFastaFiles /path/to/genome/fasta \
--sjdbGTFfile /path/to/annotations.gtf \
--sjdbOverhang ReadLength-1
/data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file$ fastp --in1 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ENCFF802OSJ_1_R1.fastq --in2 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ENCFF566QBO_1_R2.fastq --out1 ENCFF802OSJ_1_R1_trimmed.fastq --out2 ENCFF566QBO_1_R2_trimmed.fastq
/data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files$ fastp --in1 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/ENCFF735NFI_1_R1.fastq --in2 /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/ENCFF578JJD_1_R2.fastq --out1 ENCFF735NFI_1_R1_trimmed.fastq --out2 ENCFF578JJD_1_R2_trimmed.fastq
- Ctrl
STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file/trimmed_fastp_files/ENCFF802OSJ_1_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ENCFF566QBO_1_R2.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ctrl1 STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file/trimmed_fastp_files/ENCFF330LGV_2_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_non_specific_target/ctrl_bam_file/QC_bam_file/trimmed_fastp_files/ENCFF232USH_2_R2_trimmed.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ctrl2
- Test
STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF735NFI_1_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF578JJD_1_R2_trimmed.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix test1 STAR --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS2/hg38_index1/ --readFilesIn /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF747JAG_2_R1_trimmed.fastq /data/Data/ang_chu_data/raw_RNA_seq/HepG2_SAP30BP_KD/test_bam_file/QC_test_bam_files/trimmed_fastq_files/ENCFF058PDK_2_R2_trimmed.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix test2
- The aligned bam file do not need to be aligned again --> can be analyzed directly.
- The alignment must be genome-base = align directly to chromosome coordinates and non specific transcripts --> splicing information
- Need to prepare: –b1 [Group 1 bam], –b2 [Group 2 bam] –s1, –s2, –b1, –b2 are .txt files, and the content is a combination of paths to the same set of files, separated by ",". pair-ended fastq separated by ":" For example:
cat ctrl_bam /path/to/ctrl_1.bam,/path/to/ctrl_2.bam
cat ctrl_fastq /path/to/ctrl_1_R1.fastq:/path/to/ctrl_1_R2.fastq,/path/to/ctrl_2_R1.fastq:/path/to/ctrl_2_R2.fastq
** Note: please note that the file content's path should only occupy a single row. If it spans multiple rows, an error may occur, preventing the detection of the content in the second row
cat ctrl_bam /path/to/ctrl_1.bam,/path/to/ctrl_2.bam
cat ctrl_bam /path/to/ctrl_1.bam, /path/to/ctrl_2.bam
- -gtf (required): The source file for splicing events and data annotation.
- -od (required): output data folder (dir)
- -tmp (required): temporary archive folder
- -t (required, default: paired): input: paired-end or single-end
- -readLength (required): single-nthended read length
- Can use fastqc statistics for fastq file
- Can use stamtools stat for bam file
- Really important: sequence length that rmats should use to determine events. If it is too short, the type and quantity of events will be greatly underestimated. If it is too long, it will lead to insufficient information and the output will directly return to zero.
- -variable-read-length:
- The read length can vary due to the quality control process (trimming, filter), and this parameter allows rmats to use the longest read length as much as possible, helping to address this issue.
- -nthread: Multi-threaded computing settings, rmats itself has good performance and does not need to be set to many cores
for file in *.fastq.gz; do echo "Processing $file..." seqtk seq -a "$file" > "${file%.fastq.gz}.fa" done
conda create -n envName rmats
conda install rmats
conda activate envName
echo /path/to/group1_1.bam,/path/to/group1_2.bam > group1 echo /path/to/group2_1.bam,/path/to/group2_2.bam > group2
mkdir output output/tmp
rmats.py --b1 group1 --b2 group2 --gtf gencode.gtf --od output --tmp output/tmp -t paired --readLength 100 --nthread 4
rmats.py --b1 group1 --b2 group2 --gtf gencode.gtf --od output --tmp output/tmp -t paired --readLength 100 --variable-read-length --nthread 4
echo /path/to/group1_1_R1.fastq:/path/to/group1_1_R2.fastq, /path/to/group1_2_R1.fastq:/path/to/group1_2_R2.fastq > fastqGroup1 echo /path/to/group2_1_R1.fastq:/path/to/group2_1_R2.fastq, /path/to/group2_2_R1.fastq:/path/to/group2_2_R2.fastq > fastqGroup2
mkdir fqOutput fqOutput/tmp
Create a STAR genome index. If there is no STAR in the virtual environment, you can use "conda install star" to install it.
STAR --runMode genomeGenerate --genomeDir path/to/indexDir --genomeFastaFiles hg 38.fasta --sjdbGTFfile gencode.gtf --sjdb Overhang 99
- For example
STAR --runMode genomeGenerate --genomeDir /home/Giang/raw_RNA_seq_practice/hg38_index1/ --genomeFastaFiles /data/Data/Database/00.human_reference_genome/hg38/hg38.fa --sjdbGTFfile /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf --sjdbOverhang 99
rmats.py --s1 fastqGroup1 --s2 fastqGroup2 --bi path/to/indexDir --gtf gencode.gtf -od fqOutput --tmp fqOutput/tmp -t paired --readLength 100 --nthread 4
- For example
rmats.py --s1 /home/Giang/raw_RNA_seq_practice/ctrl1.txt --s2 /home/Giang/raw_RNA_seq_practice/test1.txt --bi /home/Giang/raw_RNA_seq_practice/hg38_index/ --gtf /data/Data/Database/16.ENCODE_shRNA_rMATs/01.rMATs_result/file/GENCODE_V29_anno.gtf --od /home/Giang/raw_RNA_seq_practice/output/ --tmp /home/Giang/raw_RNA_seq_practice/output/tmp -t paired --readLength 100 --nthread 4
conda deactivate
rMATs have designed tools can draw the rMATs result and the coverage data of BAM file into a sashimii plot.
conda create n envName python=2 bedtools samtools pysam
pip install matplotlib==2.0.2
pip install scipy
pip install rmats2sashimiplot
- rMATs event input is [AS_Event].MATS.JC.txt
- It is recommended to use R, python, excel to filter those events first (FDR<0.05, specific target), save as a new file but maintain the tsv format
- The alignment result part accepts SAM [–s1/–s2] or BAM [–b1/–b2] You can directly input the path. It does not have to generate a path text file like rMATs. The group and order of the input path must be consistent with running rMATs, otherwise the picture will be reversed.
conda activate envName
mkdir outFolder
rmats2sashimiplot
--l1 shSF3B4 \ # label the group1 name (freely) --l2 ctrl \ # label the group2 name --o outFolder/
--t SE \ #Event type -e filtered.MATS.JC.txt \ # Filtered results file from [AS_Event].MATS.JC.txt --b1 /path/to/group1_1.bam,/path/to/group1_2.bam
--b2 /path/to/group2_1.bam,/path/to/group2_2.bam \
5 Alternative splicing events:
- SE: skipping exon
- A5SS: Alternative 5'SS
- A3SS: alternative 3'SS
- MXE: mutually exclusive exons
- RI: retained intron
- The number of supporting reads can be counted by junction reads only (JC) or by both junction and exon reads (JCEC)
- In --od direction:
- [AS_Event].MATS.JC.txt: list of events and read counts, only count splice junction read
- [AS_Event].MATS.JCEC.txt: list of events and read counts, count both splice junction and exon body reads
- fromGTF.[AS_Event].txt: All identified AS events from GTF and RNA
- fromGTF.novelJunction.[AS_Event].txt: AS events were identified only after considering the RNA --> not include events with unannotated splice site
- fromGTF.novelSpliceSite.[AS_Event].txt: only events that include an unannotated SS
- JC.raw.input.[AS_Event].txt: event counts only reads that span junctions
- JCEC.raw.input.[AS_Event].txt: event counts including reads span junction and reads not cross an exon boundary
- Column
- ID: rMATS event id
- GeneID
- geneSymbol: gene name
- chr: chromosome
- strand: strand of the gene
- IJC_SAMPLE_1: inclusion counts for sample 1. Replicates are comma separated
- SJC_SAMPLE_1: skipping counts for sample 1. Replicates are comma separated
- IncFormlen: Length of inclusion form, used for normalization
- SkipFormLen: length of skipping form, used for normalization
- P value: significance of splicing difference between 2 sample groups (Only if statistical model is on)
- FDR: False Discovery Rate calculated from p-value
- IncLevel1:inclusion level for sample 1. Replicates are comma separated (from normalized counts)
- IncLevel2: inclusion level for sample 2. Replicates are comma separated (from normalized counts)
- IncLevel2: average(IncLevel1) - average(IncLevel2)
/home/Giang/anaconda3/envs/rMATs/bin/STAR-avx2 --runThreadN 6 --genomeDir /home/Giang/raw_RNA_seq_practice/RMATS3_QC_bam/hg38_index1/ --readFilesIn /data/Data/Giang/trimmed_fastq/trimmed_test_fastq/ENCFF735NFI_1_R1_trimmed.fastq /data/Data/Giang/trimmed_fastq/trimmed_test_fastq/ENCFF578JJD_1_R2_trimmed.fastq --outFileNamePrefix test1 --outSAMtype BAM SortedByCoordinate --outSJfilterOverhangMin 20 12 12 12 --alignIntronMin 10 --alignIntronMax 500000 --alignEndsType EndToEnd