miRNA Discovery Project : simple, easy-to-use, rule-based miRNA candidate discovery with multithread acceleration
Version : 1.0 Final
Author : Lee Sang-gil, dept. of Applied Biology & Chemistry, Seoul National University, South Korea
E-mail : [email protected]
Quick Installation & Usage Guide
Need Python 2.7.6 or newer, bowtie, RNAfold
Simply unzip and run main.py like the following:
python main.py -r "reference fasta format genome path" -i "input fasta format RNA-seq path" -o "output path"
For help screen:
python main.py -h
Introduction
This program discovers miRNA candidates, showing mature miRNA-miRNA* duplex and pre-miRNA sequence
It provides alignment information, simplified tabular information, and length distribution of genome-mapped RNA-seq reads
Some filtering procedure, I/O formats, and parameter schemes are inspired by miREAP (by Li Qibin [email protected] Bioinformatics department, Beijing Genomics Institute)
It aims to minimize input file lists and software dependencies, it is written in python script with no installation hassle, and it doesn't require sudo (administrator) privilege to run, which is sometimes a problem when you deploy it at a server
It provides multithread implementation to maximize performance, which is scaled to the cpu threads you use
Albeit there are a number of similar softwares with machine learning techniques achieving high accuracy, this program can be used with in-house data with no extra data, and with simple and fast use
Scoring scheme is defined as simple own criteria, but users can modify the scheme by editing score_seq function in the source code
Software Dependencies
The program needs following software to be installed (in any path):
bowtie (http://bowtie-bio.sourceforge.net/index.shtml)
RNAfold (in ViennaRNA package) (https://www.tbi.univie.ac.at/RNA/)
If both software can be directly executed via terminal (bowtie-build, bowtie, RNAfold), you don't need to specify any parameters in the script
If you have installed these software but cannot directly be executed via terminal, you need to specify the path containing the executables in the script
Parameters
-h, --help
show the help screen and exit
Core
-r --reference
reference genome file location(.fa, .fasta) (default : 'ref.fa')
ex) /home/myname/(...)/my_ref.fa
if no path is specified, it tries to load "ref.fa" in the current folder
-i --input
input small RNA file location(.fa, .fasta) (default : 'smrna.fa')
ex) /home/myname/(...)/my_smrna.fa
if no path is specified, it tries to load "smrna.fa" in the current folderthe file format should look like the following:
>t0000035 3234
GAATGGATAAGGATTAGCGATGATACA
>t0000072 1909
TTGCAGTATGTAGGAAATCAAAACGTTC
(i.e. >name (tab) reads (newline) seq)
most raw RNA-seq data formats are similar to this, but you may need to pre-process the file to match the format
-o --output (default : 'result' subfolder)
output file location
ex) /home/myname/(...)/myoutput
the program generates output files in the specified output location
if the path do not exist, the program tries to make the directory
if no path is specified, results are generated in "result" subfolder in the current folder
-w --bowtiepath
bowtie location (default: command "bowtie" and "bowtie-build")
ex) /home/myname/(...)/bowtie-1.1.2
you need to specify the path CONTAINING the executables "bowtie" and "bowtie-build"
if no path is specified, the program tries to execute with terminal command "bowtie" and "bowtie-build"
-p --RNAfoldpath
RNAfold location (default : command "RNAfold")
ex) /home/myname/(...)/ViennaRNA/bin
you need to specify the path CONTAINING the executable "RNAfold"
if no path is specified, the program tries to execute with terminal command "RNAfold"
pre-miRNA Discovery Criteria
-l --minlength (default : 18)
min. length of mature miRNA
-L --maxlength (default : 26)
max. length of mature miRNA
-u --multloci (default : 20)
max. multiple loci of miRNA matches to reference genome
if the raw RNA-seq read matches to genome with too many locations, it should be considered as repeat, not miRNA candidate
this value is passed to bowtie, and only reads corresponding to this criterion will be mapped
-d --distance (default : 35)
max. distance between miRNA-miRNA* of precursor hairpin loop
if you want to discover "bigger" pre-miRNAs (like plant miRNA or non-canonical miRNA), you can increase this value, but the program will tend to prefer "bigger" pre-miRNAs and "smaller" pre-miRNAs might be discarded
-a --arm (default : 10)
length of arms(both ends) of precursor from mature miRNA
also called as "flank", this value specifies the length of potential drosha-cut sites
the program assumes this value to calculate MFE of putative pre-miRNAs
-f --mfe (default : 18)
min. abs. MFE for valid miRNA precursor
pre-miRNAs whose absolute MFE value are larger than this value (-18 kcal/mol) are considered as novel pre-miRNAs
the program "picks" pre-miRNA with the best normalized MFE near the searching location
pre-miRNAs not passing this criterion are considered "unstable" and discarded
Mature miRNA Scoring Criteria
the site of pre-miRNA passing all the criteria with the best score is selected as putative mature miRNA-miRNA* duplex
if you want to allow more non-canonical miRNA-miRNA* duplex, you may want to increase these values
-m --serialmismatch (default : 2)
max. serial mismatch of miRNA-miRNA* duplex
if any single mismatch site is "too large", it is discarded
-M --multmismatch (default : 2)
max. multiple mismatches of miRNA-miRNA* duplex
if the number of mismatch sites is "too much", it is discarded
-b --serialbulge (default : 2)
max. serial bulge of miRNA-miRNA* duplex
if any single bulge is "too large", it is discarded
-B --multbulge (default : 2)
max. multiple bulges of miRNA-miRNA* duplex
if the number of bulges is "too much", it is discarded
Performance & Memory Efficiency
-t --thread (default : automatic, all threads of the system)
number of CPU threads
if you want to run the program with fewer cpu threads (and less memory), specify the value
-s --step (default : 2)
step size of RNAfold precursor extend loop
the program progressively calculates MFE of putative pre-miRNAs extending the length of the sequence with this specified value
for example, with default settings, the program starts with [ (minlength + arm) * 2 + distance ] and extends the length "left" or "right" automatically
if you increase this value, the program calculates with RNAfold more sparsely and can achieve faster performance, but it could miss the "optimal" choice of pre-miRNA
if you have set larger "distance" value, you may want to increase this value too
-c --mincount (default : automatic, number of unique RNA-seq reads / 500000)
min. read count of smRNA seq displayed at result_mature.txt
the program allocates copy of loaded map data to all threads to maximize performance, so if map data is too large, memory capacity could be an issue, especially if the map file exceeds several hundred MBs
also, if the RNA-seq file has many reads with small count (1~2), result_mature.txt might be hard to read, and result would show less probable miRNA candidates
by default, this value is defined in a heuristic way to achieve stable memory usage and reliable result
if you specify this value, genome-mapped reads with count lower than this value will be considered "insignificant", and will not be used for mature miRNA finding
Miscellaneous & Additional Features : recommended to keep default values
--batch_size (default : number of CPU threads * 3)
size of input data "chunk" for internal processing,
it can be arbitrary value, but (number of cpu threads) * n is recommended
if you increase the size, you COULD earn faster performance (but not always), but the memory usage will be increased
optimal value is depending on the system you are running, so you may want to experiment
--plot (default:false)
draw simple bar plot of length distribution of genome-aligned RNA sequence
if you set this value to true, the program generates "length-distribution.png" file to output path
need matplotlib python package to be installed