schulzlab / orna Goto Github PK

Fast in-silico normalization algorithm for NGS data

License: MIT License

CMake 11.31% Shell 0.91% C++ 87.78%

ngstools normalization rna-seq metagenomics metagenomic-analysis rna-seq-analysis

orna's Introduction

About

The de bruijn graph (DBG) is one of the most commonly used data structures for assembly of sequencing data. Reads from the sequencer are chopped into small words of size k (k-mers) which form the nodes of the DBG. Two nodes are connected by an edge if they have a k-1 overlap. Each edge can be labelled with a k+1-mer formed by merging the kmers of the two nodes. For instance, if an edge connects two nodes of kmers ATCG and TCGT, the edge can be labelled as ATCGT. Assembly is generated by traversing paths in this graph. With the advances in deep sequencing technologies, assembling high coverage datasets has become a challenge in terms of memory and runtime requirements. Hence, read normalization, a lossy read filtering approach is gaining a lot of attention. Although current normalization algorithms are efficient, they provide no guarantee to preserve important k-mers that form connections between different genomic regions in the graph. There is a possibility that the resultant assembly is fragmented. In this work, normalization is phrased as a set multicover problem on reads and a linear time heuristic algorithm is proposed, named ORNA (Optimized Read Normalization Algorithm). ORNA normalizes to the minimum number of reads required to retain all labels (k+1-mers) and inturn all kmers and relative label abundances from the original dataset. Hence, no connections from the original graph are lost and coverage information is preserved.

When to use ORNA

ORNA is a read normalization software developed in spirit of Diginorm. ORNA is computationally inexpensive and it guarantees the preservation of all kmers from the original dataset. It can be used if the user has a high coverage dataset but does not have enough computational power (in particular memory but also limited time) in order to conduct a de novo assembly, because it removes the redundancy in your data. It can also be used to merge many sequencing datasets. The user must be aware that using ORNA (or in that case any normalization software) might have a significant impact on the assemblies produced as it is highly dependent on the dataset.

Enhancements to ORNA

We have implemented two additional options in ORNA to improve the reduction performance using either abundance values of kmers in reads or base quality scores.

ORNA-Q (parameter: -sorting 1):

In this mode, ORNA apart from preserving all the labels from the original dataset, also maximizes the total read quality score for the normalized dataset. The read quality score of a read is defined as the sum of phred qualities of bases in the read. ORNA-Q sorts the input dataset using read quality scores using a counting sort procedure before reduction.

ORNA-K (parameter: -ksorting 1)

In this mode, the normalization algorithm maximizes the total read abundance score of the normalized dataset (apart from preserving all labels from the original dataset). The read abundance score of a read is defined as the median of abundances of kmers present in the read. ORNA-K sorts the input dataset using the median kmer abundances of the reads in the dataset and then uses ORNA for reduction.

ORNA Algorithm

1.  Input : Read set R, LogBase b, kmer size k
2.  Initialization: k'=k+1
3.                  n = NumberOfDistinctK'mers(R)
4.                  counter(0,...,n)=0
5.                  Rout=null
6.  Steps:
7.          for r in R:
8.              flag=0
9.              V'=ObtainK'mers(R)
10.             for v in V':
11.                if(counter(v) < min(abundance(v), log_b(abundance(v)))) then:
12.                  counter(v)++
13.                  flag=1
14.                end if
15.              end for
16.              if flag!=0 then:
17.                Rout = Rout U r
18.              end if
19.          end for
20. Output: Rout

ORNA uses the GATB version 1.2.2 to store the kmer information
It reduces the abundance of a kmer to a value which is equal to the logarithmic transformation of the abundance. The base b of the logarithm is provided by the user.
ORNA was tested on two de bruijn graph based assemblers namely Oases and TransABySS and also worked for the assembly of metagenomics data.

Points to be noted

Currently, as ORNA retains all the kmers from the original dataset, it would also retain erroneous kmers. Thus ORNA reduces more reads, like any other tool for read reduction, when the data is error corrected. In case of RNA-seq or other non-uniform data we suggest to use the SEECER algorithm that proved to work well with ORNA.
ORNA-Q, ORNA-K and ORNA's paired-end mode currently does not support multithreading. Work is in progress for this and will be included in the future versions of ORNA.

Version

Version 0.4

Contact

For questions or suggestions regarding ORNA contact

Dilip A Durai (ddurai_at_mmci.uni-saarland.de)
Marcel H Schulz (marcel.schulz_at_em.uni-frankfurt.de)

Download

There are two ways how you can access and use ORNA. Either download from github or through bioconda.

If you use bioconda then installation is as easy as:

  conda install ORNA

Alternatively, the software can be downloaded by using the following command

	git clone https://github.com/SchulzLab/ORNA

The downloaded folder should contain the following files and folders:

install.sh
gatb-core (it will be empty. Files would be copied in once the install script is run)
src(folder) (contains the source code for ORNA)

Pre-requisite

Linux operating system with gcc version >=4.7
All the analysis for the manuscript was performed on Debain 8 operating system

Installation

Run the following command for installation if you downloaded it from github.

  bash install.sh

The above command should create a build folder. The executable of ORNA will be in build/bin

ORNA parameters

./bin/ORNA -help

short	explanation	note
-help	shows the help message
-sorting	(0 or 1) quality based sorting of input data	Default 0
-ksorting	(0 or 1) kmer abundance based sorting of input data	Default 0
-base	Base value for the logarithmic function	Default 1.7
-kmer	the value of k for kmer size	Default 21
-input	Input fasta file (for single end mode)
-pair1	First mate of the pair (for paired-end mode)
-pair2	Second mate of the pair (for paired-end mode)
-output	Prefix of the output file	Default "Normalized"
-nb-cores	number of cores (does not work for paired end mode)	Default 1
-type	type of the output file (fasta/fastq)	Default fasta

kmer value:
This parameter represents the kmer size to be used for reduction. As we aim at preserving all the edge lables ((k+1)-mers) from the original dataset, internally the kmer size given by the user would be incremented by 1. For instance, if the user provides a kmer size of 21, then ORNA would increment the kmer size to 22 for all its calculations. All the analysis in the paper were done using a kmer size of 21 for reads having length of 50bps and 76bps. If you are running an DBG assembly afterwards, we recommend to use the smallest k-mer used in the assembler. Depending on the dataset memory and runtime requirements will change depending on k.

base:
This parameter represents the base of the logarithm function used to decide the new abundance of kmer. For instance if the original abundance of a kmer is 1000 and a base of 10 is selected as a parameter then the new abundance is set to log₁₀1000 = 3. The higher the base parameter the more reduction of the reads. According to the analysis done in ORNA paper, a base of 1.7 seems to be a good compromise between data reduction and little loss in assembly quality. More examples can be found in this answer.

Running ORNA

To run ORNA, execute the following command from the installation directory

  ./build/bin/ORNA -input Dataset_name -output Output -base LogBase -kmer kmerSize -nb-cores NumberOfThreads -type fasta

Run ORNA in paired-end mode from the installation directory

  ./build/bin/ORNA -pair1 first_pair -pair2 second_pair -output Output -base LogBase -kmer kmerSize -type fasta

For instance, if the dataset to be normalized is named as input.fa, the following command would normalize the dataset using a log base of 1.7 and a kmer size of 21

  ./build/bin/ORNA -input input.fa -output output.fa -base 1.7 -kmer 21 -nb-cores 1

Citation

If you use ORNA in the normal mode (without quality of kmer abundance based sorting) in your work please cite:

Durai DA, Schulz MH. In-silico read normalization with set multicover optimization. Bioinformatics 2018 full text

If you use ORNA-Q/S (with quality or kmer abundance based sorting), please cite:

Durai DA, Schulz MH. Improving in-silico normalization using read weights. Scientific Reports 2019 full text

Acknowledgement

ORNA uses the GATB library for graph building and k-mer counting. We are thankful for their support.

orna's People

Contributors

Stargazers

Watchers

Forkers

resurgo-genetics dilipariyur ssasong sales-lab

orna's Issues

Silently fail to create output files when the specified output dir does not exist

Hello devs,

If I run "$ORNA -pair1 1.fa.gz -pair2 2.fa.gz -output test/test" in a folder where test/ does not exist, the program would silently fail, with no output files in test/. (And if run with added argument "-sorting 1", the program would complain about missing file in test/tmp/s0.fa.) This problem can be addressed by replacing mkdir() with a recursive one.

As a side note, I noticed that ORNA didn't try to remove *.h5 file when running in paired end mode, while it did do so in single end mode. I doubt if this is intentional since *.h5 is essentially a binary file.

Cheers

Enable continuing a run

Orna is great. It does take a while to run though on bigger projects (no criticism here). Then you run on a server, the specified time does not always allow for the entire script to run. My issue is that when you rerun it, it starts all the way back to the beginning which can be frustrating when you where close to the end (we're talking multiple days of run). I tried to hack the program, but I couldn't find an easy way to do this. Would it take forever to add a --continue flag to process it automatically.
I really like your work though, I think it's great. Cheers!

Incorrect Error message while forgotten to define the parameter

Hello!

Thank you for a great tool!
I noticed the error message to inform wrong.
In case I have forgotten to define the -cs option value, the error message tells me Unknown parameter 'fastq'.
After I have defined the parameter to -cs 1, the script worked out well.

Thank you for a great job!

Anaconda installation

Would it be possible to make available an anaconda installation of ORNA? It would be easier to incorporate ORNA into other pipelines which are composed of programs on anaconda.

Thanks!

Carl

Best parameters to use in multi-assembly strategy.

Hey Durai and Schulz,

First of all i would thank you for doing this amazing program.
Seems it a great tool to the researchers with less computational power, like me.
I would like to have your advices in terms of parameters to use on this program.
I have more than 200gb of rna-seq raw data of a crustacean.
Additionally, i also performed the reads correction with two tools, the SEECER as you recommend, and Rcorrector.
Now, i would like use ORNA to normalize the datasets , before the assemblies.
To the assemblies i will use the Soapdenovo and Transabyss (with multiple kmers) and trinity.

Regarding the -kmer and -base parameter, what would be your recommendation to obtain a single normalized dataset, that could be later used in the different assemblers?

Best Regards ,

A. Machado

Compiling fails with error

Hi, I'm just trying to compile, but I get error related to gatb. Any way around this?

 make
[  0%] Built target H5detect
[  0%] Built target H5make_libsettings
[ 68%] Built target hdf5
[ 68%] Built target hdf5_postbuild
[ 68%] Building CXX object ext/gatb-core/src/CMakeFiles/gatbcore-static.dir/gatb/bank/impl/Bank.cpp.o
In file included from /home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/bank/api/IBank.hpp:31,
                 from /home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/bank/impl/Bank.hpp:31,
                 from /home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/bank/impl/Bank.cpp:20:
/home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/tools/collections/api/Iterable.hpp: In member function ‘virtual Item* gatb::core::tools::collections::Iterable<Item>::getItems(Item*&)’:
/home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/tools/collections/api/Iterable.hpp:85: error: there are no arguments to ‘exit’ that depend on a template parameter, so a declaration of ‘exit’ must be available
/home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/tools/collections/api/Iterable.hpp:85: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
In file included from /home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/bank/api/IBank.hpp:31,
                 from /home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/bank/impl/Bank.hpp:31,
                 from /home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/bank/impl/Bank.cpp:20:
/home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/tools/collections/api/Iterable.hpp: In member function ‘virtual size_t gatb::core::tools::collections::Iterable<Item>::getItems(Item*&, size_t, size_t)’:
/home/smomw240/temp/ORNA/thirdparty/gatb-core/gatb-core/src/gatb/tools/collections/api/Iterable.hpp:97: error: there are no arguments to ‘exit’ that depend on a template parameter, so a declaration of ‘exit’ must be available
make[2]: *** [ext/gatb-core/src/CMakeFiles/gatbcore-static.dir/gatb/bank/impl/Bank.cpp.o] Error 1
make[1]: *** [ext/gatb-core/src/CMakeFiles/gatbcore-static.dir/all] Error 2
make: *** [all] Error 2

Question: Normalize fastq files separately or merge?

Hi! I want to assemble a transcriptome and have RNASeq data from multiple individuals and tissues. Would you recommend to run ORNA on a single fastq file pair, where I merge all fastq files into one, or on each individual/tissue separately? I would think the former to reduce the dataset the most?

Information on logrithm base value

Could I please ask for a bit more information on the function of the base value - what it does and what impact higher or lower values will have, and correspondingly how I should decide a setting based on the structure of my dataset?

For extra information, my RNAseq dataset is comprised of multiple tissues from various heterozygous individuals. I intend to conduct a multiple-k approach - should I run ORNA separate times with different values of k on the dataset to generate a set of normalised reads for respective assembly run, or should a single normalisation using the smallest k suffice?

Best wishes,
Reza

EXCEPTION: This dataset has no solid kmers

Hello SchulzLab,

I tried to run ORNA, but came out with the following errors:

(tsm-py3) tanshiming@S620100019205:~/Documents/CaoBin/October-2018/trimmed_duk_kmer31$ ORNA -pair1 /home/tanshiming/Documents/CaoBin/October-2018/trimmed_duk_kmer31/MFC280618_S2_R1_001.fastq.gz -pair2 MFC280618_S2_R2_001.fastq.gz -kmer 31 -nb-cores 15 -verbose 1 -ksorting 1 -type fastq -output MFC280618.Normalized
Given Parameters
----------------
Base:	1.7
kmer size:	31
Mode: ORNA-K
Bin size:	1000
Number of cores:	15
----------------
[DSK: nb solid kmers found : 0           ]  100  %   elapsed:  37 min 58 sec   remaining:   0 min 0  sec   cpu: 351.8 %   mem: [1595, 1595, 1595] MB 
EXCEPTION: This dataset has no solid kmers

Could I check if something went wrong?

Thank you!

Running error: Too many open files

Hello SchulzLab,
I wanna use ORNA to normalize my datasets and I tried it on an fastq file but failed. This fastq file is the RNAseq sample with 150bp read length and ~24 million reads.
My command is:
ORNA -kmer 25 -input ../00.fastq/PA1_female_1.clean.fq -nb-cores 30 -type fastq -output Normalized

Then the the following errors came out:

How can I resolve this problem?
Thank you!

Concatenating multiple samples fastq

Hi!
If I have a dataset of (lets say) 10 fastq files, would you recommend concatenating them into a single fastq file prior to ORNA or would it be better to normalize them separately. This dataset is intended for de novo transcriptome assembly.

Also, I have some questions on the choosing normalization kmer length for datasets intended for multi-kmer assembly. I understand that you recommend using the shortest kmer size for normalization. But, shouldn't we use the longest kmer length instead to prevent loss of information that will be useful for assembly at the longest kmer size?

Sorry for bad English and thank you in advance!

Output as .h5

Dear SchulzLab,
I am trying to reduce a 55gb illumina reads file but even when I specify fasta as output i keep getting a .h5 file. Any idea why this might be happening?

The command I am using is the following:
"ORNA -input illumina.fq.gz -kmer 115 -nb-cores 16 -type fasta "

Reads from stdin?

Hello,
Is it possible to submit reads to the tool via a pipe or something else? I do not want to combine all my reads into one file, I would just like to submit them through the standard input, for example, through <(cat reads1 reads2 ... readsN).

EXCEPTION: error opening file: tmp/s2228.fq ()

I tried to run ORNA on a large dataset (each fastq file ~45 GB, paired end), but it fails with the above error. The DSK step seems to work fine, and then ORNA runs for a while and produces various fq files in the tmp directory, s2228.fq is the one with the largest number. It is present and contains normal fastq data as far as I can see.
I tried a test with just very small fastq files, ORNA runs fine with these. I tried this on two different machines with the same error, so I don't think it's a file system issue. I guess it happens at the start of the second iteration.

> /opt/ORNA/ORNA -pair1 all.R1.fastq.gz -pair2 all.R2.fastq.gz -sorting 1 -nb-cores 12 -output normalized -type fastq -kmer 25

Given Parameters
----------------
Base:   1.7
kmer size:      25
Mode: ORNA-Q
Number of cores:        12
----------------
[DSK: nb solid kmers found : 507317221   ]  100  %   elapsed:  97 min 17 sec   remaining:   0 min 0  sec   cpu: 247.4 %   mem: [1737, 4411, 4415] MB
[Building BooPHF]  100  %   elapsed:   2 min 8  sec   remaining:   0 min 0  sec
[MPHF: populate                          ]  100  %   elapsed:   3 min 19 sec   remaining:   0 min 0  sec   cpu: 100.0 %   mem: [1285, 1285, 4415] MB
[Bloom: read solid kmers                 ]  100  %   elapsed:   0 min 44 sec   remaining:   0 min 0  sec   cpu: 577.4 %   mem: [2145, 2145, 4415] MB
[Debloom: build extension                ]  100  %   elapsed:   1 min 46 sec   remaining:   0 min 0  sec   cpu: 896.9 %   mem: [2493, 2493, 4415] MB
[Debloom: finalization                   ]  100  %   elapsed:   0 min 56 sec   remaining:   0 min 0  sec   cpu: 154.9 %   mem: [2372, 2372, 4415] MB
[Debloom: cascading                      ]  100  %   elapsed:   0 min 32 sec   remaining:   0 min 0  sec   cpu: 878.4 %   mem: [2373, 2373, 4415] MB
[Graph: nb branching found : 46454780    ]  100  %   elapsed:   1 min 46 sec   remaining:   0 min 0  sec   cpu: 2112.5 %   mem: [3904, 3904, 4415] MB
Populating node abundances
Running ORNA-Q in paired end mode
1841
EXCEPTION: error opening file: tmp/s2228.fq ()