Code Monkey home page Code Monkey logo

big-bwt's Introduction

modified by Taher Mun to support fasta files

Big-BWT

Tool to build the BWT and optionally the Suffix Array for highly repetitive files using the approach described in Prefix-Free Parsing for Building Big BWTs by Christina Boucher, Travis Gagie, Alan Kuhnle and Giovanni Manzini [1].

Copyrights 2018- by the authors.

Installation

  • Download/Clone the repository
  • make (create the C/C++ executables)
  • bigbwt -h (get usage instruction)

Note that bigbwt is a Python script so you need at least Python 3.4 installed.

Sample usage

The only requirement for the input file is that it does not contain the characters 0x00, 0x01, and 0x02 which are used internally by bigbwt. To build the BWT for file yeast.fasta just type

   bigbwt yeast.fasta

If no errors occurrs the BWT file yeast.fasta.bwt is created: it should be one character longer than the input; the extra character is the BWT eos symbol represented by the ASCII character 0x00. Two important command line options are the window size -w and the modulus -m. Their use is explained in the paper.

Using option -S it is possible to compute the Suffix Array at the same time of the BWT. The Suffix Array is written using 5 bytes per integer (5 is a compilation constant defined in utils.h). This is exactly the same format used by the pSAscan/SAscan tools that indeed should produce exactly the same Suffix Array file.

Using options -s and/or -e (in alternative to -S) it is possible to compute a sampled Suffix Array. Option -s generates a .ssa file containing the SA entries at the beginning of BWT runs. More precisely, the .ssa file contains pairs of 5 bytes integers <j,SA[j]> for all indices j such that BWT[j]!=BWT[j-1]. Option -e generates a .esa file containing the SA entries at the end of BWT runs: i.e. the pairs <j,SA[j]> for all indices j such that BWT[j]!=BWT[j+1].

The tool has limited (and experimental) support for multiple threads. Use bigbwt option -t to specify the number of helper threads: in our tests -t 4 reduced the running time by roughly a factor two.

If you don't trust the output of our tool run bigbwt with option -c. This will compute the BWT using the SACAK algorithm [1] and compare it with the one computed by bigbwt. Be warned that SACAK, although being the most space economical among linear time algorithms, needs up to 9n bytes of RAM, since it first compute the Suffix Array and then outputs the BWT (with extension .Bwt).

References

[1] Christina Boucher, Travis Gagie, Alan Kuhnle and Giovanni Manzini Prefix-Free Parsing for Building Big BWTs Proc. WABI '18.

[2] Nong, G., Practical linear-time O(1)-workspace suffix sorting for constant alphabets, ACM Trans. Inform. Syst., vol. 31, no. 3, pp. 1-15, 2013

big-bwt's People

Contributors

alshai avatar giovmanz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.