Code Monkey home page Code Monkey logo

quickmerge's Introduction

quickmerge

What is quickmerge?

quickmerge uses a simple concept to improve contiguity of genome assemblies based on long molecule sequences, often with dramatic outcomes. The program uses information from assemblies made with Illumina short reads and Pacific Biosciences or Oxford Nanopore long reads to improve contiguities of an assembly generated with long reads alone. This is counterintuitive because Illumina short reads are not typically considered to cover genomic regions which PacBio and ONP long reads cannot. For more details, please see the paper that describes it. Citation for ONP assembly merging coming soon!

Why use quickmerge?

  • Saves money. Illumina sequences are cheaper than PacBio or ONP long reads. So quickmerge allows you to cut your long molecule requirement by as much as half by replacing the same with Illumina short reads. E.g. if you think you would get a N50 of 8Mb from 75X long reads (ONP or PacBio), try sequencing 45X long and 70X Illumina reads instead of 75X long reads. You may not need that extra 35X long reads.
  • It is fast. Takes less than a minute to run on most genomes. You run nucmer once (nucmer is the most time consuming step) and then you can run quickmerge over a large number of parameters in a very short time.
  • Requires only fasta files and does not depend on any special data or computational resources.

The package contains all necessary components to run quickmerge. We also provide a set of test data (currrently available on request) so that you can check that the program is working correctly in your computer. Please send questions and comments to [email protected]

  1. DOWNLOAD

    To download the latest version of quickmerge and MUMmer, its primary dependency, you can clone the repository using

     git clone
    
  2. INSTALL:

    UNIX: To install on a unix-based system, enter the following into the command line from the directory that this readme originated from:

     bash make_merger.sh
    

    This will compile 'quickmerge' and MUMMer. Requires GNU c++ compiler.

NON-UNIX:

On a non-unix system, you will have to manually compile these two programs, like so:

first, enter the 'merger' directory and enter the following command to make the merger program:
```
make
```
then, enter the 'MUMmer3.23' directory and enter the following commands, as specified in the MUMmer readme:
```
make check

make install
```
  1. RUNNING QUICKMERGE: WRAPPER:

    The simplest way to run 'merger' is to use the python wrapper 'merge_wrapper_v2.py':

     merge_wrapper.py hybrid_assembly.fasta self_assembly.fasta
    

    try the command 'merge_wrapper_v2.py -h' for detail on options available with this wrapper.

    MANUAL:

    To manually run 'merger', first make a call to 'nucmer'. Nucmer aligns the two assemblies so that the merger can find the correct splice sites:

     nucmer -l 100 -prefix out  self_assembly.fasta hybrid_assembly.fasta
    

    Then, use delta-filter to filter out alignments due to repeats and duplicates:

     delta-filter -i 95 -r -q out.delta > out.rq.delta
    

    Finally, use 'quickmerge' to merge the two assemblies (note: the order of the self and hybrid assembly is important:

     quickmerge -d out.rq.delta -q hybrid_assembly.fasta -r self_assembly.fasta -hco 5.0 -c 1.5 -l n -ml m
    

    Description of the parameters:

    -q: Hybrid assembly. (this can also be a PacBio or a ONP only assembly). see quickmerge wiki for details

    -r: Self assembly. (can also be a hybrid assembly).

    -hco: controls the overlap cutoff used in selection of anchor contigs. Default is 5.0.

    -c: controls the overlap cutoff for contigs used for extension of the anchor contig. Default is 1.5.

    For both "hco" and "c", bigger the number, more stringent is the criteria for contig selection (which will lead to fewer contigs being merged). If they are too small (<1), chances of spurious merging will increase. It is better to be conservative while merging contigs!

    -l: controls the length cutoff for anchor contigs. A good rule of thumb is to start with the N50 of the self assembly. E.g. if the N50 of your self assembly is 2Mb then use 2000000 as your cutoff. Lowering this value may lead to more merging but may increase the probability of mis-joins.

    -ml: controls the minimum alignment length to be considered for merging. This is especially helpful for repeat-rich genomes. Default is 0 but higher values (>5000) are recommended.

  2. SOME HELPFUL TIPS:

  • Although this program was written to merge a hybrid assembly (e.g. as generated by DBG2OLC) and a PacBio or ONP only assembly, it can also be used to merge two different long molecule only assemblies (e.g. one generated with PBcR or canu and another generated with FALCON).

  • For optimal merging results, identify the major misassemblies (especially translocations and inversions) in the component assemblies and break the contigs at such misassembly boundaries. Alignment of the component assemblies to the merged assembly may help to identify such assembly errors because a specific error typically occurs in only one of the assemblies.

  • The fasta sequence headers should not have white spaces in them. In case they do, as might happen for assemblies obtained from FALCON assembler, the white space needs to be removed before launching the merging python script or before running mummer. Our merging script now takes care of this issue.

  • You can run Ka-kit's finisherSC after running quickmerge to improve the contiguity even further.

  • Assembly polishing with Quiver and pilon before and after assembly merging is strongly recommended. However, if you are running finisherSC, you may perform the quiver polishing after the finisher step.

  • Check the merged assembly by aligning the hybrid and/or non-hybrid assembly to the merged assembly (you can use nucmer -mumreference and mummerplot for alignment and dot plot visualization).

  1. KNOWN ISSUES:
  • All sequence names (that is, fasta headers) in each of the two input assemblies must be unique, i.e., each assembly must have all unique names, and the two assemblies must not share any names. The quickmerge python wrapper now automatically fix this.

quickmerge's People

Contributors

mahulchak avatar jgbaldwinbrown avatar esolares avatar

Watchers

James Cloos avatar Wtong avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.