Code Monkey home page Code Monkey logo

miniwfa's Introduction

Getting Started

# install and test
git clone https://github.com/lh3/miniwfa
cd miniwfa && make test-mwf
./test-mwf -c test/t3-?.fa

# use as a library
cp miniwfa.{c,h} kalloc.{c,h} your_src/

Introduction

Miniwfa is a reimplementation of the WaveFront Alignment algorithm (WFA) with dual gap penalty. For long diverged sequences, the original WFA is slow and memory hungry. Miniwfa introduces a low-memory mode and a chaining heuristic to address these issues.

Miniwfa was developed in parallel to the linear-space BiWFA algorithm. Now BiWFA uses less memory and is generally faster than miniwfa in the exact mode. Miniwfa is much faster in the heuristic mode but it does not guarantee to find the optimal solution.

Miniwfa can be used as a library. Here is a compilable sample program:

// compile with gcc -O3 this-prog.c miniwfa.c kalloc.c
#include <string.h>
#include <stdio.h>
#include "miniwfa.h"
int main(int argc, char *argv[]) {
  mwf_opt_t opt;
  mwf_rst_t r;
  mwf_opt_init(&opt);
  if (argc >= 3) {
    mwf_wfa_auto(0, &opt, strlen(argv[1]), argv[1], strlen(argv[2]), argv[2], &r);
    printf("%d\n", r.s);
  }
  return 0;
}

Algorithm

When reporting alignment score only, miniwfa behaves largely the same as the original WFA. Miniwfa differs mainly in traceback. In the high memory mode, miniwfa packs traceback information into 7 bits per entry:

extD2<<6 | extI2<<5 | extD1<<4 | extI1<<3 | fromState

where extD1, for example, indicates whether we are extending the D1 deletion state, and fromState keeps the max state (5 possible values). As such, miniwfa uses 1 byte for each traceback entry. The standard WFA uses 20 bytes per entry.

Miniwfa has a low-memory mode inspired by but different from wfalm. In the wfalm terminology, miniwfa stores a "stripe" of traceback information every p cycles. It finds the positions on these stripes that the optimal alignment path goes through. Miniwfa then applies the WFA algorithm for a second time. In this round, it resets the bandwidth to 1 whenever it reaches a stripe. Miniwfa approximately uses (20qs2/p+ps) bytes of memory. Here, s is the optimal alignment penalty, p is the distance between stripes and q=max(x,o1+e1,o2+e2) is the maximal penalty between adjacent entries. The time complexity is O(n(s+p)) where n is the length of the longer sequence.

In the heuristic mode, miniwfa chains low-occurrence k-mers and fills gaps in the chain with the exact algorithm.

Vectorization

Miniwfa relies on compiler vectorization to parallelize three key loops. Old compilers (e.g. gcc-4.8.5) that do not support auto vectorization will lead to lower performance. Some compilers (e.g. gcc-6.4.0) cannot vectorize one of the loops. Clang-13.1.6 and gcc-10.3.0 are known to work well with miniwfa. Also note that gcc does not vectorize loops with -O2 and vectorization is influenced by SIMD. If you use miniwfa as a library, it is recommended to enable -msse4 -O3 when compiling with gcc.

Evaluation

We only use three pairs of sequences for evaluation. The first pair consists of the two haplotypes of NA19240 around the C4A/C4B gene. They are 100-150kb in length with a penalty of 26,917 under penalty x=4, o1=4, e1=2, o2=15 and e2=1. The second pair consists of GRCh38 and CHM13 around MHC. They are about 5Mb in length with a penalty of 229,868. The third pair consists of the two MHC haplotypes in HG002 with a penalty of 267,637. These sequences can be found via Zenodo.

We checked out the development branch of WFA2-lib on 2022-04-25 and compiled the code with gcc-10.3.0 (no LTO) on a CentOS 7 server equipped with two Xeon 6230 CPUs. The table below shows the timing and peak memory for miniwfa and BiWFA in its linear mode. The table used to include other memory modes of WFA2-lib and wfalm, which were removed because BiWFA is a clear winner now.

Method Command line tMHC (s) MMHC (GB) tHG002 MHG002 tC4 MC4
miniwfa high-mem test-mwf -c 385 50.6 533 68.1 3.8 0.73
miniwfa low-mem test-mwf -cp5000 544 4.1 735 5.3 5.8 0.22
biwfa linear test-wfa -cm0 308 0.4 837 0.4 2.0 0.05

Historical notes on WFA and related algorithms

I got the following notes wrong a few times. Please let me know if you found the narrative is still inaccurate.

Given a pair of strings, let n be the length of the longer string and d is the edit distance between the two strings. Ukkonen found the O(nd) algorithm to compute edit distances in 1983 and published it in 1985. Myers independently conceived the same algorithm in 1984 and published it in 1986. He additionally introduced a few variations including a linear-space algorithm with forward and reverse search. It is essentially BiWFA under edit distance. The idea of "wave" was implicit in Myers' 1986 paper. His 2014 paper explicitly explained "wave".

Sometimes the O(nd) algorithm is also attributed to Landau and Vishkin (1989). I don't know if Landau and Vishkin were aware of Ukkonen's work at the time of submission in 1986, but their published paper in 1989 did cite Ukkonen (1985).

The search for a similar algorithm for linear or affine gap penalties took three decades. To the best of my knowledge, Xin et al (2017) first found a version of this algorithm but apparently they have never published it in a peer-reviewed journal. Marco-Sola et al (2021) were probably unaware of Xin et al. They independently published the WFA algorithm as we know today. They also gave a highly efficient implementation, beating all global alignment algorithms by a large margin.

A major concern with the original WFA is its large memory consumption. Eizenga and Paten (2022) implemented wfalm to reduce its peak memory. This repo was inspired by this work. Then Marco-Sola et al (2022) developed BiWFA that finds the alignment in linear space. This is so far the best performing algorithm.

It is worth noting that also in 1985, Ukkonen published another algorithm with expected O(nt) time complexity where t is a given threshold. This algorithm guarantees to find the edit distance d if dโ‰คt. It is somewhat similar to banded alignment but distinct from his O(nd) algorithm published in the same year.

miniwfa's People

Contributors

lh3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

gdv yuk12 zm-git-dev

miniwfa's Issues

History of the wave-front algorithm ...

Hi Heng, Richard Durbin pointed me at your history and I thought I would offer a few corrections/additions.

Esko and I developed the idea independently. I came up with it in 1984 and published in 1986 and did not know about his 1985 paper. Your history makes it sound like my paper referenced his, it did not, the Landau Vishkin paper is the one that cited his 1985 paper.

My 1986 paper also showed that the algorithm was O(n+k^2) in expectation. I also showed how to deliver an alignment in linear space with a forward and reverse wave. And using a suffix tree and O(1) lca finding that the algorithm could be made (impractically) to run in O(nlogn+k^2) worst-case time. So it went well beyond Esko's paper and fully superceded the work of Landau and Vishkin and others in that group. My paper also gave a much cleaner exposition I think than Esko's (look at the two papers) and introduced the term "wave".

But the story doesn't end there. In the mid-2000's I realized the wave could be extended heuristically in a way that accelerated
with increasing stringency of match. This finally found its way into a paper in my 2014 WABI paper `Efficient Alignment Discovery amongst Noisy Long Reads,''. This paper doesn't give my final word on the wavefront heurisitc which remains unpublished but does I think give interesting ideas for how to terminate a wave extension as well as several ideas for trimming the wave front. I gave many talks about this work from 2013 on in the context of long read sequencing and assembly. I met Santiago when he was a Ph.D. in Roderic Guigo's labe and in fact was on his thesis committee. I like him very much and I think he liked my work including the non-affine wave-front which lead him to develop WFA.

That ends my description of the history from my perspective. Please incorporate as you see fit. I've been a long time fan of
your work and hope that perhaps one day we will actually meet before I get too old :-) I still interact with Richard a lot so perhaps
in that context.

Kind regards, Gene Myers

miniwfa on syntenic regions

Hi,

I would like to use miniwfa heuristic mode to align a 3 Mbp region from the Axolotl genome to its 200 kbp syntenic region in the human genome, with a particular emphasis on aligning any common repetitive regions as best as possible.

I was wondering whether you could give me your feedback as to whether this is the most suitable WFA implementation to achieve this with reasonable time and memory requirements? I have several thousands of such sequence pairs, which makes these requirements problematic.

In addition, am I correct in assuming that the test-mwf -c output is in paf format? Any advice on how to convert this format to fasta .aln or sam formats for further processing of the alignment?

Thanks in advance!

Note on ends-free BiWFA

Hi,

I would like to highlight that the BiWFA can be applied to any ends-free alignment variation. If you think about it, you only need to change the initial conditions; that is, WF_0 of forward and reverse alignment need to allow starting at any diagonal (with offset zero).

Although, I'm not sure about how performant will this be compared to the regular WFA ends-free (i.e., for different sequence lengths, errors, and ends-free init length), from the algorithmic standpoint, I'm sure it is doable. In fact, I started to put that code into BiWFA, although I stopped for the sake of releasing a simpler code.

Cheers,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.