Code Monkey home page Code Monkey logo

gwfa's Introduction

Note: the development of GWFA has been moved to gfatools such that the algorithm can directly operate on bidirected sequence graphs.


Getting Started

git clone https://github.com/lh3/gwfa
cd gwfa && make
./gwf-test test/C4-90.gfa.gz test/C4-NA19240.1.fa.gz
./gwf-test test/C4-NA19240.1.fa.gz test/C4-NA19240.2.fa.gz

Introduction

GWFA (Graph WaveFront Alignment) is an algorithm to align a sequence against a sequence graph. It adapts the WFA algorithm for graphs. This repo presents a proof-of-concept implementation of GWFA that computes the edit distance between a graph and a sequence without backtracing. The algorithm assumes the start of the sequence to be aligned with the start of the first segment in the graph and requires the query sequence to be fully aligned. This behavior is similar to the SHW mode of edlib. It does not support semi-global alignment and is thus not intended for read mapping. This is not an enduser tool.

GWFA is optimized for graphs consisting of long segments. It is largely reduced to my implementation of the Landau-Vishkin algorithm given two linear sequences as input. Similar to WFA, GWFA is fast when the edit distance is small. It can align a ~120kb sequence to a ~160kb graph at 0.1% divergence in 0.02 second, much faster than the ordinary Needleman-Wunsch formulation.

Evaluation

To evaluate the performance of GWFA, we constructed two small graphs with minigraph. The first graph includes ~120kb region around the C4A and C4B genes. The second includes ~5Mb of MHC, consisting of 980 segments and 1399 links. Both graphs are available via Zenodo. Neither has cycles.

For the C4-90 graph, Haowen Zhang found GWFA can report the same edit distance in comparison to his implementations of various sequence-to-graph algorithms.

For the MHC-57 graph, GWFA took 3m38s to align the HG002.2 haplotype (extracted from MHC-61.agc on Zenodo) with 41932 edits. Other exact solutions did not finish in a day. GraphAligner-1.0.14 didn't align the haplotype in one piece. For the first 2Mb subsequence of HG002.2, GraphAligner reported 6460 edits in the vg mode and 6547 edits in the dbg mode. GWFA found a smaller edit distance of 6447.

We have not tested more complex graphs with cycles. Although we think the GWFA algorithm should be correct in theory, we are not sure if the current implmentation is correct in all corner cases. Please use with caution.

Contributors

Shiqi Wu formulated the initial GWFA algorithm. Haowen Zhang inspected the code and helped the evaluation.

gwfa's People

Contributors

lh3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

dwpeng

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.