Code Monkey home page Code Monkey logo

wfa2-lib's Introduction

WFA2-lib

1. INTRODUCTION

1.1 What is WFA?

The wavefront alignment (WFA) algorithm is an exact gap-affine algorithm that takes advantage of homologous regions between the sequences to accelerate the alignment process. Unlike traditional dynamic programming algorithms that run in quadratic time complexity, the WFA runs in time O(ns+s^2), proportional to the sequence length n and the alignment score s, using O(s^2) memory (or O(s) using the ultralow/BiWFA mode). Moreover, the WFA algorithm exhibits simple computational patterns that the modern compilers can automatically vectorize for different architectures without adapting the code. To intuitively illustrate why the WFA algorithm is so interesting, look at the following figure. The left panel shows the cells computed by a classical dynamic programming based algorithm (like Smith-Waterman or Needleman Wunsch). In contrast, the right panel shows the cells computed by the WFA algorithm to obtain the same result (i.e., the optimal alignment).

1.2 What is WFA2-lib?

The WFA2 library implements the WFA algorithm for different distance metrics and alignment modes. It supports various distance functions (e.g., indel, edit, gap-linear, gap-affine, and dual-gap gap-affine). The library allows computing only the score or the complete alignment (i.e., CIGAR) (see Alignment Scope). Also, the WFA2 library supports computing end-to-end alignments (a.k.a. global-alignment) and ends-free alignments (including semi-global, glocal, and extension alignment) (see Alignment Span). In the case of long and noisy alignments, the library provides different low-memory modes that significantly reduce the memory usage of the naive WFA algorithm implementation. Beyond the exact-alignment modes, the WFA2 library implements heuristic modes that dramatically accelerate the alignment computation. Additionally, the library provides many other support functions to display and verify alignment results, control the overall memory usage, and more.

1.3 Getting started

Git clone and compile the library, tools, and examples (by default, use cmake for the library and benchmark build).

git clone https://github.com/smarco/WFA2-lib
cd WFA2-lib
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --verbose
ctest . --verbose

There are some flags that can be used. For instance:

cmake .. -DOPENMP=TRUE
cmake .. -DCMAKE_BUILD_TYPE=Release -DEXTRA_FLAGS="-ftree-vectorizer-verbose=5"

Alternatively, the simple Makefile build system can be used.

git clone https://github.com/smarco/WFA2-lib
cd WFA2-lib
make clean all

Also, it is possible to build WFA2-lib in a GNU Guix container, for more information, see guix.scm.

1.4 Contents (where to go from here)

Section WFA2-lib features explores the most relevant options and features of the library. Then, the folder tools/ contains tools for executing and understanding the WFA2 library capabilities. Additionally, the folder examples/ contains simple examples illustrating how to integrate the WFA2 code into any tool.

1.5 Important notes and clarifications

  • The WFA algorithm is an exact algorithm. If no heuristic is applied (e.g., band or adaptive pruning), the core algorithm guarantees to always find the optimal solution (i.e., best alignment score). Since its first release, some authors have incorrectly referred to the WFA as approximated or heuristic; this is not the case.

  • Given two sequences of length n, traditional dynamic-programming (DP) based methods (like Smith-Waterman or Needleman-Wunsch) compute the optimal alignment in O(n^2) time, using O(n^2) memory. In contrast, the WFA algorithm requires O(ns+s^2) time and O(s^2) memory (with s the optimal alignment score). Therefore, the memory consumption of the WFA algorithm is not intrinsically higher than that of other methods. Most DP-based methods can use heuristics (like banded, X-drop, or Z-drop) to reduce the execution time and the memory usage at the expense of losing accuracy. Likewise, the WFA algorithm can also use heuristics to reduce the execution time and memory usage. Moreover, the memory mode ultralow uses the BiWFA algorithm to execute in O(ns+s^2) time and linear O(s) memory.

  • A note for the fierce competitors. I can understand that science and publishing have become a fierce competition these days. Many researchers want their methods to be successful and popular, seeking funding, tenure, or even fame. If you are going to benchmark the WFA using the least favourable configuration, careless programming, and a disadvantageous setup, please, go ahead. But remember, researchers like you have put a lot of effort into developing the WFA. We all sought to find better methods that could truly help other researchers. So, try to be nice, tone down the marketing, and produce fair evaluations and honest publications.

2. USING WFA2-LIB IN YOUR PROJECT

2.1 Simple C example

This simple example illustrates how to align two sequences using the WFA2 library. First, include the WFA2 alignment headers.

#include "wavefront/wavefront_align.h"

Next, create and configure the WFA alignment object. The following example uses the defaults configuration and sets custom gap_affine penalties. Note that mismatch, gap-opening, and gap-extension must be positive values.

// Configure alignment attributes
wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
attributes.distance_metric = gap_affine;
attributes.affine_penalties.mismatch = 4;
attributes.affine_penalties.gap_opening = 6;
attributes.affine_penalties.gap_extension = 2;
// Initialize Wavefront Aligner
wavefront_aligner_t* const wf_aligner = wavefront_aligner_new(&attributes);

Finally, call the wavefront_align function.

char* pattern = "TCTTTACTCGCGCGTTGGAGAAATACAATAGT";
char* text    = "TCTATACTGCGCGTTTGGAGAAATAAAATAGT";
wavefront_align(wf_aligner,pattern,strlen(pattern),text,strlen(text)); // Align

Afterwards, we can use the library to display the alignment result (e.g., the alignment score and CIGAR).

// Display CIGAR & score
cigar_print_pretty(stderr,pattern,strlen(pattern),text,strlen(text),
                   &wf_aligner->cigar,wf_aligner->mm_allocator);
fprintf(stderr,"Alignment Score %d\n",wf_aligner->cigar.score);

At the end of the program, it is polite to release the memory used.

wavefront_aligner_delete(wf_aligner); // Free

To compile and run this example, you need to link against the WFA library (-lwfa).

$> gcc -O3 wfa_example.c -o wfa_example -lwfa
$> ./wfa_example

IMPORTANT. Once an alignment object is created, it is strongly recommended to reuse it to compute multiple alignments. Creating and destroying the alignment object for every alignment computed can have a significant overhead. Reusing the alignment object allows repurposing internal data structures, minimising the cost of memory allocations, and avoiding multiple alignment setups and precomputations.

2.2 Simple C++ example

The WFA2 library can be used from C++ code using the C++ bindings. This example is similar to the previous one but uses C++ bindings. First, include the C++ bindings and remember to use the WFA namespace.

#include "bindings/cpp/WFAligner.hpp"
using namespace wfa;

Configure and create the WFA alignment object. In this case, gap-affine distance using custom penalties and the standard memory-usage algorithm (i.e., standard WFA algorithm).

// Create a WFAligner
WFAlignerGapAffine aligner(4,6,2,WFAligner::Alignment,WFAligner::MemoryHigh);

Align two sequences (in this case, given as strings).

string pattern = "TCTTTACTCGCGCGTTGGAGAAATACAATAGT";
string text    = "TCTATACTGCGCGTTTGGAGAAATAAAATAGT";
aligner.alignEnd2End(pattern,text); // Align

Display the result of the alignment.

// Display CIGAR & score
string cigar = aligner.getAlignmentCigar();
cout << "CIGAR: " << cigar  << endl;
cout << "Alignment score " << aligner.getAlignmentScore() << endl;

IMPORTANT. Once an alignment object is created, it is strongly recommended to reuse it to compute multiple alignments. Creating and destroying the alignment object for every alignment computed can have a significant overhead. Reusing the alignment object allows repurposing internal data structures, minimising the cost of memory allocations, and avoiding multiple alignment setups and precomputations.

2.3 Rust bindings

Rust bindings can be generated automatically using bindgen, see bindings/rust/build.rs. An example of how to use them is here.

3. WFA2-LIB FEATURES

  • Exact alignment method that computes the optimal alignment score and/or alignment CIGAR.
  • Supports multiple distance metrics (i.e., indel, edit, gap-linear, gap-affine, and dual-gap gap-affine).
  • Allows performing end-to-end (a.k.a. global) and ends-free (e.g., semi-global, extension, overlap) alignment.
  • Implements low-memory modes to reduce and control memory consumption (down to O(s) using the ultralow mode).
  • Supports various heuristic strategies to use on top of the core WFA algorithm.
  • WFA2-lib operates with plain ASCII strings. Although we mainly focus on aligning DNA/RNA sequences, the WFA algorithm and the WFA2-lib implementation work with any pair of strings. Moreover, these sequences do not have to be pre-processed (e.g., packed or profiled), nor any table must be precomputed (like the query profile, used within some Smith-Waterman implementations).
  • Due to its simplicity, the WFA algorithm can be automatically vectorized for any SIMD-compliant CPU supported by the compiler. For this reason, the WFA2-lib implementation is independent of any specific ISA or processor model. Unlike other hardware-dependent libraries, we aim to offer a multiplatform pairwise-alignment library that can be executed on different processors and models (e.g., SSE, AVX2, AVX512, POWER-ISA, ARM, NEON, SVE, SVE2, RISCV-RVV, ...).

3.1 Distance Metrics

The WFA2 library implements the wavefront algorithm for the most widely used distance metrics. The practical alignment time can change depending on the distance function, although the computational complexity always remains proportional to the alignment score or distance. The WFA2 library offers the following distance metrics or functions:

  • Indel (or LCS). Produces alignments allowing matches, insertions, and deletions with unitary cost (i.e., {M,I,D} = {0,1,1}) but not mismatches. Also known as the longest common subsequence (LCS) problem. The LCS is defined as the longest subsequence common to both sequences, provided that the characters of the subsequence are not required to occupy consecutive positions within the original sequences.
    PATTERN    A-GCTA-GTGTC--AATGGCTACT-T-T-TCAGGTCCT
               |  ||| |||||    |||||||| | | |||||||||
    TEXT       AA-CTAAGTGTCGG--TGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1I1D3M1I5M2I2D8M1I1M1I1M1I9M
    // Configuration
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = indel;
  • Edit (a.k.a. Levenshtein). Produces alignments allowing matches, mismatches, insertions, and deletions with unitary cost (i.e., {M,X,I,D} = {0,1,1,1}). Edit or Levenshtein distance between two sequences is the minimum number of single-character edits (i.e., insertions, deletions, or mismatches) required to transform one sequence into the other.
    PATTERN    AGCTA-GTGTCAATGGCTACT-T-T-TCAGGTCCT
               | ||| |||||  |||||||| | | |||||||||
    TEXT       AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1X3M1I5M2X8M1I1M1I1M1I9M
    // Configuration
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = edit;
  • Gap-linear (as in Needleman-Wunsch). Produces alignments allowing matches, mismatches, insertions, and deletions. Allows assigning a penalty (a.k.a. cost or weight) to each alignment operation. It computes the optimal alignment, minimizing the overall cost to transform one sequence into the other. Under the gap-linear model, the alignment score is computed based on {X,I}⁠, where X corresponds to the mismatch penalty and the gap penalty is expressed as the function l(N)=N·I (given the length of the gap N and the gap penalty I).
    PATTERN    A-GCTA-GTGTC--AATGGCTACT-T-T-TCAGGTCCT
               |  ||| |||||    |||||||| | | |||||||||
    TEXT       AA-CTAAGTGTCGG--TGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1I1D3M1I5M2I2D8M1I1M1I1M1I9M
    // Configuration
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = gap_linear;
    attributes.linear_penalties.mismatch = 6; // X > 0
    attributes.linear_penalties.indel = 2;    // I > 0
  • Gap-affine (as in Smith-Waterman-Gotoh). Linear gap cost functions can lead to alignments populated with small gaps. Long gaps are preferred in certain scenarios, like genomics or evolutionary studies (understood as a single event). Under the gap-affine model, the alignment score is computed based on {X,O,E}⁠, where X corresponds to the mismatch penalty and the gap penalty is expressed as the function g(N)=O+N·E (given the length of the gap N, the gap opening penalty O, and the gap extension penalty E).
    PATTERN    AGCTA-GTGTCAATGGCTACT---TTTCAGGTCCT
               | ||| |||||  ||||||||   | |||||||||
    TEXT       AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1X3M1I5M2X8M3I1M1X9M
    // Configuration
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = gap_affine;
    attributes.affine_penalties.mismatch = 6;      // X > 0
    attributes.affine_penalties.gap_opening = 4;   // O >= 0
    attributes.affine_penalties.gap_extension = 2; // E > 0
  • Dual-cost gap-affine distances. Also known as piece-wise gap-affine cost, this distance metric addresses some issues that the regular gap-affine distance has with long gaps. In a nutshell, the regular gap-affine distance can occasionally split long gaps by sporadic mismatches (often when aligning long and noisy sequences). Instead, many users would prefer to increase the open gap cost to produce a single long gap. For that, the dual-cost gap-affine distance (p=2) defines two affine cost functions (i.e., for short and long gaps). Then, the alignment score is computed based on {X,O1,E1,O2,E2}⁠, where X corresponds to the mismatch penalty and the gap penalty is expressed as the function g(N)=min{O1+N·E1,O2+N·E2} (given the length of the gap N, the gap opening penalties O1 and O2, and the gap extension penalties E1 and E2).
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = gap_affine_2p;
    attributes.affine2p_penalties.mismatch = 6;       // X > 0
    attributes.affine2p_penalties.gap_opening1 = 4;   // O1 >= 0
    attributes.affine2p_penalties.gap_extension1 = 2; // E1 > 0
    attributes.affine2p_penalties.gap_opening2 = 12;  // O2 >= 0
    attributes.affine2p_penalties.gap_extension2 = 1; // E2 > 0

3.2 Alignment Scope

Depending on the use case, it is often the case that an application is only required to compute the alignment score, not the complete alignment (i.e., CIGAR). As it happens with traditional dynamic programming algorithms, the WFA algorithm requires less memory (i.e., O(s)) to compute the alignment score. In turn, this results in slighter faster alignment executions. For this reason, the WFA2 library implements two different modes depending on the alignment scope: score-only and full-CIGAR alignment.

The ** score-only alignment ** mode computes only the alignment score. This mode utilizes only the front-wavefronts of the WFA algorithm to keep track of the optimal alignment score. As a result, it requires O(s) memory and, in practice, performs slighter faster than the standard full-CIGAR mode.

    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_scope = compute_score;

The ** full-CIGAR alignment ** computes the sequence of alignment operations (i.e., {'M','X','D','I'}) that transforms one sequence into the other (i.e., alignment CIGAR). The alignment score can be obtained as a by-product of the alignment process, evaluating the score of the alignment CIGAR. This mode requires O(s^2) memory (using the default memory mode, wavefront_memory_high) or less (using the low-memory modes).

    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_scope = compute_alignment;

3.3 Alignment Span

The WFA2 library allows computing alignments with different spans or shapes. Although there is certain ambiguity and confusion in the terminology, we have tried to generalize the different options available to offer flexible parameters that can capture multiple alignment scenarios. During the development of the WFA we decided to adhere to the classical approximate string matching terminology where we align a pattern (a.k.a. query or sequence) against a text (a.k.a. target, database, or reference).

  • End-to-end alignment. Also known as global alignment, this alignment mode forces aligning the two sequences from the beginning to end of both.
    PATTERN    AATTAATTTAAGTCTAGGCTACTTTCGGTACTTTGTTCTT
               ||||    ||||||||||||||||||||||||||   |||
    TEXT       AATT----TAAGTCTAGGCTACTTTCGGTACTTT---CTT
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_end2end;
  • Ends-free alignment. This alignment mode allows leading and trailing insertions or deletions for "free" (i.e., no penalty/cost on the overall alignment score). Moreover, this alignment mode allows determining the maximum gap length allowed for free at the beginning and end of the sequences. Note that this mode does not implement local alignment as it does not allow free insertions and deletions at the beginning/end of the sequences at the same time. However, it allows many different configurations used across different analyses, methods, and tools.
    PATTERN    AATTAATTTAAGTCTAGGCTACTTTCGGTACTTTGTTCTT
                   |||||||||||||||||||||||||||||| ||
    TEXT       ----AATTTAAGTCTAGGCTACTTTCGGTACTTTCTT---
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = pattern_begin_free;
    attributes.alignment_form.pattern_end_free = pattern_end_free;
    attributes.alignment_form.text_begin_free = text_begin_free;
    attributes.alignment_form.text_end_free = text_end_free;
  • Other
Glocal alignment (a.k.a. semi-global or fitting)

  • Glocal alignment (a.k.a. semi-global or fitting). Alignment mode where the pattern is globally aligned and the text is locally aligned. Often due to the large size of one of the sequences (e.g., the text sequence being a genomic reference), this alignment mode forces one sequence (i.e., pattern) to align globally to a substring of the other (i.e., text).
    PATTERN    -------------AATTTAAGTCTAGGCTACTTTC---------------
                            ||||||||| ||||||||||||
    TEXT       ACGACTACTACGAAATTTAAGTATAGGCTACTTTCCGTACGTACGTACGT
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = 0;
    attributes.alignment_form.pattern_end_free = 0;
    attributes.alignment_form.text_begin_free = text_begin_free;
    attributes.alignment_form.text_end_free = text_end_free;

Extension alignment

  • Extension alignment. Alignment mode where the start of both pattern and text sequences are forced to be aligned. However, the ends of both are free. This alignment mode is typically used within seed-and-extend algorithms.
    // Right extension
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = 0;
    attributes.alignment_form.pattern_end_free = pattern_end_free;
    attributes.alignment_form.text_begin_free = 0;
    attributes.alignment_form.text_end_free = text_end_free;

    PATTERN    AATTTAAGTCTG-CTACTTTCACGCA-GCT----------
               ||||| |||||| ||||||||||| | | |
    TEXT       AATTTCAGTCTGGCTACTTTCACGTACGATGACAGACTCT
    // Left extension
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = pattern_begin_free;
    attributes.alignment_form.pattern_end_free = 0;
    attributes.alignment_form.text_begin_free = text_begin_free;
    attributes.alignment_form.text_end_free = 0;

    PATTERN    -------------AAACTTTCACGTACG-TGACAGTCTCT
                              ||||||||||||| |||||| ||||
    TEXT       AATTTCAGTCTGGCTACTTTCACGTACGATGACAGACTCT

Overlapped alignment

  • Overlapped alignment (a.k.a. dovetail).
    // Overlapped (Right-Left)
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = pattern_begin_free;
    attributes.alignment_form.pattern_end_free = 0;
    attributes.alignment_form.text_begin_free = 0;
    attributes.alignment_form.text_end_free = text_end_free;

    PATTERN    ACGCGTCTGACTGACTGACTAAACTTTCATGTAC-TGACA-----------------
                                   ||||||||| |||| |||||
    TEXT       --------------------AAACTTTCACGTACGTGACATATAGCGATCGATGACT
    // Overlapped (Left-Right)
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = 0;
    attributes.alignment_form.pattern_end_free = pattern_end_free;
    attributes.alignment_form.text_begin_free = text_begin_free;
    attributes.alignment_form.text_end_free = 0;

    PATTERN    ----------------------ACGCGTCTGACTGACTACGACTACGACTGACTAGCAT
                                     ||||||||| || ||
    TEXT       ACATGCATCGATCAGACTGACTACGCGTCTG-CTAAC----------------------

3.4 Memory modes

The WFA2 library implements various memory modes: wavefront_memory_high, wavefront_memory_med, wavefront_memory_low, and wavefront_memory_ultralow. These modes allow regulating the overall memory consumption. The standard WFA algorithm, which stores explicitly all wavefronts in memory, correspond to the mode wavefront_memory_high. Memory modes wavefront_memory_med and wavefront_memory_low progressively reduce memory usage at the expense of slightly larger alignment times. Memory mode wavefront_memory_ultralow utilizes the BiWFA algorithm using a minimal memory footprint of O(s) and the same O(ns+s^2) time complexity as the original WFA. In practice, wavefront_memory_ultralow can outperform wavefront_memory_high because the latter experiences memory slowdowns when aligning long and noisy sequences.

Memory modes can be used transparently with other alignment options and generate identical results. Note that this option does not affect the score-only alignment mode, which always uses a minimal memory footprint of O(s)).

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.memory_mode = wavefront_memory_ultralow;

3.5 Heuristic modes

The WFA algorithm can be used combined with many heuristics to reduce the alignment time and memory used. As it happens to other alignment methods, heuristics can result in suboptimal solutions and loss of accuracy. Moreover, some heuristics may drop the alignment if the sequences exceed certain divergence thresholds (i.e., x-drop/z-drop). Due to the popularity and efficiency of these methods, the WFA2 library implements many of these heuristics. Note, it is not about how little DP-matrix you compute, but about how good/accurate the resulting alignments are.

WFA2's heuristics are classified into the following categories: 'wf-adaptive', 'drops', and 'bands'. It is possible to combine a maximum of one heuristic from each category (OR-ing the strategy values or using the API). In the case of using multiple heuristics, these will be applied in cascade, starting with 'wf-adaptive', then 'drops', and finally 'bands'.

  • None (for comparison). If no heuristic is used, the WFA behaves exploring cells of the DP-matrix in increasing score order (increasing scores correspond to colours from blue to red).

Full-WFA

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_none;
  • Heuristic wf-adaptive. This WFA heuristic removes outer diagonals that are extremely far behind compared to other ones in the same wavefront. Unlike other methods, the adaptive-wavefront reduction heuristic prunes based on the potential of the diagonal to lead to the optimal solution without previous knowledge of the error between the sequences.
  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_wfadaptive;
  attributes.heuristic.min_wavefront_length = 10;
  attributes.heuristic.max_distance_threshold = 50;
  attributes.heuristic.steps_between_cutoffs = 1;

        Graphical examples:

Adaptive-WF(10,50)

Adaptive-WF(10,50,10)

  • Heuristic drops. This heuristic compares the maximum score computed so far with the score of the last computed cells. Depending on the score difference, these heuristic strategies can reduce the size of the wavefront computed or even abandon the alignment process. In the case of zero-match alignment, $M=1$ will be assumed just for computation of the score drop. Also note that this heuristic is not compatible with distances 'edit' or 'indel'. In this category, WFA2 implements 'X-drop' and 'Z-drop'.

        X-drop implements the classical X-drop heuristic. For each diagonal $k$, the X-drop heuristic compares the current score $sw_k$ with the maximum observed score so far $sw_{max}$. If the difference drops more than the $xdrop$ parameter (i.e., $sw_{max} - sw_k &gt; xdrop$), the heuristic prunes the diagonal $k$ as it is unlikely to lead to the optimum alignment. If all the diagonals are pruned under this criteria, the alignment process is abandoned.

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_xdrop;
  attributes.heuristic.xdrop = 100;
  attributes.heuristic.steps_between_cutoffs = 100;

        Z-drop implements the Z-drop heuristic (as described in Minimap2). This heuristic halts the alignment process if the score drops too fast in the diagonal direction. Let $sw_{max}$ be the maximum observed score so far, computed at cell $(i',j')$. Then, let $sw$ be the maximum score found in the last computed wavefront, computed at cell $(i,j)$. The Z-drop heuristic stops the alignment process if $sw_{max} - sw &gt; zdrop + gap_e·|(i-i')-(j-j')|$, being $gap_e$ the gap-extension penalty and $zdrop$ a parameter of the heuristic.

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_zdrop;
  attributes.heuristic.zdrop = 100;
  attributes.heuristic.steps_between_cutoffs = 100;

        Graphical examples:

None

X-drop(200,1)

Z-drop(200,1)

  • Heuristic bands. These heuristics set a band in the diagonals preventing the wavefront from growing beyond those limits. It allows setting the minimum diagonal (i.e., min_k) and maximum diagonal (i.e., max_k). These heuristics are the most restrictive but the fastest and simplest to compute. In this category, WFA2 implements 'static-band' and 'adaptive-band'.

        Static-band sets a fixed band in the diagonals preventing the wavefront from growing beyond those limits. It allows setting the minimum diagonal (i.e., min_k) and maximum diagonal (i.e., max_k).

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_banded_static;
  attributes.heuristic.min_k = -10;
  attributes.heuristic.max_k = +10;

        Adaptive-band is similar to the static-band heuristic; however, it allows the band to move towards the diagonals closer to the end of the alignment. Unlike the static-band that is performed on each step, the adaptive-band heuristics allows configuring the number of steps between heuristic band cut-offs.

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_banded_adaptive;
  attributes.heuristic.min_k = -10;
  attributes.heuristic.max_k = +10;
  attributes.heuristic.steps_between_cutoffs = 1;

        Graphical examples:

Banded(10,10)

Banded(10,150)

Adaptive-Band(10,10,1)

Adaptive-Band(50,50,1)

3.6 Some technical notes

  • Thanks to Eizenga's formulation, WFA2-lib can operate with any match score. In practice, M=0 is often the most efficient choice.

  • Note that edit and LCS are distance metrics and, thus, the score computed is always positive. However, using weighted distances (e.g., gap-linear and gap-affine) the alignment score is computed using the selected penalties (i.e., the alignment score can be positive or negative). For instance, if WFA2-lib is executed using $M=0$, the final score is expected to be negative.

  • All WFA2-lib algorithms/variants are stable. That is, for alignments with the same score, all alignment modes always resolve ties (between M, X, I,and D) using the same criteria: M (highest prio) > X > D > I (lowest prio). Only the memory mode ultralow (BiWFA) resolves ties differently (although the results are still optimal).

  • WFA2lib follows the convention that describes how to transform the (1) Pattern/Query into the (2) Text/Database/Reference used in classic pattern matching papers. However, the SAM CIGAR specification describes the transformation from (2) Reference to (1) Query. If you want CIGAR-compliant alignments, swap the pattern and text sequences argument when calling the WFA2lib's align functions (to convert all the Ds into Is and vice-versa).

4. REPORTING BUGS AND FEATURE REQUEST

Feedback and bug reporting is highly appreciated. Please report any issue or suggestion on github or email to the main developer ([email protected]). Don't hesitate to contact us if:

  • You experience any bug or crash.
  • You want to request a feature or have any suggestion.
  • Your application using the library is running slower than it should or you expected.
  • Need help integrating the library into your tool.

5. LICENSE

WFA2-lib is distributed under MIT licence.

6. AUTHORS

Santiago Marco-Sola ([email protected]) is the main developer and the person you should address your complaints.

Andrea Guarracino and Erik Garrison have contributed to the design of new features and intensive testing of the library.

Pjotr Prins contributed the CMake build system, preventing of leaking variables in include headers and other tweaks.

Miquel Moretó has contributed with fruitful technical discussions and tireless efforts seeking funding, so we could keep working on this project.

7. ACKNOWLEDGEMENTS

  • Baoxing Song and Buckler's lab for their interest and help promoting the WFA and pushing for the inclusion of new features.

  • Juan Carlos Moure and Antonio Espinosa for their collaboration and support of this project.

8. CITATION

Santiago Marco-Sola, Juan Carlos Moure, Miquel Moreto, Antonio Espinosa. "Fast gap-affine pairwise alignment using the wavefront algorithm.". Bioinformatics, 2020.

Santiago Marco-Sola, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, Miquel Moreto. "Optimal gap-affine alignment in O(s) space". Bioinformatics, 2023.

wfa2-lib's People

Contributors

8051enthusiast avatar adamnovak avatar andreaguarracino avatar brendanf avatar ctsa avatar kcleal avatar marcelm avatar mencian avatar millak avatar pettyalex avatar pjotrp avatar quim0 avatar ragnargrootkoerkamp avatar rlorigro avatar smarco avatar tbrekalo avatar tshauck avatar twrightsman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wfa2-lib's Issues

EXC_BAD_ACCESS

I'm running into an issue that I can't produce if I just give it one sequence pair...

wfa::WFAlignerGapAffine2Pieces aligner(4, 4, 2, 24, 1, wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh);
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010026c28b libwfa.2.1.0.dylib`wavefronts_backtrace_del2_ext(wf_aligner=0x0000000128808010, score=22455, k=-34) at wavefront_backtrace.c:184:20
   181 	  if (score < 0) return WAVEFRONT_OFFSET_NULL;
   182 	  wavefront_t* const d2wavefront = wf_aligner->wf_components.d2wavefronts[score];
   183 	  if (d2wavefront != NULL &&
-> 184 	      d2wavefront->lo <= k+1 &&
   185 	      k+1 <= d2wavefront->hi) {
   186 	    return BACKTRACE_PIGGYBACK_SET(d2wavefront->offsets[k+1],backtrace_D2_ext);
   187 	  } else {

ASAN/UBSAN gives something else...

../subprojects/wfa/wavefront/wavefront_extend.c:116:20: runtime error: load of misaligned address 0x00010ce54c33 for type 'uint64_t', which requires 8 byte alignment
0x00010ce54c33: note: pointer points here
 3f  3f 3f 3f 54 47 43 43 54  47 54 43 41 47 47 47 54  43 43 54 47 54 54 47 47  41 41 47 47 47 43 54
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:116:38: runtime error: load of misaligned address 0x00010ce5784a for type 'uint64_t', which requires 8 byte alignment
0x00010ce5784a: note: pointer points here
 21 21  21 21 54 54 47 43 43 54  47 54 43 41 47 47 47 54  43 43 54 47 54 47 47 41  41 47 47 47 43 41
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:124:13: runtime error: load of misaligned address 0x00010d82ff8a for type 'uint64_t', which requires 8 byte alignment
0x00010d82ff8a: note: pointer points here
 47 54  43 41 47 47 47 54 43 43  54 47 54 47 47 41 41 47  47 47 43 54 47 54 41 41  54 41 47 41 47 47
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:124:31: runtime error: load of misaligned address 0x00010d8305e4 for type 'uint64_t', which requires 8 byte alignment
0x00010d8305e4: note: pointer points here
  47 54 43 41 47 47 47 54  43 43 54 47 54 47 47 41  41 47 47 47 43 41 54 54  54 43 41 54 41 47 47 47

You can try to reproduce with https://github.com/armintoepfer/clr-align-challenge and then

lldb -- ./cas ../data/long.txt

Broken build with OMP enabled

WFA2-lib is not building for me with cmake version 3.25.2 when OpenMP is enabled.

It looks like #64 illegally used a mix of plain and keyword-based CMake syntax for linking libraries: https://stackoverflow.com/questions/59522267/cmake-rejects-a-second-target-link-libraries-talking-about-keyword-vs-plain

Reproduction:

Build with OpenMP:

mkdir build &&  cd build
cmake -DOPENMP=ON ..
cmake --build .

Fail:

CMake Error at CMakeLists.txt:176 (target_link_libraries):
  The keyword signature for target_link_libraries has already been used with
  the target "wfa2cpp_static".  All uses of target_link_libraries with a
  target must be either all-keyword or all-plain.

  The uses of the keyword signature are here:

   * CMakeLists.txt:171 (target_link_libraries)



CMake Error at CMakeLists.txt:177 (target_link_libraries):
  The keyword signature for target_link_libraries has already been used with
  the target "wfa2cpp".  All uses of target_link_libraries with a target must
  be either all-keyword or all-plain.

  The uses of the keyword signature are here:

   * CMakeLists.txt:170 (target_link_libraries)

Solutions: Specify the scope of how OpenMP is linked to WFA2-lib. I don't see any reason to publicly expose this, so PRIVATE seems like the default option.

Segmentation fault (core dumped)

Hi,

I used the following code to align several pairs of sequences and got segfault on one pair of sequences. The sequences are attached. Please help. Thanks!

  // Configure alignment attributes
  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.distance_metric = edit;
  attributes.alignment_scope = compute_score;
  attributes.alignment_form.span = alignment_endsfree;
  attributes.alignment_form.pattern_begin_free = 0;
  attributes.alignment_form.pattern_end_free = 0;
  attributes.alignment_form.text_begin_free = 0;
  attributes.alignment_form.text_end_free = (text_length < pattern_length ? ext_length : pattern_length) / 2;
  // Initialize Wavefront Aligner
  wavefront_aligner_t* const wf_aligner = wavefront_aligner_new(&attributes);
  // Align
  wavefront_align(wf_aligner,pattern,strlen(pattern),text,strlen(text));
  fprintf(stderr,"WFA-Alignment returns score %d\n",wf_aligner->cigar.score);
  // Free
  wavefront_aligner_delete(wf_aligner);

sequences.zip

Problem getting rust bindings to build

Hi,

I'm trying to use the rust bindings in a rust project, but fail to build it. I think I may have to tag @RagnarGrootKoerkamp here?

It is the very first time that I do something like this, so I don't really know where to start debugging this. I am developing in WSL.

So I cloned the repository and ran

git clone https://github.com/smarco/WFA2-lib
cd WFA2-lib
mkdir build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --verbose
ctest . --verbose

make
make lib_wfa

I've added bindgen to the build dependencies and the build.rs file in the src directory of my rust project, adapting the paths from wfa2 to WFA2-lib to match the repository name.

Building seems to find the files, but then runs into the error below:

$ cargo build
   Compiling STRdust v0.1.0 (/home/wdecoster/wsl-repos/STRdust)
error: failed to run custom build command for `STRdust v0.1.0 (/home/wdecoster/wsl-repos/STRdust)`

Caused by:
  process didn't exit successfully: `/home/wdecoster/wsl-repos/STRdust/target/debug/build/STRdust-52f9cedab0223029/build-script-build` (exit status: 101)
  --- stdout
  cargo:rustc-link-search=../WFA2-lib/lib
  cargo:rustc-link-lib=wfa
  cargo:rustc-link-lib=omp
  cargo:rerun-if-changed=../WFA2-lib/lib/libwfa.a

  --- stderr
  ../WFA2-lib/utils/heatmap.h:86:5: error: unknown type name 'FILE'
  ../WFA2-lib/system/profiler_counter.h:40:3: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:41:3: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:42:3: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:43:3: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:54:11: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:56:1: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:57:1: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:58:1: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:59:1: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:69:5: error: unknown type name 'FILE'
  ../WFA2-lib/system/profiler_counter.h:73:11: error: unknown type name 'bool'
  ../WFA2-lib/system/profiler_counter.h:75:5: error: unknown type name 'FILE'
  ../WFA2-lib/system/profiler_counter.h:83:3: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:85:3: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:90:11: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:93:11: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:96:11: error: unknown type name 'uint64_t'
  ../WFA2-lib/system/profiler_counter.h:99:11: error: unknown type name 'uint64_t'
  fatal error: too many errors emitted, stopping now [-ferror-limit=]
  thread 'main' panicked at 'Unable to generate bindings: ClangDiagnostic("../WFA2-lib/utils/heatmap.h:86:5: error: unknown type name 'FILE'\n../WFA2-lib/system/profiler_counter.h:40:3: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:41:3: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:42:3: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:43:3: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:54:11: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:56:1: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:57:1: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:58:1: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:59:1: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:69:5: error: unknown type name 'FILE'\n../WFA2-lib/system/profiler_counter.h:73:11: error: unknown type name 'bool'\n../WFA2-lib/system/profiler_counter.h:75:5: error: unknown type name 'FILE'\n../WFA2-lib/system/profiler_counter.h:83:3: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:85:3: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:90:11: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:93:11: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:96:11: error: unknown type name 'uint64_t'\n../WFA2-lib/system/profiler_counter.h:99:11: error: unknown type name 'uint64_t'\nfatal error: too many errors emitted, stopping now [-ferror-limit=]\n")', src/build.rs:54:10
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

What can I do to debug the issue here? Or is there an obvious thing I did wrong?

Thanks!

Diffuculty building correctly

Hi,

I am trying to build WFA2-lib and use it without any bindings directly as a C library. I am on ubuntu 20.04 (cmake version 3.16.3, gcc version 9.4.0, make version 4.2.1). I think I have managed to build it correctly but am having difficulty calling the library in a .c file. The way I built it is as follows:

git clone https://github.com/smarco/WFA2-lib
cd WFA2-lib
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --verbose
ctest . --verbose

make
make install

This runs without any errors or problems. I then tried the first example in the main page but when I tried to compile the code
using

gcc -O3 wfa_example.c -o wfa_example -lwfa

I get the error

In file included from wavefront/wavefront_align.h:34,
                 from wfa_example.c:2:
wavefront/wavefront_aligner.h:34:10: fatal error: utils/heatmap.h: No such file or directory
   34 | #include "utils/heatmap.h"
      |          ^~~~~~~~~~~~~~~~~
compilation terminated

The same happens with the minimal example of simply having a code which includes the following:

#include <stdio.h>
#include "wavefront_align.h"

If I try to remove the second line (including the wavefront_align) and compile the code using the same command above I get

/usr/bin/ld: cannot find -lwfa

I have added WFA2-lib/bin and WFA2-lib/lib in the PATH variable but I still get the same error. I am kind of a novice when it comes to building C libraries from source and using them so I am assuming that I have not set some paths correctly, have not used the headers correctly or compiled the example code correctly. Please let me know how I could solve this issue.

Thanks

Dual-cost gap-affine distances with gap opening penalty = 0

Hi,

I think there is a little bug in the Dual-cost gap-affine distance aligner initialization in WFA-lib. When declaring wfa::WFAlignerGapAffine2Pieces aligner (5, 0, 5, 5, 1, wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh);, WFA exits saying Penalties (X=5,O1=0,E1=5,O2=5,E2=1) must be (X>0,O1>=0,E1>0,O1>=0,E1>0), which seems incoherent.
I looked a little bit in the code, and there (wavefront/wavefront_penalties_set_affine2p.c) it indeed seems forbidden to have O1 = 0 or O2 = 0

Small typo in WFA paper

We're getting into very nitpicky things now, but I noticed this, and who knows, maybe you can still fix it for some future version 😄 :

  • Equation 3 in the WFA paper is missing some commas in the 2nd line, in the maximum for M^{lo}.
  • Edit: thinking more about it, maybe the max in that line should be min instead?
  • The pseudocode has $\widetilde M$ while the text has $\tilde M$.

Valgrind uninitialized value errors

I'm seeing a lot of valgrind 'Conditional jump or move depends on uninitialised value' errors coming from libwfa2 on a relatively simple example program, which I noticed while trying to trace down some instability in a larger program using wfa2.

Here is the small example code:

test.c

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include "wavefront/wavefront_align.h"

int main() {
    char *pattern = "A";
    char *text    = "ABC";

    wavefront_aligner_attr_t attr = wavefront_aligner_attr_default;
    attr.distance_metric = gap_linear;
    attr.linear_penalties.match = -1;
    attr.linear_penalties.mismatch = 1;
    attr.linear_penalties.indel = 1;
    attr.alignment_scope = compute_alignment;

    attr.alignment_form.span = alignment_endsfree;
    attr.alignment_form.pattern_begin_free = 0;
    attr.alignment_form.pattern_end_free = 0;
    attr.alignment_form.text_begin_free = 0;
    attr.alignment_form.text_end_free = 2;

    wavefront_aligner_t* const wf_aligner = wavefront_aligner_new(&attr);
    wavefront_align(wf_aligner, pattern, strlen(pattern), text, strlen(text));
    cigar_print_pretty(stderr,
      wf_aligner->cigar,pattern,strlen(pattern),text,strlen(text));

    fprintf(stderr,"Alignment Score %d\n",wf_aligner->cigar->score);

    wavefront_aligner_delete(wf_aligner);
}

I cloned today's main branch (931181d), and compiled it in DEBUG mode, then compiled test.c and ran it with valgrind as shown below, I preview a few of the errors, but there were something like 30 of these "Conditional jump or move depends on uninitialised value" errors:

$ gcc -g -I../ test.c ../build/libwfa2.a -lm
$ valgrind ./a.out
==6548== Memcheck, a memory error detector
==6548== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==6548== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==6548== Command: ./a.out
==6548==
==6548== Conditional jump or move depends on uninitialised value(s)
==6548==    at 0x4124B5: wavefront_extend_matches_packed_kernel (wavefront_extend_kernels.c:47)
==6548==    by 0x4124B5: wavefront_extend_matches_packed_endsfree (wavefront_extend_kernels.c:119)
==6548==    by 0x40CE7F: wavefront_extend_endsfree_dispatcher_seq (wavefront_extend.c:229)
==6548==    by 0x40CF04: wavefront_extend_endsfree_dispatcher_threads (wavefront_extend.c:246)
==6548==    by 0x40CFC7: wavefront_extend_endsfree (wavefront_extend.c:282)
==6548==    by 0x411996: wavefront_unialign (wavefront_unialign.c:251)
==6548==    by 0x401643: wavefront_align_unidirectional (wavefront_align.c:132)
==6548==    by 0x401930: wavefront_align (wavefront_align.c:228)
==6548==    by 0x4012A0: main (test.c:26)
==6548==
==6548== Conditional jump or move depends on uninitialised value(s)
==6548==    at 0x4120A6: wavefront_termination_endsfree (wavefront_termination.c:128)
==6548==    by 0x412513: wavefront_extend_matches_packed_endsfree (wavefront_extend_kernels.c:122)
==6548==    by 0x40CE7F: wavefront_extend_endsfree_dispatcher_seq (wavefront_extend.c:229)
==6548==    by 0x40CF04: wavefront_extend_endsfree_dispatcher_threads (wavefront_extend.c:246)
==6548==    by 0x40CFC7: wavefront_extend_endsfree (wavefront_extend.c:282)
==6548==    by 0x411996: wavefront_unialign (wavefront_unialign.c:251)
==6548==    by 0x401643: wavefront_align_unidirectional (wavefront_align.c:132)
==6548==    by 0x401930: wavefront_align (wavefront_align.c:228)
==6548==    by 0x4012A0: main (test.c:26)

Is there a way that this uninitialized state could be addressed?

Safe to change parameters between uses of aligner object?

The README suggests that it is best to reuse the aligner object in order to avoid extra allocations and computations. Is it also safe to change parameters (specifically, maximum score and static band width) after the object has been used to create an alignment?

I am guessing the answer may be no, because for some data sets I am getting a segfault when I finally free the aligner object; specifically in wavefront_slab_delete.

Consider following semantic versioning

I've started to write Python bindings for WFA2-lib and got bitten by a few API-related things. I started out trying pyWFA but they were using an older (v2.2?), vendored version of WFA2-lib. It seemed that the API had changed quite a bit from v2.2 to v2.3.3, so I figured I'd try writing bindings to v2.3.3, which is packaged in bioconda and also is the only stable version released, if I understand correctly.

I had to write a small patch (see #83) to get the bindings to build against v2.3.3 properly, but also looked in to updating the bindings to work with the latest main changes, since it seems a lot of fixes and features have made it in since v2.3.3. However, I noticed the API changed again.

In the interest of reducing the burden on users of this really cool library, could you consider following a versioning scheme like SemVer, where the public API doesn't break without a major version bump?

[questions] benchmarks; scores VS penalties

Hi,

this looks like a great project! I have two questions:

  1. Do you have benchmarks against libraries that implement NW? Like parasail and seqan? You say that auto-vectorisation of the library works well, do you have benchmarks for this? Do you assume that this is comparable to the speed-up that libraries with explicit vectorisation (inter-sequence batch vectorisation or anti-diagonals) get?
  2. You are currently only offering penalties (negative scores) and the match-"score" is fixed to 0. Is this a limitation of the algorithm, or would non-negative scores work as well? I think that would be a requirement to get Protein scoring matrixes to work?

Thanks!

VERSION file breaks vcflib build on Mac

I'm trying to build vcflib for Mac. vcflib now includes this library as a dependency, and if it isn't installed it points the build process at a vendored copy in a submodule.

It adds the repository root of that submodule to the compiler's include search path, so you can #include <wavefront/wavefront_aligner.h> or whatever.

However, the repository root contains a file named VERSION, and on my Mac at least, line 37 of the C++ standard library header /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk/usr/include/c++/v1/cstddef is:

#include <version>

Because Mac is case-insensitive, the WFA2-lib VERSION file ends up getting included by the standard library, instead of the standard library's version header file.

Please rename the VERSION file to something else (maybe VERSION.txt), or move the wavefront directory into another level of subdirectory (maybe include or src), so that vcflib can include from it without having this happen.

Alternately, vcflib could change its build process to install WFA2-lib into a directory somewhere and then build with it from there.

Score is `-INF` when aligning two empty sequences

When using BiWFA/ultralow memory mode and indel costs, aligning two empty sequences seems to return -2147483648 as the cost. This only happens when calling the library directly (via rust), and not when running align_benchmark as

align_benchmark -a indel-wfa --input input  -o output --wfa-memory-mode ultralow

where input simply contains

>
<

I.e., this could very well be a bug on my side in the rust-c interface.

SIMD for extending

In the current code you (and @lh3 also) compare 8chars / one 64bit int at a time. Have you tried using SIMD for this?

In particular, the operation is very similar to what absl/flat_hash_map does: do a compare of 16chars, convert the result to a 16bit mask, and count leading zeros.

16 vs 8 may not actually be such a big difference and SIMD of course makes things more complicated so it's hard to say whether it actually does much in the end.

Latency is also an important factor, since most of the time you call this off the main matching diagonal, where you don't expect more than 8 matching characters anyway. I don't know whether simd has higher latency than the current code.

Meet in the middle

For Dijkstra, a common technique to reduce explored states is to use meet in the middle, where you search both from the start and end and stop as you expand a state from both ends.

WFA is basically just a more efficient implementation of Dijkstra, since it also expands in order of increasing g (distance from start). Thus, meet in the middle should just work here.

All you'd need to do is run is from both sides to the middle of the string (or alternate and run both sides to equal score). It shouldn't be hard to detect when the two searches overlap. Probably a check that the M from the start and M from the end on the same diagonal add to more than the sequence length is sufficient.

The benefit here is that now the runtime is 2*(s/2)^2 = s^2/2, so around twice as fast as the original,. assuming the implementation doesn't slow down too much.

(For non-global alignment this may be more tricky. Not sure yet.)

I don't know if it's worth the effort right now and how much a 2x speedup matters compared to other optimizations you may be thinking of, but anyway at some point this could be nice.

Some questions on the readme

Hi! Great new readme with super nice visualizations, sweet! We have some similar ones for our own paper, but yours are nicer ;)

Some questions that I have:

  • Could you clarify how exactly WFA2-lib differs from WFA? (Maybe the features list could include what is/is not in WFA.) The introduction lists what is supported by WFA2, but doesn't make it super clear that these were not in WFA1.
  • Re. A note for the fierce competitors: I totally understand and agree with this. Would you mind sharing the way in which you would like your code to be benchmarked, and how not to do it?

I've been using

bin/align_benchmark -i {input} -a gap-affine-wfa --affine-penalties="0,1,0,1"

and just use snakemake to time the command on 10MB input files. Do you think this is accurate? It does include the file IO overhead, but we also include that for our own code. (We only run for sequences >=10000 currently, so file IO is not the bottleneck anyway.) We only compare exact global alignment with edit-distance costs, so I don't think there is any parameter tuning needed on the WFA side right?

I see you also ported the align_benchmark tool to WFA2. Is this ready to replace the WFA1 benchmark already and should I switch over?

Other Landau-Vishkin implementations

Are you aware of any other competitive / comparable in speed implementation of the basic Landau-Vishkin algorithm that WFA extends?

I tried looking for one, but couldn't find any, and since you also don't compare to them, this may not exist?

Tag versionin

Hi! Would you mind creating a new release as of merging: #48 into master for safer CMake FetchContent integration that doesn't depend on master or commit hash?

Thanks! Cheers. :)

How to control the primary orientation of the alignment?

I wonder if there is a term in alignment tools to describe the difference as example below. WFA2 show different alignment pattern with other Smith-Waterman alignment tools, such as CSSW. There is two possible alignment output for this query read, and both of them have identical alignment score. CSSW favor opening gap on the leftmost site, and report result as --GT. But WFA2 report result as GT--.
I would like to know if there is a parameter in this library to control this, and make WFA2 consistent with other tools?

CSSW:
ref      TAGTCTGGCACGGTGAAGAGACATGAGAGGTGTAGAATAAGTGGGAGGCCCCCGGCGCCCGGCCCCGTC
         |||||||||||||||||||||||||||||  ||||||||||||||||||||||||||||||||||||||
query    TAGTCTGGCACGGTGAAGAGACATGAGAG--GTAGAATAAGTGGGAGGCCCCCGGCGCCCGGCCCCGTC

WFA2:
ref      TAGTCTGGCACGGTGAAGAGACATGAGAGGTGTAGAATAAGTGGGAGGCCCCCGGCGCCCGGCCCCGTC
         |||||||||||||||||||||||||||||||  ||||||||||||||||||||||||||||||||||||
query    TAGTCTGGCACGGTGAAGAGACATGAGAGGT--AGAATAAGTGGGAGGCCCCCGGCGCCCGGCCCCGTC

Possible bug in ends-free score calculation when match < 0

Thanks for the great library. I've observed an issue with the alignment score in some endsfree alignments when I set a match score. I've been able to boil this down to a relatively small test case as follows:

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include "wavefront/wavefront_align.h"

int main() {
    char *pattern = "A";
    char *text    = "ABC";

    wavefront_aligner_attr_t attr = wavefront_aligner_attr_default;
    attr.distance_metric = gap_linear;
    attr.linear_penalties.match = -1;
    attr.linear_penalties.mismatch = 1;
    attr.linear_penalties.indel = 1;
    attr.alignment_scope = compute_alignment;

    attr.alignment_form.span = alignment_endsfree;
    attr.alignment_form.pattern_begin_free = 0;
    attr.alignment_form.pattern_end_free = 0;
    attr.alignment_form.text_begin_free = 0;
    attr.alignment_form.text_end_free = 2;

    wavefront_aligner_t* const wf_aligner = wavefront_aligner_new(&attr);
    wavefront_align(wf_aligner, pattern, strlen(pattern), text, strlen(text));
    cigar_print_pretty(stderr,
      pattern,strlen(pattern),text,strlen(text),
      wf_aligner->cigar,wf_aligner->mm_allocator);

    fprintf(stderr,"Alignment Score %d\n",wf_aligner->cigar->score);

    wavefront_aligner_delete(wf_aligner);
}

On the current tip of main (94bcccd), this program produces:

$  gcc -I../WFA2-lib ../WFA2-lib/build/libwfa2.a test1.c
$ a.out
      ALIGNMENT 1M2I
      ALIGNMENT.COMPACT 2I
      PATTERN    A--
                 |  
      TEXT       ABC
Alignment Score 2

...with the incorrect score of 2 instead of 1. This seems to be a trailing edge issue, because I can't reproduce the problem with the equivalent 'text_begin_free' setting.

Retrieving the extent of an alignment

Hi, there was a question by @brentp in issue #18 about how to retrieve the extent of an alignment:

Also, is there an efficient way for this to get the start, end of the alignment in the pattern and text? I don't need the full cigar or backtace, only the extents of the alignment.
Ah. You mean, score-only alignment-extension, but returning the position where it stopped. Interesting idea. You can always access the position where the alignment stopped the extension HERE. I can provide a friendly API to retrieve this info (if useful).

The provided link seems to be out of date. Could you perhaps point out again which field to access?

Also, I am currently using the C++ WFAligner class and would be happy if there were a friendly API for getting this information. But this is not urgent as I can just to using the C interface.

Note that in my case, I would be interested in this information even when running the alignment with backtrace. I can get it by parsing the CIGAR, but it feels unnecessary to do so when – I assume – accessing an attribute would be all that is needed.

(Note regarding the out-of-date link: If you view a source code file on GitHub, press the y key to switch to a URL that includes the commit hash. When you use that as the link, it will continue to work even when the file changes.)

Which CIGAR is returned when multiple alignments have equal scores?

It is generally the case that there may be multiple alignments with the same score, and of course all of these are equally correct. However, my main purpose of alignment is to calculate the pairwise identity score as used in, e.g., BLAST, which is equivalent to $1 - \frac{\text{edit distance}}{\text{alignment length}}$. I am using end2end, i.e. fully global, alignment.

Actually maximizing this pairwise ID score would involve a variable gap penalty depending on the total number of gaps, which is not what I am interested in. However, it would be nice to, after calculating the edit distance, be able to calculate the pairwise identity using either the minimum or maximum alignment length corresponding to that edit distance. I have noticed in my testing that WFA2 usually, but not always, gives shorter alignments than edlib; this suggests that both are using traceback algorithms which simply try to find one alignment, and do not worry about any limits on the length.

Please support big-endian architectures

Hi,

Would it be possible to consider supporting BIG_ENDIAN architectures such as S390X ?

Running tests for them at the moment results in failures due to the assumptions about the byte order:

Comparing ./tests vs ./tests/wfa.utest.check
[UTest::test.affine] Error
[UTest::test.affine.p0] Error
[UTest::test.affine.p1] Error
[UTest::test.affine.p2] Error
[UTest::test.affine.p3] Error
[UTest::test.affine.p4] Error
[UTest::test.affine.p5] Error
[UTest::test.affine.wfapt0] Error
[UTest::test.affine.wfapt1] Error
[UTest::test.affine2p] Error
[UTest::test.biwfa.affine] Error
[UTest::test.biwfa.affine.p0] Error
[UTest::test.biwfa.affine.p1] Error
[UTest::test.biwfa.affine.p2] Error
[UTest::test.biwfa.affine.p3] Error
[UTest::test.biwfa.affine.p4] Error
[UTest::test.biwfa.affine.p5] Error
[UTest::test.biwfa.affine.wfapt0] Error
[UTest::test.biwfa.affine.wfapt1] Error
[UTest::test.biwfa.affine2p] Error
[UTest::test.biwfa.edit] Error
[UTest::test.biwfa.indel] Error
[UTest::test.biwfa.score.affine] Error
[UTest::test.biwfa.score.affine.p0] Error
[UTest::test.biwfa.score.affine.p1] Error
[UTest::test.biwfa.score.affine.p2] Error
[UTest::test.biwfa.score.affine.p3] Error
[UTest::test.biwfa.score.affine.p4] Error
[UTest::test.biwfa.score.affine.p5] Error
[UTest::test.biwfa.score.affine.wfapt0] Error
[UTest::test.biwfa.score.affine.wfapt1] Error
[UTest::test.biwfa.score.affine2p] Error
[UTest::test.biwfa.score.edit] Error
[UTest::test.biwfa.score.indel] Error
[UTest::test.edit] Error
[UTest::test.indel] Error
[UTest::test.pb.affine] Error
[UTest::test.pb.affine.p0] Error
[UTest::test.pb.affine.p1] Error
[UTest::test.pb.affine.p2] Error
[UTest::test.pb.affine.p3] Error
[UTest::test.pb.affine.p4] Error
[UTest::test.pb.affine.p5] Error
[UTest::test.pb.affine.wfapt0] Error
[UTest::test.pb.affine.wfapt1] Error
[UTest::test.pb.affine2p] Error
[UTest::test.pb.edit] Error
[UTest::test.pb.indel] Error
[UTest::test.score.affine] Error
[UTest::test.score.affine.p0] Error
[UTest::test.score.affine.p1] Error
[UTest::test.score.affine.p2] Error
[UTest::test.score.affine.p3] Error
[UTest::test.score.affine.p4] Error
[UTest::test.score.affine.p5] Error
[UTest::test.score.affine.wfapt0] Error
[UTest::test.score.affine.wfapt1] Error
[UTest::test.score.affine2p] Error
[UTest::test.score.edit] Error
[UTest::test.score.indel] Error

Segmentation Fault with memory_mode != high

I get a segmentation fault on the current development branch with this code:

extern "C" {
#include <wavefront/wavefront_align.h>
}

#include <string>
#include <fstream>

void do_align(std::string_view const pattern, std::string_view const ref)
{
  // Configure alignment attributes
  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.alignment_scope = compute_alignment;
  attributes.distance_metric = gap_affine;
  attributes.affine_penalties.mismatch = 3;
  attributes.affine_penalties.gap_opening = 5;
  attributes.affine_penalties.gap_extension = 1;
  attributes.alignment_form.span = alignment_endsfree;
  attributes.alignment_form.pattern_begin_free = 0;
  attributes.alignment_form.pattern_end_free = 0;
  attributes.alignment_form.text_begin_free = 1;
  attributes.alignment_form.text_end_free = 1;
  attributes.heuristic.strategy = wf_heuristic_wfadaptive;
  attributes.heuristic.min_wavefront_length = 10;
  attributes.heuristic.max_distance_threshold = 50;
  attributes.heuristic.steps_between_cutoffs = 1;
  attributes.memory_mode = wavefront_memory_high; // <---- change me to low or med

  wavefront_aligner_t * const wf_aligner = wavefront_aligner_new(&attributes);

  int res = wavefront_align(wf_aligner, pattern.data(), pattern.size(), ref.data(), ref.size());

  cigar_print_pretty(stderr, pattern.data(), pattern.size(), ref.data(), ref.size(),
                     wf_aligner->cigar, wf_aligner->mm_allocator);
  fprintf(stderr,"Alignment Score %d\nResult:%d\n", wf_aligner->cigar->score, res);

  assert(res != -1);
  wavefront_aligner_delete(wf_aligner);
}

int main()
{
    std::ifstream qrystr{"qry.txt"};
    std::string qry{std::istreambuf_iterator<char>(qrystr),
                    std::istreambuf_iterator<char>()};
    std::ifstream refstr{"ref.txt"};
    std::string ref{std::istreambuf_iterator<char>(refstr),
                    std::istreambuf_iterator<char>()};

    do_align(qry, ref);
}

Files:
https://hauswedell.net/tmp/ref.txt
https://hauswedell.net/tmp/qry.txt

edit: also on the main branch

Right score for score-only alignment?

Hi Smarco,
I'm testing the examples/basic.c to test the score-only alignment(attributes.alignment_scope = compute_score;).

In the output, I got two scores.
Score1: alignment score (wf_aligner->cigar.score)
Score2:SCORE (RE)COMPUTED (cigar_score_gap_affine(&wf_aligner->cigar,&attributes.affine_penalties))

In normal full-cigar alignment mode, Score1 may equal to score2. But score2 is equal to the standard NW algorithm score.
In score-only mode, Score1 is available but not equal to standard NW algorithm score. Score2 is not available, possibly because this mode does not calculate cigar.

Is there a way to use score-only alignment mode to get the correct NW algorithm score?

Best regards,
Haojing

Memory usage for ultra long read alignment

@RagnarGrootKoerkamp and I have observed the memory usage of BiWFA increasing significantly to >10GB when aligning >500kbp ONT reads with linear and affine costs (eg., sub = 1, open = 1, extend = 1). We are currently using Rust bindings here (using cmake to build with flags recommended by the WFA2-lib readme) and we have observed this memory usage issue on both the latest main branch and a commit from around 1 year ago. We have reproduced this on both MacOS and Linux.

The Rust wrapper is quite thin so we think it's unlikely that its the cause of the issue. Any idea what's wrong?

Bug in overloaded WFAligner::alignEndsFree()

WFAligner::alignEndsFree() based on char* (line 106 in WFAligner.cpp) seems to be correct, but the other method based on std::string (line 122 in WFAligner.cpp) returns WFAligner::alignEnd2End(), which is not correct.

Instead, it is supposed to return WFAligner::alignEndsFree() using .c_str():

  return alignEndsFree(
      pattern.c_str(), pattern.length(), patternBeginFree, patternEndFree,
      text.c_str(), text.length(), textBeginFree, textEndFree);

WFA2-lib commons.h bleeds C macros

By including wfa2-lib headers into vcflib (for the great vcfwave tool) we bleed in a set of macros that are defined, such as FORMAT, EOL and TAB. A library should not export these by default! See

See vcflib/vcflib#359

Downstream I have the option of not including wfa2-lib in the vcflib headers, or explicitly undefining macros. Both not what we want. Can you create a version that does not bleed macros? I prefer an upstream solution :). And thanks again for a great library!

Adding -lrt in case "librt.so.1: error adding symbols: DSO missing from command line"

In case anyone stumbles upon the same installation error, I fixed it by adding -lrt to line 48 in the Makefile as

all: CC_FLAGS+=-O3 -march=native -lrt #-flto -ffat-lto-objects

Compilers

(base) [kris@rackham3 WFA2-lib]$ echo $CC
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc
(base) [kris@rackham3 WFA2-lib]$ /proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc --version
x86_64-conda_cos6-linux-gnu-cc (crosstool-NG 1.23.0.449-a04d0) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(base) [kris@rackham3 WFA2-lib]$ echo $CXX
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-g++
(base) [kris@rackham3 WFA2-lib]$ /proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-g++ --version
x86_64-conda_cos6-linux-gnu-g++ (crosstool-NG 1.23.0.449-a04d0) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Installation error message below (missing -lrt)

(base) [kris@rackham3 WFA2-lib]$ make clean all
rm -rf bin build lib 2> /dev/null
make --directory=tools/align_benchmark clean
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark'
rm -rf ./build
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark'
make --directory=examples clean
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/examples'
rm -rf bin
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/examples'
rm -rf tests/*.alg tests/*.log* 2> /dev/null
make --directory=alignment all
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/alignment'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c affine_penalties.c -o ../build/affine_penalties.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c affine2p_penalties.c -o ../build/affine2p_penalties.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c cigar.c -o ../build/cigar.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c score_matrix.c -o ../build/score_matrix.o
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/alignment'
make --directory=bindings/cpp all
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/bindings/cpp'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-g++ -Wall -g -O3 -march=native  -I../.. -c WFAligner.cpp -o ../../build/cpp/WFAligner.o
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/bindings/cpp'
make --directory=system all
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/system'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c mm_allocator.c -o ../build/mm_allocator.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c mm_stack.c -o ../build/mm_stack.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c profiler_counter.c -o ../build/profiler_counter.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c profiler_timer.c -o ../build/profiler_timer.o
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/system'
make --directory=utils all
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/utils'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c bitmap.c -o ../build/bitmap.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c commons.c -o ../build/commons.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c dna_text.c -o ../build/dna_text.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c heatmap.c -o ../build/heatmap.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c sequence_buffer.c -o ../build/sequence_buffer.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c string_padded.c -o ../build/string_padded.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -c vector.c -o ../build/vector.o
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/utils'
make --directory=wavefront all
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/wavefront'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_align.c -o ../build/wavefront_align.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_aligner.c -o ../build/wavefront_aligner.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_attributes.c -o ../build/wavefront_attributes.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_backtrace_buffer.c -o ../build/wavefront_backtrace_buffer.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_backtrace_offload.c -o ../build/wavefront_backtrace_offload.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_backtrace.c -o ../build/wavefront_backtrace.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_bialign.c -o ../build/wavefront_bialign.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_bialigner.c -o ../build/wavefront_bialigner.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_components.c -o ../build/wavefront_components.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_compute_affine.c -o ../build/wavefront_compute_affine.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_compute_affine2p.c -o ../build/wavefront_compute_affine2p.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_compute_edit.c -o ../build/wavefront_compute_edit.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_compute_linear.c -o ../build/wavefront_compute_linear.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_compute.c -o ../build/wavefront_compute.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_debug.c -o ../build/wavefront_debug.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_display.c -o ../build/wavefront_display.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_extend.c -o ../build/wavefront_extend.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_heuristic.c -o ../build/wavefront_heuristic.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_pcigar.c -o ../build/wavefront_pcigar.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_penalties.c -o ../build/wavefront_penalties.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_plot.c -o ../build/wavefront_plot.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_slab.c -o ../build/wavefront_slab.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront_unialign.c -o ../build/wavefront_unialign.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -DWFA_PARALLEL -fopenmp -I.. -c wavefront.c -o ../build/wavefront.o
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/wavefront'
ar -rsc lib/libwfa.a build/*.o 2> /dev/null
ar -rsc lib/libwfacpp.a build/*.o build/cpp/*.o 2> /dev/null
make --directory=tools/generate_dataset all
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/tools/generate_dataset'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  generate_dataset.c -o ../../bin/generate_dataset  -lm
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/tools/generate_dataset'
make --directory=tools/align_benchmark all
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark'
make --directory=benchmark all
make[2]: Entering directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/benchmark'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c benchmark_check.c -o ../build/benchmark_check.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c benchmark_edit.c -o ../build/benchmark_edit.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c benchmark_gap_affine.c -o ../build/benchmark_gap_affine.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c benchmark_gap_affine2p.c -o ../build/benchmark_gap_affine2p.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c benchmark_gap_linear.c -o ../build/benchmark_gap_linear.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c benchmark_indel.c -o ../build/benchmark_indel.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c benchmark_utils.c -o ../build/benchmark_utils.o
make[2]: Leaving directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/benchmark'
make --directory=edit all
make[2]: Entering directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/edit'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c edit_bpm.c -o ../build/edit_bpm.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c edit_dp.c -o ../build/edit_dp.o
make[2]: Leaving directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/edit'
make --directory=gap_affine all
make[2]: Entering directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/gap_affine'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c affine_matrix.c -o ../build/affine_matrix.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c swg.c -o ../build/swg.o
make[2]: Leaving directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/gap_affine'
make --directory=gap_affine2p all
make[2]: Entering directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/gap_affine2p'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c affine2p_dp.c -o ../build/affine2p_dp.o
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c affine2p_matrix.c -o ../build/affine2p_matrix.o
make[2]: Leaving directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/gap_affine2p'
make --directory=gap_linear all
make[2]: Entering directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/gap_linear'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c nw.c -o ../build/nw.o
make[2]: Leaving directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/gap_linear'
make --directory=indel all
make[2]: Entering directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/indel'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -I.. -I../../.. -c indel_dp.c -o ../build/indel_dp.o
make[2]: Leaving directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark/indel'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -L../../lib -I../.. -I../../.. -I. ./build/*.o align_benchmark_params.c align_benchmark.c -o ../../bin/align_benchmark -lrt  -lwfa -lm -fopenmp
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/tools/align_benchmark'
make --directory=examples all
make[1]: Entering directory `/domus/h1/kris/source/WFA2-lib/examples'
/proj/snic2020-15-201/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -Wall -g -O3 -march=native  -L../lib -I.. wfa_basic.c -o bin/wfa_basic -lwfa -fopenmp -lm
/crex/proj/snic2020-15-201/anaconda3/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: ../lib/libwfa.a(profiler_timer.o): undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5'
/crex/proj/snic2020-15-201/anaconda3/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: /crex/proj/snic2020-15-201/anaconda3/bin/../x86_64-conda_cos6-linux-gnu/sysroot/lib/librt.so.1: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[1]: *** [examples_c] Error 1
make[1]: Leaving directory `/domus/h1/kris/source/WFA2-lib/examples'
make: *** [examples] Error 2

Speed up WFA when two sequences differ greatly in length

When patching gaps between two anchors, we sometimes need to align two sequences of vastly different lengths. Suppose we are aligning a 10bp sequence against a 100kb sequence. The active band [lo,hi] in the original WFA will grow from 1 to 100010 in size. The total number of iterations is about 1000102/2. Based on the running time of WFA2-lib, I guess WFA2-lib has a similar behavior. This is not necessary given that a wavefront can only consist of 10 cells in this example.

Generally, a key observation is that the active band is determined by the cells in the current stripe (the wfalm terminology). Sometimes we can decrease "hi" or increase "lo" when the wavefront hits the end of the query or the target. I added a hack to miniwfa to achieve that. For 10bp vs 100kb, WFA2-lib takes 44 sec while the modified miniwfa takes 0.2 sec. ksw2 only takes 0.01s mainly because WFA doesn't have an advantage on such examples and partly because my hack is inefficient. I think there should be a faster and cleaner solution but I haven't found that.

how to use extension

Hi, I'm using the code below to evaluate using wfa-lib to extend given a seed. Based on the docs, I though it would have to extend left and then right in 2 different calls, but it seems that's not the case.

In brief, given:

 
   //              0123456789012345678901234567890
   char *p =      "AAAAAAACCGGTTGGAAAAAAAAAAAAAAAAAA";
   //         0123456788901234
   char *t = "AAAAAAAAAAAACCGGTTGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";

             int pattern_begin_free = 7;
             int pattern_end_free = 0;
             int text_begin_free = 11;
             int text_end_free =0;

I get the full alignment:


      PATTERN    -----AAAAAAACCGGTTGGAAAAAAAAAAAAAAAAAA---------------
                      |||||||||||||||||||||||||||||||||               
      TEXT       AAAAAAAAAAAACCGGTTGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

So, for seed and extend, is it only needed to do something like this and it will indeed extend right as well?

Also, is there an efficient way for this to get the start, end of the alignment in the pattern and text? I don't need the full cigar or backtace, only the extents of the alignment.

thanks in advance.

here is the code I use for testing.

#include "wavefront/wavefront_align.h"


int wf_align(char *pattern, char *text, distance_metric_t distance_metric,
             int match, /* 0 or less */
             int mismatch, /* 1 or more */ 
             int gap_open, /* 1 or more */
             int gap_extend, /* 1 or more */
             int pattern_begin_free,
             int pattern_end_free,
             int text_begin_free,
             int text_end_free
   ) {

    wavefront_aligner_attr_t attr = wavefront_aligner_attr_default;
    attr.distance_metric = distance_metric;
    attr.distance_metric = gap_affine;
    attr.affine_penalties.match = match;
    attr.affine_penalties.mismatch = mismatch;
    attr.affine_penalties.gap_opening = gap_open;
    attr.affine_penalties.gap_extension = gap_extend;
    attr.alignment_scope = compute_alignment;

    attr.heuristic.strategy = wf_heuristic_wfadaptive;
    attr.heuristic.min_wavefront_length = 10;
    attr.heuristic.max_distance_threshold = 50;
    attr.heuristic.steps_between_cutoffs = 1;
   
    attr.alignment_form.span = alignment_endsfree;
    // right extension
    attr.alignment_form.pattern_begin_free = pattern_begin_free;
    attr.alignment_form.pattern_end_free = pattern_end_free;
    attr.alignment_form.text_begin_free = text_begin_free;
    attr.alignment_form.text_end_free = text_end_free;

    // left extension
    /*
    attr.alignment_form.pattern_begin_free = pattern_begin_free;
    attr.alignment_form.pattern_end_free = 0;
    attr.alignment_form.text_begin_free = text_begin_free;
    attr.alignment_form.text_end_free = 0;
    */

    wavefront_aligner_t* const wf_aligner = wavefront_aligner_new(&attr);

    wavefront_align(wf_aligner, pattern, strlen(pattern), text, strlen(text));
    cigar_print_pretty(stderr,
      pattern,strlen(pattern),text,strlen(text),
      &wf_aligner->cigar,wf_aligner->mm_allocator);

    wavefront_aligner_delete(wf_aligner);
}

int main() {
   //              0123456789012345678901234567890
   char *p =      "AAAAAAACCGGTTGGAAAAAAAAAAAAAAAAAA";
   //         0123456788901234
   char *t = "AAAAAAAAAAAACCGGTTGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";

   wf_align(p, t, gap_affine, 0, 2, 2, 1, 7, 0, 11, 0);

}

Feature questions

Curious if you have plans to extend this to produce local alignments ( e.g. all HSPs above some score threshold ), and scoring matrices?

Rust bindings and sign of returned score

We're writing a rust wrapper around WFA to do some testing.

It seems that there is an inconsistency in the sign of the returned cost:

  • for unit and lcs, the returned value is positive: the 'cost' of the alignment.
  • for other cases gap_linear and gap_affine and gap_affine_2p, the returned value is negative: the 'score' of the alignment.

You can see this is my code here and here respectively.

I was trying to find some docs on this difference, but the readme does not mention this difference currently. Is it intentional?

wf_aligner->cigar->operations is blank

I tried to align two sequences:

GGAGCATCAGTAGACCTAACCATTTTCTCCCTCCATCTAGCAGGAGTATCTTCAATCCTAGGAGCAATTAATTTCATCACTACAATTATTAACATAAA
CCTCAGAAGAAACGCTGCTCACGCCGGGTCTTCAGTTGATTTTGCCATTTTTTCTTTGCACTTGGCGGGAGTTTCTTCTCTATTAGGGGCCGTCAATTTTATTAGAACTTTAAGCAATTTGCGAATTTTTGGGATATTTTTGGATCAGATGCCCCTTTTTTGTTGAGCAGTATTAATTACGGCAGTTCTATTGCTTTTATCCTTACCCGTTCTCGCGGGGGCTATTACCATGTTGCTAACAGACCGCAATTTGAATTCCAGATTTTATGACGTTAGGGGAGGGGGGGATCCCGTGCTCTACCAGCATTTATTT

wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
attributes.distance_metric = gap_affine;
attributes.affine_penalties.mismatch =4;
attributes.affine_penalties.gap_opening = 6;
attributes.affine_penalties.gap_extension = 2;
// Initialize Wavefront Aligner
wavefront_aligner_t* const wf_aligner = wavefront_aligner_new(&attributes);
wavefront_align(wf_aligner,seq1,strlen(seq1),seq2,strlen(seq2)); //Align

But the cigar->operations doesn't contain anything. Isn't this supposed to contain the cigar string?

(gdb) p wf_aligner->cigar->begin_offset
$38 = 3686
(gdb) p wf_aligner->cigar->end_offset
$39 = 3999
(gdb) p wf_aligner->cigar->operations
$40 = 0x7ffff28e5b98 ""

Generate Dataset Tool outdated README

Some of the command line options are no longer supported (--pattern-length|P and --text-length|T) and have been substituted by a new one which is undocumented.

Build issue on centos

Hi,
Thanks for the really nice library. I have no problems building on my Mac using clang using make BUILD_TOOLS=0 BUILD_EXAMPLES=0 clean all

However I run into problem when installing on centos, installing the c++ code (old gcc version?):

make clean all
rm -rf bin build lib
make --directory=tools/align_benchmark clean
make[1]: Entering directory `/scratch/c.sbi8kc2/WFA2-lib/tools/align_benchmark'
rm -rf ./build
make[1]: Leaving directory `/scratch/c.sbi8kc2/WFA2-lib/tools/align_benchmark'
make --directory=examples clean
make[1]: Entering directory `/scratch/c.sbi8kc2/WFA2-lib/examples'
rm -rf bin
make[1]: Leaving directory `/scratch/c.sbi8kc2/WFA2-lib/examples'
make --directory=alignment all
make[1]: Entering directory `/scratch/c.sbi8kc2/WFA2-lib/alignment'
gcc -Wall -g -O3 -march=native -I.. -c affine_penalties.c -o ../build/affine_penalties.o
gcc -Wall -g -O3 -march=native -I.. -c affine2p_penalties.c -o ../build/affine2p_penalties.o
gcc -Wall -g -O3 -march=native -I.. -c cigar.c -o ../build/cigar.o
gcc -Wall -g -O3 -march=native -I.. -c score_matrix.c -o ../build/score_matrix.o
make[1]: Leaving directory `/scratch/c.sbi8kc2/WFA2-lib/alignment'
make --directory=bindings/cpp all
make[1]: Entering directory `/scratch/c.sbi8kc2/WFA2-lib/bindings/cpp'
g++ -Wall -g -O3 -march=native -I../.. -c WFAligner.cpp -o ../../build/cpp/WFAligner.o
WFAligner.cpp:57:3: warning: identifier ‘nullptr’ is a keyword in C++11 [-Wc++0x-compat]
   this->wfAligner = nullptr;
   ^
WFAligner.cpp: In constructor ‘wfa::WFAligner::WFAligner(wfa::WFAligner::AlignmentScope, wfa::WFAligner::MemoryModel)’:
WFAligner.cpp:57:21: error: ‘nullptr’ was not declared in this scope
   this->wfAligner = nullptr;
                     ^
make[1]: *** [../../build/cpp/WFAligner.o] Error 1
make[1]: Leaving directory `/scratch/c.sbi8kc2/WFA2-lib/bindings/cpp'
make: *** [bindings/cpp] Error 2

I tried deleting the /bindings/cpp from the Makefile so SUBDIRS looks like:

SUBDIRS=alignment \
        system \
        utils \
        wavefront

I also removed this line from the Makefile:

$(AR) $(AR_FLAGS) $(LIB_WFA_CPP) $(FOLDER_BUILD)/*.o $(FOLDER_BUILD_CPP)/*.o 2> /dev/null
The build now works with make BUILD_TOOLS=0 BUILD_EXAMPLES=0 clean all

Would it be possible to add a c++ free rule to the makefile? Thanks again!

Alignment with match score not optimal

Using the latest v2.2, with match score, the alignment is not optimal preferring a gap and not a mismatch. The alignment result is "8D1I1M7I", expecting "7D1X1M7I"

int main(int argc,char* argv[]) {
  // Patter & Text
  char* pattern = "CCCCCCCCC";
  char* text    = "TCTTTTTTT";
  // Configure alignment attributes
  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.distance_metric = gap_affine;
  attributes.affine_penalties.match = -10;
  attributes.affine_penalties.mismatch = 4;
  attributes.affine_penalties.gap_opening = 6;
  attributes.affine_penalties.gap_extension = 2;

  attributes.alignment_form.span = alignment_endsfree;
  attributes.alignment_form.pattern_begin_free = strlen(pattern)-1;
  attributes.alignment_form.pattern_end_free = 0;
  attributes.alignment_form.text_begin_free = 0;
  attributes.alignment_form.text_end_free = strlen(text)-1;
  // Initialize Wavefront Aligner
  wavefront_aligner_t* const wf_aligner = wavefront_aligner_new(&attributes);
  // Align
  wavefront_align(wf_aligner,pattern,strlen(pattern),text,strlen(text));
  fprintf(stderr,"WFA-Alignment returns score %d\n",wf_aligner->cigar.score);
  // Display alignment
  fprintf(stderr,"  PATTERN  %s\n",pattern);
  fprintf(stderr,"  TEXT     %s\n",text);
  fprintf(stderr,"  SCORE (RE)COMPUTED %d\n",
      cigar_score_gap_affine(&wf_aligner->cigar,&attributes.affine_penalties));
  cigar_print_pretty(stderr,
      pattern,strlen(pattern),text,strlen(text),
      &wf_aligner->cigar,wf_aligner->mm_allocator);
  // Free
  wavefront_aligner_delete(wf_aligner);
}

The result:

WFA-Alignment returns score 64
  PATTERN  CCCCCCCCC
  TEXT     TCTTTTTTT
  SCORE (RE)COMPUTED -40
      ALIGNMENT	8D1I1M7I
      ALIGNMENT.COMPACT	8D1I7I
      PATTERN    CCCCCCCC-C-------
                          |
      TEXT       --------TCTTTTTTT

[WFA::Compute] Maximum allocated wavefronts reached

Hi,

Thank you for developing WFA, it is great! I am using the C++ version in my program, and I am stumbling upon an error:
[WFA::Compute] Maximum allocated wavefronts reached
This error only arrives after a high number of alignments, and does not arrive at all in some instances. Here is how I am using WFA in the code:
wfa::WFAlignerGapAffine aligner(8,6,2, wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh); aligner.alignEnd2End(cons, re);
This code snippet is included in a loop, where cons and re change. The error arrives in a deterministic fashion when going through the loop for the ~720th time.

The error does not arrive when I reuse the same WFAlignerGapAffine object for all alignments. I know this would be the recommended way to go, but I found it uses a lot of RAM (in my case >30G while <1G when resetting the object every time).

Any idea on how to solve this ?

Many tanks,
Roland

Do D and I have inverted roles in CIGAR strings?

Running wfademo.cpp, I noticed that the meaning of D and I in the CIGAR output seems to have been swapped from their usual meaning. Here’s an example taken from the README:

    PATTERN    AGCTA-GTGTCAATGGCTACT---TTTCAGGTCCT
               | ||| |||||  ||||||||   | |||||||||
    TEXT       AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1X3M1I5M2X8M3I1M1X9M

The README states that text is equivalent to reference and pattern equivalent to query (which makes sense). If I take the above pattern to be a sequencing read and the text to be a genome reference, then the two gaps would be considered to be deletions, but they are encoded as 1I and 3I, respectively. Or should I think about this differently?

left shift of negative value

If you compile the latest https://github.com/armintoepfer/aligner-testbed with gcc10.3.0 with asan and ubsan

$ meson -Db_sanitize=address,undefined
$ ninja
$ ./at ../data/clr1.txt --rounds 1 --log-level INFO --miniwfa=false --ksw2=false
| 20220421 10:31:22.817 | INFO | Number of sequence pairs : 1
../subprojects/wfa/wavefront/wavefront_extend.c:116:20: runtime error: load of misaligned address 0x7f567e1c877e for type 'uint64_t', which requires 8 byte alignment
0x7f567e1c877e: note: pointer points here
 3f 3f 3f 3f 41 54  43 54 43 54 43 54 43 41  41 43 41 41 43 41 43 43  43 41 43 47 54 41 47 47  41 47
             ^
../subprojects/wfa/wavefront/wavefront_extend.c:116:38: runtime error: load of misaligned address 0x7f567e1d6462 for type 'uint64_t', which requires 8 byte alignment
0x7f567e1d6462: note: pointer points here
 21 21  21 21 41 41 54 43 54 43  41 43 47 43 54 43 41 41  43 43 41 43 41 41 43 41  41 43 47 47 41 47
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:124:13: runtime error: load of misaligned address 0x7f567e1c87b3 for type 'uint64_t', which requires 8 byte alignment
0x7f567e1c87b3: note: pointer points here
 54  43 54 43 54 43 41 41 41  43 43 41 43 41 41 43 41  41 43 47 47 41 47 47 41  47 47 41 47 47 41 41
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:124:31: runtime error: load of misaligned address 0x7f567e1d649d for type 'uint64_t', which requires 8 byte alignment
0x7f567e1d649d: note: pointer points here
 54 43 54 43 41 54 43  41 41 43 41 41 43 47 41  41 43 41 41 43 47 47 41  47 47 41 47 47 41 47 47  41
             ^
../subprojects/wfa/wavefront/wavefront_backtrace.c:217:12: runtime error: left shift of negative value -1073741823
../subprojects/wfa/wavefront/wavefront_backtrace.c:186:12: runtime error: left shift of negative value -1073741824
../subprojects/wfa/wavefront/wavefront_backtrace.c:245:12: runtime error: left shift of negative value -1073741822
| 20220421 10:31:37.533 | INFO | WFA2 C    : 0 / 14s 715ms

This is on a x86 AMD EPYC 7702 with no march.

Uninitialized offset can lead to client segfault

Hi @smarco -

Thanks again for WFA2 and your prior help with a similar issue -- I have a new problem where I'm getting non-deterministic segfaults in client code. I traced this down to one issue in wfa2 which is likely to be the cause, a Conditional jump or move depends on uninitialised value here:

if (offset == WAVEFRONT_OFFSET_NULL) continue;

I reduced the problem case to the following code:

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include "wavefront/wavefront_align.h"

int main() {
    char pattern[] = "TAT";
    char text[]    = "CAT";

    wavefront_aligner_attr_t attr = wavefront_aligner_attr_default;
    attr.distance_metric = gap_linear;
    attr.linear_penalties.match = -1;
    attr.linear_penalties.mismatch = 2;
    attr.linear_penalties.indel = 2;
    attr.alignment_scope = compute_alignment;

    attr.alignment_form.span = alignment_endsfree;
    attr.alignment_form.pattern_begin_free = 1;
    attr.alignment_form.pattern_end_free = 1;
    attr.alignment_form.text_begin_free = 1;
    attr.alignment_form.text_end_free = 1;

    wavefront_aligner_t* const wf_aligner = wavefront_aligner_new(&attr);
    wavefront_align(wf_aligner, pattern, strlen(pattern), text, strlen(text));
    cigar_print_pretty(stderr,
      wf_aligner->cigar,pattern,strlen(pattern),text,strlen(text));

    fprintf(stderr,"Alignment Score %d\n",wf_aligner->cigar->score);

    wavefront_aligner_delete(wf_aligner);
}

After building this in linux debug mode against the library version currently on main (9fdf46e), the valgrind output is:

==44021== Memcheck, a memory error detector
==44021== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==44021== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==44021== Command: ./a.out
==44021==
==44021== Conditional jump or move depends on uninitialised value(s)
==44021==    at 0x41242B: wavefront_extend_matches_packed_endsfree (wavefront_extend_kernels.c:143)
==44021==    by 0x40CE7D: wavefront_extend_endsfree_dispatcher_seq (wavefront_extend.c:229)
==44021==    by 0x40CF02: wavefront_extend_endsfree_dispatcher_threads (wavefront_extend.c:246)
==44021==    by 0x40CFC5: wavefront_extend_endsfree (wavefront_extend.c:282)
==44021==    by 0x411999: wavefront_unialign (wavefront_unialign.c:251)
==44021==    by 0x401641: wavefront_align_unidirectional (wavefront_align.c:132)
==44021==    by 0x40192E: wavefront_align (wavefront_align.c:228)
==44021==    by 0x40129E: main (test.c:26)
==44021==
==44021== Use of uninitialised value of size 8
==44021==    at 0x412474: wavefront_extend_matches_kernel_blockwise (wavefront_extend_kernels.c:72)
==44021==    by 0x412474: wavefront_extend_matches_packed_endsfree (wavefront_extend_kernels.c:145)
==44021==    by 0x40CE7D: wavefront_extend_endsfree_dispatcher_seq (wavefront_extend.c:229)
==44021==    by 0x40CF02: wavefront_extend_endsfree_dispatcher_threads (wavefront_extend.c:246)
==44021==    by 0x40CFC5: wavefront_extend_endsfree (wavefront_extend.c:282)
==44021==    by 0x411999: wavefront_unialign (wavefront_unialign.c:251)
==44021==    by 0x401641: wavefront_align_unidirectional (wavefront_align.c:132)
==44021==    by 0x40192E: wavefront_align (wavefront_align.c:228)
==44021==    by 0x40129E: main (test.c:26)
==44021==
==44021== Use of uninitialised value of size 8
==44021==    at 0x41247B: wavefront_extend_matches_kernel_blockwise (wavefront_extend_kernels.c:72)
==44021==    by 0x41247B: wavefront_extend_matches_packed_endsfree (wavefront_extend_kernels.c:145)
==44021==    by 0x40CE7D: wavefront_extend_endsfree_dispatcher_seq (wavefront_extend.c:229)
==44021==    by 0x40CF02: wavefront_extend_endsfree_dispatcher_threads (wavefront_extend.c:246)
==44021==    by 0x40CFC5: wavefront_extend_endsfree (wavefront_extend.c:282)
==44021==    by 0x411999: wavefront_unialign (wavefront_unialign.c:251)
==44021==    by 0x401641: wavefront_align_unidirectional (wavefront_align.c:132)
==44021==    by 0x40192E: wavefront_align (wavefront_align.c:228)
==44021==    by 0x40129E: main (test.c:26)
==44021==
==44021== Conditional jump or move depends on uninitialised value(s)
==44021==    at 0x4120A9: wavefront_termination_endsfree (wavefront_termination.c:128)
==44021==    by 0x412516: wavefront_extend_matches_packed_endsfree (wavefront_extend_kernels.c:148)
==44021==    by 0x40CE7D: wavefront_extend_endsfree_dispatcher_seq (wavefront_extend.c:229)
==44021==    by 0x40CF02: wavefront_extend_endsfree_dispatcher_threads (wavefront_extend.c:246)
==44021==    by 0x40CFC5: wavefront_extend_endsfree (wavefront_extend.c:282)
==44021==    by 0x411999: wavefront_unialign (wavefront_unialign.c:251)
==44021==    by 0x401641: wavefront_align_unidirectional (wavefront_align.c:132)
==44021==    by 0x40192E: wavefront_align (wavefront_align.c:228)
==44021==    by 0x40129E: main (test.c:26)
==44021==
==44021== Conditional jump or move depends on uninitialised value(s)
==44021==    at 0x4120FD: wavefront_termination_endsfree (wavefront_termination.c:144)
==44021==    by 0x412516: wavefront_extend_matches_packed_endsfree (wavefront_extend_kernels.c:148)
==44021==    by 0x40CE7D: wavefront_extend_endsfree_dispatcher_seq (wavefront_extend.c:229)
==44021==    by 0x40CF02: wavefront_extend_endsfree_dispatcher_threads (wavefront_extend.c:246)
==44021==    by 0x40CFC5: wavefront_extend_endsfree (wavefront_extend.c:282)
==44021==    by 0x411999: wavefront_unialign (wavefront_unialign.c:251)
==44021==    by 0x401641: wavefront_align_unidirectional (wavefront_align.c:132)
==44021==    by 0x40192E: wavefront_align (wavefront_align.c:228)
==44021==    by 0x40129E: main (test.c:26)
==44021==
      ALIGNMENT 1X2M
      ETRACE    1X
      CIGAR     1X2M
      PATTERN    TAT
                  ||
      TEXT       CAT
Alignment Score 0
==44021==
==44021== HEAP SUMMARY:
==44021==     in use at exit: 0 bytes in 0 blocks
==44021==   total heap usage: 23 allocs, 23 frees, 4,297,004 bytes allocated
==44021==
==44021== All heap blocks were freed -- no leaks are possible
==44021==
==44021== Use --track-origins=yes to see where uninitialised values come from
==44021== For lists of detected and suppressed errors, rerun with: -s
==44021== ERROR SUMMARY: 15 errors from 5 contexts (suppressed: 0 from 0)

I tried to pursue this a bit further, and could at least figure out that the initialization problem seems to be specific to wavefront 2.

If you're able to reproduce and fix (or explain a workaround for) this case it would be very helpful. Thank you!

Hamming distance

Is there a plan to implement Hamming distance calculation? It is probably trivial compared with other distance metrics, but really helps avoid using another library (so WFA2 all the way) or reinventing the wheel. Thanks.

Difference between GapAffine(1,1,1) and EditDistance

Hi Santiago,

Still working with WFA-lib, I noticed a behaviour that I do not understand on wfa::WFAlignerGapAffine. It seems to me that it did not find the optimal alignment in the example below (so I probably misunderstand the parametrization). I thought that wfa::WFAlignerGapAffine alignerAffine(1,1,1, wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh); was equivalent to wfa::WFAlignerEdit alignerEdit(wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh);, but it does not seem to be the case :

wfa::WFAlignerGapAffine alignerAffine(1,1,1, wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh);
wfa::WFAlignerEdit alignerEdit(wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh);

string seq1 = "AAAAAAAAAATAGCGAAAATGCCCAGCTATTTTTTTTTTT";
string seq2 =   "AAAAAAAAAACAGCGAAATGCCCCAGCGATTTTTTTTTTT";

alignerAffine.alignEnd2End(seq1, seq2);
alignerEdit.alignEnd2End(seq1, seq2);

string cigarEdit = alignerEdit.getAlignmentCigar();
string cigarAffine = alignerAffine.getAlignmentCigar();

cout << "Alignment with edit distance: " << cigarEdit << " " << alignerEdit.getAlignmentScore() << endl;
print_alignment(seq1, seq2, cigarEdit, 0, seq1.size());
cout << "alignment with gap-affine (1,1,1): " << cigarAffine << " " << alignerAffine.getAlignmentScore() << endl;
print_alignment(seq1, seq2, cigarAffine, 0, seq1.size());

gives :

Alignment with edit distance: MMMMMMMMMMXMMMMMMMDMMMMMIMMMXMMMMMMMMMMMM 4
AAAAAAAAAATAGCGAAA-ATGCCCAGCTATTTTTTTTTTT
AAAAAAAAAACAGCGAAATGCCCC-AGCGATTTTTTTTTTT
alignment with gap-affine (1,1,1): MMMMMMMMMMXMMMMMMMXXXMMMMMMXMMMMMMMMMMMM -5
AAAAAAAAAATAGCGAAAATGCCCAGCTATTTTTTTTTTT
AAAAAAAAAACAGCGAAATGCCCCAGCGATTTTTTTTTTT

Apparently the gapAffine aligner favored 3 mismatches over one insertion and one deletion, which I do not understand.
Do you know why it behaves like this ?

Thank you,
Roland

Unexpected alignment failures

I have integrated WFA2-lib into an app, and I keep getting difficult to track alignment failures. After some successful alignments, wavefront_align() will just return -1. The exact point where this happens, depends on whether I build in debug or release mode and whether I turn on sanitizers or not. Release mode produces many hundred correct alignments, debug fails after a couple and with ASAN, it fails immediately.

I suspect there is undefined behaviour somewhere in WFA2-lib.

Here is a minimal example:

extern "C" {
#include <wavefront/wavefront_align.h>
}

#include <string>

void do_align(std::string_view const pattern, std::string_view const ref)
{
  // Configure alignment attributes
  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.alignment_scope = compute_alignment;
  attributes.distance_metric = gap_affine;
  attributes.affine_penalties.mismatch = 3;
  attributes.affine_penalties.gap_opening = 5;
  attributes.affine_penalties.gap_extension = 1;
  attributes.alignment_form.span = alignment_endsfree;
  attributes.alignment_form.pattern_begin_free = 0;
  attributes.alignment_form.pattern_end_free = 0;
  attributes.alignment_form.text_begin_free = 1;
  attributes.alignment_form.text_end_free = 1;
  attributes.heuristic.strategy = wf_heuristic_wfadaptive;
  attributes.heuristic.min_wavefront_length = 10;
  attributes.heuristic.max_distance_threshold = 10;
  attributes.heuristic.steps_between_cutoffs = 1;

  wavefront_aligner_t * const wf_aligner = wavefront_aligner_new(&attributes);

  int res = wavefront_align(wf_aligner, pattern.data(), pattern.size(), ref.data(), ref.size());

  cigar_print_pretty(stderr, pattern.data(), pattern.size(), ref.data(), ref.size(),
                     &wf_aligner->cigar, wf_aligner->mm_allocator);
  fprintf(stderr,"Alignment Score %d\nResult:%d\n", wf_aligner->cigar.score, res);

  assert(res != -1);
  wavefront_aligner_delete(wf_aligner);
}

int main()
{
    do_align("GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTCAGGAGCCGAGCCGACGAGGTGGTGATGTTGGTCGGGCGTGATCCGGGGTGGCGTGACGAGGATGGCGGGGTGGTAGCGGGGGGGGGGGGGGGCGGGCGGGCGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG",
             "GGGGGGGGGGGGATTATACAGCAAATTTACTTAAAAATGTGTATTAGTCAGATTTTTAGTTACTCATGGGTAAATGCAATCCCTAATTAAGGGTGTGAAGTGAGTGCTGAAACTTGCTTAGGAAAAGAGGTGGAAAAATTGGATGGGAATTAAGCATAGAGGTACCACGAAGTATCTGAAATTGTTTGGTTATGTCTGTAGACAAATCAAATGCTTAAACAAAATAAACTGAAATTTTCAACACATGCACACACACAGTCCTCATACTTTTAGATTTTTAGTTTAAAAAATAAGT");
}

When building with these options: g++ wfabug.cpp -I ~/devel/WFA2-lib ~/devel/WFA2-lib/lib/libwfa.a
it prints:

      ALIGNMENT 12M1X25I1M1X1M48D26I3M20I3M1X1M1X1M2X1M1X1M1X1M1X1M33D1M1X1M1X1M4I2M2X2M3X1M1X7M2X1M2X4M2X3M2X1M1X1M2X1M3X1M1X2M4X4M1X1M8I2M1X1M3X3M2X1M16D1M1X1M1X1M32I1M37D41I1M43D20I1M1I
      ALIGNMENT.COMPACT 1X25I1X48D26I20I1X1X2X1X1X1X33D1X1X4I2X3X1X2X2X2X2X1X2X3X1X4X1X8I1X3X2X16D1X1X32I37D41I43D20I1I
      PATTERN    GGGGGGGGGGGGG-------------------------GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG--------------------------GGG--------------------GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTCAGGA----GCCGAGCCGACGAGGTGGTGATGTTGGTCGGGCGTGATCCGGGGTGGCGTGACGAGG--------ATGGCGGGGTGGTAGCGGGGGGGGGGGGGGGCGG--------------------------------GCGGGCGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGG-----------------------------------------GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG--------------------G-
                 ||||||||||||                          | |                                                                          |||                    ||| | |  | | | |                                 | | |    ||  ||   | |||||||  |  ||||  |||  | |  |   | ||    |||| |        || |   |||  |                | | |                                |                                                                              |                                                               | 
      TEXT       GGGGGGGGGGGGATTATACAGCAAATTTACTTAAAAATGTG------------------------------------------------TATTAGTCAGATTTTTAGTTACTCATGGGTAAATGCAATCCCTAATTAAGGGTGTGAAGTGAGTG---------------------------------CTGAAACTTGCTTAGGAAAAGAGGTGGAAAAATTGGATGGGAATTAAGCATAGAGGTACCACGAAGTATCTGAAATTGTTTGGTTAT----------------GTCTGTAGACAAATCAAATGCTTAAACAAAATAAACTG-------------------------------------AAATTTTCAACACATGCACACACACAGTCCTCATACTTTTAG-------------------------------------------ATTTTTAGTTTAAAAAATAAGT
Alignment Score -553
Result:0

When building with the address sanitizer: g++ -fsanitize=address wfabug.cpp -I ~/devel/WFA2-lib ~/devel/WFA2-lib/lib/libwfa.a
it prints:

      ALIGNMENT
      ALIGNMENT.COMPACT
      PATTERN    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTCAGGAGCCGAGCCGACGAGGTGGTGATGTTGGTCGGGCGTGATCCGGGGTGGCGTGACGAGGATGGCGGGGTGGTAGCGGGGGGGGGGGGGGGCGGGCGGGCGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
                 ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
      TEXT       GGGGGGGGGGGGATTATACAGCAAATTTACTTAAAAATGTGTATTAGTCAGATTTTTAGTTACTCATGGGTAAATGCAATCCCTAATTAAGGGTGTGAAGTGAGTGCTGAAACTTGCTTAGGAAAAGAGGTGGAAAAATTGGATGGGAATTAAGCATAGAGGTACCACGAAGTATCTGAAATTGTTTGGTTATGTCTGTAGACAAATCAAATGCTTAAACAAAATAAACTGAAATTTTCAACACATGCACACACACAGTCCTCATACTTTTAGATTTTTAGTTTAAAAAATAAGT
Alignment Score -2147483648
Result:-1
a.out: wfabug.cpp:36: void do_align(std::string_view, std::string_view): Assertion `res != -1' failed.
[1]    673339 IOT instruction (core dumped)  ./a.out

Feature requests: R bindings and early stopping

Nice work!

Would it be possible to get R bindings for this? I put together a minimal example here: https://github.com/traversc/WavefrontAlignR (feel free to do whatever with it)

I'd also like to a request an "early stopping" feature, where if the best possible alignment distance exceeds a user defined threshold, stop alignment and return a flag value (like INT_MAX). Assuming this doesn't add too much overhead, this would be useful because I'm mostly interested in finding only highly similar sequences between two sets.

Last, I ran a quick benchmark comparing an existing R package. Is this a fair comparison? Code used to run WFA2 here: https://github.com/traversc/WavefrontAlignR/blob/main/src/WFA_bindings.cpp

# Benchmark for a 10,000 x 10,000 alignment
# "seqs" is a vector of DNA sequences on average 43 bp long
library(WavefrontAlignR)
library(stringdist)
library(tictoc)

# WFA2 levenshtein
tic()
y1 <- WavefrontAlignR::edit_dist_matrix(seqs, seqs)
toc()
# 191.452 sec elapsed, 522324 alignments / sec

# stringdist levenshtein
tic()
y2 <- stringdist::stringdistmatrix(seqs, seqs, method = "lv", nthread=1)
toc()
# 677.356 sec elapsed, 147633 alignments / sec

Please tag releases

Hi,
since WFA2-lib is a new depencency of vcflib I intend to package WFA2-lib for Debian to be able to upgrade the vcflib Debian package to its latest version. To do so it helps a lot if you would tag your releases to give us sensible chances to tell apart what are random development commits and what is user oriented production code.
Thanks a lot, Andreas.

WFA2-lib for local alignment

I am currently trying to use the alignEndsFree function in the WFA2 library to perform local alignment between pairs of sequences. However, I am having trouble obtaining the required input arguments for this function.

Specifically, the alignEndsFree function requires four arguments: patternBeginFree, patternEndFree, textBeginFree, and textEndFree, which correspond to the beginning and end for the pattern and text sequences. However, I am unsure how to obtain these.

Additionally, I noticed that in the benchmark section of the WFA2, a global parameters struct is used to obtain the input arguments for the alignEndsFree function. However, I am unsure how to modify this approach to get patternBeginFree, patternEndFree, textBeginFree, and textEndFree for a given pattern and text.

ends-free match score not used

I have problem understanding the "ends-free" span mode. The match penalty (value <= 0) has no impact on the alignment and it is not counted towards the returned score.

Here is an example: changing the match score has no impact. mismatch = 4, gap open = 6, gap extension = 2
pattern_begin_free=len(pattern)-1,
pattern_end_free=0,
text_begin_free=0,
text_end_free=len(text)-1

9D1X8I      ALIGNMENT
9D1X8I      ALIGNMENT.COMPACT
      PATTERN    CCCCCCCCCC--------

      TEXT       ---------TCTTTTTTT

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.