Code Monkey home page Code Monkey logo

pyalign's Introduction

pyalign

Fast and simple alignments in Python:

Example inside Jupyter

Make sure to take a look at "other alignment libraries" below to better understand if this is the right library for you.


Alignments have been a staple algorithm in bioinformatics for decades now, but most packages implementing tend to be either easy to use and slow, or fast but very difficult to use and highly domain specific.

pyalign is a small and hopefully rather versatile Python package that aims to be fast and easy to use. At its core, it is an optimizer for finding "optimum correspondences between sequences" (Kruskal, 1983) - the main proponents of which are alignments and dynamic time warping.

General Features:

  • easy to install and easy to use
  • robust and efficient implementation of standard algorithms
  • very fast for smaller problem sizes (see below for details)
  • built-in visualization functionality for teaching purposes

In terms of alignment algorithms:

  • computes local, global and semiglobal alignments on pairs of sequences
  • supports different gap costs (commonly used ones as well as custom ones)
  • automatically selects best suitable algorithm (e.g. Gotoh)
  • no assumptions on matched items, i.e. not limited to characters
  • supports any given similarity or distance function (i.e. can maximize or minimize)
  • can return one as well as all optimal alignments and scores

The implementation should be rather fast due to highly optimized code paths for every special case. While it does not support GPUs, here are some facts:

  • optimized C++ core employing xtensor
  • supports SIMD via batching (i.e. simple SIMD parallelism as first suggested by Alpern et al. and more recently by Rudnicki et al.)
  • carefully designed to avoid dynamic memory allocation
  • extensive metaprogramming to provide different optimized code paths for different usage patterns - for example, computing "only single score" won't write tracebacks, whereas computing "all alignments" will track multiple traceback edges

Installation

via pip (recommended)

pyalign currently provides precompiled packages for Windows (Intel), Linux (Intel) and macOS (Intel).

pip install pyalign

on Google Colab

First install conda via:

!pip install -q condacolab
import condacolab
condacolab.install()

Then run:

!git clone https://github.com/poke1024/pyalign && cd pyalign && conda env create -f environment.yml && conda activate pyalign && python setup.py install

locally

Installing pyalign locally will require a modern C++ compiler. It also requires various libraries from the xtensor stack which are best installed via cona; for the full list of required packages, see environment.yml.

Local installation via conda:

git clone https://github.com/poke1024/pyalign
cd pyalign
conda env create -f environment.yml
conda activate pyalign
python setup.py install

Example

Running

import pyalign
alignment = pyalign.global_alignment("INDUSTRY", "INTEREST", gap_cost=0, eq=1, ne=-1)
alignment

will compute an optimal global alignment between "INDUSTRY" and "INTEREST".

We instruct the optimizer to use of scores 1 and -1 (for matching and non-matching letters) and no (i.e. 0) gap costs.

In Jupyter, this will give

IN----DUSTRY
||      ||  
INTERE--ST--

Of course you can also extract the actual score:

alignment.score

as

4.0

It's also possible to extract the traceback matrix and path and generate visuals (and thus a detailed rationale for the obtained score and solution).

In contrast to the first example above, which used the simplified high level API, we now use the full, more detailed API, which gives much more detailed access to different gap costs, solvers and scoring configurations. To make things a bit more interesting, we switch from 0 gap cost to 0.2 (which will not change the result in this case, but shows in the traceback matrix):

import pyalign.problems
import pyalign.solve
import pyalign.gaps

pf = pyalign.problems.general(
    pyalign.problems.Equality(eq=1, ne=-1),
    direction="maximize")
solver = pyalign.solve.GlobalSolver(
    gap_cost=pyalign.gaps.LinearGapCost(0.2),
    codomain=pyalign.solve.Solution)
problem = pf.new_problem("INDUSTRY", "INTEREST")
solver.solve(problem)

traceback and path

As a final example, here is how we would modify the solver above to get a list over all optimal solutions of a problem:

from typing import Iterator

solver = pyalign.solve.GlobalSolver(
    gap_cost=pyalign.gaps.LinearGapCost(0.2),
    codomain=List[pyalign.solve.Solution])

This will now return a list of solutions, each with its own traceback, e.g.:

[<pyalign.solve.Solution at 0x10f0f15b0>,
 <pyalign.solve.Solution at 0x10f0f1580>]

To learn more about the API, take a look at

Binder

Performance

Here are a few benchmarks. The "pure python" implementation seen in this benchmark is found at https://github.com/eseraygun/python-alignment.

+alphabet means using pyalign.problems.alphabetic instead of the simpler pyalign.problems.general to construct a problem.

+SIMD means feeding groups of equally-structured aligment problems into one solve call by using pyalign.problem.ProblemBatch - doing this will internally make use of AVX2 on Intel and Neon on ARM processors.

The following benchmarks were done on an Apple M1 Max. SIMD-128 refers to the M1's 128-bit SIMD.

The benchmark code can be found under benchmark.py.

traceback and path

traceback and path

Other Alignment Libraries

Here is a short overview of other libraries.

Nice General Purpose Implementations

For large scale / bioinformatics problems

What you will not find in pyalign:

  • SIMD acceleration for single pairs of sequences as in e.g. (Farrar 2007)
  • GPU acceleration, see e.g. (Barnes, 2020)
  • approximate or randomized algorithms
  • advanced preprocessing or indexing

If you need any of the above, you might want to take a look at:

Even more alignment libraries

References

Original Works

Altschul, S. (1998). Generalized affine gap costs for protein sequence alignment. Proteins: Structure, 32.

Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162(3), 705–708. https://doi.org/10.1016/0022-2836(82)90398-9

Sankoff, D. (1972). Matching Sequences under Deletion/Insertion Constraints. Proceedings of the National Academy of Sciences, 69(1), 4–6. https://doi.org/10.1073/pnas.69.1.4

Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195–197. https://doi.org/10.1016/0022-2836(81)90087-5

Miller, W., & Myers, E. W. (1988). Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology, 50(2), 97–120. https://doi.org/10.1007/BF02459948

Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. https://doi.org/10.1016/0022-2836(70)90057-4

Waterman, M. S., Smith, T. F., & Beyer, W. A. (1976). Some biological sequence metrics. Advances in Mathematics, 20(3), 367–387. https://doi.org/10.1016/0001-8708(76)90202-4

Waterman, M. S. (1984). Efficient sequence alignment algorithms. Journal of Theoretical Biology, 108(3), 333–337. https://doi.org/10.1016/S0022-5193(84)80037-5

Other Algorithms

Chakraborty, A., & Bandyopadhyay, S. (2013). FOGSAA: Fast Optimal Global Sequence Alignment Algorithm. Scientific Reports, 3(1), 1746. https://doi.org/10.1038/srep01746

Surveys

Aluru, S. (Ed.). (2005). Handbook of Computational Molecular Biology. Chapman and Hall/CRC. https://doi.org/10.1201/9781420036275

Stojmirović, A., & Yu, Y.-K. (2009). Geometric Aspects of Biological Sequence Comparison. Journal of Computational Biology, 16(4), 579–610. https://doi.org/10.1089/cmb.2008.0100

Kruskal, J. B. (1983). An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules. SIAM Review, 25(2), 201–237. https://doi.org/10.1137/1025045

Müller, M. (2007). Information Retrieval for Music and Motion. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-74048-3

Implementations

Alpern, B., Carter, L., & Su Gatlin, K. (1995). Microparallelism and high-performance protein matching. Proceedings of the 1995 ACM/IEEE Conference on Supercomputing (CDROM) - Supercomputing ’95, 24-es. https://doi.org/10.1145/224170.224222

Barnes, R. (2020). A Review of the Smith-Waterman GPU Landscape. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-152.html

Farrar, M. (2007). Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2), 156–161. https://doi.org/10.1093/bioinformatics/btl582

Flouri, T., Kobert, K., Rognes, T., & Stamatakis, A. (2015). Are all global alignment algorithms and implementations correct? [Preprint]. Bioinformatics. https://doi.org/10.1101/031500

Rognes, T. (2011). Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics, 12(1), 221. https://doi.org/10.1186/1471-2105-12-221

Rudnicki, W. R., Jankowski, A., Modzelewski, A., Piotrowski, A., & Zadrożny, A. (2009). The new SIMD Implementation of the Smith-Waterman Algorithm on Cell Microprocessor. Fundamenta Informaticae, 96(1–2), 181–194. https://doi.org/10.3233/FI-2009-173

Tran, T. T., Liu, Y., & Schmidt, B. (2016). Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi. 26th International Symposium on Computer Architecture and High Performance Computing, 54, 128–138. https://doi.org/10.1016/j.parco.2015.11.001

pyalign's People

Contributors

bertsky avatar dobatymo avatar poke1024 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pyalign's Issues

view pdoc output

Although it would be possible to dive into the docstrings on the Python interpreter, I would very much like to view the generated API documentation (HTML output).

Now, the Readme just references the Jupyter notebook – which is great BTW, but often crashes, and does not provide a top-down overview of the API.

There seems to be a configuration already under pdoc/make.sh, which IIUC uses pdoc3 (not pdoc). But I cannot get it to run under Python 3.8 (despite installing pdoc3 plus everything I can find under environment.yml):

Traceback (most recent call last):
  File "/pdoc/__init__.py", line 222, in import_module
    module = importlib.import_module(module_path)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/pyalign/__init__.py", line 13, in <module>
    from .solve import *
  File "/pyalign/solve.py", line 342, in <module>
    class MatrixForm:
  File "/pyalign/solve.py", line 343, in MatrixForm
    _solvers = solver_variants("solve")
  File "/pyalign/solve.py", line 323, in solver_variants
    (algorithm.Detail.SCORE, algorithm.Count.ONE, None, "optimal"): (
AttributeError: 'NoneType' object has no attribute 'Detail'

So I guess the issue for me is both,

  • how do I get the pdoc generator running, and
  • would you consider hosting this via CI / GH pages?

Installation errors

When installing to virtual environment via pip (i.e., "pip install pyalign"), the following error appears:

ValueError: path '/home/runner/work/pyalign/pyalign/pyalign/algorithm/module.cpp' cannot be absolute

When trying to install via github (i.e., "pip install -e git+https://github.com/poke1024/pyalign#egg=pyalign"), the following error appears:

error: Support for editable installs via PEP 660 was recently introduced
in setuptools. If you are seeing this error, please report to:

  https://github.com/pypa/setuptools/issues

Any ideas on how to address this issue?

@poke1024

License clarification

Hi!

The included LICENSE text is MIT, but the PyPI metadata in setup.py says GPLv2. Which is the correct license?

Thanks!

Installation via pip on Python 3.9 and newer fails with fatal error: xtensor/xarray.hpp: No such file or directory

Hi there,

first of, this is an excellent library. I am utilizing it for a research project and had no issues installing and running it on Kaggle. Recently, however, Kaggle apparently upgraded to Python 3.10.10 and the installation no longer works. This is the error I'm getting when I run !pip install pyalign in a fresh notebook:

Collecting pyalign
  Using cached pyalign-0.4.4.tar.gz (50 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: numpy~=1.19 in /opt/conda/lib/python3.10/site-packages (from pyalign) (1.23.5)
Collecting pybind11~=2.6.2 (from pyalign)
  Using cached pybind11-2.6.2-py2.py3-none-any.whl (191 kB)
Requirement already satisfied: cached-property in /opt/conda/lib/python3.10/site-packages (from pyalign) (1.5.2)
Collecting py-cpuinfo==8.0.0 (from pyalign)
  Using cached py_cpuinfo-8.0.0-py3-none-any.whl
Collecting pymorton==1.0.5 (from pyalign)
  Using cached pymorton-1.0.5-py2.py3-none-any.whl (4.4 kB)
Collecting pyyaml==6.0 (from pyalign)
  Using cached PyYAML-6.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (682 kB)
Collecting cpufeature==0.2.0 (from pyalign)
  Using cached cpufeature-0.2.0-cp310-cp310-linux_x86_64.whl
Building wheels for collected packages: pyalign
  Building wheel for pyalign (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for pyalign (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [47 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-310
      creating build/lib.linux-x86_64-cpython-310/pyalign
      copying pyalign/cpu.py -> build/lib.linux-x86_64-cpython-310/pyalign
      copying pyalign/__init__.py -> build/lib.linux-x86_64-cpython-310/pyalign
      copying pyalign/simple.py -> build/lib.linux-x86_64-cpython-310/pyalign
      copying pyalign/solve.py -> build/lib.linux-x86_64-cpython-310/pyalign
      copying pyalign/gaps.py -> build/lib.linux-x86_64-cpython-310/pyalign
      copying pyalign/utils.py -> build/lib.linux-x86_64-cpython-310/pyalign
      copying pyalign/_version.py -> build/lib.linux-x86_64-cpython-310/pyalign
      creating build/lib.linux-x86_64-cpython-310/pyalign/tests
      copying pyalign/tests/test_nw.py -> build/lib.linux-x86_64-cpython-310/pyalign/tests
      copying pyalign/tests/test_gotoh.py -> build/lib.linux-x86_64-cpython-310/pyalign/tests
      copying pyalign/tests/__init__.py -> build/lib.linux-x86_64-cpython-310/pyalign/tests
      copying pyalign/tests/test_sw.py -> build/lib.linux-x86_64-cpython-310/pyalign/tests
      copying pyalign/tests/test_batch.py -> build/lib.linux-x86_64-cpython-310/pyalign/tests
      copying pyalign/tests/test_wsb.py -> build/lib.linux-x86_64-cpython-310/pyalign/tests
      creating build/lib.linux-x86_64-cpython-310/pyalign/io
      copying pyalign/io/alignment.py -> build/lib.linux-x86_64-cpython-310/pyalign/io
      copying pyalign/io/__init__.py -> build/lib.linux-x86_64-cpython-310/pyalign/io
      copying pyalign/io/plot.py -> build/lib.linux-x86_64-cpython-310/pyalign/io
      creating build/lib.linux-x86_64-cpython-310/pyalign/problems
      copying pyalign/problems/function.py -> build/lib.linux-x86_64-cpython-310/pyalign/problems
      copying pyalign/problems/__init__.py -> build/lib.linux-x86_64-cpython-310/pyalign/problems
      copying pyalign/problems/factory.py -> build/lib.linux-x86_64-cpython-310/pyalign/problems
      copying pyalign/problems/instance.py -> build/lib.linux-x86_64-cpython-310/pyalign/problems
      creating build/lib.linux-x86_64-cpython-310/pyalign/algorithm
      copying pyalign/algorithm/__init__.py -> build/lib.linux-x86_64-cpython-310/pyalign/algorithm
      running build_ext
      building 'pyalign.algorithm.native.algorithm' extension
      creating build/temp.linux-x86_64-cpython-310
      creating build/temp.linux-x86_64-cpython-310/tmp
      creating build/temp.linux-x86_64-cpython-310/tmp/pip-install-vv6368xe
      creating build/temp.linux-x86_64-cpython-310/tmp/pip-install-vv6368xe/pyalign_70ad6a0542354aa5a94076826e635912
      creating build/temp.linux-x86_64-cpython-310/tmp/pip-install-vv6368xe/pyalign_70ad6a0542354aa5a94076826e635912/pyalign
      creating build/temp.linux-x86_64-cpython-310/tmp/pip-install-vv6368xe/pyalign_70ad6a0542354aa5a94076826e635912/pyalign/algorithm
      gcc -pthread -B /opt/conda/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/include -fPIC -O2 -isystem /opt/conda/include -fPIC -I/tmp/pip-build-env-lzyua9kw/overlay/lib/python3.10/site-packages/numpy/core/include -I/tmp/pip-install-vv6368xe/pyalign_70ad6a0542354aa5a94076826e635912 -I/tmp/pip-build-env-lzyua9kw/overlay/lib/python3.10/site-packages/pybind11/include -I/opt/conda/include/python3.10 -c /tmp/pip-install-vv6368xe/pyalign_70ad6a0542354aa5a94076826e635912/pyalign/algorithm/module.cpp -o build/temp.linux-x86_64-cpython-310/tmp/pip-install-vv6368xe/pyalign_70ad6a0542354aa5a94076826e635912/pyalign/algorithm/module.o -fvisibility=hidden -g0 -std=c++17 -O3 -ftemplate-backtrace-limit=0 -march=native
      In file included from /tmp/pip-install-vv6368xe/pyalign_70ad6a0542354aa5a94076826e635912/pyalign/algorithm/pyalign.h:4,
                       from /tmp/pip-install-vv6368xe/pyalign_70ad6a0542354aa5a94076826e635912/pyalign/algorithm/module.cpp:3:
      /tmp/pip-install-vv6368xe/pyalign_70ad6a0542354aa5a94076826e635912/pyalign/algorithm/common.h:6:10: fatal error: xtensor/xarray.hpp: No such file or directory
          6 | #include <xtensor/xarray.hpp>
            |          ^~~~~~~~~~~~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyalign
Failed to build pyalign
ERROR: Could not build wheels for pyalign, which is required to install pyproject.toml-based projects

I'm gettin the exact same error in a local environment running Python 3.9. Installation in 3.8.10 works fine though. Conda is not an option since we need to install the library in an existing project based on Poetry/PyPy.

Any help would be greatly appreciated. Thank you!

running from repo as working directory fails

If I import pyalign while still inside the source directory, I get the following:

RuntimeError: none of ['pyalign.algorithm.native', 'pyalign.algorithm.intel_avx2', 'pyalign.algorithm.apple_m1', 'pyalign.algorithm.generic'] is available

(The problem goes away when cding somewhere else.)

Using vectors instead of characters

I would like to use this outside of bioinformatics where for each character, it is a vector (np.ndarray) and distance function for computing the "distance" between vectors.
All your examples using strings, I was interested if this is possible with pyalign?

Is there a way to change penalty for leading/trailing gaps?

I am hoping to find a way to ignore leading/trailing gaps when doing a local alignment, is that possible? I didn't see anything in the docs so far. If not, could you give me an example of how to access the actual alignment path? I can see you can generate a diagram but it would be useful to know how to get at the path object so I can post-process remove the trailing/leading mismatches as gaps? alternatively can i use a non-constant mismatch penalty?

notebook crashes

I have not been able to isolate the cause, but the example notebook keeps crashing for me (with the kernel disconnecting after some steps).

It looks somewhat racy/random, but I managed to make the whole thing run through by commenting the .alignment accessor in 15, 22, 23 and the .solve() call in 29.

Isn't there supposed to be a preview available with all the results/images precomputed which can also be exported?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.