Code Monkey home page Code Monkey logo

swarm's Introduction

Hi there 👋

  • 🔭 I’m working on improving the tools VSEARCH and Swarm for microbiome bioinformatics.
  • 🤔 I’m often thinking about new ideas for how we can use clever algorithms and data structures to improve bioinformatics tools.
  • I'm teaching the course IN4030 Introduction to bioinformatics at the University of Oslo every spring semester.
  • In the fall of 2024, I'll be organizing the course IN-BIOS5000/9000 Sequencing technologies, data analysis, and applications.
  • I'm also involved in various projects at Oslo University Hospital (OUS), including Publika, a publication database for OUS.

swarm's People

Contributors

etienneplatini avatar frederic-mahe avatar lczech avatar mooreryan avatar torognes avatar vmikk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

swarm's Issues

Swarm radius in statistics should be multiplied by d

The radius value outputted is a short integer. This integer represents the number of generations, or the number of iterations, that it took for the swarm to reach its maximum limits. If multiplied by the parameter d, it gives the maximum number of differences between the seed and the edge of the swarm.

I think that the radius value should be multiplied by d by default.

Refine clustering

When running swarm_breaker I get an error I cannot figure out. I would be grateful for help:

$ python ~/kode/swarm/scripts/swarm_breaker.py -f All_reads_U.fasta -s amplicons.swarms > amplicons.swarms2
Traceback (most recent call last):
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 315, in
main()
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 304, in main
swarm_breaker(binary, all_amplicons, swarms, threshold)
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 243, in swarm_breaker
graph_raw_data = run_swarm(binary, all_amplicons, amplicons, threshold)
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 127, in run_swarm
close_fds=True)
File "/usr/lib/python2.7/subprocess.py", line 679, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

Unstable order of amplicons in swarms with new approach for d=1 with >1 thread

With the new approach for d=1, the order of amplicons in each swarm may be unstable when running with more than one thread; the actual swarms and their amplicons should be the same. This is because the list of subseeds for each seed is not in the order of the input file. To fix this we could sort the subseeds before proceeding to the next (sub)seed.

Add the OTU number to the output of the -b option

Géraldine Pascal from Toulouse (France) suggested to add the OTU number to the output of the -b option. Each line would then contain a new column with a positive integer representing the OTU number in swarm results (): 1 for the first OTU, 2 for the second, etc.

That would facilitate the filtering of the -b output, allowing for an easy extraction of all amplicon pairs belonging to a given OTU. Implementation would require a small modification of swarm_breaker.py, if the script is still necessary for OTU breaking.

Check for unique sequences

Check if sequences are unique and handle any duplicates; Compute hash of each sequence by MD5, SHA1 or similar

Allow ascii \x01 in headers

Allow the \x01 character (^A) (SOH) in headers. It is used by NCBI to separate entries in the FASTA headers of the NR and NT databases.

Segmentation fault (or SIGABRT) with -a

With several test datasets, I get a segmentation fault at the end of the clustering process when using the -a option. A gdb session shows

gdb swarm
...
run -a -b -d 1 < AF091148.fas
...
Number of swarms:  887
Largest swarm:     441
Max generations:   186

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff67f3700 (LWP 24981)]
0x00007ffff702d0c0 in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:83
83  cxa_thread_atexit_impl.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace 
#0  0x00007ffff702d0c0 in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:83
#1  0x00007ffff7bc70b2 in start_thread (arg=0x7ffff67f3700) at pthread_create.c:319
#2  0x00007ffff70dac2d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

if I do the same using 2 threads, I obtain a different error:

gdb swarm
...
run -a -b -d 1 < AF091148.fas
...
[Thread 0x7ffff5ff2700 (LWP 25025) exited]
*** Error in `/home/fred/Science/Projects/Swarms/swarm/swarm': munmap_chunk(): invalid pointer: 0x00000000006403d0 ***
[Thread 0x7ffff57f1700 (LWP 25026) exited]

Program received signal SIGABRT, Aborted.
0x00007ffff702a077 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace 
#0  0x00007ffff702a077 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff702b458 in __GI_abort () at abort.c:89
#2  0x00007ffff7067fb4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff715abc0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff706d78e in malloc_printerr (action=1, str=0x7ffff715abe8 "munmap_chunk(): invalid pointer", ptr=<optimized out>) at malloc.c:4996
#4  0x000000000040b76c in hash_free () at algod1.cc:142
#5  algo_d1_run () at algod1.cc:727
#6  0x000000000040169a in main (argc=<optimized out>, argv=0x7fffffffe2e8) at swarm.cc:428
(gdb) frame 4
#4  0x000000000040b76c in hash_free () at algod1.cc:142
142   free(hash_occupied);
(gdb) print hash_occupied 
$1 = (unsigned char *) 0x6403d0 "J\001"

Linearization awk code prints last line twice

The provided awk code to linearize fasta files prints the last line twice.

awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {print}' amplicons.fasta > amplicons_linearized.fasta

doesnt work for me, while:

awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1}' amplicons.fasta > amplicons_linearized.fasta

is fine.

Easy test:

>1
AAACCC
>2
GGGTTT

with original code produces:

>1
AAACCC
>2
GGGTTTGGGTTT

with fixed code produces:

>1
AAACCC
>2
GGGTTT

Tested with GNU awk 4.0.1 and 4.1.1

swarm_breaker.py sometimes hangs forever

The python script uses subprocess to run swarm on sub-parts of the fasta dataset. The function wait() sometime hangs forever, without doing anything. The bug is known to be triggered when using pipes to redirect stdout and/or stderr of the subprocess. When the amount of data is bigger than 65k (a memory page?), the deadlock happens.

One recommended solution is to write stdout and stderr to files. This is what we do here, so I don't understand why the bug happens.

Output detailed statistics for each swarm

Add an option to output the following swarm statistics to a separate file:

  • maximum radius of cluster
  • number of unique amplicons
  • number of copies
  • name of the seed
  • seed abundance,
  • number of singletons (amplicons with an abundance of 1)

Fix bug in affine alignment backtracking

More information need to be saved during forward alignment in order to backtrack the correct alignment when using affine gap penalties. Otherwise the identified alignment may not be optimal.

Chimera checking and swarm

When clustering with swarm, when is an appropriate time to check for chimeras?

Developers of previous OTU picking algorithms have recommended chimera checking at different stages. For example, Robert Edgar recommends using uchime to remove chimeras after OTU picking with uparse. His recommendation is based, in part, on chimera removal heuristics built in to uparse. When using uclust, he recommended referenced-based chimera checking before OTU picking so that OTUs were not influenced by chimeras.

When using swarm, when and how should we attempt to remove chimeric sequences?

Progress Indication

If it's possible given the implementation, some kind of progress indication would be very helpful. Maybe a count of sequences left unassigned to an OTU?

Can the stats file be used in any way to determine progress? We've had a job clustering 5,358,959,072 nucleotides in 12,783,921 (unique) sequences on 16 cores with 16GB of RAM for around a week now, and knowing when it may complete would be some peace of mind. It's chugging along quite nicely at 1000% CPU usage (according to top), and has survived when cd-hit and uclust would have filled up the RAM and forced the process into swap space.

Store sequences using 2 bits

Store sequences using 2 bits instead of 8 bits per nucleotide. Will allow larger datasets. Impact on performance uncertain.

Downstream analysis on otutable or biom file

Hi,

What would be the best approach to continue with statistiscs (in R/phyloseq) after swarm clustering? Mapping raw reads to the reference sequences using vsearch and the -usearch_global option? If I would use the mothur output, a count_table is still needed to create a shared/biom file. Or is there another way to use the abundance information in the stats file for example to come to a table with otu counts?

Thanks!

Bug in the -b option output (critical)

That option is important for the swarm breaking. In its present form, the output of the option is wrong, which gives wrong breaking results. The -b output is formatted as follows: @@ TAB father TAB son TAB number of differences.

As it is implemented now, the -b option always output the cluster's seed as "father". Here is a simple example. That fasta file:

>s01_100
ACGTACGT
>s02_50
ACGTTCGT
>s03_25
ACGTTAGT
>s04_1
ACGTTAAT

...yields the following cluster: s01--s02--s03--s04

Number of swarms:  1
Largest swarm:     4
Max generations:   3

...but the -b option output is:

@@  s01 s02 1
@@  s01 s03 1
@@  s01 s04 1

...instead of:

@@  s01 s02 1
@@  s02 s03 1
@@  s03 s04 1

Support for usearch amplicon-abundance annotation style

Swarm assumes that in this type of fasta header:

>KR08766_2

The value 2 represents the abundance of that sequence (i.e. the number of time the sequence has been seen in the data set). Usearch uses a different annotation style (see here for a full description):

>KR08766;size=1;
>KR08766;size=2

Swarm should support these two forms (with or without the terminal ;) to facilitate its use in uclust-based pipelines.

Add the number of differences in the output of the d option

In its present form, the -d option outputs pairs of amplicons that have at most /d/ differences, and a dummy value "1". It would be much more interesting to replace that dummy value with the actual number of differences between the two amplicons.

Swarm radius values should be available via an option

It would be interesting to get the radius of each swarm. The radius is the maximum distance (or the minimum percentage of identity) between the seed of the swarm and any another amplicon assigned to the swarm.

Sort by abundance

Add option to sort input sequences according to their abundance if present

Exact string matching strategy (special case d = 1)

When d = 1, how many micro-variants can a sequence have?

At most 7L + 4, where L is the length of the seed.

A micro-variant is a sequence differing from the seed sequence by only one substitution, insertion or deletion. The number of possible micro-variants is surprisingly low, and it is reasonably fast to generate all possible sequences. If we store our fasta entries as a mapping of sequences to amplicon names and abundances, we can turn swarming into a problem of exact string matching.

It works as such: parse the entire dataset and store it in a mapping structure. For a given seed sequence, produce all its unique micro-variants. Check if these micro-variants are in the mapping structure. If yes, add them to the swarm as sub-seeds and mark them as swarmed.

The complexity of the computation now depends on how fast the mapping structure can be queried, and how fast we can create the micro-variants sets. The micro-variant sets have to be created for all amplicons (9L + 4 operations), and at most 7L + 4 look-ups are necessary for each amplicon. Depending on the complexity of look-ups, the global complexity could be in O(n log n).

A pure python implementation is only 10 times slower than swarm (C++) on a dataset containing 77,693 unique sequences of average length 129 bp: 25 s for swarm, 240 s for swarm.py. I will run tests on much larger datasets to see if the new implementation outperforms the C++ implementation when n increases.

Accelerate pairwise comparisons for the special case d = 1

By default, the d value is set to 1, to produce high-resolution OTUs, even for the densest regions of the amplicon-space. We can safely assume that d = 1 will be by the most frequent choice in future studies.

There are several ways to speed up the pairwise comparisons when searching only for amplicons with one difference.

The most obvious way is to compare the strings character by character. Compare the first characters of each sequence, if they match, compare the second characters of each sequence, etc. On the first mismatch, record the coordinate of the mismatch and break. Start the comparisons from the end of the sequences, until a mismatch is found. If the mismatches occur at the same coordinates, then we have a single mutation between the sequences.

Of course, there is no need to compare the sequences if their length difference is greater than 1.

For sequences of length n (+- 1), the number of character comparisons to do is at most n, which should be faster than global pairwise alignment. Here is an implementation as a python function:

def pairwise_alignment(seed, candidates):
    """
    Fast pairwise alignment (search only for sequences with a single
    mutation)
    """
    selected = list()
    seed_length = len(seed)
    for candidate in candidates:
        candidate_length = len(candidate)
        # Eliminate sequences too long or too short
        if abs(seed_length - candidate_length) > 1:
            continue
        # Identify the longest sequence (in practice, turn all
        # insertion cases into deletions)
        if seed_length >= candidate_length:
            query, subject = seed, candidate
        else:
            query, subject = candidate, seed
        # Compare the sequences character by character
        length = len(query)
        stop_forward = length
        for i in xrange(0, length - 1, 1):
            if query[i] != subject[i]:
                stop_forward = i
                break
        stop_reverse = stop_forward
        for i in xrange(1, length - stop_forward, 1):
            if query[-i] != subject[-i]:
                stop_reverse = length - i
                break
        # Do we detect a single mutation or insertion-deletion?
        if stop_forward == stop_reverse:
            selected.append(candidate)
    return selected

The function arguments are the seed sequence and a list of candidate sequences to test. It is 40 times faster than a pure python Needleman-Wunsch function.

Avoid SSE3 and SSSE3 dependency

The Swarm code is compiled with "-mssse3" in the Makefile. A few ssse3 instructions are used in the alignment code if present, otherwise the code is written to only require sse2, and the presence of sse2 is tested for. However, it seems that the compiler also uses some sse3 or ssse3 instructions somewhere. This results in termination of Swarm with "Illegal instruction" fatal errors on some architectures that does not support all sse3 or ssse3 instructions. Two reports from users indicate this (both using pick_denovo_otus.py in qiime).

The code should be recompiled with other compiler options in the Makefile to avoid this problem.

Adapt to AVX2

Adapt SWARM to AVX2 and the 256-bit registers available in the new Intel Haswell CPUs that became available in June 2013. Should allow 32-way SIMD parallellisation.

Use swarm for dereplication (-d 0)

Rohit Kolora suggested to add a possibility to run swarm with -d 0 to dereplicate datasets.

The recommended tool for dereplication is vsearch --derep_full. Implementing an efficient dereplication in swarm is not a priority for us, and would lead to code duplication. Although, it should be possible to implement some sort of (inefficient but functional) dereplication, just by keeping the present code and allowing the use of d values ranging from zero to 255. I guess the fastest way is to use the alternative algorithm (so the branching would be alternative algorithm for d = 0 or 1, normal algorithm for higher d values).

It seems that users spontaneously try to run swarm -d 0. It has some logic to it, so we might as well try to make it possible (if it is not time consuming).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.