torognes / swarm Goto Github PK

View Code? Open in Web Editor NEW

121.0 121.0 23.0 5.07 MB

A robust and fast clustering method for amplicon-based studies

License: GNU Affero General Public License v3.0

Python 2.88% Makefile 1.31% C++ 94.30% C 1.45% Dockerfile 0.06%

bioinformatics clustering fasta metabarcoding microbiome sequence

swarm's Introduction

Hi there 👋

🔭 I’m working on improving the tools VSEARCH and Swarm for microbiome bioinformatics.
🤔 I’m often thinking about new ideas for how we can use clever algorithms and data structures to improve bioinformatics tools.
I'm teaching the course IN4030 Introduction to bioinformatics at the University of Oslo every spring semester.
In the fall of 2024, I'll be organizing the course IN-BIOS5000/9000 Sequencing technologies, data analysis, and applications.
I'm also involved in various projects at Oslo University Hospital (OUS), including Publika, a publication database for OUS.

swarm's People

Contributors

Stargazers

Watchers

Forkers

larsmew pombredanne igsbma giriartes lczech kemin711 a1an77 biocodings shu251 epibiome jianshu93 mooreryan gadams1959 debbie-leung phymucs lacademic nvt-1009 tintingli vmikk

swarm's Issues

Swarm radius in statistics should be multiplied by d

The radius value outputted is a short integer. This integer represents the number of generations, or the number of iterations, that it took for the swarm to reach its maximum limits. If multiplied by the parameter d, it gives the maximum number of differences between the seed and the edge of the swarm.

I think that the radius value should be multiplied by d by default.

UCLUST file format output

Add option to output in a UCLUST-like file format. For details, see

http://www.drive5.com/uclust/uclust_userguide_1_1_579.html#_Toc257997686

Improve speed by advanced filtering

Refine clustering

When running swarm_breaker I get an error I cannot figure out. I would be grateful for help:

$ python ~/kode/swarm/scripts/swarm_breaker.py -f All_reads_U.fasta -s amplicons.swarms > amplicons.swarms2
Traceback (most recent call last):
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 315, in
main()
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 304, in main
swarm_breaker(binary, all_amplicons, swarms, threshold)
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 243, in swarm_breaker
graph_raw_data = run_swarm(binary, all_amplicons, amplicons, threshold)
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 127, in run_swarm
close_fds=True)
File "/usr/lib/python2.7/subprocess.py", line 679, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

Unstable order of amplicons in swarms with new approach for d=1 with >1 thread

With the new approach for d=1, the order of amplicons in each swarm may be unstable when running with more than one thread; the actual swarms and their amplicons should be the same. This is because the list of subseeds for each seed is not in the order of the input file. To fix this we could sort the subseeds before proceeding to the next (sub)seed.

Small bug in the uclust-like output format

The fourth column of lines starting with "C" should be "*" instead of a percentage of identity.

Add the OTU number to the output of the -b option

Géraldine Pascal from Toulouse (France) suggested to add the OTU number to the output of the -b option. Each line would then contain a new column with a positive integer representing the OTU number in swarm results (): 1 for the first OTU, 2 for the second, etc.

That would facilitate the filtering of the -b output, allowing for an easy extraction of all amplicon pairs belonging to a given OTU. Implementation would require a small modification of swarm_breaker.py, if the script is still necessary for OTU breaking.

Compilation issue with swarm 2.0.4

the file threads.h is missing and prevents compilation.

Check for unique sequences

Check if sequences are unique and handle any duplicates; Compute hash of each sequence by MD5, SHA1 or similar

Assigning taxonomy to swarm centroids

Hello Frédéric,

Do you have a standard operating procedure for assigning taxonomy to a given swarm?

I'm accustomed to using pre-clustered taxonomy databases, but that would negate swarm's increated resolution of similar taxa.

Thank you for your advice,
Colin

Allow ascii \x01 in headers

Allow the \x01 character (^A) (SOH) in headers. It is used by NCBI to separate entries in the FASTA headers of the NR and NT databases.

Sequence headers with multiple underscore characters cause problems

Headers with multiple underscore characters (e.g. >x_y_123) are not handled correctly.

Segmentation fault (or SIGABRT) with -a

With several test datasets, I get a segmentation fault at the end of the clustering process when using the -a option. A gdb session shows

gdb swarm
...
run -a -b -d 1 < AF091148.fas
...
Number of swarms:  887
Largest swarm:     441
Max generations:   186

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff67f3700 (LWP 24981)]
0x00007ffff702d0c0 in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:83
83  cxa_thread_atexit_impl.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace 
#0  0x00007ffff702d0c0 in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:83
#1  0x00007ffff7bc70b2 in start_thread (arg=0x7ffff67f3700) at pthread_create.c:319
#2  0x00007ffff70dac2d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

if I do the same using 2 threads, I obtain a different error:

gdb swarm
...
run -a -b -d 1 < AF091148.fas
...
[Thread 0x7ffff5ff2700 (LWP 25025) exited]
*** Error in `/home/fred/Science/Projects/Swarms/swarm/swarm': munmap_chunk(): invalid pointer: 0x00000000006403d0 ***
[Thread 0x7ffff57f1700 (LWP 25026) exited]

Program received signal SIGABRT, Aborted.
0x00007ffff702a077 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace 
#0  0x00007ffff702a077 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff702b458 in __GI_abort () at abort.c:89
#2  0x00007ffff7067fb4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff715abc0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff706d78e in malloc_printerr (action=1, str=0x7ffff715abe8 "munmap_chunk(): invalid pointer", ptr=<optimized out>) at malloc.c:4996
#4  0x000000000040b76c in hash_free () at algod1.cc:142
#5  algo_d1_run () at algod1.cc:727
#6  0x000000000040169a in main (argc=<optimized out>, argv=0x7fffffffe2e8) at swarm.cc:428
(gdb) frame 4
#4  0x000000000040b76c in hash_free () at algod1.cc:142
142   free(hash_occupied);
(gdb) print hash_occupied 
$1 = (unsigned char *) 0x6403d0 "J\001"

Linearization awk code prints last line twice

The provided awk code to linearize fasta files prints the last line twice.

awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {print}' amplicons.fasta > amplicons_linearized.fasta

doesnt work for me, while:

awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1}' amplicons.fasta > amplicons_linearized.fasta

is fine.

Easy test:

>1
AAACCC
>2
GGGTTT

with original code produces:

>1
AAACCC
>2
GGGTTTGGGTTT

with fixed code produces:

>1
AAACCC
>2
GGGTTT

Tested with GNU awk 4.0.1 and 4.1.1

Add d value in mothur-output

Add the d value in the first column. "swarm_1" instead of just "swarm".

Compile on Mac

Make it possible to compile SWARM on Macs.

swarm_breaker.py sometimes hangs forever

The python script uses subprocess to run swarm on sub-parts of the fasta dataset. The function wait() sometime hangs forever, without doing anything. The bug is known to be triggered when using pipes to redirect stdout and/or stderr of the subprocess. When the amount of data is bigger than 65k (a memory page?), the deadlock happens.

One recommended solution is to write stdout and stderr to files. This is what we do here, so I don't understand why the bug happens.

Check length of sequences during comparison

Check length of sequences during comparison and skip comparison if length difference is larger than d. May increase speed.

Check for unique headers

Check for unique headers and report error if duplicates are found

Check compositition of sequences during comparison

Check compositition of sequences during comparison and skip if base composition is too different

Output detailed statistics for each swarm

Add an option to output the following swarm statistics to a separate file:

maximum radius of cluster
number of unique amplicons
number of copies
name of the seed
seed abundance,
number of singletons (amplicons with an abundance of 1)

Integrate the swarm-breaker into the main swarm code

swarm_breaker.py handles multi-line fasta files incorrectly

It seems to ignore all but the last sequence line in multi-line fasta files. See line 133-134 in swarm_breaker.py.

Fix bug in affine alignment backtracking

More information need to be saved during forward alignment in order to backtrack the correct alignment when using affine gap penalties. Otherwise the identified alignment may not be optimal.

Chimera checking and swarm

When clustering with swarm, when is an appropriate time to check for chimeras?

Developers of previous OTU picking algorithms have recommended chimera checking at different stages. For example, Robert Edgar recommends using uchime to remove chimeras after OTU picking with uparse. His recommendation is based, in part, on chimera removal heuristics built in to uparse. When using uclust, he recommended referenced-based chimera checking before OTU picking so that OTUs were not influenced by chimeras.

When using swarm, when and how should we attempt to remove chimeric sequences?

Progress Indication

If it's possible given the implementation, some kind of progress indication would be very helpful. Maybe a count of sequences left unassigned to an OTU?

Can the stats file be used in any way to determine progress? We've had a job clustering 5,358,959,072 nucleotides in 12,783,921 (unique) sequences on 16 cores with 16GB of RAM for around a week now, and knowing when it may complete would be some peace of mind. It's chugging along quite nicely at 1000% CPU usage (according to top), and has survived when cd-hit and uclust would have filled up the RAM and forced the process into swap space.

Store sequences using 2 bits

Store sequences using 2 bits instead of 8 bits per nucleotide. Will allow larger datasets. Impact on performance uncertain.

Graceful exit if cpu features are missing

Graceful exit if cpu features are missing. Don't exit with illegal instruction.

Galaxy wrapper for Swarm

Downstream analysis on otutable or biom file

Hi,

What would be the best approach to continue with statistiscs (in R/phyloseq) after swarm clustering? Mapping raw reads to the reference sequences using vsearch and the -usearch_global option? If I would use the mothur output, a count_table is still needed to create a shared/biom file. Or is there another way to use the abundance information in the stats file for example to come to a table with otu counts?

Thanks!

Provide Mothur-compatible output files

Bug in the -b option output (critical)

That option is important for the swarm breaking. In its present form, the output of the option is wrong, which gives wrong breaking results. The -b output is formatted as follows: @@ TAB father TAB son TAB number of differences.

As it is implemented now, the -b option always output the cluster's seed as "father". Here is a simple example. That fasta file:

>s01_100
ACGTACGT
>s02_50
ACGTTCGT
>s03_25
ACGTTAGT
>s04_1
ACGTTAAT

...yields the following cluster: s01--s02--s03--s04

Number of swarms:  1
Largest swarm:     4
Max generations:   3

...but the -b option output is:

@@  s01 s02 1
@@  s01 s03 1
@@  s01 s04 1

...instead of:

@@  s01 s02 1
@@  s02 s03 1
@@  s03 s04 1

Remove requirement of SSE4.1

Avoid requirement of SSE4.1 so that the code will run on some AMD cpus.

Avoid requirement of POPCNT instruction

SWARM still requires a CPU with the POPCNT instruction, but that requirement should be avoided.

Support for usearch amplicon-abundance annotation style

Swarm assumes that in this type of fasta header:

>KR08766_2

The value 2 represents the abundance of that sequence (i.e. the number of time the sequence has been seen in the data set). Usearch uses a different annotation style (see here for a full description):

>KR08766;size=1;
>KR08766;size=2

Swarm should support these two forms (with or without the terminal ;) to facilitate its use in uclust-based pipelines.

Add the number of differences in the output of the d option

In its present form, the -d option outputs pairs of amplicons that have at most /d/ differences, and a dummy value "1". It would be much more interesting to replace that dummy value with the actual number of differences between the two amplicons.

Swarm radius values should be available via an option

It would be interesting to get the radius of each swarm. The radius is the maximum distance (or the minimum percentage of identity) between the seed of the swarm and any another amplicon assigned to the swarm.

Rewrite search8.cc and search16.cc using intrinsics for portability

Registers xmm8-15 and r8-15 are unavailble on 32-bit platforms. Rewrite with intrinsics to avoid problem.

More informative error message for illegal characters

When an illegal character in a sequence is detected, inform the user of where that character was found.

Sort by abundance

Add option to sort input sequences according to their abundance if present

Exact string matching strategy (special case d = 1)

When d = 1, how many micro-variants can a sequence have?

At most 7L + 4, where L is the length of the seed.

A micro-variant is a sequence differing from the seed sequence by only one substitution, insertion or deletion. The number of possible micro-variants is surprisingly low, and it is reasonably fast to generate all possible sequences. If we store our fasta entries as a mapping of sequences to amplicon names and abundances, we can turn swarming into a problem of exact string matching.

It works as such: parse the entire dataset and store it in a mapping structure. For a given seed sequence, produce all its unique micro-variants. Check if these micro-variants are in the mapping structure. If yes, add them to the swarm as sub-seeds and mark them as swarmed.

The complexity of the computation now depends on how fast the mapping structure can be queried, and how fast we can create the micro-variants sets. The micro-variant sets have to be created for all amplicons (9L + 4 operations), and at most 7L + 4 look-ups are necessary for each amplicon. Depending on the complexity of look-ups, the global complexity could be in O(n log n).

A pure python implementation is only 10 times slower than swarm (C++) on a dataset containing 77,693 unique sequences of average length 129 bp: 25 s for swarm, 240 s for swarm.py. I will run tests on much larger datasets to see if the new implementation outperforms the C++ implementation when n increases.

Implement the new fastidious option

Accelerate pairwise comparisons for the special case d = 1

By default, the d value is set to 1, to produce high-resolution OTUs, even for the densest regions of the amplicon-space. We can safely assume that d = 1 will be by the most frequent choice in future studies.

There are several ways to speed up the pairwise comparisons when searching only for amplicons with one difference.

The most obvious way is to compare the strings character by character. Compare the first characters of each sequence, if they match, compare the second characters of each sequence, etc. On the first mismatch, record the coordinate of the mismatch and break. Start the comparisons from the end of the sequences, until a mismatch is found. If the mismatches occur at the same coordinates, then we have a single mutation between the sequences.

Of course, there is no need to compare the sequences if their length difference is greater than 1.

For sequences of length n (+- 1), the number of character comparisons to do is at most n, which should be faster than global pairwise alignment. Here is an implementation as a python function:

def pairwise_alignment(seed, candidates):
    """
    Fast pairwise alignment (search only for sequences with a single
    mutation)
    """
    selected = list()
    seed_length = len(seed)
    for candidate in candidates:
        candidate_length = len(candidate)
        # Eliminate sequences too long or too short
        if abs(seed_length - candidate_length) > 1:
            continue
        # Identify the longest sequence (in practice, turn all
        # insertion cases into deletions)
        if seed_length >= candidate_length:
            query, subject = seed, candidate
        else:
            query, subject = candidate, seed
        # Compare the sequences character by character
        length = len(query)
        stop_forward = length
        for i in xrange(0, length - 1, 1):
            if query[i] != subject[i]:
                stop_forward = i
                break
        stop_reverse = stop_forward
        for i in xrange(1, length - stop_forward, 1):
            if query[-i] != subject[-i]:
                stop_reverse = length - i
                break
        # Do we detect a single mutation or insertion-deletion?
        if stop_forward == stop_reverse:
            selected.append(candidate)
    return selected

The function arguments are the seed sequence and a list of candidate sequences to test. It is 40 times faster than a pure python Needleman-Wunsch function.

Avoid SSE3 and SSSE3 dependency

The Swarm code is compiled with "-mssse3" in the Makefile. A few ssse3 instructions are used in the alignment code if present, otherwise the code is written to only require sse2, and the presence of sse2 is tested for. However, it seems that the compiler also uses some sse3 or ssse3 instructions somewhere. This results in termination of Swarm with "Illegal instruction" fatal errors on some architectures that does not support all sse3 or ssse3 instructions. Two reports from users indicate this (both using pick_denovo_otus.py in qiime).

The code should be recompiled with other compiler options in the Makefile to avoid this problem.

Read input from stdin if no filename is specified on the command line

Extend ideas used in new strategy for d=1 to d>1

Errors in statistics: radius and generation

The radius and generation values in the statistics are not correct.

Adapt to AVX2

Adapt SWARM to AVX2 and the 256-bit registers available in the new Intel Haswell CPUs that became available in June 2013. Should allow 32-way SIMD parallellisation.

Use swarm for dereplication (-d 0)

Rohit Kolora suggested to add a possibility to run swarm with -d 0 to dereplicate datasets.

The recommended tool for dereplication is vsearch --derep_full. Implementing an efficient dereplication in swarm is not a priority for us, and would lead to code duplication. Although, it should be possible to implement some sort of (inefficient but functional) dereplication, just by keeping the present code and allowing the use of d values ranging from zero to 255. I guess the fastest way is to use the alternative algorithm (so the branching would be alternative algorithm for d = 0 or 1, normal algorithm for higher d values).

It seems that users spontaneously try to run swarm -d 0. It has some logic to it, so we might as well try to make it possible (if it is not time consuming).

expand on the swarm_breaker.py -b option to check $PATH

The '-b' option wouldn't need to be passed if swarm exists in one of the $PATH directories (/usr/bin included).