By default, the d value is set to 1, to produce high-resolution OTUs, even for the densest regions of the amplicon-space. We can safely assume that d = 1 will be by the most frequent choice in future studies.
There are several ways to speed up the pairwise comparisons when searching only for amplicons with one difference.
The most obvious way is to compare the strings character by character. Compare the first characters of each sequence, if they match, compare the second characters of each sequence, etc. On the first mismatch, record the coordinate of the mismatch and break. Start the comparisons from the end of the sequences, until a mismatch is found. If the mismatches occur at the same coordinates, then we have a single mutation between the sequences.
Of course, there is no need to compare the sequences if their length difference is greater than 1.
For sequences of length n (+- 1), the number of character comparisons to do is at most n, which should be faster than global pairwise alignment. Here is an implementation as a python function:
def pairwise_alignment(seed, candidates):
"""
Fast pairwise alignment (search only for sequences with a single
mutation)
"""
selected = list()
seed_length = len(seed)
for candidate in candidates:
candidate_length = len(candidate)
# Eliminate sequences too long or too short
if abs(seed_length - candidate_length) > 1:
continue
# Identify the longest sequence (in practice, turn all
# insertion cases into deletions)
if seed_length >= candidate_length:
query, subject = seed, candidate
else:
query, subject = candidate, seed
# Compare the sequences character by character
length = len(query)
stop_forward = length
for i in xrange(0, length - 1, 1):
if query[i] != subject[i]:
stop_forward = i
break
stop_reverse = stop_forward
for i in xrange(1, length - stop_forward, 1):
if query[-i] != subject[-i]:
stop_reverse = length - i
break
# Do we detect a single mutation or insertion-deletion?
if stop_forward == stop_reverse:
selected.append(candidate)
return selected
The function arguments are the seed sequence and a list of candidate sequences to test. It is 40 times faster than a pure python Needleman-Wunsch function.