Pinging @benjward @CiaranOMara perhaps this discussion should be part of #102 , but I do feel like the scope is a little broader here.
The situation
We currently are working on multiple different "subsequence" like types:
- Ben is (has been?) working on NTupleKmers
- I have a working prototype of a sequence view, similar to an array view (link coming soon)
- I'm playing around with 256-bit kmers
This comes on top of the already pretty large subsequence codebase:
- LongSequence's copy-on-write (COW) mechanism that allows a LongSequence to actually be a subsequence
- The recently revamped Mer types.
So I'm concerned if not done right, we will end up with an enormous unelegant codebase that might still leave some bases uncovered. So: What do we really need? As I see it, we really just need:
- An arbitrary-length subsequence, preferrably allocation-free.
- An efficient type for kmers. Ideally, this would be flexible in size as well, but in my experience, kmers with k > 127 are rarely used.
My proposal
In this issue, I make two arguments:
Argument one: We should move completely to either NTupleKmers if they work, else N-bit kmers
While a maximum of K = 32 is good enough for many purposes, the usage of longer kmers is common enough that it's an awkward ceiling. By leveraging BitIntegers.jl, we could change the Kmer code to wrap an integer up to 1024-bit integers. Essentially its structure would be:
struct Kmer{K, A <: Alphabet, T <: Unsigned}
data::T
end
For longer kmers like 51-mers (common in metagenomics), 127-mers (sometimes used in assembly), and perhaps even higher values than that, a truly variable-length kmer type like NTupleKmers is obviously best. Furthermore, I personally find it likely that NTuples will be even further optimized by the Julia core devs in the future since they are integral to immutable arrays.
However, NTupleKmers may be very difficult to write efficient code for. If that is the case, we can simplify the future codebase a great deal by having our kmers simply wrap a BitInteger type which is already aggressively optimized by LLVM. I am working on a proof of concept for this for show. Unfortunately, it adds another dependency for BioSequences.
- Proposal 1: Remove our current Kmer types. If NTupleKmers work well, use those solely. Otherwise, add BitIntegers.jl as a dependency, and make a kmer type that parameterizes the integer type that stores its data
Argument two: We should remove the COW mechanism of LongSequence
I realize that this is a little controversial and require large changes to the codebase. But I do think it's the right direction to move. The COW mechanism allows sequences to be views, which is neat if users creates tonnes of subsequences.
However, my experience working with BioSequences.jl is that COW does make the codebase significantly more complicated. Every single indexing or iteration function that operates on the underlying data must take into account that the data may not begin at 1. In principle, this should not matter too much because it's hidden from the user. But realistically, at least in the near term, many of the heavy users of BioSequences will be devs themselves, and would like to tinker with BioSequence internals. That is, after all, one of the promises of Julia, it lets you tinker!
Furthermore, the benefit is eroded when 1.5 comes, since these subsequences will still allocate (because LongSequence
is mutable) in a way a sequence view won't.
For the large majority of users, allocating a new LongSequence when slicing it will have completely fine performance - my current work for example, typically involves operations on max a few 100s of small sequences. But these users may still suffer from the COW mechanism if the difficulty of developing causes more bugs, e.g. #104 #110 . For the few ones which want great performance, the COW mechanism might not be performant enough. For the developers, it causes headaches.
- Proposal 2: Remove the COW mechanism from LongSequence and introduce a
SeqView
-like object. Slicing a LongSequence will create a copy. The structure of a LongSequence is a mutable struct with the data vector and a single integer giving the length.