Code Monkey home page Code Monkey logo

Comments (10)

jakobnissen avatar jakobnissen commented on June 15, 2024

Update: I've created a proof-of-concept kmer type here: https://github.com/jakobnissen/experimental_kmer

Features:

  • Allows arbitrary alphabets of the kmers, e.g. amino acid kmers
  • Kmers are backed by an Unsigned, with anywhere from UInt8 to UInt1024 (using BitIntegers), allowing up to 512-mers if using DNAAlphabet{2} (or 128-mers for amino acids etc)
  • Kmer iterators supports kmers to be created from all BioSequences, including other kmers
  • Kmer iterators are as fast or faster than existing iterators, e.g. 0.9 - 1.5 ns/iteration on my laptop, although it slows down for larger kmers. For 499-mers, it takes about 10 ns/iteration.

Take a look at it. If you like the design, I can work towards feature parity with the existing kmers, merge them into BioSequences.

from biosequences.jl.

jakobnissen avatar jakobnissen commented on June 15, 2024

Update:

  • I'm happy with my Kmer design now. It's still not up to feature parity (no skipmers), but I'm confident it's a solid base to build on.
  • I've added a SeqView to my experimental kmer repo. It's pretty simple, but I really like the design, and it's very fast and convenient. I'm not 100% sure about the design yet though. It doesn't have a lot of specialized methods yet - they are a bit harder to add, so I will only add them if you think the design is solid.
  • I've made a branch of BioSequences in my private fork where I've removed the COW. That was significantly easier than expected.

If you have time, then please:

  1. Check out the kmer design. Does it fit your needs? Is there anything the NTupleKmers could do that this design couldn't?
  2. Check out the SeqView.
  3. All tests on the no-COW branch pass. Therefore there is no longer an issue of taking the time to implement removing the COW. So it is purely a design decision. Are you comfortable removing the COW and relying solely on SeqView?

All type and function names etc are not final, it's just a placeholder.

from biosequences.jl.

TransGirlCodes avatar TransGirlCodes commented on June 15, 2024

I was pleased with the NTupleKmer experiment. I only stopped because of whatever shiny thing caught the corner of my eye... or lockdown fatigue... or just some other distraction. As far as I could tell it performed as well as, or better than a primitive type, never worse. There are a few performance gotchas because of bugs in julia right now, but every time I hit them they were already reported, documented, well understood how to work around, and on the way to a compiler fix. Which I guess is to be expected. A lot of bitwise stuff for those NTupleKmers that would involve BitIndex and whatnot for LongSequences, like a reverse complement, boiled down to some simple and pure tail-end recursion that generated pretty nice ASM. I used a UInt64 word size, and found a 2 word NTupleKmer performed the same as using a UInt128 (and for stuff like RC produced same ASM). One advantage of NTupleKmers is the ability to go longer than the maximum Unsigneds allow in base julia. When I last discussed kmers on Slack with julia people and whether I could make primitives to get me bigger integers, which is basically what BitIntegers.jl does, I was very much discouraged from doing so from core julia devs, I think Jeff Bezanson and others - for reasons I fail to remember exactly (thanks slack!) but related to LLVM's support for intrinsics and custom primitives becoming progressively suckier. NTuples are also base julia and so are not an extra dep. We should do a proper speed / ASM checklist of the core stuff we expect kmers to do in a micro-optimised way and see where the gaps in any particular design are and see if there's anything in particular that slows things down for one design vs the other.

I'm super happy for SeqView to be relied on, and for the COW to be removed from LongSequence, it is conceptually much more simple for every sequence to "own" its underlying data, for slices to construct new sequences, and views to reference a subrange of a sequence. It's what everyone expects from arrays and array-like types in julia, it makes sense to keep in with those expectations.

I see the change in kmer type design as a separate decision and issue to the COW/SeqView issue, but that's perhaps because I see Kmers more as sequences that own their own data in the same sense LongSequences do, more than I see them as sub-sequences of a LongSequence. Because kmers as datapoints or things have to be able to exist independently
from the read/LongSequence from which they were generated. This is particularly important for kmer coverage/count applications and analysing genome graphs, but of course, for someone who has a read, wants to iterate over the kmers, compute a value, and forget the kmers and move onto the next sequence, they would arguably much better off using SeqView and not the Kmer type.

One answer to this Kmers as sequences themselves vs Kmers as views of a bigger sequence might be to eliminate Kmer from the type vocab entirely, and instead, we have LongSequence (or MutableSequence or whatever), and StaticSequence, instead of LongSequence, and Kmer.

from biosequences.jl.

jakobnissen avatar jakobnissen commented on June 15, 2024

I agree with all of that. I did run into some funky compilation stuff once I pushed the Kmer size to 1024 bits in my experiment - and NTupleKmers could almost certainly go much higher. If LLVM support for larger integers is shaky, that means BitIntegers may be too shaky a foundation to lay new Kmer types. I can take a look at NTupleKmers if you want and work on it.

And of course the SeqView is completely different - only related in the sense that it, like a kmer, is a subsequence (I know, techincally a kmer is a sequence in itself, but the use case is almost always to extract them from a larger seq). SO I'll work towards removing the COW completely (and test it with FASTX.jl etc) and then make a PR then 👍. 1.5 is still not out, so there is plenty of time to develop SeqView. Let's close the SeqView topic in this issue then and discuss that in #102 .

I do think "Kmer" is a good term, since it's well-known, and also because, for practical reasons, K must be a parameter of whatever static type we create anyway. But the name's not too important.

What i DO want to keep from my current "kmer experiment" is the following:

  • Kmers define a "push" operation for kmers, which is when a symbol (or encoding) is added to the end, shifting it. We may also define other modifying operations like add_symbols(m::Kmer, seq::BioSequence, p::Integer, N) which shifts m N symbols and adds N new symbols from seq starting from p.
  • Kmer iterators are made much simpler because they rely on those functions and don't actually do the bitshifts themselves but only focuses on iteration logic. That makes the burden of developing multiple kmer iterators (which we probably want) significantly lower.
  • We define a simplified "bi-kmer" type, NOT a full BioSequence type. It just represents a nucleic acid kmer and its reverse complement. This type also has push defined for it. The purpose of this is that, if a kmer iterator has BiKmer as a paramter instead of Kmer, it will automatically produce BiKmers, only using generic code. This is useful because it allows the same kmer iterators to yield simple kmers or easily reverse-complementable kmers. The issue here is that creating the currently existing MerIterResult is significantly (~+50%) slower than just creating forward kmers.

from biosequences.jl.

CiaranOMara avatar CiaranOMara commented on June 15, 2024

@jakobnissen, I like the way you're separating the middle-level logic from the lower-level logic. As you say, it certainly reduces the burden. It also improves readability and testability. I think many of your lower-level logic functions would belong at the core of BioSequences.

I think I had similar thoughts as @benjward did about the Kmer type. I was wondering whether they are simply a slice or a view of a sequence. For me, the way that a Kmer type could make sense is if they were beneficial to coverage/count applications or graph analysis and offer some optimisation given a known number of symbols and their size.

In terms of kmer as an understood term, I think context could be sufficient—for example, a Kmer is a result (whatever type that may be) provided by iterators with names such as MerIterator{K,A} or SkipMerIterator{K,N,A}.

The only thing that I can think of that is an uncovered base (if agreed upon) or that may alter your design is the pursuit of SymbolSets/Alphabets where Symbols are aware of their encoding. Unfortunately, I haven't had a chance to develop the idea further.

from biosequences.jl.

jakobnissen avatar jakobnissen commented on June 15, 2024

With regard to the "kmer as a subsequence", I did play around with making kmer iterators using SeqView (I can upload some code for you to see). It turned out that while they're efficient to create, they are very difficult to manipulate. For example, how would you reverse complement it without allocating? How about fast hashing of kmers, or creating skipmers? I couldn't figure out how to do it.

I'm still not sure what you mean about the Alphabet/Symbol thing. Perhaps this is a thing for another issue?

from biosequences.jl.

TransGirlCodes avatar TransGirlCodes commented on June 15, 2024

Kmers define a "push" operation for kmers, which is when a symbol (or encoding) is added to the end, shifting it. We may also define other modifying operations like add_symbols(m::Kmer, seq::BioSequence, p::Integer, N) which shifts m N symbols and adds N new symbols from seq starting from p.

I completely agree, in fact this pushy type thing is something I wrote in NTupleKmers, the core of it being a shift that correctly carries bits over to the next word.

Kmer iterators are made much simpler because they rely on those functions and don't actually do the bitshifts themselves but only focuses on iteration logic. That makes the burden of developing multiple kmer iterators (which we probably want) significantly lower.

This is also something I can fully endorse. It makes for a nice conceptual distinction in that the iterator is responsible for how it samples bases from a seq and the kmer type is responsible for how it incorporates them.

We define a simplified "bi-kmer" type, NOT a full BioSequence type. It just represents a nucleic acid kmer and its reverse complement. This type also has push defined for it. The purpose of this is that, if a kmer iterator has BiKmer as a paramter instead of Kmer, it will automatically produce BiKmers, only using generic code. This is useful because it allows the same kmer iterators to yield simple kmers or easily reverse-complementable kmers. The issue here is that creating the currently existing MerIterResult is significantly (~+50%) slower than just creating forward kmers.

It is faster than doing a generator like (canonical(x) for x in eachmer(myfoo)) though, which for GenomeGraphs.jl and KmerAnalysis.jl is super important. I was aware I was kinda insisting on (forcing!) people to use a thing that they might not care about - I just wanted us to have something resembling the parts required for basic genome assembly as fast as possible - you know, get the v0.1 out there, get people looking at stuff and hopefully enthusiastic. But I really like this BiKmer solution, it feels much more like a last word to the problem than my "MerIterResult plaster".

Ok, should I then move NTupleKmers content over to BioSequences.jl on a new branch, say we call it the better_kmers branch? Then we can work from there and merge other stuff and iterators and so on into it?

from biosequences.jl.

jakobnissen avatar jakobnissen commented on June 15, 2024

I agree with all the above - make the branch. My summer holidays begin from today, so I'll be able to do some work on it these weeks.

from biosequences.jl.

CiaranOMara avatar CiaranOMara commented on June 15, 2024

With regard to the "kmer as a subsequence", I did play around with making kmer iterators using SeqView (I can upload some code for you to see).

Of course, I'm curious -- it'll show me or give me a better idea as to how you're thinking about this.

It turned out that while they're efficient to create, they are very difficult to manipulate. For example, how would you reverse complement it without allocating?

No idea -- would reverse lookup tables be efficient?

I'm still not sure what you mean about the Alphabet/Symbol thing. Perhaps this is a thing for another issue?

It's the thing that derailed #102. And yes, I'll eventually open an issue for it.

from biosequences.jl.

jakobnissen avatar jakobnissen commented on June 15, 2024

With #120 and #119 merged, and #121 in the works, this issue has been resolved. Closing.

from biosequences.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.