anton083 / vectorizedkmers.jl Goto Github PK

View Code? Open in Web Editor NEW

9.0 2.0 0.0 736 KB

Fast K-mer counting in Julia

Home Page: https://anton083.github.io/VectorizedKmers.jl/

License: MIT License

Julia 100.00%

dna kmer sequence

vectorizedkmers.jl's People

Contributors

Stargazers

Watchers

vectorizedkmers.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Kmers.jl

so I just found out about this neat in-development Kmers.jl package. It has this Kmer type, a very intricate type ecosystem, and fancy functions.

I did a microbenchmark on a k-mer counting vector function thing and the implementation was orders of magnitude slower than my implementation that looks at the data of the sequence directly.

function count_kmers2(seq::LongDNA{2}, k::Integer)
    kcv = KmerCountVector{4, k}()
    counts = kcv.counts
    for (i, kmer) in EveryKmer(seq, Val{k}())
        counts[kmer.data[1] + one(UInt)] += 1
    end
    kcv
end

Moreover, vectorized k-mer counting can only easily be done directly with Kmers.jl with 2-bits/base sequences, cause of how the data is stored when we allow for ambig nucs (4 bits/base). for my own implementation of vectorized k-mer counting on LongDNA{4}, I solve this by calling trailing_zeros on every 4 bits in the data vector, to get the 2-bit representations of each base (this also disambiguates the base to the first possible base in the order A, C, G, T). if this was done on 4-bits/base DNAKmers, it'd just be a sh*tshow really. a sliding window is the way to go. these Kmer thingies can probably be used for a generalized function that applies to all BioSequences though. but i'm still not sure how to handle ambig sequences...

sorry about poorly structuring this issue. just wanted to get my thoughts out there. may also have f'd up the grammar, spelling, and capitalization a bit. heh.

getindex methods

The BioSequences extension, for example, could have a getindex method such that you can take a k-mer vector and access elements using a sequence/k-mer type.

Base.getindex(kv::KmerVector{4, k}, kmer::LongDNA{2}) where {k} = kv[kmer.data[1]+1]
Base.getindex(kv::KmerVector{4, k}, kmer::LongDNA{4}) where {k} = kv[kmer_to_index(kmer, k)]

It wouldn't be trivial to implement as there'd need to be functions for converting k-mers to numerical indices.
This might open up a reason to tie this package with Kmers.jl, see #25

"K-mer" looks gross

I've been using a capital K for the name of the package, in types, and in the documentation. The package name and types would be harder to read if the k was lowercase:
"Vectorizedkmers", "AbstractkmerCountVector"

"K-mer" on the other hand, looks super gross! I went with it just to be consistent with the package and type names. In the README.md, I'm currently writing "$k$-mer" just to make the introduction more aesthetically pleasing, but it feels weird to use lowercase k in the documentation when it's uppercase in the code and in type names.

Proposed solution:

f*ck it. the package name can have an uppercase K. so can the types. the k variable could be lowercase, and so can it be in the docs.
i haven't capitalized the k in variable names -- that's a line that i just don't cross, as i am a devout snake_caser.

Generalize types to not necessarily be "Counts"?

I realized that the cell in these arrays don't have to be counts, necessarily.

The values could be the indices at which (unique?) k-mers occur in a sequence. Operations like sqeuclidean could make sense in this case, as we'd be looking at the position difference of k-mers. That metric would blow waay out of proportion by just a single mutation though, so perhaps some other kind of iteration over the data would be necessary that doesn't just square the differences.

The values could also be boolean flags that say whether a k-mer is in a sequence or not.

or heck, the values could be normalized or something, or show the average k-mer count of a set of sequences! A function for this is already implemented in conversion.jl: for averaging all the k-mer counts of a matrix.

My proposal

I propose that we at least just change the names of the abstract types.
AbstractKmerCountArray -> AbstractKmerArray
AbstractKmerCountScalar -> AbstractKmerScalar
AbstractKmerCountVector -> AbstractKmerVector
AbstractKmerCountMatrix -> AbstractKmerMatrix

There's a case to be made that we could still have concrete Count types, but that might just add a lot of complexity. The user should keep track of what these arrays actually represent. The package should just be a nice generalized interface for creating these vectors, with functions for e.g. counting k-mers efficiently or converting k-mers to indices.

If we also change the names of the concrete types, we could just remove "Count" there as well:

KmerCountVector -> KmerVector
KmerCountVectors -> KmerVectors
KmerCountRows -> KmerColumns or KmerRowVectors? (it's an alias for KmerVectors{1})
KmerCountColumns -> KmerColumns or KmerColumnVectors? (it's an alias for KmerVectors{2})

Functions such as count_kmers would still be oblivious to what the arrays actually represent I guess.

KmerCountScalar

Currently reworking the types, and realized that it might make sense to have a KmerCountScalar type, which is just a zero-dimensional k-mer count. Implementation may look like this:

struct KmerCountScalar{S, k, T, A} <: AbstractKmerCount{0, S, k, T, A}
    kmer::Integer
    counts::A

    function KmerCountScalar{S, k}(kmer::Integer, counts::A) where {S, k, T, A <: AbstractArray{T, 0}}
        new{S, k, T, A}(kmer, counts)
    end

    function KmerCountScalar{S, k, T, A}(kmer::Integer, count::T) where {S, k, T, A}
        new{S, k, T, A}(kmer, fill(count))
    end
end

@inline Base.size(::KmerCountScalar) = ()
@inline Base.length(::KmerCountScalar) = 1
@inline Base.getindex(kcs::KmerCountScalar) = kcs.counts[1]
@inline Base.getindex(kcs::KmerCountScalar, i) = kcs.counts[i]

# without these, it's gonna get shown as `fill(<count>)`
@inline Base.repr(kcs::KmerCountScalar) = kcs[]
@inline Base.show(io::IO, kcs::KmerCountScalar) = print(io, repr(kcs))

Directly indexing vectors and matrix could return this type. The only real benefit that I can see is that we can have the kmer field, which means that we can remember which k-mer it was that was accessed.

Kinda weird though, and printing matrices was super slow...

anton083 / vectorizedkmers.jl Goto Github PK

vectorizedkmers.jl's People

Contributors

Stargazers

Watchers

vectorizedkmers.jl's Issues

TagBot trigger issue

Kmers.jl

getindex methods

"K-mer" looks gross

Proposed solution:

Generalize types to not necessarily be "Counts"?

My proposal

KmerCountScalar

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent