Code Monkey home page Code Monkey logo

vectorizedkmers.jl's People

Contributors

anton083 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

vectorizedkmers.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Kmers.jl

so I just found out about this neat in-development Kmers.jl package. It has this Kmer type, a very intricate type ecosystem, and fancy functions.

I did a microbenchmark on a k-mer counting vector function thing and the implementation was orders of magnitude slower than my implementation that looks at the data of the sequence directly.

function count_kmers2(seq::LongDNA{2}, k::Integer)
    kcv = KmerCountVector{4, k}()
    counts = kcv.counts
    for (i, kmer) in EveryKmer(seq, Val{k}())
        counts[kmer.data[1] + one(UInt)] += 1
    end
    kcv
end

image

Moreover, vectorized k-mer counting can only easily be done directly with Kmers.jl with 2-bits/base sequences, cause of how the data is stored when we allow for ambig nucs (4 bits/base). for my own implementation of vectorized k-mer counting on LongDNA{4}, I solve this by calling trailing_zeros on every 4 bits in the data vector, to get the 2-bit representations of each base (this also disambiguates the base to the first possible base in the order A, C, G, T). if this was done on 4-bits/base DNAKmers, it'd just be a sh*tshow really. a sliding window is the way to go. these Kmer thingies can probably be used for a generalized function that applies to all BioSequences though. but i'm still not sure how to handle ambig sequences...

sorry about poorly structuring this issue. just wanted to get my thoughts out there. may also have f'd up the grammar, spelling, and capitalization a bit. heh.

getindex methods

The BioSequences extension, for example, could have a getindex method such that you can take a k-mer vector and access elements using a sequence/k-mer type.

Base.getindex(kv::KmerVector{4, k}, kmer::LongDNA{2}) where {k} = kv[kmer.data[1]+1]
Base.getindex(kv::KmerVector{4, k}, kmer::LongDNA{4}) where {k} = kv[kmer_to_index(kmer, k)]

It wouldn't be trivial to implement as there'd need to be functions for converting k-mers to numerical indices.
This might open up a reason to tie this package with Kmers.jl, see #25

"K-mer" looks gross

I've been using a capital K for the name of the package, in types, and in the documentation. The package name and types would be harder to read if the k was lowercase:
"Vectorizedkmers", "AbstractkmerCountVector"

"K-mer" on the other hand, looks super gross! I went with it just to be consistent with the package and type names. In the README.md, I'm currently writing "$k$-mer" just to make the introduction more aesthetically pleasing, but it feels weird to use lowercase k in the documentation when it's uppercase in the code and in type names.

Proposed solution:

f*ck it. the package name can have an uppercase K. so can the types. the k variable could be lowercase, and so can it be in the docs.
i haven't capitalized the k in variable names -- that's a line that i just don't cross, as i am a devout snake_caser.

Generalize types to not necessarily be "Counts"?

I realized that the cell in these arrays don't have to be counts, necessarily.

The values could be the indices at which (unique?) k-mers occur in a sequence. Operations like sqeuclidean could make sense in this case, as we'd be looking at the position difference of k-mers. That metric would blow waay out of proportion by just a single mutation though, so perhaps some other kind of iteration over the data would be necessary that doesn't just square the differences.

The values could also be boolean flags that say whether a k-mer is in a sequence or not.

or heck, the values could be normalized or something, or show the average k-mer count of a set of sequences! A function for this is already implemented in conversion.jl: for averaging all the k-mer counts of a matrix.

My proposal

I propose that we at least just change the names of the abstract types.
AbstractKmerCountArray -> AbstractKmerArray
AbstractKmerCountScalar -> AbstractKmerScalar
AbstractKmerCountVector -> AbstractKmerVector
AbstractKmerCountMatrix -> AbstractKmerMatrix

There's a case to be made that we could still have concrete Count types, but that might just add a lot of complexity. The user should keep track of what these arrays actually represent. The package should just be a nice generalized interface for creating these vectors, with functions for e.g. counting k-mers efficiently or converting k-mers to indices.

If we also change the names of the concrete types, we could just remove "Count" there as well:

KmerCountVector -> KmerVector
KmerCountVectors -> KmerVectors
KmerCountRows -> KmerColumns or KmerRowVectors? (it's an alias for KmerVectors{1})
KmerCountColumns -> KmerColumns or KmerColumnVectors? (it's an alias for KmerVectors{2})

Functions such as count_kmers would still be oblivious to what the arrays actually represent I guess.

KmerCountScalar

Currently reworking the types, and realized that it might make sense to have a KmerCountScalar type, which is just a zero-dimensional k-mer count. Implementation may look like this:

struct KmerCountScalar{S, k, T, A} <: AbstractKmerCount{0, S, k, T, A}
    kmer::Integer
    counts::A

    function KmerCountScalar{S, k}(kmer::Integer, counts::A) where {S, k, T, A <: AbstractArray{T, 0}}
        new{S, k, T, A}(kmer, counts)
    end

    function KmerCountScalar{S, k, T, A}(kmer::Integer, count::T) where {S, k, T, A}
        new{S, k, T, A}(kmer, fill(count))
    end
end

@inline Base.size(::KmerCountScalar) = ()
@inline Base.length(::KmerCountScalar) = 1
@inline Base.getindex(kcs::KmerCountScalar) = kcs.counts[1]
@inline Base.getindex(kcs::KmerCountScalar, i) = kcs.counts[i]

# without these, it's gonna get shown as `fill(<count>)`
@inline Base.repr(kcs::KmerCountScalar) = kcs[]
@inline Base.show(io::IO, kcs::KmerCountScalar) = print(io, repr(kcs))

Directly indexing vectors and matrix could return this type. The only real benefit that I can see is that we can have the kmer field, which means that we can remember which k-mer it was that was accessed.

Kinda weird though, and printing matrices was super slow...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.