Code Monkey home page Code Monkey logo

experimental_kmer's Introduction

Experimental kmers

Update: This work here is made obsolete by Ben Ward's superior NTupleKmers, which as of 2020-09-25 is planned to become the default kmer type in BioJulia

This is a proof-of-concept package. It is intended show a working example of a more general, very "hackable", yet still fast, kmer type for use in BioSequences.

Extra features compared to existing kmer types:

  • Supports arbitrary alphabets
  • Variable-length encoding from 8- to 1024 bits.

Demonstration

The type is parameterized by three parameters: Kmer{K, A, T}:

  • K is the value of k
  • A is the alphabet type, a subtype of Alphabet
  • T is the storage type, a subtype of Unsigned

Arbitrary alphabets

julia> m = Kmer{4, CharAlphabet, UInt128}("读写汉字")
4-mer of CharAlphabet:
读写汉字

julia> reverse(m)
4-mer of CharAlphabet:
字汉写读

Supports large kmers through BitIntegers.jl

julia> Kmer(RNAAlphabet{2}, randrnaseq(500))
500-mer of RNAAlphabet{2}:
CAUAUGAUGGAUGGGUUUGGUGCGCAGACUUUAGGACUAGGCAGUAGAUAAAUAUUCAACGGAGUGUCUAUAGCUGUG

julia> kmer"GWYFPPNML"aa
9-mer of AminoAcidAlphabet:
GWYFPPNML

Simple, efficient kmer iterator of any alphabet

julia>  it = SimpleKmerIterator{AAKmer{5}}(randaaseq(10^6))
Main.t.SimpleKmerIterator{Main.t.Kmer{5,AminoAcidAlphabet,UInt64},LongSequence{AminoAcidAlphabet}}(DDHMGFAHPCAYFVNMFPTCVFSAVSSTHNVKLYFTQTY…MGRIPSPPWCWHIIKDQGDIFWNHWLDSCCCRKDYTIRL)

julia> @btime sum(i.data for i in it)
  875.581 μs (2 allocations: 32 bytes)
0x0091802f3c41a9b9

Uses the high-level BioSequences v2 API internally, with sane fallbacks This allows very generic code, with following features being example of emergent properties:

  • Creating kmers from kmers
julia> T1 = t.Kmer{3, RNAAlphabet{4}, UInt32};

julia> mer = t.kmer"TAGTCGCGAGAA"
12-mer of DNAAlphabet{2}:
TAGTCGCGAGAA

julia> it = t.StandardKmerIterator{T1, typeof(mer)}(mer);

julia> [canonical(i) for i in it]
10-element Array{Main.t.Kmer{3,RNAAlphabet{4},UInt32},1}:
 CUA
 ACU
 GAC
 CGA
 CGC
 CGC
 CGA
 CUC
 AGA
 GAA
  • Many basic properties like getindex just uses the BioSequence fallback

experimental_kmer's People

Contributors

jakobnissen avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.