Code Monkey home page Code Monkey logo

biosequences.jl's Introduction

Bio

Latest release MIT license Pkg Status

This package has been depreceated. Full details are available here. You might still download and use this package, as we don't want old scripts to break. However going forward, know that this repository is archived and read only, no further updates or fixes will be committed. You should use the packages that replace Bio.jl - a list is available here.

biosequences.jl's People

Contributors

bicycle1885 avatar ciaranomara avatar cjprybol avatar cossio avatar dtjohnson avatar femtocleaner[bot] avatar github-actions[bot] avatar harryscholes avatar ivirshup avatar jakobnissen avatar kdyrhage avatar kescobo avatar lhnguyen-vn avatar mortenpi avatar phaverty avatar tanhevg avatar timholy avatar tkelman avatar tp2750 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biosequences.jl's Issues

Add split function

I was trying to add the split function for BioSequences.

Base.split(str::T, splitter; limit::Integer=0, keep::Bool=true) where {T<:Sequence} =
    Base._split(str, splitter, limit, keep, T[])
function Base._split(str::Sequence,splitter, limit::Integer , keep_empty::Bool ,strs::Array)

    i = start(str)
    n = endof(str)
    r = search(str,splitter,i)
    j,k = first(r), last(r) + 1

    while 0 < j <= n && length(str) != limit-1
        if i < k
            if keep_empty || i < j
                push!(strs,str[i:j-1])
            end
            i = k
        end
        (k <= j) && (k = j + 1)
        r = search(str,splitter,k)
        j, k = first(r), last(r) + 1
    end

    if keep_empty || !done(str,i)
        push!(strs, str[i:end])
    end

    return strs
end

It was based on the one in base shown below.

split(str::T, splitter; limit::Integer=0, keep::Bool=true) where {T<:AbstractString} =
    _split(str, splitter, limit, keep, SubString{T}[])
function _split(str::AbstractString, splitter, limit::Integer, keep_empty::Bool, strs::Array)
    i = start(str)
    n = endof(str)
    r = search(str,splitter,i)
    j, k = first(r), nextind(str,last(r))
    while 0 < j <= n && length(strs) != limit-1
        if i < k
            if keep_empty || i < j
                push!(strs, SubString(str,i,prevind(str,j)))
            end
            i = k
        end
        (k <= j) && (k = nextind(str,j))
        r = search(str,splitter,k)
        j, k = first(r), nextind(str,last(r))
    end
    if keep_empty || !done(str,i)
        push!(strs, SubString(str,i))
    end
    return strs
end

It seems to work fine with DNASequences and RNASequences.

split(dna"AAAAACACAAAACATATATATC",dna"CA",keep=true)

4-element Array{BioSequences.BioSequence{BioSequences.DNAAlphabet{4}},1}:
 AAAAA   
         
 AAA     
 TATATATC
split(dna"AAAAACACAAAACATATATATC",dna"CA",keep=false)

3-element Array{BioSequences.BioSequence{BioSequences.DNAAlphabet{4}},1}:
 AAAAA   
 AAA     
 TATATATC

But with AminoAcidSequence is a little buggy, in particular, the keep=true argument doesn't work as expected.

split(aa"HHHHHHTTCHHHHHHHTTCHHHHH",aa"HT",keep=true)

3-element Array{BioSequences.BioSequence{BioSequences.AminoAcidAlphabet},1}:
 HHHHH   
 TCHHHHHH
 TCHHHHH 
split(aa"HHHHHHTTCHHHHHHHTTCHHHHH",aa"HT",keep=false)

3-element Array{BioSequences.BioSequence{BioSequences.AminoAcidAlphabet},1}:
 HHHHH   
 TCHHHHHH
 TCHHHHH 

Implement fast String to LongSequence conversion for more alphabets

In #81 , I made LongSequence conversion of ASCII strings much faster by using an already-optimized method for conversion to 4-bit nucleotide alphabets. However, there is no corresponding fast method for 2-bit alphabets nor amino acid alphabets, though there easily could be.
That should be implemented some time.

Edit: Instead of current dispatch model, for instantiating a LongSequence of alphabet type DNAAlphabet, RNAAlphabet and AminoAcidAlphabet, don't even check for ASCII. Just wrap a vector around it always, since non-ascii strings will not be valid anyway. That saves the call to isascii, which is O(n).

Change Varargs to vectors or array

function count_pairwise{P<:Position,A<:NucAlphs,N}(::Type{P}, seqs::Vararg{BioSequence{A},N})

@bicycle1885 There was an issue or PR recently in which you recommended removing Varargs and just going with a general iterable.

I'm considering whether I should change the varargs in the signature to some other container type, perhaps add a parameter for AbstractArray like:

count_pairwise{P<:Position,A<:NucAlphs,C<:AbstractArray}(::Type{P}, seqs::C{BioSequence{A},N})

I already have a plan to cleanup the fast counting internals a bit and wondered if it was worth doing this at the same time.

Should we make BioSequences <: AbstractVector?

So I'm just making an actual issue to discuss - and document said discussion of a prospect that was raised on Slack.

Because as I'm going through broadcasting, I'm seeing that there are some arguments in the pro column for doing this.

FASTA.Reader doesn't recognize the correct alphabet with ambiguous sites

I have a data set of DNA sequences, where some sites are ambiguous. In the example file below, the second sequence has Y, which means either C or T.

>P1
ATT
>P2
AYC
>P3
ATT
>P4
ATT

When I read this file with FASTA.Reader, the second sequence is thought to be an alignment of amino acids, because Y also means Tyrosine if we think about amino acids instead of nucleic acids:

julia> r = BioSequences.FASTA.Reader(open("myfile.fasta"))
julia> for record in r
         @show eltype(sequence(record))
       end
eltype(sequence(record)) = DNA
eltype(sequence(record)) = AminoAcid
eltype(sequence(record)) = DNA
eltype(sequence(record)) = DNA

I would like to tell FASTA.Reader that the Y character stands for DNA_Y, not AA_Y. Is there a way to tell the function that all the 4 sequences are DNA sequences?

(v1.1) pkg> status
    Status `~/.julia/environments/v1.1/Project.toml`
  [7e6ae17a] BioSequences v1.1.0
  [3c28c6f8] BioSymbols v3.1.0
  [336ed68f] CSV v0.5.5
  ...

Erroneous reverse_complement behaviour?

Noticed something strange in some of my tests as I prepare packages for submission to BioJulia. For one test, I was using reverse_complement on a particular sequence, and the output of the function is definitely wrong. This happens on my local machines but not on the Travis testing images, which is how I noticed this, so I'm not sure that this is replicable at all- just want to make sure it's not a general phenomenon.

Expected Behavior

Ordinary reverse complement produced: from
TATATATATATCGCGCGCGCG
expect
CGCGCGCGCGATATATATATA

Current Behavior

Output of reverse_complement is
TATATATATATATATATATAT
whereas output of reverse_complement! is correct

Possible Solution / Implementation

No idea!

Steps to Reproduce (for bugs)

julia> testseq=LongSequence{DNAAlphabet{2}}("ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGATATATATATATATAT")
1000nt DNA Sequence:
ATATATATATATATATATATATATATATATATATATATAโ€ฆGCGCGCGCGCGCGCGCGCGCGCGATATATATATATATAT

julia> testseq[490:510]
21nt DNA Sequence:
TATATATATATCGCGCGCGCG

julia> reverse_complement(testseq[490:510])
21nt DNA Sequence:
TATATATATATATATATATAT

julia> reverse_complement!(testseq[490:510])
21nt DNA Sequence:
CGCGCGCGCGATATATATATA

Your Environment

Status ~/.julia/environments/v1.4/Project.toml
[6e4b80f9] BenchmarkTools v0.5.0
[71d2fffc] BioBackgroundModels v0.1.0 [../../../../../srv/git/BioBackgroundModels]
[7e6ae17a] BioSequences v2.0.4
[a93c6f00] DataFrames v0.21.2
[31c24e10] Distributions v0.23.4
[c2308a5c] FASTX v1.1.2
[5fb14364] OhMyREPL v0.5.5
[92933f4c] ProgressMeter v1.3.1
[295af30f] Revise v2.7.2
[4c63d2b9] StatsFuns v0.9.5

Further generalise the bitparallel counting code

There are other operations I've seen need of - computing the number of segregating sites in a population sample of sequences and so on, that is not purely counting the number of a kind of site between two sequences, but that CAN be done, with a bit of re-working, with the bit-parallel counting methods already developed here. So previously I generalised the code, to counting different kinds of sites bit-parallel, now I'm going to generalise it further to let it do much more than just pairwise site counting. Related PR to come.

In 2.0 docs, link on left menu bar for searching goes to index, not page for search methods

This template is rather extensive. Fill out all that you can, if are a new contributor or you're unsure about any section, leave it unchanged and a reviewer will help you ๐Ÿ˜„. This template is simply a tool to help everyone remember the BioJulia guidelines, if you feel anything in this template is not relevant, simply delete it.

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Context

Your Environment

  • Package Version used:
  • Julia Version used:
  • Operating System and version (desktop or mobile):
  • Link to your project:

How to get the phred score in a string variable most efficiently?

In a section of code I am writing, I want to read fastq file without making use of phred score at all. I would like to just have a string variable which would contain the phred line exactly as it is currently in the input fastq files.

New to Julia, so I have failed to work out how to do it. You must do it somewhere because println(record) (please see context below) results in the representation I am looking for.

fastq_stream = FASTQ.Reader(open("pool1.fastq", "r"))
for record in fastq_stream
    #pseudo-code from here
    # to_csv(record.name, record.sequence, record.quality, record.sequence.length())
end
close(fastq_stream)

Along with the bit in pseudo-code, I am also storing which sample each sequence belongs to and the barcode match error (from a demultiplexing function we use). The idea is for users to then filter the contents of the csv with all this extra information according to their own needs in order to produce filtered fastq files (outside scope of this issue, but thought I would add it for context).

The native list of INT8 I get from FASTQ.quality(record, :sanger) does not store well in a CSV. At this stage a string would be better. When it comes to writing fastq files, I can then convert from string to this list format if necessary.

Any help or alternate recommendations would be very much appreciated.

Your Environment

  • Package Version used: Not sure how to find the version
  • Julia Version used: 0.6.4
  • Operating System and version (desktop or mobile):
    Ubuntu inside a Docker container
uname -a

Linux a5d0b3698610 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

One-hot encoding of sequences

This is a feature request.

Machine learning models of biological sequences often work by representing sequences in one-hot encoding. I think it would be nice to add support for onehot encoding/decoding of biological sequences in this package.

Export `RE.Regex`, and make constructors easier.

I want to have a regex search search for biosequences, and it can be changed by a value.

for example,

start="AGCAT"
re=@biore_str(string(start,"[AGCT]+"),"d")

it always repored an error, but I want it can be changed by start value?

BioSequences.RE.RegexMatch behaves nothing like Base.RegexMatch

BioSequences.RE.RegexMatch lacks the "match" and "offset" fields of Base.RegexMatch, which not only makes it considerably less useful, but is somewhat difficult to debug because their print() representation is identical.

Code:

string_seq = "ARNDCEQG"
aa_seq = aa"ARNDCEQG"

string_match = match(r"DCE", string_seq)
aa_match = match(biore"DCE"aa, aa_seq)

println(string_match)
println(aa_match)

println(fieldnames(typeof(string_match)))
println(fieldnames(typeof(aa_match)))

Result:

RegexMatch("DCE")
RegexMatch("DCE")
(:match, :captures, :offset, :offsets, :regex)
(:seq, :captured)

Demultiplexer scales badly with Hamming distance

Background

I'm using Demultiplexer() to demultiplex nanopore reads.
This works well, but when allowing more errors in the barcodes, the time to generate the demultiplexer grows very fast.

Current Behavior

Allowing one more error cost more than 10 times longer in terms of time and allocations.

Desired Behavior

It would be great if it was faster.

Steps to reproduce

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=1, distance=:hamming)
  0.000388 seconds (1.56 k allocations: 162.047 KiB)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 1

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=2, distance=:hamming)
  0.010063 seconds (50.47 k allocations: 3.590 MiB)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 2

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=3, distance=:hamming)
  0.193055 seconds (1.08 M allocations: 58.884 MiB)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 3

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=4, distance=:hamming)
  3.394650 seconds (15.94 M allocations: 734.229 MiB, 10.49% gc time)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 4

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=5, distance=:hamming)
 39.984839 seconds (169.53 M allocations: 7.118 GiB, 9.05% gc time)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 5

My Environment

julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, sandybridge)

julia> Pkg.status("BioSequences")
Status `~/.julia/environments/v1.4/Project.toml`
  [7e6ae17a] BioSequences v2.0.1

Implement broadcasting for biosequences

TL;DR

For v3, we should add broadcasting to BioSequence, and make my_seq[1:4] = DNA_A error.

Rationale

I was working on #133 , but can't get it to work properly. When calling getindex and setindex! with BioSequences, there always seems to be an edge case where dispatch goes the wrong way, or something is not implemented, or you get a method ambiguity error. I remember thinking the same when doing #120 .

I hate method ambiguity errors. They're a critical error in the sense that they completely break your workflow. They can't be found statically, and they're extremely hard to test for. The LongSequence test suite is compresensive, but I can guarantee you, I can find method ambiguity in something as simple as setindex! there.

Why is this so hard? Why is it so hard for BioSequences in particular?

I've come to think one of the reasons is that my_seq[inds] = x has two distinct semtantic meaning in BioSequences right now, namely these two:

julia> my_seq = dna"GGCTA";

julia> my_seq[2:4] = dna"AAC"; my_seq
5nt DNA Sequence:
GAACA

julia> my_seq[2:4] = DNA_M; my_seq
5nt DNA Sequence:
GMMMA

Having two distinct semantic meanings covered by the same function is a bad idea. Here's what happens if you do it.

  1. Implement both methods. But oh, we need to have them be different signatures to dispatch to the right one. Let's implement setindex!(::BioSequence, ::Any, ::Vector{Bool}) and setindex!(::BioSequence, ::BioSequence, ::Vector{Bool}) separately. This works ONLY because one signature is more specific than the other, NOT because the signatures are incompatible. Note that we already loses some flexibility, because we now can't do my_seq[1:2] = [DNA_A, DNA_C], since that would call the wrong method.
  2. Make new biosequence e.g. LongSequence or LongSubSeq. We need to specialize one of the setindex! methods, so we add setindex!(::LongSequence, ::Any, ::Vector{Bool}).
  3. That new method is now more specific than the fallback (of course), but that now messes up the original dispatch, that was built on one method being more specific than the other. Method ambiguity, so we add setindex!(::LongSequence, ::LongSequence, ::Vector{Bool}). We lose even more flexibility, and the original method error may still be exposed if you try to do my_seq[1:2] = mer"TA". Also, we have to duplicate efforts by having this new method, whose content is identical to the original one.
  4. Repeat with a new biological sequence. Same as above, but much worse, since any combination of types can now error

We've gone down a wrong path. I can see two solutions:

  1. We never implement two semantically different methods with a type intersection between their arguments. That means we discard setindex!(::BioSequence, ::Any, ::Vector{Bool}), and instead have a method where we restrict the middle argument to be e.g. Union{BioSymbol, Char} or something, and the other method to have Union{BioSequence, AbstractVector}. The disadvantage here is that we lose some ducktyping.
  2. We never implement two semantically different method within the same function, full stop. That means we get rid of setindex!(::BioSequence, ::Any, ::Vector{Bool}) (and a few other functions like it), and replace them with new functions. I think we can do this by overloading broadcasting. The disadvantage here is that I don't know how complex implementing broadcasting is.

Check interfaces, remove NotImplemented errors

Errors such as e.g.

"Get the length of a biological sequence."
@inline function Base.length(seq::BioSequence)
error(
string(
"Base.length has not been defined for BioSequence type: ",
typeof(seq),
". Any compatible concrete BioSequence subtype must have this method implemented."
)
)
end

It's not more useful than a MethodError, and it hampers static analysis from e.g. JET, which will probably be a reality before the end of this year. Much better to simply remove it. (also see this blogpost)

Instead, we should go through and verify we have covered and documented interfaces for:

  • A new BioSequence subtype
  • A new Alphabet subtype
  • An alphabet having the ASCIIAlphabet trait

Also, the tests should include custom biosequence and alphabet types to make sure some basic behaviour is covered by fallbacks.

  • Add custom alphabet type to tests
  • Add a separate alphabet with ASCIIAlphabet trait for print tests
  • Add custom mutable biosequence subtype to tests
  • Make sure all relevant tests include custom biosequence and/or custom alphabet

Failed push! adds empty bases

Is this intended?

julia> seq = dna"ATGC";

julia> push!(seq, "fail")
ERROR: MethodError: Cannot `convert` an object of type String to an object of type DNA
Closest candidates are:
  convert(::Type{DNA}, ::RNA) at /Users/ksb/.julia/packages/BioSymbols/O70zi/src/nucleicacid.jl:57
  convert(::Type{DNA}, ::Char) at /Users/ksb/.julia/packages/BioSymbols/O70zi/src/nucleicacid.jl:67
  convert(::Type{T}, ::UInt8) where T<:NucleicAcid at /Users/ksb/.julia/packages/BioSymbols/O70zi/src/nucleicacid.jl:50
  ...
Stacktrace:
 [1] enc64(::LongSequence{DNAAlphabet{4}}, ::String) at /Users/ksb/.julia/packages/BioSequences/k4j4J/src/longsequences/indexing.jl:125
 [2] unsafe_setindex! at /Users/ksb/.julia/packages/BioSequences/k4j4J/src/longsequences/indexing.jl:77 [inlined]
 [3] push!(::LongSequence{DNAAlphabet{4}}, ::String) at /Users/ksb/.julia/packages/BioSequences/k4j4J/src/biosequence/transformations.jl:23
 [4] top-level scope at REPL[27]:1
 [5] eval(::Module, ::Any) at ./boot.jl:331
 [6] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [7] run_backend(::REPL.REPLBackend) at /Users/ksb/.julia/packages/Revise/tV8FE/src/Revise.jl:1165
 [8] top-level scope at none:0

julia> seq
5nt DNA Sequence:
ATGC-

If push!() throws an error, I would think that the sequence wouldn't change. This behavior seems like a footgun.

PFM for amino acid sequences?

I noticed that the PFM function does not run on protein sequences. Is there a biological rationale for this or is this simply missing functionality?

pfm_matrix
ArgumentError: sequence element must be DNA or RNA

BioSequences.PFM(::Array{BioSequences.LongSequence{BioSequences.AminoAcidAlphabet},1})@pwm.jl:52
top-level scope@Local: 1[inlined]

Spring cleanup for v3

BioSequences is a large package. Probably larger than it needs to be. It seems like a hold-over from the olden days when BioJulia was much more monolithic similar to e.g. Biopython.

In my opinion, it's easier and cleaner if BioSequences only contain the sequence types itself, and the most basic processing on them. Nothing fancy.

Are there things we can move out of BioSequences for v3? Here are my suggestions. If anyone thinks any of this belongs in this package, please leave a comment

Functionality to remove

  • src/composition.jl
  • src/demultiplexer.jl
  • src/minhash.jl
  • src/nmask.jl
  • src/refseq
  • ConditionIterator
  • VoidAlphabet
  • CharAlphabet

Functionality to migrate to other packages

  • K-mer composition
  • Barcode demultiplexing
  • K-mer minhashing

Dependencies to remove
These are either not used, or can be removed with trivial changes

  • BioGenerics
  • Printf
  • Combinatorics
  • IndexableBitVectors

A few comments to these

composition and minhash

These kmer techniques really belong in another package. There are so many cool kmer bithacks, and so much to do with kmers. We can't possibly do it all in this package.
I'd much rather have a handful of kmer iterators in this package, and leave it at that. Other packages can then build on the kmer iterators and sequences from here.

NMask and RefSeq

Do people really use these? It seems to me one might as well use LongDNASeq or its 2-bit equivalent. Theoretically, there is some space saving possible by using the NMask. Practically speaking, I doubt we have the dev manpower to maintain yet another longsequence-like sequence.

Demultiplexer

This seems very application-specific, and should probably be part of some NGS-related package.

ConditionIterator

This just seems like a re-implementation of Iterators.Filter, with much less functionality.

VoidAlphabet and CharAlphabet

I can't see the point of VoidAlphabet at all. CharAlphabet is useful for testing the generic interfaces, so I propose we move it from src to test.

Fixup tests before v3

Two things I'd like to improve on

  • Some of my most recent changes here on version 2 have had insufficient testing. Ideally, I'd like to get coverage up near 100% again. This is much more easily achieved AFTER #138 , so this might be one of the later issues to address before v3
  • There are way too many tests for BioSequences - nearly 9000 for LongSequence and 26000 for Mers. That's in package with 10k lines! On my (rather fast) laptop, they take 1.5 minutes. Not an eternity, but still a bit of wait. Worse, most of them don't really do anything except burn CI time and spam up the terminal if something goes wrong. I bet at least 75% of these tests could be eliminated if we cut down on test patterns, like
    for len in 1:64, _ in 1:10
        @test kmer_stuff(len, rand_seq(len))
    end

Do we really need 640 distinct tests here? What are the odds that function fails at k=54, and not before?

So, plan of action:

  • Complete #138 , removing everything we don't want in this package
  • Complete #121 , and any other new features to mers or longsequences
  • Aggressively cut down on needless tests for mers and longsequences
  • Add tests back to get near-100% coverage

Using Aqua on Julia exposes a few method ambiguities. In general, BioSequences have an issue with method ambiguities.

  • Fix method ambiguities (after all v3 feature issues, the possible interface rewrite, and #135 )

Streamline find functions to behave like Base's

Right now, you can do:

julia> findfirst(DNA_A, dna"TAGCGATGA")
2

Whereas it should be:

julia> findfirst(isequal(DNA_A), dna"TAGCGATGA")
2

The latter example currently errors. This happens because the predicate version of find* in BioSequences are currently restricted to subtypes of Function. But there are many function-like objects (such as e.g. Base.Fix2), which are not instances of Function.

We should remove findfirst(DNA_A, dna"TAG"), and make sure these functions behave as expected.

julia 1.5: Time for views of a sequence?

@jakobnissen @CiaranOMara

So julia 1.5 now means structs that have references to mutable are now stack-allocated.
I'm very excited about this. As you know, every LongSequence are mutable and every LongSequence owns its data.

Yes, there is a copy on write mechanism, but that's just an implementation detail to remove copying unless absolutely nessecery.

If you create many views over a sequence - taking advantage of the mechanism, you avoid copying, but you still need to allocate many LongSequence objects on the heap and trigger the gc and all that.

I've long wanted to make proper allocation free views of a sequence - a true view of fixed start and end, where if you edit a base you DO indeed edit the base in the underlying sequence (contrary to the copy on write mechanism which mean the seqs still owns its own data - even if sharing a vec temporarily for performance reasons). But there was never any point as they would just allocate to. Perhaps 1.5 is the time to try something - what do you guys think? I might be a massive boon, for example making kmers with K > 64?

We need a codon type for work with coding sequences.

We need a codon type for working with sequences that code for proteins.

Currently translation of proteins is done by iterating over 3-mers with a step size of 3.
The Kmer types almost work for this purpose - catching ambiguous characters.
However, they fall short when it comes to Gaps.
See if you try to make a Kmer with gaps you don't get what you might expect:

julia> kmer"AAA"
DNA 3-mer:
AAA

julia> kmer"AA-"
DNA 3-mer:
AGA

julia> kmer"---"
DNA 3-mer:
GGA

So for codons containing gaps e.g. "AA-", you want to throw error.
Even more complex, if you're dealing with sequences aligned by codons, then if you have a site:

"""
.....ACG......
.....- - -......
"""

Then you may want a codon to be able to represent "---", three gaps is a valid site you might want to work with in that context.

In addition, using 64 bits to store just three nucleotide characters is wasteful - a byte will do it.

I propose a new bitstype called Codon, based somewhat on Kmers, but that have a separate representation and that catch and account for gaps.

Such a type will pave the way to having higher level abstractions of sequences with reading frames for example, and gene models. But for now, such a type will allow me to add utilities for looking at signals of selection in sequences.

Fail to instantiate on Ubuntu

Current Behavior

In Ubuntu, fail to instantiate a package which contains BioSequences if the package is created in macOS.

Steps to Reproduce

Generate the follwing BSTest package in Julia REPL on macOS.

paalon at mac in ~/git
โ†ช julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.0 (2020-03-21)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.4) pkg> generate BSTest
 Generating  project BSTest:
    BSTest/Project.toml
    BSTest/src/BSTest.jl

(@v1.4) pkg> activate BSTest
 Activating environment at `~/git/BSTest/Project.toml`

(BSTest) pkg> registry add https://github.com/BioJulia/BioJuliaRegistry.git
    Cloning registry from "https://github.com/BioJulia/BioJuliaRegistry.git"
registry `BioJuliaRegistry` already exist in `~/.julia/registries/BioJuliaRegistry`.

(BSTest) pkg> add BioSequences
   Updating registry at `~/.julia/registries/BioJuliaRegistry`
   Updating git-repo `https://github.com/BioJulia/BioJuliaRegistry.git`
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
   Updating `~/git/BSTest/Project.toml`
  [7e6ae17a] + BioSequences v2.0.1
   Updating `~/git/BSTest/Manifest.toml`
  [67c07d97] + Automa v0.8.0
  [47718e42] + BioGenerics v0.1.0
  [7e6ae17a] + BioSequences v2.0.1
  [3c28c6f8] + BioSymbols v4.0.3
  [861a8166] + Combinatorics v1.0.0
  [864edb3b] + DataStructures v0.17.11
  [1cb3b9ac] + IndexableBitVectors v1.0.0
  [bac558e1] + OrderedCollections v1.1.0
  [f27b6e38] + Polynomials v0.5.3
  [3cdcf5f2] + RecipesBase v1.0.0
  [3bb67fe8] + TranscodingStreams v0.9.5
  [7200193e] + Twiddle v1.1.1
  [2a0f44e3] + Base64
  [8ba89e20] + Distributed
  [b77e0a4c] + InteractiveUtils
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [de0858da] + Printf
  [9a3f8284] + Random
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [8dfed614] + Test
  [4ec0a83e] + Unicode

(BSTest) pkg> st
Project BSTest v0.1.0
Status `~/git/BSTest/Project.toml`
  [7e6ae17a] BioSequences v2.0.1

(BSTest) pkg> instantiate

julia> using BSTest
[ Info: Precompiling BSTest [f5260244-0a45-42a5-ad8a-78ba73d8e808]

julia> exit()
paalon at mac in ~/git
โ†ช ls
BSTest/
paalon at mac in ~/git
โ†ช 

Copy this package to Ubuntu, then instantiate.

paalon at ubuntu-bionic in ~/git
โ†ช ls
BSTest/
paalon at ubuntu-bionic in ~/git
โ†ช julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.0 (2020-03-21)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.4) pkg> activate BSTest
 Activating environment at `~/git/BSTest/Project.toml`

(BSTest) pkg> registry add https://github.com/BioJulia/BioJuliaRegistry.git
    Cloning registry from "https://github.com/BioJulia/BioJuliaRegistry.git"
      Added registry `BioJuliaRegistry` to `~/.julia/registries/BioJuliaRegistry`

(BSTest) pkg> st
Project BSTest v0.1.0
Status `~/git/BSTest/Project.toml`
โ†’ [7e6ae17a] BioSequences v2.0.1
โ”Œ Warning: Some packages (indicated with a red arrow) are not downloaded, use `instantiate` to instantiate the current environment
โ”” @ Pkg.Display /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Pkg/src/Display.jl:238

(BSTest) pkg> instantiate
   Updating registry at `~/.julia/registries/BioJuliaRegistry`
   Updating git-repo `https://github.com/BioJulia/BioJuliaRegistry.git`
ERROR: expected package `Combinatorics [861a8166]` to be registered

(BSTest) pkg>

My Environment

macOS 10.15.4 Catalina

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Ubuntu 18.04.4 LTS (Bionic Beaver)

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Make FASTA and FASTQ readers / writers compatible with FileIO

We've already done great work for making reading formatted bioinformatics generic and simple to understand, it is a good idea to make what we have compatible with FileIO so for users then, opening and reading files like FASTA and FASTQ the same as if they were working with a CSV file.

BioSequences.jl 0.6

I'd like to release a new version of BioSequences.jl because:

  • It started to support Julia 0.6.
  • I cannot finish Julia 0.6 support of GenomicFeatures.jl since it depends on BioSequences.jl 0.5, which depends on PairwiseListMatrices.jl, which doesn't support Julia 0.6.
  • We added a new feature (#7) and a bug fix (#8).

Disallow construction of k-mers from integers

Right now, you can do

julia> DNACodon(11)
DNA 3-mer:
AGT

I don't think you should be able to do that. The fact that the codon is internally represented as UInt(11) is an implementation detail, and should not be part of the API.

I suggest removing it.

Edit: Changed the following:

  • Removed conversion to/from integers/kmers. This is dangerous since conversion can happen implicitly
  • Construction of kmers from integers is still permitted, but now checks for overflow, e.g. DNAMer{2}(0xfffff) throws a DomainError
  • To avoid the check, use the function masked, which instead masks the noncoding bits.

Add trait "iscomplete" to Alphabet

Add trait "iscomplete" to every alphabet. It should default to Val{false}(). It should return Val{true} if length(symbols(A)) == 1 << bits_per_symbol(A), i.e. if all bitpatterns are valid. In that case, the decoder can skip validation. There are also other efficiencies to be had, such as better random sequence generation.

Slicing ReferenceSequences

Slices of ReferenceSequence aren't behaving correctly.

Current Behavior

(@v1.4) pkg> st BioSequences
Status `~/.julia/environments/v1.4/Project.toml`
  [7e6ae17a] BioSequences v2.0.5 #master (https://github.com/BioJulia/BioSequences.jl.git)

julia> using BioSequences
[ Info: Precompiling BioSequences [7e6ae17a-c86d-528c-b3b9-7f778a29fe59]
[ Info: Compiling bit-parallel GC counter for LongSequence{<:NucleicAcidAlphabet}
[ Info: Compiling bit-parallel mismatch counter for LongSequence{<:NucleicAcidAlphabet}
[ Info: Compiling bit-parallel match counter for LongSequence{<:NucleicAcidAlphabet}
[ Info: Compiling bit-parallel ambiguity counter...
[ Info: 	For a single LongSequence{<:NucleicAcidAlphabet}
[ Info: 	For a pair of LongSequence{<:NucleicAcidAlphabet}s
[ Info: Compiling bit-parallel certainty counter for LongSequence{<:NucleicAcidAlphabet}
[ Info: Compiling bit-parallel gap counter for LongSequence{<:NucleicAcidAlphabet}

julia> rseq = ReferenceSequence(dna"NNAAAAAAN")
9nt Reference Sequence:
NNAAAAAAN

julia> sub_rseq = rseq[3:8]
6nt Reference Sequence:
AAAAAA

julia> sub_sub_rseq = sub_rseq[1:4]
4nt Reference Sequence:
NNAA <-- The expected answer is AAAA 

Context

Working sequences from TwoBit records.

Type for Paired-end Reads

For generating contigs we need to make use of paired-end reads. I think it would be convenient to have a a type Paired-end BioSequence or we can modify the current version of the BioSequence type to have gaps.

@benjward, I would be really happy to hear your comment about this one.

getindex not defined for arrays

I was trying to create a sequence without some index and I've found that getindex using arrays isn't supported:

using BioSequences
seq = dna"CCTAGGAGG"
seq[[1,2,3,7,8,9]] # MethodError
seq[[true, true, true, false, false, false, true, true, true]] # MethodError

Future of subsequences

Pinging @benjward @CiaranOMara perhaps this discussion should be part of #102 , but I do feel like the scope is a little broader here.

The situation

We currently are working on multiple different "subsequence" like types:

  • Ben is (has been?) working on NTupleKmers
  • I have a working prototype of a sequence view, similar to an array view (link coming soon)
  • I'm playing around with 256-bit kmers

This comes on top of the already pretty large subsequence codebase:

  • LongSequence's copy-on-write (COW) mechanism that allows a LongSequence to actually be a subsequence
  • The recently revamped Mer types.

So I'm concerned if not done right, we will end up with an enormous unelegant codebase that might still leave some bases uncovered. So: What do we really need? As I see it, we really just need:

  • An arbitrary-length subsequence, preferrably allocation-free.
  • An efficient type for kmers. Ideally, this would be flexible in size as well, but in my experience, kmers with k > 127 are rarely used.

My proposal

In this issue, I make two arguments:

Argument one: We should move completely to either NTupleKmers if they work, else N-bit kmers
While a maximum of K = 32 is good enough for many purposes, the usage of longer kmers is common enough that it's an awkward ceiling. By leveraging BitIntegers.jl, we could change the Kmer code to wrap an integer up to 1024-bit integers. Essentially its structure would be:

struct Kmer{K, A <: Alphabet, T <: Unsigned}
    data::T
end

For longer kmers like 51-mers (common in metagenomics), 127-mers (sometimes used in assembly), and perhaps even higher values than that, a truly variable-length kmer type like NTupleKmers is obviously best. Furthermore, I personally find it likely that NTuples will be even further optimized by the Julia core devs in the future since they are integral to immutable arrays.

However, NTupleKmers may be very difficult to write efficient code for. If that is the case, we can simplify the future codebase a great deal by having our kmers simply wrap a BitInteger type which is already aggressively optimized by LLVM. I am working on a proof of concept for this for show. Unfortunately, it adds another dependency for BioSequences.

  • Proposal 1: Remove our current Kmer types. If NTupleKmers work well, use those solely. Otherwise, add BitIntegers.jl as a dependency, and make a kmer type that parameterizes the integer type that stores its data

Argument two: We should remove the COW mechanism of LongSequence
I realize that this is a little controversial and require large changes to the codebase. But I do think it's the right direction to move. The COW mechanism allows sequences to be views, which is neat if users creates tonnes of subsequences.

However, my experience working with BioSequences.jl is that COW does make the codebase significantly more complicated. Every single indexing or iteration function that operates on the underlying data must take into account that the data may not begin at 1. In principle, this should not matter too much because it's hidden from the user. But realistically, at least in the near term, many of the heavy users of BioSequences will be devs themselves, and would like to tinker with BioSequence internals. That is, after all, one of the promises of Julia, it lets you tinker!

Furthermore, the benefit is eroded when 1.5 comes, since these subsequences will still allocate (because LongSequence is mutable) in a way a sequence view won't.

For the large majority of users, allocating a new LongSequence when slicing it will have completely fine performance - my current work for example, typically involves operations on max a few 100s of small sequences. But these users may still suffer from the COW mechanism if the difficulty of developing causes more bugs, e.g. #104 #110 . For the few ones which want great performance, the COW mechanism might not be performant enough. For the developers, it causes headaches.

  • Proposal 2: Remove the COW mechanism from LongSequence and introduce a SeqView-like object. Slicing a LongSequence will create a copy. The structure of a LongSequence is a mutable struct with the data vector and a single integer giving the length.

Demultiplexer fails for Levenshtein distance > 1

Expected Behavior

Demultiplexers can be generated for any Hamming distance, and for Levenshtein distance 1.
I would expect it to also work for higher distances.

These cases work:

julia> Demultiplexer(LongDNASeq.(["ATGGATGG"]), n_max_errors=1, distance=:hamming)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 1

julia> Demultiplexer(LongDNASeq.(["ATGGATGG"]), n_max_errors=2, distance=:hamming)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 2

julia> Demultiplexer(LongDNASeq.(["ATGGATGG"]), n_max_errors=1, distance=:levenshtein)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: levenshtein
  number of barcodes: 1
  number of correctable errors: 1

Current Behavior

This does not work:

julia> Demultiplexer(LongDNASeq.(["ATGGATGG"]), n_max_errors=2, distance=:levenshtein)
ERROR: MethodError: no method matching isless(::Nothing, ::Int64)
Closest candidates are:
  isless(::Missing, ::Any) at missing.jl:87
  isless(::AbstractFloat, ::Real) at operators.jl:167
  isless(::Real, ::Real) at operators.jl:355
  ...
Stacktrace:
 [1] <(::Nothing, ::Int64) at ./operators.jl:277
 [2] <=(::Nothing, ::Int64) at ./operators.jl:326
 [3] hamming_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:240
 [4] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:259
 [5] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:272
 [6] Demultiplexer(::Array{LongSequence{DNAAlphabet{4}},1}; n_max_errors::Int64, distance::Symbol) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:162
 [7] top-level scope at REPL[81]:1

The problem appears to be in the levenshtein_circle function:

julia> BioSequences.levenshtein_circle(LongDNASeq("ATGGAGTC"), 1)
71-element Array{LongSequence{DNAAlphabet{4}},1}:
 AAGGAGTC
 AATGGAGTC
 ACGGAGTC
 ... truncated

julia> BioSequences.levenshtein_circle(LongDNASeq("ATGGAGTC"), 2)
ERROR: MethodError: no method matching isless(::Nothing, ::Int64)
Closest candidates are:
  isless(::Missing, ::Any) at missing.jl:87
  isless(::AbstractFloat, ::Real) at operators.jl:167
  isless(::Real, ::Real) at operators.jl:355
  ...
Stacktrace:
 [1] <(::Nothing, ::Int64) at ./operators.jl:277
 [2] <=(::Nothing, ::Int64) at ./operators.jl:326
 [3] hamming_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:240
 [4] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:259
 [5] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:272
 [6] top-level scope at REPL[100]:1

Possible Solution / Implementation

Apparently this causes the error, but I'm not able to see how this becomes empty

if findfirst(isequal(seq[p]), ACGT) โ‰ค r

Steps to Reproduce (for bugs)

Take the test-example

@testset "Levenshtein distance" begin

and replace the distance with 2:

julia> barcodes = [dna"ATGG", dna"CAGA", dna"GGAA", dna"TACG"]
4-element Array{LongSequence{DNAAlphabet{4}},1}:
 ATGG
 CAGA
 GGAA
 TACG

julia> dplxr = Demultiplexer(barcodes, n_max_errors=1, distance=:levenshtein)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: levenshtein
  number of barcodes: 4
  number of correctable errors: 1

julia> dplxr = Demultiplexer(barcodes, n_max_errors=2, distance=:levenshtein)
ERROR: MethodError: no method matching isless(::Nothing, ::Int64)
Closest candidates are:
  isless(::Missing, ::Any) at missing.jl:87
  isless(::AbstractFloat, ::Real) at operators.jl:167
  isless(::Real, ::Real) at operators.jl:355
  ...
Stacktrace:
 [1] <(::Nothing, ::Int64) at ./operators.jl:277
 [2] <=(::Nothing, ::Int64) at ./operators.jl:326
 [3] hamming_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:240
 [4] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:259
 [5] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:272
 [6] Demultiplexer(::Array{LongSequence{DNAAlphabet{4}},1}; n_max_errors::Int64, distance::Symbol) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:162
 [7] top-level scope at REPL[104]:1

Your Environment

  • Package Version used:
  • Julia Version used:
  • Operating System and version (desktop or mobile):
  • Link to your project:
julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_DEPOT_PATH = /opt/Julia/julia-1.5

julia> Pkg.status("BioSequences")
  [7e6ae17a] BioSequences v2.0.5

[speculative] Remove `BioSequence(::Integer)`?

Right now you can do

julia> LongDNASeq(10)
10nt DNA Sequence:
CA--------

This strikes me as weak typing. In fact, we have not constructed a sequence from 10. Imagine a situation where you mistakenly thing x is something sequence-like, such as a bioseq or a string, but really, it's an integer. Doing LongDNASeq(x) then works - and give the wrong result.

I propose instead we change it to LongDNASeq(undef, 10), similar to arrays.

Performance review before release

So @benjward have mentioned wanting to make a new release of BioSequences that includes performance improvements. We might as well squeeze whatever we have on our performance wishlist into that release.

When #83 and #84 gets merged, I have two smaller PRs lined up which builds on top of those:

  • One that improves the performance of printing LongSequence. Currently, they are printed base-by-base, which is inefficient in 1.3 and above due to IO operations locking threads. Also, we can exploit the ASCII-related functions I created in the other PRs to make it way faster for ASCII sequences.
  • One that introduces a copy! method, such that you can in-place modify a single sequence when you are iterating over sequences, similar to what you do for FASTA records. This is just a simple wrapper around copyto!, or encode_copy! or encode_chunks, depending on source and destination types.

Benchmark

We can use the RC benchmark, created by the Seq guys. This benchmark covers almost all the included benchmarks. I'll be Julia 1.3 on my native computer instead of in a VM. This is comparing all current PRs with commit 61b8fae.

Before: 100.9 seconds
After:    9.2 seconds

TODO:

  • Improve LongSequence printing (PR #91 )
  • Add copy! method ( PR #87)
  • Add docs to copy! method (PR #95 )
  • Optimize length(::LongSequence) (PR #88 )
  • Optimize resize!(::LongSequence) (PR #92)
  • Optimize orphan!(::LongSequence) (PR #88)
  • Review bitpar counting kernels ( PR #96 )
  • Review ReferenceSequence ( PR #94 )
  • SIMD-vectorize reverse complementation ( PR #97 )
  • Consider optimizing random nucleotide generation (issue #77 )
  • Improve minhash hashing speed (PR #99)

Faster random LongSequence{NucleicAcidAlphabet{4}} generation

I was goofing around the other day and discovered a way to make generation of LongSequence{NucleicAcidAlphabet{4}} faster. It's probably fast enough already for any practical application, but in the words of Jeff Bezanson: "MOAR PERFORMANCE". And also, this new code is shorter.
I'm aware that BioSequences guarantees reproducible random sequence generation, and a new algorithm messes with that, so I make this issue here instead of a PR, so we can perhaps use it in a future version of BioSequences.

The new function looks like this:

function randseq2(rng::AbstractRNG, A::NucleicAcidAlphabet{4}, len::Integer)
    seq = LongSequence{typeof(A)}(len)
    data = seq.data
    rand!(rng, data)
    @inbounds for i in eachindex(data)
        nuc = 0x1111111111111111
        mask = data[i]
        nuc = ((nuc & mask) << 1) | (nuc & ~mask)
        mask >>>= 1
        nuc = ((nuc & mask) << 2) | (nuc & ~mask)
        data[i] = nuc
    end
    return seq
end

And it's fast because it uses rand! to generate all the randomness at once, and because it SIMD-vectorizes. It's about 8x faster for long sequences.

Problems installing BioSequences package

Hello! Apologies if this is not the right place to post things like this. I am trying to install the BioSequences package, and have done the following:

  • Entered package mode
    registry add https://github.com/BioJulia/BioJuliaRegistry.git
  • Exited package mode
    using Pkg
    Pkg.add("BioSequences")

I am getting the following error:

ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type String    
Closest candidates are:     
  convert(::Type{T<:AbstractString}, ::T<:AbstractString) where T<:AbstractString at strings/basic.jl:208    
  convert(::Type{T<:AbstractString}, ::AbstractString) where T<:AbstractString at strings/basic.jl:209     
  convert(::Type{T}, ::T) where T at essentials.jl:167     
Stacktrace:     
 [1] setindex!(::Dict{Base.UUID,String}, ::Nothing, ::Base.UUID) at ./dict.jl:380    
 [2] deps_graph(::Pkg.Types.Context, ::Dict{Base.UUID,String}, ::Dict{Base.UUID,Pkg.Types.VersionSpec}, ::Dict{Base.UUID,Pkg.Types.Fixed}) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/Operations.jl:424    
 [3] resolve_versions!(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/Operations.jl:316    
 [4] #add#100(::Bool, ::typeof(Pkg.Operations.add), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}, ::Array{Base.UUID,1}) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/Operations.jl:962    
 [5] #add at ./none:0 [inlined]    
 [6] #add#25(::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Pkg.API.add), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:99    
 [7] add at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:69 [inlined]    
 [8] #add#24 at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:67 [inlined]    
 [9] add at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:67 [inlined]     
 [10] #add#21 at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:65 [inlined]    
 [11] add at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:65 [inlined]    
 [12] #add#20(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Pkg.API.add), ::String) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:64    
 [13] add(::String) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:64    
 [14] top-level scope at REPL[2]:1    

I'm using Julia version 1.2.0, and the latest BioSequences. I'm fairly confident that I got the git repo correctly, because re-doing:
registry add https://github.com/BioJulia/BioJuliaRegistry.git
Results in:

   Cloning registry from "https://github.com/BioJulia/BioJuliaRegistry.git"    
[ Info: registry `BioJuliaRegistry` already exist in `~/.julia/registries/BioJuliaRegistry`. 

map(f, seq) raises an exception

Consider the following snippet:

using BioSequences
my_dict = Dict{AminoAcid, Float64}(...)
x = map(aa -> my_dict[aa], aa"PLEASANTLY")

Expected Behavior

I would expect x to contain an array of Float64, taken from my_dict.

Current Behavior

I get the following error (-3.5 is one of the entries in my_dict):

ERROR: InexactError: UInt8(UInt8, -3.5)
Stacktrace:
 [1] Type at ./float.jl:679 [inlined]
 [2] convert at ./number.jl:7 [inlined]
 [3] convert at /bmm/home/et517/.julia/packages/BioSymbols/8RyRT/src/aminoacid.jl:24 [inlined]
 [4] enc64(::BioSequence{AminoAcidAlphabet}, ::Float64) at /bmm/home/et517/.julia/packages/BioSequences/7i86L/src/bioseq/bioseq.jl:107
 [5] unsafe_setindex! at /bmm/home/et517/.julia/packages/BioSequences/7i86L/src/bioseq/indexing.jl:143 [inlined]
 [6] map!(::getfield(Ampliphyre, Symbol("##39#40")), ::BioSequence{AminoAcidAlphabet}) at /bmm/home/et517/.julia/packages/BioSequences/7i86L/src/bioseq/transformations.jl:172
 [7] map at /bmm/home/et517/.julia/packages/BioSequences/7i86L/src/bioseq/transformations.jl:178 [inlined]

Possible Solution / Implementation

It appears that the method function Base.map(f::Function, seq::BioSequence) around transformations.jl:177 is just wrong and should be removed.

Your Environment

BioSequences v1.1.0
Julia Version 1.0.3
Platform Info:
  OS: Linux (x86_64-conda_cos6-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, sandybridge)

Faster computation of Kmer composition

Kmer composition is relatively widely used. The reason for using kmers rather than arbitrary length sequences are almost often the superior speed of kmers, hence it's an area worth optimizing.

Currently in BioSequences, calculating the kmer composition of a sequence relies on iterate(x::EachKmerIterator). However, rather than relying on a kmer iterator, counting kmers by a direct loop over each nucleotide of the sequence is significantly faster for longer sequences. In a test implementation I made, the kmer composition can be counted ~4x faster for long sequences (10k nt) using a direct loop, compared to the current BioSequences implementation.

I suggest changing the implementation of the method
composition(iter::EachKmerIterator{T}) where {T<:Kmer}
Such that it's calculated by direct iteration.

Advantages

  • Quicker computation

Disadvantages

  • Does not rely on the existing implementation of BioSequences.EachKmerIterator. This means that if the underlying structure of sequences change, updating BioSequences.EachKmerIterator will not be sufficient to making these methods work.

I have created an implementation of my proposed change at:
https://github.com/jakobnissen/biobag/blob/master/kmercount_newimpl.jl

Note!: I have not extensively tested my implementation for all the corner cases - I can write the test code if this proposal is accepted.

Dependency issues when including the BioSequenceGraphs packages.

I have made several updates in my local to the BioSequenceGraphs package and want to use this updated version together with the BioSequences package.
I call
' using BioSequences'
to have access to all the functionalities in BioSequences package which runs smoothly.
When I want to use the BioSequenceGraphs package in my local, I go to the directory containing the .jl files (BioSequenceGraph.jl/src) and use include to call each of the .jl files inside the src folder:

include("Nodes.jl")
include("Links.jl")
include("SequenceGraph.jl")
include("IO.jl")
include("DeBruijnGraph.jl")

However after doing this I can not use the functions from BioSequences that I could use previously. The functions that are extended inside one of these .jl files are not accessible anymore. Some examples are : length, sequences.

What should be the proper way integrating the BioSequenceGraphs package in my local without causing some dependency/import related issues.
Below is the error I get after running the sequence function to get the fastq reads and then trying to include the Nodes.jl which also has a function named sequence:

error in method definition: function BioCore.sequence must be explicitly imported to be extended

Implement reverse translation (AminoAcid -> Codon)

I am thinking about where a good place for the codon and aminoacid tables would be in julia and wanted to ask if it would fit in here.

Expected Behavior

convert(Vector{RNAMer{3}}, AA_A) == RNAMer{3}.(["GCU", "GCA", "GCC", "GCG"])
convert(AminoAcid, RNAMer{3}("GCU")) == AA_A 

Maybe have an extra type alias const Codon = RNAMer{3}

Possible Solution / Implementation

Add two lookup tables.

Context

Could go alternatively in a separate package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.