biojulia / biosequences.jl Goto Github PK

View Code? Open in Web Editor NEW

147.0 15.0 46.0 3.44 MB

Biological sequences for the julia language

Home Page: http://biojulia.dev/BioSequences.jl

License: MIT License

Julia 100.00%

biojulia bioinformatics biology biological-sequences

biosequences.jl's Introduction

Bio

This package has been depreceated. Full details are available here. You might still download and use this package, as we don't want old scripts to break. However going forward, know that this repository is archived and read only, no further updates or fixes will be committed. You should use the packages that replace Bio.jl - a list is available here.

biosequences.jl's People

Contributors

Stargazers

Watchers

Forkers

kdm9 tkelman bicycle1885 kescobo transgirlcodes acrylami dcjones dillondaudert jgreener64 dmitrip phaverty harryscholes dtjohnson kdyrhage jakobnissen juliadocsforks ivirshup tanhevg stjordanis ciaranomara jlumpe cjprybol jing-xinxing isgasho bioturbonick lhnguyen-vn tp2750 avmi shulp2211 standardgalactic mario-kart-felix playfloor kristofferc ccc1685 lixiang117423 ov-ov chenhz1223 camilogarciabotero benjaminlozanow doctorbetaq tdw-89 almir07 wook2014 timbitz cihga39871 marcoxa

biosequences.jl's Issues

Add split function

I was trying to add the split function for BioSequences.

Base.split(str::T, splitter; limit::Integer=0, keep::Bool=true) where {T<:Sequence} =
    Base._split(str, splitter, limit, keep, T[])
function Base._split(str::Sequence,splitter, limit::Integer , keep_empty::Bool ,strs::Array)

    i = start(str)
    n = endof(str)
    r = search(str,splitter,i)
    j,k = first(r), last(r) + 1

    while 0 < j <= n && length(str) != limit-1
        if i < k
            if keep_empty || i < j
                push!(strs,str[i:j-1])
            end
            i = k
        end
        (k <= j) && (k = j + 1)
        r = search(str,splitter,k)
        j, k = first(r), last(r) + 1
    end

    if keep_empty || !done(str,i)
        push!(strs, str[i:end])
    end

    return strs
end

It was based on the one in base shown below.

split(str::T, splitter; limit::Integer=0, keep::Bool=true) where {T<:AbstractString} =
    _split(str, splitter, limit, keep, SubString{T}[])
function _split(str::AbstractString, splitter, limit::Integer, keep_empty::Bool, strs::Array)
    i = start(str)
    n = endof(str)
    r = search(str,splitter,i)
    j, k = first(r), nextind(str,last(r))
    while 0 < j <= n && length(strs) != limit-1
        if i < k
            if keep_empty || i < j
                push!(strs, SubString(str,i,prevind(str,j)))
            end
            i = k
        end
        (k <= j) && (k = nextind(str,j))
        r = search(str,splitter,k)
        j, k = first(r), nextind(str,last(r))
    end
    if keep_empty || !done(str,i)
        push!(strs, SubString(str,i))
    end
    return strs
end

It seems to work fine with DNASequences and RNASequences.

split(dna"AAAAACACAAAACATATATATC",dna"CA",keep=true)

4-element Array{BioSequences.BioSequence{BioSequences.DNAAlphabet{4}},1}:
 AAAAA   
         
 AAA     
 TATATATC

split(dna"AAAAACACAAAACATATATATC",dna"CA",keep=false)

3-element Array{BioSequences.BioSequence{BioSequences.DNAAlphabet{4}},1}:
 AAAAA   
 AAA     
 TATATATC

But with AminoAcidSequence is a little buggy, in particular, the keep=true argument doesn't work as expected.

split(aa"HHHHHHTTCHHHHHHHTTCHHHHH",aa"HT",keep=true)

3-element Array{BioSequences.BioSequence{BioSequences.AminoAcidAlphabet},1}:
 HHHHH   
 TCHHHHHH
 TCHHHHH

split(aa"HHHHHHTTCHHHHHHHTTCHHHHH",aa"HT",keep=false)

3-element Array{BioSequences.BioSequence{BioSequences.AminoAcidAlphabet},1}:
 HHHHH   
 TCHHHHHH
 TCHHHHH

Implement fast String to LongSequence conversion for more alphabets

In #81 , I made LongSequence conversion of ASCII strings much faster by using an already-optimized method for conversion to 4-bit nucleotide alphabets. However, there is no corresponding fast method for 2-bit alphabets nor amino acid alphabets, though there easily could be.
That should be implemented some time.

Edit: Instead of current dispatch model, for instantiating a LongSequence of alphabet type DNAAlphabet, RNAAlphabet and AminoAcidAlphabet, don't even check for ASCII. Just wrap a vector around it always, since non-ascii strings will not be valid anyway. That saves the call to isascii, which is O(n).

Change Varargs to vectors or array

BioSequences.jl/src/bioseq/site_counting/site_counting.jl

Line 158 in 45e5533

    
           function count_pairwise{P<:Position,A<:NucAlphs,N}(::Type{P}, seqs::Vararg{BioSequence{A},N})

@bicycle1885 There was an issue or PR recently in which you recommended removing Varargs and just going with a general iterable.

I'm considering whether I should change the varargs in the signature to some other container type, perhaps add a parameter for AbstractArray like:

count_pairwise{P<:Position,A<:NucAlphs,C<:AbstractArray}(::Type{P}, seqs::C{BioSequence{A},N})

I already have a plan to cleanup the fast counting internals a bit and wondered if it was worth doing this at the same time.

Release BioSequences v2

@BioJuliaBot register

Should we make BioSequences <: AbstractVector?

So I'm just making an actual issue to discuss - and document said discussion of a prospect that was raised on Slack.

Because as I'm going through broadcasting, I'm seeing that there are some arguments in the pro column for doing this.

FASTA.Reader doesn't recognize the correct alphabet with ambiguous sites

I have a data set of DNA sequences, where some sites are ambiguous. In the example file below, the second sequence has Y, which means either C or T.

>P1
ATT
>P2
AYC
>P3
ATT
>P4
ATT

When I read this file with FASTA.Reader, the second sequence is thought to be an alignment of amino acids, because Y also means Tyrosine if we think about amino acids instead of nucleic acids:

julia> r = BioSequences.FASTA.Reader(open("myfile.fasta"))
julia> for record in r
         @show eltype(sequence(record))
       end
eltype(sequence(record)) = DNA
eltype(sequence(record)) = AminoAcid
eltype(sequence(record)) = DNA
eltype(sequence(record)) = DNA

I would like to tell FASTA.Reader that the Y character stands for DNA_Y, not AA_Y. Is there a way to tell the function that all the 4 sequences are DNA sequences?

(v1.1) pkg> status
    Status `~/.julia/environments/v1.1/Project.toml`
  [7e6ae17a] BioSequences v1.1.0
  [3c28c6f8] BioSymbols v3.1.0
  [336ed68f] CSV v0.5.5
  ...

Erroneous reverse_complement behaviour?

Noticed something strange in some of my tests as I prepare packages for submission to BioJulia. For one test, I was using reverse_complement on a particular sequence, and the output of the function is definitely wrong. This happens on my local machines but not on the Travis testing images, which is how I noticed this, so I'm not sure that this is replicable at all- just want to make sure it's not a general phenomenon.

Expected Behavior

Ordinary reverse complement produced: from
TATATATATATCGCGCGCGCG
expect
CGCGCGCGCGATATATATATA

Current Behavior

Output of reverse_complement is
TATATATATATATATATATAT
whereas output of reverse_complement! is correct

Possible Solution / Implementation

No idea!

Steps to Reproduce (for bugs)

julia> testseq=LongSequence{DNAAlphabet{2}}("ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGATATATATATATATAT")
1000nt DNA Sequence:
ATATATATATATATATATATATATATATATATATATATA…GCGCGCGCGCGCGCGCGCGCGCGATATATATATATATAT

julia> testseq[490:510]
21nt DNA Sequence:
TATATATATATCGCGCGCGCG

julia> reverse_complement(testseq[490:510])
21nt DNA Sequence:
TATATATATATATATATATAT

julia> reverse_complement!(testseq[490:510])
21nt DNA Sequence:
CGCGCGCGCGATATATATATA

Your Environment

Package Version used: 2.0.4
Julia Version used: 1.4.2
Operating System and version (desktop or mobile): Linux Mint 19.3
Link to your project: https://github.com/mmattocks/BioBackgroundModels.jl

Status ~/.julia/environments/v1.4/Project.toml
[6e4b80f9] BenchmarkTools v0.5.0
[71d2fffc] BioBackgroundModels v0.1.0 [../../../../../srv/git/BioBackgroundModels]
[7e6ae17a] BioSequences v2.0.4
[a93c6f00] DataFrames v0.21.2
[31c24e10] Distributions v0.23.4
[c2308a5c] FASTX v1.1.2
[5fb14364] OhMyREPL v0.5.5
[92933f4c] ProgressMeter v1.3.1
[295af30f] Revise v2.7.2
[4c63d2b9] StatsFuns v0.9.5

Move to upcoming release of BioSymbols

As part of #145 , BioJulia/BioSymbols.jl#38 needs to be merged as well, which requires a new BioSymbols release.

We might as well make a new BioSymbols release, and cram whatever things we want to change into that release.

Further generalise the bitparallel counting code

There are other operations I've seen need of - computing the number of segregating sites in a population sample of sequences and so on, that is not purely counting the number of a kind of site between two sequences, but that CAN be done, with a bit of re-working, with the bit-parallel counting methods already developed here. So previously I generalised the code, to counting different kinds of sites bit-parallel, now I'm going to generalise it further to let it do much more than just pairwise site counting. Related PR to come.

In 2.0 docs, link on left menu bar for searching goes to index, not page for search methods

This template is rather extensive. Fill out all that you can, if are a new contributor or you're unsure about any section, leave it unchanged and a reviewer will help you 😄. This template is simply a tool to help everyone remember the BioJulia guidelines, if you feel anything in this template is not relevant, simply delete it.

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Context

Your Environment

Package Version used:
Julia Version used:
Operating System and version (desktop or mobile):
Link to your project:

How to get the phred score in a string variable most efficiently?

In a section of code I am writing, I want to read fastq file without making use of phred score at all. I would like to just have a string variable which would contain the phred line exactly as it is currently in the input fastq files.

New to Julia, so I have failed to work out how to do it. You must do it somewhere because println(record) (please see context below) results in the representation I am looking for.

fastq_stream = FASTQ.Reader(open("pool1.fastq", "r"))
for record in fastq_stream
    #pseudo-code from here
    # to_csv(record.name, record.sequence, record.quality, record.sequence.length())
end
close(fastq_stream)

Along with the bit in pseudo-code, I am also storing which sample each sequence belongs to and the barcode match error (from a demultiplexing function we use). The idea is for users to then filter the contents of the csv with all this extra information according to their own needs in order to produce filtered fastq files (outside scope of this issue, but thought I would add it for context).

The native list of INT8 I get from FASTQ.quality(record, :sanger) does not store well in a CSV. At this stage a string would be better. When it comes to writing fastq files, I can then convert from string to this list format if necessary.

Any help or alternate recommendations would be very much appreciated.

Your Environment

Package Version used: Not sure how to find the version
Julia Version used: 0.6.4
Operating System and version (desktop or mobile):
Ubuntu inside a Docker container

uname -a

Linux a5d0b3698610 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

One-hot encoding of sequences

This is a feature request.

Machine learning models of biological sequences often work by representing sequences in one-hot encoding. I think it would be nice to add support for onehot encoding/decoding of biological sequences in this package.

Export `RE.Regex`, and make constructors easier.

I want to have a regex search search for biosequences, and it can be changed by a value.

for example,

start="AGCAT"
re=@biore_str(string(start,"[AGCT]+"),"d")

it always repored an error, but I want it can be changed by start value?

BioSequences.RE.RegexMatch behaves nothing like Base.RegexMatch

BioSequences.RE.RegexMatch lacks the "match" and "offset" fields of Base.RegexMatch, which not only makes it considerably less useful, but is somewhat difficult to debug because their print() representation is identical.

Code:

string_seq = "ARNDCEQG"
aa_seq = aa"ARNDCEQG"

string_match = match(r"DCE", string_seq)
aa_match = match(biore"DCE"aa, aa_seq)

println(string_match)
println(aa_match)

println(fieldnames(typeof(string_match)))
println(fieldnames(typeof(aa_match)))

Result:

RegexMatch("DCE")
RegexMatch("DCE")
(:match, :captures, :offset, :offsets, :regex)
(:seq, :captured)

Demultiplexer scales badly with Hamming distance

Background

I'm using Demultiplexer() to demultiplex nanopore reads.
This works well, but when allowing more errors in the barcodes, the time to generate the demultiplexer grows very fast.

Current Behavior

Allowing one more error cost more than 10 times longer in terms of time and allocations.

Desired Behavior

It would be great if it was faster.

Steps to reproduce

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=1, distance=:hamming)
  0.000388 seconds (1.56 k allocations: 162.047 KiB)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 1

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=2, distance=:hamming)
  0.010063 seconds (50.47 k allocations: 3.590 MiB)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 2

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=3, distance=:hamming)
  0.193055 seconds (1.08 M allocations: 58.884 MiB)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 3

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=4, distance=:hamming)
  3.394650 seconds (15.94 M allocations: 734.229 MiB, 10.49% gc time)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 4

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=5, distance=:hamming)
 39.984839 seconds (169.53 M allocations: 7.118 GiB, 9.05% gc time)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 5

My Environment

julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, sandybridge)

julia> Pkg.status("BioSequences")
Status `~/.julia/environments/v1.4/Project.toml`
  [7e6ae17a] BioSequences v2.0.1

Implement broadcasting for biosequences

TL;DR

For v3, we should add broadcasting to BioSequence, and make my_seq[1:4] = DNA_A error.

Rationale

I was working on #133 , but can't get it to work properly. When calling getindex and setindex! with BioSequences, there always seems to be an edge case where dispatch goes the wrong way, or something is not implemented, or you get a method ambiguity error. I remember thinking the same when doing #120 .

I hate method ambiguity errors. They're a critical error in the sense that they completely break your workflow. They can't be found statically, and they're extremely hard to test for. The LongSequence test suite is compresensive, but I can guarantee you, I can find method ambiguity in something as simple as setindex! there.

Why is this so hard? Why is it so hard for BioSequences in particular?

I've come to think one of the reasons is that my_seq[inds] = x has two distinct semtantic meaning in BioSequences right now, namely these two:

julia> my_seq = dna"GGCTA";

julia> my_seq[2:4] = dna"AAC"; my_seq
5nt DNA Sequence:
GAACA

julia> my_seq[2:4] = DNA_M; my_seq
5nt DNA Sequence:
GMMMA

Having two distinct semantic meanings covered by the same function is a bad idea. Here's what happens if you do it.

Implement both methods. But oh, we need to have them be different signatures to dispatch to the right one. Let's implement setindex!(::BioSequence, ::Any, ::Vector{Bool}) and setindex!(::BioSequence, ::BioSequence, ::Vector{Bool}) separately. This works ONLY because one signature is more specific than the other, NOT because the signatures are incompatible. Note that we already loses some flexibility, because we now can't do my_seq[1:2] = [DNA_A, DNA_C], since that would call the wrong method.
Make new biosequence e.g. LongSequence or LongSubSeq. We need to specialize one of the setindex! methods, so we add setindex!(::LongSequence, ::Any, ::Vector{Bool}).
That new method is now more specific than the fallback (of course), but that now messes up the original dispatch, that was built on one method being more specific than the other. Method ambiguity, so we add setindex!(::LongSequence, ::LongSequence, ::Vector{Bool}). We lose even more flexibility, and the original method error may still be exposed if you try to do my_seq[1:2] = mer"TA". Also, we have to duplicate efforts by having this new method, whose content is identical to the original one.
Repeat with a new biological sequence. Same as above, but much worse, since any combination of types can now error

We've gone down a wrong path. I can see two solutions:

We never implement two semantically different methods with a type intersection between their arguments. That means we discard setindex!(::BioSequence, ::Any, ::Vector{Bool}), and instead have a method where we restrict the middle argument to be e.g. Union{BioSymbol, Char} or something, and the other method to have Union{BioSequence, AbstractVector}. The disadvantage here is that we lose some ducktyping.
We never implement two semantically different method within the same function, full stop. That means we get rid of setindex!(::BioSequence, ::Any, ::Vector{Bool}) (and a few other functions like it), and replace them with new functions. I think we can do this by overloading broadcasting. The disadvantage here is that I don't know how complex implementing broadcasting is.

Check interfaces, remove NotImplemented errors

Errors such as e.g.

BioSequences.jl/src/biosequence/biosequence.jl

Lines 21 to 30 in 0ba3d4b

    
           "Get the length of a biological sequence." 
        
           @inline function Base.length(seq::BioSequence) 
        
               error( 
        
                   string( 
        
                       "Base.length has not been defined for BioSequence type: ", 
        
                       typeof(seq), 
        
                       ". Any compatible concrete BioSequence subtype must have this method implemented." 
        
                   ) 
        
               ) 
        
           end

It's not more useful than a MethodError, and it hampers static analysis from e.g. JET, which will probably be a reality before the end of this year. Much better to simply remove it. (also see this blogpost)

Instead, we should go through and verify we have covered and documented interfaces for:

A new BioSequence subtype
A new Alphabet subtype
An alphabet having the ASCIIAlphabet trait

Also, the tests should include custom biosequence and alphabet types to make sure some basic behaviour is covered by fallbacks.

Add custom alphabet type to tests
Add a separate alphabet with ASCIIAlphabet trait for print tests
Add custom mutable biosequence subtype to tests
Make sure all relevant tests include custom biosequence and/or custom alphabet

Failed push! adds empty bases

Is this intended?

julia> seq = dna"ATGC";

julia> push!(seq, "fail")
ERROR: MethodError: Cannot `convert` an object of type String to an object of type DNA
Closest candidates are:
  convert(::Type{DNA}, ::RNA) at /Users/ksb/.julia/packages/BioSymbols/O70zi/src/nucleicacid.jl:57
  convert(::Type{DNA}, ::Char) at /Users/ksb/.julia/packages/BioSymbols/O70zi/src/nucleicacid.jl:67
  convert(::Type{T}, ::UInt8) where T<:NucleicAcid at /Users/ksb/.julia/packages/BioSymbols/O70zi/src/nucleicacid.jl:50
  ...
Stacktrace:
 [1] enc64(::LongSequence{DNAAlphabet{4}}, ::String) at /Users/ksb/.julia/packages/BioSequences/k4j4J/src/longsequences/indexing.jl:125
 [2] unsafe_setindex! at /Users/ksb/.julia/packages/BioSequences/k4j4J/src/longsequences/indexing.jl:77 [inlined]
 [3] push!(::LongSequence{DNAAlphabet{4}}, ::String) at /Users/ksb/.julia/packages/BioSequences/k4j4J/src/biosequence/transformations.jl:23
 [4] top-level scope at REPL[27]:1
 [5] eval(::Module, ::Any) at ./boot.jl:331
 [6] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [7] run_backend(::REPL.REPLBackend) at /Users/ksb/.julia/packages/Revise/tV8FE/src/Revise.jl:1165
 [8] top-level scope at none:0

julia> seq
5nt DNA Sequence:
ATGC-

If push!() throws an error, I would think that the sequence wouldn't change. This behavior seems like a footgun.

add BioSequences cannot find corresponding UUID

I am new to julia. I installed julia 1.3 and attempted to add BioSequences package. I get the following error.

"Pattern matching and searching" in documentation links to autogenerated search page

This is probably a minor issue, but in package documentation, "Pattern matching and searching" on the left hand side links here: https://biojulia.net/BioSequences.jl/stable/search/. seach appears to be a reserved URI for the autogenerated search page in Documenter, so the original doc is hard to access.

Should be easy for me to push a fix out, just rename search.md to something else.

PFM for amino acid sequences?

I noticed that the PFM function does not run on protein sequences. Is there a biological rationale for this or is this simply missing functionality?

pfm_matrix
ArgumentError: sequence element must be DNA or RNA

BioSequences.PFM(::Array{BioSequences.LongSequence{BioSequences.AminoAcidAlphabet},1})@pwm.jl:52
top-level scope@Local: 1[inlined]

Spring cleanup for v3

BioSequences is a large package. Probably larger than it needs to be. It seems like a hold-over from the olden days when BioJulia was much more monolithic similar to e.g. Biopython.

In my opinion, it's easier and cleaner if BioSequences only contain the sequence types itself, and the most basic processing on them. Nothing fancy.

Are there things we can move out of BioSequences for v3? Here are my suggestions. If anyone thinks any of this belongs in this package, please leave a comment

Functionality to remove

Functionality to migrate to other packages

K-mer composition
Barcode demultiplexing
K-mer minhashing

Dependencies to remove
These are either not used, or can be removed with trivial changes

BioGenerics
Printf
Combinatorics
IndexableBitVectors

A few comments to these

composition and minhash

These kmer techniques really belong in another package. There are so many cool kmer bithacks, and so much to do with kmers. We can't possibly do it all in this package.
I'd much rather have a handful of kmer iterators in this package, and leave it at that. Other packages can then build on the kmer iterators and sequences from here.

NMask and RefSeq

Do people really use these? It seems to me one might as well use LongDNASeq or its 2-bit equivalent. Theoretically, there is some space saving possible by using the NMask. Practically speaking, I doubt we have the dev manpower to maintain yet another longsequence-like sequence.

Demultiplexer

This seems very application-specific, and should probably be part of some NGS-related package.

ConditionIterator

This just seems like a re-implementation of Iterators.Filter, with much less functionality.

VoidAlphabet and CharAlphabet

I can't see the point of VoidAlphabet at all. CharAlphabet is useful for testing the generic interfaces, so I propose we move it from src to test.

Fixup tests before v3

Two things I'd like to improve on

Some of my most recent changes here on version 2 have had insufficient testing. Ideally, I'd like to get coverage up near 100% again. This is much more easily achieved AFTER #138 , so this might be one of the later issues to address before v3
There are way too many tests for BioSequences - nearly 9000 for LongSequence and 26000 for Mers. That's in package with 10k lines! On my (rather fast) laptop, they take 1.5 minutes. Not an eternity, but still a bit of wait. Worse, most of them don't really do anything except burn CI time and spam up the terminal if something goes wrong. I bet at least 75% of these tests could be eliminated if we cut down on test patterns, like

    for len in 1:64, _ in 1:10
        @test kmer_stuff(len, rand_seq(len))
    end

Do we really need 640 distinct tests here? What are the odds that function fails at k=54, and not before?

So, plan of action:

Complete #138 , removing everything we don't want in this package
Complete #121 , and any other new features to mers or longsequences
Aggressively cut down on needless tests for mers and longsequences
Add tests back to get near-100% coverage

Using Aqua on Julia exposes a few method ambiguities. In general, BioSequences have an issue with method ambiguities.

Fix method ambiguities (after all v3 feature issues, the possible interface rewrite, and #135 )

ERROR: UndefVarError: FASTA not defined

julia> fst::BioSequences.FASTA.Reader
ERROR: UndefVarError: FASTA not defined

Ok... who broke this? I'm going back to Python, can't deal with this mess.

Streamline find functions to behave like Base's

Right now, you can do:

julia> findfirst(DNA_A, dna"TAGCGATGA")
2

Whereas it should be:

julia> findfirst(isequal(DNA_A), dna"TAGCGATGA")
2

The latter example currently errors. This happens because the predicate version of find* in BioSequences are currently restricted to subtypes of Function. But there are many function-like objects (such as e.g. Base.Fix2), which are not instances of Function.

We should remove findfirst(DNA_A, dna"TAG"), and make sure these functions behave as expected.

julia 1.5: Time for views of a sequence?

@jakobnissen @CiaranOMara

So julia 1.5 now means structs that have references to mutable are now stack-allocated.
I'm very excited about this. As you know, every LongSequence are mutable and every LongSequence owns its data.

Yes, there is a copy on write mechanism, but that's just an implementation detail to remove copying unless absolutely nessecery.

If you create many views over a sequence - taking advantage of the mechanism, you avoid copying, but you still need to allocate many LongSequence objects on the heap and trigger the gc and all that.

I've long wanted to make proper allocation free views of a sequence - a true view of fixed start and end, where if you edit a base you DO indeed edit the base in the underlying sequence (contrary to the copy on write mechanism which mean the seqs still owns its own data - even if sharing a vec temporarily for performance reasons). But there was never any point as they would just allocate to. Perhaps 1.5 is the time to try something - what do you guys think? I might be a massive boon, for example making kmers with K > 64?

We need a codon type for work with coding sequences.

We need a codon type for working with sequences that code for proteins.

Currently translation of proteins is done by iterating over 3-mers with a step size of 3.
The Kmer types almost work for this purpose - catching ambiguous characters.
However, they fall short when it comes to Gaps.
See if you try to make a Kmer with gaps you don't get what you might expect:

julia> kmer"AAA"
DNA 3-mer:
AAA

julia> kmer"AA-"
DNA 3-mer:
AGA

julia> kmer"---"
DNA 3-mer:
GGA

So for codons containing gaps e.g. "AA-", you want to throw error.
Even more complex, if you're dealing with sequences aligned by codons, then if you have a site:

"""
.....ACG......
.....- - -......
"""

Then you may want a codon to be able to represent "---", three gaps is a valid site you might want to work with in that context.

In addition, using 64 bits to store just three nucleotide characters is wasteful - a byte will do it.

I propose a new bitstype called Codon, based somewhat on Kmers, but that have a separate representation and that catch and account for gaps.

Such a type will pave the way to having higher level abstractions of sequences with reading frames for example, and gene models. But for now, such a type will allow me to add utilities for looking at signals of selection in sequences.

Document seqmatrix and majorityvote

This functionality was not in the docs. If it should be kept, it should at least be documented.

Fail to instantiate on Ubuntu

Current Behavior

In Ubuntu, fail to instantiate a package which contains BioSequences if the package is created in macOS.

Steps to Reproduce

Generate the follwing BSTest package in Julia REPL on macOS.

paalon at mac in ~/git
↪ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.0 (2020-03-21)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.4) pkg> generate BSTest
 Generating  project BSTest:
    BSTest/Project.toml
    BSTest/src/BSTest.jl

(@v1.4) pkg> activate BSTest
 Activating environment at `~/git/BSTest/Project.toml`

(BSTest) pkg> registry add https://github.com/BioJulia/BioJuliaRegistry.git
    Cloning registry from "https://github.com/BioJulia/BioJuliaRegistry.git"
registry `BioJuliaRegistry` already exist in `~/.julia/registries/BioJuliaRegistry`.

(BSTest) pkg> add BioSequences
   Updating registry at `~/.julia/registries/BioJuliaRegistry`
   Updating git-repo `https://github.com/BioJulia/BioJuliaRegistry.git`
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
   Updating `~/git/BSTest/Project.toml`
  [7e6ae17a] + BioSequences v2.0.1
   Updating `~/git/BSTest/Manifest.toml`
  [67c07d97] + Automa v0.8.0
  [47718e42] + BioGenerics v0.1.0
  [7e6ae17a] + BioSequences v2.0.1
  [3c28c6f8] + BioSymbols v4.0.3
  [861a8166] + Combinatorics v1.0.0
  [864edb3b] + DataStructures v0.17.11
  [1cb3b9ac] + IndexableBitVectors v1.0.0
  [bac558e1] + OrderedCollections v1.1.0
  [f27b6e38] + Polynomials v0.5.3
  [3cdcf5f2] + RecipesBase v1.0.0
  [3bb67fe8] + TranscodingStreams v0.9.5
  [7200193e] + Twiddle v1.1.1
  [2a0f44e3] + Base64
  [8ba89e20] + Distributed
  [b77e0a4c] + InteractiveUtils
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [de0858da] + Printf
  [9a3f8284] + Random
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [8dfed614] + Test
  [4ec0a83e] + Unicode

(BSTest) pkg> st
Project BSTest v0.1.0
Status `~/git/BSTest/Project.toml`
  [7e6ae17a] BioSequences v2.0.1

(BSTest) pkg> instantiate

julia> using BSTest
[ Info: Precompiling BSTest [f5260244-0a45-42a5-ad8a-78ba73d8e808]

julia> exit()
paalon at mac in ~/git
↪ ls
BSTest/
paalon at mac in ~/git
↪

Copy this package to Ubuntu, then instantiate.

paalon at ubuntu-bionic in ~/git
↪ ls
BSTest/
paalon at ubuntu-bionic in ~/git
↪ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.0 (2020-03-21)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.4) pkg> activate BSTest
 Activating environment at `~/git/BSTest/Project.toml`

(BSTest) pkg> registry add https://github.com/BioJulia/BioJuliaRegistry.git
    Cloning registry from "https://github.com/BioJulia/BioJuliaRegistry.git"
      Added registry `BioJuliaRegistry` to `~/.julia/registries/BioJuliaRegistry`

(BSTest) pkg> st
Project BSTest v0.1.0
Status `~/git/BSTest/Project.toml`
→ [7e6ae17a] BioSequences v2.0.1
┌ Warning: Some packages (indicated with a red arrow) are not downloaded, use `instantiate` to instantiate the current environment
└ @ Pkg.Display /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Pkg/src/Display.jl:238

(BSTest) pkg> instantiate
   Updating registry at `~/.julia/registries/BioJuliaRegistry`
   Updating git-repo `https://github.com/BioJulia/BioJuliaRegistry.git`
ERROR: expected package `Combinatorics [861a8166]` to be registered

(BSTest) pkg>

My Environment

macOS 10.15.4 Catalina

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Ubuntu 18.04.4 LTS (Bionic Beaver)

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Make FASTA and FASTQ readers / writers compatible with FileIO

We've already done great work for making reading formatted bioinformatics generic and simple to understand, it is a good idea to make what we have compatible with FileIO so for users then, opening and reading files like FASTA and FASTQ the same as if they were working with a CSV file.

BioSequences.jl 0.6

I'd like to release a new version of BioSequences.jl because:

It started to support Julia 0.6.
I cannot finish Julia 0.6 support of GenomicFeatures.jl since it depends on BioSequences.jl 0.5, which depends on PairwiseListMatrices.jl, which doesn't support Julia 0.6.
We added a new feature (#7) and a bug fix (#8).

Drop Julia 0.5 support

Disallow construction of k-mers from integers

Right now, you can do

julia> DNACodon(11)
DNA 3-mer:
AGT

I don't think you should be able to do that. The fact that the codon is internally represented as UInt(11) is an implementation detail, and should not be part of the API.

I suggest removing it.

Edit: Changed the following:

Removed conversion to/from integers/kmers. This is dangerous since conversion can happen implicitly
Construction of kmers from integers is still permitted, but now checks for overflow, e.g. DNAMer{2}(0xfffff) throws a DomainError
To avoid the check, use the function masked, which instead masks the noncoding bits.

Error tagging new release

The tag name "v0.5.0-prerelease" is not of the appropriate SemVer form (vX.Y.Z).
cc: @Ward9250

Add trait "iscomplete" to Alphabet

Add trait "iscomplete" to every alphabet. It should default to Val{false}(). It should return Val{true} if length(symbols(A)) == 1 << bits_per_symbol(A), i.e. if all bitpatterns are valid. In that case, the decoder can skip validation. There are also other efficiencies to be had, such as better random sequence generation.

Slicing ReferenceSequences

Slices of ReferenceSequence aren't behaving correctly.

Current Behavior

(@v1.4) pkg> st BioSequences
Status `~/.julia/environments/v1.4/Project.toml`
  [7e6ae17a] BioSequences v2.0.5 #master (https://github.com/BioJulia/BioSequences.jl.git)

julia> using BioSequences
[ Info: Precompiling BioSequences [7e6ae17a-c86d-528c-b3b9-7f778a29fe59]
[ Info: Compiling bit-parallel GC counter for LongSequence{<:NucleicAcidAlphabet}
[ Info: Compiling bit-parallel mismatch counter for LongSequence{<:NucleicAcidAlphabet}
[ Info: Compiling bit-parallel match counter for LongSequence{<:NucleicAcidAlphabet}
[ Info: Compiling bit-parallel ambiguity counter...
[ Info: 	For a single LongSequence{<:NucleicAcidAlphabet}
[ Info: 	For a pair of LongSequence{<:NucleicAcidAlphabet}s
[ Info: Compiling bit-parallel certainty counter for LongSequence{<:NucleicAcidAlphabet}
[ Info: Compiling bit-parallel gap counter for LongSequence{<:NucleicAcidAlphabet}

julia> rseq = ReferenceSequence(dna"NNAAAAAAN")
9nt Reference Sequence:
NNAAAAAAN

julia> sub_rseq = rseq[3:8]
6nt Reference Sequence:
AAAAAA

julia> sub_sub_rseq = sub_rseq[1:4]
4nt Reference Sequence:
NNAA <-- The expected answer is AAAA

Context

Working sequences from TwoBit records.

Type for Paired-end Reads

For generating contigs we need to make use of paired-end reads. I think it would be convenient to have a a type Paired-end BioSequence or we can modify the current version of the BioSequence type to have gaps.

@benjward, I would be really happy to hear your comment about this one.

getindex not defined for arrays

I was trying to create a sequence without some index and I've found that getindex using arrays isn't supported:

using BioSequences
seq = dna"CCTAGGAGG"
seq[[1,2,3,7,8,9]] # MethodError
seq[[true, true, true, false, false, false, true, true, true]] # MethodError

Future of subsequences

Pinging @benjward @CiaranOMara perhaps this discussion should be part of #102 , but I do feel like the scope is a little broader here.

The situation

We currently are working on multiple different "subsequence" like types:

Ben is (has been?) working on NTupleKmers
I have a working prototype of a sequence view, similar to an array view (link coming soon)
I'm playing around with 256-bit kmers

This comes on top of the already pretty large subsequence codebase:

LongSequence's copy-on-write (COW) mechanism that allows a LongSequence to actually be a subsequence
The recently revamped Mer types.

So I'm concerned if not done right, we will end up with an enormous unelegant codebase that might still leave some bases uncovered. So: What do we really need? As I see it, we really just need:

An arbitrary-length subsequence, preferrably allocation-free.
An efficient type for kmers. Ideally, this would be flexible in size as well, but in my experience, kmers with k > 127 are rarely used.

My proposal

In this issue, I make two arguments:

Argument one: We should move completely to either NTupleKmers if they work, else N-bit kmers
While a maximum of K = 32 is good enough for many purposes, the usage of longer kmers is common enough that it's an awkward ceiling. By leveraging BitIntegers.jl, we could change the Kmer code to wrap an integer up to 1024-bit integers. Essentially its structure would be:

struct Kmer{K, A <: Alphabet, T <: Unsigned}
    data::T
end

For longer kmers like 51-mers (common in metagenomics), 127-mers (sometimes used in assembly), and perhaps even higher values than that, a truly variable-length kmer type like NTupleKmers is obviously best. Furthermore, I personally find it likely that NTuples will be even further optimized by the Julia core devs in the future since they are integral to immutable arrays.

However, NTupleKmers may be very difficult to write efficient code for. If that is the case, we can simplify the future codebase a great deal by having our kmers simply wrap a BitInteger type which is already aggressively optimized by LLVM. I am working on a proof of concept for this for show. Unfortunately, it adds another dependency for BioSequences.

Proposal 1: Remove our current Kmer types. If NTupleKmers work well, use those solely. Otherwise, add BitIntegers.jl as a dependency, and make a kmer type that parameterizes the integer type that stores its data

Argument two: We should remove the COW mechanism of LongSequence
I realize that this is a little controversial and require large changes to the codebase. But I do think it's the right direction to move. The COW mechanism allows sequences to be views, which is neat if users creates tonnes of subsequences.

However, my experience working with BioSequences.jl is that COW does make the codebase significantly more complicated. Every single indexing or iteration function that operates on the underlying data must take into account that the data may not begin at 1. In principle, this should not matter too much because it's hidden from the user. But realistically, at least in the near term, many of the heavy users of BioSequences will be devs themselves, and would like to tinker with BioSequence internals. That is, after all, one of the promises of Julia, it lets you tinker!

Furthermore, the benefit is eroded when 1.5 comes, since these subsequences will still allocate (because LongSequence is mutable) in a way a sequence view won't.

For the large majority of users, allocating a new LongSequence when slicing it will have completely fine performance - my current work for example, typically involves operations on max a few 100s of small sequences. But these users may still suffer from the COW mechanism if the difficulty of developing causes more bugs, e.g. #104 #110 . For the few ones which want great performance, the COW mechanism might not be performant enough. For the developers, it causes headaches.

Proposal 2: Remove the COW mechanism from LongSequence and introduce a SeqView-like object. Slicing a LongSequence will create a copy. The structure of a LongSequence is a mutable struct with the data vector and a single integer giving the length.

Demultiplexer fails for Levenshtein distance > 1

Expected Behavior

Demultiplexers can be generated for any Hamming distance, and for Levenshtein distance 1.
I would expect it to also work for higher distances.

These cases work:

julia> Demultiplexer(LongDNASeq.(["ATGGATGG"]), n_max_errors=1, distance=:hamming)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 1

julia> Demultiplexer(LongDNASeq.(["ATGGATGG"]), n_max_errors=2, distance=:hamming)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 2

julia> Demultiplexer(LongDNASeq.(["ATGGATGG"]), n_max_errors=1, distance=:levenshtein)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: levenshtein
  number of barcodes: 1
  number of correctable errors: 1

Current Behavior

This does not work:

julia> Demultiplexer(LongDNASeq.(["ATGGATGG"]), n_max_errors=2, distance=:levenshtein)
ERROR: MethodError: no method matching isless(::Nothing, ::Int64)
Closest candidates are:
  isless(::Missing, ::Any) at missing.jl:87
  isless(::AbstractFloat, ::Real) at operators.jl:167
  isless(::Real, ::Real) at operators.jl:355
  ...
Stacktrace:
 [1] <(::Nothing, ::Int64) at ./operators.jl:277
 [2] <=(::Nothing, ::Int64) at ./operators.jl:326
 [3] hamming_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:240
 [4] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:259
 [5] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:272
 [6] Demultiplexer(::Array{LongSequence{DNAAlphabet{4}},1}; n_max_errors::Int64, distance::Symbol) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:162
 [7] top-level scope at REPL[81]:1

The problem appears to be in the levenshtein_circle function:

julia> BioSequences.levenshtein_circle(LongDNASeq("ATGGAGTC"), 1)
71-element Array{LongSequence{DNAAlphabet{4}},1}:
 AAGGAGTC
 AATGGAGTC
 ACGGAGTC
 ... truncated

julia> BioSequences.levenshtein_circle(LongDNASeq("ATGGAGTC"), 2)
ERROR: MethodError: no method matching isless(::Nothing, ::Int64)
Closest candidates are:
  isless(::Missing, ::Any) at missing.jl:87
  isless(::AbstractFloat, ::Real) at operators.jl:167
  isless(::Real, ::Real) at operators.jl:355
  ...
Stacktrace:
 [1] <(::Nothing, ::Int64) at ./operators.jl:277
 [2] <=(::Nothing, ::Int64) at ./operators.jl:326
 [3] hamming_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:240
 [4] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:259
 [5] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:272
 [6] top-level scope at REPL[100]:1

Possible Solution / Implementation

Apparently this causes the error, but I'm not able to see how this becomes empty

BioSequences.jl/src/demultiplexer.jl

Line 240 in 6f0e437

if findfirst(isequal(seq[p]), ACGT) ≤ r

Steps to Reproduce (for bugs)

Take the test-example

BioSequences.jl/test/demultiplexer.jl

Line 68 in 6f0e437

@testset "Levenshtein distance" begin

and replace the distance with 2:

julia> barcodes = [dna"ATGG", dna"CAGA", dna"GGAA", dna"TACG"]
4-element Array{LongSequence{DNAAlphabet{4}},1}:
 ATGG
 CAGA
 GGAA
 TACG

julia> dplxr = Demultiplexer(barcodes, n_max_errors=1, distance=:levenshtein)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: levenshtein
  number of barcodes: 4
  number of correctable errors: 1

julia> dplxr = Demultiplexer(barcodes, n_max_errors=2, distance=:levenshtein)
ERROR: MethodError: no method matching isless(::Nothing, ::Int64)
Closest candidates are:
  isless(::Missing, ::Any) at missing.jl:87
  isless(::AbstractFloat, ::Real) at operators.jl:167
  isless(::Real, ::Real) at operators.jl:355
  ...
Stacktrace:
 [1] <(::Nothing, ::Int64) at ./operators.jl:277
 [2] <=(::Nothing, ::Int64) at ./operators.jl:326
 [3] hamming_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:240
 [4] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:259
 [5] levenshtein_circle(::LongSequence{DNAAlphabet{4}}, ::Int64) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:272
 [6] Demultiplexer(::Array{LongSequence{DNAAlphabet{4}},1}; n_max_errors::Int64, distance::Symbol) at /opt/Julia/julia-1.5/packages/BioSequences/k4j4J/src/demultiplexer.jl:162
 [7] top-level scope at REPL[104]:1

Your Environment

Package Version used:
Julia Version used:
Operating System and version (desktop or mobile):
Link to your project:

julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_DEPOT_PATH = /opt/Julia/julia-1.5

julia> Pkg.status("BioSequences")
  [7e6ae17a] BioSequences v2.0.5

Test Issue

Woo test

Use broadcasted function to provide really conveinient and performant dot operations for sequences.

Just a JuliaCon note to myself so I don't forget to do this.

[speculative] Remove `BioSequence(::Integer)`?

Right now you can do

julia> LongDNASeq(10)
10nt DNA Sequence:
CA--------

This strikes me as weak typing. In fact, we have not constructed a sequence from 10. Imagine a situation where you mistakenly thing x is something sequence-like, such as a bioseq or a string, but really, it's an integer. Doing LongDNASeq(x) then works - and give the wrong result.

I propose instead we change it to LongDNASeq(undef, 10), similar to arrays.

Performance review before release

So @benjward have mentioned wanting to make a new release of BioSequences that includes performance improvements. We might as well squeeze whatever we have on our performance wishlist into that release.

When #83 and #84 gets merged, I have two smaller PRs lined up which builds on top of those:

One that improves the performance of printing LongSequence. Currently, they are printed base-by-base, which is inefficient in 1.3 and above due to IO operations locking threads. Also, we can exploit the ASCII-related functions I created in the other PRs to make it way faster for ASCII sequences.
One that introduces a copy! method, such that you can in-place modify a single sequence when you are iterating over sequences, similar to what you do for FASTA records. This is just a simple wrapper around copyto!, or encode_copy! or encode_chunks, depending on source and destination types.

Benchmark

We can use the RC benchmark, created by the Seq guys. This benchmark covers almost all the included benchmarks. I'll be Julia 1.3 on my native computer instead of in a VM. This is comparing all current PRs with commit 61b8fae.

Before: 100.9 seconds
After:    9.2 seconds

TODO:

~~Improve LongSequence printing (PR #91 )~~
~~Add copy! method ( PR #87)~~
~~Add docs to copy! method (PR #95 )~~
~~Optimize length(::LongSequence) (PR #88 )~~
~~Optimize resize!(::LongSequence) (PR #92)~~
~~Optimize orphan!(::LongSequence) (PR #88)~~
Review bitpar counting kernels ( PR #96 )
Review ReferenceSequence ( PR #94 )
~~SIMD-vectorize reverse complementation ( PR #97 )~~
~~Consider optimizing random nucleotide generation (issue #77 )~~
~~Improve minhash hashing speed (PR #99)~~

Faster random LongSequence{NucleicAcidAlphabet{4}} generation

I was goofing around the other day and discovered a way to make generation of LongSequence{NucleicAcidAlphabet{4}} faster. It's probably fast enough already for any practical application, but in the words of Jeff Bezanson: "MOAR PERFORMANCE". And also, this new code is shorter.
I'm aware that BioSequences guarantees reproducible random sequence generation, and a new algorithm messes with that, so I make this issue here instead of a PR, so we can perhaps use it in a future version of BioSequences.

The new function looks like this:

function randseq2(rng::AbstractRNG, A::NucleicAcidAlphabet{4}, len::Integer)
    seq = LongSequence{typeof(A)}(len)
    data = seq.data
    rand!(rng, data)
    @inbounds for i in eachindex(data)
        nuc = 0x1111111111111111
        mask = data[i]
        nuc = ((nuc & mask) << 1) | (nuc & ~mask)
        mask >>>= 1
        nuc = ((nuc & mask) << 2) | (nuc & ~mask)
        data[i] = nuc
    end
    return seq
end

And it's fast because it uses rand! to generate all the randomness at once, and because it SIMD-vectorizes. It's about 8x faster for long sequences.

Problems installing BioSequences package

Hello! Apologies if this is not the right place to post things like this. I am trying to install the BioSequences package, and have done the following:

Entered package mode
registry add https://github.com/BioJulia/BioJuliaRegistry.git
Exited package mode
using Pkg
Pkg.add("BioSequences")

I am getting the following error:

ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type String    
Closest candidates are:     
  convert(::Type{T<:AbstractString}, ::T<:AbstractString) where T<:AbstractString at strings/basic.jl:208    
  convert(::Type{T<:AbstractString}, ::AbstractString) where T<:AbstractString at strings/basic.jl:209     
  convert(::Type{T}, ::T) where T at essentials.jl:167     
Stacktrace:     
 [1] setindex!(::Dict{Base.UUID,String}, ::Nothing, ::Base.UUID) at ./dict.jl:380    
 [2] deps_graph(::Pkg.Types.Context, ::Dict{Base.UUID,String}, ::Dict{Base.UUID,Pkg.Types.VersionSpec}, ::Dict{Base.UUID,Pkg.Types.Fixed}) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/Operations.jl:424    
 [3] resolve_versions!(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/Operations.jl:316    
 [4] #add#100(::Bool, ::typeof(Pkg.Operations.add), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}, ::Array{Base.UUID,1}) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/Operations.jl:962    
 [5] #add at ./none:0 [inlined]    
 [6] #add#25(::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Pkg.API.add), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:99    
 [7] add at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:69 [inlined]    
 [8] #add#24 at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:67 [inlined]    
 [9] add at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:67 [inlined]     
 [10] #add#21 at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:65 [inlined]    
 [11] add at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:65 [inlined]    
 [12] #add#20(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Pkg.API.add), ::String) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:64    
 [13] add(::String) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.2/Pkg/src/API.jl:64    
 [14] top-level scope at REPL[2]:1

I'm using Julia version 1.2.0, and the latest BioSequences. I'm fairly confident that I got the git repo correctly, because re-doing:
registry add https://github.com/BioJulia/BioJuliaRegistry.git
Results in:

   Cloning registry from "https://github.com/BioJulia/BioJuliaRegistry.git"    
[ Info: registry `BioJuliaRegistry` already exist in `~/.julia/registries/BioJuliaRegistry`.

map(f, seq) raises an exception

Consider the following snippet:

using BioSequences
my_dict = Dict{AminoAcid, Float64}(...)
x = map(aa -> my_dict[aa], aa"PLEASANTLY")

Expected Behavior

I would expect x to contain an array of Float64, taken from my_dict.

Current Behavior

I get the following error (-3.5 is one of the entries in my_dict):

ERROR: InexactError: UInt8(UInt8, -3.5)
Stacktrace:
 [1] Type at ./float.jl:679 [inlined]
 [2] convert at ./number.jl:7 [inlined]
 [3] convert at /bmm/home/et517/.julia/packages/BioSymbols/8RyRT/src/aminoacid.jl:24 [inlined]
 [4] enc64(::BioSequence{AminoAcidAlphabet}, ::Float64) at /bmm/home/et517/.julia/packages/BioSequences/7i86L/src/bioseq/bioseq.jl:107
 [5] unsafe_setindex! at /bmm/home/et517/.julia/packages/BioSequences/7i86L/src/bioseq/indexing.jl:143 [inlined]
 [6] map!(::getfield(Ampliphyre, Symbol("##39#40")), ::BioSequence{AminoAcidAlphabet}) at /bmm/home/et517/.julia/packages/BioSequences/7i86L/src/bioseq/transformations.jl:172
 [7] map at /bmm/home/et517/.julia/packages/BioSequences/7i86L/src/bioseq/transformations.jl:178 [inlined]

Possible Solution / Implementation

It appears that the method function Base.map(f::Function, seq::BioSequence) around transformations.jl:177 is just wrong and should be removed.

Your Environment

BioSequences v1.1.0
Julia Version 1.0.3
Platform Info:
  OS: Linux (x86_64-conda_cos6-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, sandybridge)

Faster computation of Kmer composition

Kmer composition is relatively widely used. The reason for using kmers rather than arbitrary length sequences are almost often the superior speed of kmers, hence it's an area worth optimizing.

Currently in BioSequences, calculating the kmer composition of a sequence relies on iterate(x::EachKmerIterator). However, rather than relying on a kmer iterator, counting kmers by a direct loop over each nucleotide of the sequence is significantly faster for longer sequences. In a test implementation I made, the kmer composition can be counted ~4x faster for long sequences (10k nt) using a direct loop, compared to the current BioSequences implementation.

I suggest changing the implementation of the method
composition(iter::EachKmerIterator{T}) where {T<:Kmer}
Such that it's calculated by direct iteration.

Advantages

Quicker computation

Disadvantages

Does not rely on the existing implementation of BioSequences.EachKmerIterator. This means that if the underlying structure of sequences change, updating BioSequences.EachKmerIterator will not be sufficient to making these methods work.

I have created an implementation of my proposed change at:
https://github.com/jakobnissen/biobag/blob/master/kmercount_newimpl.jl

Note!: I have not extensively tested my implementation for all the corner cases - I can write the test code if this proposal is accepted.

Dependency issues when including the BioSequenceGraphs packages.

I have made several updates in my local to the BioSequenceGraphs package and want to use this updated version together with the BioSequences package.
I call
' using BioSequences'
to have access to all the functionalities in BioSequences package which runs smoothly.
When I want to use the BioSequenceGraphs package in my local, I go to the directory containing the .jl files (BioSequenceGraph.jl/src) and use include to call each of the .jl files inside the src folder:

include("Nodes.jl")
include("Links.jl")
include("SequenceGraph.jl")
include("IO.jl")
include("DeBruijnGraph.jl")

However after doing this I can not use the functions from BioSequences that I could use previously. The functions that are extended inside one of these .jl files are not accessible anymore. Some examples are : length, sequences.

What should be the proper way integrating the BioSequenceGraphs package in my local without causing some dependency/import related issues.
Below is the error I get after running the sequence function to get the fastq reads and then trying to include the Nodes.jl which also has a function named sequence:

error in method definition: function BioCore.sequence must be explicitly imported to be extended

Implement reverse translation (AminoAcid -> Codon)

I am thinking about where a good place for the codon and aminoacid tables would be in julia and wanted to ask if it would fit in here.

Expected Behavior

convert(Vector{RNAMer{3}}, AA_A) == RNAMer{3}.(["GCU", "GCA", "GCC", "GCG"])
convert(AminoAcid, RNAMer{3}("GCU")) == AA_A

Maybe have an extra type alias const Codon = RNAMer{3}

Possible Solution / Implementation

Add two lookup tables.

Context

Could go alternatively in a separate package.

	"Get the length of a biological sequence."
	@inline function Base.length(seq::BioSequence)
	error(
	string(
	"Base.length has not been defined for BioSequence type: ",
	typeof(seq),
	". Any compatible concrete BioSequence subtype must have this method implemented."
	)
	)
	end

biojulia / biosequences.jl Goto Github PK

biosequences.jl's Introduction

Bio

biosequences.jl's People

Contributors

Stargazers

Watchers

Forkers

biosequences.jl's Issues

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Your Environment

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Context

Your Environment

Your Environment

Background

Current Behavior

Desired Behavior

Steps to reproduce

My Environment

TL;DR

Rationale

composition and minhash

NMask and RefSeq

Demultiplexer

ConditionIterator

VoidAlphabet and CharAlphabet

Current Behavior

Steps to Reproduce

My Environment

Current Behavior

Context

The situation

My proposal

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Your Environment

Benchmark

TODO:

Expected Behavior

Current Behavior

Possible Solution / Implementation

Your Environment

Advantages

Disadvantages

Expected Behavior

Possible Solution / Implementation

Context

Recommend Projects

Recommend Topics

Recommend Org