bebop / poly Goto Github PK

A Go package for engineering organisms.

Home Page: https://pkg.go.dev/github.com/bebop/poly

License: MIT License

Go 100.00%

synthetic-biology synbio genetic-engineering molecular-biology bioengineering computational-biology bioinformatics dna fasta genbank

poly's People

Contributors

Stargazers

Watchers

Forkers

masparasol jecalles 0x106 spowers42 rickeyestes rkrishnasanka nicknolan isaacguerreir trendingtechnology tijeco v-raja maplemuffet jkhales alex-mcfadden ttldtor friendzymes hbscharp codercahol j1souza fialhopm 00mjk ehrlich-b praveenramalingam sheinsch thomvett samiranne mithrandir2k8 arsalan0c chagas98 fergloragain wook2014 cachemoi jorropo lgruen matiasinsaurralde tech-sumit jxilt micrictor soypat dolanor-galaxy carreter folago ragnorc ysung6 toothsy stjordanis genostack mehulkar lkassl devinriley abondrn paleale jakubtro26 jlomasunr twflem 4c3x vipcube leonorcosta31541 astrolemonade fatmasunak8 cyberflamego cassianstormborne dora2319 shehackedyou dasprosad milanhbs 5ky9uy brunoscaglione

poly's Issues

Testing on windows

I'd like to be able to run our tests on windows but given this stack trace from this pull request windows has a problem with ParseGff() on line 35 of io_test.go.

The problem is platform specific and the test doesn't fail on osx, or ubuntu and other tests have passed on windows.

The stack trace stops on line 188 of io.go and it appears that it may ultimately have a problem with using tabs to split instead of carriage return as seen on line 186 of io.go.

Has anyone ever seen this before / be willing to take a crack at it?

CRISPR Gopher mascot

Lots of Go related projects adopt a dressed up Go Gopher as a mascot and we should too!

Examples

Gopher using DNA as jump rope
Gopher cutting DNA with scissors
Gopher duct taping DNA back together
Gopher inspecting peptide chains emerging from ribosomes with a little monocle

I'm not much an artist myself but if anyone could make a gopher fit for a mascot I'd be grateful. Any ideas on what their outfit should be?

Optimizing sequences

Before a sequence is synthesized it should be optimized in at least 3 ways that I can think of.

Codon optimization for a target organism
Non-repetitive parts optimization for a target organism
Synthesis optimization (optimizing sequence for probability of synthesis success)

Each optimization would have its own function but it'd be great to eventually have a function that performs a suite of pre-synthesis optimizations and checks.

@Koeng101 has already written a little about codon optimization and synthesis optimization but we've yet to start working on non-repetitive parts optimization. Are there any other pre-synthesis optimizations that we should be considering?

Migrating to and integrating godocs

Is your feature request related to a problem? Please describe.

Currently we're using github pages but it's somewhat tedious to write and publish docs there. Also our documentation is missing from pkg.go.dev

Describe the solution you'd like

Godocs provides a way to render inline comments as documentation. As time goes on it would be best if we relied on this for library documentation and perhaps for command line documentation. I'm hoping that our docs look like this at some point but currently our official go docs look like this.

Things to be done:

Figure out what formatting comments need to be rendered as godocs.
Reformat and expand all comments to be rendered by godocs.
Figure out how to update godocs on pkg.go.dev on push.
See if there's any way to also document CLI commands using godocs.

It looks like this link about adding-packages to pkg.go.dev is a good starting point.
https://go.dev/about#adding-a-package

Primer flow

Primers are a common need for labs, and will be required for more complex protocols, like Gibson assemblies.

The most basic application of a primer design will be a simple amplification.

poly amplify pUC19.gb --amplicon "ATGACCATGATTACGCCAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCGAATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAG" > primers.txt pUC19_amplified.gb

primers.txt would be a tab separated value file which looks like (similar to what Snapgene provides):

Primer 1	ATGACCATGATTACGCCAAG
Primer 2	CTATGCGGCATCAGAGCA

While pUC19_amplified.gb would simply be a genbank file with only the primers + amplified sequence.

There are a few different flags that would be useful for poly amplify -

--primer_for "pUC19_for" --primer_rev "pUC19_rev" for naming the output primers in primers.txt
--amplicon "ATGC" amplifies the particular string
--range 0,0 amplifies a particular range of sequence
--no_amplify pGLO.gb prevents primers from amplifying a naughty sequence from a different file
--validate 10:20 --size 100:150 --coverage 100 --overlap 10:40 amplifies a particular range of sequence with a size within the size limitations of the --size flag, the coverage limitations of the --coverage tag, and the overlap limitations of the --overlap tag.

These 4 functions generally fulfill all needs of a biologist. The first few cover the use cases of your average cloner - they simply want to clone out an amplicon, and don't really care about more advanced features. The 5th, --validate covers the use case of people who build primers to validate things, like a colony PCR, or a validation PCR for clinical samples. I think these 2 use cases cover ~90% of the different kinds of uses of poly amplify

Use cases:

poly amplify

poly amplify pUC19.gb --primer_for "pUC19_for" --primer_rev "pUC19_rev" --amplicon "ATGACCATGATTACGCCAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCGAATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAG" > primers.txt amplified.gb

poly amplify range

poly amplify pUC19.gb --primer_for "pUC19_for" --primer_rev "pUC19_rev" --range 146:469 > primers.txt amplified.gb

poly no_amplify

poly amplify pUC19.gb --range 146:469 --no_amplify pGLO.gb > primers.txt amplified.gb

poly validate

poly amplify SARs-CoV-2.gb --validate 30:29886 --size 375:425 --coverage 100 --overlap 30:50 > primers.txt many_amplified.gb
(Note: this is pretty much the exact use case of https://artic.network/ncov-2019, which is why this kind of thing is important. It also doesn't fit in well to other parts of the program)
Another example:
poly amplify pUC19.gb --validate 146:469 --size 0:500 --coverage 100 --overlap 0:0 > primers.txt amplified.gb
In this example, we really just want to amplify a fragment that we know this sequence is inside of so we can do some sanger sequencing or the like on it.

@jecalles thoughts on different use cases I might be forgetting?

Uniprot parser

Is your feature request related to a problem? Please describe.
A parser for Uniprot XML data dumps (https://www.uniprot.org/downloads)

Describe the solution you'd like
A way to digest all of the Uniprot data dumps.

Describe alternatives you've considered
We built the concurrent Fasta parser to handle the Uniprot FASTA dumps, but a lot of information is lost there. It would be greatly preferred to have the XML parsed, so we can get full documentation of every protein.

Additional context
Add any other context or screenshots about the feature request here.

Writing out to genbank file format.

Is your feature request related to a problem? Please describe.
Currently poly can read genbank (.gb, .gbk files) but it can't write out .gb, .gbk files.

Describe the solution you'd like.
What I'd like is a gbk string builder similar in interface to the GffBuilder function in io.go.

Pull request template

It'd be really cool to have a pull request comment template that prompts people to write about what they're integrating.

If someone could look up how to do this with the .github directory and add one that I could edit and merge it would make a great first contribution!

No sequence parsing for genbank files exported from Snapgene

Describe the bug
Genbank files exported by Snapgene have no sequence output in their JSON files.

To Reproduce
puc19.gbk
puc19_snapgene.gb
puc19.json
puc19_snapgene.json

$ cat puc19.gbk | poly c -i gb -o json > puc19.json
$ cat puc19_snapgene.gb | poly c -i gb -o json > puc19_snapgene.json
cat puc19.json | jq ".Sequence.Sequence"
$ cat puc19.json | jq ".Sequence.Sequence"
"gagatacctacagcgtgagctatgagaaagcgccacgcttcccgaagggagaaaggcggacaggtatccggtaagcggcagggtcggaacaggagagcgcacgagggagcttccagggggaaacgcctggtatctttatagtcctgtcgggtttcgccacctctgacttgagcgtcgatttttgtgatgctcgtcaggggggcggagcctatggaaaaacgccagcaacgcggcctttttacggttcctggccttttgctggccttttgctcacatgttctttcctgcgttatcccctgattctgtggataaccgtattaccgcctttgagtgagctgataccgctcgccgcagccgaacgaccgagcgcagcgagtcagtgagcgaggaagcggaagagcgcccaatacgcaaaccgcctctccccgcgcgttggccgattcattaatgcagctggcacgacaggtttcccgactggaaagcgggcagtgagcgcaacgcaattaatgtgagttagctcactcattaggcaccccaggctttacactttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacacaggaaacagctatgaccatgattacgccaagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattcactggccgtcgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatcgccttgcagcacatccccctttcgccagctggcgtaatagcgaagaggcccgcaccgatcgcccttcccaacagttgcgcagcctgaatggcgaatggcgcctgatgcggtattttctccttacgcatctgtgcggtatttcacaccgcatatggtgcactctcagtacaatctgctctgatgccgcatagttaagccagccccgacacccgccaacacccgctgacgcgccctgacgggcttgtctgctcccggcatccgcttacagacaagctgtgaccgtctccgggagctgcatgtgtcagaggttttcaccgtcatcaccgaaacgcgcgagacgaaagggcctcgtgatacgcctatttttataggttaatgtcatgataataatggtttcttagacgtcaggtggcacttttcggggaaatgtgcgcggaacccctatttgtttatttttctaaatacattcaaatatgtatccgctcatgagacaataaccctgataaatgcttcaataatattgaaaaaggaagagtatgagtattcaacatttccgtgtcgcccttattcccttttttgcggcattttgccttcctgtttttgctcacccagaaacgctggtgaaagtaaaagatgctgaagatcagttgggtgcacgagtgggttacatcgaactggatctcaacagcggtaagatccttgagagttttcgccccgaagaacgttttccaatgatgagcacttttaaagttctgctatgtggcgcggtattatcccgtattgacgccgggcaagagcaactcggtcgccgcatacactattctcagaatgacttggttgagtactcaccagtcacagaaaagcatcttacggatggcatgacagtaagagaattatgcagtgctgccataaccatgagtgataacactgcggccaacttacttctgacaacgatcggaggaccgaaggagctaaccgcttttttgcacaacatgggggatcatgtaactcgccttgatcgttgggaaccggagctgaatgaagccataccaaacgacgagcgtgacaccacgatgcctgtagcaatggcaacaacgttgcgcaaactattaactggcgaactacttactctagcttcccggcaacaattaatagactggatggaggcggataaagttgcaggaccacttctgcgctcggcccttccggctggctggtttattgctgataaatctggagccggtgagcgtgggtctcgcggtatcattgcagcactggggccagatggtaagccctcccgtatcgtagttatctacacgacggggagtcaggcaactatggatgaacgaaatagacagatcgctgagataggtgcctcactgattaagcattggtaactgtcagaccaagtttactcatatatactttagattgatttaaaacttcatttttaatttaaaaggatctaggtgaagatcctttttgataatctcatgaccaaaatcccttaacgtgagttttcgttccactgagcgtcagaccccgtagaaaagatcaaaggatcttcttgagatcctttttttctgcgcgtaatctgctgcttgcaaacaaaaaaaccaccgctaccagcggtggtttgtttgccggatcaagagctaccaactctttttccgaaggtaactggcttcagcagagcgcagataccaaatactgttcttctagtgtagccgtagttaggccaccacttcaagaactctgtagcaccgcctacatacctcgctctgctaatcctgttaccagtggctgctgccagtggcgataagtcgtgtcttaccgggttggactcaagacgatagttaccggataaggcgcagcggtcgggctgaacggggggttcgtgcacacagcccagcttggagcgaacgacctacaccgaact"
$ cat puc19_snapgene.json | jq ".Sequence.Sequence"
""

Expected behavior
puc19_snapgene.gb sequence output should be the same as puc19.gbk.

Desktop (please complete the following information):

OS: Arch linux

Additional context
Add any other context about the problem here.

Seqhash uppercase requirement broken

Describe the bug
Seqhash is producing different outputs if the sequence in is lower or uppercase. This exact problem was mentioned here, originally implemented in #6 and was removed in this commit.

Seqhash should return the same value if a base pair is uppercase or lowercase because there is no sequence-level difference between the two.

To Reproduce
puc19_upper.gbk
puc19.gbk

$ cat puc19.gbk | poly ha -i gb
4031e1971acc8ff1bf0aa4ed623bc58beefc15e043075866a0854d592d80b28b
$ cat puc19_upper.gbk | .././poly ha -i gb
93c69f4f40c5803aab046d15264de8ab3574d8debf428cf8cfd7d29a8437319a

Expected behavior
Both puc19 files have identical sequences, and should have identical hash outputs.

Desktop (please complete the following information):

OS: Arch linux

Additional context
Reference implementation of seqhash is a python library, with the code being hosted here.

easy_dna functionality port.

I had a great call the other month with the developer of easy_dna and he suggested that even though it isn't by far his most popular library it is his most useful and that poly should have similar functionality. Below is a list of functions that I will accept PRs for if ported. See above link for documentation and links to original source code. Please make sure they are well tested!

all_iupac_variants
~~anonymized_record~~
~~copy_and_paste_segment~~
~~cut_and_paste_segment~~
dna_pattern_to_regexpr
list_common_enzymes
random_dna_sequence
random_protein_sequence
~~replace_segment~~
~~reverse_segment~~
~~swap_segments~~

Thanks,
Tim

P.S some clarifying info that Davian asked for.

Most functions will likely fit into the scope of either sequence.go or transformations.go and their associated test files. If you feel like a function doesn't fit within the scope of either you can make a utils.go and utils_test.go in the projects main directory to include with your pull request.

Most functions can be written as standalone string functions and then be wrapped by a method to use with poly's main sequence struct. That's as complex as integration with the main library will have to be in most cases.

Add poly version

Is your feature request related to a problem? Please describe.
poly version should output the version of poly that is currently running.

Describe the solution you'd like

$ poly version
0.0.1

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

If we ever need loading screen art

It's not necessary at the moment but if we ever need to create a loading dialogue it'd be funny to have some ascii art to accompany it. Inspired by this and this I give you this:

 ___       __   ___       ___  ___  __          __  
|__  |\ | / _` |__  |\ | |__  |__  |__) | |\ | / _` 
|___ | \| \__> |___ | \| |___ |___ |  \ | | \| \__>

Inconsistent naming convention in JSON

I was reviewing the JSON at https://github.com/TimothyStiles/poly/blob/prime/data/puc19static.json and I found that there is an inconsistency in "Meta" and "Locus". In Meta, there is GenbankDivision field, and in Locus, there is GenBankDivision field. Poly should be consistent.

Biofast comparison

Heng Li (inventor of SAM, minimap, etc, basically an absolute madlad that has seriously contributed to bio software space) had this in a blog post:

http://lh3.github.io/2020/05/17/fast-high-level-programming-languages

Looks like Go is about 2x slower than most the other high level languages. It would be great to have a performance test suite of poly vs other REAL bio software (not hypothetical), maybe something that can be automatically run?

Expanding sequence methods

Is your feature request related to a problem? Please describe.
In addition to our .GetSequence() method there should also be insert, delete, and replace sequence methods.

Describe the solution you'd like
insert, delete, and replace methods should work for Sequence, and Feature structs. In the case where a change causes a shift in sequence coordinates an Sequence struct's Features should be updated to reflect that.

Additional context
These methods are incredibly important. They will come into play with cloning, crispr, synthesis, and codon optimization functions plus a bunch of other stuff. They should be a priority.

poly store 2.0 discussion

Poly store is a method to store DNA and search DNA sequences. Unlike many other methods for storing and indexing DNA sequences, Poly store is meant to be able to store everything - all DNA humanity has sequenced. The first implementation of Poly store was experimented with in #15

This is a difficult problem, but generally digests to a couple different necessary features:

Must be disk based (Best method still up-in-the-air)
Must be able to update index over time (SAIS algorithm will likely work best for this [1] )
Must implement logarithmic-time search of DNA sequences (Suffix arrays are cool)

In order to implement Poly store, I think we should use a suffix-array data structure. This argument was made in #15 , and I think it still holds. The basic idea is we would have sequence, suffix array, lcp (longest common prefix), and the seqhash position arrays. sequence would a massive DNA sequence compromising of all appended sequences together. suffix array is the suffix array of sequence, and lcp is the lcp of that suffix array. seqhash position is a list of sequence positions that compose a seqhash, which references an object in a known database. Seqhashes are the universal identifiers we actually want in the end.

There will be a few necessary operations:

Search sequence
Search longest common substrings (for insert-time deduplication)
Search longest common seqhash overlaps (for occasional storage deduplication)
Sequence walking

Search Sequence

On a most basic level, searching sequence is done by doing a binary search on a suffix array. This operation is rather simple, and the data structure doesn't matter too much, so long as we can easily get the length of the suffix array and ensure that the suffix array is correctly sorted.

Search longest common substrings

For deduplication of new DNA getting inserted, it is important to be able to find the longest common substrings. So long as you can reversibly insert the new DNA sequence into the suffix array, this operation comes down to checking the local LCPs around the inserted DNA sequence, similar to the optimal deduplicator described in #15 under the initial post. As an added bonus, this step will also deduplicate the sequence against itself for repeated regions.

This can also be used to build relationship maps between species.

Search longest common seqhash overlaps

While the above algorithm for finding longest common substrings works well, it is insert-order dependent, where certain insert orders will be more efficient than others. This is concerning, and would require a smart insertion order. This, however, is quite solvable with some clever use of seqhash positions. There needs to be a way to calculate the most inefficient seqhash positions to occasionally rearrange the sequence. This is a little unclear, so here is an example:

Say the reference human genome has a section of sequence that looks like ... ATG GGG TAA ..., however, it turns out most human beings have the sequence ... ATG GGA TAA .... If the reference human genome was inserted first, it will always split up the subsequent human sequences into two different positions. When you factor in the increased referencing costs, it makes sense to store the ... ATG GGA TAA ... sequence and instead split the reference human genome.

Sequence walking

To do an alignment, you need a small sequence to get started with the approximately right positions. This sequence I call the "sequence seed", and is well illustrated in the Centrifuge paper [0]. After you get the proper locations of the seed, you walk down the sequence in both directions. If any seqhash position gets split while walking, you must branch to test the different possibilities.

This sequence walking can be enhanced by searching the longest common substrings (which reduces or eliminates the need for backtracking), so long as the lcp array values are easy to search on.

Focus points

With these features and operations in mind, there are a few focus points that the end product will have to be good at:

Efficient reversible transactions
Efficient indexing and relationship referencing
Fast insertion

Efficient transactions are necessary for implementing the continuous updating feature from the SAIS algorithm [0], and reversible transactions are necessary for searching longest common substrings.

Efficient indexing and relationship referencing are necessary for quickly moving through the relationships between seqhash positions, suffix array and sequence lists. This especially is necessary for sequence walking, which needs to coordinate between the sequence and seqhash positions list often.

SQLite caps out at ~100K inserts per second in my testing - without very aggressive deduplication, it would take far too long to insert all DNA materials. This can be a problem, but it is unknown how big of a problem it will be with deduplication.

===

@TimothyStiles Thoughts? I'm thinking the biggest change here from last is the longest common substrings problem, which turns out to be essential for a lot of interesting capacities, such as effective deduplication, phylogenetic / relationship tree building, and in-depth alignment.

References

[0] https://dx.doi.org/10.1101%2Fgr.210641.116
[1] https://zork.net/~st/jottings/sais.html

Why is there Linear and Circular bools on Locus struct?

Describe the bug
In io.go at line 80 + 81, there is a Circular and Linear field, both as booleans. Couldn't you just have one? (Since if Circular is false, linear is true, and vice versa, by definition).

Is this here simply to accommodate the ambiguous case?

To Reproduce
Steps to reproduce the behavior:
go to io.go

Expected behavior
A single field

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Optimal fragmentation

Is your feature request related to a problem? Please describe.
I would like Poly to have a function to Fragment DNA samples that you wish to synthesize. Many DNAs that one would wish to synthesize are above the fragment threshold for synthesis providers (~1500bp for Twist), so this would be useful to them. In addition, you can synthesize more complex DNA fragments through clever splitting of Fragments (ie, split a homopolymer to get >8bp homopolymers in a synthesized fragment).

I would like this function to be optimal using the data set in this paper - https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238592 . You can imagine using a constraint solver to get the optimal overhangs for a given sequence, no matter the size. This is fundamentally useful for users who want to synthesize large fragments, like full biosynthetic pathways. It is also useful for users who want to efficiently create complex DNA from restrictive synthesis companies.

Describe the solution you'd like
I want a function that I can pass many sequences which then optimally fragments them for synthesis.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
I'd like to be able to do "linking", ie, synthesizing many different genes on a single fragment, which then get divided using different restriction enzyme (ie, BbsI + BtgZI)

Refactor hash.go

Is your feature request related to a problem? Please describe.
Currently there are two issues with how hash.go is implemented.

The blake3 package does not implement a standard crypto.hash() interface. Currently we just use a special method for hashing with blake3 but this is bad and should be fixed asap.
GenericSequenceHash only hashes AnnotatedSequence objects, it should also hash Feature, and Sequence objects.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Write a wrapper around the blake3 package we use that uses the crypto.hash() interface like all the other hash functions seen here.
Replace GenericSequenceHash with a new function called HashSequence that takes an io.writer and a crypto.hash function and returns a hashed string. AnnotatedSequence, Feature, and Sequence objects should also have a .Hash() method. .GetSequence() will come in handy here.

Describe alternatives you've considered
We could continue building features around this but it’s best to refactor and build off of much cleaner code.

Gff parser chops off last attribute in attribute list.

Describe the bug
Gff parser chops off last attribute in attribute list.

To Reproduce
Use the gff parser.

Expected behavior
Not this.

Implement efficient sequence search

An extremely useful function for poly to have is efficient sequence search. This can be used for lots of different operations. Historically, BLAST has been the most popular technology, but BLAST has a few major disadvantages. BLAST is a relatively old technology, and its time complexity is at best O(n), and at worst, O(n^2) where n is the length of your sequence.

How should we implement efficient sequence search in Poly?

Use modern algorithms like Suffix trees, Suffix arrays and FM indexes

When it comes to efficient algorithms for searching sequence, there are 3 primary algorithms to be interested in - suffix trees, suffix arrays, and FM indexes. All three can reduce sequence search down to log time complexity. Suffix trees take a large quantity of memory, but are extremely fast for certain domains of problems (like the longest common substring problem). Suffix arrays are much simpler and smaller, but can't do as many interesting things as suffix trees can. However, they do have the unique advantage in Go of being implemented in the standard library.

FM indexes are relatively more complex than both suffix arrays and suffix trees (if you want to learn more, watch this video on suffix tries/trees, burrows-wheeler transform, and finally the FM index). While they are more computationally intensive than suffix trees or arrays and are much more complicated than suffix trees or arrays, they take up far less space.

Given popular tools like Bowtie2 and https://ccb.jhu.edu/software/centrifuge/ use the FM index, this is probably what poly should do, right?

FM indexes aren't that great

In 2016, Centrifuge was made for the purpose of very fast sequence search. Basically, they took similar genomes and did deduplication of common genetic elements (based on species and 53 k-mers), then created an FM index of those sequences. The deduplication seems to yield a sequence that is about ~15% the size of the original entry. In 2016, the NCBI sequence databases, when duplicates are removed, is about 109 billion base pairs.

Stored as a suffix array, this would take 147 gigabytes of memory (with the Golang suffix array implementation. (each base set to int8, so index + storage takes 9 bytes per base)

Stored as a FM index, this took 69 gigabytes of memory.

Centrifuge's naive deduplication algorithm was able to achieve about 6.6x reduction in memory usage overall, while FM indexes were only able to get 2.1x reduction from suffix arrays alone.

Deduplication should be the primary focus

In essence, it seems like what is saving a real quantity of memory isn't necessarily the FM index and all the complexity it brings along, but the simple act of deduplicating data. Thus, to do effective sequence search, I think Poly should use the basic Golang suffix arrays and invest energy into figuring out deduplication of genetic data. If Poly is meant for forward engineering, we can expect that DNA parts will be reused quite often, which is an opportunity for creating a very efficient deduplication scheme. Centrifuge's deduplication scheme doesn't have great support for circular sequences, which is something Poly should have.

Deduplication is also a very nice feature for commercial operations that need to save lots of sequence data, such as data coming through sequencers. I'm not sure how to implement this yet, but we should think about it. Please post in this thread if you have any ideas.

Enhanced poly ha output

Is your feature request related to a problem? Please describe.
The most likely use of hash outputs is going to be linking hash IDs to the LOCUS of a Genbank file, since that is typically the unique identifier. There should be a non-default option for the hash function to output the LOCUS of the Genbank file it has parsed.

Describe the solution you'd like
I would like the poly hash function to have 1 new optional option:

--locus (outputs locus)

Describe alternatives you've considered
I've considered just getting this information from the json file, but that seemed overly complicated for systems that already have nice genbank output.

Additional context
Add any other context or screenshots about the feature request here.

Start and End of features not properly parsed

Describe the bug
JSON object representation of Genbank features don't have proper Start and End values. The Location field, as far as I know, is slightly different. Start and end should either have simple integer values, or be redundant with Location

To Reproduce
Steps to reproduce the behavior:
cat bsub.gbk | poly c -i gbk -o json > bsub.json

Expected behavior
The Start and End fields in the JSON output should match the Genbank outputs.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Arch linux, compiled with go.

Additional context
Add any other context about the problem here.

Update contributor's guide

Is your feature request related to a problem? Please describe.

I get the feeling that the contributor's guide is getting kind of stale.

Describe the solution you'd like

If people could reply here to suggest improvement or talk about roadblocks they've faced with it onboarding/contributing I'd love to hear them!

I think I'll update requirements to include that code must also use godocs after we fix #53.

Adding support for RPM, Yum, and Apt-get

It's weird to ask linux users to use homebrew to install when it's not that popular on linux systems. It'd be better if we offered support for RPM, Yum, and Apt-get.

We should include in documentation instructions like rpm -i <url-to-latest-github-release.rpm> or yum install <url-to-latest-github-release.rpm> to do this we'd likely need to automate release updates to documentation via a github action. We should also test that these install before pushing updates.

Rhea parser integration

Is your feature request related to a problem? Please describe.
Integration and testing of the Rhea parser package into Poly

https://github.com/Koeng101/rhea

Describe the solution you'd like
Integration and testing of the Rhea parser package into Poly

Describe alternatives you've considered
@TimothyStiles writing a Rhea parser from scratch

Additional context
The RDF XML Rhea database drop can be downloaded here - https://www.rhea-db.org/help/download

Removing newlines in feature product description doesn't add a space.

Describe the bug
When parsing out descriptions, newlines aren't replaced with a space. Newlines should be replaced with a space and not with nothing.

Genbank:

     CDS             410..1750
                     /gene="dnaA"
                     /locus_tag="BSU_00010"
                     /old_locus_tag="BSU00010"
                     /function="16.9: Replicate"
                     /experiment="publication(s) with functional evidences,
                     PMID:2167836, 2846289, 12682299, 16120674, 1779750,
                     28166228"
                     /note="Evidence 1a: Function from experimental evidences
                     in the studied strain; PubMedId: 2167836, 2846289,
                     12682299, 16120674, 1779750, 28166228; Product type f :
                     factor"
                     /codon_start=1
                     /transl_table=11
                     /product="chromosomal replication initiator informational
                     ATPase"
                     /protein_id="NP_387882.1"

JSON:

    "product": "chromosomal replication initiator informationalATPase",

To Reproduce
cat bsub.gbk | poly c -i gbk -o json > bsub.json
Check bsub.gbk and bsub.json with your favorite text editor.

Expected behavior
I expect there to be a space

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Arch linux with go

Additional context
Add any other context about the problem here.

JSON codon table import + export

Is your feature request related to a problem? Please describe.
It is annoying having to digest a .gbk file every time I want to do an optimization. I would like to be able to import and export codon tables using JSON format.

Describe the solution you'd like
JSON import and output of codon tables.

Describe alternatives you've considered

Additional context

Least rotation command and regression testing general hashes.

I recently discovered a bug that I quick patched in commit 346e3eb where the sequence hash was summing incorrectly and returning comically long hashes that were far beyond the 40 character limit of sha1.

I believe I've patched the problem but to prevent regression there should be a least a test to prove that general hashing is working properly. It's hard to check against system standard hashers because they do not directly hash the sequence itself but instead the whole file.

What could alleviate this is a least rotation command that utilizes the same least rotation function that the circular hashing function uses and instead of returning a hash just returns the string. I'll be able to implement this at some point but it's a great first contribution if anyone wants to jump in.

Writing SBOL3 IO

Is your feature request related to a problem? Please describe.

People keep asking for SBOL3 IO so I thought there should be an issue for it.

Describe the solution you'd like

The SBOL3 parser should implement four functions `ReadSBOL, ParseSBO, BuildSBOL, WriteSBOL.

ReadSBOL pretty much just does file io and passes a file string to ParseSBOL.
ParseSBOL picks apart the sbol file string and appropriately fills an AnnotatedSequence instance with it.
BuildSBOL takes an AnnotatedSequence object and uses it to build a string that can be passed to WriteSBOL.
WriteSBOL writes an SBOL string to file.

Each function should be tested. The easiest way is to read, and parse an SBOL into a Sequence then build a string with it and write it out then compare the input with the output. If they're the same then you've done it!

Python interface

Is your feature request related to a problem? Please describe.
Need for a python interface to use Poly with Python 3+. This will enable us to easily integrate parsers, etc. into other software pipelines.

Describe the solution you'd like
A step by step guide on how to integrate poly into Python. This will allow inexperienced software developers to leverage poly's libraries and capabilities easily. From the structure of the codebase, it's hard to figure out what all features are available within the codebase. This should also include a set of easy to follow tutorials that can be used to instantiate the classes and use the API programmatically rather than running subprocesses.

Describe alternatives you've considered
I was looking up online how to do this and it seems like the general path is: Build a Shared Object -> Write CPython wrapper -> Use it. From my experience writing these wrappers is extremely annoying. I've used something like SWIG in the past to generate wrappers for C++ projects with python. I wonder if something similar exists for generating these wrappers. Alternatively, the wrappers need to be created as part of the projects development roadmap to provide API's

https://dev.to/astagi/extending-python-with-go-1deb

Additional context
I myself am new to the go ecosystem so I might be wrong about this. Would love to hear some feedback.

Update contributing guide.

Contributing guide needs to have more on dev setup and code climate usage.

all_iupac_variants port from easy_dna

From easy_dna. Return all unambiguous possible versions of the given sequence.

>>> all_iupac_variants('ATN')
>>> ['ATA', 'ATC', 'ATG', 'ATT']

Will be using the ambiguity codes from here.

DNA synthesis optimizer

Large scale DNA synthesis requires optimization of sequences, removing sequences that complicate the synthesis reaction. There have been many iterations at different companies, but generally it follows the formula of:

Remove restriction enzyme sites (ie, BsaI)
Fix homopolymers
Fix repeat regions
Fix high or low global GC content
Fix high or low local GC content
Fix regions with large variation in global GC content (ie, two 50bp sequences scattered throughout a 1kbp fragment shouldn't have a 50% difference)

DNA Chisel is a good example of a project that does this (other ideas may come from their main site)

NEB REBASE parser

Is your feature request related to a problem? Please describe.
It would be great to have a full suite of enzymes available for clone.go. The easiest way to do this would be to make a parser for one of the data dumps from NEB REBASE http://rebase.neb.com/rebase/rebase.serv.html

The easiest format will probably be format #31 http://rebase.neb.com/rebase/rebase.f31.html

Describe the solution you'd like
getBaseRestrictionEnzyme should have all restriction enzymes available with it, built into poly.

Describe alternatives you've considered
N/A

Additional context
REBASE is awesome and can provide customized data dumps. Though that probably shouldn't be necessary, it is something to think about.

Refactor io.go

Is your feature request related to a problem? Please describe.

io.go was the first file I wrote for poly. It's a monster at ~1400 lines. It defines the AnnotatedSequence struct which holds all the information that gets parsed using the gbk, gff, json, etc parsers that are also in that file. It's pretty good stuff but could use a little revamping in certain places

Current issues/needs are:

Finding redundancies in AnnotatedSequence struct and child structs to update/consolidate. Update any parser that uses/used what we're going to change. Evaluate merging Sequence struct into AnnotatedSequence struct and renaming AnnotatedSequence to Sequence.
Fasta parsing currently just creates an []AnnotatedSequence slice where each annotated sequence holds the comment and a sequence. To be consistent with other parsers and methods we should find a way to use just a single AnnotatedSequence to represent a parsed fasta file.
GffParser should be edited to use fasta parsing for the raw sequence as some Gff3 files include fasta at the bottom for some reason.
I didn't really know about io.writer when I first wrote io.go. Most of the parsing/reading/building/writing should probably be refactored to use io.writer if possible.
There's a lot of good leftover functions and data that I used to build GbkParser. A lot of it is dead code now and should be archived in a new branch.
SBOL2 IO should be written. I'll make a separate issue for that.
One of the tests that tests GffParser fails on windows and windows only. More on that in issue #46 .

Not sure if this is everything that needs to be done but I'll edit and add more if they come up.

Describe the solution you'd like

I've created a new branch called refactor-io. Please make all related pull request there.

Happy hacking!
Tim

TestTranslationString fails randomly during PRs

Describe the bug
TestTranslationString oftentimes (not always) fails during the Github ci test suite. This only happens with mac OSX. I have no idea why

To Reproduce
Steps to reproduce the behavior:

Make new Pull request
Watch this one test function fail

Expected behavior
Mac OSX failures should probably be the same as ubuntu failures, and if it is going to fail, just fail every time.

Additional context
I see that that function is using some stdin stuff - maybe mac doesn't do that right? idk

NCBI Genbank flat file parsing

Is your feature request related to a problem? Please describe.
Downloading bulk Genbank from https://ftp.ncbi.nlm.nih.gov/genbank/ gives a little bit of a unique file format. There are many sequences within each file, and they start with a header. I would like to be able to directly pipe those into Poly for manipulation. Specifically, I am looking at building codontable.com, and this kind of data parsing will be necessary.

Describe the solution you'd like

Proper parsing of multi-genbank files (files with more than 1 genbank file)
Parsing of the Genbank flat file header

Describe alternatives you've considered
I've considered doing a real dirty bash parsing to get the job done, but it's probably worthwhile to make the Genbank flat file parser. It will tease out any bugs in the original iteration of the Genbank parser since it can be used on all official Genbank files.

Additional context
Genbank files are separated by //

The headers of Genbank flat files look like:

GBBCT1.SEQ          Genetic Sequence Data Bank
                         October 15 2020

                NCBI-GenBank Flat File Release 240.0

                     Bacterial Sequences (Part 1)

  101593 loci,   185853961 bases, from   101593 reported sequences

with LOCUS immediately following.

Writing a synthesis.go spec

Is your feature request related to a problem? Please describe.
Implementing DNA synthesis optimization may be complex enough to write a spec for it before we get going.

Describe the solution you'd like
Stack overflow has a great dev blog post on how to write a spec. First step is information gathering which we can do here in this issue thread.

Questions to consider:

What kinds of optimizations are there? Are they vendor specific? Do they ever need to be updated?
What optimization do users want? (we should probably also focus on getting users)
What are the requirements for each optimization?
Are there common requirements/needs that almost synthesis optimization functions have?
Has anyone done this before? How can we learn from them and how can we improve on what they've done?
What questions are we missing?

As we start answering these questions we can begin to fill out a draft spec based on the dev blog post I mentioned earlier. Our requirements will be a little less complex so we'll definitely have a shorter spec than the outline they give in that post!

I'm not sure what the best environment is for writing this spec together. I'll think about it, edit this post and add a link when I've come to a conclusion though I'm leaning towards just making a spec directory and putting a markdown file in there.

In the meantime dump any links, resources, ideas, etc here so that we can add them when the document is live!

Update Seqhash algorithm

Is your feature request related to a problem? Please describe.
Update the seqhash algorithm to give versioned output. For example, pUC19 would be v1_DCD_4b0616d1b3fc632e42d78521deb38b44fba95cca9fde159e01cd567fa996ceb9, with the DCD standing for "DNA-Circular-Double stranded".

Describe the solution you'd like
https://git.sr.ht/~koeng/python-seqhash/tree/master/seqhash/__init__.py similar output to the reference code.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Create troubleshoot checklist for first time streamers.

Is your feature request related to a problem? Please describe.
New discord users seem to have a hard time streaming their screens. I had to fix permissions setting on my computer in one case and this morning someone on our discord had a similar issue.

Describe the solution you'd like
We should have a checklist for this so new users can go over this before trying to jump into a live stream and share their screens/ have mic problems.

Writing a cloning.go spec

Is your feature request related to a problem? Please describe.
Implementing multiple cloning functions may be complex enough to write a spec for it before we get going.

Describe the solution you'd like
Stack overflow has a great dev blog post on how to write a spec. First step is information gathering which we can do here in this issue thread.

Questions to consider:

What kinds of cloning functions are there?
What cloning functions do users want? (we should probably also focus on getting users)
What are the requirements for each cloning function?
Are there common requirements/needs that almost all cloning functions have?
Has anyone done this before? How can we learn from them and how can we improve on what they've done?
What questions am I missing?

I'm not sure what the best environment is for writing this spec together. I'll think about it and edit this post and add a link when I've come to a conclusion though I'm leaning towards just making a spec directory and putting a markdown file in there.

In the meantime dump any links, resources, ideas, etc here so that we can add them when the document is live!

Windows command line support.

It's pretty frustrating that I can't get Poly command line tests to pass on Windows OS during github action test jobs. I have no concept of how powershell works in Windows and am pretty useless when it comes to testing here. Currently command line tests are just bypassed for Windows OS.

If any Windows user here could go through Poly's commands_test.go file and recreate the unix tests for Windows OS that would be amazing. Same for documentation.

Creating an awesome command line user experience

Is your feature request related to a problem? Please describe.
I've used a lot of command line tools over time and some of them have really pretty interfaces. Ascii art, loading icons, etc. I know that pipes should work but beyond that I'm not really sure what users want in a command line interface.

Describe the solution you'd like
A web search for, "how to design a great command line user experience", yields pretty good results.

Command line ux in 2020
ux for command line tools.

One thing that I know poly needs to be able to do is provide feedback to the user about what it's doing. Again, ascii art, loading icons, etc. What's more important is there is a bunch of stuff we don't know.

Things that need to be done:

Do a UX review of current command line tools.
Reach out to developers about what kind of command line tools they like and want.
Review what's possible and make suggestions.
Implement suggestions.
Test with users.

Adding a documentation site.

Is your feature request related to a problem? Please describe.

Poly should have a nice documentation site.

Describe the solution you'd like

An out of the box solution that is pretty and has examples to complement go docs.
https://squidfunk.github.io/mkdocs-material/ comes to mind.

Describe alternatives you've considered

I'm not sure what alternatives there are at the moment but I made this thread to attract ideas and document solutions that I find.

Fasta file format IO

Describe the solution you'd like
I'd like a fasta parser than parses fasta into an AnnotatedSequence struct or set of AnnotatedSequence structs. Most of the struct can be blank but the AnnotatedSequence.Sequence.Sequence should be a single string of characters with no whitespace and AnnotatedSequence.Sequence.Description should be the description. This is a really great first contribution and there's no rush to implement so I thought it'd be great to share here!

Additional context
Here's a spec from wikipedia. The resulting parser should be placed in io.go along with tests in io_test.go. More information on the AnnotatedSequence struct can also be found in io.go.

Upgrading melting temp in primers.go

primers.go does a great job of calculating primer melting temp using SantaLucia but in @jecalles' original pull request #34 they mentioned some extra stuff that they may need for their project.

It'd be great if we could upgrade MeltingTemp() to adjust predictions for (1) internal base mismatches, (2) internal loops, and (3) overhangs as described in pr #34. I'm not sure I understand everything that needs to be done here but @jecalles included a review section that contains most of the math needed - (doi: 10.1146/annurev.biophys.32.110601.141800).

Gibson Assembly Cloning

Now that the MeltingTemp() has been merged into prime we can start working on Gibson Assembly.

At its core GibsonAssembly() would take a slice of sequences and return primers that would help join each piece into a fully realized sequence. It would also account for whether the sequence is linear or circular and design that last primer for a circular sequence as necessary. This function and tests would be placed in new files called clone.go and clone_test.go

For reference Addgene has a great segment on Gibson Assembly Cloning as well as a short tutorial on primer design.

Packaging for package managers.

Is your feature request related to a problem? Please describe.
Non-golang programmers will have a hard time installing the command line tool without using standard package managers.

Describe the solution you'd like
Compilation targets for various popular package managers. Where after each release a new binary is created and updated to be used with:

apt-get
yum
nix
brew
choco

Describe alternatives you've considered
We could keep having users install using go get but it's not very nice to expect casual users to install a whole language when they can just install a binary instead.

bebop / poly Goto Github PK

poly's People

Contributors

Stargazers

Watchers

Forkers

poly's Issues

Use cases:

poly amplify

poly amplify range

poly no_amplify

poly validate

Search Sequence

Search longest common substrings

Search longest common seqhash overlaps

Sequence walking

Focus points

References

How should we implement efficient sequence search in Poly?

Use modern algorithms like Suffix trees, Suffix arrays and FM indexes

FM indexes aren't that great

Deduplication should be the primary focus

Recommend Projects

Recommend Topics

Recommend Org