bebop / poly Goto Github PK
View Code? Open in Web Editor NEWA Go package for engineering organisms.
Home Page: https://pkg.go.dev/github.com/bebop/poly
License: MIT License
A Go package for engineering organisms.
Home Page: https://pkg.go.dev/github.com/bebop/poly
License: MIT License
I'd like to be able to run our tests on windows but given this stack trace from this pull request windows has a problem with ParseGff()
on line 35 of io_test.go.
The problem is platform specific and the test doesn't fail on osx, or ubuntu and other tests have passed on windows.
The stack trace stops on line 188 of io.go and it appears that it may ultimately have a problem with using tabs to split instead of carriage return as seen on line 186 of io.go.
Has anyone ever seen this before / be willing to take a crack at it?
Lots of Go related projects adopt a dressed up Go Gopher as a mascot and we should too!
Gopher using DNA as jump rope
Gopher cutting DNA with scissors
Gopher duct taping DNA back together
Gopher inspecting peptide chains emerging from ribosomes with a little monocle
I'm not much an artist myself but if anyone could make a gopher fit for a mascot I'd be grateful. Any ideas on what their outfit should be?
Before a sequence is synthesized it should be optimized in at least 3 ways that I can think of.
Each optimization would have its own function but it'd be great to eventually have a function that performs a suite of pre-synthesis optimizations and checks.
@Koeng101 has already written a little about codon optimization and synthesis optimization but we've yet to start working on non-repetitive parts optimization. Are there any other pre-synthesis optimizations that we should be considering?
Is your feature request related to a problem? Please describe.
Currently we're using github pages but it's somewhat tedious to write and publish docs there. Also our documentation is missing from pkg.go.dev
Describe the solution you'd like
Godocs provides a way to render inline comments as documentation. As time goes on it would be best if we relied on this for library documentation and perhaps for command line documentation. I'm hoping that our docs look like this at some point but currently our official go docs look like this.
Things to be done:
It looks like this link about adding-packages to pkg.go.dev is a good starting point.
https://go.dev/about#adding-a-package
Primers are a common need for labs, and will be required for more complex protocols, like Gibson assemblies.
The most basic application of a primer design will be a simple amplification.
poly amplify pUC19.gb --amplicon "ATGACCATGATTACGCCAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCGAATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAG" > primers.txt pUC19_amplified.gb
primers.txt would be a tab separated value file which looks like (similar to what Snapgene provides):
Primer 1 ATGACCATGATTACGCCAAG
Primer 2 CTATGCGGCATCAGAGCA
While pUC19_amplified.gb would simply be a genbank file with only the primers + amplified sequence.
There are a few different flags that would be useful for poly amplify -
--primer_for "pUC19_for" --primer_rev "pUC19_rev"
for naming the output primers in primers.txt--amplicon "ATGC"
amplifies the particular string--range 0,0
amplifies a particular range of sequence--no_amplify pGLO.gb
prevents primers from amplifying a naughty sequence from a different file--validate 10:20 --size 100:150 --coverage 100 --overlap 10:40
amplifies a particular range of sequence with a size within the size limitations of the --size flag, the coverage limitations of the --coverage tag, and the overlap limitations of the --overlap tag.These 4 functions generally fulfill all needs of a biologist. The first few cover the use cases of your average cloner - they simply want to clone out an amplicon, and don't really care about more advanced features. The 5th, --validate
covers the use case of people who build primers to validate things, like a colony PCR, or a validation PCR for clinical samples. I think these 2 use cases cover ~90% of the different kinds of uses of poly amplify
poly amplify pUC19.gb --primer_for "pUC19_for" --primer_rev "pUC19_rev" --amplicon "ATGACCATGATTACGCCAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCGAATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAG" > primers.txt amplified.gb
poly amplify pUC19.gb --primer_for "pUC19_for" --primer_rev "pUC19_rev" --range 146:469 > primers.txt amplified.gb
poly amplify pUC19.gb --range 146:469 --no_amplify pGLO.gb > primers.txt amplified.gb
poly amplify SARs-CoV-2.gb --validate 30:29886 --size 375:425 --coverage 100 --overlap 30:50 > primers.txt many_amplified.gb
(Note: this is pretty much the exact use case of https://artic.network/ncov-2019, which is why this kind of thing is important. It also doesn't fit in well to other parts of the program)
Another example:
poly amplify pUC19.gb --validate 146:469 --size 0:500 --coverage 100 --overlap 0:0 > primers.txt amplified.gb
In this example, we really just want to amplify a fragment that we know this sequence is inside of so we can do some sanger sequencing or the like on it.
@jecalles thoughts on different use cases I might be forgetting?
Is your feature request related to a problem? Please describe.
A parser for Uniprot XML data dumps (https://www.uniprot.org/downloads)
Describe the solution you'd like
A way to digest all of the Uniprot data dumps.
Describe alternatives you've considered
We built the concurrent Fasta parser to handle the Uniprot FASTA dumps, but a lot of information is lost there. It would be greatly preferred to have the XML parsed, so we can get full documentation of every protein.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
Currently poly can read genbank (.gb, .gbk files) but it can't write out .gb, .gbk files.
Describe the solution you'd like.
What I'd like is a gbk string builder similar in interface to the GffBuilder
function in io.go.
It'd be really cool to have a pull request comment template that prompts people to write about what they're integrating.
If someone could look up how to do this with the .github directory and add one that I could edit and merge it would make a great first contribution!
Describe the bug
Genbank files exported by Snapgene have no sequence output in their JSON files.
To Reproduce
puc19.gbk
puc19_snapgene.gb
puc19.json
puc19_snapgene.json
$ cat puc19.gbk | poly c -i gb -o json > puc19.json
$ cat puc19_snapgene.gb | poly c -i gb -o json > puc19_snapgene.json
cat puc19.json | jq ".Sequence.Sequence"
$ cat puc19.json | jq ".Sequence.Sequence"
"gagatacctacagcgtgagctatgagaaagcgccacgcttcccgaagggagaaaggcggacaggtatccggtaagcggcagggtcggaacaggagagcgcacgagggagcttccagggggaaacgcctggtatctttatagtcctgtcgggtttcgccacctctgacttgagcgtcgatttttgtgatgctcgtcaggggggcggagcctatggaaaaacgccagcaacgcggcctttttacggttcctggccttttgctggccttttgctcacatgttctttcctgcgttatcccctgattctgtggataaccgtattaccgcctttgagtgagctgataccgctcgccgcagccgaacgaccgagcgcagcgagtcagtgagcgaggaagcggaagagcgcccaatacgcaaaccgcctctccccgcgcgttggccgattcattaatgcagctggcacgacaggtttcccgactggaaagcgggcagtgagcgcaacgcaattaatgtgagttagctcactcattaggcaccccaggctttacactttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacacaggaaacagctatgaccatgattacgccaagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattcactggccgtcgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatcgccttgcagcacatccccctttcgccagctggcgtaatagcgaagaggcccgcaccgatcgcccttcccaacagttgcgcagcctgaatggcgaatggcgcctgatgcggtattttctccttacgcatctgtgcggtatttcacaccgcatatggtgcactctcagtacaatctgctctgatgccgcatagttaagccagccccgacacccgccaacacccgctgacgcgccctgacgggcttgtctgctcccggcatccgcttacagacaagctgtgaccgtctccgggagctgcatgtgtcagaggttttcaccgtcatcaccgaaacgcgcgagacgaaagggcctcgtgatacgcctatttttataggttaatgtcatgataataatggtttcttagacgtcaggtggcacttttcggggaaatgtgcgcggaacccctatttgtttatttttctaaatacattcaaatatgtatccgctcatgagacaataaccctgataaatgcttcaataatattgaaaaaggaagagtatgagtattcaacatttccgtgtcgcccttattcccttttttgcggcattttgccttcctgtttttgctcacccagaaacgctggtgaaagtaaaagatgctgaagatcagttgggtgcacgagtgggttacatcgaactggatctcaacagcggtaagatccttgagagttttcgccccgaagaacgttttccaatgatgagcacttttaaagttctgctatgtggcgcggtattatcccgtattgacgccgggcaagagcaactcggtcgccgcatacactattctcagaatgacttggttgagtactcaccagtcacagaaaagcatcttacggatggcatgacagtaagagaattatgcagtgctgccataaccatgagtgataacactgcggccaacttacttctgacaacgatcggaggaccgaaggagctaaccgcttttttgcacaacatgggggatcatgtaactcgccttgatcgttgggaaccggagctgaatgaagccataccaaacgacgagcgtgacaccacgatgcctgtagcaatggcaacaacgttgcgcaaactattaactggcgaactacttactctagcttcccggcaacaattaatagactggatggaggcggataaagttgcaggaccacttctgcgctcggcccttccggctggctggtttattgctgataaatctggagccggtgagcgtgggtctcgcggtatcattgcagcactggggccagatggtaagccctcccgtatcgtagttatctacacgacggggagtcaggcaactatggatgaacgaaatagacagatcgctgagataggtgcctcactgattaagcattggtaactgtcagaccaagtttactcatatatactttagattgatttaaaacttcatttttaatttaaaaggatctaggtgaagatcctttttgataatctcatgaccaaaatcccttaacgtgagttttcgttccactgagcgtcagaccccgtagaaaagatcaaaggatcttcttgagatcctttttttctgcgcgtaatctgctgcttgcaaacaaaaaaaccaccgctaccagcggtggtttgtttgccggatcaagagctaccaactctttttccgaaggtaactggcttcagcagagcgcagataccaaatactgttcttctagtgtagccgtagttaggccaccacttcaagaactctgtagcaccgcctacatacctcgctctgctaatcctgttaccagtggctgctgccagtggcgataagtcgtgtcttaccgggttggactcaagacgatagttaccggataaggcgcagcggtcgggctgaacggggggttcgtgcacacagcccagcttggagcgaacgacctacaccgaact"
$ cat puc19_snapgene.json | jq ".Sequence.Sequence"
""
Expected behavior
puc19_snapgene.gb sequence output should be the same as puc19.gbk.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
Describe the bug
Seqhash is producing different outputs if the sequence in is lower or uppercase. This exact problem was mentioned here, originally implemented in #6 and was removed in this commit.
Seqhash should return the same value if a base pair is uppercase or lowercase because there is no sequence-level difference between the two.
To Reproduce
puc19_upper.gbk
puc19.gbk
$ cat puc19.gbk | poly ha -i gb
4031e1971acc8ff1bf0aa4ed623bc58beefc15e043075866a0854d592d80b28b
$ cat puc19_upper.gbk | .././poly ha -i gb
93c69f4f40c5803aab046d15264de8ab3574d8debf428cf8cfd7d29a8437319a
Expected behavior
Both puc19 files have identical sequences, and should have identical hash outputs.
Desktop (please complete the following information):
Additional context
Reference implementation of seqhash is a python library, with the code being hosted here.
I had a great call the other month with the developer of easy_dna and he suggested that even though it isn't by far his most popular library it is his most useful and that poly should have similar functionality. Below is a list of functions that I will accept PRs for if ported. See above link for documentation and links to original source code. Please make sure they are well tested!
all_iupac_variants
anonymized_record
copy_and_paste_segment
cut_and_paste_segment
dna_pattern_to_regexpr
list_common_enzymes
random_dna_sequence
random_protein_sequence
replace_segment
reverse_segment
swap_segments
Thanks,
Tim
P.S some clarifying info that Davian asked for.
Most functions will likely fit into the scope of either sequence.go or transformations.go and their associated test files. If you feel like a function doesn't fit within the scope of either you can make a utils.go and utils_test.go in the projects main directory to include with your pull request.
Most functions can be written as standalone string functions and then be wrapped by a method to use with poly's main sequence struct. That's as complex as integration with the main library will have to be in most cases.
Is your feature request related to a problem? Please describe.
poly version should output the version of poly that is currently running.
Describe the solution you'd like
$ poly version
0.0.1
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
It's not necessary at the moment but if we ever need to create a loading dialogue it'd be funny to have some ascii art to accompany it. Inspired by this and this I give you this:
___ __ ___ ___ ___ __ __ |__ |\ | / _` |__ |\ | |__ |__ |__) | |\ | / _` |___ | \| \__> |___ | \| |___ |___ | \ | | \| \__>
I was reviewing the JSON at https://github.com/TimothyStiles/poly/blob/prime/data/puc19static.json and I found that there is an inconsistency in "Meta" and "Locus". In Meta, there is GenbankDivision
field, and in Locus, there is GenBankDivision
field. Poly should be consistent.
Heng Li (inventor of SAM, minimap, etc, basically an absolute madlad that has seriously contributed to bio software space) had this in a blog post:
http://lh3.github.io/2020/05/17/fast-high-level-programming-languages
Looks like Go is about 2x slower than most the other high level languages. It would be great to have a performance test suite of poly vs other REAL bio software (not hypothetical), maybe something that can be automatically run?
Is your feature request related to a problem? Please describe.
In addition to our .GetSequence()
method there should also be insert, delete, and replace sequence methods.
Describe the solution you'd like
insert, delete, and replace methods should work for Sequence
, and Feature
structs. In the case where a change causes a shift in sequence coordinates an Sequence
struct's Features
should be updated to reflect that.
Additional context
These methods are incredibly important. They will come into play with cloning, crispr, synthesis, and codon optimization functions plus a bunch of other stuff. They should be a priority.
Poly store is a method to store DNA and search DNA sequences. Unlike many other methods for storing and indexing DNA sequences, Poly store is meant to be able to store everything - all DNA humanity has sequenced. The first implementation of Poly store was experimented with in #15
This is a difficult problem, but generally digests to a couple different necessary features:
In order to implement Poly store, I think we should use a suffix-array data structure. This argument was made in #15 , and I think it still holds. The basic idea is we would have sequence
, suffix array
, lcp
(longest common prefix), and the seqhash position
arrays. sequence
would a massive DNA sequence compromising of all appended sequences together. suffix array
is the suffix array of sequence
, and lcp
is the lcp of that suffix array. seqhash position
is a list of sequence
positions that compose a seqhash, which references an object in a known database. Seqhashes are the universal identifiers we actually want in the end.
There will be a few necessary operations:
On a most basic level, searching sequence is done by doing a binary search on a suffix array. This operation is rather simple, and the data structure doesn't matter too much, so long as we can easily get the length of the suffix array and ensure that the suffix array is correctly sorted.
For deduplication of new DNA getting inserted, it is important to be able to find the longest common substrings. So long as you can reversibly insert the new DNA sequence into the suffix array, this operation comes down to checking the local LCPs around the inserted DNA sequence, similar to the optimal deduplicator described in #15 under the initial post. As an added bonus, this step will also deduplicate the sequence against itself for repeated regions.
This can also be used to build relationship maps between species.
While the above algorithm for finding longest common substrings works well, it is insert-order dependent, where certain insert orders will be more efficient than others. This is concerning, and would require a smart insertion order. This, however, is quite solvable with some clever use of seqhash positions. There needs to be a way to calculate the most inefficient seqhash positions to occasionally rearrange the sequence. This is a little unclear, so here is an example:
Say the reference human genome has a section of sequence that looks like ... ATG GGG TAA ...
, however, it turns out most human beings have the sequence ... ATG GGA TAA ...
. If the reference human genome was inserted first, it will always split up the subsequent human sequences into two different positions. When you factor in the increased referencing costs, it makes sense to store the ... ATG GGA TAA ...
sequence and instead split the reference human genome.
To do an alignment, you need a small sequence to get started with the approximately right positions. This sequence I call the "sequence seed", and is well illustrated in the Centrifuge paper [0]. After you get the proper locations of the seed, you walk down the sequence in both directions. If any seqhash position gets split while walking, you must branch to test the different possibilities.
This sequence walking can be enhanced by searching the longest common substrings (which reduces or eliminates the need for backtracking), so long as the lcp
array values are easy to search on.
With these features and operations in mind, there are a few focus points that the end product will have to be good at:
Efficient transactions are necessary for implementing the continuous updating feature from the SAIS algorithm [0], and reversible transactions are necessary for searching longest common substrings.
Efficient indexing and relationship referencing are necessary for quickly moving through the relationships between seqhash positions
, suffix array
and sequence
lists. This especially is necessary for sequence walking, which needs to coordinate between the sequence
and seqhash positions
list often.
SQLite caps out at ~100K inserts per second in my testing - without very aggressive deduplication, it would take far too long to insert all DNA materials. This can be a problem, but it is unknown how big of a problem it will be with deduplication.
===
@TimothyStiles Thoughts? I'm thinking the biggest change here from last is the longest common substrings
problem, which turns out to be essential for a lot of interesting capacities, such as effective deduplication, phylogenetic / relationship tree building, and in-depth alignment.
[0] https://dx.doi.org/10.1101%2Fgr.210641.116
[1] https://zork.net/~st/jottings/sais.html
Describe the bug
In io.go at line 80 + 81, there is a Circular
and Linear
field, both as booleans. Couldn't you just have one? (Since if Circular is false, linear is true, and vice versa, by definition).
Is this here simply to accommodate the ambiguous case?
To Reproduce
Steps to reproduce the behavior:
go to io.go
Expected behavior
A single field
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
I would like Poly to have a function to Fragment DNA samples that you wish to synthesize. Many DNAs that one would wish to synthesize are above the fragment threshold for synthesis providers (~1500bp for Twist), so this would be useful to them. In addition, you can synthesize more complex DNA fragments through clever splitting of Fragments (ie, split a homopolymer to get >8bp homopolymers in a synthesized fragment).
I would like this function to be optimal using the data set in this paper - https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238592 . You can imagine using a constraint solver to get the optimal overhangs for a given sequence, no matter the size. This is fundamentally useful for users who want to synthesize large fragments, like full biosynthetic pathways. It is also useful for users who want to efficiently create complex DNA from restrictive synthesis companies.
Describe the solution you'd like
I want a function that I can pass many sequences which then optimally fragments them for synthesis.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
I'd like to be able to do "linking", ie, synthesizing many different genes on a single fragment, which then get divided using different restriction enzyme (ie, BbsI + BtgZI)
Is your feature request related to a problem? Please describe.
Currently there are two issues with how hash.go
is implemented.
The blake3 package does not implement a standard crypto.hash()
interface. Currently we just use a special method for hashing with blake3 but this is bad and should be fixed asap.
GenericSequenceHash only hashes AnnotatedSequence
objects, it should also hash Feature
, and Sequence
objects.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Write a wrapper around the blake3 package we use that uses the crypto.hash()
interface like all the other hash functions seen here.
Replace GenericSequenceHash with a new function called HashSequence
that takes an io.writer
and a crypto.hash function and returns a hashed string. AnnotatedSequence, Feature, and Sequence objects should also have a .Hash()
method. .GetSequence()
will come in handy here.
Describe alternatives you've considered
We could continue building features around this but itβs best to refactor and build off of much cleaner code.
Describe the bug
Gff parser chops off last attribute in attribute list.
To Reproduce
Use the gff parser.
Expected behavior
Not this.
An extremely useful function for poly to have is efficient sequence search. This can be used for lots of different operations. Historically, BLAST has been the most popular technology, but BLAST has a few major disadvantages. BLAST is a relatively old technology, and its time complexity is at best O(n)
, and at worst, O(n^2)
where n
is the length of your sequence.
When it comes to efficient algorithms for searching sequence, there are 3 primary algorithms to be interested in - suffix trees, suffix arrays, and FM indexes. All three can reduce sequence search down to log time complexity. Suffix trees take a large quantity of memory, but are extremely fast for certain domains of problems (like the longest common substring problem). Suffix arrays are much simpler and smaller, but can't do as many interesting things as suffix trees can. However, they do have the unique advantage in Go of being implemented in the standard library.
FM indexes are relatively more complex than both suffix arrays and suffix trees (if you want to learn more, watch this video on suffix tries/trees, burrows-wheeler transform, and finally the FM index). While they are more computationally intensive than suffix trees or arrays and are much more complicated than suffix trees or arrays, they take up far less space.
Given popular tools like Bowtie2 and https://ccb.jhu.edu/software/centrifuge/ use the FM index, this is probably what poly should do, right?
In 2016, Centrifuge was made for the purpose of very fast sequence search. Basically, they took similar genomes and did deduplication of common genetic elements (based on species and 53 k-mers), then created an FM index of those sequences. The deduplication seems to yield a sequence that is about ~15% the size of the original entry. In 2016, the NCBI sequence databases, when duplicates are removed, is about 109 billion base pairs.
Stored as a suffix array, this would take 147 gigabytes of memory (with the Golang suffix array implementation. (each base set to int8, so index + storage takes 9 bytes per base)
Stored as a FM index, this took 69 gigabytes of memory.
Centrifuge's naive deduplication algorithm was able to achieve about 6.6x reduction in memory usage overall, while FM indexes were only able to get 2.1x reduction from suffix arrays alone.
In essence, it seems like what is saving a real quantity of memory isn't necessarily the FM index and all the complexity it brings along, but the simple act of deduplicating data. Thus, to do effective sequence search, I think Poly should use the basic Golang suffix arrays and invest energy into figuring out deduplication of genetic data. If Poly is meant for forward engineering, we can expect that DNA parts will be reused quite often, which is an opportunity for creating a very efficient deduplication scheme. Centrifuge's deduplication scheme doesn't have great support for circular sequences, which is something Poly should have.
Deduplication is also a very nice feature for commercial operations that need to save lots of sequence data, such as data coming through sequencers. I'm not sure how to implement this yet, but we should think about it. Please post in this thread if you have any ideas.
Is your feature request related to a problem? Please describe.
The most likely use of hash outputs is going to be linking hash IDs to the LOCUS of a Genbank file, since that is typically the unique identifier. There should be a non-default option for the hash function to output the LOCUS of the Genbank file it has parsed.
Describe the solution you'd like
I would like the poly hash function to have 1 new optional option:
Describe alternatives you've considered
I've considered just getting this information from the json file, but that seemed overly complicated for systems that already have nice genbank output.
Additional context
Add any other context or screenshots about the feature request here.
Describe the bug
JSON object representation of Genbank features don't have proper Start
and End
values. The Location
field, as far as I know, is slightly different. Start and end should either have simple integer values, or be redundant with Location
To Reproduce
Steps to reproduce the behavior:
cat bsub.gbk | poly c -i gbk -o json > bsub.json
Expected behavior
The Start
and End
fields in the JSON output should match the Genbank outputs.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
I get the feeling that the contributor's guide is getting kind of stale.
Describe the solution you'd like
If people could reply here to suggest improvement or talk about roadblocks they've faced with it onboarding/contributing I'd love to hear them!
I think I'll update requirements to include that code must also use godocs after we fix #53.
It's weird to ask linux users to use homebrew to install when it's not that popular on linux systems. It'd be better if we offered support for RPM, Yum, and Apt-get.
We should include in documentation instructions like rpm -i <url-to-latest-github-release.rpm>
or yum install <url-to-latest-github-release.rpm>
to do this we'd likely need to automate release updates to documentation via a github action. We should also test that these install before pushing updates.
Is your feature request related to a problem? Please describe.
Integration and testing of the Rhea parser package into Poly
https://github.com/Koeng101/rhea
Describe the solution you'd like
Integration and testing of the Rhea parser package into Poly
Describe alternatives you've considered
@TimothyStiles writing a Rhea parser from scratch
Additional context
The RDF XML Rhea database drop can be downloaded here - https://www.rhea-db.org/help/download
Describe the bug
When parsing out descriptions, newlines aren't replaced with a space. Newlines should be replaced with a space and not with nothing.
Genbank:
CDS 410..1750
/gene="dnaA"
/locus_tag="BSU_00010"
/old_locus_tag="BSU00010"
/function="16.9: Replicate"
/experiment="publication(s) with functional evidences,
PMID:2167836, 2846289, 12682299, 16120674, 1779750,
28166228"
/note="Evidence 1a: Function from experimental evidences
in the studied strain; PubMedId: 2167836, 2846289,
12682299, 16120674, 1779750, 28166228; Product type f :
factor"
/codon_start=1
/transl_table=11
/product="chromosomal replication initiator informational
ATPase"
/protein_id="NP_387882.1"
JSON:
"product": "chromosomal replication initiator informationalATPase",
To Reproduce
cat bsub.gbk | poly c -i gbk -o json > bsub.json
Check bsub.gbk
and bsub.json
with your favorite text editor.
Expected behavior
I expect there to be a space
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
It is annoying having to digest a .gbk file every time I want to do an optimization. I would like to be able to import and export codon tables using JSON format.
Describe the solution you'd like
JSON import and output of codon tables.
Describe alternatives you've considered
Additional context
I recently discovered a bug that I quick patched in commit 346e3eb where the sequence hash was summing incorrectly and returning comically long hashes that were far beyond the 40 character limit of sha1.
I believe I've patched the problem but to prevent regression there should be a least a test to prove that general hashing is working properly. It's hard to check against system standard hashers because they do not directly hash the sequence itself but instead the whole file.
What could alleviate this is a least rotation command that utilizes the same least rotation function that the circular hashing function uses and instead of returning a hash just returns the string. I'll be able to implement this at some point but it's a great first contribution if anyone wants to jump in.
Is your feature request related to a problem? Please describe.
People keep asking for SBOL3 IO so I thought there should be an issue for it.
Describe the solution you'd like
The SBOL3 parser should implement four functions `ReadSBOL, ParseSBO, BuildSBOL, WriteSBOL.
ReadSBOL
pretty much just does file io and passes a file string to ParseSBOL
.
ParseSBOL
picks apart the sbol file string and appropriately fills an AnnotatedSequence instance with it.
BuildSBOL
takes an AnnotatedSequence
object and uses it to build a string that can be passed to WriteSBOL
.
WriteSBOL
writes an SBOL string to file.
Each function should be tested. The easiest way is to read, and parse an SBOL into a Sequence
then build a string with it and write it out then compare the input with the output. If they're the same then you've done it!
Is your feature request related to a problem? Please describe.
Need for a python interface to use Poly with Python 3+. This will enable us to easily integrate parsers, etc. into other software pipelines.
Describe the solution you'd like
A step by step guide on how to integrate poly into Python. This will allow inexperienced software developers to leverage poly's libraries and capabilities easily. From the structure of the codebase, it's hard to figure out what all features are available within the codebase. This should also include a set of easy to follow tutorials that can be used to instantiate the classes and use the API programmatically rather than running subprocesses.
Describe alternatives you've considered
I was looking up online how to do this and it seems like the general path is: Build a Shared Object -> Write CPython wrapper -> Use it. From my experience writing these wrappers is extremely annoying. I've used something like SWIG in the past to generate wrappers for C++ projects with python. I wonder if something similar exists for generating these wrappers. Alternatively, the wrappers need to be created as part of the projects development roadmap to provide API's
https://dev.to/astagi/extending-python-with-go-1deb
Additional context
I myself am new to the go ecosystem so I might be wrong about this. Would love to hear some feedback.
Contributing guide needs to have more on dev setup and code climate usage.
Large scale DNA synthesis requires optimization of sequences, removing sequences that complicate the synthesis reaction. There have been many iterations at different companies, but generally it follows the formula of:
DNA Chisel is a good example of a project that does this (other ideas may come from their main site)
Is your feature request related to a problem? Please describe.
It would be great to have a full suite of enzymes available for clone.go
. The easiest way to do this would be to make a parser for one of the data dumps from NEB REBASE http://rebase.neb.com/rebase/rebase.serv.html
The easiest format will probably be format #31 http://rebase.neb.com/rebase/rebase.f31.html
Describe the solution you'd like
getBaseRestrictionEnzyme should have all restriction enzymes available with it, built into poly.
Describe alternatives you've considered
N/A
Additional context
REBASE is awesome and can provide customized data dumps. Though that probably shouldn't be necessary, it is something to think about.
Is your feature request related to a problem? Please describe.
io.go was the first file I wrote for poly
. It's a monster at ~1400 lines. It defines the AnnotatedSequence
struct which holds all the information that gets parsed using the gbk, gff, json, etc parsers that are also in that file. It's pretty good stuff but could use a little revamping in certain places
Current issues/needs are:
Finding redundancies in AnnotatedSequence
struct and child structs to update/consolidate. Update any parser that uses/used what we're going to change. Evaluate merging Sequence
struct into AnnotatedSequence
struct and renaming AnnotatedSequence
to Sequence
.
Fasta parsing currently just creates an []AnnotatedSequence
slice where each annotated sequence holds the comment and a sequence. To be consistent with other parsers and methods we should find a way to use just a single AnnotatedSequence
to represent a parsed fasta file.
GffParser
should be edited to use fasta parsing for the raw sequence as some Gff3 files include fasta at the bottom for some reason.
I didn't really know about io.writer
when I first wrote io.go
. Most of the parsing/reading/building/writing should probably be refactored to use io.writer
if possible.
There's a lot of good leftover functions and data that I used to build GbkParser
. A lot of it is dead code now and should be archived in a new branch.
SBOL2 IO should be written. I'll make a separate issue for that.
One of the tests that tests GffParser
fails on windows and windows only. More on that in issue #46 .
Not sure if this is everything that needs to be done but I'll edit and add more if they come up.
Describe the solution you'd like
I've created a new branch called refactor-io. Please make all related pull request there.
Happy hacking!
Tim
Describe the bug
TestTranslationString
oftentimes (not always) fails during the Github ci test suite. This only happens with mac OSX. I have no idea why
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Mac OSX failures should probably be the same as ubuntu failures, and if it is going to fail, just fail every time.
Additional context
I see that that function is using some stdin stuff - maybe mac doesn't do that right? idk
Is your feature request related to a problem? Please describe.
Downloading bulk Genbank from https://ftp.ncbi.nlm.nih.gov/genbank/ gives a little bit of a unique file format. There are many sequences within each file, and they start with a header. I would like to be able to directly pipe those into Poly for manipulation. Specifically, I am looking at building codontable.com, and this kind of data parsing will be necessary.
Describe the solution you'd like
Describe alternatives you've considered
I've considered doing a real dirty bash parsing to get the job done, but it's probably worthwhile to make the Genbank flat file parser. It will tease out any bugs in the original iteration of the Genbank parser since it can be used on all official Genbank files.
Additional context
Genbank files are separated by //
The headers of Genbank flat files look like:
GBBCT1.SEQ Genetic Sequence Data Bank
October 15 2020
NCBI-GenBank Flat File Release 240.0
Bacterial Sequences (Part 1)
101593 loci, 185853961 bases, from 101593 reported sequences
with LOCUS immediately following.
Is your feature request related to a problem? Please describe.
Implementing DNA synthesis optimization may be complex enough to write a spec for it before we get going.
Describe the solution you'd like
Stack overflow has a great dev blog post on how to write a spec. First step is information gathering which we can do here in this issue thread.
Questions to consider:
As we start answering these questions we can begin to fill out a draft spec based on the dev blog post I mentioned earlier. Our requirements will be a little less complex so we'll definitely have a shorter spec than the outline they give in that post!
I'm not sure what the best environment is for writing this spec together. I'll think about it, edit this post and add a link when I've come to a conclusion though I'm leaning towards just making a spec directory and putting a markdown file in there.
In the meantime dump any links, resources, ideas, etc here so that we can add them when the document is live!
Is your feature request related to a problem? Please describe.
Update the seqhash algorithm to give versioned output. For example, pUC19 would be v1_DCD_4b0616d1b3fc632e42d78521deb38b44fba95cca9fde159e01cd567fa996ceb9
, with the DCD standing for "DNA-Circular-Double stranded".
Describe the solution you'd like
https://git.sr.ht/~koeng/python-seqhash/tree/master/seqhash/__init__.py similar output to the reference code.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
New discord users seem to have a hard time streaming their screens. I had to fix permissions setting on my computer in one case and this morning someone on our discord had a similar issue.
Describe the solution you'd like
We should have a checklist for this so new users can go over this before trying to jump into a live stream and share their screens/ have mic problems.
Is your feature request related to a problem? Please describe.
Implementing multiple cloning functions may be complex enough to write a spec for it before we get going.
Describe the solution you'd like
Stack overflow has a great dev blog post on how to write a spec. First step is information gathering which we can do here in this issue thread.
Questions to consider:
As we start answering these questions we can begin to fill out a draft spec based on the dev blog post I mentioned earlier. Our requirements will be a little less complex so we'll definitely have a shorter spec than the outline they give in that post!
I'm not sure what the best environment is for writing this spec together. I'll think about it and edit this post and add a link when I've come to a conclusion though I'm leaning towards just making a spec directory and putting a markdown file in there.
In the meantime dump any links, resources, ideas, etc here so that we can add them when the document is live!
It's pretty frustrating that I can't get Poly command line tests to pass on Windows OS during github action test jobs. I have no concept of how powershell works in Windows and am pretty useless when it comes to testing here. Currently command line tests are just bypassed for Windows OS.
If any Windows user here could go through Poly's commands_test.go file and recreate the unix tests for Windows OS that would be amazing. Same for documentation.
Is your feature request related to a problem? Please describe.
I've used a lot of command line tools over time and some of them have really pretty interfaces. Ascii art, loading icons, etc. I know that pipes should work but beyond that I'm not really sure what users want in a command line interface.
Describe the solution you'd like
A web search for, "how to design a great command line user experience", yields pretty good results.
Command line ux in 2020
ux for command line tools.
One thing that I know poly needs to be able to do is provide feedback to the user about what it's doing. Again, ascii art, loading icons, etc. What's more important is there is a bunch of stuff we don't know.
Things that need to be done:
Is your feature request related to a problem? Please describe.
Poly should have a nice documentation site.
Describe the solution you'd like
An out of the box solution that is pretty and has examples to complement go docs.
https://squidfunk.github.io/mkdocs-material/ comes to mind.
Describe alternatives you've considered
I'm not sure what alternatives there are at the moment but I made this thread to attract ideas and document solutions that I find.
Describe the solution you'd like
I'd like a fasta parser than parses fasta into an AnnotatedSequence struct or set of AnnotatedSequence structs. Most of the struct can be blank but the AnnotatedSequence.Sequence.Sequence should be a single string of characters with no whitespace and AnnotatedSequence.Sequence.Description should be the description. This is a really great first contribution and there's no rush to implement so I thought it'd be great to share here!
Additional context
Here's a spec from wikipedia. The resulting parser should be placed in io.go along with tests in io_test.go. More information on the AnnotatedSequence struct can also be found in io.go.
primers.go
does a great job of calculating primer melting temp using SantaLucia but in @jecalles' original pull request #34 they mentioned some extra stuff that they may need for their project.
It'd be great if we could upgrade MeltingTemp() to adjust predictions for (1) internal base mismatches, (2) internal loops, and (3) overhangs as described in pr #34. I'm not sure I understand everything that needs to be done here but @jecalles included a review section that contains most of the math needed - (doi: 10.1146/annurev.biophys.32.110601.141800).
Now that the MeltingTemp()
has been merged into prime
we can start working on Gibson Assembly.
At its core GibsonAssembly()
would take a slice of sequences and return primers that would help join each piece into a fully realized sequence. It would also account for whether the sequence is linear or circular and design that last primer for a circular sequence as necessary. This function and tests would be placed in new files called clone.go
and clone_test.go
For reference Addgene has a great segment on Gibson Assembly Cloning as well as a short tutorial on primer design.
Is your feature request related to a problem? Please describe.
Non-golang programmers will have a hard time installing the command line tool without using standard package managers.
Describe the solution you'd like
Compilation targets for various popular package managers. Where after each release a new binary is created and updated to be used with:
apt-get
yum
nix
brew
choco
Describe alternatives you've considered
We could keep having users install using go get
but it's not very nice to expect casual users to install a whole language when they can just install a binary instead.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.