hepcat72 / coord2seq Goto Github PK
View Code? Open in Web Editor NEWRetrieve subsequences with coordinates. For DNA, RNA, & Protein.
License: GNU General Public License v3.0
Retrieve subsequences with coordinates. For DNA, RNA, & Protein.
License: GNU General Public License v3.0
The user can use -c to supply start and stop coordinates. So there should also be a way to specify the parent sequence ID, either in a format similar to "ID:start..stop" or "ID:start-stop". Commas should also be allowed and automatically removed.
Also - change the default from circular to linear or have it auto-detected.
WARNING1: Start/Stop coordinates appear to be unordered. Since the chromosome is linear, the start/stop for sequence [B045_1-4_chr1_r_c1288[k=4][m=2]_TGACT] will be switched.
WARNING2: Start/Stop coordinates appear to be unordered. Since the chromosome is linear, the start/stop for sequence [B045_1-4_chr1_r_c1291[k=5][m=2]_AGACT] will be switched.
WARNING3: Start/Stop coordinates appear to be unordered. Since the chromosome is linear, the start/stop for sequence [B045_1-4_chr1_r_c1295[k=4][m=2]_AGACA] will be switched.
WARNING4: Start/Stop coordinates appear to be unordered. Since the chromosome is linear, the start/stop for sequence [B045_1-4_chr1_r_c255[k=13][m=4]_GAACT] will be switched.
Come up with more succinct/less frequent warnings about coordinate order. Only issue the warning once when first encountered. And, only issue the warning if the file is inconsistent. Explain how coords will be reported, like lesser/greater or start/stop.
GTF files have coordinates where the start is 1-based and the stop is 0-based (for whatever reason). Other popular formats use 0-based coords for the start and stop. These coordinate types should be supported.
Allow the columns to be determined automatically.
Use this logic to validate columns specified by the user and suggest a fix.
Error if the file deviates from this specification
I entered coordinates "308552-308754". One parent sequence was only 85779 bases long, thus this was the output:
WARNING5: Start coordinate [308552] is greater than or equal to twice the size of the sequence [85779]. Setting to sequence size.
WARNING6: Stop coordinate [308754] is greater than or equal to twice the size of the sequence [85779]. Setting to sequence size.
>chrmitochondrion 85779..85779
A
I think if both coords are too large, no sequence should be generated.
Most cases will be based on the input coord file, but there are definitely some cases where you want the same coords extracted from multiple sequence files.
Currently, if you specify -i "*.fa" -f "a*.coord" -f "b*.coord"
, you get an error about the number of -i and -f files needing to be the same. The number that should be the same is the number of files specified to each -f with the -i.
This issue may be able to be resolved by implementing the new CommandLineInterface module instead of the template.
I think that currently, multiple sequence files are allowed for each coordinate file, so this change may not be possible if I want to keep that ability and essentially allow 1:M and M:1 between sequence and coord files, which would be the desired case depending on which file type is getting the outfile suffix. I would have to create 2 separate pairs of infiles.
I should also implement a --gtf (and other format) flags that set all the columns at once. If any column is otherwise specified, it should over-ride the one set using --gtf.
It would be much faster if I had an index to use to retrieve sequence from coordinates.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.