Code Monkey home page Code Monkey logo

coord2seq's People

Contributors

hepcat72 avatar

Watchers

 avatar  avatar  avatar

coord2seq's Issues

Allow parent sequence ID string option on the command line

The user can use -c to supply start and stop coordinates. So there should also be a way to specify the parent sequence ID, either in a format similar to "ID:start..stop" or "ID:start-stop". Commas should also be allowed and automatically removed.

Also - change the default from circular to linear or have it auto-detected.

Less frequent and better coordinate warnings

WARNING1: Start/Stop coordinates appear to be unordered.  Since the chromosome is linear, the start/stop for sequence [B045_1-4_chr1_r_c1288[k=4][m=2]_TGACT] will be switched.
WARNING2: Start/Stop coordinates appear to be unordered.  Since the chromosome is linear, the start/stop for sequence [B045_1-4_chr1_r_c1291[k=5][m=2]_AGACT] will be switched.
WARNING3: Start/Stop coordinates appear to be unordered.  Since the chromosome is linear, the start/stop for sequence [B045_1-4_chr1_r_c1295[k=4][m=2]_AGACA] will be switched.
WARNING4: Start/Stop coordinates appear to be unordered.  Since the chromosome is linear, the start/stop for sequence [B045_1-4_chr1_r_c255[k=13][m=4]_GAACT] will be switched.

Come up with more succinct/less frequent warnings about coordinate order. Only issue the warning once when first encountered. And, only issue the warning if the file is inconsistent. Explain how coords will be reported, like lesser/greater or start/stop.

Allow different common coordinate modes

GTF files have coordinates where the start is 1-based and the stop is 0-based (for whatever reason). Other popular formats use 0-based coords for the start and stop. These coordinate types should be supported.

Auto-determine column assignments

Allow the columns to be determined automatically.

  1. Look for things like "chr" or "contig" or see if all values match the IDs in the fasta file
  2. Allow there to not be a chr column if there's only 1 seq
  3. Look at numbers to see if they are within the seq length
  4. Check uniqueness of a column for subseq IDs (default to chr.start.stop if indeterminate)
  5. Allow direction to be +/-, #..#c, plus/minus, etc.

Use this logic to validate columns specified by the user and suggest a fix.

When coords are larger than the parent sequence length, a single base record is generated

I entered coordinates "308552-308754". One parent sequence was only 85779 bases long, thus this was the output:

WARNING5: Start coordinate [308552] is greater than or equal to twice the size of the sequence [85779].  Setting to sequence size.
WARNING6: Stop coordinate [308754] is greater than or equal to twice the size of the sequence [85779].  Setting to sequence size.
>chrmitochondrion 85779..85779
A

I think if both coords are too large, no sequence should be generated.

Allow 2D coord file names

Currently, if you specify -i "*.fa" -f "a*.coord" -f "b*.coord", you get an error about the number of -i and -f files needing to be the same. The number that should be the same is the number of files specified to each -f with the -i.

This issue may be able to be resolved by implementing the new CommandLineInterface module instead of the template.

I think that currently, multiple sequence files are allowed for each coordinate file, so this change may not be possible if I want to keep that ability and essentially allow 1:M and M:1 between sequence and coord files, which would be the desired case depending on which file type is getting the outfile suffix. I would have to create 2 separate pairs of infiles.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.