tseemann / any2fasta Goto Github PK

View Code? Open in Web Editor NEW

124.0 124.0 17.0 6.39 MB

Convert various sequence formats to FASTA

License: GNU General Public License v3.0

Perl 76.41% Clarion 6.12% TeX 17.46%

any2fasta's People

Contributors

Stargazers

Watchers

Forkers

diegoibt dhananjaykimothi zm-git-dev barrantesisrael tangbozeng gunzivan28 erikrikarddaniel ssyamoako mingjuhao zhaoxia413 sfbaron ssreyya desyemelese sadikmu niicaii gyli53 otakuzerg

any2fasta's Issues

Add option to deal with pipe chars in IDs

Maybe we want to parse NCBI pipe specifiers?
eg >gb|xxxxx|ref|yyyyyy|

Support .gz compression

This could then be a drop in replacement in MLST, Abricate etc.

N sequence for GBK missing // terminator

Tecnically a GBK is invalid if it missing the // agter the SEQ block.

But we could be nice and still output any sequence we have?

ie. if we see EOF or another LOCUS header we can assume the record has endded.

Add option -w for chars per line

Default should be 60

Robuster GBK parsing

    if ($in_seq) {
      #       421 ctctcaaact aaagccgtct cactctccat gagtcgttcg acagatcgcg ttttaaattg
      my $s = substr $_, 10;  # trim the coordinate prefix
      $s =~ s/\s//g;
      $dna .= $s . "\n";

Only load first row to check, then load rest

Avoid loading whole file if it's not going to be used.
Especially for FASTQ files.

NCBI gbff file to fasta

Dear,
The gbff file in NCBI usually have follows format. There is all the sequence will be output when using any2fasta command, but how to just output the CDS sequence? Thanks.

LOCUS       XM_017747270            2892 bp    mRNA    linear   PLN 09-AUG-2016
DEFINITION  PREDICTED: Gossypium arboreum serine/threonine protein phosphatase
            2A regulatory subunit B''alpha-like (LOC108487170), transcript
            variant X1, mRNA.
ACCESSION   XM_017747270
VERSION     XM_017747270.1
DBLINK      BioProject: PRJNA335838
KEYWORDS    RefSeq.
SOURCE      Gossypium arboreum
  ORGANISM  Gossypium arboreum
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
            Pentapetalae; rosids; malvids; Malvales; Malvaceae; Malvoideae;
            Gossypium.
COMMENT     MODEL REFSEQ:  This record is predicted by automated computational
            analysis. This record is derived from a genomic sequence
            (NC_030664.1) annotated using gene prediction method: Gnomon.
            Also see:
                Documentation of NCBI's Annotation Process
            
            ##Genome-Annotation-Data-START##
            Annotation Provider         :: NCBI
            Annotation Status           :: Full annotation
            Annotation Version          :: Gossypium arboreum Annotation
                                           Release 100
            Annotation Pipeline         :: NCBI eukaryotic genome annotation
                                           pipeline
            Annotation Software Version :: 7.1
            Annotation Method           :: Best-placed RefSeq; Gnomon
            Features Annotated          :: Gene; mRNA; CDS; ncRNA
            ##Genome-Annotation-Data-END##
FEATURES             Location/Qualifiers
     source          1..2892
                     /organism="Gossypium arboreum"
                     /mol_type="mRNA"
                     /cultivar="Shixiya1"
                     /db_xref="taxon:29729"
                     /chromosome="1"
                     /country="China"
                     /collection_date="May-2010"
     gene            1..2892
                     /gene="LOC108487170"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Gnomon. Supporting evidence
                     includes similarity to: 16 Proteins, and 100% coverage of
                     the annotated genomic feature by RNAseq alignments,
                     including 10 samples with support for all annotated
                     introns"
                     /db_xref="GeneID:108487170"
     CDS             634..2271
                     /gene="LOC108487170"
                     /codon_start=1
                     /product="serine/threonine protein phosphatase 2A
                     regulatory subunit B''alpha-like"
                     /protein_id="XP_017602759.1"
                     /db_xref="GeneID:108487170"
                     /translation="MSLSIKMDIDAVEDVTCLDPELLQLPDVSPFALKASPQLVEDFF
                     SQWLSLPGTGHLVKSLIDDAKSGTIVNASANFSTLNAVGSHSLSSMFPSSNAPPLSPR
                     SSSGSPRTSKQKSSPSALGSPLKLVSEPMQEIIPQFYFQNGCPPTKELKEQCLSQINH
                     LFNNPLNGLQIDEFKAVTKEVCKLPSFLSSALFRKIDVEWTGIVTRDAFIKYWVDGNM
                     LTMDIATQIFEILKRPGCKYLTQVDFKPVLRELLATHPGLEFLRNTPEFQDRYAETVI
                     YRIFYHINRSGNGRLTLRELKRGSLVAAMQHADEEEDINKVLRYFSYEHFYVIYCKFW
                     ELDTDHDFFIDRENLIRYGNHALTYRIVDRIFSQAPRKFTSEVEGKMGYEDFVYFMFS
                     EEDKSSQPSLEYWFKCIDLDGNGVLTPNEMQFFYEEQLHRMECMAQEPVLFEDILCQI
                     IDMIAPEREYCITLQDLKRCKLSGNVFNILFNLNKFVAFESRDPFLIRQEREEPTLTE
                     WDHFAHREYIRLSMEEDVEDASNGSAEVWDESLEAPF"
ORIGIN      
        1 tatctttcat ccttcttcgc tgcagcttcc tattcctttt agtttcccct atgtccactc
       61 tctctgtaat aaaatcaaat gctaataata atactttgat ttctctgctc ctgttttctt
      121 cctctctccg tttcttttta atttttaaaa ccattcccta cttttaatca aattcacgtc
      181 aaatctcatt atcttcttgg catttttaag ttttttttcc gcactgaaag ttaacggaaa
      241 gtactcgaga atttatcagt ttctcttttt ggaagtaaaa caggctaaat tctttcgaga
      301 ctcttcgaag gatttggtat tccagtttat tcataacgcc ggcagctagg gttttggaga
      361 acggcgtatt ttaaacggtt acgtttctac ttccgttgaa gaaaaaaagg attttaccgt
      421 cttttttcct taactctttg gagcaagatt ttgtaattat ttccacggta tcgtcaattt
      481 accatatcat ttcggagcgt gttctttttc ccagttagag aaatctccga agtggcgttg
      541 atttcttttt gctgttgcat ttgaagaatt tgaaagagtt acaagtttta gggtgtttat
      601 ttttatttag tgctgtttga taaggtaggc gagatgtcat tatctataaa gatggatatt
      661 gatgcagtgg aggatgttac ttgtttggac cctgagcttt tgcagcttcc tgatgtttct
      721 ccatttgcac taaaagccag tcctcaactt gtagaggact ttttctctca gtggctttcg
      781 cttcctggga ccggccatct ggtgaaatct ttgattgatg atgcaaagtc agggacaata
      841 gttaacgctt ctgcaaactt ttctactcta aatgctgttg ggagccattc gttgtcttcc
      901 atgtttccaa gtagcaatgc acctccactt tctccaagaa gctcatctgg ttctcctcgc
      961 acgtcaaagc agaagtccag cccttctgct cttggctctc cattgaaatt agttagtgaa
     1021 ccaatgcaag aaatcattcc acagttttat ttccaaaatg gttgtccacc aaccaaggaa
     1081 ttgaaagaac aatgtctttc tcaaattaat caccttttta ataatcctct aaatggattg
     1141 caaatagatg agtttaaagc agtgacaaag gaagtttgca agctaccatc tttcctctct
     1201 tctgcacttt ttagaaaaat agatgtagag tggactggaa tagtgaccag agatgctttc
     1261 attaagtatt gggttgatgg aaatatgctg acgatggata tagcaactca aatatttgaa
     1321 attcttaagc gtccaggctg caagtacctc actcaggttg acttcaaacc tgttcttcga
     1381 gaacttttgg cgacccatcc aggattagaa ttcctgcgga acacgcctga atttcaagat
     1441 agatacgctg aaactgtcat atacagaata ttttatcaca tcaatagatc gggaaatggc
     1501 cgtcttaccc tcagggagct caaaagagga agtctggttg ctgccatgca acatgctgat
     1561 gaggaagagg acattaacaa agtccttagg tacttctcat atgaacattt ctatgttata
     1621 tactgtaagt tttgggagtt ggacacggac catgatttct tcatcgacag agaaaatctc
     1681 attagatatg gcaatcatgc ccttacctac aggattgttg atagaatatt ttcacaggct
     1741 ccacgaaaat ttactagtga ggtagaaggg aagatgggtt atgaggactt tgtctacttc
     1801 atgttttcgg aggaggacaa atcatctcag cctagtcttg agtattggtt taagtgcata
     1861 gatttggatg gaaatggtgt gctgacgcca aatgaaatgc aatttttcta tgaggagcag
     1921 ctgcatcgaa tggaatgcat ggcccaggaa cctgtgctct ttgaggacat attgtgtcaa
     1981 ataattgaca tgattgctcc tgagagagaa tattgcatca cgctacagga tttgaaaaga
     2041 tgcaaacttt caggaaatgt ttttaacatc cttttcaatc ttaataagtt tgtggctttc
     2101 gaaagccgtg atccattcct catacggcag gaacgtgagg aaccaacttt gacagagtgg
     2161 gatcactttg cacatagaga gtatatcagg ctttcaatgg aagaagatgt tgaagacgct
     2221 tcgaatggga gtgctgaagt atgggatgag tcgcttgaag ctccatttta atttttaagg
     2281 ttgctgaggt gagttttgta gtaccttgtc aaaagataat attcaaggtg aatgaagaaa
     2341 aattggctac ttggacattc tgcagatggt gtgcttgtct gcaaagtgat tggccacaag
     2401 cttcaaattc attcgtatag attttaccta tatagttcac ctgcaggcta tctagttgcc
     2461 atttttgcaa ctaagtggcg gcaacaaaat ttctgtcagg aaagccaatt gcttctcata
     2521 caagagaggg ttgattctcc ctgctcttaa ctaatcacca tctccctccc aggccaggta
     2581 tcaacagtct gctactatgt taaaactttt tgttctgttt ttagttggtg aaacaatcat
     2641 ttactgttat cagtctgtgc ctttggggtc gtggaggaaa gtaaaggtgg atggtggata
     2701 ctgcgattgc cttgttttgg tttagtggcc gcccctatct ttgttgccaa acagaaattt
     2761 cgttccccct tcgttactag ctcaacgact cttacctttt tttctcagtt tttggtacaa
     2821 tgtacatgtt ccttattttt ttgatccagt gggtgaaatg aacacttttt tttttttaaa
     2881 aaaggaaaag tt
//

Add BED output option

Add --outfmt bed ?
Perhaps if symlinked to any2bed it with --outfmt bed ?

Useful for making --mask files for snippy etc

feature request any2fasta file.pdb

Hi,

For PDB files from protein structures, e.g. like those predicted by alphafold2, it would be great to have any2fasta work on PDB files.

Initial simple request (pseudocode using csvtk, of which you are also a fan!):

cat ranked_0.pdb | csvtk space2tab | csvtk cut -H -t -f 4,6 | csvtk uniq -H -t -f 2 | turn-3-letter-code-to-single-letter-code | stitch to single line of AAs

If more than one chain:

cat ranked_0.pdb | csvtk space2tab | csvtk cut -H -t -f 4,5,6 | csvtk uniq -H -t -f 2,3 | foreach chain; do turn-3-letter-code-to-single-letter-code | stitch to single line of AAs

I hope this is clear enough. Having this in any2fasta would add yet another conversion (here PDB) into FASTA available in a single repo.

Thanks in advance

converting multiple files from directory

Hi there. Is there command that converts files within single or different folders under one directory and output into separate folder?

Support (multi-line) FASTQ files

How did i forget this?

Add EMBL file support

Similar to GBK - might need to refactor?

purify_dna called twice

I think I am calling purif_dna()
twice in the Genbank parser.

Fails badly when input is /dev/null

./any2fasta /dev/null
This is any2fasta 0.1.0
Read 0 lines from input
Use of uninitialized value $line[0] in pattern match (m//) at ./any2fasta line 72.
Use of uninitialized value $line[0] in pattern match (m//) at ./any2fasta line 73.
Use of uninitialized value $line[0] in pattern match (m//) at ./any2fasta line 74.
Use of uninitialized value $line[0] in concatenation (.) or string at ./any2fasta line 75.
ERROR: Could not determine input sequence format:

check for lines==0

Add CLUSTAL format

CLUSTAL W (1.81) multiple sequence alignment

gene02          ATGCTAGAATATGCTCTGAG--ATATTCAATATATCGTGCTAGGATATGCTCTGAGATAT
gene01          ATGCRAGGATATGCTCTGAGATATATTCTATATATCGTGCTAGGATATGCTCTGAGATAT
gene03          ATGCT---ACATGCTCTGAGACATATTCTATATATCGTGCTAGG---TGCTCTGAGATAT
                ****    * **********  ****** ***************   *************

gene02          ANNCTAGATATCGGCTAGGATATGTTCTGAGATATATTCTTTTATATCG
gene01          ATTCTATATATCGGCTAGGATATGCTCTGAGATATATTC-TATATATCG
gene03          ATTCTATATATCGGCTAGGATATGCTCTGAGATATATTC-TATATATCG
                *  *** ***************** ************** * *******

could also have MUSCLE as first line in its -clw mode.

Handle multiple files of different types

any2fasta 1.gbk 2.fna 3.gff.gz 4.fa.bz2 5.gbff.zip

Loop over them?

Skip blank lines in parse_fasta

CTTATCAAGCCGTTTCAGGTGCTGGTATGGGAGCAATTCTTGAGACACAACGTGAACTTCGTGAAGTCTT
GAATGATGGTGTGAAACCATGTGATTTGCATGCGGAAATTTTGCCTTCAGGTGGTGACAAGAAACATTAT

>gi|480530348|ref|NC_021005.1| Streptococcus pneumoniae SPN994039 draft genome
AAGGTTATCCACTATGTTTTTCGATAAAAAGCTTAATAAATCAATAATTTCTTCTTTTATCCCCAACCTG
TGGATAAAGTTTGGTAACATTGTGGATTATTTTTCACAGCTTGTGGAAAATTCTTGCTATCTATGGTAAA

"ERROR: Unfamilar format with first line" with gzipped file

Hi,

I'm trying to run abricate on a list of files but get the "unfamilar format" error from any2fasta with all my gzip-compressed files.
The output looks like this:

This is any2fasta 0.4.2
Opening '2021-07-09-18_S6_L001_run_001-assembly.fasta.gz'
ERROR: Unfamilar format with first line:`3/b2021-07-09-18_S6_L001_run_001-assemb54�Wky�S�k=���n��'s���@��_���?����������������_�����������������s�����v���￦���������GWw��������?������n��k�?o�������?����~�{�?/����������/����v_6��~����?o�{��{��������j��_��W�����ߵ�~w~���׻�^������w~wt/���v���9{/�߿owo��b1W����������+��}ߡXS��w�󻀽��ϒr���~W��7yo�x��B���kݸ�ν�w��%����r�s!����{Uw������w�o��wY�ޭ�܍����b�wǿ��7��߶����߻���܏��%���\��'��濟ͲZ�O��~響����V����
                                        �������{���^�]l��5�o~�q|��E֍���o4r�]�E�\����ǹQ����]-ø��5�&�Hں�6y���+X�ߧ<�ھ|�}ZA��_$�M���_���-�7��
                                                         ��>ޠ�{5�\�/2�?�/�˻��F�.o��V@�ֳ\�d�:X�_���$�d�aտ+��x�=�o�s��}&������M��[�!r�Ͳ������M���V7۬`i~�}ƕh����^��nJ�Z�X�l�		�������wW��"�������C5�ݱpB�K\���+bt/Pp�7^��}����K6��nvQ'��ߨ�{���ԢNv���V���g��\wn���dwގ�̵��ｶ��"�[��^km�^�{���9}�B

Any idea what may be the problem?

Cheers. :)

Add Stockholm format

https://en.wikipedia.org/wiki/Stockholm_format

Used by pfam, rfam etc

option -w to set length line not working

GBK should use VERSION not ACCESSION for sequence ID

When processing a GBK file, the VERSION field should be used for sequence ID in the FASTA file rather than the ACCESSION:

ACCESSION   NZ_CP069563
VERSION     NZ_CP069563.1

So, instead of getting a FASTA with:

>NZ_CP069563

We would get:

>NZ_CP069563.1

Only converts first read of nanopore fastq.gz file

I have some fastq.gz files from nanopore sequencing. They have a lot of reads.

$ wc -l nanopore_reads/*
  2337214 nanopore_reads/1326974.fastq.gz
   293071 nanopore_reads/3433012.fastq.gz
  2630285 total

But when I use any2fasta, I only get the first line:

$ any2fasta nanopore_reads/1326974.fastq.gz 
This is any2fasta 0.4.2
Opening 'nanopore_reads/1326974.fastq.gz'
Detected FASTQ format
Read 4 lines from 'nanopore_reads/1326974.fastq.gz'
>548c0b75-3444-4fb3-8a42-cedb213fe177 runid=fca3ca39efc128e221a14cb984aff5e59479f18e read=13 ch=131 start_time=2021-08-17T21:32:40Z flow_cell_id=FAN24319 protocol_group_id=UT-GXB02179-210817 sample_id=UT-GXB02179-210817 barcode=barcode06 barcode_alias=barcode06
TGTATCAAATAGCAAATCCATATGCAATTGGATTTCGAACTAATGTGTAATGCATACCAAATGGTGAGATAAGCCACAAAGGAACGTAGTTTGTATTTACCATTCAAACTCTAAATTACTTATTAAAACCTGAGAAGTGGGTTTCATTTATATAACTAAAGGCACTTTTCTAATTTAAGACCAGTCTTCCTCCAAATCTTCGATTACGAATAGTTAGTAATTTCTTCTTCATATTTTGATGTTGCCTCAGTTACTACAGCATTTCGACCACCTGTAACTCCATATCGGTCAAAGCCATTTTCAGCATCTCTAGAAGTTGTATTATTTCTGAAAGATAAACTTTTGCCTTACCATCTTTAATTAGACTAGCTTTGCCTCTTACATTTGGGCATTTCTTTTCTGGGATTTCAATGACGATTACATATGCTTGAGTTGCCATAATTTAATTACACCTTCTCAATTATGAAATCAGCCTTAGTTGTCATTCCACCACTAAAACTACTCTTGGGCAGTTAACTTAGCAGACTCTTCATCATCTATTTGACCGACTAAACTTTTTCAGGTGTATTTCTCGCAAAGAGCCAGAAAGTAGT
Wrote 1 sequences from FASTQ file.
Processed 1 files.
Done.

I was expecting each read to be exported to fasta format. Is there a flag somewhere that I'm missing?

Stripping version information?

Not clear if this is right spot for this, but is any2fasta stripping the version information from the accessions?

The issue

When using Nullarbor 2 and Snippy4 with --mask and a BED file created with phaster-query.pl, snippy-core crashed because ref.fa only had the accession number (NZ_CP016638) without the version number whereas the BED file had the accession plus version (NZ_CP016638.1). Manually adding the version numbers to ref.fa fixed the issue.