tseemann / any2fasta Goto Github PK
View Code? Open in Web Editor NEWConvert various sequence formats to FASTA
License: GNU General Public License v3.0
Convert various sequence formats to FASTA
License: GNU General Public License v3.0
Maybe we want to parse NCBI pipe specifiers?
eg >gb|xxxxx|ref|yyyyyy|
This could then be a drop in replacement in MLST, Abricate etc.
Tecnically a GBK is invalid if it missing the //
agter the SEQ
block.
But we could be nice and still output any sequence we have?
ie. if we see EOF or another LOCUS header we can assume the record has endded.
Default should be 60
if ($in_seq) {
# 421 ctctcaaact aaagccgtct cactctccat gagtcgttcg acagatcgcg ttttaaattg
my $s = substr $_, 10; # trim the coordinate prefix
$s =~ s/\s//g;
$dna .= $s . "\n";
Avoid loading whole file if it's not going to be used.
Especially for FASTQ files.
Dear,
The gbff file in NCBI usually have follows format. There is all the sequence will be output when using any2fasta
command, but how to just output the CDS sequence? Thanks.
LOCUS XM_017747270 2892 bp mRNA linear PLN 09-AUG-2016
DEFINITION PREDICTED: Gossypium arboreum serine/threonine protein phosphatase
2A regulatory subunit B''alpha-like (LOC108487170), transcript
variant X1, mRNA.
ACCESSION XM_017747270
VERSION XM_017747270.1
DBLINK BioProject: PRJNA335838
KEYWORDS RefSeq.
SOURCE Gossypium arboreum
ORGANISM Gossypium arboreum
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
Pentapetalae; rosids; malvids; Malvales; Malvaceae; Malvoideae;
Gossypium.
COMMENT MODEL REFSEQ: This record is predicted by automated computational
analysis. This record is derived from a genomic sequence
(NC_030664.1) annotated using gene prediction method: Gnomon.
Also see:
Documentation of NCBI's Annotation Process
##Genome-Annotation-Data-START##
Annotation Provider :: NCBI
Annotation Status :: Full annotation
Annotation Version :: Gossypium arboreum Annotation
Release 100
Annotation Pipeline :: NCBI eukaryotic genome annotation
pipeline
Annotation Software Version :: 7.1
Annotation Method :: Best-placed RefSeq; Gnomon
Features Annotated :: Gene; mRNA; CDS; ncRNA
##Genome-Annotation-Data-END##
FEATURES Location/Qualifiers
source 1..2892
/organism="Gossypium arboreum"
/mol_type="mRNA"
/cultivar="Shixiya1"
/db_xref="taxon:29729"
/chromosome="1"
/country="China"
/collection_date="May-2010"
gene 1..2892
/gene="LOC108487170"
/note="Derived by automated computational analysis using
gene prediction method: Gnomon. Supporting evidence
includes similarity to: 16 Proteins, and 100% coverage of
the annotated genomic feature by RNAseq alignments,
including 10 samples with support for all annotated
introns"
/db_xref="GeneID:108487170"
CDS 634..2271
/gene="LOC108487170"
/codon_start=1
/product="serine/threonine protein phosphatase 2A
regulatory subunit B''alpha-like"
/protein_id="XP_017602759.1"
/db_xref="GeneID:108487170"
/translation="MSLSIKMDIDAVEDVTCLDPELLQLPDVSPFALKASPQLVEDFF
SQWLSLPGTGHLVKSLIDDAKSGTIVNASANFSTLNAVGSHSLSSMFPSSNAPPLSPR
SSSGSPRTSKQKSSPSALGSPLKLVSEPMQEIIPQFYFQNGCPPTKELKEQCLSQINH
LFNNPLNGLQIDEFKAVTKEVCKLPSFLSSALFRKIDVEWTGIVTRDAFIKYWVDGNM
LTMDIATQIFEILKRPGCKYLTQVDFKPVLRELLATHPGLEFLRNTPEFQDRYAETVI
YRIFYHINRSGNGRLTLRELKRGSLVAAMQHADEEEDINKVLRYFSYEHFYVIYCKFW
ELDTDHDFFIDRENLIRYGNHALTYRIVDRIFSQAPRKFTSEVEGKMGYEDFVYFMFS
EEDKSSQPSLEYWFKCIDLDGNGVLTPNEMQFFYEEQLHRMECMAQEPVLFEDILCQI
IDMIAPEREYCITLQDLKRCKLSGNVFNILFNLNKFVAFESRDPFLIRQEREEPTLTE
WDHFAHREYIRLSMEEDVEDASNGSAEVWDESLEAPF"
ORIGIN
1 tatctttcat ccttcttcgc tgcagcttcc tattcctttt agtttcccct atgtccactc
61 tctctgtaat aaaatcaaat gctaataata atactttgat ttctctgctc ctgttttctt
121 cctctctccg tttcttttta atttttaaaa ccattcccta cttttaatca aattcacgtc
181 aaatctcatt atcttcttgg catttttaag ttttttttcc gcactgaaag ttaacggaaa
241 gtactcgaga atttatcagt ttctcttttt ggaagtaaaa caggctaaat tctttcgaga
301 ctcttcgaag gatttggtat tccagtttat tcataacgcc ggcagctagg gttttggaga
361 acggcgtatt ttaaacggtt acgtttctac ttccgttgaa gaaaaaaagg attttaccgt
421 cttttttcct taactctttg gagcaagatt ttgtaattat ttccacggta tcgtcaattt
481 accatatcat ttcggagcgt gttctttttc ccagttagag aaatctccga agtggcgttg
541 atttcttttt gctgttgcat ttgaagaatt tgaaagagtt acaagtttta gggtgtttat
601 ttttatttag tgctgtttga taaggtaggc gagatgtcat tatctataaa gatggatatt
661 gatgcagtgg aggatgttac ttgtttggac cctgagcttt tgcagcttcc tgatgtttct
721 ccatttgcac taaaagccag tcctcaactt gtagaggact ttttctctca gtggctttcg
781 cttcctggga ccggccatct ggtgaaatct ttgattgatg atgcaaagtc agggacaata
841 gttaacgctt ctgcaaactt ttctactcta aatgctgttg ggagccattc gttgtcttcc
901 atgtttccaa gtagcaatgc acctccactt tctccaagaa gctcatctgg ttctcctcgc
961 acgtcaaagc agaagtccag cccttctgct cttggctctc cattgaaatt agttagtgaa
1021 ccaatgcaag aaatcattcc acagttttat ttccaaaatg gttgtccacc aaccaaggaa
1081 ttgaaagaac aatgtctttc tcaaattaat caccttttta ataatcctct aaatggattg
1141 caaatagatg agtttaaagc agtgacaaag gaagtttgca agctaccatc tttcctctct
1201 tctgcacttt ttagaaaaat agatgtagag tggactggaa tagtgaccag agatgctttc
1261 attaagtatt gggttgatgg aaatatgctg acgatggata tagcaactca aatatttgaa
1321 attcttaagc gtccaggctg caagtacctc actcaggttg acttcaaacc tgttcttcga
1381 gaacttttgg cgacccatcc aggattagaa ttcctgcgga acacgcctga atttcaagat
1441 agatacgctg aaactgtcat atacagaata ttttatcaca tcaatagatc gggaaatggc
1501 cgtcttaccc tcagggagct caaaagagga agtctggttg ctgccatgca acatgctgat
1561 gaggaagagg acattaacaa agtccttagg tacttctcat atgaacattt ctatgttata
1621 tactgtaagt tttgggagtt ggacacggac catgatttct tcatcgacag agaaaatctc
1681 attagatatg gcaatcatgc ccttacctac aggattgttg atagaatatt ttcacaggct
1741 ccacgaaaat ttactagtga ggtagaaggg aagatgggtt atgaggactt tgtctacttc
1801 atgttttcgg aggaggacaa atcatctcag cctagtcttg agtattggtt taagtgcata
1861 gatttggatg gaaatggtgt gctgacgcca aatgaaatgc aatttttcta tgaggagcag
1921 ctgcatcgaa tggaatgcat ggcccaggaa cctgtgctct ttgaggacat attgtgtcaa
1981 ataattgaca tgattgctcc tgagagagaa tattgcatca cgctacagga tttgaaaaga
2041 tgcaaacttt caggaaatgt ttttaacatc cttttcaatc ttaataagtt tgtggctttc
2101 gaaagccgtg atccattcct catacggcag gaacgtgagg aaccaacttt gacagagtgg
2161 gatcactttg cacatagaga gtatatcagg ctttcaatgg aagaagatgt tgaagacgct
2221 tcgaatggga gtgctgaagt atgggatgag tcgcttgaag ctccatttta atttttaagg
2281 ttgctgaggt gagttttgta gtaccttgtc aaaagataat attcaaggtg aatgaagaaa
2341 aattggctac ttggacattc tgcagatggt gtgcttgtct gcaaagtgat tggccacaag
2401 cttcaaattc attcgtatag attttaccta tatagttcac ctgcaggcta tctagttgcc
2461 atttttgcaa ctaagtggcg gcaacaaaat ttctgtcagg aaagccaatt gcttctcata
2521 caagagaggg ttgattctcc ctgctcttaa ctaatcacca tctccctccc aggccaggta
2581 tcaacagtct gctactatgt taaaactttt tgttctgttt ttagttggtg aaacaatcat
2641 ttactgttat cagtctgtgc ctttggggtc gtggaggaaa gtaaaggtgg atggtggata
2701 ctgcgattgc cttgttttgg tttagtggcc gcccctatct ttgttgccaa acagaaattt
2761 cgttccccct tcgttactag ctcaacgact cttacctttt tttctcagtt tttggtacaa
2821 tgtacatgtt ccttattttt ttgatccagt gggtgaaatg aacacttttt tttttttaaa
2881 aaaggaaaag tt
//
Add --outfmt bed
?
Perhaps if symlinked to any2bed
it with --outfmt bed
?
Useful for making --mask
files for snippy etc
Hi,
For PDB files from protein structures, e.g. like those predicted by alphafold2, it would be great to have any2fasta work on PDB files.
Initial simple request (pseudocode using csvtk, of which you are also a fan!):
cat ranked_0.pdb | csvtk space2tab | csvtk cut -H -t -f 4,6 | csvtk uniq -H -t -f 2 | turn-3-letter-code-to-single-letter-code | stitch to single line of AAs
If more than one chain:
cat ranked_0.pdb | csvtk space2tab | csvtk cut -H -t -f 4,5,6 | csvtk uniq -H -t -f 2,3 | foreach chain; do turn-3-letter-code-to-single-letter-code | stitch to single line of AAs
I hope this is clear enough. Having this in any2fasta would add yet another conversion (here PDB) into FASTA available in a single repo.
Thanks in advance
Hi there. Is there command that converts files within single or different folders under one directory and output into separate folder?
How did i forget this?
Similar to GBK - might need to refactor?
I think I am calling purif_dna()
twice in the Genbank parser.
./any2fasta /dev/null
This is any2fasta 0.1.0
Read 0 lines from input
Use of uninitialized value $line[0] in pattern match (m//) at ./any2fasta line 72.
Use of uninitialized value $line[0] in pattern match (m//) at ./any2fasta line 73.
Use of uninitialized value $line[0] in pattern match (m//) at ./any2fasta line 74.
Use of uninitialized value $line[0] in concatenation (.) or string at ./any2fasta line 75.
ERROR: Could not determine input sequence format:
check for lines==0
CLUSTAL W (1.81) multiple sequence alignment
gene02 ATGCTAGAATATGCTCTGAG--ATATTCAATATATCGTGCTAGGATATGCTCTGAGATAT
gene01 ATGCRAGGATATGCTCTGAGATATATTCTATATATCGTGCTAGGATATGCTCTGAGATAT
gene03 ATGCT---ACATGCTCTGAGACATATTCTATATATCGTGCTAGG---TGCTCTGAGATAT
**** * ********** ****** *************** *************
gene02 ANNCTAGATATCGGCTAGGATATGTTCTGAGATATATTCTTTTATATCG
gene01 ATTCTATATATCGGCTAGGATATGCTCTGAGATATATTC-TATATATCG
gene03 ATTCTATATATCGGCTAGGATATGCTCTGAGATATATTC-TATATATCG
* *** ***************** ************** * *******
could also have MUSCLE
as first line in its -clw
mode.
any2fasta 1.gbk 2.fna 3.gff.gz 4.fa.bz2 5.gbff.zip
Loop over them?
CTTATCAAGCCGTTTCAGGTGCTGGTATGGGAGCAATTCTTGAGACACAACGTGAACTTCGTGAAGTCTT
GAATGATGGTGTGAAACCATGTGATTTGCATGCGGAAATTTTGCCTTCAGGTGGTGACAAGAAACATTAT
>gi|480530348|ref|NC_021005.1| Streptococcus pneumoniae SPN994039 draft genome
AAGGTTATCCACTATGTTTTTCGATAAAAAGCTTAATAAATCAATAATTTCTTCTTTTATCCCCAACCTG
TGGATAAAGTTTGGTAACATTGTGGATTATTTTTCACAGCTTGTGGAAAATTCTTGCTATCTATGGTAAA
Hi,
I'm trying to run abricate on a list of files but get the "unfamilar format" error from any2fasta with all my gzip-compressed files.
The output looks like this:
This is any2fasta 0.4.2 Opening '2021-07-09-18_S6_L001_run_001-assembly.fasta.gz' ERROR: Unfamilar format with first line:`3/b2021-07-09-18_S6_L001_run_001-assemb54�Wky�S�k=���n��'s���@��_���?����������������_�����������������s�����v���₩���������GWw��������?������n��k�?o�������?����~�{�?/����������/����v_6��~����?o�{��{��������j��_��W�����ߵ�~w~����^������w~wt/���v���9{/�߿owo��b1W����������+��}ߡXS��w���ϒr���~W��7yo�x��B���kݸ�ν�w��%����r�s!����{Uw������w�o��wY�ޭ�܍����b�wǿ��7��߶���������%���\��'��濟ͲZ�O��~響����V���� �������{���^�]l��5�o~�q|��E֍���o4r�]�E�\����ǹQ����]-ø��5�&�Hں�6y���+X�ߧ<�ھ|�}ZA��_$�M���_���-�7�� ��>ޠ�{5�\�/2�?�/�˻��F�.o��V@�ֳ\�d�:X�_���$�d�aտ+��x�=�o�s��}&������M��[�!r�Ͳ������M���V7۬`i~�}ƕh����^��nJ�Z�X�l� �������wW��"�������C5�ݱpB�K\���+bt/Pp�7^��}����K6��nvQ'��ߨ�{���ԢNv���V���g��\wn���dwގ�̵��カ��"�[��^km�^�{���9}�B
Any idea what may be the problem?
Cheers. :)
https://en.wikipedia.org/wiki/Stockholm_format
Used by pfam, rfam etc
When processing a GBK file, the VERSION field should be used for sequence ID in the FASTA file rather than the ACCESSION:
ACCESSION NZ_CP069563
VERSION NZ_CP069563.1
So, instead of getting a FASTA with:
>NZ_CP069563
We would get:
>NZ_CP069563.1
I have some fastq.gz files from nanopore sequencing. They have a lot of reads.
$ wc -l nanopore_reads/*
2337214 nanopore_reads/1326974.fastq.gz
293071 nanopore_reads/3433012.fastq.gz
2630285 total
But when I use any2fasta, I only get the first line:
$ any2fasta nanopore_reads/1326974.fastq.gz
This is any2fasta 0.4.2
Opening 'nanopore_reads/1326974.fastq.gz'
Detected FASTQ format
Read 4 lines from 'nanopore_reads/1326974.fastq.gz'
>548c0b75-3444-4fb3-8a42-cedb213fe177 runid=fca3ca39efc128e221a14cb984aff5e59479f18e read=13 ch=131 start_time=2021-08-17T21:32:40Z flow_cell_id=FAN24319 protocol_group_id=UT-GXB02179-210817 sample_id=UT-GXB02179-210817 barcode=barcode06 barcode_alias=barcode06
TGTATCAAATAGCAAATCCATATGCAATTGGATTTCGAACTAATGTGTAATGCATACCAAATGGTGAGATAAGCCACAAAGGAACGTAGTTTGTATTTACCATTCAAACTCTAAATTACTTATTAAAACCTGAGAAGTGGGTTTCATTTATATAACTAAAGGCACTTTTCTAATTTAAGACCAGTCTTCCTCCAAATCTTCGATTACGAATAGTTAGTAATTTCTTCTTCATATTTTGATGTTGCCTCAGTTACTACAGCATTTCGACCACCTGTAACTCCATATCGGTCAAAGCCATTTTCAGCATCTCTAGAAGTTGTATTATTTCTGAAAGATAAACTTTTGCCTTACCATCTTTAATTAGACTAGCTTTGCCTCTTACATTTGGGCATTTCTTTTCTGGGATTTCAATGACGATTACATATGCTTGAGTTGCCATAATTTAATTACACCTTCTCAATTATGAAATCAGCCTTAGTTGTCATTCCACCACTAAAACTACTCTTGGGCAGTTAACTTAGCAGACTCTTCATCATCTATTTGACCGACTAAACTTTTTCAGGTGTATTTCTCGCAAAGAGCCAGAAAGTAGT
Wrote 1 sequences from FASTQ file.
Processed 1 files.
Done.
I was expecting each read to be exported to fasta format. Is there a flag somewhere that I'm missing?
Not clear if this is right spot for this, but is any2fasta stripping the version information from the accessions?
When using Nullarbor 2 and Snippy4 with --mask
and a BED file created with phaster-query.pl
, snippy-core
crashed because ref.fa
only had the accession number (NZ_CP016638
) without the version number whereas the BED file had the accession plus version (NZ_CP016638.1
). Manually adding the version numbers to ref.fa
fixed the issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.