pcantalupo / annotater Goto Github PK
View Code? Open in Web Editor NEWAnnotate sequences by BLAST using NCBI taxonomy information
License: MIT License
Annotate sequences by BLAST using NCBI taxonomy information
License: MIT License
Hi there,
This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs
parameter:
Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.
If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.
Thank you!
-- Arman (armish/blast-patrol)
When running annotater with Rapsearch, you need a parallel BLAST database with the same name as the Rapsearch database in the same directory. This is because of the hack to use blastdbcmd
in the Version method to get database information. There is no parallel command like blastdbcmd
in the rapsearch distribution.
i'm seeing multiple messages like this during Taxonomy step
print() on closed filehandle TMPOUT at /path/to/annotater/Reann.pm line 275, <IN> line 17
which is followed by this exception:
------------- EXCEPTION -------------
MSG: Could not read file '~/path/to/viral.1.1.genomic.fna.gis.txt.fa': No such file or directory
STACK Bio::Root::IO::_initialize_io /path/to/bioperl/bioperl-live/Bio/Root/IO.pm:268
STACK Bio::SeqIO::_initialize /path/to/bioperl/bioperl-live/Bio/SeqIO.pm:513
STACK Bio::SeqIO::fasta::_initialize /path/to/bioperl/bioperl-live/Bio/SeqIO/fasta.pm:87
STACK Bio::SeqIO::new /path/to/bioperl/bioperl-live/Bio/SeqIO.pm:389
STACK Bio::SeqIO::new /path/to/bioperl/bioperl-live/Bio/SeqIO.pm:435
STACK Reann::Taxonomy /path/to/annotater/Reann.pm:282
STACK toplevel /path/to/annotater/Reann.pl:17
-------------------------------------
Annotater hangs while trying to get lineage from NCBI for LXPA01121130.1. The output says:
GetLineage: didn't get valid taxid:<> for GI:LXPA01121130.1 so getting lineage from NCBI
Rapsearch 2.24 is outputting illegal evalues (like below) and Rapsearch.pm does not parse them properly
On a OSX 10.12.6, I ran prove -v Taxonomy.t
and got this error:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Response Error
Can't verify SSL peers without knowing which Certificate Authorities to trust
STACK: Error::throw
STACK: Bio::Root::Root::throw /Users/pgc92/local/lib/bioperl/bioperl-live/Bio/Root/Root.pm:447
STACK: Bio::DB::GenericWebAgent::get_Response /Users/pgc92/local/lib/bioperl/bioperl-live/Bio/DB/GenericWebAgent.pm:214
STACK: Bio::DB::EUtilities::get_Response /Users/pgc92/perl5/lib/perl5/Bio/DB/EUtilities.pm:35
STACK: Taxonomy::accession2gi /Users/pgc92/local/lib/annotater/Taxonomy.pm:169
STACK: Taxonomy.t:19
-----------------------------------------------------------
I'm getting a 'No alias or index file found' error during Taxonomy step. I'm only getting this with Rapsearch databases that are not in $BLASTDB.
Taxonomy
Getting fasta seqs for protein_Mononeg_Aedes
BLAST Database error: No alias or index file found for nucleotide database [protein_Mononeg_Aedes] in search path
I'm getting blastdbcmd errors after moving BLAST database files to a new directory mount (/zfs2) on the HTC cluster. Never had these problems before on mobydisk
Taxonomy
Getting fasta seqs for /zfs2/bar/refs/blast/nr
Error: [blastdbcmd] Entry not found: YP7.1
Error: [blastdbcmd] Skipped YP7.1
Error: [blastdbcmd] Entry not found: XP_00643
Error: [blastdbcmd] Skipped XP_00643
When using BLAST+ => 2.5.0, the Description values in the Report are integer values. When I use BLAST+ <= 2.3.0 (didn't try 2.4), the Descriptions are as I expect (i.e. Tobacco mosaic virus, complete genome
)
This stems from a change in how blastdbcmd
extracts fasta sequences from the blast database. When this fullgi gi|9626125|ref|NC_001367.1|
is used with blastdbcmd
, it extracts the following
for BLAST >= 2.5.0
blastdbcmd -db viral.1.1.genomic -entry "gi|9626125|ref|NC_001367.1|" | head -n1
>NC_001367.1 Tobacco mosaic virus, complete genome
for BLAST <= 2.3.0
$ blastdbcmd -db viral.1.1.genomic -entry "gi|9626125|ref|NC_001367.1|" | head -n1
>gi|9626125|ref|NC_001367.1| Tobacco mosaic virus, complete genome
Then when Reann.pm reads the fasta file it uses the fasta identifier, which is an ACC.VER for BLAST+ >= 2.5, to update the %acc hash. With ACC.VER, it is creating a new hash key with the description. This new key (the ACC.VER) is never accessed again because later in the code, the fullgi from the Report file (the Accession field) is used to look up what is expected to be the Description. However, the hash value is the number of times the fullgi was found in the Report file (due to this line)
To test, use sewageseqs16.fa with
BLAST+ 2.3.0 (Descriptions are OK)
$ export PATH=/home/paul/local/usr/local/ncbi-blast-2.3.0+/bin/:$PATH
$ rm -rf annotator
$ Reann.pl -file ../sewageseqs16.fa -config sewageseqs.conf -num_threads 8 -evalue 1e-5
sewageseqs.conf
ann.0.fasta
tblastx -show_gis -num_threads 8 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp" -db viral.1.1.genomic -evalue 1e-5 -query ann.0.fasta -out ann.0.0.tblastx
Report
Taxonomy
Getting fasta seqs for viral.1.1.genomic
Skipping LocalTaxonomy
temp file is /tmp/3yjYOMnh2E
$ cut -f 1,7,8 annotator/ann.wTax.report.txt
seqID accession desc
All.viralseqs.fa.Contig1000 gi|9626125|ref|NC_001367.1| Tobacco mosaic virus, complete genome
All.viralseqs.fa.Contig1002 gi|764159063|ref|NC_026582.1| Uncultured phage WW-nAnB, complete genome
All.viralseqs.fa.Contig1006 gi|20177424|ref|NC_003630.1| Pepper mild mottle virus, complete genome
All.viralseqs.fa.Contig1007 gi|9630643|ref|NC_001918.1| Aichi virus, complete genome
All.viralseqs.fa.Contig1008 gi|20177424|ref|NC_003630.1| Pepper mild mottle virus, complete genome
All.viralseqs.fa.Contig1 gi|1043372815|ref|NC_030457.1| Circovirus-like genome DCCV-8, complete genome
All.viralseqs.fa.Contig10 gi|20260796|ref|NC_003692.1| Pariacato virus chromosome RNA2, complete genome
All.viralseqs.fa.Contig100 gi|45476493|ref|NC_005817.1| Sclerophthora macrospora virus A RNA 1, complete sequence
All.viralseqs.fa.Contig1001 gi|1127298492|ref|NC_032551.1| Beihai picorna-like virus 77 strain BHBei75652 hypothetical protein 1 and hypothetical protein 2 genes, complete cds
All.viralseqs.fa.Contig1003 gi|20177424|ref|NC_003630.1| Pepper mild mottle virus, complete genome
All.viralseqs.fa.Contig1004 gi|1127298656|ref|NC_032614.1| Beihai picorna-like virus 74 strain BHHK130969 hypothetical protein 1 and hypothetical protein 2 genes, complete cds
FTWLCJP02IU3XZ gi|593779948|ref|NC_023737.1| Mycobacterium phage Pleione, complete genome
FTWLCJP02IU4PY gi|203454602|ref|NC_011273.1| Mycobacterium phage Myrna, complete genome
FTWLCJP02IUB6X gi|197085614|ref|NC_011142.1| Iodobacteriophage phiPLPE, complete genome
FTWLCJP02IUGFL
FTWLCJP02IUNP9
BLAST+ 2.6.0 (Descriptions are not ok; integer values)
$ export PATH=/home/paul/local/usr/local/ncbi-blast-2.6.0+/bin/:$PATH
$ rm -rf annotator
$ Reann.pl -file ../sewageseqs16.fa -config sewageseqs.conf -num_threads 8 -evalue 1e-5
sewageseqs.conf
ann.0.fasta
tblastx -show_gis -num_threads 8 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp" -db viral.1.1.genomic -evalue 1e-5 -query ann.0.fasta -out ann.0.0.tblastx
Report
Taxonomy
Getting fasta seqs for viral.1.1.genomic
Skipping LocalTaxonomy
temp file is /tmp/JkhItzr1AK
$ cut -f 1,7,8 annotator/ann.wTax.report.txt
seqID accession desc
All.viralseqs.fa.Contig1000 gi|9626125|ref|NC_001367.1| 1
All.viralseqs.fa.Contig1002 gi|764159063|ref|NC_026582.1| 1
All.viralseqs.fa.Contig1006 gi|20177424|ref|NC_003630.1| 3
All.viralseqs.fa.Contig1007 gi|9630643|ref|NC_001918.1| 1
All.viralseqs.fa.Contig1008 gi|20177424|ref|NC_003630.1| 3
All.viralseqs.fa.Contig1 gi|1043372815|ref|NC_030457.1| 1
All.viralseqs.fa.Contig10 gi|20260796|ref|NC_003692.1| 1
All.viralseqs.fa.Contig100 gi|45476493|ref|NC_005817.1| 1
All.viralseqs.fa.Contig1001 gi|1127298492|ref|NC_032551.1| 1
All.viralseqs.fa.Contig1003 gi|20177424|ref|NC_003630.1| 3
All.viralseqs.fa.Contig1004 gi|1127298656|ref|NC_032614.1| 1
FTWLCJP02IU3XZ gi|593779948|ref|NC_023737.1| 1
FTWLCJP02IU4PY gi|203454602|ref|NC_011273.1| 1
FTWLCJP02IUB6X gi|197085614|ref|NC_011142.1| 1
FTWLCJP02IUGFL
FTWLCJP02IUNP9
When testing the Taxonomy.pm module using Taxonomy.t, recently been getting alot of errors like below (stemming from gi2taxid and taxid2lineage methods). I thought the Bioperl module has a built in delay so that it doesn't send too many requests.
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Response Error
Too Many Requests
STACK: Error::throw
STACK: Bio::Root::Root::throw /ihome/uchandran/pgc92/local/lib/bioperl/bioperl-live/lib/Bio/Root/Root.pm:449
STACK: Bio::DB::GenericWebAgent::get_Response /ihome/uchandran/pgc92/local/lib/bioperl/bioperl-live/lib/Bio/DB/GenericWebAgent.pm:215
STACK: Bio::DB::EUtilities::get_Response /ihome/uchandran/pgc92/local/lib/bioperl/Bio-EUtilities/lib/Bio/DB/EUtilities.pm:204
STACK: Bio::DB::EUtilities::get_Parser /ihome/uchandran/pgc92/local/lib/bioperl/Bio-EUtilities/lib/Bio/DB/EUtilities.pm:287
STACK: Bio::DB::EUtilities::next_DocSum /ihome/uchandran/pgc92/local/lib/bioperl/Bio-EUtilities/lib/Bio/DB/EUtilities.pm:856
STACK: Taxonomy::gi2taxid /ihome/uchandran/pgc92/local/lib/annotater/Taxonomy.pm:232
and
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Response Error
Too Many Requests
STACK: Error::throw
STACK: Bio::Root::Root::throw /ihome/uchandran/pgc92/local/lib/bioperl/bioperl-live/lib/Bio/Root/Root.pm:449
STACK: Bio::DB::GenericWebAgent::get_Response /ihome/uchandran/pgc92/local/lib/bioperl/bioperl-live/lib/Bio/DB/GenericWebAgent.pm:215
STACK: Bio::DB::EUtilities::get_Response /ihome/uchandran/pgc92/local/lib/bioperl/Bio-EUtilities/lib/Bio/DB/EUtilities.pm:204
STACK: Taxonomy::taxid2lineage /ihome/uchandran/pgc92/local/lib/annotater/Taxonomy.pm:276
pass_filters
is not implemented in building the final BE Report file. This functionality needs developed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.