Code Monkey home page Code Monkey logo

annotater's Introduction

annotater

Got sequences that need annotated with various BLAST programs? Diamond and Rapsearch2 are supported also.

Quick Start

perl Makefile.PL
make
make test
make install

If you encounter errors check below for the necessary dependencies that are required.

Run annotater: Reann.pl -file YOURSEQUENCES.FA -config CONFIGFILE -tax -remotetax. These options tell annotater to search your sequences against the list of searches in your configuration file and to lookup taxonomy information (-tax) for subject hits by querying NCBI (-remotetax). Your results are found in annotator/ann.wTax.BE.report.txt (tab-delimited text).

Installation

Install the following and make sure they are working before proceeding:

Clone repository. Add the annotater directory path to your PATH and PERL5LIB variables. In addition, add annotater/bin directory path to your PATH variable. Set a BLASTDB environmental variable using a full path to the location of your BLAST databases (tilde ~ is not allowed in the path). All the BLAST databases need to be in the same folder unless you specify full paths in the configuration file.

Configuration

Annotater requires two input files: a fasta file and a configuration file. There are example configuration files in ./configs. Let's take a look at ./configs/annot.conf.

-exec blastn -db GCF_000001405.39_top_level -qc 50 -pid 80 -lcase_masking -max_target_seqs 2
-exec blastn -db nt -qc 50 -pid 80
-exec blastx -db nr -task blastx-fast
-exec tblastx -db ref_viruses_rep_genomes

Each line specifies a different Search, either a blast program (blastn, tblastx), diamond or rapsearch, with its associated parameters. The searches are run in serial fashion starting from the top. Each line must start with -exec to tell Annotater which search program to run. Currently only BLAST+, diamond and rapsearch are supported. The options -qc, query coverage, and -pid, percent identity, are specific to Annotater. Query coverage and percent identity specifies the minimum % query coverage and % identity for a hit to be significant. The remaining parameters are specific to each search program; check the appropriate documentation for available options.

Usage

Lets annotate the following sequence file e7.fa (the E7 gene from HPV16) with Annotater against the NCBI reference virus genomes.

>e7
atgcatggagatacacctacattgcatgaatatatgttagatttgcaaccagagacaactgatctctactgttatgagcaattaaatgacagctcagaggaggaggatgaaatagatggtccagctggacaagcagaaccggacagagcccattacaatattgtaaccttttgttgcaagtgtgactctacgcttcggttgtgcgtacaaagcacacacgtagacattcgtactttggaagacctgttaatgggcacactaggaattgtgtgccccatctgttctcagaaaccataa

Then download the reference virus genomes BLAST database and extract it. Next, create a configuration file called config.txt that contains:

-exec blastn -db /PATH/TO/ref_viruses_rep_genomes

You have to specify a full path to the database file unless you place the database files in your $BLASTDB directory. Then run Annotater with 4 threads and set an evalue cutoff of 1e-50.

Reann.pl -file e7.fa -config config.txt -num_threads 4 -evalue 1e-50

Options such as -num_threads and -evalue will be applied to each Search in the configuration file. The results are found in the file ./annotater/ann.wTax.BE.report.txt. The fields and their descriptions are:

  1. seqID - sequence identifier
  2. seq - sequence
  3. seqLength - sequence length
  4. pid - Percent identity of the alignment
  5. coverage - Percent of the query that participates in the alignment
  6. e - Evalue of the alignment
  7. accession - the accession number of the subject (a.k.a. hit) sequence in the database (see Column ‘db’).
  8. desc - the description of the subject sequence
  9. type - the taxonomic bin of the subject sequence. Possible values: virus (meaning eukaryotic viruses), phage, bacteria, human, mouse, fungi, and other.
  10. family - the virus family
  11. species - the species name of the subject
  12. genome - the genome type of the virus
  13. algorithm - the search algorithm used for this alignment
  14. db - the database used by the algorithm to search for an alignment of the contig
  15. qstart - the starting base in the query that participates in the alignment
  16. qend - the ending base in the query that participates in the alignment
  17. sstart - the starting base in the subject that participates in the alignment
  18. send - the ending base in the subject that participates in the alignment
  19. nsf - non-stop frame. If the value is 1, there is at least one frame in the contig that does not have a stop-codon.
  20. qent - nucleotide entropy of the query
  21. qhsp_ent - nucleotide entropy of the query sequence from Qstart to Qend, inclusive
  22. shsp_ent - nucleotide entropy of the subject sequence from Sstart to Send, inclusive
  23. shsp_%lc - percent of low complexity amino acids in the subject sequence from Sstart to Send, inclusive

Taxonomy module

In the e7.fa example above, notice that columns 9 to 12 are NULL in the report.txt file. This is because we didn't tell Annotater to run the Taxonomy module. If you want taxonomy information about each subject such as type (i.e. bacteria, virus, etc...), virus family, species, genome type (i.e. ssRNA+, etc...), rerun the Annotater command line but this time add the parameter -tax. However, if you don't have a local NCBI Taxonomy database properly installed (see below), add the option -remotetax as well (this query NCBI for taxonomy info). Since Annotater kept track of what searches were completed, it will skip running BLASTN again on the sequence and start immediately on determining the taxonomy information.

Reann.pl -file e7.fa -config config.txt -num_threads 4 -evalue 1e-50 -tax

NCBI Taxonomy database (optional)

Build a local copy of the NCBI Taxonomy database using the taxonomizr R package (Github). Run the following R commands (takes several hours and about 150GB of space):

library(taxonomizr)
prepareDatabase('accessionTaxa.sql', types = c('nucl_gb', 'nucl_wgs', 'prot'))

This will download the necessary NCBI Taxonomy files and create a file called accessionTaxa.sql. Then set the following environmental variables:

  1. TAXASQL - path to accessionTaxa.sql
  2. NAMESDMP and NODESDMP - full path to names.dmp and nodes.dmp, respectively

Running Annotater with Docker

You can build your own docker image by running docker build -t annotater . or you can download the annotater image from Docker Hub with docker pull virushunter/annotater

To run Annotater using the e7.fa example above, use the following command line but make sure to change the /PATH/TO/BLASTDIR as appropriate for your setup.

docker run --rm -ti -v $(pwd):$(pwd) -v /PATH/TO/BLASTDIR:/PATH/TO/BLASTDIR -w $(pwd) virushunter/annotater Reann.pl -file e7.fa -config config.txt -num_threads 4 -evalue 1e-50

If you want to run the Taxonomy module, you need to add a volume mount for your taxonomy directory that contains accessionTaxa.sql. Also, you need to set the environment variables pointing to the location of the taxonomy files. Here is the complete docker command (make sure to change the directory paths to match your setup)

docker run --rm -ti -v $(pwd):$(pwd) -v /PATH/TO/BLASTDIR:/PATH/TO/BLASTDIR \
-v /PATH/TO/TAXONOMYDIR:/PATH/TO/TAXONOMYDIR \
-w $(pwd) \
-e TAXASQL=/PATH/TO/TAXONOMYDIR/accessionTaxa.sql \
-e NAMESDMP=/PATH/TO/TAXONOMYDIR/names.dmp \
-e NODESDMP=/PATH/TO/TAXONOMYDIR/nodes.dmp \
virushunter/annotater Reann.pl -file e7.fa -config config.txt -num_threads 4 -evalue 1e-50

Using Diamond

In order to use Diamond, you need to add the -type parameter in the configuration file. For instance, if you want to run diamond blastx, add something like the following in your config file:

-exec diamond -type blastx -d /PATH/TO/nr.dmnd

Developer info

Reann->new()

  1. creates output folder
  2. copies fasta file to output folder and creates temporary sequence chunk files (named ann.X.fasta [X = 0 to N - 1 where N = num chunks] )
  3. copies configuration file to output folder
  4. writes version.txt file

Reann->run()

  1. Writes/updates restart.txt file
  2. generates a program (i.e. blast) outputfile for each step of pipeline for each sequence file chunk
  3. removes temporary sequence chunk files (ann.X.fasta)

Reann->Report()

  1. creates ann.report.txt file
  2. removes fasta file that was copied into output folder

Reann->Taxonomy()

  1. TBD

Configuration file

Line 122 - config file is removed. Remove this line so config file is kept in the output folder.

Restart file

The Restart file defines the step that finished. The first value is the sequence file chunk (0-based). The second value is the step of pipeline (0-based). Each step is defined on one line in the annotator config file.

For example, if restart = 0,1 then this means that Step 2 of the first sequence file chunk finished and that Step 3 (0,2) needs to be run next. However, there are two exceptions to this rule:

  1. At the start of Reann->run(), a restart file is written that contains 0,0 which may seem like it means that Chunk 1/Step 1 finished. However, Step 1 still needs run and after it runs, the restart file is rewritten with 0,0 to indicate that Chunk 1/Step 1 is finished.
  2. At end of Reann->run(), the restart file is rewritten with value of X,0 where X is equal to number of Fasta chunks + 1.

Let’s look at an example where there are three annotation steps. The restart file will look like the following before (line 109) and after (line 111) each step.

Step Beginning End
1 0,0 0,0
2 0,0 0,1
3 0,1 0,2

Deprecated

NCBI Taxonomy database (old)

Get the Taxonomy files

wget --quiet ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz &
wget --quiet ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz &
wget --quiet ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz &
wget --quiet ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxcat.tar.gz &

Unzip each GZ file for file in *gz; do echo $file; gunzip $file; done

Untar the tar archives

tar xvf taxcat.tar
tar xvf taxdump.tar

Convert DMP file to BIN file for Bio::LITE::Taxonomy::NCBI::Gi2taxid module. See https://github.com/pcantalupo/mytaxonomy makebin.pl &

Then set the following environmental variables:

  1. NGT - full path to the gi_taxid_nucl .bin file that was created with the Bio::LITE::Taxonomy::NCBI::Gi2taxid module
  2. PGT - same as NGT but to the gi_taxid_prot .bin file
  3. NAMESDMP and NODESDMP - full path to names.dmp and nodes.dmp, respectively

annotater's People

Contributors

joshpk105 avatar pcantalupo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

annotater's Issues

Confirm that use of BLAST's `-max_target_seqs` is intentional

Hi there,

This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs parameter:

Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.

If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.

Thank you!
-- Arman (armish/blast-patrol)

EXCEPTION: Bio::Root::Exception MSG: Response Error Too Many Requests

When testing the Taxonomy.pm module using Taxonomy.t, recently been getting alot of errors like below (stemming from gi2taxid and taxid2lineage methods). I thought the Bioperl module has a built in delay so that it doesn't send too many requests.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Response Error
Too Many Requests
STACK: Error::throw
STACK: Bio::Root::Root::throw /ihome/uchandran/pgc92/local/lib/bioperl/bioperl-live/lib/Bio/Root/Root.pm:449
STACK: Bio::DB::GenericWebAgent::get_Response /ihome/uchandran/pgc92/local/lib/bioperl/bioperl-live/lib/Bio/DB/GenericWebAgent.pm:215
STACK: Bio::DB::EUtilities::get_Response /ihome/uchandran/pgc92/local/lib/bioperl/Bio-EUtilities/lib/Bio/DB/EUtilities.pm:204
STACK: Bio::DB::EUtilities::get_Parser /ihome/uchandran/pgc92/local/lib/bioperl/Bio-EUtilities/lib/Bio/DB/EUtilities.pm:287
STACK: Bio::DB::EUtilities::next_DocSum /ihome/uchandran/pgc92/local/lib/bioperl/Bio-EUtilities/lib/Bio/DB/EUtilities.pm:856
STACK: Taxonomy::gi2taxid /ihome/uchandran/pgc92/local/lib/annotater/Taxonomy.pm:232

and

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Response Error
Too Many Requests
STACK: Error::throw
STACK: Bio::Root::Root::throw /ihome/uchandran/pgc92/local/lib/bioperl/bioperl-live/lib/Bio/Root/Root.pm:449
STACK: Bio::DB::GenericWebAgent::get_Response /ihome/uchandran/pgc92/local/lib/bioperl/bioperl-live/lib/Bio/DB/GenericWebAgent.pm:215
STACK: Bio::DB::EUtilities::get_Response /ihome/uchandran/pgc92/local/lib/bioperl/Bio-EUtilities/lib/Bio/DB/EUtilities.pm:204
STACK: Taxonomy::taxid2lineage /ihome/uchandran/pgc92/local/lib/annotater/Taxonomy.pm:276

Rapsearch outputs illegal evalues

Rapsearch 2.24 is outputting illegal evalues (like below) and Rapsearch.pm does not parse them properly

  • 0x0p+0
  • 0x1.3fa9b55ddbf92p-178
  • 0x1.e49bd1ea65088p-134

getting integer values for the Description column in Report.txt

When using BLAST+ => 2.5.0, the Description values in the Report are integer values. When I use BLAST+ <= 2.3.0 (didn't try 2.4), the Descriptions are as I expect (i.e. Tobacco mosaic virus, complete genome)

This stems from a change in how blastdbcmd extracts fasta sequences from the blast database. When this fullgi gi|9626125|ref|NC_001367.1| is used with blastdbcmd, it extracts the following

for BLAST >= 2.5.0

blastdbcmd -db viral.1.1.genomic -entry "gi|9626125|ref|NC_001367.1|" | head -n1
>NC_001367.1 Tobacco mosaic virus, complete genome

for BLAST <= 2.3.0

$ blastdbcmd -db viral.1.1.genomic -entry "gi|9626125|ref|NC_001367.1|" | head -n1
>gi|9626125|ref|NC_001367.1| Tobacco mosaic virus, complete genome

Then when Reann.pm reads the fasta file it uses the fasta identifier, which is an ACC.VER for BLAST+ >= 2.5, to update the %acc hash. With ACC.VER, it is creating a new hash key with the description. This new key (the ACC.VER) is never accessed again because later in the code, the fullgi from the Report file (the Accession field) is used to look up what is expected to be the Description. However, the hash value is the number of times the fullgi was found in the Report file (due to this line)

To test, use sewageseqs16.fa with

BLAST+ 2.3.0 (Descriptions are OK)

$ export PATH=/home/paul/local/usr/local/ncbi-blast-2.3.0+/bin/:$PATH
$ rm -rf annotator
$ Reann.pl -file ../sewageseqs16.fa -config sewageseqs.conf -num_threads 8 -evalue 1e-5
sewageseqs.conf
ann.0.fasta
tblastx -show_gis -num_threads 8 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp" -db viral.1.1.genomic -evalue 1e-5 -query ann.0.fasta -out ann.0.0.tblastx
Report
Taxonomy
        Getting fasta seqs for viral.1.1.genomic
        Skipping LocalTaxonomy
temp file is /tmp/3yjYOMnh2E
$ cut -f 1,7,8 annotator/ann.wTax.report.txt
seqID   accession       desc
All.viralseqs.fa.Contig1000     gi|9626125|ref|NC_001367.1|     Tobacco mosaic virus, complete genome
All.viralseqs.fa.Contig1002     gi|764159063|ref|NC_026582.1|   Uncultured phage WW-nAnB, complete genome
All.viralseqs.fa.Contig1006     gi|20177424|ref|NC_003630.1|    Pepper mild mottle virus, complete genome
All.viralseqs.fa.Contig1007     gi|9630643|ref|NC_001918.1|     Aichi virus, complete genome
All.viralseqs.fa.Contig1008     gi|20177424|ref|NC_003630.1|    Pepper mild mottle virus, complete genome
All.viralseqs.fa.Contig1        gi|1043372815|ref|NC_030457.1|  Circovirus-like genome DCCV-8, complete genome
All.viralseqs.fa.Contig10       gi|20260796|ref|NC_003692.1|    Pariacato virus chromosome RNA2, complete genome
All.viralseqs.fa.Contig100      gi|45476493|ref|NC_005817.1|    Sclerophthora macrospora virus A RNA 1, complete sequence
All.viralseqs.fa.Contig1001     gi|1127298492|ref|NC_032551.1|  Beihai picorna-like virus 77 strain BHBei75652 hypothetical protein 1 and hypothetical protein 2 genes, complete cds
All.viralseqs.fa.Contig1003     gi|20177424|ref|NC_003630.1|    Pepper mild mottle virus, complete genome
All.viralseqs.fa.Contig1004     gi|1127298656|ref|NC_032614.1|  Beihai picorna-like virus 74 strain BHHK130969 hypothetical protein 1 and hypothetical protein 2 genes, complete cds
FTWLCJP02IU3XZ  gi|593779948|ref|NC_023737.1|   Mycobacterium phage Pleione, complete genome
FTWLCJP02IU4PY  gi|203454602|ref|NC_011273.1|   Mycobacterium phage Myrna, complete genome
FTWLCJP02IUB6X  gi|197085614|ref|NC_011142.1|   Iodobacteriophage phiPLPE, complete genome
FTWLCJP02IUGFL
FTWLCJP02IUNP9

BLAST+ 2.6.0 (Descriptions are not ok; integer values)

$ export PATH=/home/paul/local/usr/local/ncbi-blast-2.6.0+/bin/:$PATH
$ rm -rf annotator
$ Reann.pl -file ../sewageseqs16.fa -config sewageseqs.conf -num_threads 8 -evalue 1e-5
sewageseqs.conf
ann.0.fasta
tblastx -show_gis -num_threads 8 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp" -db viral.1.1.genomic -evalue 1e-5 -query ann.0.fasta -out ann.0.0.tblastx
Report
Taxonomy
        Getting fasta seqs for viral.1.1.genomic
        Skipping LocalTaxonomy
temp file is /tmp/JkhItzr1AK
$ cut -f 1,7,8 annotator/ann.wTax.report.txt
seqID   accession       desc
All.viralseqs.fa.Contig1000     gi|9626125|ref|NC_001367.1|     1
All.viralseqs.fa.Contig1002     gi|764159063|ref|NC_026582.1|   1
All.viralseqs.fa.Contig1006     gi|20177424|ref|NC_003630.1|    3
All.viralseqs.fa.Contig1007     gi|9630643|ref|NC_001918.1|     1
All.viralseqs.fa.Contig1008     gi|20177424|ref|NC_003630.1|    3
All.viralseqs.fa.Contig1        gi|1043372815|ref|NC_030457.1|  1
All.viralseqs.fa.Contig10       gi|20260796|ref|NC_003692.1|    1
All.viralseqs.fa.Contig100      gi|45476493|ref|NC_005817.1|    1
All.viralseqs.fa.Contig1001     gi|1127298492|ref|NC_032551.1|  1
All.viralseqs.fa.Contig1003     gi|20177424|ref|NC_003630.1|    3
All.viralseqs.fa.Contig1004     gi|1127298656|ref|NC_032614.1|  1
FTWLCJP02IU3XZ  gi|593779948|ref|NC_023737.1|   1
FTWLCJP02IU4PY  gi|203454602|ref|NC_011273.1|   1
FTWLCJP02IUB6X  gi|197085614|ref|NC_011142.1|   1
FTWLCJP02IUGFL
FTWLCJP02IUNP9

getting multiple "print() on closed filehandle TMPOUT" messages

i'm seeing multiple messages like this during Taxonomy step

print() on closed filehandle TMPOUT at /path/to/annotater/Reann.pm line 275, <IN> line 17

which is followed by this exception:

------------- EXCEPTION -------------
MSG: Could not read file '~/path/to/viral.1.1.genomic.fna.gis.txt.fa': No such file or directory
STACK Bio::Root::IO::_initialize_io /path/to/bioperl/bioperl-live/Bio/Root/IO.pm:268
STACK Bio::SeqIO::_initialize /path/to/bioperl/bioperl-live/Bio/SeqIO.pm:513
STACK Bio::SeqIO::fasta::_initialize /path/to/bioperl/bioperl-live/Bio/SeqIO/fasta.pm:87
STACK Bio::SeqIO::new /path/to/bioperl/bioperl-live/Bio/SeqIO.pm:389
STACK Bio::SeqIO::new /path/to/bioperl/bioperl-live/Bio/SeqIO.pm:435
STACK Reann::Taxonomy /path/to/annotater/Reann.pm:282
STACK toplevel /path/to/annotater/Reann.pl:17
-------------------------------------

BLAST database error during Taxonomy step

I'm getting a 'No alias or index file found' error during Taxonomy step. I'm only getting this with Rapsearch databases that are not in $BLASTDB.

Taxonomy
        Getting fasta seqs for protein_Mononeg_Aedes
BLAST Database error: No alias or index file found for nucleotide database [protein_Mononeg_Aedes] in search path

blastdbcmd errors during Taxonomy step

I'm getting blastdbcmd errors after moving BLAST database files to a new directory mount (/zfs2) on the HTC cluster. Never had these problems before on mobydisk

Taxonomy
        Getting fasta seqs for /zfs2/bar/refs/blast/nr
Error: [blastdbcmd] Entry not found: YP7.1
Error: [blastdbcmd] Skipped YP7.1
Error: [blastdbcmd] Entry not found: XP_00643
Error: [blastdbcmd] Skipped XP_00643

BLAST error: No alias or index file found for database

When running annotater with Rapsearch, you need a parallel BLAST database with the same name as the Rapsearch database in the same directory. This is because of the hack to use blastdbcmd in the Version method to get database information. There is no parallel command like blastdbcmd in the rapsearch distribution.

Exception error when testing Taxonomy.t

On a OSX 10.12.6, I ran prove -v Taxonomy.t and got this error:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Response Error
Can't verify SSL peers without knowing which Certificate Authorities to trust
STACK: Error::throw
STACK: Bio::Root::Root::throw /Users/pgc92/local/lib/bioperl/bioperl-live/Bio/Root/Root.pm:447
STACK: Bio::DB::GenericWebAgent::get_Response /Users/pgc92/local/lib/bioperl/bioperl-live/Bio/DB/GenericWebAgent.pm:214
STACK: Bio::DB::EUtilities::get_Response /Users/pgc92/perl5/lib/perl5/Bio/DB/EUtilities.pm:35
STACK: Taxonomy::accession2gi /Users/pgc92/local/lib/annotater/Taxonomy.pm:169
STACK: Taxonomy.t:19
-----------------------------------------------------------

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.