Code Monkey home page Code Monkey logo

taxonomy_ranks's Introduction

taxonomy-ranks

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

WHEN YOU ADAPT (PART OF) THE SOFTWARE FOR YOUR USE CASES, THE AUTHOR AND THE SOFTWARE MUST BE EXPLICITLY CREDITED IN YOUR PUBLICATIONS AND SOFTWARE, AND YOU SHOULD ASK THE USERS OF YOUR SOFTWARE TO CITE THE SOFTWARE IN THEIR PUBLICATIONS. IN A WORD, 请讲武德.

1 Introduction

To get taxonomy ranks information with ETE3 Python3 module (http://etetoolkit.org/).

This program was from MitoZ and is still part of MitoZ (https://github.com/linzhi2013/MitoZ), so please cite the publication below if this program is helpful for you, thanks!

  • Guanliang Meng, Yiyuan Li, Chentao Yang, Shanlin Liu, MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization, Nucleic Acids Research, https://doi.org/10.1093/nar/gkz173

2 Installation install with bioconda

You can install this program via pip3 or conda or mammba (https://mamba.readthedocs.io/en/latest/) command.

Conda

$ conda install taxonomy_ranks
# or  
$ mamba install taxonomy_ranks

Pip

Make sure your pip3 is from Python3

$ which pip
/Users/mengguanliang/soft/miniconda3/bin/pip

then type

$ pip install taxonomy_ranks

There will be a command taxaranks created under the same directory where your pip command located.

If you want to learn more about Python3 and pip, please refer to https://www.python.org/ and https://docs.python.org/3/tutorial/venv.html?highlight=pip.

MitoZ

If your have MitoZ >= 3.5-beta-1 installed, you actually already have this program:

$ mitoz-tools taxonomy_ranks -h

See https://github.com/linzhi2013/MitoZ/wiki/The-%27mitoz-tools-taxonomy_ranks%27-command

3 Usage

3.1 commandline usage

$ taxaranks

	usage: taxaranks [-h] [-i <file>] [-o <file>] [-t] [-v]

	To get the lineage information of input taxid, species name, or higher ranks
	(e.g., Family, Order) with ETE3 package.

	The ete3 package will automatically download the NCBI Taxonomy database during
	 the first time using of this program.

	Please be informed:

	(1) A rank name may have more than one taxids, e.g., Pieris can means:
	Pieris <butterfly> and Pieris <angiosperm>. I will search the lineages for
	both of them.

	(2) When you give a species name, if I can not find the taxid for the species
	name, I will try to search the first word (Genus).

	(3) Any input without result found will be output in outfile.err ('-o' option).


	optional arguments:
	  -h, --help  show this help message and exit
	  -i <file>   A file can be a list of ncbi taxa id or species names (or higher
	              ranks, e.g. Family, Order), or a mixture of them.
	  -o <file>   outfile
	  -t          Also print out the taxid for each rank
	  -e          Also print out the records without lineage information found to the '-o <file>'
	  -v          verbose output

The -i <file> file can be a list of ncbi taxa id or species names (or higher ranks, e.g. Family, Order), or a mixture of them.

ETE3 package will automatically download the NCBI Taxonomy database during the first time using of this program.

Once the NCBI Taxonomy database has been installed, there is no need to connect to the network any more, unless you want to update the database after a period of time, for this case, please go to http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html for more details.

3.2 using as a module

A taxa_name may have more than one potential_taxid.

from taxonomy_ranks import TaxonomyRanks

taxa_name = 'homo sapiens'

rank_taxon = TaxonomyRanks(taxa_name)

rank_taxon.get_lineage_taxids_and_taxanames()

ranks = ('user_taxa', 'taxa_searched', 'superkingdom', 'kingdom', 'superphylum', 'phylum', 'subphylum', 'superclass', 'class', 'subclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus', 'subgenus', 'species')

# If you don't want the results of so many ranks, just simplify the 'ranks' tupe, e.g.
# ranks = ('user_taxa', 'taxa_searched',  'superkingdom', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species')
# The rank without a value found in the database will have the default vale 'NA'.

for potential_taxid in rank_taxon.lineages:
     for rank in ranks:
         if rank in ('user_taxa', 'taxa_searched'):
             taxon = rank_taxon.lineages[potential_taxid][rank]
             print(potential_taxid, rank, taxon, sep='\t')
         else:
             taxon, taxid_of_taxon = rank_taxon.lineages[potential_taxid][rank]
             print(potential_taxid, rank, taxon, taxid_of_taxon, sep='\t')

# the outputs are:
9606	user_taxa	homo sapiens
9606	taxa_searched	homo sapiens
9606	superkingdom	Eukaryota	2759
9606	kingdom	Metazoa	33208
9606	superphylum	NA	NA
9606	phylum	Chordata	7711
9606	subphylum	Craniata	89593
9606	superclass	Sarcopterygii	8287
9606	class	Mammalia	40674
9606	subclass	NA	NA
9606	superorder	Euarchontoglires	314146
9606	order	Primates	9443
9606	suborder	Haplorrhini	376913
9606	superfamily	Hominoidea	314295
9606	family	Hominidae	9604
9606	subfamily	Homininae	207598
9606	genus	Homo	9605
9606	subgenus	NA	NA
9606	species	Homo sapiens	9606

# In the above, the taxid 9606 is for homo sapiens
# while each rank has its own taxid, e.g. 2759 is for Eukaryota.

4 Example

run

$ taxaranks -i test.taxa -o test.taxa.tsv

Input file test.taxacontent:

Spodoptera litura
Pieris rapae
Locusta migratoria
Frankliniella occidentalis
Marsupenaeus japonicus
Penaeus monodon

Result file test.taxa.tsv content:

user_taxa	taxa_searched	superkingdom	kingdom	superphylum	phylum	subphylum	superclass	class	subclass	superorder	order	suborder	superfamily	family	subfamily	genus	subgenus	species
Spodoptera litura	Spodoptera litura	Eukaryota	Metazoa	NA	Arthropoda	Hexapoda	NA	Insecta	Pterygota	Amphiesmenoptera	Lepidoptera	Glossata	Noctuoidea	Noctuidae	Amphipyrinae	Spodoptera	NA	Spodoptera litura
Pieris rapae	Pieris rapae	Eukaryota	Metazoa	NA	Arthropoda	Hexapoda	NA	Insecta	Pterygota	Amphiesmenoptera	Lepidoptera	Glossata	Papilionoidea	Pieridae	Pierinae	Pieris	NA	Pieris rapae
Locusta migratoria	Locusta migratoria	Eukaryota	Metazoa	NA	Arthropoda	Hexapoda	NA	Insecta	Pterygota	NA	Orthoptera	Caelifera	Acridoidea	Acrididae	Oedipodinae	Locusta	NA	Locusta migratoria
Frankliniella occidentalis	Frankliniella occidentalis	Eukaryota	Metazoa	NA	Arthropoda	Hexapoda	NA	Insecta	Pterygota	NA	Thysanoptera	Terebrantia	Thripoidea	Thripidae	Thripinae	Frankliniella	NA	Frankliniella occidentalis
Marsupenaeus japonicus	Marsupenaeus japonicus	Eukaryota	Metazoa	NA	Arthropoda	Crustacea	Multicrustacea	Malacostraca	Eumalacostraca	Eucarida	Decapoda	Dendrobranchiata	Penaeoidea	Penaeidae	NA	Penaeus	NA	Penaeus japonicus
Penaeus monodon	Penaeus monodon	Eukaryota	Metazoa	NA	Arthropoda	Crustacea	Multicrustacea	Malacostraca	Eumalacostraca	Eucarida	Decapoda	Dendrobranchiata	Penaeoidea	Penaeidae	NA	Penaeus	NA	Penaeus monodon

With the '-t' optioin,

$ taxaranks -i test.taxa -o test.taxa.tsv -t

Result file test.taxa.tsv will be:

user_taxa	taxa_searched	superkingdom	superkingdom_taxid	kingdom	kingdom_taxid	superphylum	superphylum_taxid	phylum	phylum_taxid	subphylum	subphylum_taxid	superclass	superclass_taxid	class	class_taxid	subclass	subclass_taxid	superorder	superorder_taxid	order	order_taxid	suborder	suborder_taxid	superfamily	superfamily_taxid	family	family_taxid	subfamily	subfamily_taxid	genus	genus_taxid	subgenus	subgenus_taxid	species	species_taxid
Spodoptera litura	Spodoptera litura	Eukaryota	2759	Metazoa	33208	NA	NA	Arthropoda	6656	Hexapoda	6960	NA	NA	Insecta	50557	Pterygota	7496	Amphiesmenoptera	85604	Lepidoptera	7088	Glossata	41191	Noctuoidea	37570	Noctuidae	7100	Amphipyrinae	95182	Spodoptera	7106	NA	NA	Spodoptera litura	69820
Pieris rapae	Pieris rapae	Eukaryota	2759	Metazoa	33208	NA	NA	Arthropoda	6656	Hexapoda	6960	NA	NA	Insecta	50557	Pterygota	7496	Amphiesmenoptera	85604	Lepidoptera	7088	Glossata	41191	Papilionoidea	37572	Pieridae	7114	Pierinae	42449	Pieris	7115	NA	NA	Pieris rapae	64459
Locusta migratoria	Locusta migratoria	Eukaryota	2759	Metazoa	33208	NA	NA	Arthropoda	6656	Hexapoda	6960	NA	NA	Insecta	50557	Pterygota	7496	NA	NA	Orthoptera	6993	Caelifera	7001	Acridoidea	92621	Acrididae	7002	Oedipodinae	27549	Locusta	7003	NA	NA	Locusta migratoria	7004
Frankliniella occidentalis	Frankliniella occidentalis	Eukaryota	2759	Metazoa	33208	NA	NA	Arthropoda	6656	Hexapoda	6960	NA	NA	Insecta	50557	Pterygota	7496	NA	NA	Thysanoptera	30262	Terebrantia	38130	Thripoidea	45049	Thripidae	45053	Thripinae	153976	Frankliniella	45059	NA	NA	Frankliniella occidentalis	133901
Marsupenaeus japonicus	Marsupenaeus japonicus	Eukaryota	2759	Metazoa	33208	NA	NA	Arthropoda	6656	Crustacea	6657	Multicrustacea	2172821	Malacostraca	6681	Eumalacostraca	72041	Eucarida	6682	Decapoda	6683	Dendrobranchiata	6684	Penaeoidea	111520	Penaeidae	6685	NA	NA	Penaeus	133894	NA	NA	Penaeus japonicus	27405
Penaeus monodon	Penaeus monodon	Eukaryota	2759	Metazoa	33208	NA	NA	Arthropoda	6656	Crustacea	6657	Multicrustacea	2172821	Malacostraca	6681	Eumalacostraca	72041	Eucarida	6682	Decapoda	6683	Dendrobranchiata	6684	Penaeoidea	111520	Penaeidae	6685	NA	NA	Penaeus	133894	NA	NA	Penaeus monodon	6687

Warning

The reason for providing the two columns (user_taxa and taxa_searched) are, sometimes a user input taxon may correspond to multiple NCBI taxa (probably belonging to different clades). When this happens, the lineage for all each taxon will be output, you MUST check this carefully!

5 Speed and Parallelisation

If you have a lot of taxa or taxon ids to search, it could be a bit slow. For this case, please refer to #1 (Thanks to @HuoJnx !).

I have copied that code snippet to the file parallelize_taxon.sh. You can download the file to your sever, and then

sh parallelize_taxon.sh <file_list_of_ncbi_taxa_id_or_species_names>

It assumes that your server has the parallel command installed (https://anaconda.org/conda-forge/parallel).

6 Problems

Your HOME directory runs out of space when downloading and installing the NCBI Taxonomy database during the first time using of this program.

The error message can be:

sqlite3.OperationalEoor: disk I/O error

This is caused by ete3 which will create a directory ~/.etetoolkit to store the databse (ca. 500M), however, your HOME directory does not have enough space left.

Solutions:
The solution is obvious.

  1. create a directory somewhere else that have enough space left:

     $ mkdir /other/place/myetetoolkit
    
  2. remove the directory ~/.etetoolkit created by ete3:

     $ rm -rf ~/.etetoolkit
    
  3. link your new directory to the HOME directory:

     $ ln -s /other/place/myetetoolkit ~/.etetoolkit
    
  4. run the program again:

     $ taxaranks my_taxonomy_list outfile
    

This way, ete3 should work as expected.

Update the NCBI taxonomy database

For more details, refer to http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html.

  1. open a console, and type

     $ python3
    

    You will enter the Python3 command line status.

  2. excute following commands in Python3

     >from ete3 import NCBITaxa
     >ncbi = NCBITaxa()
     >ncbi.update_taxonomy_database()
    

You can also manually download the database from ncbi:

wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
	> ncbi.update_taxonomy_database(taxdump_file='taxdump.tar.gz')

7 Citations

  • Guanliang Meng, Yiyuan Li, Chentao Yang, Shanlin Liu, MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization, Nucleic Acids Research, https://doi.org/10.1093/nar/gkz173

Besides, since taxonomy-ranks makes use of the ete3 toolkit, you should also cite it if you use taxonomy-ranks in your publications.

Please go to http://etetoolkit.org/ for more details.

8 Author

Guanliang MENG.

linzhi2012<MitoZ>gmail<MitoZ>com

taxonomy_ranks's People

Contributors

linzhi2013 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

taxonomy_ranks's Issues

Parallelisation of taxonomy_ranks

Hello. I think taxonomy_ranks is a very convenient tool for lineage annotation. But it's a little bit slow, so I wrote a script to parallelize it and found it works. It can fasten the speed 4 times for 21579 queries on a 48-thread server. Hope that the script can help anyone who needs it. ^_^

Code

taxaranks_parallel(){

    ## stop after error
    set -e
    
    ## get current directory
    current_dir=$(pwd)
    
    ##parse the input path
    input=$1
    dir=$(dirname $input|xargs realpath)
    base=$(basename $input)
    real_input="${dir}/${base}"
    echo "Input is $input."
    
    ## go to sub_dir
    sub_dir="${dir}/split_${base}"
    rm -rf $sub_dir; mkdir -p $sub_dir
    cd $sub_dir
    echo "Create temporary directory $sub_dir."
    
    ## get parameters for spliting, then split
    total_line=$(cat $real_input|wc -l )
    threads=$(nproc)
    need_length=3
    split -a $need_length -d -n "l/${threads}" $real_input
    echo "Have $threads threads, split the file to $threads parts."
    
    ## run taxaranks in parallel
    echo "Annotating..."
    ls .|parallel "taxaranks -i {} -o {}.lineage -t"
    
    ## merge
    merge_file="../${base}.lineage"
    merge_file_with_head="../${base}.lineage.with_head"
    
    #### drop the first line for each file, then merge
    rm -rf $merge_file;ls *.lineage|parallel "awk 'NR>1 {print}' {} &>> $merge_file"
    
    #### add the first line for the merge file
    head_line=$(ls *.lineage|head -n1|xargs head -n1)
    awk -v a="$head_line" 'BEGIN{print a} {print $0}' $merge_file &>$merge_file_with_head
    rm -rf $merge_file;mv $merge_file_with_head $merge_file
    
    ## remove the sub_dir
    rm -rf $sub_dir
    echo "Clear temporary directory."
    
    ## back to the previous working directory
    cd $current_dir
    
    ## prompt
    echo "All finished."
}

Example

Without parallelization

image

With parallelization

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.