sherrillmix / taxonomizr Goto Github PK

Parse NCBI taxonomy and accessions to find taxonomic assignments

License: GNU General Public License v2.0

R 98.06% Makefile 0.73% C 1.21%

taxonomizr's Introduction

Convert accession numbers to taxonomy

Introduction

taxonomizr provides some simple functions to parse NCBI taxonomy files and accession dumps and efficiently use them to assign taxonomy to accession numbers or taxonomic IDs. This is useful for example to assign taxonomy to BLAST results. This is all done locally after downloading the appropriate files from NCBI using included functions (see below).

The major functions are:

prepareDatabase: download data from NCBI and prepare SQLite database
accessionToTaxa: convert accession numbers to taxonomic IDs
getTaxonomy: convert taxonomic IDs to taxonomy

More specialized functions are:

getId: convert a biological name to taxonomic ID
getRawTaxonomy: find all taxonomic ranks for a taxonomic ID
normalizeTaxa: combine raw taxonomies with different taxonomic ranks
condenseTaxa: condense a set of taxa to their most specific common branch
makeNewick: generate a Newick formatted tree from taxonomic output
getAccessions: find accessions for a given taxonomic ID
getDescendants: find descendants for a given taxonomic ID

And a simple use case might look like (see below for more details):

library(taxonomizr)
#note this will require a lot of hard drive space, bandwidth and time to process all the data from NCBI
prepareDatabase('accessionTaxa.sql')
blastAccessions<-c("Z17430.1","Z17429.1","X62402.1") 
ids<-accessionToTaxa(blastAccessions,'accessionTaxa.sql')
getTaxonomy(ids,'accessionTaxa.sql')

Requirements

This package downloads a few databases from NCBI and stores them in an easily accessible form on the hard drive. This ends up taking a decent amount of space so you'll probably want around 75 Gb of free hard drive space.

Installation

The package is on CRAN, so it should install with a simple:

install.packages("taxonomizr")

If you want the development version directly from github, use the devtools library and run:

devtools::install_github("sherrillmix/taxonomizr")

To use the library, load it in R:

library(taxonomizr)

Preparation

Since version 0.5.0, there is a simple function to run all preparations. Note that you'll need a bit of time, download bandwidth and hard drive space before running this command (we're downloading taxonomic assignments for every record in NCBI). To create a SQLite database called accessionTaxa.sql in the current working directory (you may want to store this somewhere more centrally located so it does not need to be duplicated with every project), we can run:

prepareDatabase('accessionTaxa.sql')

## Downloading names and nodes with getNamesAndNodes()

## Downloading accession2taxid with getAccession2taxid()

## This can be a big (several gigabytes) download. Please be patient and use a fast connection.

## Preprocessing names with read.names.sql()

## Preprocessing nodes with read.nodes.sql()

## Preprocessing accession2taxid with read.accession2taxid()

## Reading ./nucl_gb.accession2taxid.gz.

## Reading ./nucl_wgs.accession2taxid.gz.

## Reading in values. This may take a while.

## Adding index. This may also take a while.

## [1] "accessionTaxa.sql"

If everything works then that should have prepared a SQLite database ready for use. You can skip the "Manual preparation" steps below.

All files are cached locally and so the preparation is only required once (delete/rename the SQLite database and recall the function to regenerate the database). It is not necessary to manually check for the presence of the database since the function checks to see if SQLite database is present and if so skips downloading/processing. For example, running the command again produces:

prepareDatabase('accessionTaxa.sql')

## SQLite database accessionTaxa.sql already exists. Delete to regenerate

## [1] "accessionTaxa.sql"

Note that if you only want the taxonomic data and do not want to assign taxonomy to accession ID then you can just get the much smaller names and nodes data sets and exclude the large download and time consuming databasing of accession IDs by setting getAccessions=FALSE e.g.:

prepareDatabase(getAccessions=FALSE)

## Downloading names and nodes with getNamesAndNodes()
##  [100%] Downloaded 57373562 bytes...
##  [100%] Downloaded 49 bytes...
## Preprocessing names with read.names.sql()
## Preprocessing nodes with read.nodes.sql()
## [1] "nameNode.sqlite"

And if you area assigning taxonomy to protein data, then you would want to grab the prot.accession2taxid.gz from NCBI by specifying the types='prot' argument (or types=c("nucl_gb", "nucl_wgs","prot") for proteins and nucleotides):

prepareDatabase(types='prot')

## Downloading names and nodes with getNamesAndNodes()

## Preprocessing names with read.names.sql()

## Preprocessing nodes with read.nodes.sql()

## Downloading accession2taxid with getAccession2taxid()

## This can be a big (several gigabytes) download. Please be patient and use a fast connection.

## Preprocessing accession2taxid with read.accession2taxid()

## Reading ./prot.accession2taxid.gz.

## Reading in values. This may take a while.

## Adding index. This may also take a while.

## [1] "nameNode.sqlite"

Assigning taxonomy

Producing accession numbers

NCBI accession numbers are often obtained when doing a BLAST search (usually the second column of output from blastn, blastx, blastp, ...). For example the output might look like:

read1   gi|326539903|gb|CP002582.1|     69.68   1745    448     69      3       1702    3517898 3519606 3e-169  608
read2   gi|160426828|gb|CP000885.1|     68.46   1763    452     82      3       1711    1790367 1788655 4e-140  511
...

So to identify a taxon for a given sequence you would blast it against e.g. the NCBI nt database and load the results into R. For NCBI databases, the accession number is often the 4th item in the | (pipe) separated reference field (often the second column in a tab separated result). For example, the CP002582.1 in the gi|326539903|gb|CP002582.1| above.

So just as an example, reading in blast results might look something like:

blastResults<-read.table('XXXX.blast',header=FALSE,stringsAsFactors=FALSE)
#grab the 4th |-separated field from the reference name in the second column
accessions<-sapply(strsplit(blastResults[,2],'\\|'),'[',4)

Finding taxonomy for NCBI accession numbers

Now we are ready to convert NCBI accession numbers to taxonomic IDs. For example, to find the taxonomic IDs associated with NCBI accession numbers "LN847353.1" and "AL079352.3":

taxaId<-accessionToTaxa(c("LN847353.1","AL079352.3"),"accessionTaxa.sql")
print(taxaId)

## [1] 1313 9606

And to get the taxonomy for those IDs:

getTaxonomy(taxaId,'accessionTaxa.sql')

##      superkingdom phylum       class      order            
## 1313 "Bacteria"   "Firmicutes" "Bacilli"  "Lactobacillales"
## 9606 "Eukaryota"  "Chordata"   "Mammalia" "Primates"       
##      family             genus           species                   
## 1313 "Streptococcaceae" "Streptococcus" "Streptococcus pneumoniae"
## 9606 "Hominidae"        "Homo"          "Homo sapiens"

You can also get taxonomy for NCBI accession numbers without versions (the .X following the main number e.g. the ".1" in LN847353.1) using the version='base' argument of accessionToTaxa:

taxaId<-accessionToTaxa(c("LN847353","AL079352"),"accessionTaxa.sql")
print(taxaId)

## [1] NA NA

taxaId<-accessionToTaxa(c("LN847353","AL079352"),"accessionTaxa.sql",version='base')
print(taxaId)

## [1] 1313 9606

Finding taxonomy for taxonomic names

If you'd like to find IDs for taxonomic names then you can do something like:

taxaId<-getId(c('Homo sapiens','Bos taurus','Homo','Alces alces'),'accessionTaxa.sql')
print(taxaId)

## [1] "9606" "9913" "9605" "9852"

And again to get the taxonomy for those IDs use getTaxonomy:

taxa<-getTaxonomy(taxaId,'accessionTaxa.sql')
print(taxa)


##      superkingdom phylum     class      order      family      genus  
## 9606 "Eukaryota"  "Chordata" "Mammalia" "Primates" "Hominidae" "Homo" 
## 9913 "Eukaryota"  "Chordata" "Mammalia" NA         "Bovidae"   "Bos"  
## 9605 "Eukaryota"  "Chordata" "Mammalia" "Primates" "Hominidae" "Homo" 
## 9852 "Eukaryota"  "Chordata" "Mammalia" NA         "Cervidae"  "Alces"
##      species       
## 9606 "Homo sapiens"
## 9913 "Bos taurus"  
## 9605 NA            
## 9852 "Alces alces"

Finding descendants for a given taxa

The function getDescents can be used to find all the descendants at a taxonomic level for a given taxa. For example to find all species (the default) in the Homininae subfamily (taxonomic ID 207598):

getDescendants(207598,'accessionTaxa.sql')

## [1] "Gorilla gorilla"                   "Gorilla beringei"                 
## [3] "Pan paniscus"                      "Pan troglodytes"                  
## [5] "Homo sapiens"                      "Homo heidelbergensis"             
## [7] "Homo sapiens environmental sample" "Homo sp."

Or all genuses:

getDescendants(207598,'accessionTaxa.sql','genus')

## [1] "Gorilla" "Pan"     "Homo"

Note that an index for the nodes table was added in v0.10.1 to make this run faster. If your database was created prior to v0.10.1 and you need maximum speed for finding descendants then then please regenerate the database.

Finding common names for taxonomic IDs

If you'd like to find all common and other types of names for a given taxa ID then you can use getCommon:

getCommon(c(9913,9606),'accessionTaxa.sql')

## [[1]]
##                         name                type
## 1                  Bos bovis             synonym
## 2     Bos primigenius taurus             synonym
## 3  Bos taurus Linnaeus, 1758           authority
## 4                 Bos taurus     scientific name
## 5      Bovidae sp. Adi Nefas            includes
## 6                     bovine         common name
## 7                     cattle genbank common name
## 8                        cow         common name
## 9                  dairy cow         common name
## 10           domestic cattle         common name
## 11              domestic cow         common name
## 12                        ox         common name
## 13                      oxen         common name
## 
## [[2]]
##                          name                type
## 1 Homo sapiens Linnaeus, 1758           authority
## 2                Homo sapiens     scientific name
## 3                       human genbank common name

Or specify only a certain type(s) of name ("common" names seem to often be split between "common name" and "genbank common name"):

getCommon(c(9913,9606,9894),'accessionTaxa.sql',c('genbank common name','common name'))

## [[1]]
##              name                type
## 1          bovine         common name
## 2          cattle genbank common name
## 3             cow         common name
## 4       dairy cow         common name
## 5 domestic cattle         common name
## 6    domestic cow         common name
## 7              ox         common name
## 8            oxen         common name
## 
## [[2]]
##    name                type
## 1 human genbank common name
## 
## [[3]]
##      name                type
## 1 giraffe genbank common name

Note that databases created with taxonomizr versions earlier than v0.9.4 do not contain the type field and so the database will have to be reloaded to use this function. For example, this could be done by calling:

taxonomizr::getNamesAndNodes()
taxonomizr::read.names.sql('names.dmp','nameNode.sqlite',overwrite=TRUE)

Condensing taxonomy a.k.a. lowest common ancestor LCA

You can use the condenseTaxa function to find the agreements among taxonomic hits. For example to condense the taxonomy from the previous section to the lowest taxonomic rank shared by all three taxa:

condenseTaxa(taxa)

##   superkingdom phylum     class      order family genus species
## 1 "Eukaryota"  "Chordata" "Mammalia" NA    NA     NA    NA

This function can also be fed a large number of grouped hits, e.g. BLAST hits for high throughput sequencing reads after filtering for the best hits for each read, and output a condensed taxonomy for each grouping:

groupings<-c('read1','read2','read1','read2')
condenseTaxa(taxa,groupings)

##       superkingdom phylum     class      order      family      genus 
## read1 "Eukaryota"  "Chordata" "Mammalia" "Primates" "Hominidae" "Homo"
## read2 "Eukaryota"  "Chordata" "Mammalia" NA         NA          NA    
##       species
## read1 NA     
## read2 NA

Find all taxonomic assignments for a given taxa

To get all taxonomic assignments for a given taxa regardless of their particular rank, you can use the getRawTaxonomy function. Note that there are often many intermediate ranks outside the more common taxonomic ranks. The function returns a list since different IDs can have differing numbers of ranks. It is used similarly to getTaxonomy:

getRawTaxonomy(c(9606,9913),'accessionTaxa.sql')

## $`9606`
##                species                  genus              subfamily 
##         "Homo sapiens"                 "Homo"            "Homininae" 
##                 family            superfamily              parvorder 
##            "Hominidae"           "Hominoidea"           "Catarrhini" 
##             infraorder               suborder                  order 
##          "Simiiformes"          "Haplorrhini"             "Primates" 
##             superorder                  clade                clade.1 
##     "Euarchontoglires"        "Boreoeutheria"             "Eutheria" 
##                clade.2                  class                clade.3 
##               "Theria"             "Mammalia"              "Amniota" 
##                clade.4                class.1             superclass 
##            "Tetrapoda" "Dipnotetrapodomorpha"        "Sarcopterygii" 
##                clade.5                clade.6                clade.7 
##         "Euteleostomi"           "Teleostomi"        "Gnathostomata" 
##                clade.8              subphylum                 phylum 
##           "Vertebrata"             "Craniata"             "Chordata" 
##                clade.9               clade.10               clade.11 
##        "Deuterostomia"            "Bilateria"            "Eumetazoa" 
##                kingdom               clade.12           superkingdom 
##              "Metazoa"         "Opisthokonta"            "Eukaryota" 
##                no rank 
##   "cellular organisms" 
## 
## $`9913`
##                species                  genus              subfamily 
##           "Bos taurus"                  "Bos"              "Bovinae" 
##                 family             infraorder               suborder 
##              "Bovidae"               "Pecora"           "Ruminantia" 
##                  order             superorder                  clade 
##         "Artiodactyla"       "Laurasiatheria"        "Boreoeutheria" 
##                clade.1                clade.2                  class 
##             "Eutheria"               "Theria"             "Mammalia" 
##                clade.3                clade.4                class.1 
##              "Amniota"            "Tetrapoda" "Dipnotetrapodomorpha" 
##             superclass                clade.5                clade.6 
##        "Sarcopterygii"         "Euteleostomi"           "Teleostomi" 
##                clade.7                clade.8              subphylum 
##        "Gnathostomata"           "Vertebrata"             "Craniata" 
##                 phylum                clade.9               clade.10 
##             "Chordata"        "Deuterostomia"            "Bilateria" 
##               clade.11                kingdom               clade.12 
##            "Eumetazoa"              "Metazoa"         "Opisthokonta" 
##           superkingdom                no rank 
##            "Eukaryota"   "cellular organisms"

These raw taxonomy with varying numbers of levels can be normalized so that all taxa share the same number of levels (aligning by taxonomic levels that are not the unspecific "clade") using the normalizeTaxa function:

raw<-getRawTaxonomy(c(9606,9913),'accessionTaxa.sql')
normalizeTaxa(raw)

##      no rank              superkingdom superkingdom.1 kingdom   kingdom.1  
## 9606 "cellular organisms" "Eukaryota"  "Opisthokonta" "Metazoa" "Eumetazoa"
## 9913 "cellular organisms" "Eukaryota"  "Opisthokonta" "Metazoa" "Eumetazoa"
##      kingdom.2   kingdom.3       phylum     subphylum  subphylum.1 
## 9606 "Bilateria" "Deuterostomia" "Chordata" "Craniata" "Vertebrata"
## 9913 "Bilateria" "Deuterostomia" "Chordata" "Craniata" "Vertebrata"
##      subphylum.2     subphylum.3  subphylum.4    superclass     
## 9606 "Gnathostomata" "Teleostomi" "Euteleostomi" "Sarcopterygii"
## 9913 "Gnathostomata" "Teleostomi" "Euteleostomi" "Sarcopterygii"
##      superclass.1           superclass.2 superclass.3 class      class.1 
## 9606 "Dipnotetrapodomorpha" "Tetrapoda"  "Amniota"    "Mammalia" "Theria"
## 9913 "Dipnotetrapodomorpha" "Tetrapoda"  "Amniota"    "Mammalia" "Theria"
##      class.2    class.3         superorder         order          suborder     
## 9606 "Eutheria" "Boreoeutheria" "Euarchontoglires" "Primates"     "Haplorrhini"
## 9913 "Eutheria" "Boreoeutheria" "Laurasiatheria"   "Artiodactyla" "Ruminantia" 
##      infraorder    parvorder    superfamily  family      subfamily   genus 
## 9606 "Simiiformes" "Catarrhini" "Hominoidea" "Hominidae" "Homininae" "Homo"
## 9913 "Pecora"      NA           NA           "Bovidae"   "Bovinae"   "Bos" 
##      species       
## 9606 "Homo sapiens"
## 9913 "Bos taurus"

normalizeTaxa does its best to figure out the order of taxonomic levels automatically but can sometimes be left with ambiguous cases. This will result in an error like:

Error in topoSort(c(nonClade, list(lineageOrder)), errorIfAmbiguous = TRUE) : 
  Ambiguous ordering found in topoSort (suborder vs infraorder)

That's saying that the algorithm is unclear from the data whether suborder or infraorder is the more specific taxonomic level. To clarify, give the lineageOrder parameter a vector going from most to least specific like:

normalizeTaxa(raw,lineageOrder=c('infraorder','suborder'))

For especially troublesome sets, you may have to repeat this step several times getting a new error each time to find all the ambiguities. This would result in building up a vector specifying the ordering of several ambiguous levels like:

normalizeTaxa(raw,lineageOrder=c('infraorder','suborder','superorder','infraclass','subclass','class'))

Finding accessions for a given taxonomic ID

To find all the accessions for a given taxonomic ID, you can use the getAccessions function. This is a bit of an unusual use case so to preserve space, an index is not created by default in read.accession2taxid. If you are going to use this function, you will want to rebuild the SQLite database with the indexTaxa argument set to true with something like:

read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql',indexTaxa=TRUE,overwrite=TRUE)

## Reading nucl_gb.accession2taxid.gz.

## Reading nucl_wgs.accession2taxid.gz.

## Reading in values. This may take a while.

## Adding index. This may also take a while.

Then you can get the accessions for taxa 3702 with a command like (note that the limit argument is used here in order to preserve space):

getAccessions(3702,'accessionTaxa.sql',limit=10)

##    taxa accession
## 1  3702  X58148.1
## 2  3702  X66414.1
## 3  3702  X60045.1
## 4  3702  X07376.1
## 5  3702  X54927.1
## 6  3702  X54926.1
## 7  3702  X54928.1
## 8  3702  X54930.1
## 9  3702  X54929.1
## 10 3702  X52320.1

Convert taxonomy to Newick tree

This is probably only useful in a few specific cases but a convenience function makeNewick to convert taxonomy into a Newick tree is included. The function takes a matrix with columns corresponding to taxonomic categories and rows corresponding to taxonomic assignments, e.g. the output from condenseTaxa or getTaxonomy or normalizeTaxa and reduces it to a Newick formatted tree. For example:

taxa

##      [,1]        [,2]       [,3]       [,4]       [,5]        [,6]   
## [1,] "Eukaryota" "Chordata" "Mammalia" "Primates" "Hominidae" "Homo" 
## [2,] "Eukaryota" "Chordata" "Mammalia" "Primates" "Hominidae" "Pan"  
## [3,] "Eukaryota" "Chordata" "Mammalia" NA         "Cervidae"  "Alces"

makeNewick(taxa)

## [1] "((((((Homo,Pan)Hominidae)Primates,((Alces)Cervidae)_)Mammalia)Chordata)Eukaryota);"

If quotes are needed, then specify the quote argument:

makeNewick(taxa,quote="'")

## [1] "(((((('Homo','Pan')'Hominidae')'Primates',(('Alces')'Cervidae')_)'Mammalia')'Chordata')'Eukaryota');"

By default, makeNewick includes trailing nodes that are all NA in the tree e.g.:

taxa[3,3:6]<-NA
print(taxa)

##      [,1]        [,2]       [,3]       [,4]       [,5]        [,6]  
## [1,] "Eukaryota" "Chordata" "Mammalia" "Primates" "Hominidae" "Homo"
## [2,] "Eukaryota" "Chordata" "Mammalia" "Primates" "Hominidae" "Pan" 
## [3,] "Eukaryota" "Chordata" NA         NA         NA          NA

makeNewick(taxa)

## [1] "((((((Homo,Pan)Hominidae)Primates)Mammalia,(((_)_)_)_)Chordata)Eukaryota);"

If these nodes are not desired then set excludeTerminalNAs to FALSE:

makeNewick(taxa,excludeTerminalNAs=TRUE)

## [1] "((((((Homo,Pan)Hominidae)Primates)Mammalia)Chordata)Eukaryota);"

Note that taxa may be the most specific taxon for a given taxa in the taxonomy matrix but will not be a leaf in the resulting tree if it appears in other taxonomy e.g. Chordata in this example.

Note: NCBI name changes in early 2023

Please note that the NCBI change their naming of several major prokaryote phylums e.g. Firmicutes became Bacillota in early 2023. Please watch out for any problems that could arise. For example:

names of assigned taxonomy may shift after updating a database to a post-change version
comparisons of old analyses performed pre-change to new analyses performed post-change will need to be done with care

If I understand things correctly, then the actual taxonomy ID will not change so it might be wise to retain the taxonomy ID for all analyses. Then on final analysis, the taxonomic names can be assigned based on whatever naming scheme is in use at that time.

Changelog

v0.10.5

Catch 404 errors and report as errors
Add resume argument to download functions
Don't retain temp files for downloads if less than 10kb
README touchups

v0.10.4

Minor improvement to output md5 and modification date for downloads to aid in debugging network issues

v0.10.3

Minor fix to prevent accessionToTaxa from hanging when given numeric inputs

v0.10.2

Behind the scenes switch to multi_download function from curl package to allow download resumption on interrupted downloads. This adds a dependency that curl package be >=5.0.0.
Add protocol option to choose between FTP and HTTP protocols for downloading. The two protocols should perform similarly and the relative speeds of NCBI's ftp and http servers seem to vary so probably not a whole lot of reason to choose one over the other unless a firewall is blocking FTP ports.

v0.10.1

Add getDescendants function to get all descendants for a given taxon

v0.9.4

Add getCommon function to get all names in the database for a given taxa ID

v0.9.3

Fix bug in testing script

v0.9.2

Allow factors as input to accessionToTaxa
Document sqlite pragmas for read.accession2taxid
Inherit ... argument documentation for prepareDatabase
Catch input/output error while processing large files
Update various user-facing links from ftp to https for easier access

v0.8.4

Add quote option to makeNewick
Trim trailing NAs off the tree in makeNewick if excludeTerminalNAs is TRUE
Add terminal semicolon to end of makeNewick tree unless terminator is NULL

v0.8.3

Add "no rank" to normalizeTaxa's default exclusion
Expand README

v0.8.2

Add normalizeTaxa function

v0.8.1

Fix minor typos

v0.8.0

Switch to curl::curl_download to avoid Windows issues

v0.7.1

Add md5 check for downloads

v0.7.0

Add getRawTaxonomy function
Add option to not download accessions

v0.6.0

Fix named vector bug in accessionToTaxa
Add makeNewick function
Deal with default 60 second timeout for downloads in R

v0.5.3

Remove nucl_est and nucl_gss from defaults since NCBI folded them into nucl_gb and removed
Squash R:devel bug

v0.5.0

Transitioned from data.table to SQLite
Addeded convenience prepareDatabase() function
Squashed Windows testing errors

Manual preparation of database (usually not necessary)

Note: Since version 0.5.0, it is usually not necessary to run the following manually, the function prepareDatabase() should do most of this automatically for you (see above).

In order to avoid constant internet access and slow APIs, the first step in using the package is to downloads all necessary files from NCBI. This uses a bit of disk space but makes future access reliable and fast.

Note: It is not necessary to manually check for the presence of these files since the functions automatically check to see if their output is present and if so skip downloading/processing. Delete the local files if you would like to redownload or reprocess them.

Download names and nodes

First, download the necessary names and nodes files from NCBI:

getNamesAndNodes()

## [1] "./names.dmp" "./nodes.dmp"

Download accession to taxa files

Then download accession to taxa id conversion files from NCBI. Note: this is a pretty big download (several gigabytes):

#this is a big download
getAccession2taxid()

## This can be a big (several gigabytes) download. Please be patient and use a fast connection.

## [1] "./nucl_gb.accession2taxid.gz"  "./nucl_wgs.accession2taxid.gz"

If you would also like to identify protein accession numbers, also download the prot file from NCBI (again this is a big download):

#this is a big download
getAccession2taxid(types='prot')

## This can be a big (several gigabytes) download. Please be patient and use a fast connection.

## [1] "./prot.accession2taxid.gz"

Convert names, nodes and accessions to database

Then process the downloaded names and nodes files into a more easily accessed form:

read.names.sql('names.dmp','accessionTaxa.sql')
read.nodes.sql('nodes.dmp','accessionTaxa.sql')

Next process the downloaded accession files into the same database (this one could take a while):

read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')

## Reading nucl_gb.accession2taxid.gz.

## Reading nucl_wgs.accession2taxid.gz.

## Reading prot.accession2taxid.gz.

## Reading in values. This may take a while.

## Adding index. This may also take a while.

Now everything should be ready for processing. All files are cached locally and so the preparation is only required once (or whenever you would like to update the data). It is not necessary to manually check for the presence of these files since the functions automatically check to see if their output is present and if so skip downloading/processing. Delete the local files if you would like to redownload or reprocess them.

Switch from data.table to SQLite

Version 0.5.0 marked a change for name and node lookups from using data.table to using SQLite. This was necessary to increase performance (10-100x speedup for getTaxonomy) and create a simpler interface (a single SQLite database contains all necessary data). Unfortunately, this switch requires a couple breaking changes:

getTaxonomy changes from getTaxonomy(ids,namesDT,nodesDT) to getTaxonomy(ids,sqlFile)
getId changes from getId(taxa,namesDT) to getId(taxa,sqlFile)
read.names is deprecated, instead use read.names.sql. For example, instead of calling names<-read.names('names.dmp') in every session, simply call read.names.sql('names.dmp','accessionTaxa.sql') once (or use the convenient prepareDatabase as above)).
read.nodes is deprecated, instead use read.names.sql. For example. instead of calling nodes<-read.names('nodes.dmp') in every session, simply call read.nodes.sql('nodes.dmp','accessionTaxa.sql') once (or use the convenient prepareDatabase as above).

I've tried to ease any problems with this by overloading getTaxonomy and getId to still function (with a warning) if passed a data.table names and nodes argument and providing a simpler prepareDatabase function for completing all setup steps (hopefully avoiding direct calls to read.names and read.nodes for most users).

I plan to eventually remove data.table functionality to avoid a split codebase so please switch to the new SQLite format in all new code.

taxonomizr's People

Contributors

Stargazers

Watchers

Forkers

alizohaib7 salix-d caojiabao wanjauk ssyamoako sesierras 25280841 mattoslmp lixiang117423

taxonomizr's Issues

get taxID for a huge list of accesion numbers

Hi,

I have .csv file with a huge list of accession numbers (thousands) which i would like find the taxonomic IDs for. Is there a way to edit the code below in order to include the entire list of accession numbers from my csv file?

taxaId<-accessionToTaxa(c("WP_227648936.1","KAF2914273.1"),"accessionTaxa.sql")

help is much appreciated, thank you.

nucl_est.accession2taxid.gz unavailable

Hi there,

I am planning to test out this package as it seems super useful. However, while running prepareDatabase, the following error is trown:

trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_est.accession2taxid.gz'
Error in (function (url, destfile, method, quiet = FALSE, mode = "w",  : 
  cannot open URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_est.accession2taxid.gz'
Calls: prepareDatabase -> do.call -> <Anonymous> -> mapply -> <Anonymous>
In addition: Warning message:
In (function (url, destfile, method, quiet = FALSE, mode = "w",  :
  cannot open URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_est.accession2taxid.gz': FTP status was '550 Requested action not taken; file unavailable'
Execution halted

Any insights on how to solve this?

Many thanks!
Sander

read.accession not working; preparedatabase failed

I've been trying to get the databases to download and get the taxonomy working.

lastly I got this error.

Reading NCBI_DATABASE/nucl_gb.accession2taxid.gz.
Error in readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) :
error reading from the connection
In addition: Warning message:
In readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) :
invalid or incomplete compressed data

packageVersion('taxonomizr')
[1] ‘0.6.0’

FR: set a variable with default path for the databases

I now run each taxonomizr session in the same folder where I created the DB which is not ideal for multiple projects and would prefer that the package looks for the database in a defined folder (/data/.../taxonomizrDB) while the working data is in another place (project folder)

Can we do this now and if not could it be added?

Thanks

Error in accesion to taxa

When running accessionToTaxa I get.

Warning messages: 1: In file.remove(tmp) : cannot remove file 'C:\Users\micro\AppData\Local\Temp\RtmpQT7MmR\filebbc2ec6fc4280c', reason 'Permission denied' 2: In file.remove(tmp) :

I am in a windows PC.

Error in utils::download.file(url, tarFile, mode = "wb") : cannot open URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'

Hi, Thank you for this R package. I installed it and tried to run prepareDatabase('accessionTaxa.sql'). But got this error. I ran this on our serve interactive R.

prepareDatabase('accessionTaxa.sql')
Downloading names and nodes with getNamesAndNodes()
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz': status was 'Couldn't resolve host name'Error in utils::download.file(url, tarFile, mode = "wb") :
cannot open URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'

I checked the URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'. I can downloaded it with wget, and the size is ~50M.

Is there any reason for its failure in downloading in R?

Thank you.

read.names.sql for common names

Hi,

Thanks for making this great package. I am wondering if there could be any SQL support for the common names? It will be nice if taxonomizr::getTaxonomy(ids = uids, sqlFile = mysql) could give an option for returning the common names in the NCBI taxonomy dump files, in addition/other than only scientific names.

Thank you!

Disk is full (and moving tmp location doesn't solve it)

Hello there,

I'm running this super small snippet of code:

unixtools::set.tempdir("/tmpdata/mytmp")

# install.packages("taxonomizr")
devtools::install_github('sherrillmix/taxonomizr')
library(taxonomizr)

prepareDatabase(
  sqlFile = "accessionTaxa.sqlite",
  tmpDir = "/tmpdata/mytmp",
  vocal = TRUE)

and with both the cran installation and the github one, I hit the same error:

Downloading names and nodes with getNamesAndNodes()
 [100%] Downloaded 58237570 bytes...
 [100%] Downloaded 49 bytes...
Preprocessing names with read.names.sql()
Preprocessing nodes with read.nodes.sql()
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
 [100%] Downloaded 2193408674 bytes...
 [100%] Downloaded 61 bytes...
 [100%] Downloaded 3892554970 bytes...
 [100%] Downloaded 62 bytes...
Preprocessing accession2taxid with read.accession2taxid()
Reading /tmpdata/mytmp/nucl_gb.accession2taxid.gz.
Reading /tmpdata/mytmp/nucl_wgs.accession2taxid.gz.
Preprocessing nodes with read.nodes.sql()
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
 [100%] Downloaded 2193408674 bytes...
 [100%] Downloaded 61 bytes...
 [100%] Downloaded 3892554970 bytes...
 [100%] Downloaded 62 bytes...
Preprocessing accession2taxid with read.accession2taxid()
Reading /tmpdata/mytmp/nucl_gb.accession2taxid.gz.
Reading /tmpdata/mytmp/nucl_wgs.accession2taxid.gz.
Reading in values. This may take a while.
Adding index. This may also take a while.
Error: database or disk is full

As previously reported, I changed the tmp folder to be one where I know I have a lot of free space, and still I'm getting the same error. Also what reported in #5 (i.e. using system call to set TMPDIR and run the command) didn't work for me.

Please find attached the sessionInfo() output
20220531_sessionInfo.txt
.

Kingdom bacteria only

Hi,

I recently found a duplicate taxon ID for genus Leptothrix, which is a genus name for bacterial species as well as for Leptothrix hardyi, a spider species. Spiders are interesting species, but I need take care of the bacteria at the moment.

Would it be possible to tell function getId to only search in a particular kingdom, such as Bacteria ?

Best, Michael

prepareDatabase('accessionTaxa.sql') - Error in utils::download.file(url, tarFile, mode = "wb") : cannot open URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'

Hi!

I've downloaded the library, and I'm trying to do the prepareDatabase step. When running prepareDatabase('accessionTaxa.sql') I get the following message:

Downloading names and nodes with getNamesAndNodes()
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
Content type 'unknown' length 54625551 bytes (52.1 MB)
================
Error in utils::download.file(url, tarFile, mode = "wb") : 
  cannot open URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
In addition: Warning messages:
1: In utils::download.file(url, tarFile, mode = "wb") :
  downloaded length 17669944 != reported length 54625551
2: In utils::download.file(url, tarFile, mode = "wb") :
  URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz': status was 'Transferred a partial file'

Is it a problem with NCBI networking? Or something else?
Thanks in advance!

I can't match the accession number with taxa ID

I'm trying to get the accession number using this taxa ID from NCBI, but it's returning NAs:

blastAccessions <- c("WP_145099348.1", "WP_084254954.1","WP_214290166.1")
ids<-accessionToTaxa(blastAccessions,"accessionTaxa.sql")

What could be happening? I've seen other issue topics, but I couldn't find the solution.

getAccession2taxid(types='prot') download old files

Hi,

It seems that getAccession2taxid(types='prot') is going to download the prot.accession2taxid.gz file instead of prot.accession2taxid.FULL.gz file on database, which is just 47 kb. It might be due to change in the database.

Error Reading in prot.accession2taxid.gz file

hello,

I have some protein accession numbers I'm trying to assign taxonomies too. Every time I try to prepare the database, either manually or in the all in one command, I keep getting this read error:

Error in readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) :
error reading from the connection
In addition: Warning message:
In readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) :
invalid or incomplete compressed data

I have re-downloaded the prot.accession2taxid file multiple times, using the R commands and directly from the FTP website. The file always seems to be the right size (just over 6gb) so I'm quite stuck on what to try next...

Many Thanks,
Jack

accessionToTaxa error

Hello! Im having trouble with accessionToTaxa.

When i run:
taxaId<-accessionToTaxa(accessions[[2]],"/scratch/amartinez/clanda/database/accessionTaxa.sql", version = "base")
taxaId shows this (the "version" argument has been changed and got the same results)

[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [25] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [49] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [73] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [97] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [121] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [145] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [169] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [193] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [217] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [241] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [265] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

I already checked "accessions" and looks like this

1 A0A031LR15 EZQ10827.1 2 A0A031WG31 VFD41820.1 3 A0A031WGJ0 VHX47376.1 4 A0A062V5D1 KCZ72522.1 5 A0A062V852 KCZ73452.1 6 A0A062VAF3 KCZ73453.1 7 A0A069RKC8 KDR96580.1 8 A0A075W9Y5 AIG97195.1 9 A0A075WB17 AIG97196.1

sql 4 protein accession: "Error in connection_import_file"

I'm trying unsuccessfully to build the sql database to get taxids and species for protein accession codes. Any clues are appreciated.
The error message is as follows:
R » prepareDatabase(types = c('prot')) Downloading names and nodes with getNamesAndNodes() ./names.dmp, ./nodes.dmp already exist. Delete to redownload Preprocessing names with read.names.sql() Preprocessing nodes with read.nodes.sql() Downloading accession2taxid with getAccession2taxid() This can be a big (several gigabytes) download. Please be patient and use a fast connection. ./prot.accession2taxid.gz already exist. Delete to redownload Preprocessing accession2taxid with read.accession2taxid() Reading ./prot.accession2taxid.gz. Reading in values. This may take a while. Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : RS_sqlite_import: /tmp/Rtmp5fKpXQ/file137afc38c4385f line 134 expected 3 columns of data but found 1 In addition: There were 50 or more warnings (use warnings() to see the first 50)

AccessionToTaxa error in file.remove(tmp)

Ho there!

Thanks for your excellent package :) I had it working nicely before I updated my R version to 4.1.2. Now I keep getting an error when I call the accessionToTaxa function using the accessionTaxa.sql database. It looks like the below:

I wonder if you have seen this before.. it may be just a local permissions thing but my IT department could not figure it out unfortunately.. Maybe you can :)

Thanks in advance,

Marcel Polling

Preparing SQL for only specific taxa

First of all, thank you so much for providing this package, it has provided an immense speed-up to my workflows.
The only issue I've found was the size of the download, which becomes tricky, especially if it has to be done every two weeks to have a fresh copy.
Two quick questions:

Is it possible to download only selected taxa (i.e. only vertebrates), and create an SQL for them to reduce download time and occupied space?
Is there any way to allow downloading from the stop point? Even just not re-downloading the already finished files would be helpful.
Thanks again! :)

error: no such column: tmp.query.accession

Hi - I am trying to execute the following script:

library(taxonomizr)
prepareDatabase('accessionTaxa.sql')
acctoget<- read.table("acctoget.txt",header=FALSE)
taxaId<-accessionToTaxa(acctoget,"accessionTaxa.sql")
taxa <- getTaxonomy(taxaId,taxaNodes,taxaNames) 
write.table(taxa,"taxa.txt",sep="\t",row.names=FALSE)

The sql. data looks correct
I see the following:

[100%] Downloaded 62 bytes...
Preprocessing accession2taxid with read.accession2taxid()
Reading ./nucl_gb.accession2taxid.gz.
Reading ./nucl_wgs.accession2taxid.gz.
Reading in values. This may take a while.
Adding index. This may also take a while.
[1] "accessionTaxa.sql"

Attached is my file with accession numbers:

acctoget.txt

What am I doing wrong?

accessionToTaxa not matching

First, thank you for the previous fix which worked perfectly with the given taxonomy. However, I conducted a new ncbi search for rcbl sequences (only first 20 shown below) and tried to run the following accession numbers using accessionToTaxa and only a few were identified, even though if you search the ncbi database they have taxon ids. I even reverted to the CRAN version of the package and had the same result.

"MG407457" "MG407456" "MG407455" "MG407454" "MG407453" "MG407452" "MG407451" "MG407450" "MG407449" "MG407448" "MG407447" "MG407446" "MG407445" "MG407444" "MG407443" "MG407442" "MG407441" "MG407440" "MG407439" "MG407438"

Curious if you have the same problem.
Thanks
Nathan

read.accession2taxid failing

I am having trouble getting read.accession2taxid() to complete properly on a remote server (it worked fine on my local laptop). When I run the command I get this error:

> read.accession2taxid(list.files('.','accession2taxid.gz$'),'NCBI_accessionTaxa_20210211.sql')
Reading nucl_gb.accession2taxid.gz.
Reading nucl_wgs.accession2taxid.gz.
Reading in values. This may take a while.
Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : 
  RS_sqlite_import: /tmp/RtmpuEFbGo/file48e13a73756 line 119051393 expected 3 columns of data but found 1
In addition: There were 50 or more warnings (use warnings() to see the first 50)

> warnings()
Warning messages:
1: In writeBin(bfr, con = out, size = 1L) : problem writing to connection

It is not a read/write permissions issue, though maybe it could be an issue with the size of sqlite temp files? I tried to set the sqlite temp directory to a location with more space with:

> Sys.setenv(SQLITE_TMPDIR = "/mnt/efs/")

but it still seems to be writing to /tmp. In any case, I am not even sure it is a size issue because there are no huge files in /tmp when the function fails. Any ideas to solving this issue would be appreciated.

Manual preparation of database

I have downloaded the the necessary files based on steps given in
https://cran.r-project.org/web/packages/taxonomizr/vignettes/usage.html

I am getting the following error while executing read.accession2taxid step

read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')
Reading nucl_wgs.accession2taxid.gz.
[2022-11-28 14:37:28] Exception: Failed to rename temporary file (final file does not exist): /var/folders/6p/11yd9mgx7sg2x79cg1ls62yc0000gn/T/RtmpnGaxwW/fileba3762dcd32.tmp -> /var/folders/6p/11yd9mgx7sg2x79cg1ls62yc0000gn/T/RtmpnGaxwW/fileba3762dcd32

at #8. popTemporaryFile.default(destnameT)
- popTemporaryFile.default() is in environment 'R.utils'

at #7. popTemporaryFile(destnameT)
- popTemporaryFile() is in environment 'R.utils'

at #6. decompressFile.default(filename = filename, ..., ext = ext, FUN = FUN)
- decompressFile.default() is in environment 'R.utils'

at #5. decompressFile(filename = filename, ..., ext = ext, FUN = FUN)
- decompressFile() is in environment 'R.utils'

at #4. gunzip.default(inFile, tmp, remove = FALSE)
- gunzip.default() is in environment 'R.utils'

at #3. R.utils::gunzip(inFile, tmp, remove = FALSE)
- R.utils::gunzip() is in environment 'R.utils'

at #2. trimTaxa(ii, tmp, 1:3)
- trimTaxa() is in environment 'taxonomizr'

at #1. read.accession2taxid(list.files(".", "accession2taxid.gz$"),
"accessionTaxa.sql")
- read.accession2taxid() is in environment 'taxonomizr'

Error: Failed to rename temporary file (final file does not exist): /var/folders/6p/11yd9mgx7sg2x79cg1ls62yc0000gn/T/RtmpnGaxwW/fileba3762dcd32.tmp -> /var/folders/6p/11yd9mgx7sg2x79cg1ls62yc0000gn/T/RtmpnGaxwW/fileba3762dcd32
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Please help
regards,
Dinesh

Database or disk is full error

Hi there,

I am trying to build the sql database manually since I need to use the protein accessions and am getting the error:

Error in result_create(conn@ptr, statement) : database or disk is full
Calls: read.accession2taxid ... .local -> new -> initialize -> initialize -> result_create
Execution halted

I have changed the R temp directory to a drive with plenty of storage (~200TB) and am still running into this issue.

This is how I am building the database:

getNamesAndNodes()
getAccession2taxid()
getAccession2taxid(types='prot')
read.names.sql('names.dmp', 'accessionTaxa.sql')
read.nodes.sql('nodes.dmp', 'accessionTaxa.sql')
read.accession2taxid(list.files('.','accession2taxid.gz$'), 'accessionTaxa.sql')

I have tried this on multiple different servers and computers with no luck. Please let me know what you think.

-Peter Skidmore

Error: no such table: main.names

When I ran following command:

prepareDatabase('accessionTaxa.sql')

I got error:

Downloading names and nodes with getNamesAndNodes()
 [100%] Downloaded 57179747 bytes...
 [100%] Downloaded 49 bytes...
Preprocessing names with read.names.sql()
Error: no such table: main.names

Error: read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')

Hi
Teacher
Hope you are doing fine. I am trying to work with this package for the 1st time I am having following issue, that following the below mentioned command I am getting the error

read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')

Reading nucl_est.accession2taxid.gz.
Reading nucl_gb.accession2taxid.gz.
Reading nucl_gss.accession2taxid.gz.
Reading nucl_wgs.accession2taxid.gz.
Reading in values. This may take a while.
Error: Problem creating sql file. Deleting.
Error in rsqlite_import_file(conn@ptr, name, value, sep, eol, skip) :
RS_sqlite_import: C:\Users\ALIZOH~1\AppData\Local\Temp\RtmpQx5v97\file926868d25ec2 line 238971807 expected 2 columns of data but found 3
In addition: Warning message:
In file.remove(sqlFile) :
cannot remove file 'accessionTaxa.sql', reason 'Permission denied'

Your help to troubleshoot this problem is requested
Regards
Ali Zohaib

accession2taxid ftp

Hi,
I realise there have been a couple of other issues recently downloading files from ftp such as nucl_gb.accession2taxid.gz. Now there is a mismatch with the MD5 checkum and prepareDatabase stops. My question: How can I build an accessionsql database with only nodes and names table to run the function getTaxonomy ? Is this possible to just skip accession2taxid ?

Best, Michael

read.names failing to read in names.dmp table

I can read in the nodes.dmp file just fine, but names.dmp fails. The error suggests that read.names is expecting a different number of columns than are present in names.dmp.

Looking at the function it seems the problem comes in when removing columns 3 & 4 from the splitLines object. The colnames command expects there to be two columns remaining, but there are three (names.dmp has five columns).

I was able to work around this by editing the read.names function, so that it removes the fifth column as well.
splitLines <- splitLines[, -(3:5)]

That seems to have corrected the problem for me, but I thought you might want to be aware of it.

library(taxonomizr)
getNamesAndNodes()
getAccession2taxid()
read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')
taxaNodes<-read.nodes('nodes.dmp')
taxaNames<-read.names('names.dmp')
Error in colnames<-(*tmp*, value = c("id", "name")) :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In (function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 1)

read.names <- function (nameFile, onlyScientific = TRUE)
{
splitLines <- do.call(rbind, strsplit(readLines(nameFile),
"\s*\|\s*"))
if (onlyScientific)
splitLines <- splitLines[splitLines[, 4] == "scientific name",
]
splitLines <- splitLines[, -(3:5)]
colnames(splitLines) <- c("id", "name")
splitLines <- data.frame(id = as.numeric(splitLines[, "id"]),
name = splitLines[, "name"], stringsAsFactors = FALSE)
out <- data.table::data.table(splitLines, key = "id")
data.table::setindex(out, "name")
return(out)
}
taxaNames<-read.names('names.dmp')

I still get a warning, but the function now works

All taxonomic levels

Is it possible to retrieve all taxonomic levels for a given id without knowing them ahead of time? An issue I am running into is multiple taxonomic levels being assigned to "clade". For example, looking up the full lineage for taxid 2511165 on NCBI shows:

cellular organisms; Eukaryota; **Sar**; **Stramenopiles**; **Ochrophyta**; Eustigmatophyceae; Eustigmatales; Monodopsidaceae; Microchloropsis

and all 3 bold levels above are identified as "clade". Including "clade" one or more times in desiredTaxa= only returns the first level assigned to clade.

> getTaxonomy(2511165,"/mnt/efs/JV_taxonomy/NCBI_accessionTaxa_20210211.sql", desiredTaxa = c("superkingdom", "phylum", "clade", "clade", "clade", "class", "order", "family", "genus", "species"))
        superkingdom phylum clade clade clade class               order           family            genus            
2511165 "Eukaryota"  NA     "Sar" NA    NA    "Eustigmatophyceae" "Eustigmatales" "Monodopsidaceae" "Microchloropsis"
        species                 
2511165 "Microchloropsis salina"

accession number not matching with taxa ID

Hello, I just updated to the newest version of taxonomizr and ran the new code to create the sql. I have about 200,000 accession number from searching ncbi for rbcl genes. I have tried both the base and version of accession numbers and they do not seem to be matching up correctly, they get some but not all the taxa Ids (the accession number has a taxa id if I search them directly in ncbi). Could you check to see if you are having the same problem? Attached is a link to the dataframe of accession numbers (taxa id are from entrez code, but would like to use your much faster functions) and example code is below.
It is worth mentioning that getTaxonomy() works once I get the taxa ID from entrez code.

I have gotten around the issue, so no hurry, but it would be nice to figure out the problem so I don't need to wait hours on entrez functions.
thank you
Nathan

https://www.dropbox.com/s/ivygvjqb0zfa6rl/ncbi_rbcl_lineage.csv?dl=0
x<-data.table::fread(file="ncbi_rbcl_lineage.csv",header = T,sep=",")
accessionToTaxa(x$accession[1:100],"accessionTaxa.sql",version='base')

output

[1] 2478980 2478980 2478980 88415 88415 88415 88415 1191690 1191690 1191690 1077399 1077399 1077399
[14] 1077399 1077399 1077399 NA NA NA NA NA NA NA NA NA NA
[27] NA NA NA NA NA NA NA NA NA NA NA NA NA
[40] NA NA NA NA NA NA NA NA NA NA NA NA NA
[53] NA NA NA NA NA NA NA NA NA NA NA NA 1486651
[66] 1486654 1486646 1486646 1486646 1486654 1486646 373125 1486654 1486654 1486654 1486654 1486646 373125
[79] 373125 373125 1486646 1486646 1486646 1486646 1486646 1486646 373124 373124 340433 1486647 1486650
[92] 1486650 373124 373124 1486647 1486647 1486646 1486647 1486650 1486646
Warning messages:
1: In file.remove(tmp) :
cannot remove file 'C:\Users\geraldn\AppData\Local\Temp\RtmpM31OhV\file219c75ba49a3', reason 'Permission denied'
2: In file.remove(tmp) :
cannot remove file 'C:\Users\geraldn\AppData\Local\Temp\RtmpM31OhV\file219c75ba49a3', reason 'Permission denied'

Error in readBin while reading .gz files

Hi!

I need to obtain taxid from a huge list of accessions numbers so "taxonomizr" seems to be the perfect option.

However, I got the following error when I run prepareDatabase or the read.accession2taxid commands (here after multiple tries so databases already downloaded):

Downloading names and nodes with getNamesAndNodes() ./names.dmp, ./nodes.dmp already exist. Delete to redownload Downloading accession2taxid with getAccession2taxid() This can be a big (several gigabytes) download. Please be patient and use a fast connection. ./nucl_gb.accession2taxid.gz, ./nucl_wgs.accession2taxid.gz already exist. Delete to redownload Preprocessing names with read.names.sql() Preprocessing nodes with read.nodes.sql() Preprocessing accession2taxid with read.accession2taxid() Reading ./nucl_gb.accession2taxid.gz. Error in readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) : error reading from the connection In addition : Warning message: In readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) : invalid or incomplete compressed data

I tried many things:
- Deleting all files and redownloading them -> Same error
- Downloading only the nucl_gb file -> Same error
- Downloading manually the nucl_gb file and running read.accession2taxid separately -> Same error
- Rewriting the files with overwrite = TRUE in the read.names.sql & read.nodes.sql functions.
- Changing SQL database name -> Same error, same file (both 381 184 Ko) with another name
- Changing the temporary directory (method in last answer) since I saw that taxizedb was using the same, following your reply on the Issue 3 -> Successfully changed the temporary folder but same error.

I saw that @MajaCN had exactly the same issue and managed to deal with it but the solution is not provided ("we found a work-around and have the files now!").

Last things that might be useful for resolving this problem:
- My computer is 91.1 Go left.
- I have Windows 10.
- I have the following error when running: accessionToTaxa("Z17430.1", "accession_2_Taxa.sql")

Error: no such table: accessionTaxa Warning message: In file.remove(tmp) : impossible to delete the file 'C:/Users/lelio/DOCUME~1/STAGEM~1/LOCAL_~1\RtmpYdSW74\file3d4c119e4bf5', due to 'Permission denied'

Thanks in advance for helping me!

Best regards,

Eliot RUIZ

Problem with Convert names, nodes and accessions to database

Hello!

Using a linux system, I have the following error:
read.accession2taxid(list.files('.','prot.accession2taxid.FULL.gz$'),'accessionTaxa.sql')
Reading prot.accession2taxid.FULL.gz.
Error in trimTaxa(ii, tmp, 1:3) : Malformed line on line 1

Can you please tell me how to solve these problems?

if i used Database('accessionTaxa.sql') starts downloading accession2taxid, but I have already downloaded the full version: prot.accession2taxid.FULL.gz.

prepareDatabase() and getAccession2taxid() curl error - server did not report OK, got 450

Hello,

I tried to prepare the SQLite database using both prepareDatabase() and the manual steps, specifically getAccession2taxid(), and get the same error ("Error in curl::curl_download(xx, yy, mode = "wb", quiet = FALSE) : server did not report OK, got 450"). Do you have any solutions for why this is happening? I looked at other questions on the forum, and this error seems unique.

Thanks for all your help, in advance!

> prepareDatabase('accessionTaxa.sql') 
Downloading names and nodes with getNamesAndNodes()
 [100%] Downloaded 55185775 bytes...
 [100%] Downloaded 49 bytes...
Preprocessing names with read.names.sql()
Preprocessing nodes with read.nodes.sql()
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
 [100%] Downloaded 2053095811 bytes...
Error in curl::curl_download(xx, yy, mode = "wb", quiet = FALSE) : 
  server did not report OK, got 450


> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] taxonomizr_0.8.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        DBI_1.1.1         RSQLite_2.2.3     rlang_0.4.10      cachem_1.0.4     
 [6] curl_4.3          data.table_1.14.0 blob_1.2.1        vctrs_0.3.6       tools_4.0.3      
[11] bit64_4.0.5       tinytex_0.29      bit_4.0.4         fastmap_1.1.0     xfun_0.21        
[16] parallel_4.0.3    compiler_4.0.3    pkgconfig_2.0.3   memoise_2.0.0

Input and output vector have different lengths, and vectors can't have named values

When I use accesstionToTaxa on a vector with several thousand accession numbers, the output vector that comes out is a few entries smaller in length than the input vector.

I have no chance of figuring out which accession numbers in the input caused the problem as accesstionToTaxa's output vector doesn't tell you which taxids go with which accession numbers.

A nice solution would be for accessionToTaxa and getTaxonomy to accept dataframes as input, where they append their output to the input dataframe as additional columns. It would also be an acceptable solution for both functions to accept named input vectors while preserving those names in their output vector.

Right now inputting a named vector results in the following error:

> head(input)
    OTU_8397    OTU_16523    OTU_10721     OTU_5863    OTU_16180     OTU_4413 
"CP013652.1" "CP016612.1" "CP002109.1" "CP022479.1" "CP011307.1" "CP022754.1" 
   OTU_16212     OTU_1206    OTU_15961     OTU_2057    OTU_14006 
"CP002403.1" "AP013044.1" "CP006594.1" "CP015971.1" "CP014282.1" 
> accessionToTaxa(input, "accessionTaxa.sql")
Error: Query and SQL mismatch
> input <- unname(input)
> head(input)
[1] "CP025590.1" "CP007443.1" "CP015399.2" "0"          "CP009872.1"
[6] "CP018970.1"
> head(accessionToTaxa(input, "accessionTaxa.sql"))
[1]    1590    1680 1834196      NA   72407     562

Unable to covert genbank IDs to lineage

I am trying to convert a csv file with 1000 genbank IDs to taxonomic lineage using the following commands in R Taxonomizr. I have attached db link of input file also. Thanks in advance.
https://www.dropbox.com/s/e80fg5ucydqfdsr/nifHgbID.csv?dl=0

Convert taxon structure to newick format

Do you know a simple way to convert the result of a taxonomic search into newick format to plot it with ggtree or ITOL ?
I found a function here https://alakazam.readthedocs.io/en/stable/topics/graphToPhylo/, but maybe you know a more elegant way ?

`accessionToTaxa` function not converting accession numbers to taxIDs?

Hello,

I am having trouble with the accessionToTaxa function and was wondering if you had any ideas why this is happening.

I had set up my database using the commands below:

getNamesAndNodes(outDir = "F:/R/db")
getAccession2taxid(outDir = "F:/R/db")
prepareDatabase('accessionTaxa.sql')

This created these files in my directory: names.dmp (193.3MB), nodes.dmp (199.1MB), and accessionTaxa.sql (64GB).

I tried to convert the accession numbers to IDs by running this: accessionToTaxa(mydata[,2], "accessionTaxa.sql")
(mydata looks like below:

> head(mydata[,2], 10)
 [1] "WP_037213762.1" "WP_058500402.1" "WP_143869084.1" "WP_135060942.1" "MBJ15343.1"     "WP_054976553.1" "WP_163853887.1"
 [8] "WP_010503396.1" "WP_036687788.1" "WP_127490283.1"

But this returned a vector only with NAs, and I also got these warning messages:

Warning messages:
1: In file.remove(tmp) : cannot remove file 'C:\Users\JP\AppData\Local\Temp\Rtmpuwvsuw\file4dec541a158b', reason 'Permission denied'
2: In file.remove(tmp) : cannot remove file 'C:\Users\JP\AppData\Local\Temp\Rtmpuwvsuw\file4dec541a158b', reason 'Permission denied'

I have looked into the archived issues and tried creating and deleting a temporary file (as you suggested in issue #39) but the file.remove(tmp) did not throw any errors. The results looked like this:

> tmp <- tempfile()
> print(tmp)
[1] "C:\\Users\\JP\\AppData\\Local\\Temp\\Rtmpuwvsuw\\file4dec5085282c"
> file.create(tmp)
[1] TRUE
> file.exists(tmp)
[1] TRUE
> file.remove(tmp)
[1] TRUE
> file.exists(tmp)
[1] FALSE

I would really appreciate it if you can think of any suggestions to fix this issue, thank you!

Can't find nodes table?

I have installed the sqlLite database but when I run prepareDatabase to check it I see this odd note:

prepareDatabase('accessionTaxa.sql')
SQLite database accessionTaxa.sql already exists. Delete to regenerate
[1] "accessionTaxa.sql"
Warning message:
In file.remove(tmp) :
  cannot remove file 'C:\Users\MINARD~1\AppData\Local\Temp\RtmpGkhOfU\file13381f9165b0', reason 'Permission denied'

And when I try to run getTaxonomy I get an error that the nodes table is missing:

getTaxonomy(taxids, 'accessionTaxa.sql', desiredTaxa = c("superkingdom"))
Error: no such table: nodes

Where can I find the sqlIte database? How can I fix this error?

I did set my working directory to the directory where I have my taxID tables on windows 10.

Error with prepareDatabase: RS_sqlite_import: expected 3 columns but found 4

Dear Sherrillmix,
I am trying to use taxonomize to find the taxonomic information for NCBI accession numbers, but cannot proceed past the prepareDatabase step, which throws an error every time (see below). I am running R 4.0.3 in Rstudio on a Windows 10 computer and have ensured there is enough room on the hardrive and in the temp directory. I can also see the file created by this command in the directory. I have also tried to solve it as stated in other issues, but to no avail. Is there a way to fix this?

prepareDatabase("accessionTaxa.sql") # prepare the NCBI database - large, put in ext. harddrive!
Downloading names and nodes with getNamesAndNodes()
./names.dmp, ./nodes.dmp already exist. Delete to redownload
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
./nucl_gb.accession2taxid.gz, ./nucl_wgs.accession2taxid.gz already exist. Delete to redownload
Preprocessing names with read.names.sql()
Preprocessing nodes with read.nodes.sql()
Preprocessing accession2taxid with read.accession2taxid()
Reading ./nucl_gb.accession2taxid.gz.
Reading ./nucl_wgs.accession2taxid.gz.
Reading in values. This may take a while.
Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) :
RS_sqlite_import: D:/RPC_SpiderGut/rtemp\RtmpmSWBo9\file341c33f67f6 line 122996989 expected 3 columns of data but found 4
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Thank you in advance for your help!
-Maike

Error: with NC_ (accession numbers starting with NC)

Dear Sir

It's working perfectly fine for all my records except for records starting with NC

NC_ (accession numbers starting with NC)
taxaId2<-accessionToTaxa(NC_010240.1,"accessionTaxa.sql")
Error in accessionToTaxa(NC_010240.1, "accessionTaxa.sql") :
object 'NC_010240.1' not found

How can we fix it ? Your kind help in this regard is needed. Thanks.

Regards
Ali Zohaib

Error in result_create(conn@ptr, statement) : no such table: ...

I am trying to run code that worked on a co-workers machine.

When I try to run

 taxa <- accessionToTaxa(arthropoda[ condor_index, 1], sqlFile= "accessionTaxa.sql")

I get the following error message

Error in result_create(conn@ptr, statement) :
no such table: accessionTaxa
Timing stopped at: 0.031 0.003 0.036

A quick Google search found this question on SO and a comment one of the answers notes

The problem is that RSQLite 2.0 was just released and does not work with older versions of sqldf. sqldf 0.4-11 which was just released should be used (or else use an older version of RSQLite)

I have tried upgrading taxonomizr and am using taxonomizr_0.5.3. Do you have any suggestions? Or has anyone else seen this with the package before?

Error from read.names.sql

I'm building a new accessionTaxa.sql db with protein accessions and get the following error:

library(taxonomizr)

getNamesAndNodes()
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
Content type 'unknown' length 49900153 bytes (47.6 MB)
==================================================
[1] "./names.dmp" "./nodes.dmp"

getAccession2taxid(types='prot')
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//prot.accession2taxid.gz'
Content type 'unknown' length 5572900471 bytes (5314.7 MB)
==================================================
[1] "./prot.accession2taxid.gz"

read.names.sql('names.dmp','accessionTaxa.sql') read.nodes.sql('nodes.dmp','accessionTaxa.sql')
Error: unexpected symbol in "read.names.sql('names.dmp','accessionTaxa.sql') read.nodes.sql"

Any help appreciated...

error: 'Query and SQL mismatch'

Hi,

This is a follow-up from issue no.43 (get taxID for a huge list of accession numbers #43), my apologies for the delayed reply. Thank you for the prompt response, it was very helpful. I am very new to coding and R itself, so still trying to get a hang of it.

I managed to execute the cmd line (taxaId<-accessionToTaxa(myCsv[,1],"accessionTaxa.sql") for my .csv file however, i ended up with an error message which says 'Query and SQL mismatch'. Not sure where I'm going wrong. I've attached a file with first 10 lines of my .csv file.
accesion_first10.csv

Thank you so much.

Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : RS_sqlite_import: disk I/O error Error: no such savepoint: dbWriteTable

Dear Sherrillmix,
There is one issue when I run prepareDatabase('accessionTaxa.sql'). Could you please tell me how to fix it? Thank you very much.

library(taxonomizr)
prepareDatabase('accessionTaxa.sql')
Downloading names and nodes with getNamesAndNodes()
./names.dmp, ./nodes.dmp already exist. Delete to redownload
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
./nucl_gb.accession2taxid.gz, ./nucl_wgs.accession2taxid.gz already exist. Delete to redownload
Preprocessing names with read.names.sql()
Preprocessing nodes with read.nodes.sql()
Preprocessing accession2taxid with read.accession2taxid()
Reading ./nucl_gb.accession2taxid.gz.
Reading ./nucl_wgs.accession2taxid.gz.
Reading in values. This may take a while.
Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) :
RS_sqlite_import: disk I/O error
Error: no such savepoint: dbWriteTable

Memory Required?

What machine specs are required to run this package? I had a little over 30GB left on my hard drive and got an out of memory error at this step in the vignette :
https://github.com/sherrillmix/taxonomizr/blob/master/vignettes/usage.Rmd#convert-accessions-to-database
read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')

there was a >16 GB temp file after the error as well.

matching old accession number

Hi, great package. I am using it to fix/define taxonomic assignments from a Silva 18s database. The dmp files that I just downloaded seem to have slightly different accession numbers than the Silva database and thus I am getting NA's after running the accessionToTaxa function. See below for example. Is there a quick way to remove characters including and to the right of the first ".", perhaps within the sql? Removing everything after the first "." in both databases should result in the correct taxonomy, as far as I can see. I am new to github and using sqls, so I apologize if this is not a good place for this question.
thanks

Silva accession numbers

"AC090637.149908.151196","AC091599.220.1669","AC091632.4938.6802","AC207586.19448.20862", "JQ776649.1.1382","JQ781512.1.1275"

accession numbers that work in recently downloaded and processed sql. (using your package) and that match with current NCBI version number (double checked on web).

"AC090637.2","AC091599.1","AC091632.1","AC207586.3","JQ776649.2","JQ781512.1"

Error in preparing accessionTaxa.sql with prepareDatabase()

Running prepareDatabase("accessionTaxa.sql",tmpDir="some_directory") yields:

Downloading names and nodes with getNamesAndNodes()
[100%] Downloaded 59698472 bytes...
[100%] Downloaded 49 bytes...
Preprocessing names with read.names.sql()
Preprocessing nodes with read.nodes.sql()
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
[100%] Downloaded 2237753217 bytes...
[100%] Downloaded 61 bytes...
[100%] Downloaded 4152831461 bytes...
[100%] Downloaded 62 bytes...
Error in (function (xx, yy) :
Downloaded file does not match ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz File corrupted or download ended early?
Calls: prepareDatabase -> do.call -> -> mapply ->
Execution halted

Assignment of taxonomy to blast hits

Hi, I have some blast hit and i am attaching the blast hit ids, query and ouptut below.

taxaId<-accessionToTaxa(c("SJL08569.1","PBK92452.1"),"../../accessionTaxa.sql")
print(taxaId)
[1] NA NA

May I know what could be the issue?

Error prepareDatabase

Hi, I am getting the following error:

prepareDatabase('accessionTaxa.sql')
Downloading names and nodes with getNamesAndNodes()
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
Content type 'unknown' length 46768956 bytes (44.6 MB)
=
downloaded length 1119304 != reported length 46768956URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz': status was 'Transferred a partial file'Error in utils::download.file(url, tarFile, mode = "wb") :
cannot open URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'

Do you know how to fix it?

Many thanks!

Carolina

Function distance between taxons

I was looking for a function that computes the distance between two taxons. Since I found nothing I would like to propose the following


  ## Load the Taxon Node Table from RSQLite into a data.frame
  db <- RSQLite::dbConnect(RSQLite::SQLite(), dbname=accessionSql)
  query <- paste0('SELECT id,parent,rank FROM nodes')
  taxaDF <- RSQLite::dbGetQuery(db,query)
  RSQLite::dbDisconnect(db)
  
  # Build IGraph object
  g <- graph_from_data_frame(taxaDF[,1:2],directed=FALSE)

  d <- distances(g, v = V(g)[name == fromId],to=V(g)[name == toId])

Is this distance calculation in the scope of taxonomizr ?

read.accession2taxid error creating SQL file: malformed line

Hi Scott,

Thanks for making this package available. I've been looking for a convenient way to convert accession numbers to UIDs to a taxonomy lineage for a while. I tried to get taxonomizr to create the .sql file to no avail. It's returning an error about a malformed line. Have you encountered this before? Any idea if I'm doing something wrong? I've attached my R logs below.

> library("taxonomizr")
> getNamesAndNodes()
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
Content type 'unknown' length 41706736 bytes (39.8 MB)
==================================================
[1] "./names.dmp" "./nodes.dmp"
> getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_gb.accession2taxid.gz'
Content type 'unknown' length 988927554 bytes (943.1 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_est.accession2taxid.gz'
Content type 'unknown' length 544402419 bytes (519.2 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_gss.accession2taxid.gz'
Content type 'unknown' length 279039082 bytes (266.1 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_wgs.accession2taxid.gz'
Content type 'unknown' length 3067473581 bytes (2925.4 MB)
==================================================
[1] "./nucl_gb.accession2taxid.gz"  "./nucl_est.accession2taxid.gz"
[3] "./nucl_gss.accession2taxid.gz" "./nucl_wgs.accession2taxid.gz"
> read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')
Reading nucl_est.accession2taxid.gz.
Reading nucl_gb.accession2taxid.gz.
Reading nucl_gss.accession2taxid.gz.
Reading nucl_wgs.accession2taxid.gz.
Error: Problem creating sql file. Deleting.
Error in trimTaxa(ii, tmp) : Malformed line on line 46212441 
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I checked the *accession2taxid.gz file checksums and they do match the md5 sums from NCBI.
MD5 sums:

3c0ea1b1e5b93911d205b68a916c2a19  nucl_est.accession2taxid.gz
8f6871b4b23ba591f3f0f122d0d3cb96  nucl_gb.accession2taxid.gz
19d8a69f3efbdcb482646efa4538467e  nucl_gss.accession2taxid.gz
210fa57011a0a44b7ce3fb8faed709bf  nucl_wgs.accession2taxid.gz

Cheers,
Bryan Nguyen

Error: Problem creating sql file. Deleting. Error in rsqlite_import_file(conn@ptr, name, value, sep, eol, skip) : RS_sqlite_import: /tmp/RtmpFiCHWM/file7bfd7d94f1c3 line 249184308 expected 2 columns of data but found 1 In addition: There were 50 or more warnings (use warnings() to see the first 50)

Hello,

I have download the prot.accession2taxid.gz and the other 4 nul files. I checked them by md5 value. They are completely downloaded.
However, when I run > read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql') , I got error:
Reading nucl_est.accession2taxid.gz.
Reading nucl_gb.accession2taxid.gz.
Reading nucl_gss.accession2taxid.gz.
Reading nucl_wgs.accession2taxid.gz.
Reading prot.accession2taxid.gz.
Reading in values. This may take a while.
Error: Problem creating sql file. Deleting.
Error in rsqlite_import_file(conn@ptr, name, value, sep, eol, skip) :
RS_sqlite_import: /tmp/RtmpFiCHWM/file7bfd7d94f1c3 line 249184308 expected 2 columns of data but found 1
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I used R 3.43. Could you kindly give me any suggestion to resolve this problem? Thank you so much

Best wishes,

Chen

sherrillmix / taxonomizr Goto Github PK

taxonomizr's Introduction

Convert accession numbers to taxonomy

Introduction

Requirements

Installation

Preparation

Assigning taxonomy

Producing accession numbers

Finding taxonomy for NCBI accession numbers

Finding taxonomy for taxonomic names

Finding descendants for a given taxa

Finding common names for taxonomic IDs

Condensing taxonomy a.k.a. lowest common ancestor LCA

Find all taxonomic assignments for a given taxa

Finding accessions for a given taxonomic ID

Convert taxonomy to Newick tree

Note: NCBI name changes in early 2023

Changelog

v0.10.5

v0.10.4

v0.10.3

v0.10.2

v0.10.1

v0.9.4

v0.9.3

v0.9.2

v0.8.4

v0.8.3

v0.8.2

v0.8.1

v0.8.0

v0.7.1

v0.7.0

v0.6.0

v0.5.3

v0.5.0

Manual preparation of database (usually not necessary)

Download names and nodes

Download accession to taxa files

Convert names, nodes and accessions to database

Switch from data.table to SQLite

taxonomizr's People

Contributors

Stargazers

Watchers

Forkers

taxonomizr's Issues

I still get a warning, but the function now works

Silva accession numbers

accession numbers that work in recently downloaded and processed sql. (using your package) and that match with current NCBI version number (double checked on web).

Recommend Projects

Recommend Topics

Recommend Org