Code Monkey home page Code Monkey logo

ncbi-xsl's Introduction

#Versatile and Lossless Conversion of NCBI GenBank Records

I've tested a handful Perl and Python scripts to retrieve annotations from GenBank flat files (.gbk, .gbff or .seq). However, accurate or lossless conversion by parsing GenBank flat files seems like a dream. A better option is to download gff files from ftp://ftp.ncbi.nlm.nih.gov/genomes/. But the gff collection is only available for a subset of refseq. In addition, some entries might be outdated or temporarily pulled off during curation.

Here is the official solution: parsing ASN.1 files instead of flat files using annotwriter from NCBI C++ toolkit. However, there is no precompiled binary for the 131 MB binary. See http://sourceforge.net/p/song/mailman/song-devel/thread/[email protected]/

  1. Install NCBI C++ Toolkit. Warning: full installation is 21 GB. To compile annotwriter only, see http://www.ncbi.nlm.nih.gov/mailman/pipermail/cpp/2015q4/002738.html

     curl -O ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/ncbi_cxx--12_0_0.tar.gz
     tar zxvf ncbi_cxx--12_0_0.tar.gz
     cd ncbi_cxx--12_0_0
     ./configure --prefix=/ANY/DIR/ncbi_cxx--12_0_0
     make
     make install
     export PATH=$PATH:/ANY/DIR/ncbi_cxx--12_0_0/bin
    
  2. Download Entrez Direct suite

     curl -O ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.zip
     unzip edirect.zip
     export PATH=$PATH:~/ANY/DIR/edirect
    
  3. Download an ASN.1 file

     efetch -db nucleotide -id <gi> > <gi.asn>
    
  4. Convert an ASN.1 file to gff3 file

     annotwriter -i <gi.asn> -format gff3 -full-annots -o <gi.gff>
    

Another solution which is highly versatile is by parsing INSDseq XML files. The steps described below use viral refseq as an example.

  1. Retrieve all GI from viral.1.1.genomic.fna

     curl -O ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
     gunzip viral.1.1.genomic.fna.gz
     grep ">" viral.1.1.genomic.fna | awk 'BEGIN {FS="|"} {print $2}' > viral.1.1.genomic.gi
    
  2. Download Entrez Direct suite

     curl -O ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.zip
     unzip edirect.zip
     export PATH=$PATH:~/ANY/DIR/edirect
    
  3. Download viral refseq in INSDseq XML format using a list of GI.

    NCBI Website and Data Usage Policies and Disclaimers: Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests.

     while read name; do
         efetch -db nucleotide -id $name -format gpc > $name.xml;
         sleep 1;
     done < viral.1.1.genomic.gi 
    
  4. Install XMLStarlet (optional)

    on Ubuntu:

     sudo apt-get install xmlstarlet
    

    on RedHat/CentOS/Fedora:

     yum install xmlstarlet
    

    on Mac OSX:

     curl -O http://iweb.dl.sourceforge.net/project/xmlstar/xmlstarlet/1.6.1/xmlstarlet-1.6.1.tar.gz
     tar zxvf xmlstarlet-1.6.1.tar.gz
     cd xmlstarlet-1.6.1
     sudo ./configure
     sudo make
     sudo make install
    
  5. View INSDseq XML structure (optional) - helps in coding a stylesheet. 10313991.xml is one of the fetched files.

     xmlstarlet el 10313991.xml
    
  6. Parsing XML with a custom stylesheet, which is surprisingly easy to code.

     xsltproc --novalid insdseq2annotation.xsl 10313991.xml
    

ncbi-xsl's People

Contributors

lcscs12345 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.