glarue / jgi-query Goto Github PK

A simple command-line tool to download data from Joint Genome Institute databases

License: Mozilla Public License 2.0

Python 100.00%

python bioinformatics cli genomics genomes

jgi-query's Introduction

jgi-query

A command-line tool for querying and downloading from databases hosted by the Joint Genome Institute (JGI). Useful for accessing JGI data from command-line-only resources such as remote servers, or as a lightweight alternative to JGI's other GUI-based download tools.

Dependencies

A user account with JGI (free)
cURL, required by the JGI download API
Python 3.x (current development) or 2.7.x (deprecated but provided -- now significantly outdated)

Installation

Download jgi-query.py
Ensure that you're running the correct version of Python with python --version. If this reports Python 2.x, run the script using python3 instead of python
From the command line, run the script with the command python jgi-query.py to show usage information and further instructions

Usage information

usage: jgi-query.py [-h] [-x [XML]] [-c] [-s] [-f] [-u] [-n RETRY_N]
                    [-l logfile] [-r REGEX] [-a]
                    [organism_abbreviation]

This script will list and retrieve files from JGI using the curl API. It will
return a list of all files available for download for a given query organism.

positional arguments:
  organism_abbreviation
                        organism name formatted per JGI's abbreviation. For
                        example, 'Nematostella vectensis' is abbreviated by
                        JGI as 'Nemve1'. The appropriate abbreviation may be
                        found by searching for the organism on JGI; the name
                        used in the URL of the 'Info' page for that organism
                        is the correct abbreviation. The full URL may also be
                        used for this argument (default: None)

optional arguments:
  -h, --help            show this help message and exit
  -x [XML], --xml [XML]
                        specify a local xml file for the query instead of
                        retrieving a new copy from JGI (default: None)
  -c, --configure       initiate configuration dialog to overwrite existing
                        user/password configuration (default: False)
  -s, --syntax_help
  -f, --filter_files    filter organism results by config categories instead
                        of reporting all files listed by JGI for the query
                        (work in progress) (default: False)
  -u, --usage           print verbose usage information and exit (default:
                        False)
  -n RETRY_N, --retry_n RETRY_N
                        number of times to retry downloading files with errors
                        (0 to skip such files) (default: 4)
  -l logfile, --load_failed logfile
                        retry downloading from URLs listed in log file
                        (default: None)
  -r REGEX, --regex REGEX
                        Regex pattern to use to auto-select and download files
                        (no interactive prompt) (default: None)
  -a, --all             Auto-select and download all files for query (no
                        interactive prompt) (default: False)

Author's note

This is a somewhat better-commented (emphasis on "somewhat") version of a script I wrote for grabbing various datasets using a headless Linux server. For a lot of my lab's bioinformatics work, we don't store/manipulate data on our local computers, and I was not able to find a good tool that allowed for convenient queries of the JGI database without additional software.

JGI also no longer allows simple downloading of many of their datasets (via wget, for example), which is another reason behind the creation of this script.

I highly encourage anyone with more advanced Python skills (read: almost everyone) to fork and submit pull requests.

General overview

JGI uses a cURL-based API to provide information/download links to files in their database.

In brief, jgi-query begins by using cURL to grab an XML file for the query text. The XML file describes all of the available files and their parent categories. For example, the file for Aureobasidium subglaciale (JGI abbreviation "Aurpu_var_sub1") begins:

jgi-query will parse the XML file to find entries with a filename attribute and, depending on command-line arguments, a parent category from the list of categories in jgi-query.config. It then displays the available files with minimal metadata, and prompts the user to enter their selection.

File selection

Main file categories in the report are numbered, as are files within each category. The selection syntax is category_number:file_selection, where file_selection is either a comma-separated list (e.g. file1, file2) or a contiguous range (e.g. file1-file4). For multiple parent categories and associated files, category/file list groupings are linked with semicolons (e.g. category1:file1,file2;category2:file5-file8).

Bulk file downloading

Additionally, there is a regex-based file selection option (enter "r" at the file selection prompt) which may be useful for selecting a large number of related files (see the Python regex documentation for syntax information). For example, to retrieve all files with "AllModels" in their names, the regex to enter at the regex prompt would be .*AllModels.*.

Use in a larger pipeline

For programmatic use, jgi-query also has command-line arguments, -a and -r, that allow retrieval of either complete or regex-filtered datasets, respectively, while bypassing interactive prompts. For example, to retrieve all gzipped GFF3 files with "FilteredModels1" for Schizophyllum commune:

python3 jgi-query.py Schco3 -r 'FilteredModels1.*\.gff3\.gz$'

Sample output for Nematostella vectensis ('Nemve1')

➜ python3 jgi-query.py Nemve1                                  
Retrieving information from JGI for query 'Nemve1' using command 'curl 'https://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=Nemve1' -L -b cookies > Nemve1_jgi_index.xml'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   379  100   379    0     0   1857      0 --:--:-- --:--:-- --:--:--  1857
100  4350    0  4350    0     0   3958      0 --:--:--  0:00:01 --:--:-- 4248k


QUERY RESULTS FOR 'Nemve1'

======================= 1: All models, Filtered and Not ========================
Genes:
 1:[1] Nemve1.AllModels.gff.gz-----------------------------------[20 MB|03/2012]
Proteins:
 1:[2] proteins.Nemve1AllModels.fasta.gz-------------------------[29 MB|03/2012]
Transcripts:
 1:[3] transcripts.Nemve1AllModels.fasta.gz----------------------[55 MB|03/2012]

=================================== 2: Files ===================================
Additional Files:
 2:[1] N.vectensis_ABAV.modified.scflds.p2g.gz------------------[261 KB|03/2012]
 2:[2] Nemve1.FilteredModels1.txt.gz------------------------------[2 MB|03/2012]
 2:[3] Nemve1.fasta.gz-------------------------------------------[81 MB|10/2005]
 2:[4] Nemve_JGIest.fasta.gz-------------------------------------[30 MB|03/2012]
 2:[5] Nemve_JGIestCL.fasta.gz------------------------------------[8 MB|03/2012]
 2:[6] NvTRjug.fasta.gz-------------------------------------------[4 KB|03/2012]

========================= 3: Filtered Models ("best") ==========================
Genes:
 3:[1] Nemve1.FilteredModels1.gff.gz------------------------------[3 MB|03/2012]
 3:[2] Nvectensis_19_PAC2_0.GFF3.gz-------------------------------[2 MB|03/2012]
Proteins:
 3:[3] proteins.Nemve1FilteredModels1.fasta.gz--------------------[5 MB|03/2012]
Transcripts:
 3:[4] transcripts.Nemve1FilteredModels1.fasta.gz-----------------[8 MB|03/2012]

Enter file selection ('q' to quit, 'usage' to review syntax, 'a' for all, 'r' for regex-based filename matching):
> 2:3;3:1
Total download size for 2 files: 84.02 MB
Continue? (y/n/[p]review files): y
Downloading 'Nemve1.FilteredModels1.gff.gz' using command:
curl -m 120 'https://genome.jgi.doe.gov/portal/Nemve1/download/Nemve1.FilteredModels1.gff.gz' -b cookies > Nemve1.FilteredModels1.gff.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3078k  100 3078k    0     0  4918k      0 --:--:-- --:--:-- --:--:-- 4918k
Downloading 'Nemve1.fasta.gz' using command:
curl -m 120 'https://genome.jgi.doe.gov/portal/Nemve1/download/Nemve1.fasta.gz' -b cookies > Nemve1.fasta.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 81.0M  100 81.0M    0     0  5320k      0  0:00:15  0:00:15 --:--:-- 2881k
Finished downloading 2 files.
Decompress all downloaded files? (y/n/k=decompress and keep original): y
Finished decompressing all files.
Keep temporary files ('Nemve1_jgi_index.xml' and 'cookies')? (y/n): n
Removing temp files and exiting

~ took 1m 17s 
➜

jgi-query's People

Contributors

Stargazers

Watchers

Forkers

liupfskygre senaj zhssakura smsaladi milkgoodname irenechoi0499 wnq13579 orangesi hkprasad beritlin stogqy zm-git-dev trellixvulnteam pabviana geneditbio tferre25 kjwallace

jgi-query's Issues

Downloading the fungal database

Hello,
I wanted to download the entire fungal database, but the tool is not responding. Do you have any solution to recommend ?
It takes a lot of time and in the end it gives me an empty XML file.
Thanks in advance.

ERROR :
#-------------------------------------------------------------------------------------------

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    92    0    92    0     0      0      0 --:--:--  0:10:00 --:--:--    22
Retrieving information from JGI for query 'fungi' using command 'curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get-directory?organism=fungi' -L -b cookies > fungi_jgi_index.xml'


Traceback (most recent call last):
  File "/shared/ifbstor1/projects/HE/FungiDB/JGI-db/jgi-query-main/jgi-query.py", line 1151, in <module>
    if not any(v["results"] for v in list(file_list.values())):
AttributeError: 'NoneType' object has no attribute 'values'

#--------------------------------------------------------------------------------------------

FILE XML : fungi_jgi_index.xml

<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

10 Minute time limit

So first off thanks for this tool, it's very useful.

My Problem:
I am trying to create an XML file from a very large database on jgi, and there seems to be a 10 minute runtime limit. Is this something built in or can it be changed?

Thanks

failed when using jgi-query to download files wanted

Hi Glarue,
I am trying to using the following command to download the files in the projects, but I get an error message and the file with the right name but wrong content:

command:

python jgi-query.py -x get-directory.xml # the xml file is downloaded from the project download page

#https://genome.jgi.doe.gov/portal/pages/dynamicOrganismDownload.jsf?organism=TheHunmicrobiome#

by click 'Open Downloads as XML '

following the instructions:

user name and password #fine

file to download

for example

2:2216 # a protein seqs file I want download

I got this

Total download size of selected files: 693.23 KB
Continue? (y/n): y
Downloading '81031.assembled.faa' using command:
curl http://genome.jgi.doe.gov/EubpyrIsolGenome/download/_JAMO/56f1982d7ded5e7f7b938de5/81031.assembled.faa -b cookies -c cookies -L > 81031.assembled.faa
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 287 100 287 0 0 101 0 0:00:02 0:00:02 --:--:-- 101
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
100 9122 0 9122 0 0 2240 0 --:--:-- 0:00:04 --:--:-- 2240
Finished downloading all files.
ERROR: '81031.assembled.faa' appears to be malformed and will be left unmodified.
Keep temporary files ('/home/mpi/pengfei/Hungate1000p/get-directory.xml' and 'cookies')? (y/n): n
Removing temp files and exiting

and the file is wrong: not protein sequences, see attached

81031.assembled.faa.txt

would you please help checking which step I am doing wrong?
Thanks

Best,
Pengfei

Downloading error: curl: (28)

Hi @glarue
I'm running the script in a remote server to download the *.tar.gz files from bacterial groups (as suggested here: #4).

Every time I run this command (starts in tar.gz not shown):

python3 jgi-query.py tenericutes -r '.tar.gz.' --retry_n 0 -c

I get something like this, on every single genome ID contained in tenericutes:

Downloading '2582580514.tar.gz' using command:
curl -m 120 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/Comgenmetab10417/download/_JAMO/53e5233f0d87856ba82b2ddc/2582580514.tar.gz' -b cookies > 2582580514.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:01:59 --:--:-- 0
curl: (28) Operation timed out after 120000 milliseconds with 0 bytes received

I also run it with default retry number (4) and it happens the same. I hope you can guide me through this.

Many thanks in advance

Please change to reference current JGI signon server.

You have a reference to creating an account at signon.jgi-psf.org. Please have it use the current server address: signon.jgi.doe.gov

Downloading bacterial genomes in bulk with submission IDs

Hi,

I have to download specific bacterial genomes from JGI. Is there a way to download them in bulk through their submission IDs via jgi-query?

Thanks,
Marco

Incorporate into Biopython

Love the idea of this tool--what would you think of making it a package in Biopython so users could just do from Bio import jgi-query, or even making it a single function in Biopython? I don't know anything about how you would do this, but I'm sure the folks over there would be happy to help, and I think it would make it even easier to deploy and use.

Error downloading the fungal database.

Hello @glarue

I am launching the script to retrieve all the fungal assembly sequences in fasta format, but it is showing me this error:

`python3 jgi-query.py fungi

Retrieving information from JGI for query 'fungi' using command 'curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get-directory?
organism=fungi' -L -b cookies > fungi_jgi_index.xml'

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 92 0 92 0 0 0 0 --:--:-- 0:10:00 --:--:-- 28

Traceback (most recent call last):
File /JGI-db/jgi-query-main/jgi-query.py", line 1151, in
if not any(v["results"] for v in list(file_list.values())):
AttributeError: 'NoneType' object has no attribute 'values'`

Do you have any idea how to download all the fungal genome sequences?

Download error

Hello, when I use python3 ./jgi-query/jgi-query.py --xml get-directory.xml to download files from JGI genome portal,
I get the following download error. Do you know what the issue could be?

Thanks!

Downloading '7393.1.70539.TCGAAG.fastq.gz' using command:
curl -m 120 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/Poptrisequencing_78/download/_JAMO/5254b441067c0136350e4f73/7393.1.70539.TCGAAG.fastq.gz' -b cookies > 7393.1.70539.TCGAAG.fastq.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
Trying '7393.1.70539.TCGAAG.fastq.gz' again due to download error (1/4):

downloading multiple datasets in bulk?

I can't tell whether this is possible using jgi-query or with the JGI API in general. I would like to download all of their bacterial genomes if at all possible but can't find a way to get a list by kingdom.

Can you provide any guidance here?
A

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.