Code Monkey home page Code Monkey logo

virtool-cli's Introduction

virtool-cli

A command line tool for working with Virtool data.

Installation

pip install virtool-cli

Usage

Ref

Commands related to building, maintaining and pulling new data for reference databases.

Build

To build a reference.json file from a src directory

virtool ref build -src DIRECTORY_PATH -o OUTPUT_PATH

If you wish for the output file to be more easily readable you can specify it to be indented

virtool ref build -src DIRECTORY_PATH -i

To specify a version to include in the reference.json file

virtool ref build -src DIRECTORY_PATH -V VERSION

Repair

Fix folder-JSON name mismatches and incorrect taxid types

virtool ref repair -src DIRECTORY_PATH

taxid

Search GenBank for matching OTUs and add their taxon ids to otu.json entries in the source directory.

virtool ref taxid -src DIRECTORY_PATH

Environmental Variables

Some of the tools in the CLI make API requests to NCBI. Unauthenticated requests are are limited to 3 per second. Setting NCBI credentials in environmental variables can increase this to 10 per second.

Name Description
NCBI_EMAIL The e-mail address used for your NCBI account
NCBI_API_KEY The API key associated with your NCBI account.

virtool-cli's People

Contributors

eroberts9789 avatar igboyes avatar jakeale avatar sygao avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

virtool-cli's Issues

Gracefully handle KeyboardInterrupt in isolate subcommand

Entering Ctrl+C while command is running results in messy unhandled exception:

^CJob processing failed
job: <Job coro=<<coroutine object fetch_otu_isolates at 0x7f109c23dc20>>>
Traceback (most recent call last):
  File "/usr/lib/python3.7/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.7/asyncio/base_events.py", line 574, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.7/asyncio/base_events.py", line 541, in run_forever
    self._run_once()
  File "/usr/lib/python3.7/asyncio/base_events.py", line 1786, in _run_once
    handle._run()
  File "/usr/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/virtool_cli/isolate.py", line 109, in fetch_otu_isolates
    new_id = await store_sequence(isolate_path, accession, accession_data, sequence_ids)
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/virtool_cli/isolate.py", line 230, in store_sequence
    async with aiofiles.open(os.path.join(path, new_id + ".json"), "w") as f:
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/aiofiles/base.py", line 75, in __aenter__
    self._obj = await self._coro
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/aiofiles/threadpool/__init__.py", line 78, in _open
    opener=opener,
KeyboardInterrupt

Aborted!
  • use try...except to catch KeyboardInterrupt and close aiojobs scheduler
  • handle TaskCancelled exception in coroutine that runs in scheduler
    • should any cleanup code by executed if the coroutine is cancelled part way through execution? (Document in issue)

Share REQUEST_INTERVAL

Share this code either in a new module called ncbi.py or existing utils.py, whichever makes more sense. Import into taxid and isolate modules.

Entrez.email = os.environ.get("NCBI_EMAIL")
Entrez.api_key = os.environ.get("NCBI_API_KEY")
REQUEST_INTERVAL = 0.4 if Entrez.email and Entrez.api_key else 0.6

Store fetched taxid as integer instead of string

Current record style:

{
    "_id": "4ucg7osc",
    "name": "Cacao mild mosaic virus",
    "abbreviation": "",
    "schema": [],
    "taxid": "1940252"
}

Should be:

{
    "_id": "4ucg7osc",
    "name": "Cacao mild mosaic virus",
    "abbreviation": "",
    "schema": [],
    "taxid": 1940252
}

urllib.error.HTTPError: HTTP Error 429: Too Many Requests

Unhandled exception while running virtool isolate.

Likely due to sending too many requests to NCBI. Please verify by looking at urllib documentation and researching status code. Post information as reply to this issue.

May need to find more robust way to throttle requests as too many seem to be getting sent.

Job processing failed
job: <Job coro=<<coroutine object fetch_otu_isolates at 0x7f107c357830>>>
Traceback (most recent call last):
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/virtool_cli/isolate.py", line 84, in fetch_otu_isolates
    records, new_accessions = await asyncio.get_event_loop().run_in_executor(None, get_records, accessions, taxid)
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/virtool_cli/isolate.py", line 144, in get_records
    id=taxid, idtype="acc"))
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/Bio/Entrez/__init__.py", line 281, in elink
    return _open(cgi, variables)
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/Bio/Entrez/__init__.py", line 606, in _open
    handle = urlopen(cgi)
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests

Add repair subcommand

Do the following to all otu.json:

  • change parent folder to match name in otu.json; in other words fix folder-JSON name mismatch
  • ensure all taxid fields are int

Fetch new isolates using taxids from existing OTUs

Fetch sequences for new isolates of existing OTUs (viruses).

Need to somehow associate sequences of the same isolate together. This info is most often in the FEATURES section of a Genbank record. However, it is sometimes only in the DEFINITION.

FEATURES             Location/Qualifiers
     source          1..1090
                     /organism="Abaca bunchy top virus"
                     /mol_type="genomic DNA"
                     /isolate="Q767"
                     /host="Musa sp."
                     /db_xref="taxon:438782"
                     /segment="DNA-N"
                     /country="Malaysia"
                     /note="acronym: ABTV"
     CDS             236..700
                     /codon_start=1
                     /product="putative nuclear shuttle protein"
                     /protein_id="ABP96961.1"
                     /translation="MDWMESQFKTCTHGCDWKAIAPEAQDNIQVITCSDSGYGRKNPR
                     KVLLRSIQIGFNGSFRGSNRNVRGFIYVSVRQDDGQMRPIMVVPFGGYGYHNDYYYFE
                     GQSSTNCEIVSDYIPAGQDWSRDMEISISNSNNCNQECDIKCYVVCNLRIKE"

Strategy

Try to group sequences into isolates automatically first:

  • work through OTU data and add tests for different formats in the FEATURES section
  • derive source type (eg. Isolate) and source name (eg. A) automatically if possible
  • if the program can't figure it out:
    • give the following to the user and let them assign an isolate type and name to a set of sequences:
      • accession links
      • definition
      • all source key value pairs that are not in a defined ignore list (this should include country, note, host, and other unwanted fields you find)
    • show the list in alphabetical order by accession; accessions of sequences from the same isolate are usually consecutive
    • this is going to be difficult

Caching

Avoid fetching to much data from Genbank.

  • efficient way to check only if there are new records for a given taxid?
    • Last-modified or E-tag headers on search result?
  • store all necessary cache or local data for CLI in a folder called .cli

Interaction

User has to confirm each isolate addition as it is fetched from Genbank

  • if we can't derive source type and name automatically, give user hyperlinks to Genbank records and ask them to either:
    • enter the type and name values
    • leave isolate unnamed
  • allow skipping isolate additions (no result; will ask user again next time they run tool)
  • allow blocking isolate additions (no result; never ask user about it again)
    • write blocked accessions to cache
  • auto skip if the sequence is significantly shorter than any sequence for existing isolates
    • allow this to be disabled via CLI option
  • require override if number of sequences for isolate

Extend README

  • add a section for each command with usage examples
  • include the usage of environmental variables

Code improvements for isolate module

arrow package missing from setup.py

Make sure all packages required for install are listed in install_requires.

    return get_distribution(dist).load_entry_point(group, name)
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2862, in load_entry_point
    return ep.load()
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2462, in load
    return self.resolve()
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2468, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/virtool_cli/run.py", line 2, in <module>
    import virtool_cli.build
  File "/home/igboyes/.venv/ref/lib/python3.7/site-packages/virtool_cli/build.py", line 3, in <module>
    import arrow
ModuleNotFoundError: No module named 'arrow'

More detailed and better formatted logging for isolate subcommand

Current:

Found isolate data for 2116736
Found isolate data for 12436
Found isolate data for 192022
Could not find isolate data for 1115692

For each found data, write a few lines instead.

  • write multiple lines for OTU for which isolate data is found

    1. first line is OTU name and taxid (lead with checkmark if success)
    2. Found isolates:
      3-n. Isolate names with indentation
  • write more detailed error lines as well

    1. first line is OTU name and taxid (lead with x for error)
    2. Why? None found, no taxid defined (#20)
  • maintain or enhance current use of color

Example:

✔ Tobacco mosaic virus (12345)
  Found 2 isolates:
    - Isolate A
    - Isolate G12

✘ Rice spotting virus (678910)
  Found 0 isolates

✘ Potato streak virus (543210)
  No taxid assigned for OTU

Reduce testing data size

Include enough data to support all test cases.

Reasons:

  • reduce size and clutter in git repository
  • improve testing speed
  • make failing tests more understandable

Improve taxid subcommand logging

Bring logging more in line with that for isolate:

  • use same x character
  • use multiple lines for each OTU to make things more readable
  • nest error or success messages under OTU name
  • reuse logging code where it makes sense
  • get rid of progress bar

More informative logging when OTU has no taxid

Look at the second line here:

Found isolate data for 404404
Could not find isolate data for None
Found isolate data for 463360

I believe this is due to the OTU having no taxid. Please provide more informative log line. For example, Could not retrieve isolate data for because no taxid is defined.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.