nextstrain / fauna Goto Github PK

RethinkDB database to support real-time virus analysis

License: GNU Affero General Public License v3.0

Python 99.62% JavaScript 0.13% Dockerfile 0.25%

fauna's Introduction

This repository is archived and contains the content used to build the documentation and splash page found in nextstrain.org. This content can now be found here.

License and copyright

Source code to Nextstrain is made available under the terms of the GNU Affero General Public License (AGPL). Nextstrain is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

fauna's People

Contributors

Stargazers

Watchers

Forkers

sdwfrost huddlej hungz23 stevenweaver tmehoke hgwu80 cclauss peterbecich eweinstein global-localhost global19 global19-atlassian-net marmikreal j23414

fauna's Issues

Counts in HI strain files need to be properly summed

The --all option in download_all.py is much appreciated. However, it needs to be smarter about how hi_strains.tsv files are combined. Here,

https://github.com/nextstrain/fauna/blob/master/download_all.py#L64

we need to combine the HI strain tsvs in a smarter fashion. Each individual tsv looks like:

A/Pakistan/431/2015	5
A/Mexico/4159/2016	6
A/Kazakhstan/646/2016	5
...
A/Pakistan/431/2015	6

The combined all_hi_strains.tsv needs to have

A/Pakistan/431/2015	11

The titers tsv files can be concatenated just as they are now.

Visualizing Sequence Information Fields in Chateau

It's really hard to see sequence information like title author etc. We could display the 'best' sequence's information like a normal virus field (strain, date etc) For downloading sequences we currently just pick the longest sequence as the 'best' sequence. So we would still have the sequences field storing a list of all sequence's from the strain, but add new main fields for title, authors, accession etc. I think this would make it much easier to see that information while still allowing multiple sequences to be stored for a single strain.

Thoughts?

Update download_all to work with segments

Current download_all.py writes sequences as h3n2.fasta etc... However, new augur wants h3n2_ha.fasta:

https://github.com/nextstrain/augur/tree/master/flu#how-to-run

The download_all.py script should be updated to download all segments and name them appropriately.

Improve parsing of strain names from serum columns

Serum strain names should match the reference virus names. The serum names from the columns are not always getting parsed correctly.

Timestamp bug

As noticed in #35, the timestamp field is being updated with each upload. It should only be updated when the other fields in the document are updated.

More flexible field requirements

@chacalle ---

I'm realizing now that we should be more flexible with field requirements. It had been set up to reject documents that lack a few key attributes like date, country, etc... I think this is overly restrictive. I think it would be better if a document is still added to the database but has date = null, county = null, etc... The only "required" field when uploading should be strain.

I would move these "required" fields to downloading instead. By default, we could only download viruses with defined date, country, etc...

I changed the required fields here:

https://github.com/blab/nextstrain-db/blob/master/vdb/upload.py#L54

I think I fixed everything here, but could you test a bit?

Sort out why Texas/50 is being dropped

Some documents are getting dropped in the new upload. The original tdb has:

A/Stockholm/65/2015    A/Georgia/532/2015    F33/15    NIMR_Feb2016_9_06.csv    160
A/Stockholm/65/2015    A/HongKong/146/2013    F10/15    NIMR_Feb2016_9_06.csv    160
A/Stockholm/65/2015    A/HongKong/4801/2014    F12/15    NIMR_Feb2016_9_06.csv    160
A/Stockholm/65/2015    A/HongKong/4801/2014    F43/15    NIMR_Feb2016_9_06.csv    320
A/Stockholm/65/2015    A/HongKong/5738/2014    F30/14    NIMR_Feb2016_9_06.csv    160
A/Stockholm/65/2015    A/Netherlands/525/2014    F23/15    NIMR_Feb2016_9_06.csv    80
A/Stockholm/65/2015    A/Samara/73/2013    F35/15    NIMR_Feb2016_9_06.csv    80
A/Stockholm/65/2015    A/Stockholm/6/2014    F14/14    NIMR_Feb2016_9_06.csv    640
A/Stockholm/65/2015    A/Stockholm/6/2014    F20/14    NIMR_Feb2016_9_06.csv    320
A/Stockholm/65/2015    A/Switzerland/9715293/2013    F18/151    NIMR_Feb2016_9_06.csv    160
A/Stockholm/65/2015    A/Switzerland/9715293/2013    F32/14    NIMR_Feb2016_9_06.csv    160
A/Stockholm/65/2015    A/Texas/50/2012    F36/12    NIMR_Feb2016_9_06.csv    160

while test_tdb_2 has:

A/Stockholm/65/2015    A/Georgia/532/2015    F33/15    NIMR_Feb2016_9_06.csv    160    
A/Stockholm/65/2015    A/HongKong/146/2013    F10/15    NIMR_Feb2016_9_06.csv    160    
A/Stockholm/65/2015    A/HongKong/4801/2014    F12/15    NIMR_Feb2016_9_06.csv    160    
A/Stockholm/65/2015    A/HongKong/4801/2014    F43/15    NIMR_Feb2016_9_06.csv    320    
A/Stockholm/65/2015    A/HongKong/5738/2014    F30/14    NIMR_Feb2016_9_06.csv    160    
A/Stockholm/65/2015    A/Netherlands/525/2014    F23/15    NIMR_Feb2016_9_06.csv    80    
A/Stockholm/65/2015    A/Samara/73/2013    F35/15    NIMR_Feb2016_9_06.csv    80    
A/Stockholm/65/2015    A/Stockholm/6/2014    F14/14    NIMR_Feb2016_9_06.csv    640    
A/Stockholm/65/2015    A/Stockholm/6/2014    F20/14    NIMR_Feb2016_9_06.csv    320    
A/Stockholm/65/2015    A/Switzerland/9715293/2013    F18/151    NIMR_Feb2016_9_06.csv    160    
A/Stockholm/65/2015    A/Switzerland/9715293/2013    F32/14    NIMR_Feb2016_9_06.csv    160

We're missing A/Texas/50/2012 in test_tdb_2.

Add command line arguments to `download_all.py`

So that it is easier to call different options.

Fields for paper title and url

Create fields for title and url attached to each sequence. The url would either link to the doi or genbank entry.

Restore from backup

It would be good to have a vdb_restore.py script that restores particular tables from the S3 backups. Could default to the most recent version in S3 or have a command line option to get a specific version.

Strain name false mismatches

The accession KU497555 is called Brazil-ZKV2015 on its Genbank page, but downloads as Brazil_ZKV2015_Asian via ViPR. The current canonize fixes the _Asian but these end as two different strains in vdb because of _ vs -.

I think the best thing to do here is to do a more generous match when checking if a strain is in the database, i.e. when looking up Brazil_ZKV2015 in vdb, it should match to Brazil-ZKV2015.

So, we'd have some degree of canonization that corrects strain names and a further degree that applies when looking for matches. Does this sound like the proper way to do things to you?

NCBI email requirement

For many operations NCBI email should not be required, but currently is in the code. For example, uploading via fasta (rather than accessions), should not require NCBI email to be set.

https://github.com/blab/nextstrain-db/blob/master/vdb/parse.py#L15

Strain names with spaces

Some strains have spaces in their Genbank names like

Dominican Republic/2016/PD1
KU853012
http://www.ncbi.nlm.nih.gov/nuccore/KU853012

I'd very much prefer strain names without white space for ease of downstream processing. I'd propose stripping whitespace at part of the canonicalize method. Still a question of what exactly to do. Three options:

Replace one or more spaces with -.
Replace one or more spaces with _.
Replace one or more spaces with '' (empty).

I think I'd lean towards (1), just because - seem to be more common in existing virus names than _, but am open to suggestions.

Error when updating Zika citations or locations

I'm getting an error when I try to update Zika citation information or location information, either

python vdb/zika_update.py -db vdb -v zika --update_citations
python vdb/zika_update.py -db vdb -v zika --update_locations

I get:

Traceback (most recent call last):
  File "vdb/zika_update.py", line 13, in <module>
    connVDB.update(**args.__dict__)
  File "/Users/trvrb/Documents/src/fauna/vdb/update.py", line 20, in update
    self.update_locations(**kwargs)
  File "/Users/trvrb/Documents/src/fauna/vdb/update.py", line 67, in update_locations
    self.upload_to_rethinkdb(self.database, self.viruses_table, viruses, overwrite=True)
  File "/Users/trvrb/Documents/src/fauna/vdb/upload.py", line 500, in upload_to_rethinkdb
    raise Exception("Couldn't insert new documents into database", database + "." + table)
Exception: ("Couldn't insert new documents into database", 'vdb.zika_viruses')

However, it looks like it's actually updating. @chacalle, sorry about this, but is there an obvious solution? I should be getting more familiar with the codebase.

database dump

Dear all,

Is it possible to get a copy of a nextstrain database dump (e.g. in JSON)? Or is this problematic for proprietary reasons?

Best and thanks,
Adrian

When syncing, keep most recent version

When syncing local and remote copies of a table, check the timestamp of each document and keep only the most recent version. This will prevent some sync conflicts.

Allow uploads of non-HI assay types.

Currently, tdb/upload filters all measurements that are not appropriate 2-fold dilutions from uploads. This should be changed to be responsive to assay_type field.

Martinique bug

The strain MRS_OPY_MARTINIQUE_PARI_2015 didn't end up as its own item in the Zika table. It's currently inline at the end of the sequence for Haiti/1225/2014. I'll fix this manually, but I think the upload script may need investigating.

Here's the relevant bit of the current FASTA download:

>Haiti/1225/2014|Zika|KU509998|2014-12-12|NorthAmerica|Haiti|?|?|Genbank|Genome|Lednicky et al|?|
GTTGTTACT.......GTGGTTAGAGGAGAKU647676|MRS_OPY_MARTINIQUE_PARI_2015|2015-12-XX|HUMAN|MARTINIQUE|NAAGTATCAACAGATTCCGG..........

GISAID Upload Pipeline

When incorporating new sequences from GISAID into nextflu, are only relatively new sequences downloaded or is everything in GISAID downloaded? vdb_parse currently parses the fasta before trying to upload each sequence and checking if the virus is already in vdb. If all sequences from GISAID are going to be in the fasta each time, it will take a while to determine the lineage for all sequences. In this case vdb_parse should immediately check for the virus in vdb after getting the strain name. If only relatively new GISAID sequences are in the fasta then this isn't a problem.

Improve field updates

@chacalle ---

The current ZiBRA pipeline is to add documents to the database without sequence information via a tsv file:

https://github.com/blab/nextstrain-db/blob/master/ZIBRA.md#database-commands

And then to add matched FASTAs that just contain:

>strainA
ATCGCTG...
>strainB
ATCGCTG...

I hacked together this functionality here:

https://github.com/blab/nextstrain-db/blob/master/vdb/upload.py#L305

but it's not very clean. It should be possible to have a document with any complement of fields with defined values and fields marked null and then to 'upload' a document with the same primary key (strain) that has some overlap in terms of fields and non-null values. The merged document should replace all null values with the new defined value, but only overwrite non-null if the --overwrite option is passed.

For example,

db document:

date: 2016-XX-XX
country: brazil
sequences:
- accession: null
- locus: null
- sequence: null

upload document:

date: 2016-01-01
country: null
sequences:
- accession: 160123456789
- locus: genome
- sequence: ATGCTGCCTGC

With default upload, the resulting db document should be:

date: 2016-XX-XX
country: brazil
sequences:
- accession: 160123456789
- locus: genome
- sequence: ATGCTGCCTGC

With --overwrite upload, the resulting db document should be:

date: 2016-01-01
country: brazil
sequences:
- accession: 160123456789
- locus: genome
- sequence: ATGCTGCCTGC

Inclusion date

I've realized it would be totally helpful to have a inclusion_date field in addition to the collection_date field in vdb. As a simple use case, I just uploaded new Zika sequences from ViPR and in doing so all the timestamps we set to present and it wasn't at all easy to find what was new. It would have been super handy to be able to sort by inclusion_date. This would also allow us to roll about the visualization and see what data was available when.

My proposal would be just to use whatever date it is when a document is first added to the database. This should be for appearance in our database, not when the sequence first appeared in GenBank.

An eventual goal is to be able to have a nextstrain.org/updates/ page that would list atomic updates to the app and link to new viruses added in each update. This could pretty easily be done by passing the inclusion_date to augur/auspice from vdb.

VDB Backups

Make regular backups of vdb. Could potentially run on an S3 bucket. This script could possibly be used to make daily backups on S3. Will also probably want a script to revert the database to a previous backup.

Standardization of ferret id, passage, genetic group fields

These fields need to be parsed better.

Subset download on server

I'm working on getting vdb integrated into the current nextflu build. I need to generate 4 FASTA files, one each for H3N2, H1N1pdm, Vic and Yam. I'm doing this with:

python vdb/flu_download.py -db vdb -v flu --select locus:HA,lineage:seasonal_h3n2 --fstem h3n2
python vdb/flu_download.py -db vdb -v flu --select locus:HA,lineage:seasonal_h1n1pdm --fstem h1n1pdm
python vdb/flu_download.py -db vdb -v flu --select locus:HA,lineage:seasonal_vic --fstem vic
python vdb/flu_download.py -db vdb -v flu --select locus:HA,lineage:seasonal_yam --fstem yam

However, downloading the full database is taking ~5 min per lineage. This is definitely impacting performance. Moving subset logic to server would improve this.

Date field bug

I just did an upload where KU820897 with date of 2015_12 in the FASTA was uploaded as 2015-XX-XX. Could you please investigate?

Dates not downloading properly

I just tried to run python vdb/download.py -db vdb -v zika --fstem zika and I got:

>Zhejiang04|zika|KX117076|?|china|china|china|china|genbank|genome|Zhang et al
>1_0015_PF|zika|KX447511|?|oceania|french_polynesia|french_polynesia|french_polynesia|vipr|genome|Nougairede et al

All the collection_dates have been replaced with ?. Not exactly sure what's going on here. All the dates in the db itself look fine. Strangely python vdb/download.py -db vdb -v zika --fstem zika --ftype json is working just as it should.

Parse subtype from sequence

A large fraction of the GISAID submissions don't include full subtype information. This is especially common for B/Vic and B/Yam. Because of this, asking for A/H3N2 in GISAID won't actually get all the H3N2 sequences. Take a look at what we (Richard) did in the nextflu build to account for this:

https://github.com/blab/nextflu/blob/master/augur/src/make_all.py

This uses BioPython plus the outgroups for H3N2, H1N1pdm, Vic and Yam to make alignments and categorize sequences with ambiguous subtypes. @chacalle do you think you could borrow this code/logic for Flu_vdb_upload.py? With this in place, I could switch to using vdb rather than direct GISAID downloads for my nextflu builds.

Update sequences script

Create another script to go through all vdb viruses, use entrez to check for updates to authors, title, url and sequences.

error with npm run chateau

I got an error from 'npm run chateau' and to mention that './chateau/bin/chateau' fails in your system. Any advice?

Justin

Automatic Uploading of Sequences from Genbank

Create function to regularly search through entrez for new sequences to upload. I think this would be useful if there are multiple nextstrain websites being maintained, could automate retrieving new sequences from genbank.

Can query with entrez like "Zika virus"[porgn] AND ("2015/01/01"[MDAT] : "2016-04-14"[MDAT]) AND ("10000"[SLEN] : "100000000"[SLEN]). Possibly also only include sequences that include complete genome in their description. Using entrez seems to lag slightly behind manually searching genbank (missing new sequences KX051563, KX056898 at the moment).

Will want some sort of staging area that shows important sequence information where someone could approve sequences for uploading. Possibly email new sequence information and accession numbers to user for approval?

Improve command line interaction with upload_all

Currently, tdb/upload_all.py requires manually changing source code to upload different datasets. This is very non-ideal. All this should be done via the command line through argparse. Want the ability to specific uploads of different subtypes (h3n2, h1n1pdm, etc...) and different data sources (nimr, cdc, elife).

Canonicalize Zika strain names

I'm going to suggest to include an option (on by default) to remove the _Asian from the end of Zika strain names. I would imagine this would work best as a method within Zika_vdb_upload. Although there could be a more general "canonicalize" method in vdb_upload that takes appropriate options when called from Zika_vdb_upload.

Match fields from tsv header

@chacalle ---

The current upload --ftype tsv command assumes that tsv fields are ordering according to fasta_fields, like so. It would be significantly better if a tsv upload required that the tsv file had an initial header line in which header elements strictly correspond to database fields. For example:

strain  location    division    date
160405000282    currais_novos   rio_grande_do_norte 2016-03-16
160405000283    currais_novos   rio_grande_do_norte 2016-03-16
160216000175    natal   rio_grande_do_norte 2016-03-04

would match columns to fields strain, location, division, date. If the db table didn't have a location field then the upload would still add location for these documents.

Citation not updating

The previous RuntimeError: Search Backend failed: Database is not supported: nuccore error has resolved itself on NCBI's end. I was update to add citations to most strains via python vdb/update.py -db vdb -v zika. However, some aren't taking. Trying for example:

python vdb/update.py -db vdb -v zika --accessions KX087102

The Genbank entry is lacking a title, but does have authors. The running update is not revising authors however.

There are a number of other accessions that behave the same way. Checkout new_server.vdb.zika and look for strains that have null for authors.

Pull genbank files with Accession numbers

The NCBI is phasing out GI numbers per this announcement. The code works for now but vdb.parse needs to be updated to get genbank files by accession number and not gi number.

Make lineage-specific tables in vdb

I think it makes the most sense to have each table be unit of analysis. We make trees for H3N2 and for H1N1pdm separately. In this case would should have tables for vdb/H3N2 and vdb/H1N1pdm. Having the vdb and tsb table names mirror each other also seems good, so we'd also have tdb/H3N2 and tdb/H1N1pdm.

In this case, I believe the script Flu_vdb_upload.py needs to be updated to deposit new H3N2, H1N1pdm, etc... sequences into different tables.

Allow filtering of tdb downloads

In vdb, we have a select command line argument:

https://github.com/nextstrain/fauna/tree/master/vdb#commands-1

that subsets download to just specific fields at a certain values eg. --select field1:value1 field2:value1,value2. We definitely want the ability to filter tdb downloads to

Subset on data source:

just CDC titers

Subset on assay type:

just HI assays
just FRA assays

Ie. something like: python tdb/download.py -db tdb -v flu --subtype h3n2 --select assay_type:HI source:CDC. Make it generic, rather than specifically tailored to source and assay_type.

Docs pass for tdb

python tdb/download.py -db tdb -v h1n1pdm isn't working. It needs to be python tdb/download.py -db tdb -v flu --subtype h3n2. Can you confirm that upload, download, etc.. docs in tdb/README.md and flu/README.md are as they should be?

Schema definitely needs updating: https://github.com/nextstrain/fauna/tree/master/tdb#schema

default values and kwargs

missing arguments to vdb_download will break it.
https://github.com/blab/nextstrain-db/blob/master/vdb/src/vdb_download.py#L24
if host is not in kwargs, the next line will fail since the attribute is not set. also applies to other arguments. easiest fix would be to replace kwargs by actual keyword arguments

Upload bug

I just tried uploading Zika sequences with

python src/Zika_vdb_upload.py --database test --virus Zika --fname GenomeFastaResults.fasta --source Genbank --locus Genome --path data/

and got the following error

Inserting next virus into database: 103344
This virus already exists in the table
Traceback (most recent call last):
  File "src/Zika_vdb_upload.py", line 28, in <module>
    run.upload()
  File "/Users/trvrb/Dropbox/current-projects/vdb/src/vdb_upload.py", line 123, in upload
    self.upload_documents()
  File "/Users/trvrb/Dropbox/current-projects/vdb/src/vdb_upload.py", line 287, in upload_documents
    self.update_document_meta(document, virus)
  File "/Users/trvrb/Dropbox/current-projects/vdb/src/vdb_upload.py", line 298, in update_document_meta
    document[field] = virus[field]
KeyError: 'subtype'

Not exactly sure what's going on here.

Replace date_modified with timestamp

The date_modified field should become timestamp. This will give more control when choosing which version of a document to keep.

Issue with flu update groupings

I was just doing an update from GISAID and ran into an error when running:

python vdb/flu_update.py -db vdb -v flu --update_groupings

I get:

Traceback (most recent call last):
  File "vdb/flu_update.py", line 57, in <module>
    connVDB.update(**args.__dict__)
  File "/Users/trvrb/Dropbox/current-projects/nextstrain-db/vdb/update.py", line 22, in update
    self.update_groupings(self.viruses_table, self.sequences_table, **kwargs)
  File "vdb/flu_update.py", line 50, in update_groupings
    self.upload_to_rethinkdb(self.database, self.viruses_table, virus_group, overwrite=True, optimal_upload=optimal_upload)
  File "/Users/trvrb/Dropbox/current-projects/nextstrain-db/vdb/upload.py", line 500, in upload_to_rethinkdb
    raise Exception("Couldn't insert new documents into database", database + "." + table)
Exception: ("Couldn't insert new documents into database", 'vdb.flu_viruses')

This is from upload_to_rethinkdb: https://github.com/blab/nextstrain-db/blob/master/vdb/upload.py#L484

I just dug around a bit and couldn't see something obvious. @chacalle if you could take a look at your leisure I'd very much appreciate it.

Upload via accession number

Include option to upload via a list of Genbank accession numbers. This could be flagged on the command line with something like --ftype fasta vs --ftype accession. The default could be fasta. There could also be a shortcut for --accessions KU729218 KU853013 to upload these directly. This would circumvent --ftype and --fname. There might be better setups for this however...

Sort out what we want to use for geo regions

Current geo regions are specified here:

https://github.com/nextstrain/fauna/blob/master/source-data/geo_regions.tsv

There are 14 of them:

North Africa
Subsaharan Africa
Europe
Caribbean
Central America
North America
China
South Asia
Japan / Korea
South Pacific
Oceania
South America
Southeast Asia
West Asia

The current proposal is to collapse to 12 regions:

Collapse "Oceania" and "South Pacific" to "Oceania"
Collapse "Central America" and "Caribbean" to "Central America / Caribbean"

Any other suggestions?

Remove fuzzy matching on vdb upload

I'm pretty sure we can remove

Adjusting strain names to match identical strains in documents to be uploaded
Using vdb.flu_viruses to adjust strain names to match strains already in vdb.flu_viruses
Adjusting accessions to match identical sequences in documents to be uploaded
Using vdb.flu_sequences to adjust accessions to match sequences already in vdb.flu_sequences

from vdb/upload. I need to test this however to make sure it doesn't break anything.

Entrez email bug

In 5e6ffd5, running python vdb/update.py -db test_vdb -v zika --accessions KX101066,KX101060 results in:

/usr/local/lib/python2.7/site-packages/Bio/Entrez/__init__.py:451: UserWarning: 
Email address is not specified.

I figured out that this is due to neither Entrez.email nor self.email being set when get_GIs is called here: https://github.com/blab/nextstrain-db/blob/master/vdb/update.py#L17. The function Entrez.esearch within get_GIs needs Entrez.email to be set and it's not.

Should separate concepts of 'table' and 'virus' in vdb

For the ZIBRA project, it will be easiest to have a separate table within vdb. I'm setting this up now as vdb/zibra. However, this should still have a virus field that says zika. We can be merging vdb/zibra into vdb/zika based on shared fields.

I'd suggest changing -v zibra / -v flu to -t zibra / -t flu. So swapping "virus" for "table". You could have a command that looks like:

python vdb/zibra_upload.py -db vdb -t zibra --fname seq.fasta --source zibra --virus zika

that uploads sequences to the vdb/zibra table, but still labels each virus field as zika.

I suspect this will be a generally useful semantic separation as well.

TDB upload fails if subtype is absent

Calling:

python tdb/cdc_upload.py -db cdc_fra_tdb -v flu --path data/ --fstem FRA_Sep2015_Sep2016_titers --ftype flat --preview

Errors out to:

Traceback (most recent call last):
  File "tdb/cdc_upload.py", line 105, in <module>
    connTDB = cdc_upload(**args.__dict__)
  File "tdb/cdc_upload.py", line 14, in __init__
    upload.__init__(self, **kwargs)
  File "/Users/trvrb/Documents/src/fauna/tdb/upload.py", line 32, in __init__
    self.subtype = subtype.lower()
AttributeError: 'NoneType' object has no attribute 'lower'

an empty subtype should not bomb out. Empty subtype should collect subtype from within the flat file.

Problem with GenBank GIs

The command python vdb/zika_update.py -db vdb -v zika --update_citations is throwing the error:

Connected to the "vdb" database
Updating citation fields
Getting accession numbers for sequences obtained from Genbank
Traceback (most recent call last):
  File "vdb/zika_update.py", line 13, in <module>
    connVDB.update(**args.__dict__)
  File "/Users/trvrb/Dropbox/current-projects/nextstrain-db/vdb/update.py", line 18, in update
    self.update_citations(table=self.sequences_table, **kwargs)
  File "/Users/trvrb/Dropbox/current-projects/nextstrain-db/vdb/update.py", line 28, in update_citations
    _, sequences = self.get_genbank_sequences(**kwargs)
  File "/Users/trvrb/Dropbox/current-projects/nextstrain-db/vdb/update.py", line 47, in get_genbank_sequences
    gi = self.get_GIs(accessions)
  File "/Users/trvrb/Dropbox/current-projects/nextstrain-db/vdb/parse.py", line 194, in get_GIs
    giList = Entrez.read(handle)['IdList']
  File "/usr/local/lib/python2.7/site-packages/Bio/Entrez/__init__.py", line 372, in read
    record = handler.read(handle)
  File "/usr/local/lib/python2.7/site-packages/Bio/Entrez/Parser.py", line 203, in read
    self.parser.ParseFile(handle)
  File "/usr/local/lib/python2.7/site-packages/Bio/Entrez/Parser.py", line 511, in externalEntityRefHandler
    self.dtd_urls.append(url)
UnboundLocalError: local variable 'url' referenced before assignment

due to GenBank phasing out GI numbers:

https://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/

Date added field

I think it would be helpful to include a vdb_date_added field that records when we added a strain into the database. This should make it easier to sort the table in Chateau to find and annotate new viruses. I'm not sure if we also need a genbank_date_added field, etc... Thoughts on this?

nextstrain / fauna Goto Github PK

fauna's Introduction

License and copyright

fauna's People

Contributors

Stargazers

Watchers

Forkers

fauna's Issues

Recommend Projects

Recommend Topics

Recommend Org