Comments (10)
In the tsv_header
branch parse_tsv_file
uses the header to define the document fields. I couldn't find a good way to ensure that the file actually has a header though.
from fauna.
Looks good. Thanks @chacalle.
from fauna.
@chacalle ---
I just noticed that if you now upload a tsv with header fields a
, b
, c
against a table where documents have fields a
, b
, c
, d
, e
. Then the resulting table lacks fields d
and e
in the newly uploaded documents. These should have fields d
and e
but be null
instead.
from fauna.
Yes, either all the fields need to be defined in the code like here and then use this to assign those fields to null
if they're not in the document.
But now that we don't filter the documents based on fields besides strain
we may not need to define all the fields for each table. Instead could look through all the documents already in the database, make a set of fields that are present, then make sure the documents to be inserted have those fields or assign them to null
.
from fauna.
I think I'm changing my mind here (sorry, sorry). Let's have an easy way to define a schema for a table. For example:
https://github.com/blab/nextstrain-db/blob/master/vdb/zibra_download.py#L14
An individual tsv may have some subset of these fields in its header in whatever order.
from fauna.
I'm not sure I understand this.
from fauna.
Thinking about this more, one of the strengths of rethinkdb/noSQL databases is that the documents inserted can be flexible in their schema. It doesn't really matter if a document is missing a field, in chateau
its displayed as undefined. We'd have to write a simple filter function for download.py
but this would allow more flexibility when adding new documents to the database with new fields like you're doing for zibra. This also goes along with making filter
in upload.py
more flexible.
from fauna.
Hmm... what's happening is that the schema is rapidly evolving. So, for example, I just added a field for minion_barcode
. Doing a vdb/upload
with a set of new document with this field resulted in only the new documents having this field. We need to either:
- Allow missing fields within documents and have vdb handle this situation gracefully. Chateau seems to work pretty okay.
- When adding a new field, update previous documents to have a value of
null
for this new field.
This is a separate issue from enforcing a schema. At the moment, I like the idea of enforcing a schema, like what's accomplished here:
https://github.com/blab/nextstrain-db/blob/master/vdb/zibra_metadata_upload.py#L16
https://github.com/blab/nextstrain-db/blob/master/vdb/zibra_download.py#L14
This makes it so that I can run zibra_download
to get the entire table and zibra_metadata_upload
to repair the entire table. This has been very useful.
So, I might suggest having an enforced schema for a table, but make it easy to adjust this schema on the fly (adding and removing fields as appropriate).
from fauna.
0fc0ed1 filters even with undefined fields. So documents without a certain field get filtered out.
from fauna.
Closing this as schema has moved on.
from fauna.
Related Issues (20)
- Geographic error? HOT 2
- Switch out `xlrd` HOT 1
- fauna downloads fail with Python 3.10
- PhantomJS not found on PATH - installation via npm install HOT 2
- Set `serum_id` to `lot_number` for CDC titer imports HOT 4
- feat: BV-BRC support HOT 1
- serum_passage_category should be set to "egg" instead of "cell" for CDC human pool data like "L21/22 H3-EGG HUMAN POOL" HOT 7
- Assign correct host to titers from non-ferret hosts (e.g., human and mouse)
- Geolocation assignments fail for duplicate location names HOT 2
- What should the environment variables RETHINK_HOST and RETHINK_AUTH_KEY be set to? HOT 1
- Higher resolution sampling date available for Zika strain HN16 HOT 1
- implement caching for geo lookups HOT 1
- chateau submodule error HOT 2
- Suggest using direct clinical sample sequence for MEX_CIENI551 Zika genome
- Annotate titer TSVs with source and passage
- fauna uploads fail in python 3 unicode error HOT 1
- argument parser in upload.py HOT 3
- Migrate to pandas 0.17 HOT 6
- Fauna installation fails for some users who don't run `npm install` inside of `/chateau` HOT 3
- fauna doesn't work with rethinkdb 2.4 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fauna.