<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Looks good. Thanks <a class="user-mention notranslate" data-hovercard-type="user" data

Yes, either all the fields need to be defined in the code like <a href="https://github

<a href="https://github.com/blab/nextstrain-db/commit/0fc0ed1e956694e54c927b378b71b138

Match fields from tsv header about fauna HOT 10 CLOSED

nextstrain commented on May 28, 2024

Match fields from tsv header

from fauna.

Comments (10)

chacalle commented on May 28, 2024

In the tsv_header branch parse_tsv_file uses the header to define the document fields. I couldn't find a good way to ensure that the file actually has a header though.

from fauna.

trvrb commented on May 28, 2024

Looks good. Thanks @chacalle.

from fauna.

trvrb commented on May 28, 2024

@chacalle ---

I just noticed that if you now upload a tsv with header fields a, b, c against a table where documents have fields a, b, c, d, e. Then the resulting table lacks fields d and e in the newly uploaded documents. These should have fields d and e but be null instead.

from fauna.

chacalle commented on May 28, 2024

Yes, either all the fields need to be defined in the code like here and then use this to assign those fields to null if they're not in the document.

But now that we don't filter the documents based on fields besides strain we may not need to define all the fields for each table. Instead could look through all the documents already in the database, make a set of fields that are present, then make sure the documents to be inserted have those fields or assign them to null.

from fauna.

trvrb commented on May 28, 2024

I think I'm changing my mind here (sorry, sorry). Let's have an easy way to define a schema for a table. For example:

https://github.com/blab/nextstrain-db/blob/master/vdb/zibra_download.py#L14

An individual tsv may have some subset of these fields in its header in whatever order.

from fauna.

chacalle commented on May 28, 2024

I'm not sure I understand this.

from fauna.

chacalle commented on May 28, 2024

Thinking about this more, one of the strengths of rethinkdb/noSQL databases is that the documents inserted can be flexible in their schema. It doesn't really matter if a document is missing a field, in chateau its displayed as undefined. We'd have to write a simple filter function for download.py but this would allow more flexibility when adding new documents to the database with new fields like you're doing for zibra. This also goes along with making filter in upload.py more flexible.

from fauna.

trvrb commented on May 28, 2024

Hmm... what's happening is that the schema is rapidly evolving. So, for example, I just added a field for minion_barcode. Doing a vdb/upload with a set of new document with this field resulted in only the new documents having this field. We need to either:

Allow missing fields within documents and have vdb handle this situation gracefully. Chateau seems to work pretty okay.
When adding a new field, update previous documents to have a value of null for this new field.

This is a separate issue from enforcing a schema. At the moment, I like the idea of enforcing a schema, like what's accomplished here:

https://github.com/blab/nextstrain-db/blob/master/vdb/zibra_metadata_upload.py#L16
https://github.com/blab/nextstrain-db/blob/master/vdb/zibra_download.py#L14

This makes it so that I can run zibra_download to get the entire table and zibra_metadata_upload to repair the entire table. This has been very useful.

So, I might suggest having an enforced schema for a table, but make it easy to adjust this schema on the fly (adding and removing fields as appropriate).

from fauna.

chacalle commented on May 28, 2024

0fc0ed1 filters even with undefined fields. So documents without a certain field get filtered out.

from fauna.

trvrb commented on May 28, 2024

Closing this as schema has moved on.

from fauna.

Match fields from tsv header about fauna HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent