jamesaoverton / cmi-pb-terminology Goto Github PK
View Code? Open in Web Editor NEWCMI-PB Controlled Terminology
CMI-PB Controlled Terminology
During initial loading of a database, we validate "tree foreign keys" and "under" constraints only after the other validation steps have completed and the data for a table has been loaded into the database. This is not currently being done for the case of updating or adding a single row to the database. We need to add these steps to the single row validation.
CMI-PB has about 278 OLink IDs and mappings to Uniprot protein records in the olink_prot_info
table. Here's the general Swagger/OpenAPI link:
https://www.cmi-pb.org/docs/api/
and here's the specific OLink request:
curl -X 'GET' \
'https://www.cmi-pb.org:443/db/olink_prot_info' \
-H 'accept: text/csv' \
-H 'Range-Unit: items'
We need to get these proteins from Uniprot and add them to the cmi-pb.owl
, with rdfs:label, synonyms, and maybe some other annotations later. They can all be children of PR 'protein'.
I found an example in the CMI-PB data where:
olink_prot_info.tsv
({"uniprot_id":"O14625","olink_id":"OID05624"}
) was put in 'olink_prot_info_conflict' because 'uniprot_id' "O12625" is a duplicate in a primary key column.this is the right behaviour, but then
This validation error is correct but still somewhat misleading, since "OID05624" can be found in olink_prot_info.tsv
and the view 'olink_prot_info_view.olink_id'.
My first thought was that the SQL foreign key constrain could apply to the 'olink_prot_info_view.olink_id' column of the view, but after a bit of research I think won't work since columns in views cannot be UNIQUE as required for a foreign key: https://www.sqlite.org/foreignkeys.html#fk_indexes
My second thought is that we should have a different validation message in this case. Instead of saying "Value OID05624 of column olink_id is not in olink_prot_info.olink_id", we should say "Value OID05624 of column olink_id is in olink_prot_info_conflict.olink_id" (this could be worded better). This would require the Python code to check in both 'olink_prot_info' and 'olink_prot_conflict'.
Thoughts or suggestions?
I'm happy with how this system is coming along. Although the "special tables" (table/column/datatype) are pretty easy to ready, they currently need to be written by hand, which is tedious and requires expertise.
Given a set of TSV files, I would like to have code that can create a decent first-pass of the table, column, and datatype tables.
For large datasets this could be very expensive, but randomly sampling a subset of N rows seems like a reasonable approach. Maybe N=1000 for starters.
The datatypes form a tree, so we could start with leaves such as 'integer' and 'float', and if we get a mixture of the two we could guess that the datatype should be a union of the two: 'number'.
It would also be good to guess at the 'structure' column of the 'column' table. We could guess at 'unique' if all the sampled values are distinct. We could guess at 'primary' if the column is unique and it's the first column of the table. We could guess at foreign keys, but this would require checking against the sets of values in other columns, and might get expensive. We could guess at 'tree' in a similar way, only checking other columns within the same table. Guessing at 'under' seems the most expensive.
We need to allow for the input data to include invalid data, and so there should be a (configurable) threshold for the percentage of values that have to match before a datatype is suggested.
After #19.
The current script just loads the table table, but I would like to load the other TSVs: table, column, datatype, prefix, import.
This Google Sheet drives the CMI-PB terminology:
https://docs.google.com/spreadsheets/d/1xCrNM8Rv3v04ii1Fd8GMNTSwHzreo74t4DGsAeTsMbk/edit#gid=0
We're only using a few of the columns, but more of this information should appear in the OWL:
A. 'Column': it would be helpful to have this as some sort of annotation, with a custom CMI-PB predicate, but we probably don't want to show it in the terminology browser
B. 'Label': We need a new predicate for this like "CMI-PB alternative term", and we want to use it as the short_label for search tables
C. 'CURIE': Bjoern would just like this filled out.
I think we want a new CMI-PB prefix for this. I'm not sure what the base IRI should be, probably http://cmi-pb.org/terminology/CMI-PB_
, but we might need to change that later.
We've seen significant performance increases from #44, which applies to intra-row validation (#43). To optimize inter-row validation, especially unique, primary, and foreign key constraints, let's try the following:
SQL doesn't give us enough information about why an insert fails, so we'll just treat this as a yes/no question. When the insert succeeds, we don't need to run step (5), which should save a lot of work. When the insert fails, we need to do just as much work as we're currently doing. So the question is whether the additional overhead of steps (2) and (3) result in an overall performance win.
For this project, like previous VALVE work, we want to validate that a given value falls under another value in a simple hierarchy. We use a tree()
function to define the hierarchy, then the under()
function to specify which tree and which value must be an ancestor.
As in previous VALVE work, we specify the tree(column)
structure on the "parent" column, and the argument is the name of the "child" column. Each child may have a parent, and each parent must appear in the child column. Then we can use a WITH RECURSIVE query to get all the ancestors (or descendants, children, parents) of a given node.
For this to work:
Eventually we want to support multiple parents, which changes our tree to a directed acyclic graph (DAG).
With the prototype, so far we've been focused on validation while loading, and then #35 validating existing rows. We also need to be able to re-validate all the tables. This could be because we changed a definition of a table, column, or datatype. It could also be because we removed some rows from a table.
Although some of our validation rules are specific to a cell or a row, some of them such as unique
, under
, and tree
necessarily span many rows, and sometimes span tables.
In a perfect world we would figure out the minimum set of (re)validations that we would have to run. Spreadsheets do this, and I'm reminded of the Clojure library Javelin, and "dataflow programming" more generally. As tempting as that is, it seems like too big a task. It's so much simpler to just run all the rules again.
Still thinking about this one...
The HTML string returned from term()
will be embedded in a page, so it cannot include an html
element. This might be as simple as changing standalone=False
.
I guess there is also supporting CSS, so we should provide a CSS file.
I would also like term()
to accept no arguments and return the div
for owl:Class
.
Currently, if value of a primary key field conflicts with an existing value for that field in the table, then we add that row to the conflict rather than the main table. This is the right behaviour.
However if there is some other problem with a primary or unique key field (e.g., it has leading space), then the value of the field is inserted to the table as NULL. SQLite happily accepts this, but we will likely run into problems with Postgres.
We need to insert all rows with invalid unique or primary key fields to their corresponding conflict table, not just rows with non-unique values.
Building on #7 and ontodev/gizmos#71, we need a (Postgres) query to make a search table out of the statements table for the OLink proteins. Columns:
We have code to load from TSV into SQL. Now we need code to write from SQL to TSV. The 'table' table contains information such as the file path to write, and we can add more information as needed. This should be fairly straightforward, but there are little things that I worry about:
\n
) to separate rows\n
and tabs with \t
The goal is to have very clean roundtrips from TSV to SQL and back.
Connected to #35, if we change a cell that is used by a foreign key constraint, we probably want to be able to propagate that change. The main example I'm thinking of is renaming controlled terminology. So we have a 'terminology' table with a cell 'human liver' and a 'specimen' table using 'terminology' as a foreign key for a 'source organ' column, with 50 rows describing human liver biopsies. If we rename 'human liver' to 'liver (human)' in 'terminology', we want a tool to figure out where it is used as a foreign key and update those 50 'specimen' rows.
These are part of the older VALVE implementations, and serve a specific purpose, but I have some new ideas that I'm still working on. This issue is just a placeholder for now.
Eventually I would like to be able to track changes made through update_row
#35 and insert_row
#57. I see two options:
I prefer 2. Previous versions of VALVE took the approach in 1 to collect validation messages. Sometimes we do want just a list of all the messages or changes, which is why we now have the export_messages()
function. But in the new VALVE work here we're trying the _meta column approach, which makes it much easier to keep validation information close to the values in SQL without complex JOINS. So far I think this approach is working.
Users should also be tracking their TSV tables in version control. The use case here is to track and visualize a small number of edits to the tables that the users have made since they last committed their TSVs to version control. So think of this feature as filling the role of git diff
. The number of changes should be small relative to total number of rows. This is about tracking changes to content only, not schema changes, although I admit that line could get blurred if changes are made to the meta-tables. The history of changes would start fresh when the data is loaded from TSV. So the workflow we have in mind is:
update_row()
and insert_row()
So the idea for 2 is that we start with a table like:
row_number | foo | foo_meta |
---|---|---|
1 | bar |
Then we update_row
and get something like:
row_number | foo | foo_meta |
---|---|---|
1 | baz | {"history": [{"value": "bar", ...}], "update": {"author": "James", "datetime": "2022-03-03T03:03:03", ... }} |
I'm fuzzier about insert_row
. Maybe we need a 'row_meta' column:
row_number | row_meta | foo | foo_meta |
---|---|---|---|
1 | bar | ||
2 | {"update": {"author": "James", "datetime": "2022-03-03T03:03:03", ... }} | baz |
We should be able to:
export_messages()
Note that I haven't said anything about delete_row()
#39, and that this use case does influence that feature.
When using a condition that matches a string with no trailing whitespace, such as match(/\S(.*\S)*/)
or match(/\S([^\n]*\S)*/)
, if the primary_dt_condition_func
using re.fullmatch
is run over a string that has trailing whitespace (does not match), the process hangs here: https://github.com/jamesaoverton/cmi-pb-terminology/blob/next/src/script/validate.py#L357
I believe this is due to catastrophic backtracking and not an issue with the validation code, but I am unable to load datasets with invalid matches because of this.
In the meantime, I can use exclude(/^\s+|\s+$/)
to fit my use case but it would be good to implement a workaround, as this may happen with other patterns.
We need to package this code as a (very small) library that can be used from Django. The library needs just three functions with simple signatures:
term(id=None)
take an ontology term ID and return the HTML for the tree and annotations. If no argument is given, return the HTML for class. This is effectively calling gizmos.tree with standalone=False and search=False and the predicates heresearch(text)
take a text string and return JSON from gizmos.searchI think we can hard code a relative path to the SQLite database file.
We can rename/reorganize the code here, e.g. the Python library code doesn't have to be in src/
.
Building on #23 and #24, we need support for foreign key constraints. I think that "from" is more clear to new users than "foreign". So I think we should support a from()
function with a "table.column" as the one argument. This will require:
create_schema()
to add a foreign key constraint for the main table (not the _constraint table)validate_rows()
Note: In the future I can see the need to specify more than one column: from(import.label, index.label)
. I think this would not just translate into a foreign key constraint in SQL -- it would require an additional view that gets the union of the two columns. I don't want to worry about this yet.
Should be pretty much the same as #23.
Eventually we want to be able to handle large tables using small amounts of memory, so we should design the system from the start to handle rows in batches.
The 'table', 'column', and 'datatype' tables are all special, and should never get too large, so it's fine to load them all into memory at once -- no batching.
In our example, the 'prefix' and 'import' tables should be loaded in batches. We should be able to set the batch size. For testing and development purposes, we will use ridiculously small batch sizes, like 1 or 3.
In particular, the batch size will control the number of rows submitted to validate_rows
each time it is called.
Somewhat the inverse of #61, it would also be useful to be able to generate tables of "random" data from given table/column/data/rule information.
The brute force approach is to generate random strings until they match the datatype regular expressions.
More elegant would be tools that parse the regex (at least partially) and use that information to generate valid strings. There should be existing tools for this.
Better still would be to define a set of "generator" functions an annotate the datatypes with them. For example: integer, float, date, person name, mailing address, human height, etc.
Such a tool would be valuable for several kinds of testing, but the key use case I have in mind is this randomly generated data (and the special tables) could be public even when the real data is private. This is an important use case when developing open systems for clinical data, where the real patient data must be kept strictly private.
Please revise the prototype's load.py to use bindparam
and sql_text
from sqlalchemy.sql.expression
as done in sprocket run.py.
I would like to output properly escaped SQL strings at this point -- I don't want to execute the SQL from Python yet.
In the 'structure' column of the 'column' table we want to be able to specify "unique" and "primary" constraints. When calling validate_row
for a batch of rows, and checking uniqueness for column X, we need to check two things:
This requires major changes to the current code:
validate_rows
(read) access to the SQL databaseMy best idea for handing (3) is to create a second 'foo_invalid' table. 'foo_invalid' would have the same schema as 'foo', except for the unique and primary key constraints. If any cell in a row fails (3) we would insert the whole row into 'foo_invalid', not into 'foo'. So we may want to flag the target table of the whole row somehow. This approach requires an modification to create_schema
to create the _invalid
tables. We probably want to also create a view that takes the union of 'foo' and 'foo_invalid', maybe called 'foo_view'.
A disadvantage of this is that we it would take an extra step to add information about the other duplicate that was inserted first.
We are not yet considering unique constraints that span multiple columns.
Currently, we allow the structure
column of the column
table to have multiple constraints, and currently the prototype code grabs all of them simply by splitting on spaces. This was never meant as a permanent solution to the problem and so far it has been fine for the simple conditions we've had to deal with like unique
, primary
, and tree(child)
, which we can rely on to not have any spaces. Since constraints like under(table.column, value)
have multiple arguments, and since presumably we don't want to force the end user to omit a space between arguments, we need to handle things more intelligently.
The best solution is to implement a simple grammar for this. For python, we can use the lark library.
On the next
branch we're currently faking the datatype conditions with hardcoded rules:
https://github.com/jamesaoverton/cmi-pb-terminology/blob/next/src/script/validate.py#L128
I would like to implement these for real. We're currenly using:
I'm not sure these are the best names. The match/search distinction was trying to follow Python, but I always need to look that up. Previous versions of VALVE required ^
and $
anchors, which are annoying. I intended 'exclude' to mean 'not search' but it could just as easily be read as 'not match'. Suggestions welcome.
Previously we used a grammar for parsing these, but this time we don't have any nested cases to worry about. Maybe string splitting is good enough for now?
Soon we will want to implement conditions for rules, which I think will require not
, but I'm not sure about any
and all
.
Similar to #13 and building on #9, we need a SQL query to make search tables out of the statements table for the Ab titer and Live cell percentages. In the terminology Google Sheet column A, the Ab titer is indicated by ab_titer.antigen
and the Live cell percentages by cell_type.cell_type_name
.
Previous VALVE tools output "standard message tables" that locate a cell and then describe validation results. General documentation is here https://github.com/ontodev/cogs#message-tables. We want to output something like this for use in reporting and quality control. I've found these tables very useful for summarizing problems with a dataset, and filtering for just the most important problems.
The validation information we need is already generated by the validation methods. We basically want to iterate over rows and collect validation messages from the _meta columns. I've written prototypes that do this in pure SQL.
I think the main question is how to specify cell locations.
In VALVE and COGS we populate the 'cell' column using "A1" syntax (i.e. column letter, row number) to locate the cell we want to talk about. Because VALVE and COGS are focused on spreadsheets, this is natural. Now that we're using SQL this is not as natural:
row_number()
We could map column names to letters. We want to have deterministic sort for all tables, as discussed in #36, so we should always be able to determine a row number, although I'm a little worried about a performance hit. I think we want the option to output a 'cell' column in A1 syntax.
But I think that what we usually want will be two columns, 'column' and 'row', where 'column' is the column name, and 'row' is the primary key for that row, or the row index if we don't have a primary key. If we have multiple primary keys, then I would want to concatenate them with a separator, although it's hard to pick a separator that won't conflict with the content of the key.
For the prototype we want to be able to edit existing rows using HTML forms. The HTML form will make use of the _meta
information to show the user any problems. When the user changes data in the form, we need to be able to (1) validate it and (2) update the database. To be clear: these should be independent. A common use case would be: the user submits invalid data, we validate and display the problems in the HTML form (without updating the database yet), then the user fixes the problem and we actually update the database.
The validation step should be very similar to what we already have in validate_row()
, but I think we need something more to handle the case where we are validating an existing row. For example, if the row has a primary key with value "X", we should not fail validation because we found a duplicate "X" in the old version of the row and the new version of the row. We probably just need to identify the (existing) row we are validating, and validate it "as if" it were not already present in the database.
The update step should replace a row in the database with the new row.
Note that we also want to use HTML forms to add new rows, but I think we have that covered already.
Currently the validate_condition()
function runs Lark or compiles a regex at least once per cell. It should be more efficient to "compile" the condition just once into a function with just the value
as the argument. This should be possible when the config
is built.
As @beckyjackson works on forms for editing rows with HTML forms (related to #35), it would be good to add support for autocomplete text fields. We tend to use Typeahead, which expects data in a simple JSON format, e.g. Gizmos search
. The format is simple enough once we know what the available values are, and that should be guided by the metatable information. For a given field that we're editing:
in("A", "B", "C")
then the autocomplete should provide: "A", "B", "C"from(foo.bar)
, the autocomplete should provide values from the foo.bar
columnunder(foo.bar, "baz")
, the autocomplete should provide values from the "baz" branch of the foo.bar
treeSo I think we need a function in the validate
module that takes a table, column, and the string to complete, and returns a JSON array formatted for Typeagead.
Previous versions of VALVE include a 'rule' table that implements when-then conditions that apply to two columns of a single row. Inter-column validation is important.
But this raises an important design problem: In VALVE the 'rule' table has 'when condition' and 'then condition' columns, where the conditions could be equals()
or not(in())
or under()
etc. The problem is that in the CMI-PB prototype we've been pulling apart the single 'condition' columns into 'nulltype', 'datatype', 'structure', and 'condition'. The separation makes for simple expressions in these columns, but I don't want the 'rule' table to contain multiple columns: 'when nulltype', 'when datatype', 'when structure', 'then nulltype', 'then structure', etc.
Rules are bound to have more complex expressions. Maybe it's fine to continue to have just 'when condition' and 'then condition', where those cells can contain combinations of datatypes, structures, and conditions.
Our validation rules fall into two groups:
We could run the intra-row validation concurrently, which should improve performance. So I'd like to pull these two kinds of validation apart.
When a cell is valid, the _meta column currently contains json({"valid": true, "messages": []})
. To save space, and maybe reduce noise, should we replace these "valid" JSON objects with a simple NULL?
I had kind of forgotten about this, but we need an insert_row()
function (parallel to update_row()
) that can be called from the frontend code.
In VALVE and our current prototype the datatypes form a tree, so a 'CURIE' datatype might have ancestors: 'word', 'nonspace', 'trimmed_line', 'trimmed_text', 'text'. The goals of this design include:
The current VALVE and prototype implementations check, for each cell, all the ancestor datatypes from most general to most specific, stopping at the first failure.
I still like design goals (2) and (3), but (1) has not been a big problem. With the new match()
function I'm finding that we can simply express exactly what should match, without relying on the ancestors. (1) forces us to check ancestor datatypes first, which can be inefficient. A new consideration is that (1) prevents us from translating our datatype checks into other validation systems, such as JSONSchema.
So I've been considering dropping design goal (1), and requiring that each datatype specifies exactly what should match. An example: for 'CURIE' instead of search(/:/)
which relies on ancestors such as 'word' and 'trimmed_line', we would specify match(/\w+:\w+/)
(or something) which is a complete specification. If this one regex matches, then we're done. If validation fails, then we make use of the hierarchy for useful error messages and suggestions, satisfying design goals (2) and (3). If desired, we can translate match(/\w+:\w+/)
into JSONSchema "pattern": "^\w+:\w+$"
-- we lose the benefits of the benefits of (2) and (3) but we can actually match the pattern, where search(/:/)
did not carry enough information by itself.
Thoughts?
Building on #28, we use tree()
by specifying an under(table.column, value)
structure. The 'table.column' points to the definition of the tree()
. The 'value' specifies a node X in the tree. When validating a value Y, X must be identical to Y or X must be an ancestor of Y in the tree.
I think we should have at least these two different validation error cases/messages:
In previous VALVE implementations the whole tree was held in memory. This time we would prefer not to keep in memory, and instead use a WITH RECURSIVE query similar to this: https://github.com/ontodev/gizmos/blob/master/gizmos/tree.py#L806
We need to be able to delete rows from tables. That's simple enough, but we also want to maintain validation, so I think this is a particular case of #38.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.