child of #8
heres the idea.
The XML parser is split into creating the base record, dbxrefs, linked records, and props. (and whatever other stuff we need).
The base record stuff is hard coded. We look for a hardcoded attribute for each column of the base record, with advanced logic to check for all possible attributes and use the "best" one.
dbxrefs are also hard-coded.
linked records.... im not there yet. let's ignore for now.
for everything else: it looks up the tag in an API. the API returns if the tag should be ignored, added as a prop, or something else.
We have a schema that stores:
ALL encountered tags. It keeps the tag name, the the ncbi db type for that tag, and if the tag is assigned to a term or not. If it's assigned, it's just the cvtermid for easy lookup. We also have a list of all the matching possible cvterms that arent necessarily assigned (probably a seperate, mview type table).
how does the schema get populated? read on...
schema population
We have a job that reads an XML file and compiles all the attribute tags: each tag is stored in the schema as unassigned. It then looks each one up in your chado.cvterm. All exact and "close enough" matches go in the possible matches schema. The admin then goes to an admin area and sees a list of all XML terms with matches. From there they can "assign" the attribute, which means when the XML gets parsed for real, it will create a property. If no attribute is assigned a term, it gets ignored. If no terms match an attribute, they are instructed to find one, with a button to automatically create a local term instead.
Furthermore, on install, we can hardcode some suggest attribute -> cvterm mappings. This is tricky because everyone's site is different, but maybe there are some attributes we would expect in ALL biosamples across plants animals fungi etc.
When someone imports a new XML, it can be configured to ignore new attribute tags (but add them to its schema as an unmatched, ignored attribute) OR to abort the load -> the admin can then assign a term and re-attempt the load.