A running list of questions that I might direct to Matt if I cannot figure them out:</

Endpoints on the KNB should use the DataONE MN.get() REST endpoint, so for examp

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Questions about EML design & implementation about eml HOT 6 CLOSED

ropensci commented on June 13, 2024

Questions about EML design & implementation

from eml.

Comments (6)

mbjones commented on June 13, 2024

Endpoints on the KNB should use the DataONE MN.get() REST endpoint, so for example, for doi:10.5063/AA/nceas.912.9:
https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:10.5063%2FAA%2Fnceas.912.9

However, note that we also recommend using the DataONE CN.resolve() service to find the list of nodes that might currently both have a copy of an object and are currently available on the network. The resolve() call returns a list of nodes that contain the object and the REST url for retrieving it. So, for example:

$ curl -s https://cn.dataone.org/cn/v1/resolve/doi%3A10.5063%2FAA%2Fnceas.912.9 | xmlstarlet fo

<?xml version="1.0" encoding="UTF-8"?>
<d1:objectLocationList xmlns:d1="http://ns.dataone.org/service/types/v1">
  <identifier>doi:10.5063/AA/nceas.912.9</identifier>
  <objectLocation>
    <nodeIdentifier>urn:node:KNB</nodeIdentifier>
    <baseURL>https://knb.ecoinformatics.org/knb/d1/mn</baseURL>
    <version>v1</version>
    <url>https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:10.5063%2FAA%2Fnceas.912.9</url>
  </objectLocation>
  <objectLocation>
    <nodeIdentifier>urn:node:CN</nodeIdentifier>
    <baseURL>https://cn.dataone.org/cn</baseURL>
    <version>v1</version>
    <url>https://cn.dataone.org/cn/v1/object/doi:10.5063%2FAA%2Fnceas.912.9</url>
  </objectLocation>
</d1:objectLocationList>

Regarding IDs, the EML spec leaves it open other than saying they must be unique in the document. The point is to provide an unambiguous identifier to reference <attribute> definitions in EML. These can then be used in other places to refer to those attribute definitions.
EML doesn't specify how to define an attribute beyond using the natural language definition. That said, for OBOE we have come up with an annotation syntax that could be used in the additionalMetadata section to provide a linkage between the attribute definition in EML and an ontology. Some examples of its use are in SVN (https://code.ecoinformatics.org/code/semtools/trunk/dev/sms/examples). This is probably more complicated than you are looking for, as it maps several different semantic aspects of the data set, including the Characteristic being measured (what you are looking for I think), as well as the Entity being measured, the MeasurementStandard used (redundant with other fields in EML), and the Context. This is the mapping we've been experimenting with in Semtools and is the basis of the figure that you included in issue #8. There is an XML Schema for the annotation syntax in the directory above the examples. The annotation is in XML, but it could also be done in RDF, which would merge better with the OBOE OWL ontology. In addition, we debated over whether its better to include the annotation inline in the EML document (which nicely packages them together), or to provide a separate annotation file (which allows people other than the EML owner to provide annotations, and lets us annotate metadata files other than EML (such as FGDC). Which is best is still under discussion in our group. We have built out a prototype extension of Morpho that produces these annotations as separate files, and then a Metacat search service that knows how to use them to do semantic-driven searches and data integration tasks. It would be great to discuss how this relates to what you are trying to do in R, and what we might adapt for compatibility.

from eml.

cboettig commented on June 13, 2024

Re 1. This is great, can definitely implement this kind of call.

I am curious about what we can offer, if anything, by way of search interfaces for EML data through the reml R package. Initially I was thinking about querying across large sets of EML files for matching column types, for data integration etc. Though EML files are generally pretty small, still, downloading and parsing large numbers of them might not be the best way to go. Thoughts?

Anyway, something to think about down the line at least.

from eml.

cboettig commented on June 13, 2024

Okay, I'm now thinking that adding RDF to the additionalMetadata section and using describes references (as discussed above and more in #9) is the best way to go about adding semantic definitions, rather than the relying on the external semtools schema for this (as we considered in issue #8). When asked about using the semtools schema, Ben makes the case for this approach quite eloquently:

While we did use the sms annotation schema in the Semtools project, I can't say that I think you should also use it. I'd be more interested in seeing a "purer" semantic approach to storing those types of annotations (e.g., "this column of this data table is measured in Gram"). Basically, these are all RDF triples. I'm not sure if Shawn Bowers - the one who first drew up the sms annotation schema - is still advocating its use, but it was experimental even in the heyday of Semtools. One of the major issues with this annotation approach is that it is another independent file that describes the EML file. This gets annoying when you try to have tools work with the many files. You could potentially embed the annotation - or any XML - in EML's additionalMetadata section.

I think the really clever thing here is that the metadata tag is flexible enough for us to just add RDF directly, as Ben illustrates like this:

<eml>
…
<dataTable id="http://some.namespace#myUniqueEntityId1">
        <attribute id="http://some.namespace#myUniqueAttributeId1"/>
        <attribute id="http://some.namespace#myUniqueAttributeId2"/>
</dataTable>
…
<additionalMetadata>
        <describes>http://some.namespace#myUniqueAttributeId1</describes>
        <metadata>
                <!-- RDF stuff here that annotates http://some.namespace#myUniqueAttributeId1 -->
                <rdf:RDF
                        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                        xmlns:o="http:/oboe-core#">
                        <rdf:Description rdf:about="http://some.namespace#myUniqueAttributeId1">
                                <o:entity>Air</o:entity>
                                <o:characteristic>Temperature</o:characteristic>
                                <cd:unit>Celsius</cd:unit>
                        </rdf:Description>
                </rdf:RDF>
        </metadata>
</additionalMetadata>
<additionalMetadata>
        <describes>http://some.namespace#myUniqueAttributeId2</describes>
        <metadata>
                <!-- RDF stuff here that annotates http://some.namespace#myUniqueAttributeId2 -->
        </metadata>
</additionalMetadata>
</eml>

A few questions:

Would it be preferable to use RDFa (like NeXML does) instead of RDF, which would presumably allow us to extract the RDF data into a pure RDF file using standard tools (e.g. http://www.w3.org/2012/pyRdfa/#distill_by_uri)?
Or is there a good reason to prefer embedded RDF, as above?
Ben points out that one option would be, rather than a separate additionalMetadata for each attribute, we could have one additionalMetadata referencing the root EML id in describes, since the rdf:Description node points to the attribute already anyhow. Any reason to prefer one approach over the other?
Presumably we could generate this automatically for standard units. We could also generate this automatically for species names, along with the adding the appropriate EML version of coverage? Or would it be better to have a single coverage node with all the taxanomic coverage, etc? (Basically a question of how other tools are using the coverage nodes. Since it sounds like they are just using them at aggregate level to identify EML files containing certain coverage, rather than at the attribute level to give semantic meaning to columns, maybe there is no point in doing the latter? This issue already touched upon in #9 , though undecided.)
What namespace do we put the attribute ids under? (both in the <rdf:Description rdf:about="http://some.namespace#myUniqueAttributeId1"> and in the describes nodes?)
Obviously we simply don't have ontological meaning for lots of terms. For a first pass, I imagine reml adding this annotation 'silently' on the above cases where we can probably automatically interpret (or infer from the schema) the semantic meaning. The harder challenge is thinking how a user might specify additional semantic annotations of elements without expert knowledge of the schema, the relevant ontology, and lots of hand-crafting. Maybe that's an impossible problem.

from eml.

cboettig commented on June 13, 2024

@mbjones Just brainstorming about adding semantics here, since Ben wasn't enthusiastic about the semtools XSD route. Would love to hear what you think about this approach when you get back.

I've just added an example in which semantic metadata is included using RDFa. Building on Ben's suggestions, the additionalMetadata node looks like:

<additionalMetadata>
     <describes>1838</describes>
     <metadata>
      <subject about="http://some.namespace#1838" xmlns:o="http:/oboe-core#"
               xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" 
               xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/" 
               xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
               xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 
               xmlns:skos="http://www.w3.org/2004/02/skos/core#"
               xmlns:xsd="http://www.w3.org/2001/XMLSchema#" 
               xmlns:nex="http://www.nexml.org/2009">
          <meta property="o:entity" content="Air" datatype="xsd:string"/>
          <meta property="o:characteristic" content="Temperature" datatype="xsd:string"/>
          <meta property="o:unit" content="Celsius" datatype="xsd:string"/>
      </subject>
    </metadata>
  </additionalMetadata>

I believe this has a few advantages over the (potentially depricated?) semtools xml annotations or RDF nodes:

A dumb parser (e.g. without any knowledge of the schema) could still extract the triples, in any desired format (RDF, turtle, etc). For instance, w3c's pyRdfa gives

@prefix o: <http:/oboe-core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://some.namespace#1838> o:characteristic "Temperature"^^xsd:string;
    o:entity "Air"^^xsd:string;
    o:unit "Celsius"^^xsd:string .

we have semantics embedded in the EML file in a language natural for the expression of semantic data.

One concern is that the contents our our additionalMetadata node are not very human-readable in this way. Nonetheless, it is still reasonably easy to understand when we render the EML file as a "plain text" format by coercing it into yaml:

describes: '1838'
  metadata:
    subject:
      meta:
      - o:entity
      - Air
      - xsd:string
      meta:
      - o:characteristic
      - Temperature
      - xsd:string
      meta:
      - o:unit
      - Celsius
      - xsd:string
      .attrs: http://some.namespace#1838

Though perhaps "Air Temperature Celsius" would be the preferred human version. In any event, that can be added directly to the text. (Yeah, the example EML attribute at 1838 isn't actually about temperature, this is just a quick demo of what adding semantics might be about).

We still have the design consideration questions above to address. Semantics could be added automatically for Dublin Core terms (things like title, creators, publication date, etc, and for cases like standard units or taxanomic names (at least when stated in coverage nodes if not in attributes) that we can resolve from the schema logic.

In the long run, ideally additional functions will allow the user to add arbitrary annotations for EML elements through reml.

from eml.

cboettig commented on June 13, 2024

@mbjones One quick related issue: for some reason, my example file does not validate against the online validator. I get the error:

> doc <- saveXML(xmlParse("rdfa_example.xml"))
> eml_validate(doc)
$`EML specific tests`
[1] "Error processing keyrefs: //additionalMetadata/describes : Error in xml document. This EML instance is invalid because referenced id 1838 does not exist in the given keys."

even though there is indeed a node with id="1838", so I'm not sure what I did wrong.

from eml.

leinfelder commented on June 13, 2024

The EML parser was not actually configured to parse attribute@id values as valid references in the additionalMetadata/describes field. I've fixed this and will deploy it soon. Parsing errors aside, the sample EML+RDF looks pretty workable as it stands, but the more I think about it, you should probably just use a single additionalMetadata/describes block for all the RDF instead of little bits for each attribute. This will be easier for parsing the RDF in one go and, as you mention, the RDF explicitly references the attribute@id values anyway as the subject.

from eml.

Questions about EML design & implementation about eml HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent