Code Monkey home page Code Monkey logo

Comments (4)

mbjones avatar mbjones commented on June 3, 2024

This is great. For the other metadata, I would ensure we have all of the metadata attached from eml-attribute that are required to properly interpret the variables. In particular, I think that we should have:

  • name, label, and description
  • for numerical quantites, units, precision, number type, bounds
  • for categorical quantities, definitions of enumerated codes
  • for datetime values, ISO 8601 format string and precision

I think we should take DataONE's concept of a 'DataPackage' that can contain multiple data.frame (and other) objects and extend it to be the container for the high-level metadata about the package (creator, contact, title, abstract, methods, etc). That way, this information is outside of the data.frame objects, but is still readily accessible. We should talk some more about the design of this and the dataone library to make them as synergistic as possible and eliminate overlap where possible.

from eml.

cboettig avatar cboettig commented on June 3, 2024

@mbjones Thanks!

Yes, I agree entirely about getting more of the eml-attribute level metadata in, and struggle a bit to find an approach that is both concise/non-redundant and complete. I'd love input on proposed function calls / API structure through which the user would specify this information. Here's a few sticking points I've hit, particularly with numeric and dateTime.

  • Numerical quantities: On one hand, we want to leverage R's native types -- e.g. numeric vs. integer, rather than providing an alternative and potentially inconsistent way to declare type. On the other hand, these don't map trivially onto EML, in which numbers are ratio vs interval and can be whole, integer, real, complex. What do we do here?
  • Datetime R has several datetime classes: base R provides Date, along with POSIXlt and POSIXct, not to mention packages like chron. None of these formats have a concept of precision, which is particularly frustrating if you want a column in, i.e. YYYY format. Making it a Date with a given formatting string: as.Date("2012", "%Y") adds today's month and day arbitrarily. We could advise users to define all dates as "character" class instead (most may already do this...), but this goes against the "use native types" objective.

Currently character types will be assigned nominal/nonEnumeratedDomain/textDomain. Do we provide a mechanism to allow a column to be "character" class while still be encoded as a "dateTime"? Related, I recall that dateTime in EML can also have much more fuzzy formats -- presumably things like geological Epochs, etc. I assume these would appear as (ordered?) factors in the data.frame, "Holocene", etc, but no idea how to go about handling this case such that we write out a dateTime node and not a nominal under measurementScale.

Currently reml expects dateTime to be in one of those three formats, and for the user to just give the format string in the unit.defs list. When writing to csv I use the format(date, format_string) to write a character string of the correct structure. An optional precision could be added to the slot in unit.defs.

  • categorical variables The current implementation handles this better, where the user gives code and definition as a named string as already illustrated above. R has the notion of factors and ordered factors, which map cleanly onto nominal/enumeratedDomain and ordinal/enumeratedDomain (I see ordinal can have a `textDomain instead in the Schema? Not sure what that would mean....) . The only non-ideal part of this approach is that it is slightly redundant: the user must write out the codes explicitly (though the wizard helper function will display each code and prompt for the definition if not given). I should probably at least add an automated check that the codes match the factor levels....

the DataPackage idea is interesting -- how is it different from an EML dataset? (I think dataset can have multiple dataTables? I may still need to add support for that in reml...) Of course, we already have R classes corresponding to EML concepts of dataset, dataTable, etc. It would be relatively trivial to make them inherit data.frame methods. I don't know of a native R concept for a collection of data.frames though (I mean, other than as a list of data.frames). We could simply add methods such that dataset[[1]] would return the first dataTable (or equivalent creature, like raster, etc), which would inherit the appropriate R type (e.g. data.frame)...

from eml.

cboettig avatar cboettig commented on June 3, 2024

We now have support for the unit metadata (attributes) Matt highlights:

  • name, label, and description
  • for numerical quantites, units, precision, number type, bounds
  • for categorical quantities, definitions of enumerated codes
  • for datetime values, ISO 8601 format string and precision

See Ex 3

from eml.

mbjones avatar mbjones commented on June 3, 2024

Fantastic!

from eml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.