ropensci / eml Goto Github PK

View Code? Open in Web Editor NEW

95.0 15.0 33.0 7.07 MB

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data

Home Page: https://docs.ropensci.org/EML

License: Other

R 82.15% CSS 0.14% JavaScript 5.76% TeX 11.95%

metadata-standard eml eml-metadata r rstats r-package

eml's Introduction

EML

EML is a widely used metadata standard in the ecological and environmental sciences. We strongly recommend that interested users visit the EML Homepage for an introduction and thorough documentation of the standard. Additionally, the scientific article The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere (Jones et al 2006) provides an excellent introduction into the role EML plays in building metadata-driven data repositories to address the needs of highly heterogeneous data that cannot be easily reduced to a traditional vertically integrated database. At this time, the EML R package provides support for the serializing and parsing of all low-level EML concepts, but still assumes some familiarity with the EML standard, particularly for users seeking to create their own EML files. We hope to add more higher-level functions which will make such familiarity less essential in future development.

Notes on the EML v2.0 Release

EML v2.0 is a complete re-write which aims to provide both a drop-in replacement for the higher-level functions of the existing EML package while also providing additional functionality. This new EML version uses only simple and familiar list structures (S3 classes) instead of the more cumbersome use of S4 found in the original EML. While the higher-level functions are identical, this makes it easier to for most users and developers to work with eml objects and also to write their own functions for creating and manipulating EML objects. Under the hood, EML relies on the emld package, which uses a Linked Data representation for EML. It is this approach which lets us combine the simplicity of lists with the specificity required by the XML schema.

This revision also supports the recently released EML 2.2.0 specification.

Creating EML

library(EML)

A minimal valid EML document:

me <- list(individualName = list(givenName = "Carl", surName = "Boettiger"))
my_eml <- list(dataset = list(
              title = "A Minimal Valid EML Dataset",
              creator = me,
              contact = me)
            )


write_eml(my_eml, "ex.xml")
#> NULL
eml_validate("ex.xml")
#> [1] TRUE
#> attr(,"errors")
#> character(0)

A Richer Example

Here we show the creation of a relatively complete EML document using EML. This closely parallels the function calls shown in the original EML R-package vignette.

`set_*` methods

The original EML R package defines a set of higher-level set_* methods to facilitate the creation of complex metadata structures. EML provides these same methods, taking the same arguments for set_coverage, set_attributes, set_physical, set_methods and set_textType, as illustrated here:

Coverage metadata

geographicDescription <- "Harvard Forest Greenhouse, Tom Swamp Tract (Harvard Forest)"
coverage <- 
  set_coverage(begin = '2012-06-01', end = '2013-12-31',
               sci_names = "Sarracenia purpurea",
               geographicDescription = geographicDescription,
               west = -122.44, east = -117.15, 
               north = 37.38, south = 30.00,
               altitudeMin = 160, altitudeMaximum = 330,
               altitudeUnits = "meter")

Reading in text from Word and Markdown

We read in detailed methods written in a Word doc. This uses EML’s docbook-style markup to preserve formatting of paragraphs, lists, titles, and so forth. (This is a drop-in replacement for EML set_method())

methods_file <- system.file("examples/hf205-methods.docx", package = "EML")
methods <- set_methods(methods_file)

We can also read in text that uses Markdown for markup elements:

abstract_file <-  system.file("examples/hf205-abstract.md", package = "EML")
abstract <- set_TextType(abstract_file)

Attribute Metadata from Tables

Attribute metadata can be verbose, and is often defined in separate tables (e.g. separate Excel sheets or .csv files). Here we use attribute metadata and factor definitions as given from .csv files.

attributes <- read.table(system.file("extdata/hf205_attributes.csv", package = "EML"))
factors <- read.table(system.file("extdata/hf205_factors.csv", package = "EML"))
attributeList <- 
  set_attributes(attributes, 
                 factors, 
                 col_classes = c("character", 
                                 "Date",
                                 "Date",
                                 "Date",
                                 "factor",
                                 "factor",
                                 "factor",
                                 "numeric"))

Data file format

Though the physical metadata specifying the file format is extremely flexible, the set_physical function provides defaults appropriate for .csv files. DEVELOPER NOTE: ideally the set_physical method should guess the appropriate metadata structure based on the file extension.

physical <- set_physical("hf205-01-TPexp1.csv")

Generic construction

In the EML R package, objects for which there is no set_ method are constructed using the new() S4 constructor. This provided an easy way to see the list of available slots. In eml2, all objects are just lists, and so there is no need for special methods. We can create any object directly by nesting lists with names corresponding to the EML elements. Here we create a keywordSet from scratch:

keywordSet <- list(
    list(
        keywordThesaurus = "LTER controlled vocabulary",
        keyword = list("bacteria",
                    "carnivorous plants",
                    "genetics",
                    "thresholds")
        ),
    list(
        keywordThesaurus = "LTER core area",
        keyword =  list("populations", "inorganic nutrients", "disturbance")
        ),
    list(
        keywordThesaurus = "HFR default",
        keyword = list("Harvard Forest", "HFR", "LTER", "USA")
        ))

Of course, this assumes that we have some knowledge of what the possible terms permitted in an EML keywordSet are! Not so useful for novices. We can get a preview of the elements that any object can take using the emld::template() option, but this involves a two-part workflow. Instead, eml2 provides generic construct methods for all objects.

Constructor methods

For instance, the function eml$creator() has function arguments corresponding to each possible slot for a creator. This means we can rely on tab completion (and/or autocomplete previews in RStudio) to see what the possible options are. eml$ functions exist for all complex types. If eml$ does not exist for an argument (e.g. there is no eml$givenName), then the field takes a simple string argument.

Creating parties (creator, contact, publisher)

aaron <- eml$creator(
  individualName = eml$individualName(
    givenName = "Aaron", 
    surName = "Ellison"),
  electronicMailAddress = "[email protected]")

HF_address <- eml$address(
                  deliveryPoint = "324 North Main Street",
                  city = "Petersham",
                  administrativeArea = "MA",
                  postalCode = "01366",
                  country = "USA")

publisher <- eml$publisher(
                 organizationName = "Harvard Forest",
                 address = HF_address)

contact <- 
  list(
    individualName = aaron$individualName,
    electronicMailAddress = aaron$electronicMailAddress,
    address = HF_address,
    organizationName = "Harvard Forest",
    phone = "000-000-0000")

Putting it all together

my_eml <- eml$eml(
           packageId = uuid::UUIDgenerate(),  
           system = "uuid",
           dataset = eml$dataset(
               title = "Thresholds and Tipping Points in a Sarracenia",
               creator = aaron,
               pubDate = "2012",
               intellectualRights = "http://www.lternet.edu/data/netpolicy.html.",
               abstract = abstract,
               keywordSet = keywordSet,
               coverage = coverage,
               contact = contact,
               methods = methods,
               dataTable = eml$dataTable(
                 entityName = "hf205-01-TPexp1.csv",
                 entityDescription = "tipping point experiment 1",
                 physical = physical,
                 attributeList = attributeList)
               ))

Serialize and validate

We can also validate first and then serialize:

eml_validate(my_eml)
#> [1] TRUE
#> attr(,"errors")
#> character(0)
write_eml(my_eml, "eml.xml")
#> NULL

Setting the version

EML will use the latest EML specification by default. To switch to a different version, use emld::eml_version()

emld::eml_version("eml-2.1.1")
#> [1] "eml-2.1.1"

Switch back to the 2.2.0 release:

emld::eml_version("eml-2.2.0")
#> [1] "eml-2.2.0"

eml's People

Contributors

Stargazers

Watchers

eml's Issues

rename files that only differ by case

Immediately after cloning the reml repo on a Mac, git status shows the man/eml_datatable.Rd file as modified, without any editing.

This seems to be due to the existence of files that differ only by case -- particularly man/eml_datatable.Rd and man/eml_dataTable.Rd. Removing the duplicate file should fix the problem, but at the moment I am unclear as to which is the right one to remove, or if in fact both are needed.

Work with local files instead of online files

eml_read and eml_write both assume online endpoints for all files.

Need to see how to provide option to use/support local destinations at custom file paths (perhaps just physical filename of EML file and datafiles).
Add option/toggle on eml_read to attempt local access (with specified path?)
eml_write needs to include it's own filename in the EML(?)
eml_write probably shouldn't write a <distribution><online> node. That should be added by eml_publish.

Workflow to ensure users write valid EML?

In our early discussions about validation, we agreed it was really just part of the developer testing suite. For a user consuming EML, having the software complain the file isn't valid isn't really helpful, it's best just to give it our best shot anyway. For writing EML, since this is programmatically generated we can assure it is valid ... or can we?

The S4 R objects we use mimic the schema, but they don't enforce required vs optional slots (in fact, all slots are always 'present' in the S4 objects, so an operational definition of "empty" is that the slot has an empty S4 object (recursive) or a length 0 character/numeric/logical string.) A user can create an S4 object and pass it into their EML file (seems like a useful/powerful option to have, particularly for reusing elements). If the object is missing some required elements, this will create invalid EML.

We can avoid this in several ways:

We could write a validation check as part of each S4 method. Rather tedious, this also seems redundant with the schema validation check. On the other hand, this approach provides a nice warning earlier to the advanced user.
We could instead write constructor functions for each object. Also tedious, but allows clear indication of optional and required parameters and can be easier to use than the new constructor. This is the strategy we employ so far, but we still permit pre-built S4 nodes to be passed to some constructors to facilitate reuse (but bypassing the protection regarding required elements).
Run the validator by default on calls to write_eml (would require an internet connection or packaging the schema). If we we check only by validating the final EML file, the user may be at some trouble to find just what they need to change. On the other hand, it is perhaps the surest way to guarantee validity.

Use reml to parse EML metadata coming from the dataone library

Currently, in the dataone package, we can download EML and parse it with an XML parser to extract some metadata for use in our script. For example:

library(dataone)
cli <- D1Client()

# Download and parse some EML metadata
obj4 <- getMember(pkg, "doi:10.5063/AA/nceas.982.3")
getFormatId(obj4)
metadata <- xmlParse(getData(obj4))

# Extract and print a list of all attribute names in the metadata
attList <- sapply(getNodeSet(metadata, "//attributeName"), xmlValue)
attList

This seems like it would be better handled by handing the eml document off to the reml package for parsing, which might provide some nicer accessor methods, plus the ability to insert new metadata or change existing fields. I've been thinking about how its best to do this. @cboettig Should the dataone package load reml to do its parsing, or should the reml package load dataone to handle its downloading. I'm thinking of this in terms of other metadata standards as well, such as FGDC or ISO 19115, and wanting to support those through the dataone library as well. Thoughts?

Get XMLSchema running on EML

We should be able to do

nex <- readSchema("eml.xsd")
defineClasses(nex)

doc <- xmlParse("my_eml_data.xml")
fromXML(doc)

@duncantl is adjusting XMLSchema to handle EML's schema for this (some challenges in the recursive referencing of schema files).

@mbjones Do we have an online URI for eml.xsd? So far I've had to download a tarball from http://knb.ecoinformatics.org/software/download.jsp#eml

Understanding inheritance in the schema

@mbjones Can you clarify where any of the elements inherit some of their definition from existing definitions? e.g. I think that's what is going on with entityGroup and referenceGroup. Wondering if there is also a base class inherited by most everything that defines the id attribute? Or is that just manually added to each definition where appropriate?

Parse and Validate EML against schema

Read in an EML file into R.
Validate against the schema with appropriate warnings.
Extract R objects for data types (see #6 )
Potentially generate the appropriate class and function types from XMLSchema

Example adding arbitrary semantics

function to create an eml_software node

An EML software node needs:

version (txt)
licenseURL or license (text)
implementation

Optionally,

dependencies

<implementation> minimally needs a url, via

a <distribution> node, which we also use elsewhere (e.g. when publishing data to a url).

For R packages, we can extract all the necessary information by providing the R package name. A wrapper function can use eml_software and the package name to create this (using the packageDescription function)

Perhaps there is no need to provide the optional R software entries, (e.g. all the optional fields under implementation) since such data is already programmatically available knowing the package distribution URL...

Add a function to generate plain-text summaries of EML metadata

A couple of options on how to do this:

Provide in markdown formatting?
More crudely: convert directly to YAML instead? as.yaml(xmlToList(eml.xml))

Ideal rendering would drop some of the less essential markup (e.g. stuff intended more for machines than people -- unit definitions, numeric types, etc).

attributeList needs to support custom units, unit matching

Currently units must come from the EML Standard Units list, already written according to the specification there (camelCase and all).

We should provide a utility to match plain text descriptions to the Standard Units.
We should provide support for specifying custom units

customUnits must be completely defined in STMML syntax, making them a bit more onerous to use than custom string types.

Add function to publish EML data through figshare

Involves several steps:

Update the EML publisher information as figshare
Update the EML distributed online URL location with the figshare link
Update EML link of csv and other data files to their download URLs from figshare
package EML and all data files (CSV, images, code, etc) as figshare file object. (Will mean using fs_create to establish the figshare link metadata first. Files will need to be uploaded to get their URLs, then the EML file will need to be modified with those URLs.)
EML keywordLists for both the figshare tags and figshare categories (with categories reflecting figshare limited thesaurus).
Likewise figshare metadata should all come from the EML file.

Example using attribute level metadata

Add a motivating example using attribute-level metadata.

Unit conversions
Other ideas?

Provide prompts / wizard to help user enter metadata if not provided

For a first-time user of the reml package, it might be easier to simply call eml_write on an R data.frame object and have the function coach them through what minimal metadata they must add, prompting them for inputs along the way ("Define column "X1:" ).

No doubt any regular user would find this frustrating to repeat each time, and would rather provide a list of metadata ahead-of-time, possibly generated programmatically (e.g. pulled from an existing file in the same format) or specified in a configuration file (e.g. no need to ask me my name every time).

Install fails due to bug while compiling vignette

install_github("reml", "ropensci")

Quitting from lines 62-63 (vingette.Rmd) 
Error: processing vignette 'vingette.Rmd' failed with diagnostics:
argument ".contact" is missing, with no default
Execution halted
Error: Command failed (1)

Stylistic issues

The package makes frequent use of camelCase. This is to provide consistent mapping to EML nodes, which are all defined in camelCase, e.g. <dataset> but <dataTable>. Deal with it.

Package functions use = instead of <- for assignment. Should be fixed. XML package has lots of nice syntactic sugar, and this makes it a bit more fluid to move the definitions around.

Need to stick with consistent and transparent use of addChildren vs, ... parent=, etc.

RJSONIO not parsing correctly using fromJSON

I've posted this question on Stack Overflow, but I wonder if I'm better coming straight here.

Here's the discussion so far on SO.

Ok, I'm trying to convert the following JSON data into an R data frame.

For some reason fromJSON in the RJSONIO package only reads up to about character 380 and then it stops converting the JSON properly.

Here is the JSON:-

"{\"metricDate\":\"2013-05-01\",\"pageCountTotal\":\"33682\",\"landCountTotal\":\"11838\",\"newLandCountTotal\":\"8023\",\"returnLandCountTotal\":\"3815\",\"spiderCountTotal\":\"84\",\"goalCountTotal\":\"177.000000\",\"callGoalCountTotal\":\"177.000000\",\"callCountTotal\":\"237.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.50\",\"callConversionPerc\":\"74.68\"}\n{\"metricDate\":\"2013-05-02\",\"pageCountTotal\":\"32622\",\"landCountTotal\":\"11626\",\"newLandCountTotal\":\"7945\",\"returnLandCountTotal\":\"3681\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"210.000000\",\"callGoalCountTotal\":\"210.000000\",\"callCountTotal\":\"297.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"70.71\"}\n{\"metricDate\":\"2013-05-03\",\"pageCountTotal\":\"28467\",\"landCountTotal\":\"11102\",\"newLandCountTotal\":\"7786\",\"returnLandCountTotal\":\"3316\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"186.000000\",\"callGoalCountTotal\":\"186.000000\",\"callCountTotal\":\"261.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"71.26\"}\n{\"metricDate\":\"2013-05-04\",\"pageCountTotal\":\"20884\",\"landCountTotal\":\"9031\",\"newLandCountTotal\":\"6670\",\"returnLandCountTotal\":\"2361\",\"spiderCountTotal\":\"51\",\"goalCountTotal\":\"7.000000\",\"callGoalCountTotal\":\"7.000000\",\"callCountTotal\":\"44.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.08\",\"callConversionPerc\":\"15.91\"}\n{\"metricDate\":\"2013-05-05\",\"pageCountTotal\":\"20481\",\"landCountTotal\":\"8782\",\"newLandCountTotal\":\"6390\",\"returnLandCountTotal\":\"2392\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"12.50\"}\n{\"metricDate\":\"2013-05-06\",\"pageCountTotal\":\"25175\",\"landCountTotal\":\"10019\",\"newLandCountTotal\":\"7082\",\"returnLandCountTotal\":\"2937\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"24.000000\",\"callGoalCountTotal\":\"24.000000\",\"callCountTotal\":\"47.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.24\",\"callConversionPerc\":\"51.06\"}\n{\"metricDate\":\"2013-05-07\",\"pageCountTotal\":\"35892\",\"landCountTotal\":\"12615\",\"newLandCountTotal\":\"8391\",\"returnLandCountTotal\":\"4224\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"239.000000\",\"callGoalCountTotal\":\"239.000000\",\"callCountTotal\":\"321.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.89\",\"callConversionPerc\":\"74.45\"}\n{\"metricDate\":\"2013-05-08\",\"pageCountTotal\":\"34106\",\"landCountTotal\":\"12391\",\"newLandCountTotal\":\"8389\",\"returnLandCountTotal\":\"4002\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"221.000000\",\"callGoalCountTotal\":\"221.000000\",\"callCountTotal\":\"295.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"74.92\"}\n{\"metricDate\":\"2013-05-09\",\"pageCountTotal\":\"32721\",\"landCountTotal\":\"12447\",\"newLandCountTotal\":\"8541\",\"returnLandCountTotal\":\"3906\",\"spiderCountTotal\":\"54\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"280.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.66\",\"callConversionPerc\":\"73.93\"}\n{\"metricDate\":\"2013-05-10\",\"pageCountTotal\":\"29724\",\"landCountTotal\":\"11616\",\"newLandCountTotal\":\"8063\",\"returnLandCountTotal\":\"3553\",\"spiderCountTotal\":\"139\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"301.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"68.77\"}\n{\"metricDate\":\"2013-05-11\",\"pageCountTotal\":\"22061\",\"landCountTotal\":\"9660\",\"newLandCountTotal\":\"6971\",\"returnLandCountTotal\":\"2689\",\"spiderCountTotal\":\"52\",\"goalCountTotal\":\"3.000000\",\"callGoalCountTotal\":\"3.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.03\",\"callConversionPerc\":\"7.50\"}\n{\"metricDate\":\"2013-05-12\",\"pageCountTotal\":\"23341\",\"landCountTotal\":\"9935\",\"newLandCountTotal\":\"6960\",\"returnLandCountTotal\":\"2975\",\"spiderCountTotal\":\"45\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"12.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-13\",\"pageCountTotal\":\"36565\",\"landCountTotal\":\"13583\",\"newLandCountTotal\":\"9277\",\"returnLandCountTotal\":\"4306\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"246.000000\",\"callGoalCountTotal\":\"246.000000\",\"callCountTotal\":\"324.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"75.93\"}\n{\"metricDate\":\"2013-05-14\",\"pageCountTotal\":\"35260\",\"landCountTotal\":\"13797\",\"newLandCountTotal\":\"9375\",\"returnLandCountTotal\":\"4422\",\"spiderCountTotal\":\"59\",\"goalCountTotal\":\"212.000000\",\"callGoalCountTotal\":\"212.000000\",\"callCountTotal\":\"283.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"74.91\"}\n{\"metricDate\":\"2013-05-15\",\"pageCountTotal\":\"35836\",\"landCountTotal\":\"13792\",\"newLandCountTotal\":\"9532\",\"returnLandCountTotal\":\"4260\",\"spiderCountTotal\":\"94\",\"goalCountTotal\":\"187.000000\",\"callGoalCountTotal\":\"187.000000\",\"callCountTotal\":\"258.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.36\",\"callConversionPerc\":\"72.48\"}\n{\"metricDate\":\"2013-05-16\",\"pageCountTotal\":\"33136\",\"landCountTotal\":\"12821\",\"newLandCountTotal\":\"8755\",\"returnLandCountTotal\":\"4066\",\"spiderCountTotal\":\"65\",\"goalCountTotal\":\"192.000000\",\"callGoalCountTotal\":\"192.000000\",\"callCountTotal\":\"260.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.50\",\"callConversionPerc\":\"73.85\"}\n{\"metricDate\":\"2013-05-17\",\"pageCountTotal\":\"29564\",\"landCountTotal\":\"11721\",\"newLandCountTotal\":\"8191\",\"returnLandCountTotal\":\"3530\",\"spiderCountTotal\":\"213\",\"goalCountTotal\":\"166.000000\",\"callGoalCountTotal\":\"166.000000\",\"callCountTotal\":\"222.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.42\",\"callConversionPerc\":\"74.77\"}\n{\"metricDate\":\"2013-05-18\",\"pageCountTotal\":\"23686\",\"landCountTotal\":\"9916\",\"newLandCountTotal\":\"7335\",\"returnLandCountTotal\":\"2581\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"5.000000\",\"callGoalCountTotal\":\"5.000000\",\"callCountTotal\":\"34.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.05\",\"callConversionPerc\":\"14.71\"}\n{\"metricDate\":\"2013-05-19\",\"pageCountTotal\":\"23528\",\"landCountTotal\":\"9952\",\"newLandCountTotal\":\"7184\",\"returnLandCountTotal\":\"2768\",\"spiderCountTotal\":\"57\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"14.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"7.14\"}\n{\"metricDate\":\"2013-05-20\",\"pageCountTotal\":\"37391\",\"landCountTotal\":\"13488\",\"newLandCountTotal\":\"9024\",\"returnLandCountTotal\":\"4464\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"227.000000\",\"callGoalCountTotal\":\"227.000000\",\"callCountTotal\":\"291.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"78.01\"}\n{\"metricDate\":\"2013-05-21\",\"pageCountTotal\":\"36299\",\"landCountTotal\":\"13174\",\"newLandCountTotal\":\"8817\",\"returnLandCountTotal\":\"4357\",\"spiderCountTotal\":\"77\",\"goalCountTotal\":\"164.000000\",\"callGoalCountTotal\":\"164.000000\",\"callCountTotal\":\"221.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.24\",\"callConversionPerc\":\"74.21\"}\n{\"metricDate\":\"2013-05-22\",\"pageCountTotal\":\"34201\",\"landCountTotal\":\"12433\",\"newLandCountTotal\":\"8388\",\"returnLandCountTotal\":\"4045\",\"spiderCountTotal\":\"76\",\"goalCountTotal\":\"195.000000\",\"callGoalCountTotal\":\"195.000000\",\"callCountTotal\":\"262.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.57\",\"callConversionPerc\":\"74.43\"}\n{\"metricDate\":\"2013-05-23\",\"pageCountTotal\":\"32951\",\"landCountTotal\":\"11611\",\"newLandCountTotal\":\"7757\",\"returnLandCountTotal\":\"3854\",\"spiderCountTotal\":\"68\",\"goalCountTotal\":\"167.000000\",\"callGoalCountTotal\":\"167.000000\",\"callCountTotal\":\"231.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.44\",\"callConversionPerc\":\"72.29\"}\n{\"metricDate\":\"2013-05-24\",\"pageCountTotal\":\"28967\",\"landCountTotal\":\"10821\",\"newLandCountTotal\":\"7396\",\"returnLandCountTotal\":\"3425\",\"spiderCountTotal\":\"106\",\"goalCountTotal\":\"167.000000\",\"callGoalCountTotal\":\"167.000000\",\"callCountTotal\":\"203.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"82.27\"}\n{\"metricDate\":\"2013-05-25\",\"pageCountTotal\":\"19741\",\"landCountTotal\":\"8393\",\"newLandCountTotal\":\"6168\",\"returnLandCountTotal\":\"2225\",\"spiderCountTotal\":\"78\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"28.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-26\",\"pageCountTotal\":\"19770\",\"landCountTotal\":\"8237\",\"newLandCountTotal\":\"6009\",\"returnLandCountTotal\":\"2228\",\"spiderCountTotal\":\"79\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-27\",\"pageCountTotal\":\"26208\",\"landCountTotal\":\"9755\",\"newLandCountTotal\":\"6779\",\"returnLandCountTotal\":\"2976\",\"spiderCountTotal\":\"82\",\"goalCountTotal\":\"26.000000\",\"callGoalCountTotal\":\"26.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.27\",\"callConversionPerc\":\"65.00\"}\n{\"metricDate\":\"2013-05-28\",\"pageCountTotal\":\"36980\",\"landCountTotal\":\"12463\",\"newLandCountTotal\":\"8226\",\"returnLandCountTotal\":\"4237\",\"spiderCountTotal\":\"132\",\"goalCountTotal\":\"208.000000\",\"callGoalCountTotal\":\"208.000000\",\"callCountTotal\":\"276.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.67\",\"callConversionPerc\":\"75.36\"}\n{\"metricDate\":\"2013-05-29\",\"pageCountTotal\":\"34190\",\"landCountTotal\":\"12014\",\"newLandCountTotal\":\"8279\",\"returnLandCountTotal\":\"3735\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"179.000000\",\"callGoalCountTotal\":\"179.000000\",\"callCountTotal\":\"235.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.49\",\"callConversionPerc\":\"76.17\"}\n{\"metricDate\":\"2013-05-30\",\"pageCountTotal\":\"33867\",\"landCountTotal\":\"11965\",\"newLandCountTotal\":\"8231\",\"returnLandCountTotal\":\"3734\",\"spiderCountTotal\":\"63\",\"goalCountTotal\":\"160.000000\",\"callGoalCountTotal\":\"160.000000\",\"callCountTotal\":\"219.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.34\",\"callConversionPerc\":\"73.06\"}\n{\"metricDate\":\"2013-05-31\",\"pageCountTotal\":\"27536\",\"landCountTotal\":\"10302\",\"newLandCountTotal\":\"7333\",\"returnLandCountTotal\":\"2969\",\"spiderCountTotal\":\"108\",\"goalCountTotal\":\"173.000000\",\"callGoalCountTotal\":\"173.000000\",\"callCountTotal\":\"226.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"76.55\"}\n\r\n"

and here is my R output

 metricDate 
"2013-05-01" 
pageCountTotal 
"33682" 
landCountTotal 
"11838" 
newLandCountTotal 
"8023" 
returnLandCountTotal 
"3815" 
spiderCountTotal 
"84" 
goalCountTotal 
"177.000000" 
callGoalCountTotal 
"177.000000" 
callCountTotal 
"237.000000" 
onlineGoalCountTotal 
"0.000000" 
conversionPerc 
"1.50" 
callConversionPerc


"74.68\"}{\"metricDate\":\"2013-05-02\",\"pageCountTotal\":\"32622\",\"landCountTotal\":\"11626\",\"newLandCountTotal\":\"7945\",\"returnLandCountTotal\":\"3681\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"210.000000\",\"callGoalCountTotal\":\"210.000000\",\"callCountTotal\":\"297.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"70.71\"}{\"metricDate\":\"2013-05-03\",\"pageCountTotal\":\"28467\",\"landCountTotal\":\"11102\",\"newLandCountTotal\":\"7786\",\"returnLandCountTotal\":\"3316\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"186.000000\",\"callGoalCountTotal\":\"186.000000\",\"callCountTotal\":\"261.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"71.26\"}{\"metricDate\":\"2013-05-04\",\"pageCountTotal\":\"20884\",\"landCountTotal\":\"9031\",\"newLandCountTotal\":\"6670\",\"returnLandCountTotal\":\"2361\",\"spiderCountTotal\":\"51\",\"goalCountTotal\":\"7.000000\",\"callGoalCountTotal\":\"7.000000\",\"callCountTotal\":\"44.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.08\",\"callConversionPerc\":\"15.91\"}{\"metricDate\":\"2013-05-05\",\"pageCountTotal\":\"20481\",\"landCountTotal\":\"8782\",\"newLandCountTotal\":\"6390\",\"returnLandCountTotal\":\"2392\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"12.50\"}{\"metricDate\":\"2013-05-06\",\"pageCountTotal\":\"25175\",\"landCountTotal\":\"10019\",\"newLandCountTotal\":\"7082\",\"returnLandCountTotal\":\"2937\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"24.000000\",\"callGoalCountTotal\":\"24.000000\",\"callCountTotal\":\"47.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.24\",\"callConversionPerc\":\"51.06\"}{\"metricDate\":\"2013-05-07\",\"pageCountTotal\":\"35892\",\"landCountTotal\":\"12615\",\"newLandCountTotal\":\"8391\",\"returnLandCountTotal\":\"4224\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"239.000000\",\"callGoalCountTotal\":\"239.000000\",\"callCountTotal\":\"321.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.89\",\"callConversionPerc\":\"74.45\"}{\"metricDate\":\"2013-05-08\",\"pageCountTotal\":\"34106\",\"landCountTotal\":\"12391\",\"newLandCountTotal\":\"8389\",\"returnLandCountTotal\":\"4002\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"221.000000\",\"callGoalCountTotal\":\"221.000000\",\"callCountTotal\":\"295.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"74.92\"}{\"metricDate\":\"2013-05-09\",\"pageCountTotal\":\"32721\",\"landCountTotal\":\"12447\",\"newLandCountTotal\":\"8541\",\"returnLandCountTotal\":\"3906\",\"spiderCountTotal\":\"54\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"280.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.66\",\"callConversionPerc\":\"73.93\"}{\"metricDate\":\"2013-05-10\",\"pageCountTotal\":\"29724\",\"landCountTotal\":\"11616\",\"newLandCountTotal\":\"8063\",\"returnLandCountTotal\":\"3553\",\"spiderCountTotal\":\"139\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"301.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"68.77\"}{\"metricDate\":\"2013-05-11\",\"pageCountTotal\":\"22061\",\"landCountTotal\":\"9660\",\"newLandCountTotal\":\"6971\",\"returnLandCountTotal\":\"2689\",\"spiderCountTotal\":\"52\",\"goalCountTotal\":\"3.000000\",\"callGoalCountTotal\":\"3.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.03\",\"callConversionPerc\":\"7.50\"}{\"metricDate\":\"2013-05-12\",\"pageCountTotal\":\"23341\",\"landCountTotal\":\"9935\",\"newLandCountTotal\":\"6960\",\"returnLandCountTotal\":\"2975\",\"spiderCountTotal\":\"45\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"12.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}{\"metricDate\":\"2013-05-13\",\"pageCountTotal\":\"36565\",\"landCountTotal\":\"13583\",\"newLandCountTotal\":\"9277\",\"returnLandCountTotal\":\"4306\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"246.000000\",\"callGoalCountTotal\":\"246.000000\",\"callCountTotal\":\"324.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"75.93\"}{\"metricDate\":\"2013-05-14\",\"pageCountTotal\":\"35260\",\"landCountTotal\":\"13797\",\"newLandCountTotal\":\"9375\",\"returnLandCountTotal\":\"4422\",\"spiderCountTotal\":\"59\",\"goalCountTotal\":\"212.000000\",\"callGoalCountTotal\":\"212.000000\",\"callCountTotal\":\"283.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"74.91\"}{\"metricDate\":\"2013-05-15\",\"pageCountTotal\":\"35836\",\"landCountTotal\":\"13792\",\"newLandCountTotal\":\"9532\",\"returnLandCountTotal\":\"4260\",\"spiderCountTotal\":\"94\",\"goalCountTotal\":\"187.000000\",\"callGoalCountTotal\":\"187.000000\",\"callCountTotal\":\"258.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.36\",\"callConversionPerc\":\"72.48\"}{\"metricDate\":\"2013-05-

(I've truncated the output a little).

The R output has been read properly up until "callConversionPerc" and after that the JSON parsing seems to break. Is there some default parameter that I've missed that could couse this behaviour? I have checked for unmasked speechmarks and anything obvious like that I didn't see any.

Surely it wouldn't be the new line operator that occurs shortly after, would it?

EDIT: So this does appear to be a new line issue.

Here's another 'JSON' string I've pulled into R, again the double quote marks are all escaped

"{\"modelId\":\"7\",\"igrp\":\"1\",\"modelName\":\"Equally Weighted\",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90}\n{\"modelId\":\"416\",\"igrp\":\"1\",\"modelName\":\"First and Last Click Weighted \",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90,\"firstWeight\":3,\"lastWeight\":3}\n{\"modelId\":\"5\",\"igrp\":\"1\",\"modelName\":\"First Click\",\"modelType\":\"first\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90}\n{\"modelId\":\"8\",\"igrp\":\"1\",\"modelName\":\"First Click Weighted\",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90,\"firstWeight\":3}\n{\"modelId\":\"128\",\"igrp\":\"1\",\"modelName\":\"First Click Weighted across PPC\",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90,\"firstWeight\":3,\"channelsMode\":\"include\",\"channels\":[5]}\n{\"modelId\":\"6\",\"igrp\":\"1\",\"modelName\":\"Last Click\",\"modelType\":\"last\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90}\n{\"modelId\":\"417\",\"igrp\":\"1\",\"modelName\":\"Last Click Weighted \",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90,\"lastWeight\":3}\n\r\n"

When I try to parse this using fromJSON I get the same problem, it gets to the last term on the first line and then stop parsing properly. Note that in this new case the output is slightly different from before returning NULL for the last item (instead of the messy string from the previous example.

$modelId
[1] "7"

$igrp
[1] "1"

$modelName
[1] "Equally Weighted"

$modelType
[1] "spread"

$status
[1] 200

$matchCriteria
[1] ""

$lookbackDays
NULL

As you can see, the components now use the "$" convention as if they are naming components and the last item is null.

I am wondering if this is to do with the way that fromJSON is parsing the strings, and when it is asked to create a variable with the same name as a variable that already exists it then fails and just returns a string or a NULL.

I would have thought that dealing with that sort of case would be coded into RJSONIO as it's pretty standard for JSON data to have repeating names.

I'm stumped as to how to fix this.

I'll be very grateful if you can advise as to what I'm doing wrong! Is there some parameter I need to be specifying to get it to recognise variable names properly?

Cheers,

Simon

Semantics integration

Truly automatic data integration needs some level of formal semantics. Should start thinking about how semantics would fit into the reml workflow, even though most are still in their infancy.

An ideal system would allow authors to contribute to existing ontologies, or at least push to a 'working' or 'draft' ontology that could later be formalized / mapped to a more central effort like OBOE

Not sure if we yet have any R-based tools for semantic reasoning, etc. (Though we do have SPARQL). Ultimately this might require a separate repository to tackle implementation and reasoning of semantic terms. (Hopefully developed by some actual domain experts in the R community).

Formats for user entry of dataTable metadata

Currently, to convert a data.frame into EML, we use a workflow that passes a data.frame, a list of column_metadata and a list of unit_metadata to a function,

  dat = data.frame(river=factor(c("SAC", "SAC", "AM")),
                        spp = factor(c("king", "king", "ccho")),
                        stg = factor(c("smolt", "parr", "smolt")),
                        ct =  c(293L, 410L, 210L))
  col_metadata = c(river = "http://dbpedia.org/ontology/River",
                   spp = "http://dbpedia.org/ontology/Species",
                   stg = "Life history stage",
                   ct = "count of number of fish")
  unit_metadata =
     list(river = c(SAC = "The Sacramento River", AM = "The American River"),
          spp = c(king = "King Salmon", ccho = "Coho Salmon"),
          stg = c(parr = "third life stage", smolt = "fourth life stage"),
          ct = "number")

Then eml is created by passing these objects to the high-level function eml_write

  doc <- eml_write(dat, col_metadata, unit_metadata)

I'm not sure if this is a good way to ask the users for metadata. One of the design goals is to reuse the natural R structures as much as possible and avoid asking for redundant information.

One problem with this is that it structures the metadata by column headings, rather than column by column, which might suggest something like this:

metadata <- 
  list("river" = list("River site used for collection",
                      c(SAC = "The Sacramento River", AM = "The American River")),
       "spp" = list("Species common name", 
                    c(king = "King Salmon", ccho = "Coho Salmon")),
       "stg" = list("Life Stage", 
                    c(parr = "third life stage", smolt = "fourth life stage")),
       "ct"  = list("count", 
                    "number")

Which provides a more column by column approach. Still, this seems unsatisfactory, as we don't reuse the levels of a factor in a column (e.g. SAC and AM), instead requiring they be rewritten; likewise we still have to repeat the column headings in our named list.

Rather than using a named list, we might also do better to capture the attribute metadata in the object, e.g.

river_metadata <- list("river",
       "River site used for collection",
       c(SAC = "The Sacramento River", AM = "The American River"))

which maps better to the schema attribute. Still none of these make maximum re-use of the data.frame objects and all are a bit cumbersome.

A more natural solution would be to write directly into the S4 slots, but I'm not clear on how this would would work. Using the above structures we could do

as("eml:attributeList", metadata)

and a more low-level option:

as("eml:attribute", river_metadata)

but not sure if that would feel more natural to users than the function calls (particularly since most R using ecologists are not familiar with S4 methods).

@schamberlain @karthikram @mbjones @duncantl
Would love any feedback on this or generally how the API should look to specify these values. Can we attach them to the data.frame/columns more directly, and is that better? (e.g. I considered labels option for factors, but that just overwrites the levels)....

Integration with related schemas?

With a robust XMLSchema package, a few other things become easy. Integration should happen at the more universal level of the schemas themselves rather than the R level, just hopefully something we can take advantage of from there.

EML already maps to Biological Data Profile, BDP, from the Federal Geographic Data Committee (used by the now-defunct National Biological Information Ifrastructure, NBII). But reverse mapping is not available (Jones & Co, 2006).

Start a unit test suite

Example EML file parses (as XML)

coverage: eml_write, eml_dataset, eml_dataTable, eml_attributeList

Example EML file validates (as EML)

coverage: eml_write, eml_dataset, eml_dataTable, eml_attributeList

Check certain entries in EML, e.g. <attribute><definition> from xpath matches definition passed to eml_write.

Continue developing unit tests as (or before) development proceeds.

basic eml_write tasks

eml_dataTable should use title for name of csv file if not provided, rather than the col-name trick.
reml method node should include dateTime of generation
Add a publication date
add a License
eml_person should be able to take strings such as Carl Boettiger <[email protected]> and coerce to R person object, then to eml_person object.
eml_publish should be able to take a return object from eml_write, (or perhaps take the eml_write arguments directly?)

utilities for coverage metadata

EML coverage nodes specify taxanomic, geographic, and temporal coverage.

They can refer to a dataset node but can also be used to define coverage of individual columns (e.g. a species column) or individual cells in a column (e.g. the species name). The latter is much richer but less commonly implemented.

taxonomic coverage

see eml taxa documentation

@schamberlain I think ideally taxonomic coverage would make use of taxize_ to help identify and correct species names. While higher taxonomic information can be specified, this would probably best be reserved for cases not referring to a particular species, since (a) we can already programmatically recover the rest of the classification given the genus and species, and (b) higher taxonomy may be inconsistent anyway.

temporal coverage

See eml temporal coverage documentation

We'll want to automatically decide if the coverage is a specific range of calendar dates, an estimated timescale (geological timescale), approximate uncertainty, and whether to include any citations to literature describing dating method (e.g. carbon dating). Could be a whole wizard / module....

Meanwhile, just supporting manual definition of this structure would be a good start.

geographic coverage

Can be bounding box, polygon, or geographicDescription (e.g. "Oregon"). Tempting to process natural language descriptions into coordinates, but that throws out true data in place of estimated data (e.g. best left to read-eml world, not the write-eml).

Add DOI to EML when publishing publicly to figshare

Extract all additional metadata provided

So far eml_read only extracts the three objects in the proof-of-principle test. Of course we will want generic access to all metadata objects, probably with a variety of tools for their extraction.

In particular, we still need utilities for <coverage>, see #9
In general, we may want to leverage xmlSchema and xmlToS4 to provide generic access to metadata.
As well as plain-text summaries, see #1

Add function to publish EML data through Github

Deploying EML file and associated data objects on the gh-pages branch of a github repository would provide a more natural URL endpoint, and facilitate forking, pull requests, and rapid versioning.

Would require many of the same steps as #3, but because we don't have a native R interface to Git the actual commit and push could be left to an external script or the user. Most natural simply to specify the repository end-point to form the appropriate URLs.

add appropriate publisher information when pushing to figshare

Could add figshare as a <publisher> node when using eml_figshare
Should re-use a general-purpose function for adding publisher to data
Should check that publisher node isn't already set to avoid duplicating entries.

Coverage metadata: Examples writing, reading, and plotting

Taxonomic coverage, geographic coverage and temporal coverage are both common and rather essential metadata we should illustrate the use of.

This should include tools to generate coverage nodes from columns of the data frame: species names, lat/longs to bounding boxes, time frame from series of times.

Also include tools to summarize coverage metadata, including extraction from columns and extraction into a separate data.frame (or appropriate R spatial object).

potential use case in converting to long forms when combining data frames

This issue is mostly note to myself in thinking out potential illustrative use cases. no input really needed at this time, kinda trivial example here.

What would eml look like to encode replicate models? Can we easily convert metadata to additional column when combining data of matching column descriptions but differing metadata, e.g.

model: Allen
parameters: r=1, K = 10, C = 5
nuisance parameters: sigma_g = 0.1
seed: 1234

value	density
0.0	0.12
0.1	0.14
0.2	0.22
0.3	0.4

And:

model: Myers
parameters: r=1, K = 10, theta = 1
nuisance parameters: sigma_g = 0.1
seed: 1234

value	density
0.0	0.14
0.1	0.11
0.2	0.33
0.3	0.33

Use the S4 approach for reading and writing XML

A somewhat more elegant approach to reading in XML is to define an S4 class for a given node and then just cast the XML into the S4 slots using xmlToS4. Undefined slots are ignored. I provide an illustration of how to do this in my advice on the RNeXML package, which compares it to alternative methods of reading XML.

This approach has the particular advantage that, at least in principle, we shouldn't need to define these classes by hand the way I show there, since their definitions can be extracted programmatically from the schema. The XMLSchema package should soon be able to do this.

Not only does this streamline our approach to reading in the EML into R objects, but it provides several other benefits. We can define coercion methods that take each of these S4 objects and coerce them into the appropriate R objects. For instance <dataTable> node into an R data.frame, along with appropriate metadata available (in S4, since there is not a natural way to attach metadata to R objects...), or <person> node into an R person, etc.

Somewhat more powerful and potentially more tricky is using the S4 approach as a write method. Rather than constructing the XML node by node as we do currently with newXMLNode and addChildren, etc, we would simply coerce our R objects (data.frames, person or strings, R DESCRIPTION files of (R) software, etc etc) into these S4 class definitions we extracted from the schema (which can be done automatically by matching slot names if the matches are good enough? e.g. person$givenName to <person><givenName>?). With luck(?), xmlSchema will be able to use the schema to figure out how to write this S4 object into XML (e.g. which slots are encoded as attributes, which as child nodes, ordering of slots, etc).

As XMLSchema is probably not up to this task yet (particularly on the write end?), we may do well to continue as we are "by hand"; though perhaps we should still be leveraging the S4 class definition in the process (and then manually turning it to XML with the calls to newXMLNode, etc...?)

@duncantl will hopefully clarify some of these questions and anything I've misstated about this strategy.

reml-generated EML should include metadata stating so

That way if someone doesn't like the EML, they know who to blame ;-)

e.g. should have plain-text description of REML, contact info & bug report info. Have to figure out the best syntax for this.
A richer implementation could document the R function calls used to generate the EML.
Include citation to REML (e.g. as a software node and/or literature node)

fix validity issues with generated EML

The currently generated EML is not valid and needs to be fixed. I have identified the following issues to be fixed:

missing @packageId attribute on root <eml> element
missing @system attribute on root <eml> element
missing <title> field
missing <creator> field
missing text values in <contact> field
<entityDescription> field is empty
misspelled <recorDelimiter>, should be <recordDelimiter>
<numericDomain> is out of order and should follow <unit>

can we replace the EMLParser class from dataone R package?

The R dataone package also has some preliminary EML parsing routines, which extract relevant metadata from EML and make it available for use in the dataone client. This is partially used for the asDataFrame() method that converts a dataone binary file to a data frame. These classes may be able to be replaced with more capable reml package methods. See:

https://repository.dataone.org/software/cicore/trunk/itk/d1_client_r/dataone/R/EMLParser-class.R
https://repository.dataone.org/software/cicore/trunk/itk/d1_client_r/dataone/R/EMLParser-methods.R

if/where to date / timestamp EML files?

@mbjones would it make sense to timestamp the EML file with date it is generated? The files have most of the information you might want to cite the data: creator, title, url or identifier, potentially organization or repository responsible, but I don't see a date associated with this. If so, where would the logical place for such a date go? (presumably this would not be ambiguous to the date data was actually collected, e.g. temporalCoverage)...

Integrate two datatables based on EML spec and ontology

This is the holy grail of metadata infrastructure and ostensibly the primary purpose of EML, see Jones et al 2006. Despite that, integration is not actually possible without semantic definitions as well, see Michener & Jones 2012, from which we adapt this minimal example below.

This example provides minimal and sometimes missing semantics; which may make it unresolvable. A complete semantic solution is diagrammed in the figure from Michener & Jones 2012.

Dataset 1

 dat = data.frame(river=c("SAC", "SAC", "AM"), 
                   spp = c("king", "king", "ccho"), 
                   stg = c("smolt", "parr", "smolt"),
                   ct =  c(293L, 410L, 210L))

 col_metadata = c(river = "http://dbpedia.org/ontology/River",
                  spp = "http://dbpedia.org/ontology/Species", 
                  stg = "Life history stage",
                  ct = "count")

 unit_metadata = 
  list(river = c(SAC = "The Sacramento River", AM = "The American River"),
       spp = c(king = "King Salmon", ccho = "Coho Salmon"),
       stg = c(parr = "third life stage", smolt = "fourth life stage"),
       ct = "number")

Dataset 2

 dat = data.frame(site = c("SAC", "AM", "AM"), 
                   species = c("Chinook", "Chinook", "Silver"), 
                   smct = c(245L, 511L, 199L),
                   pcnt =  c(290L, 408L, 212L))

 col_metadata = c(site = "http://dbpedia.org/ontology/River",
                  species = "http://dbpedia.org/ontology/Species", 
                  smct = "Smolt count",
                  pcnt = "Parr count")

 unit_metadata = 
  list(river = c(SAC = "The Sacramento River", AM = "The American River"),
       spp = c(Chinook = "King Salmon", Silver = "Coho Salmon"),
       smct = "number",
       pcnt = "number")

Figure

Add KNB as a publish node?

Presumably we should be able to push directly to KNB as well (many benefits, including being a DataONE node...).

May need Matt's help on getting API tools to do this...

Dealing with dateTimes: when dates are defined over multiple columns

I think that the dateTime of a single observation should always be given in a single column. For reasons unfathomable to me, some data represents year as a column, month as another column, day as another column, etc.

This is really only a problem when we do not have good metadata to recognize that these all refer to the same observation. For instance, in the dataset linked above, we can tell that all the columns are "dateTime" objects, but we have generally no way to be sure that the "year" in column 2 is the year that corresponds to the "day" in column 3. These could be independent dateTime observations, such as the start time and end time of a study, etc.

While it seems obvious that a single observation should get a single cell, apparently it isn't. I'm open to ideas on how to approach these issues.

This is a problem for read_eml for two reasons:

If we are to render any of these as time (POSIXt class) objects, we need to be able to associate them. A crude as.POSIXt would instead render the date as the current year, rather than that given in the column.
Presumably a researcher would like to access the individual points in time, so ideally we would provide a combined column as date-time format, even if we return the original columns unformatted.

Questions

Perhaps in such cases we leave the objects as character strings and go on?
When can we safely convert a dateTime object from a character string to a time object, and what class of time object should we use? e.g. POSIXt, or objects from the chron or date packages?

Dealing with settings / generic metadata

Some metadata a user would probably rather set once in some global configuration than have to specify each time, such as their personal contact information. The package API currently uses eml$set and eml$get to handle this.

At the same time, if a user needs to adjust the contact information for a particular file, they should be able override these values without altering their global configuration.

Need to be careful to avoid collisions in the eml$set approach. As implemented, it won't support structuring metadata (e.g. contact_givenName, contact_surName, ...). Ultimately we might want to be more clever about this, or just go to the yaml approach entirely.

Need also to be careful in avoiding lengthy and fragile function APIs. eml$set helps with this, as the function can get the data it needs without passing down through many levels, but also makes the override issue harder.

Once we have more implementation examples, we can give this some hard thought. Meanwhile:

Potential methods of providing metadata

specify metadata in R, e.g. eml$set(contact_email = "[email protected]")
Pull in generic metadata from an external file, e.g. YAML:

contact:
  email: [email protected]

Pull in generic metadata information from a DOI or a website -- e.g. rather than having to enter all authors manually. (FundRef as well as CrossRef data...?)

Generate an EML file given a data.frame and appropriate metadata

To begin with, consider rendering a data.frame such as this

 dat = data.frame(river=c("SAC", "SAC", "AM"), 
                   spp = c("king", "king", "ccho"), 
                   stg = c("smolt", "parr", "smolt"),
                   ct =  c(293L, 410L, 210L))

with the following accompanying metadata:

 col_metadata = c(river = "http://dbpedia.org/ontology/River",
                  spp = "http://dbpedia.org/ontology/Species", 
                  stg = "Life history stage",
                  ct = "count")
 unit_metadata = 
  list(river = c(SAC = "The Sacramento River", AM = "The American River"),
       spp = c(king = "King Salmon", ccho = "Coho Salmon"),
       stg = c(parr = "third life stage", smolt = "fourth life stage"),
       ct = "number")

into EML.

Reading and writing CSV files should have full implementation / translation of read.table API

R's read.table() function (which read.csv is aliased to) provides lots of options that we should

be encoded in the metadata when we write csv
be reading in from the metadata when we read csv

 read.table(file, header = FALSE, sep = "", quote = "\"'",
                dec = ".", row.names, col.names,
                as.is = !stringsAsFactors,
                na.strings = "NA", colClasses = NA, nrows = -1,
                skip = 0, check.names = TRUE, fill = !blank.lines.skip,
                strip.white = FALSE, blank.lines.skip = TRUE,
                comment.char = "#",
                allowEscapes = FALSE, flush = FALSE,
                stringsAsFactors = default.stringsAsFactors(),
                fileEncoding = "", encoding = "unknown", text)

See ?read.table for details. Particularly important for the read interface.

add numberOfRecords node to dataTable

http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-dataTable.html#numberOfRecords

Optional value providing non-header lines in CSV. Can simply be pulled from dim(dataframe)[1]

Extend data.frame to allow more metadata

An underlying philosophy of reml has been to map native R objects to EML structure (as opposed to either the raw XML or the S4 representations, which won't be familiar to most users). While data.frame is the natural candidate, it doesn't include essential metadata such as units.

I've taken a stab at extending the data.frame class as data.set, providing the additional attributes unit.defs and col.defs. See data.set.R. This object should be able to be used wherever a data.frame can be applied, but we can also define additional methods that operate on this metadata.

This also suggests an alternative way to define metadata of a data.frame in place of the current approach illustrated in the README. I've added a function so that this data.set object can be created analogously to a data.frame:

dat = data.set(river = c("SAC",  "SAC",   "AM"),
               spp   = c("king",  "king", "ccho"),
               stg   = c("smolt", "parr", "smolt"),
               ct    = c(293L,    410L,    210L),
               col.defs = c("River site used for collection",
                            "Species common name",
                            "Life Stage", 
                            "count of live fish in traps"),
               unit.defs = list(c(SAC = "The Sacramento River", 
                                  AM = "The American River"),
                                c(king = "King Salmon", 
                                  ccho = "Coho Salmon"),
                                c(parr = "third life stage", 
                                  smolt = "fourth life stage"),
                                "number"))

an existing data.frame can also be passed in along with col.defs and unit.defs.

Questions:

@duncantl @mbjones Is the implementation of the extension sensible? data.frame is actually a rather confusing class -- cannot tell if it is S4 or S3 (e.g. new('data.frame') creates an S4 object, but data.frame() does not...), and has attributes like names and row.names which may or may not be S4 slots... e.g. they can be accessed by slot() but not @...). data.set.R

What other metadata do we want? e.g. we could make full eml act like this, but that makes rather big data.frames...

Can't build package due to some missing commas.

install_github("reml", "ropensci")
Installing github repo(s) reml/master from ropensci
Downloading reml.zip from https://github.com/ropensci/reml/archive/master.zip
Installing package from C:\Users\thart\AppData\Local\Temp\RtmpuCh7ZJ/reml.zip
Installing reml
"C:/PROGRA~~1/R/R-30~~1.1/bin/i386/R" --vanilla CMD INSTALL
"C:\Users\thart\AppData\Local\Temp\RtmpuCh7ZJ\reml-master"
--library="C:/Users/thart/Documents/R/win-library/3.0" --with-keep.source --install-tests

installing source package 'reml' ...
** R
Error in parse(outFile) :
C:/Users/thart/AppData/Local/Temp/RtmpuCh7ZJ/reml-master/R/attribute.R:213:52: unexpected 'function'
212:
213: setMethod("extract", signature("measurementScale") function
^
ERROR: unable to collate and parse R files for package 'reml'
removing 'C:/Users/thart/Documents/R/win-library/3.0/reml'

ext_validate broken?

@duncantl For some reason I cannot get your ext_validate to run successfully any more. e.g. running this test gives the error:

"): non-character argument
1: eml_validate(txt) at test_ext_validate.R:17
2: .reader(ans) at /home/cboettig/Documents/code/reml/R/ext_validate.R:59
3: strsplit(ans, ": ") at /home/cboettig/Documents/code/reml/R/ext_validate.R:76

not sure what's up on this one.

R Puzzles

_Here's a running list of questions I have for @duncantl or whomever, largely arising as I try to understand the S4 based approach to representing the schema and various puzzles that arise in the process. _

generic R isses

Why do I lose names when coercing between (named) character strings and lists (and vice versa)? (try as(list(a=1, b=2), "character"))
I set prototype for a slot, install package, remove prototype for that slot, reinstall package, prototype is still set! Huh? (Deleting the installed package location and the local namespace and then re-roxygening and installing seems to fix this).
What's up with "labels" on factors? They overwrite my levels and everything.

XML and schema issues

Best way to handle optional elements (when we'd rather not specify default values)? Ideally would have something such that addChildren(parent, class@empty_option) would not do anything when the slot was truly empty.
"One or more", "Zero or more" everything needs a ListOf class. Goodness, but this is annoying.

yup, tedious but mindless.

Related: In the schema, sometimes a list of elements has a parent node, e.g. <attributeList>, followed by <attribute> <attribute> , ... sometimes it doesn't. Does it make sense to write an extra object class associated with the first case (e.g. class for attributeList?)?

yup, classes for all elements, and more classes for ListOf

Stategies for file organization when defining S4 objects? Do I need to define the elementary types first? How do I avoid all the "class not defined" errors on installing package?

Set the collate order for the files, describing which order they should be loaded in (e.g. class basic class definitions before richer ones, classes before methods). In Roxygen, the order is set by using @include fileA.R ton the documentation of fileB.R to indicate fileA has definitions needed for fileB.

Attributes: A bit annoying that the S4 representation we're using uses slots for attributes as well as node/values without indication of which is which, but I guess it's okay. We could map them in S4 to a attr slot..

Yup. Tedious again but not problematic. Writing to/from methods takes care of this explicitly.

Do we write into XML as a coercion method, setAs("class", "XMLInternalNode", function(from){... or with some other kind of function?

sure, though sometimes preferable to define as a method, allowing us to make use of callNextMethod to convert against the inherited slots.

How do we handle coercion of S4 objects into S3 objects (and vice versa)? Can we make S4 objects work in S3 functions? What's the deal with setOldClass and S3part?
XML Schema (XSD I guess I could say) has notion of sharing a bunch of slots using entity-groups. I guess we just put all members of the group in as seperate slots each time? (Simple enough when XMLSchema is generating the class definitions I guess).

Answer: Just use contains in the setClass definition (Inheritance)

Strategies for mixing XMLSchema generated class definitions and manual class definitions of schema objects? (namespaces for classes)
Coercion to promote types or better to explicitly call new?

Answer: Using "new", we must know the slot name corresponding to the type. Coercion allows us to specify the type, e.g.

setAs("eml:nominal", "eml:measurementScale", function(from) new("eml:measurementScale", nominal = from))
setAs("eml:ordinal", "eml:measurementScale", function(from) new("eml:measurementScale", ordinal = from))

can be used with as(from[[3]], from[[4]]), reading the class name from a varaible instead of hardwiring the slot name. The coercion methods take care of mapping the class names to the appropriate slot names.

Common attributes for all nodes an easy way? For example, character / numeric types, in the schema can all have id attributes. I suppose programmatic solution is to make every node a class? would we have setClass('somenode', representation(title="eml:title"... and then setClass("eml:title", representation(title = "character", id = "character")?

How about just having all inherit from a common base?

EML search queries

Thus far issues have been divided between read, write, and publish, or integrate EML. Development has mostly focused on writing EML. Publish is relatively straightforward extension of writing the EML, just adds a few extra fields to the EML file and pushes the data to the appropriate repository with appropriate metadata.

Reading EML is potentially more of a challenge since (a) we assume the user doesn't know xpath and (b) want to provide conversion into native R objects wherever possible. We have only the basic proof-of-principle based on the trivial write-EML example which imports the csv file into an appropriately labeled data.frame.

Not sure if searching across EML is a read-eml issue or a separate task, since in general such a query might be posed across a database of EML files rather than a single XML file.

To have a focal example, I'll just borrow one posed by one of my PIs:

"Find all data that involves a morphometric measurement of an invertebrate species at 3 or more geographically distinct locations.

(e.g. 3+ different populations of the same species) This kind of data would be useful for all sorts of within-species variation comparisons (when put against environmental variables, etc), but is remarkably difficult to find, as vertically integrated databases tend to omit morphological data (like most GBIF entries), or else aggregate at the species level, discarding the geographic data. Many papers have less than three populations, and it is all but impossible to find another paper that makes the same morphometric measurements on the same species at a unique location.

It seems like this is the kind of query we could construct in EML; and in particular perform the aggregation step. But that assumes a model in which we query directly against all available EML files. I'm not sure if that is sensible or if there's a more clever way to do these queries. (particularly as we would have to do some computation in the process - e.g. to isolate data with invertebrate coverage we would have to query the taxonomic coverage and then query against ITIS or something to determine if the species etc listed was an invertebrate) @mbjones is there a better way to think about complex queries (metacat?)?

eml_citation

We can get citation information for R packages with citation("reml"), so it would be natural to get the citation information for an EML object with:

eml <- read_eml("my_eml_file.xml")
citation(eml)

read_eml should make use of the S3 class eml and return a pointer to the XML root node as doc
Create the function eml_citation with alias citation.eml that extracts the appropriate data citation.
Citation metadata should include DOI if published to figshare or dryad, etc.
Use the bibentry R class, so that citation can be returned in various formats (e.g. print(citation(doc), style='bibtex'))

Extract appropriate R objects from EML dataTable

Given the EML file defining a CSV and metadata types, extract the R object information. This should allow the user to reconstruct the following R objects from EML generated by #2

 dat = data.frame(river=c("SAC", "SAC", "AM"), 
                   spp = c("king", "king", "ccho"), 
                   stg = c("smolt", "parr", "smolt"),
                   ct =  c(293L, 410L, 210L))

with the following accompanying metadata:

 col_metadata = c(river = "http://dbpedia.org/ontology/River",
                  spp = "http://dbpedia.org/ontology/Species", 
                  stg = "Life history stage",
                  ct = "count")
 unit_metadata = 
  list(river = c(SAC = "The Sacramento River", AM = "The American River"),
       spp = c(king = "King Salmon", ccho = "Coho Salmon"),
       stg = c(parr = "third life stage", smolt = "fourth life stage"),
       ct = "number")

Ensure that all objects have the correct object type: e.g. (ordered) factors should be (ordered) factors, etc.

Use EML 2.1.1 ?

@mbjones eml_write is assigning the namespace 2.1.0. Any reason we shouldn't be using 2.1.1?

Questions about EML design & implementation

A running list of questions that I might direct to Matt if I cannot figure them out:

Can we get (public) endpoints/URLs to all the public EML files in KNB?
How should we be generating id attributes for <attribute> elements?
How can we replace definition with a URI to an existing ontology?

Units already have clear semantic definitions, but assigning good definitions to columns or values for character strings (such as species names, or geographic sites, etc) is considerably less developed. We do have a somewhat round-about way to attach things like "Coverage" definitions to columns (attributes). Replacing definitions with URIs would seem simpler...

Isn't having <attributeDefinition> and <textDomain><definition> redundant in the case of character string columns? (e.g. see example below)

            <attribute id="1354213311470">
               <attributeName>run.num</attributeName>
               <attributeDefinition>which run number. (integer)</attributeDefinition>
               <measurementScale>
                  <nominal>
                     <nonNumericDomain>
                        <textDomain>
                           <definition>which run number</definition>
                        </textDomain>
                     </nonNumericDomain>
                  </nominal>
               </measurementScale>
            </attribute>

ropensci / eml Goto Github PK

eml's Introduction

EML

Notes on the EML v2.0 Release

Creating EML

A minimal valid EML document:

A Richer Example

set_* methods

Coverage metadata

Reading in text from Word and Markdown

Attribute Metadata from Tables

Data file format

Generic construction

Constructor methods

Creating parties (creator, contact, publisher)

Putting it all together

Serialize and validate

Setting the version

eml's People

Contributors

Stargazers

Watchers

Forkers

eml's Issues

taxonomic coverage

temporal coverage

geographic coverage

Dataset 1

Dataset 2

Figure

Questions

Potential methods of providing metadata

Questions:

generic R isses

XML and schema issues

Recommend Projects

Recommend Topics

Recommend Org

`set_*` methods