dataoneorg / object-formats Goto Github PK

View Code? Open in Web Editor NEW

1.0 12.0 3.0 40 KB

DataONE Object Formats controlled vocabulary

License: Apache License 2.0

d1-cn operations

object-formats's People

Stargazers

Watchers

Forkers

amoeba ble-lter csjx

object-formats's Issues

Add a format identifier for the new 1.1.0 Collections & Portals schemas

Format Metadata

Collections 1.1.0

formatId: https://purl.dataone.org/collections-1.1.0
formatName: Dataset collections v1.1.0
formatType: METADATA
mediaType: text/xml
extension: xml

Portals 1.1.0

formatId: https://purl.dataone.org/portals-1.1.0
formatName: Dataset portals v1.1.0
formatType: METADATA
mediaType: text/xml
extension: xml

Format description

This is an update to the current 1.0.0 version of the Collections and Portals schemas. The new 1.1.0 version is required to support more complex definition filters, different section types, and more.

Specification / Namespace documentation

https://github.com/DataONEorg/collections-portals-schemas

Checklist

For both Collections & Portals 1.1.0

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format

Considerations

the purl.dataone.org re-directs still need to be created, see DataONEorg/dataone_purl#4.

rationale behind R Markdown file (pandoc markdown compatible)

The only entry for R Markdown says:
R Markdown file (pandoc markdown compatible)

Why is pandoc mentioned? If there are flavors of R Markdown that aren't pandoc compatible, then why isn't there an entry for vanilla R Markdown? If all R Markdown files are pandoc compatible, then why not just remove the pandoc part?

GeoTIFF

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: image/geotiff
formatName: GeoTIFF
formatType: DATA
mediaType : image/tiff
extension: tiff

Being a raster GIS format, the GeoTIFF may have other files that go with it such as pyramids or metadata files. Therefore, a zip version is also provided.

formatId: image/geotiff+zip
formatName: GeoTIFF (zipped)
formatType: DATA
mediaType : image/tiff+zip
extension: zip

Format description

Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.

GeoTIFF is a TIFF image with embedded georeferencing information such that the pixels of the image can be correctly located in a map in a geographic information system (GIS). This public domain metadata standard is a good choice for sharing raster GIS data, as most GIS software supports and can generate files using this format. There are numerous examples of GeoTIFFs archived in DataONE nodes.

This is one of the formats recommended by an EDI/LTER working group developing best practices for archiving spatial data.

Specification / Namespace documentation

Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.

Standard: http://docs.opengeospatial.org/is/19-008r4/19-008r4.html

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format Name includes version info where applicable
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format
All extensions in widespread use are listed

Proposed Sqlite Format ID

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: application/vnd.sqlite3
formatName: SQLite Database
formatType: DATA
mediaType: application/vnd.sqlite3
extension: db

Format description

There is no SQL Lite database format ID as of now. There is the "GeoPackage Encoding Standard (OGC) Format Family" option, but this has a format id of geopackage+sqlite3. This new format ID is needed for when researchers decide to archive their .db files.

Specification / Namespace documentation

IANA registered format: https://www.iana.org/assignments/media-types/application/vnd.sqlite3
SQLLite database format docs: https://www.sqlite.org/fileformat.html

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format

Considerations

Unknown

System Metadata (HashStore)

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: http://ns.dataone.org/service/types/v2.0:SystemMetadata
formatName: System Metadata (HashStore)
formatType: METADATA
mediaType: text/xml
extension: None

Format description

With the Metacat back-end storage refactor, we will need a new metadata format to represent the metadata that corresponds to each data object in the HashStore. Development is ongoing and this issue should be updated as discussions take place and progress is made.

Specification / Namespace documentation

The format will be defined at http://ns.dataone.org/service/types/v2.0:SystemMetadata.

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format

Considerations

Need to confirm the mediaType. As of writing this issue, I believe that this metadata will be an .xml document.
Need to confirm the formatName - is System Metadata (HashStore) ok?
First time working with object-formats, and am likely missing a few items, but would like to get the process started to get this new format added.

Create new format-id for science-on-schema.org Dataset in JSON-LD

Format Metadata

formatId: science-on-schema.org/Dataset/1.2;ld+json
formatName: JSON-LD metadata
formatType: METADATA
mediaType: application/ld+json
extension: jsonld

Format description

Dataset resources may be described by schema.org/Dataset markup serialized as JSON-LD in dataset landing pages. In order for DataONE to collate and index such metadata it is necessary to provide a suitable formatId.

Specification / Namespace documentation

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format

Considerations

application/ld+json is likely the most common serialization format, though technically any of the RDF formats may be used.

The formal definition of Dataset is at https://schema.org/Dataset though science-on-schema.org provides numerous recommendations that are necessary for effective use on science data.

Versioning is necessary as recommendations change over time.

add github action for validation

We want to ensure that the format file stays valid, so add a CI test that this is the case.

Esri File Geodatabase

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: application/vnd.gdb+zip
formatName: Esri File Geodatabase (zipped)
formatType: DATA
mediaType: application/vnd.gdb+zip
extension: `zip'

Format description

Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.

This format is a zipped Esri file geodatabase. When unzipped, a file geodatabase is a folder whose contents comprise a file based geodatabase. The folder must have the .gdb extension at the end of its name.

The format can store both vector and raster geospatial data. Geographic Information System (GIS) software such as ArcGIS and QGIS can read/write this format. However, reading/writing is quite limited outside of ArcGIS.

This is one of the formats that is somewhat recommended by an EDI/LTER working group developing best practices for archiving spatial data. We prefer more open formats. However, due to the utility of file geodatabases and ubiquity of ArcGIS, we suspect many folks are storing geospatial data in this format. This format provides features not available in other geospatial format, and so export to a more open format sometimes means losing/obscuring data or functionality.

Specification / Namespace documentation

Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.

About: https://desktop.arcgis.com/en/arcmap/10.3/manage-data/administer-file-gdbs/file-geodatabases.htm
API for read/write: https://github.com/Esri/file-geodatabase-api

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format Name includes version info where applicable
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format
All extensions in widespread use are listed

Considerations

I used the vnd.xxx+zip pattern from the shapefile example. Someone please vet this.

create a template for new format requests

Create an issue template for new format requests. This should include all of the fields for a format, plus a checklist to be sure they have not created a duplicate, etc.

Update WaterML link in readme

The readme states:

...the formatId for WaterML is http://www.loc.gov/METS/...

I think someone accidentally copied the id from the next item down in the list.

To fix it, you could just change the link in the readme to http://www.cuahsi.org/waterML/1.0/.

But, maybe just use a different example instead? Now that I look at it, there may be some issues with the WaterML entries, which I'll post as a separate issue.

ESRI Shapefile (zipped)

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: application/vnd.shp+zip
formatName: Esri Shapefile (zipped)
formatType: DATA
mediaType : application/vnd.shp+zip
extension: .zip

Format description

Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.

This is for a zipped shapefile directory following the specification for the ESRI Shapefile (http://en.wikipedia.org/wiki/Shapefile) format, which is a common format used for representing vector geospatial data and is defined in https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf. Shapefiles are unusual because the format specification requires the use of three mandatory files (.shp, .shx, and .dbf) as well as several other optional files, all of which share the same basename and must be in the same parent directory, and which collectively constitute the "shapefile" dataset. So, the individual file that has a .shp extension is incomplete without the collection of other files in a directory that together make up a shapefile dataset. Typically, this directory is zipped up for exchange (so the zipped directory often has the .zip extension). In DataONE, many of these zipped up shapefiles are present and typed as zip files, and so are unrecognizable as the more specialized shapefile variant.

In this proposal, I suggest that we create a format for zipped shapefiles that allows this specialized variant of zip files to be recognized and registered as such. This identifier would only be used for objects that represent a zipped directory containing the files that constitute a dataset in ESRI Shapefile format, and would not be used for the individual file components of such a dataset (which each would have different types, and could be the subject of another proposal). The individual subcomponents of a Shapefile have the following assigned Media types:

application/vnd.shp: https://www.iana.org/assignments/media-types/application/vnd.shp
application/vnd.shx: https://www.iana.org/assignments/media-types/application/vnd.shx
application/vnd.dbf: https://www.iana.org/assignments/media-types/application/vnd.dbf

The Media type of a zipped shapefile is unclear from the specification. My conclusion is that it is best to give it the media type application/zip, and rely on the more specific formatId to differentiate these from other arbitrary zip files.

This format was first requested in Redmine Issue 6883 in 2015, and has been needed for a while.

Specification / Namespace documentation

Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.

Specification URL: https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format Name includes version info where applicable
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format
All extensions in widespread use are listed

Considerations

Should we also add the individual components like shp and shx?
We should list the many other geospatial data formats as well: https://gisgeography.com/gis-formats/

CF 1.5-1.8+

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

1.5:

formatId: CF-1.5
formatName: NetCDF Climate and Forecast Metadata Convention, version 1.5
formatType: DATA
mediaType: application/netcdf
extension: nc

1.6:

formatId: CF-1.6
formatName: NetCDF Climate and Forecast Metadata Convention, version 1.6
formatType: DATA
mediaType: application/netcdf
extension: nc

1.7:

formatId: CF-1.7
formatName: NetCDF Climate and Forecast Metadata Convention, version 1.7
formatType: DATA
mediaType: application/netcdf
extension: nc

1.8 and up:

formatId: netCDF-CF
formatName: NetCDF Climate and Forecast Metadata Convention
formatType: DATA
mediaType: application/netcdf
extension: nc

Format description

Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.

CF 1.5 through 1.8 are released. Since we have 1.0 through 1.4, we should add the rest. Starting with CF 1.8, a Conventions attribute indicating the convention used is required (previously it was optional) in the netCDF file, and so indicating that this is a CF file via formatName is enough, and you can leave it up to CF-aware software to interpret the version upon inspecting the file itself.

Specification / Namespace documentation

Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.

http://cfconventions.org/

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format

Considerations

Describe or list any considerations that might impact the use of the format, or related issues that we should consider.

The 1.8 and up entry does not indicate in the format name that this is for 1.8 and up. Will this be confusing to users who are trying to choose a formatName? I was trying to be efficient and save future work with the generic 1.8 and up version, but I also have no guarantee that CF won't drop the conventions-identification requirement in the future (though I am almost certain they would not drop it).

Stepping back a bit, I question whether the CF Conventions entries are even needed. I think when choosing a format, specifying nteCDF would be enough. Software that can read netCDF files, is either CF-aware or not. If aware, it will know what to do with the file when it sees it. If not, the end user may have to hold the software's hand a bit. But either way, I don't see how knowing whether CF is used ahead of time, is helpful.

change ncml media type

For formatId http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2, do you have examples where NCML files have a .nc extension (indicating a netCDF binary file)? I thought NCML files typically had a .ncml extension. They're just XML files, so I suppose they could also have a .xml extension. Hmmm, I just thought of another question: Do you allow multiple media types or file extensions, such as .tiff and .tif?

add CodeMeta format

Format Metadata

For 1.0:

formatId: https://doi.org/10.5063/schema/codemeta-1.0
formatName: CodeMeta, version 1.0
formatType: METADATA
mediaType: application/ld+json
extension: json

For 2.0:

formatId: https://doi.org/10.5063/schema/codemeta-2.0
formatName: CodeMeta, version 2.0
formatType: METADATA
mediaType: application/ld+json
extension: json

Format description

CodeMeta includes structured metadata about software and may accompany software or scripts in a data package.

Specification / Namespace documentation

https://codemeta.github.io/

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format

Considerations

I'm not sure if the format identifier is appropriate, or if some URL should be used such as https://raw.githubusercontent.com/codemeta/codemeta/2.0-rc/codemeta.jsonld.

Document how to submit edits to existing formats

You may want a separate template for this, similar to the "submit a new format" template. You may also want a branch naming pattern similar to how feature_#_format is used for new formats. I think the existing branch naming pattern would also work for format edits, so maybe just think about whether the template is also good for both purposes (new vs edit).

Document how to submit new formats

Update the README to reflect the format proposal process.

bash shell scripts

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: application/x-sh
formatName: Bourne shell script
formatType: DATA
mediaType: application/x-sh
extension: .sh

Format description

Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.

Shell scripts are commonly used by researchers to execute programs and manipulate files.

Specification / Namespace documentation

Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.

https://tiswww.case.edu/php/chet/bash/bashtop.html

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format

Considerations

Describe or list any considerations that might impact the use of the format, or related issues that we should consider.

Apache Parquet

Parquet Format

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: application/vnd.apache.parquet
formatName: Apache Parquet
formatType: DATA
mediaType: application/vnd.apache.parquet
extension: parquet

Format description

Parquet is a columnar storage format that supports nested data that is becoming more commonly used in science applications. It is developed at the Apache Software Foundation (see https://parquet.apache.org/), and is used extensively in the Hadoop ecosystem. Libraries exist for Java, python, R, and other environments.

Specification / Namespace documentation

The format is defined at https://github.com/apache/parquet-format. There is no established media type yet, so we are proposing to use the vendor-specific format for the media type.

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format

Considerations

It is unclear to me (due to lack of familiarity on my part) how versioning is handled in parquet, and whether different format types should be used for different parquet versions. If anyone has an idea about that and backwards compatibility, please provide feedback in a comment on whether we need multiple formats or not.

Object Format List for DataONE

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: http://ns.dataone.org/service/types/v1#ObjectFormatList
formatName: Object Format List
formatType: METADATA
mediaType: text/xml
extension:xml

Format description

When we install a new CN, we should insert the object which contains the list of object formats first before it functions. Currently we use the old Metacat API to insert/update it and the object is treated as a metadata object. However, after we disable the old Metacat API, we should use DataONE cn.create to insert the documents. If we don't have an object format identifier, which is a metadata type, for this object, CN will treat it as a data object.

Specification / Namespace documentation

The namespace is http://ns.dataone.org/service/types/v1

Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format

Considerations

Describe or list any considerations that might impact the use of the format, or related issues that we should consider.

GeoJSON

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

2008 version:

formatId: application/geo+json-2008
formatName: GeoJSON, version GJ2008
formatType: DATA
mediaType: application/geo+json
extension: json

current version:

formatId: application/geo+json-RFC7946
formatName: GeoJSON, version RFC 7946
formatType: DATA
mediaType: application/geo+json
extension: json

Another extension is .geojson.

Format description

Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.

This is an open format for storing vector geospatial data. It can be read by various geographic information system (GIS) software including ArcGIS and QGIS. There is a 2008 version, and then a 2016 version (RFC 7946) that is more restrictive. RFC7946 supersedes the 2008 version? The 2008 version is still exported by some software (e.g., ArcGIS for Desktop) as far as I can tell.

This is one of the formats recommended by an EDI/LTER working group developing best practices for archiving spatial data.

Specification / Namespace documentation

Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.

2008 Specification: https://geojson.org/geojson-spec.html
Current Specification: https://tools.ietf.org/html/rfc7946
Version differences discussed: Esri/arcgis-to-geojson-utils#21
Media type from: https://op.europa.eu/en/web/eu-vocabularies/at-concept/-/resource/authority/file-type/GEOJSON/?target=Browse

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format Name includes version info where applicable
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format
All extensions in widespread use are listed

Considerations

I got the media type from the EU vocab. Note that Library of Congress has instead application/vnd.geo+json.

Did I use the plus sign and the dash correctly in the format ID?

I don't know which extension is more commonly used for GeoJSON files, .json or .geojson.

GeoPackage

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: application/geopackage+sqlite3
formatName: GeoPackage Encoding Standard (OGC) Format Family
formatType: DATA
mediaType: application/geopackage+sqlite3
extension: gpkg

Format description

Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.

This is one of the formats recommended by an EDI/LTER working group developing best practices for archiving spatial data. It is an OGC encoding standard for storing geospatial vector and raster data. Geographic Information System (GIS) software such as ArcGIS and QGIS can read and write it.

Specification / Namespace documentation

Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.

Website: https://www.geopackage.org/
OGC standard: https://www.ogc.org/standards/geopackage
Where I got the name, extension, and media type: https://www.digipres.org/formats/sources/fdd/formats/#fdd000520

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format Name includes version info where applicable
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format
All extensions in widespread use are listed

Considerations

The GeoPackage website lists application/geopackage+vnd.sqlite3 as the media type and points to IANA, but IANA has application/geopackage+sqlite3 which is what I used for this proposal.

I wasn't sure if I should include the version. It looks like OGC is at 1.2.1. I skirted around the version issue by including Format Family in the name.

I also included OGC in the name, following the Library of Congress Example.

Issues with WaterML entries

The WaterML entries are:

  <objectFormat>
    <formatId>http://www.cuahsi.org/waterML/1.0/</formatId>
    <formatName>Water Markup Language, version 1.0</formatName>
    <formatType>METADATA</formatType>
    <mediaType name="text/xml"/>
    <extension>xml</extension>
  </objectFormat>
  <objectFormat>
    <formatId>http://www.cuahsi.org/waterML/1.1/</formatId>
    <formatName>Water Markup Language, version 1.0</formatName>
    <formatType>METADATA</formatType>
    <mediaType name="text/xml"/>
    <extension>xml</extension>
  </objectFormat>

Some possible issues I noticed:

The CUAHSI URLs return 404s. I suppose purely as a format identifier, it doesn't matter? Or should they resolve to something if they appear to be a URL?
The formatIds end in a trailing slash. I was reading about the Zarr specification, in which they have a normalization rule such that trailing slashes are stripped. In Zarr's case, this helps ensure consistent behavior across storage systems, which may not be relevant for the DataONE format list. In the format list, some ids have trailing slashes while others do not. So perhaps for consistency's sake, the trailing slashes in these and all other ids should be eliminated.
The entry for id 1.1 has name 1.0.
WaterML typically includes both data and metadata. This is analogous to the CF-1.0 entry in the format list, which has a formatType of DATA. I think the WaterML entries should be changed to DATA. Unless...for reasons beyond me, whatever WaterML items you currently have in DataONE only contain metadata. It's been a while, but I think the schema would allow it; however, I don't recall seeing that done in practice.

Jupyter Notebooks

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

formatId: application/x-ipynb+json
formatName: Jupyter Notebook
formatType: DATA
mediaType: application/x-ipynb+json
extension: ipynb

Format description

This format is commonly used by researchers to have their code and text in the same place when doing analysis. How notebooks are formatted can be found here: https://nbformat.readthedocs.io/en/latest/

Checklist

The format is not a duplicate of another in the list under a different name or identifier
Format Identifier is unique
Format identifier is the commonly-used identifier for the namespace, or the best URI for the namespace, or the best Media type
Format identifier is same as the MIME media type if the mime type is specific to only this format (e.g., image/png is specific to one format, whereas text/xml is not specific to one format)
Format Name is recognizable and sensible
Format includes version info where applicable in formatName and formatId
formatType is the correct type from the values: DATA, METADATA, or RESOURCE
MediaType is the most specific MIME media type that applies to the format