dataoneorg / object-formats Goto Github PK
View Code? Open in Web Editor NEWDataONE Object Formats controlled vocabulary
License: Apache License 2.0
DataONE Object Formats controlled vocabulary
License: Apache License 2.0
https://purl.dataone.org/collections-1.1.0
Dataset collections v1.1.0
METADATA
text/xml
xml
https://purl.dataone.org/portals-1.1.0
Dataset portals v1.1.0
METADATA
text/xml
xml
This is an update to the current 1.0.0 version of the Collections and Portals schemas. The new 1.1.0 version is required to support more complex definition filters, different section types, and more.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
the purl.dataone.org
re-directs still need to be created, see DataONEorg/dataone_purl#4.
The only entry for R Markdown says:
R Markdown file (pandoc markdown compatible)
Why is pandoc mentioned? If there are flavors of R Markdown that aren't pandoc compatible, then why isn't there an entry for vanilla R Markdown? If all R Markdown files are pandoc compatible, then why not just remove the pandoc part?
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
image/geotiff
GeoTIFF
DATA
image/tiff
tiff
Being a raster GIS format, the GeoTIFF may have other files that go with it such as pyramids or metadata files. Therefore, a zip version is also provided.
image/geotiff+zip
GeoTIFF (zipped)
DATA
image/tiff+zip
zip
Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.
GeoTIFF is a TIFF image with embedded georeferencing information such that the pixels of the image can be correctly located in a map in a geographic information system (GIS). This public domain metadata standard is a good choice for sharing raster GIS data, as most GIS software supports and can generate files using this format. There are numerous examples of GeoTIFFs archived in DataONE nodes.
This is one of the formats recommended by an EDI/LTER working group developing best practices for archiving spatial data.
Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
application/vnd.sqlite3
SQLite Database
DATA
application/vnd.sqlite3
db
There is no SQL Lite database format ID as of now. There is the "GeoPackage Encoding Standard (OGC) Format Family" option, but this has a format id of geopackage+sqlite3
. This new format ID is needed for when researchers decide to archive their .db files.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
Unknown
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
http://ns.dataone.org/service/types/v2.0:SystemMetadata
System Metadata (HashStore)
METADATA
text/xml
None
With the Metacat
back-end storage refactor, we will need a new metadata format to represent the metadata that corresponds to each data object in the HashStore
. Development is ongoing and this issue should be updated as discussions take place and progress is made.
The format will be defined at http://ns.dataone.org/service/types/v2.0:SystemMetadata
.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
mediaType
. As of writing this issue, I believe that this metadata will be an .xml
document.formatName
- is System Metadata (HashStore)
ok?science-on-schema.org/Dataset/1.2;ld+json
METADATA
application/ld+json
jsonld
Dataset resources may be described by schema.org/Dataset
markup serialized as JSON-LD in dataset landing pages. In order for DataONE to collate and index such metadata it is necessary to provide a suitable formatId
.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
application/ld+json
is likely the most common serialization format, though technically any of the RDF formats may be used.
The formal definition of Dataset is at https://schema.org/Dataset though science-on-schema.org provides numerous recommendations that are necessary for effective use on science data.
Versioning is necessary as recommendations change over time.
We want to ensure that the format file stays valid, so add a CI test that this is the case.
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
application/vnd.gdb+zip
Esri File Geodatabase (zipped)
DATA
application/vnd.gdb+zip
Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.
This format is a zipped Esri file geodatabase. When unzipped, a file geodatabase is a folder whose contents comprise a file based geodatabase. The folder must have the .gdb extension at the end of its name.
The format can store both vector and raster geospatial data. Geographic Information System (GIS) software such as ArcGIS and QGIS can read/write this format. However, reading/writing is quite limited outside of ArcGIS.
This is one of the formats that is somewhat recommended by an EDI/LTER working group developing best practices for archiving spatial data. We prefer more open formats. However, due to the utility of file geodatabases and ubiquity of ArcGIS, we suspect many folks are storing geospatial data in this format. This format provides features not available in other geospatial format, and so export to a more open format sometimes means losing/obscuring data or functionality.
Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.
About: https://desktop.arcgis.com/en/arcmap/10.3/manage-data/administer-file-gdbs/file-geodatabases.htm
API for read/write: https://github.com/Esri/file-geodatabase-api
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
I used the vnd.xxx+zip
pattern from the shapefile example. Someone please vet this.
Create an issue template for new format requests. This should include all of the fields for a format, plus a checklist to be sure they have not created a duplicate, etc.
The readme states:
...the
formatId
for WaterML ishttp://www.loc.gov/METS/
...
I think someone accidentally copied the id from the next item down in the list.
To fix it, you could just change the link in the readme to http://www.cuahsi.org/waterML/1.0/
.
But, maybe just use a different example instead? Now that I look at it, there may be some issues with the WaterML entries, which I'll post as a separate issue.
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
application/vnd.shp+zip
Esri Shapefile (zipped)
DATA
application/vnd.shp+zip
.zip
Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.
This is for a zipped shapefile directory following the specification for the ESRI Shapefile (http://en.wikipedia.org/wiki/Shapefile) format, which is a common format used for representing vector geospatial data and is defined in https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf. Shapefiles are unusual because the format specification requires the use of three mandatory files (.shp
, .shx
, and .dbf
) as well as several other optional files, all of which share the same basename and must be in the same parent directory, and which collectively constitute the "shapefile" dataset. So, the individual file that has a .shp
extension is incomplete without the collection of other files in a directory that together make up a shapefile dataset. Typically, this directory is zipped up for exchange (so the zipped directory often has the .zip
extension). In DataONE, many of these zipped up shapefiles are present and typed as zip files, and so are unrecognizable as the more specialized shapefile variant.
In this proposal, I suggest that we create a format for zipped shapefiles that allows this specialized variant of zip files to be recognized and registered as such. This identifier would only be used for objects that represent a zipped directory containing the files that constitute a dataset in ESRI Shapefile format, and would not be used for the individual file components of such a dataset (which each would have different types, and could be the subject of another proposal). The individual subcomponents of a Shapefile have the following assigned Media types:
application/vnd.shp
: https://www.iana.org/assignments/media-types/application/vnd.shpapplication/vnd.shx
: https://www.iana.org/assignments/media-types/application/vnd.shxapplication/vnd.dbf
: https://www.iana.org/assignments/media-types/application/vnd.dbfThe Media type of a zipped shapefile is unclear from the specification. My conclusion is that it is best to give it the media type application/zip
, and rely on the more specific formatId
to differentiate these from other arbitrary zip files.
This format was first requested in Redmine Issue 6883 in 2015, and has been needed for a while.
Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
1.5:
CF-1.5
NetCDF Climate and Forecast Metadata Convention, version 1.5
DATA
application/netcdf
nc
1.6:
CF-1.6
NetCDF Climate and Forecast Metadata Convention, version 1.6
DATA
application/netcdf
nc
1.7:
CF-1.7
NetCDF Climate and Forecast Metadata Convention, version 1.7
DATA
application/netcdf
nc
1.8 and up:
netCDF-CF
NetCDF Climate and Forecast Metadata Convention
DATA
application/netcdf
nc
Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.
CF 1.5 through 1.8 are released. Since we have 1.0 through 1.4, we should add the rest. Starting with CF 1.8, a Conventions
attribute indicating the convention used is required (previously it was optional) in the netCDF file, and so indicating that this is a CF file via formatName is enough, and you can leave it up to CF-aware software to interpret the version upon inspecting the file itself.
Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
Describe or list any considerations that might impact the use of the format, or related issues that we should consider.
The 1.8 and up
entry does not indicate in the format name that this is for 1.8 and up. Will this be confusing to users who are trying to choose a formatName? I was trying to be efficient and save future work with the generic 1.8 and up
version, but I also have no guarantee that CF won't drop the conventions-identification requirement in the future (though I am almost certain they would not drop it).
Stepping back a bit, I question whether the CF Conventions entries are even needed. I think when choosing a format, specifying nteCDF would be enough. Software that can read netCDF files, is either CF-aware or not. If aware, it will know what to do with the file when it sees it. If not, the end user may have to hold the software's hand a bit. But either way, I don't see how knowing whether CF is used ahead of time, is helpful.
For formatId http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2, do you have examples where NCML files have a .nc extension (indicating a netCDF binary file)? I thought NCML files typically had a .ncml extension. They're just XML files, so I suppose they could also have a .xml extension. Hmmm, I just thought of another question: Do you allow multiple media types or file extensions, such as .tiff and .tif?
For 1.0:
https://doi.org/10.5063/schema/codemeta-1.0
CodeMeta, version 1.0
METADATA
application/ld+json
json
For 2.0:
https://doi.org/10.5063/schema/codemeta-2.0
CodeMeta, version 2.0
METADATA
application/ld+json
json
CodeMeta includes structured metadata about software and may accompany software or scripts in a data package.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
I'm not sure if the format identifier is appropriate, or if some URL should be used such as https://raw.githubusercontent.com/codemeta/codemeta/2.0-rc/codemeta.jsonld
.
You may want a separate template for this, similar to the "submit a new format" template. You may also want a branch naming pattern similar to how feature_#_format
is used for new formats. I think the existing branch naming pattern would also work for format edits, so maybe just think about whether the template is also good for both purposes (new vs edit).
Update the README to reflect the format proposal process.
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
application/x-sh
Bourne shell script
DATA
application/x-sh
.sh
Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.
Shell scripts are commonly used by researchers to execute programs and manipulate files.
Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.
https://tiswww.case.edu/php/chet/bash/bashtop.html
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
Describe or list any considerations that might impact the use of the format, or related issues that we should consider.
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
application/vnd.apache.parquet
application/vnd.apache.parquet
parquet
Parquet is a columnar storage format that supports nested data that is becoming more commonly used in science applications. It is developed at the Apache Software Foundation (see https://parquet.apache.org/), and is used extensively in the Hadoop ecosystem. Libraries exist for Java, python, R, and other environments.
The format is defined at https://github.com/apache/parquet-format. There is no established media type yet, so we are proposing to use the vendor-specific format for the media type.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
http://ns.dataone.org/service/types/v1#ObjectFormatList
Object Format List
METADATA
text/xml
xml
When we install a new CN, we should insert the object which contains the list of object formats first before it functions. Currently we use the old Metacat API to insert/update it and the object is treated as a metadata object. However, after we disable the old Metacat API, we should use DataONE cn.create
to insert the documents. If we don't have an object format identifier, which is a metadata type, for this object, CN will treat it as a data object.
The namespace is http://ns.dataone.org/service/types/v1
Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
Describe or list any considerations that might impact the use of the format, or related issues that we should consider.
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
2008 version:
application/geo+json-2008
GeoJSON, version GJ2008
DATA
application/geo+json
json
current version:
application/geo+json-RFC7946
GeoJSON, version RFC 7946
DATA
application/geo+json
json
Another extension is .geojson
.
Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.
This is an open format for storing vector geospatial data. It can be read by various geographic information system (GIS) software including ArcGIS and QGIS. There is a 2008 version, and then a 2016 version (RFC 7946) that is more restrictive. RFC7946 supersedes the 2008 version? The 2008 version is still exported by some software (e.g., ArcGIS for Desktop) as far as I can tell.
This is one of the formats recommended by an EDI/LTER working group developing best practices for archiving spatial data.
Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.
2008 Specification: https://geojson.org/geojson-spec.html
Current Specification: https://tools.ietf.org/html/rfc7946
Version differences discussed: Esri/arcgis-to-geojson-utils#21
Media type from: https://op.europa.eu/en/web/eu-vocabularies/at-concept/-/resource/authority/file-type/GEOJSON/?target=Browse
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
I got the media type from the EU vocab. Note that Library of Congress has instead application/vnd.geo+json
.
Did I use the plus sign and the dash correctly in the format ID?
I don't know which extension is more commonly used for GeoJSON files, .json or .geojson.
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
application/geopackage+sqlite3
GeoPackage Encoding Standard (OGC) Format Family
DATA
application/geopackage+sqlite3
gpkg
Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.
This is one of the formats recommended by an EDI/LTER working group developing best practices for archiving spatial data. It is an OGC encoding standard for storing geospatial vector and raster data. Geographic Information System (GIS) software such as ArcGIS and QGIS can read and write it.
Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
The GeoPackage website lists application/geopackage+vnd.sqlite3
as the media type and points to IANA, but IANA has application/geopackage+sqlite3
which is what I used for this proposal.
I wasn't sure if I should include the version. It looks like OGC is at 1.2.1. I skirted around the version issue by including Format Family
in the name.
I also included OGC
in the name, following the Library of Congress Example.
The WaterML entries are:
<objectFormat>
<formatId>http://www.cuahsi.org/waterML/1.0/</formatId>
<formatName>Water Markup Language, version 1.0</formatName>
<formatType>METADATA</formatType>
<mediaType name="text/xml"/>
<extension>xml</extension>
</objectFormat>
<objectFormat>
<formatId>http://www.cuahsi.org/waterML/1.1/</formatId>
<formatName>Water Markup Language, version 1.0</formatName>
<formatType>METADATA</formatType>
<mediaType name="text/xml"/>
<extension>xml</extension>
</objectFormat>
Some possible issues I noticed:
1.1
has name 1.0
.DATA
. I think the WaterML entries should be changed to DATA
. Unless...for reasons beyond me, whatever WaterML items you currently have in DataONE only contain metadata. It's been a while, but I think the schema would allow it; however, I don't recall seeing that done in practice.Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
application/x-ipynb+json
Jupyter Notebook
DATA
application/x-ipynb+json
ipynb
This format is commonly used by researchers to have their code and text in the same place when doing analysis. How notebooks are formatted can be found here: https://nbformat.readthedocs.io/en/latest/
image/png
is specific to one format, whereas text/xml
is not specific to one format)DATA
, METADATA
, or RESOURCE
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.