Code Monkey home page Code Monkey logo

esas2obis's Introduction

ESAS2OBIS

funding

Rationale

This repository contains the functionality to standardize the data of the European Seabirds at Sea (ESAS) to a Darwin Core Archive that can be harvested by OBIS and GBIF.

Workflow

To republish the data:

  1. Clone this repository to your computer.
  2. Download all public ESAS data from ICES.
  3. Unzip the download and move the files to the repository in a data/raw directory. The directory (and the files it contains) is ignored by git, so you will have to create it.
  4. Open the repository in RStudio by opening the esas2obis.Rproj file.
  5. Open the Darwin Core mapping script dwc_mapping.Rmd.
  6. Click Run > Run All to transform the data to Darwin Core files using SQL. This will take a while.
  7. Verify that all steps in the the mapping script ran without errors.
  8. Verify in git or GitHub Desktop that the sample data are not affected (changes would indicate updates or issues in the mapping).
  9. Upload the Darwin Core files to the VLIZ "upload" IPT.
  10. Validate the Darwin Core Archive (by EurOBIS staff).
  11. Publish the dataset to OBIS and GBIF (by EurOBIS staff).

Published dataset

Darwin Core transformation

ESAS data is structured in 4 hierarchical tables: campaigns, samples, positions and observations.

Event core

The Event core contains three types of events:

  • Campaigns (type=cruise) with an eventID, date range, and remarks.
  • Samples (type=sample) with an eventID, parentEventID (the campaign), single date and remarks.
  • Positions (type=subSample) with an eventID, parentEventID (the sample), datetime and location.

The eventIDs are created by concatenating the parent identifiers, e.g. <campaignID>_<sampleID>_<positionID> for a position. This makes them unique within the dataset and easy to understand.

Record-level terms such as institutionCode, datasetName, license and rightsHolder are included as well.

See the SQL file for the full transformation.

Occurrence extension

The Occurrence extension contains the observations, with the following terms:

  • eventID (the position) and occurrenceID.
  • basisOfRecord (always HumanObservation) and occurrenceStatus (always present).
  • scientificName, scientificNameID (WoRMS identifier), kingdom (always Animalia) and vernacularName.
  • individualCount, sex, lifeStage, behavior, associatedTaxa (also expressed as measurements or facts).
  • occurrenceRemarks.

The occurrenceIDs are created similarly to the eventIDs, as <campaignID>_<sampleID>_<positionID>_<observationID>.

See the SQL file for the full transformation.

Extended Measurement Or Fact extension (EMOF)

The EMOF extension contains all other ESAS data, with the following terms:

  • eventID: identifier of sample or position (there are no campaign measurements).
  • occurrenceID (where applicable): identifier of the occurrence.
  • measurementType: lowercase description of the measurement.
  • measurementTypeID (where applicable): link to a definition of the measurement. Where possible, we use the BODC Parameter Usage Vocabulary (P01) or fall back to ESAS vocabularies maintained by ICES (e.g. https://vocab.ices.dk/services/rdf/collection/UseOfBinoculars).
  • measurementValue: human readable value or description, lowercased where appropriate.
  • measurementValueID (where applicable): IRI for the value. These mostly link to values in ESAS vocabularies maintained by ICES (e.g. https://vocab.ices.dk/services/rdf/collection/UseOfBinoculars/2), except for platform code (C17), sex (S10) and life stage (S11).
  • measurementUnit (where applicable): unit of the measurement.
  • measurementUnitID: link to a definition of the unit, with XXXX for not applicable and UUUU for dimensionless (e.g. individualCount).

The ESAS terms behaviour and association can contain multiple values for a single observation and are split into maximum 3 measurements or facts records.

See Table 1 for an overview and the SQL file for the full transformation.

Table 1: ESAS terms that are expressed as measurement or fact

table measurement or fact type example
sample platform code vocab BELGICA
sample platform class vocab ship
sample platform side vocab left
sample platform height number
sample transect width integer 300
sample sampling method vocab ship-based transect method with distance estimation and snapshot for flying birds
sample primary sampling boolean True
sample target taxa vocab all species recorded (standard)
sample distance bins string 0|50|100|200|300
sample use of binoculars vocab Binoculars used extensively for scanning ahead and to the side, naked eye used for close observations (e.g. for cetacean monitoring)
sample number of observers integer 2
position distance number 0.7
position area number 0.21
position wind force vocab moderate breeze
position visibility vocab C
position glare vocab weak
position sun angle integer
position cloud cover vocab
position precipitation vocab none
position ice cover integer 0
position observation conditions vocab
observation group identifier string 12
observation in transect boolean True
observation individual count integer 1
observation observation distance vocab 100-200
observation life stage vocab adult
observation moult vocab  active primary moult
observation plumage vocab non-breeding (winter) plumage
observation sex vocab female
observation travel direction vocab  45
observation prey vocab medium fish, unidentified (ca. 2-5x bill length)
observation association x 3 vocab associated with observation base
observation behaviour x 3 vocab scavenging

Repo structure

The repository structure is based on Cookiecutter Data Science and the Checklist recipe. Files and directories indicated with GENERATED should not be edited manually.

├── README.md              : Description of this repository
├── LICENSE                : Repository license
├── esas2obis.Rproj        : RStudio project file
├── .gitignore             : Files and directories to be ignored by git
│
├── src
│   └── dwc_mapping.Rmd    : Darwin Core mapping script
|
├── sql                    : Darwin Core transformations
│   ├── dwc_event.sql
│   ├── dwc_occurrence.sql
│   └── dwc_mof.sql
|
└── data
    ├── processed          : Darwin Core output of mapping script GENERATED
    └── processed_sample   : Darwin Core sample output of mapping script for git comparison GENERATED

License

MIT License for the code and documentation in this repository. The included data is released under another license.

esas2obis's People

Contributors

peterdesmet avatar

Watchers

 avatar  avatar  avatar  avatar

esas2obis's Issues

Allow to parse `~` in Association

Association is used for associatedTaxa , but the SQL currently doesn't handle ~. To be seen how this data will be exported from ESAS.

P01 term for ice cover

Ice cover is defined as:

Ice coverage within the transect %

It does not match the too narrow:

These three terms have the broader concept:

Ice coverage

sea_ice_area_fraction

  • http://vocab.nerc.ac.uk/collection/P07/current/CFSN0424/
  • "Area fraction" is the fraction of a grid cell's horizontal area that has some characteristic of interest. It is evaluated as the area of interest divided by the grid cell area. It may be expressed as a fraction, a percentage, or any other dimensionless representation of a fraction. Sea ice area fraction is area of the sea surface occupied by sea ice. It is also called "sea ice concentration". "Sea ice" means all ice floating in the sea which has formed from freezing sea water, rather than by other processes such as calving of land ice to form icebergs.

The latter has a very specific definition, but mentions percentage and might actually apply. The first one is more generic and maybe easier to understand. @rubenpp7 @nicolasvanermen, which one would you pick?

P01 term for Area

Area is defined as:

Area of sea surveyed during the observation bin in km². Can only remain empty if Distance is used.

There are two existing terms that come close:

Sample area (aircraft survey)

Area (representative)

The first one looks great, but @nicolasvanermen, I assume ESAS data could use area for ship based observations as well? Or is the match close enough?

How to map multiple `rightsHolder`

@pieterprovoost @rubenpp7 some campaigns in ESAS have multiple dataRightsHolders, e.g.:

https://esas.ices.dk/api/getCampaignRecords?campaignID=775

[
  {
    "tblUploadID": 231,
    "tblCampaignID": 208521,
    "dataRightsHolder": "3269~2299",
    "country": "DE",
    "campaignID": "775",
    "dataAccess": "Public",
    "startDate": "2011-07-27",
    "endDate": "2011-08-22",
    "notes": null
  }
]

3269~2299 indicates Federal Agency for Nature Conservation (BfN) | Research and Technology Centre (Buesum) (FTZ)

Can I map both in rightsHolder (separated with |) or do you prefer that I only keep the first one?

Record level terms to add

  • type: use eventType values, cruise, transect, position (list to be sent by @rubenpp7)
  • modified: not sure this info is available through ESAS API typically not used
  • language: could be set to en, what is OBIS recommendation typically not used
  • license: https://creativecommons.org/licenses/by/4.0/ typically not used, but populated anyway
  • rightsHolder: set to DataRightsHolder typically not used, but populated anyway
  • accessRights: don't use
  • bibliographicCitation: don't use
  • references: not applicable on a record level
  • institutionID: could be set to EDMO endpoint? typically not used
  • collectionID: don't use
  • datasetID: use link of original ESAS dataset: https://esas.ices.dk typically not used, but populated anyway
  • institutionCode: ICES
  • collectionCode: could be set to ESAS, does OBIS work with virtual collections? fine to set to ESAS
  • datasetName: European Seabirds at Sea (ESAS)
  • ownerInstitutionCode: use rightsHolder instead
  • basisOfRecord: HumanObservation
  • informationWithheld: not applicable on a record level
  • dataGeneralizations: not applicable on a record level
  • dynamicProperties: not used

P01 term for Distance

Distance is defined as:

Distance travelled during the observation bin in km. Can only remain empty if Area is used.

There are two existing terms that come close:

Distance travelled

Distance (along transect)

The first one is great, because it relates to the platform movement. But ours is within the distance bin, not since the start of the cruise. @rubenpp7 @nicolasvanermen which one should I pick, if any?

How many ~ values

@nicolasvanermen, their are 4 fields in ESAS that allow ~ separated values:

  • dataRightsHolder
  • country
  • association
  • behaviour

Is there a maximum of values to expect for these fields? E.g. is it reasonable to assume only 2 or 3 values?

P01 term for visibility

Visibility is defined as:

Visibility in kilometer, uses the Visibility vocabulary. It can use numbers or categories, e.g. A Poor (< 1 km) which can be translated to km.

There some existing terms that come close:

Horizontal visibility in the atmosphere by scattering photometer

  • http://vocab.nerc.ac.uk/collection/P01/current/VISHOR01/
  • Maximum horizontal distance through the atmosphere at which an object can be seen and identified by the unaided eye estimated from the extinction coefficient measured by a scattering photometer.
  • Broader: Horizontal visibility

Horizontal visibility

Horizontal visibility (WMO code) in the atmosphere by visual estimation and conversion to WMO code using table 4300

Ours is also a "visual estimation", but we don't use WMO codes, so that one is not applicable I think. The first one is too narrow, because it assumes the use of a photometer. The second sounds fine?

Mapping of measurementTypeID, measurementValueID, measurementUnitID

  • All TBD in this list should be resolved.
  • All untested in this list should be resolved (or at least checked) -> Moved to other issue

Rationale

measurementType

Lowercase, space separated description, e.g. platform class.

measurementTypeID

Link to http://vocab.nerc.ac.uk/collection/P01/current/ or specific recommended vocab (http://vocab.nerc.ac.uk/collection/S11/current/) if a good applicable definition can be found there.

Otherwise, link to ICES vocab that was used (e.g. https://vocab.ices.dk/services/rdf/collection/PlatformSide). Note that these links don't provide a definition, but their URL is meaningful.

Leave empty if no good definition exists out there.

measurementValue

Give the most meaningful value, so it can be understood out of context. E.g. the full description of a term:

Binoculars used extensively for scanning ahead and to the side, naked eye used for close observations (e.g. for cetacean monitoring)

Or the translation of a value to more meaningful values (NE -> 45°).

Note that very often, the original code (e.g. NE) will be part of the URL in measurementValueID (e.g. https://vocab.ices.dk/services/rdf/collection/TravelDirection/NE), so that code is not lost.

measurementValueID

Link to recommended vocab term (e.g. http://vocab.nerc.ac.uk/collection/S11/current/S1116/) if a good match can be found.

Otherwise, link to the term in the ICES vocab that was used (e.g. https://vocab.ices.dk/services/rdf/collection/LifeStage/3)

Leave empty if no controlled vocab was used.

measurementUnit

Lowercase unit, e.g. degrees, km2, Beaufort .

Leave empty for terms where a unit does not apply.

measurementUnitID

Look for relevant unit in P06, e.g. http://vocab.nerc.ac.uk/collection/P06/current/SQKM/

If the measurement is dimensionless (e.g. a count, beaufort), use https://vocab.nerc.ac.uk/collection/P06/current/UUUU/

Otherwise, set to not applicable: https://vocab.nerc.ac.uk/collection/P06/current/XXXX/


sample: PlatformCode: ok

sample: PlatformClass: ok

sample: PlatformSide: ok

sample: PlatformHeight: untested (not in sample data)

sample: TransectWidth: ok

sample: SamplingMethod: ok

sample: PrimarySampling: ok

sample: TargetTaxa: ok

sample: DistanceBins: ok

sample: UseOfBinoculars: ok

sample: NumberOfObservers: ok


pos: Distance: ok

pos: Area: see #19

pos: WindForce: ok

pos: Visibility: see #20

pos: Glare: ok

pos: SunAngle: untested (not in sample data)

pos: CloudCover: untested (not in sample data)

pos: Precipitation: ok

pos: IceCover: ok

pos: ObservationConditions: untested (not in sample data)


obs: GroupID: ok

  • measurementTypeID: none Not in P01 or ESAS, decided not to express as events: see #12
  • measurementValueID: none
  • measurementUnitID: none

obs: Transect: ok

obs: Count: ok

obs: ObservationDistance: ok

obs: LifeStage: ok

obs: Moult: ok

obs: Plumage: ok

obs: Sex: ok

obs: TravelDirection: ok

obs: Prey: ok

obs: Association: ok

obs: Behaviour: ok

Second paragraph of description not retained on GBIF

The dataset description consists of two paragraphs: https://www.vliz.be/imis?dasid=3117&doiid=826

European Seabirds At Sea (ESAS) assembles offshore monitoring data on ... European-wide data assembly in 1991.

ESAS data are collected by ... to download or request data.

On GBIF only the first paragraph is retained: https://www.gbif.org/dataset/3470d506-e667-4e3f-b178-819669684c05

This is not a major issue, but would it be possible to retain both paragraphs?

P01 term for cloud cover

Cloud cover is defined as:

Cloud cover expressed as x/8 (octas). Uses the CloudCover vocabulary.

There some existing terms that come close:

Cloud cover (all clouds) in the atmosphere by visual estimation and conversion to WMO code

Cloud cover height and extent

@nicolasvanermen @rubenpp7 Since the first term specifies WMO codes (and we use octas), I think we have to go for the broader term?

Test mapping of measurements that currently have no data

The following measurements are mapped, but could not be tested because currently none of the ESAS data is using these:

sample: PlatformHeight

pos: SunAngle

pos: CloudCover

pos: ObservationConditions

Pull from source data

The script currently works from a manually given data, but should work from the ICES webservices or download.

Validate generated archive

campaigns

  • rightsHolder: 31 NA no NA

sample

  • rightsHolder: 824 NA no NA
  • eventID + parentEventID: 793 NA no NA

positions

  • rightsHolder: 420,418 NA no NA
  • eventID + parentEventID: 216,016 NA no NA
  • eventDate: 216,016 NA no NA

observations

  • eventID: no NA
  • occurrenceID: no NA
  • scientificNameID: no NA

Don't filter on `CampaignID`

Data are currently filtered throughout on CampaignID so a nice sample can be given in data/processed. This filter should be removed for production.

Checklist of mapped ESAS fields

Campaigns

  • CampaignID
  • DataAccess: will be a filter
  • StartDate
  • EndDate
  • Notes: useful?

Samples

  • CampaignID
  • SampleID
  • Date
  • PlatformCode
  • PlatformClass
  • PlatformSide
  • PlatformHeight
  • TransectWidth
  • SamplingMethod
  • PrimarySampling: useful?
  • TargetTaxa
  • DistanceBins: useful?
  • UseOfBinoculars
  • NumberOfObservers
  • Notes: useful?

Positions

  • SampleID
  • PositionID
  • Time
  • Latitude
  • Longitude
  • Distance
  • Area
  • WindForce
  • Visibility
  • Glare
  • SunAngle
  • CloudCover
  • Precipitation
  • IceCover
  • ObservationConditions

Observations

  • PositionID
  • ObservationID
  • GroupID
  • Transect
  • SpeciesCodeType: not used, service should always return aphiaID
  • SpeciesCode: rework
  • Count: occ
  • Count: mof
  • ObservationDistance
  • LifeStage: occ
  • LifeStage: mof
  • Moult
  • Plumage
  • Sex: occ
  • Sex: mof
  • TravelDirection
  • Prey
  • Association: occ
  • Association: mof
  • Behaviour: occ
  • Behaviour: mof
  • Notes: useful?

Completed project deliverables

The EMODnet Biology service provider contract between INBO and VLIZ listed 3 deliverables. Here's how these were met:

Provide an overview of a complete mapping between the ESAS database held at ICES and DwC EventCore, documented in a GitHub repository

A complete mapping was made for the ESAS database hosted at ICES following OBIS best practices. It transforms all ESAS data/fields into an Event Core, Occurrence Extension and Extended Measurement or Facts (EMOF) extension. Much of the information is expressed in the EMOF, with extensive links to BODC controlled vocabularies for measurementTypeID, measurementValueID and measurementUnitID.

The transformation is expressed in 3 SQL files and can be run in R using the dwc_mapping.Rmd script. It starts from a download of public ESAS data, which can be initiated at https://esas.ices.dk/inventory.

The mapping, code to run it and documentation are managed in this self-contained repository: https://github.com/EMODnet/esas2obis

Complete a first data transfer of ESAS data from ICES to VLIZ (through INBO) in DwC EventCore format

A first version of the dataset was delivered on 2022-12-02. This was reviewed by the EurOBIS staff and resulted in an accepted version of the dataset that was delivered on 2023-02-21: https://ipt.vliz.be/upload/resource?r=esas&v=1.3. It contains all public data held in the ESAS database at that point, amounting to:

  • Events: 1,553,035 records
  • Occurrences: 2,687,910 records
  • EMOF: 18,210,658 records

This dataset is in the process of being included in EMODnet Biology, OBIS and GBIF.

Document the data flow and all instructions for future updates of ESAS data from ICES to EurOBIS/EMODnet Biology

The steps to republish the dataset are documented in the README of the repository: https://github.com/EMODnet/esas2obis#workflow. The repository also contains a sample of the data that should remain the same. It can be used to verify that the transformation does not contain any errors.

The entire Darwin Core transformation (in addition to being documented as SQL files) is documented/summarized in the README as well: https://github.com/EMODnet/esas2obis#darwin-core-transformation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.