Code Monkey home page Code Monkey logo

environmental-monitoring's Introduction

environmental-monitoring

The Finnish research stations from monitoring programmes analysis framework

The task (initial)

The Finnish research stations have collected metadata from several long-term environmental monitoring programmes.

The aim is that during the next 5 years all data from Kilpisjärvi, Lammi and Tvärminne stations will be opened according to FAIR principles?

Where would you start and what would be the most important steps / stages to achieve this goal?

The task (formalized)

Task: To develop a strategy for the establishment of Kilpisjärvi, Lammi and Tvärminne stations over the next five years in accordance with FAIR principles

FAIR principles

FAIR data is a term outlining best practice for preserving research data. The FAIR principles that are used in this project are presented in Figure 1:

Figure 1. FAIR principles brief description (the picture was taken from 1, 26.04.2024)

Thus, four FAIR principles are distinguished (below are short descriptions of each principle summarizing the key message):

  • Findability: For data to be findable there must be sufficient metadata and a unique and persistent identifier.
  • Accessibility: metadata and data should be readable by humans and by machines, and it must reside in a trusted repository
  • Interoperability: must share a common structure, and metadata must use recognized, formal terminologies for description
  • Reusability: Data and collections must have clear usage licenses and clear provenance

Proposal

This section presents the first version of the draft strategy that is proposed to be applied to the original metadata.

The aim: To develop a strategy for the establishment of Kilpisjärvi, Lammi and Tvärminne stations over the next five years in accordance with FAIR principles.

In order to achieve the stated objective, the current document has been prepared and the following actions have been performed in the process of its preparation:

  • Evaluate the current structure of metadata and datasets (Data Exploration).
  • Propose changes to the structure based on the evaluation. Criterion - "Interoperability"
  • Propose a plan to improve the "Accessibility" criterion by creating dataset distribution services (API, versioning, etc.).
  • Create a plan to improve the "Findability" criterion. For this purpose, it is proposed to use a robust identifier system
  • Propose a plan to improve the "Reusability" criterion through license preparation and data provenance

Each of the items is described in more detail below.

Data exploration

Based on the results of a detailed comparison of the metadata dataset columns of different stations, the following table was prepared:

Table 1. Observed issues per metadata dataset

Column Issue per Kilpisjärvi station Issue per Lammi station Issue per Tvärminne station
Name of the dataset
Station Presence nan
General description of data Presence nan Presence nan Presence nan
Location Presence nan Presence nan No data in the column
X No data in the column Presence nan No data in the column
Y No data in the column Presence nan No data in the column
Coordinate system No data in the column Presence nan Presence nan
Environment type Presence nan Presence nan
Other type No data in the column Environment type column intersection, presence nan No data in the column
Measured variables Duplicates due to long format Duplicates due to long format Duplicates due to long format
Unit Presence nan and "???" Presence nan Presence nan
Spatial extent Presence nan, not unified Presence nan, not unified Presence nan, not unified
Spatial resolution Not unified Not unified Presence nan, not unified
Temporal extent Presence nan, not unified Presence nan, not unified Presence nan, not unified
Starting year Presence nan, not unified Presence nan, not unified
End year Mixed types: integer and string Mixed types: integer and string, presence nan No data in the column
Temporal resolution Presence nan Presence nan, not unified Presence nan
Data type Presence nan Presence nan Presence nan
Format Presence nan Presence nan Presence nan
Is the dataset well described Expected to be bool Expected to be bool Presence nan
In what databases the dataset has been described Presence nan Presence nan Presence nan
Is the dataset regurlarly shared somewhere Presence nan Presence nan Presence nan
Is the data used in a publication Presence nan Presence nan Presence nan, not unified
Where Presence nan No data in the column No data in the column
Owner Presence nan and "?" Presence nan Presence nan
Other notes
Contact person Presence nan Presence nan Presence nan
Institution Presence nan No data in the column No data in the column
KBS, omat muistiinpanot Presence nan Not exist Not exist
LBA notes Not exist Presence nan Not exist

Thus, a set of actions is required to improve the metadata structure.

Step 1. Interoperability

Based on the results of the analysis of the table structure of the xlsx page of the file Kilpisjärvi, Lammi, Tvärminne it is proposed to perform the following actions, which will make the metadata structure more simple, understandable and comparable:

  • Station column: It is required to enter station identifiers into the table (for station Tvärminne) and prepare a separate document listing all possible identifiers for each station abbreviation
  • General description of data column: Need to fill in the gaps for stations Kilpisjärvi, Lammi and Tvärminne
  • Location column: Need to fill in the gaps for stations Kilpisjärvi, Lammi and Tvärminne
  • X, Y and Coordinate system columns: it is proposed to indicate WGS84 coordinates everywhere instead of local metric systems (this will allow users to find a point more easily in both Google maps and other ones, as well as such coordinates are much easier to understand than multi-cypher coordinates from metric systems). Regardless of whether WGS84 will be used or not, it is necessary to keep the coordinates of all stations in one coordinate reference system and projection
  • Environment type, Other type columns: It is proposed to merge these columns and add a category such as “other” to the Environment type column
  • Measured variables column: to save space, consider presenting the table in long format instead of wide one (see Figure 1)

Figure 1. Wide format into long format for the column Measured variables transformation

  • Unit column: Need to fill the gaps for stations Kilpisjärvi, Lammi and Tvärminne. If the value is dimensionless quantity, this must be specified explicitly. nan should not be in this column
  • Spatial extent column: Need to fill the gaps for stations Kilpisjärvi, Lammi and Tvärminne. Unification needed. Perhaps the presence of this column is redundant and its role is already fulfilled by such fields as X, Y and Location. If we assume that the column is to be saved, then it makes sense to store spatial coverage in it using bounding box coordinates instead of places name
  • Spatial resolution column: Need to fill the gaps for station Tvärminne. It is necessary to unify, for example, to present all information in two possible variants: meters and centimeters.
  • Temporal extent column: Need to fill the gaps for stations Kilpisjärvi, Lammi and Tvärminne. Unification needed: some cells indicate the dates of monitoring, others the duration
  • Starting year column: Need to fill the gaps for stations Kilpisjärvi, Lammi and Tvärminne. Unification needed: replace "noin 1970" with 1970 so only integers will be in the column, etc.
  • End year column: Need to fill the gaps for stations Lammi and Tvärminne. Unification needed: it is better to avoid presence in one column integer values and "ongoing" string
  • Temporal resolution column: Need to fill the gaps for stations Kilpisjärvi, Lammi and Tvärminne. A unified system for specifying the discreteness of observations should be developed
  • Data type column: Need to fill the gaps for stations Kilpisjärvi, Lammi and Tvärminne
  • Format column: Need to fill the gaps for stations Kilpisjärvi, Lammi and Tvärminne
  • Is the dataset well described column: Make the column take only two values: True or False
  • In what databases the dataset has been described column: Need to fill the gaps for stations Kilpisjärvi, Lammi and Tvärminne
  • Is the dataset regurlarly shared somewhere column: Make the column take only two values: True or False
  • Is the data used in a publication column: Make the column take only two values: True or False
  • Where column: Each cell can have either None or a list with links to publications so that more than one can be specified
  • Owner and other column: Need to fill the gaps for stations Kilpisjärvi, Lammi and Tvärminne if required

General suggestions for improving interoperability:

  • Allow datasets to be downloaded in multiple formats, or allow the user to interact with the data through a REST API. Possible examples: Parquet, CSV, Excel, JSON, GEOJSON, Geopackage, shapefile. Converters of tabular data to such formats can be reused in other data catalogs.

All proposed changes are planned to be implemented in a step-by-step collaboration with representatives of the institutions that own the datasets that are described in this metadata.

Example of the station Kilpisjärvi sample with applied changes: https://docs.google.com/spreadsheets/d/1_a6_xrBO5CZ2aykT4Y5Dj_eqIYBIUGOE/edit?usp=sharing&ouid=117425651530373612251&rtpof=true&sd=true

Step 2. Accessibility

To make data accessible in the modern digital world, it is proposed to build a datahub ecosystem where data can be accessed using two modes:

  • Manual downloading of datasets in desired format;
  • Data request via REST API

Implementation of both ways of interaction requires implementation and deployment of the backend service. An example of such a service implemented using the technology stack Python, Docker, heroku platform: API environmental monitoring service

Service versioning should be prepared for researchers and engineers. For example, when providing API access to a data platform, both semantic versioning of datasets, as well as the API of the service. Then, when changing the data structure, using old references and methods, researchers will be able to get the same datasets.

The following examples of data platforms can be used as references:

Step 3. Findability

Persistent identifiers are required for Findability. Ensuring the availability of such identifiers is proposed through the use of DOIs. This will also allow such data to be easily used and cited in scientific publications.

Example of data platforms that uses DOI generation:

The following systems are used to search for datasets (in addition to standard search engines like Google, Bing, etc.):

Therefore, it is planned to prepare datasets in such a way that they are easily accessible also using these platforms.

Step 4. Reusability

The final processing activities should be the preparation of license documents and verification that the data have a clear provenance. To ensure Reusability, it is also necessary to describe in detail the structure of the datasets provided. This can be done by means of a single web page (as a simplified example) - see environmental-monitoring description page

Conclusion

The findings of the study were:

  • Strategy for the construction of an environmental monitoring scientific data distribution system for Kilpisjärvi, Lammi and Tvärminne stations according to FAIR principles;
  • Recommendations for improving the station metadata structure and a sample with applied changes;
  • Prototype web page providing information on how to retrieve data on selected stations and with which licenses;
  • Prototype of a web service that allows using REST API to get meta-information about stations. On the basis of this backend service it will be possible to customize the systems for downloading datasets and metadata for them.

Used literature

  1. CCDC FAIR data principles
  2. Open and FAIR data sharing policy related to writing
  3. Common Data Elements: Standardizing Data Collection. Data Definition of FAIR Data

Document was prepared by Mikhail Sarafanov 28.04.2024

Demo service

Current module deployed on Heroku using the following instructions:

Deploy FastAPI on Heroku using Docker Container

heroku login

Launch docker daemon and then

heroku container:login
heroku container:push web --app environmental-monitoring
heroku container:release web --app environmental-monitoring

Swagger UI available via URL: https://environmental-monitoring-82bc1f9868c5.herokuapp.com/

  • login: university
  • password: university

For local launch there is a need to start launch.py

environmental-monitoring's People

Contributors

dreamlone avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.