ncas-cms / canari-data Goto Github PK
View Code? Open in Web Editor NEWDocuments and code for canari data management
Documents and code for canari data management
How are we going to organise, document, and catalog these? Can we include names for the internal domain? What if we have many such boundaries (e.g. all the CORDEX domains)?
how do we handle CICE and NEMO area cell_methods -- neither model writes them
We need to go through all the DIAG variables and look at whether or not there needs to be a vertical coordinate associated with them.
For example, we currently have a field that looks like this:
f.long_name,f.standard_name = ('TEMPERATURE AT 1.5M', 'air_temperature')
print(f)
Field: air_temperature (ncvar%m01s03i236_2)
-------------------------------------------
Data : air_temperature(time(10), latitude(324), longitude(432)) K
Cell methods : time(10): maximum (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
: longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(10)) = [1950-01-01 12:00:00, ..., 1950-01-10 12:00:00] 360_day
but if we compare it to a similar field in CMIP6, we see:
tas1.long_name, tas1.standard_name = ('Near-Surface Air Temperature','air_temperature')
print(tas1):
Field: air_temperature (ncvar%tas)
----------------------------------
Data : air_temperature(time(240), latitude(324), longitude(432)) K
Cell methods : area: time(240): mean
Dimension coords: time(240) = [1850-01-16 00:00:00, ..., 1869-12-16 00:00:00] 360_day
: latitude(324) = [-89.72222137451172, ..., 89.72223663330078] degrees_north
: longitude(432) = [0.4166666567325592, ..., 359.58331298828125] degrees_east
: height(1) = [1.5] m
Cell measures : measure:area (external variable: ncvar%areacella)
The issue at hand here is that our variable is relying on the long_name
to provide coordinate information. These sort of variables need the appropriate hight coordinate.
If we anticipate users from outside the CANARI consortium wanting access to start dumps for their regional model domains (assuming we have written them bc files), how are we going to catalog and organise them?
Here are a couple of outputs:
Field: sea_water_potential_temperature (ncvar%votemper2)
--------------------------------------------------------
Data : sea_water_potential_temperature(time(1), depth(75), ncdim%y(1207), ncdim%x(1442)) degree_C
Cell methods : time(1): mean
Dimension coords: depth(75) = [0.5057600140571594, ..., 5902.0576171875] m
Auxiliary coords: time(time(1)) = [1950-01-16 00:00:00] 360_day
: latitude(ncdim%y(1207), ncdim%x(1442)) = [[-89.5, ..., 49.99550247192383]] degrees_north
: longitude(ncdim%y(1207), ncdim%x(1442)) = [[72.75, ..., 73.0]] degrees_east
Cell measures : measure:area(ncdim%y(1207), ncdim%x(1442)) = [[1000000.0, ..., 445.6573486328125]] m2
Field: sea_water_potential_temperature (ncvar%votemper)
-------------------------------------------------------
Data : sea_water_potential_temperature(time(1), depth(75), ncdim%y(1207), ncdim%x(1442)) degree_C
Cell methods : time(1): mean
Dimension coords: depth(75) = [0.5057600140571594, ..., 5902.0576171875] m
Auxiliary coords: time(time(1)) = [1950-01-16 00:00:00] 360_day
: latitude(ncdim%y(1207), ncdim%x(1442)) = [[-89.5, ..., 49.99550247192383]] degrees_north
: longitude(ncdim%y(1207), ncdim%x(1442)) = [[72.75, ..., 73.0]] degrees_east
Cell measures : measure:area(ncdim%y(1207), ncdim%x(1442)) = [[1000000.0, ..., 445.6573486328125]] m2
On the face of it they are the same, but actually these represent two different quantities:
votemper is monthly avg of (toce_e3t )divided by the monthly average of (e3t)
votemper2 is monthly avg of ((toce_e3t)^2 divided by monthly average of (e3t)
This is the relevant xml
<field id="e3t" long_name="Ocean Model cell Thickness" standard_name="cell_thickness" unit="m" grid_ref="grid_T_3D"/>
<field id="toce" long_name="Sea Water Potential Temperature" standard_name="sea_water_potential_temperature" unit="degree_C" grid_ref="grid_T_3D"/>
<field id="toce_e3t" long_name="temperature * e3t" unit="degree_C*m" grid_ref="grid_T_3D" > toce * e3t </field >
and
<field ts_enabled="true" field_ref="toce" name="votemper" operation="average" freq_op="1mo" > @toce_e3t / @e3t </field>
<field ts_enabled="true" field_ref="toce" name="votemper2" operation="average" freq_op="1mo" > @toce2_e3t / @e3t </field>
Clearly
It would help to find out what these represent scientifically first ...
Several CICE fields are wrong and have been processed through cdds to be right (?) - there is some thought that the fields have been fixed in https://code.metoffice.gov.uk/svn/cice/main/branches/dev/alexwest/r400_correct_cmip6_diagnostics_take2
- I'm running a test
We want to make sure we use appropriate table names for our output files. The CMIP list is here, but we probably don't want to comply fully, not least because we have our own output frequencies. We should ensure our file names (and global attributes) include:
Draft for atmosphere (not land):
1m_pt
data)1m_
average data)1d_
averaged data)1d_pt
)Points of distinction from CMIP, everything starts with A, not just Amon ... and everything is averaged unless it includes pt.
As above, for land, but with L ...
Some of the global attributes will need to be entered by the user starting the simulation, and information will need to be loaded into the further_info_url
, and there is obvious scope to get initial condition and boundary files wrong in the attempt to create a new realisation.
We need some code to run at the fend of the first cycle, which does some basic checks, delivers a basic look at some data, and checks it isn't the same as previous run, and this code needs to be run before allowing the simulation to restart and run for the rest of its time.
This probably needs to run for both the historical and SSP runs (after the change over).
A large number of fields which are instantaneous fields at a specific time have spurious cell methods. E.g.
Field: surface_temperature (ncvar%m01s00i507_10)
------------------------------------------------
Data : surface_temperature(time(241), latitude(324), longitude(432)) K
Cell methods : time(241): mean (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
: longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(241)) = [1950-01-01 01:30:00, ..., 1950-01-30 22:30:00] 360_day
is either supposed to be hourly instantaneous data, no cell method, or it's hourly averages, it's not this!
Do we need to include bounds for atmos lat and long (& vertical) coordinates
NEMO diaptr files contain zonal data that is represented as full 3-d fields - how should that be described with cell methods
in addition they contain data for indian/pacific/atlantic basins (again zonal data represented as 3-d fields) - what cell methods?
I'd assumed that since NEMO XML had standard_names eg
<field field_ref="zotempac" name="zotempac" standard_name="zonal_mean_temperature_pacific" grid_ref="gznl_T_3D" >
that the standard name was correct -- not so -- a large part of NEMO standard names are not CF -- I am comparing against http://cfconventions.org/Data/cf-standard-names/78/build/cf-standard-name-table.html
Not sure how to proceed -- remove all non standard names?
This is the list of fields we'd need to include if we wanted to fully support CORDEX. The full list is here.
They fall into three groups:
cx-6hr (instantaneous values)
daily (these appear to be means)
Monthly (averages I think)
How many of these do we already have?
We probably want to separate the data on tapes, need to discuss with the CEDA/JASMIN group:
Need at least:
We are, I think, running extra simulations to get extra ocean states for macro initialisation. Where are these simulations being stored? Are they being stored? How will we refer to them in the parent_id
?
The WGCM Infrastructure Panel (WIP) is maintaining a list of "MIPs" which are tracking CMIP6 data standards and tools : https://docs.google.com/spreadsheets/d/1guq4uL68i6Y9rjTeiuTBpoyzZp2J_5tX/edit#gid=2099926125
Should we add CANARI to this list (just for the visibility)? The idea is to support some degree of consistency and exchange of information.
We do not understand why some data has auxiliary coordinates used for time.
There seems to be some unnecessary indirection arising from the way we have configured XIOS. This can't be the way it was done by IPSL in CMIP6. We need to find out if we can change some configuration to avoid this.
For example, we see:
Data : lagrangian_tendency_of_air_pressure(time(120), air_pressure(9), latitude(325), longitude(432)) Pa s-1
Cell methods : time(120): point
Dimension coords: time(120) = [1950-01-01 06:00:00, ..., 1950-02-01 00:00:00] 360_day
: air_pressure(9) = [925.0, ..., 50.0] hPa
: latitude(325) = [-90.0, ..., 90.0] degrees_north
: longitude(432) = [0.0, ..., 359.1666564941406] degrees_east
Auxiliary coords: time(time(120)) = [1950-01-01 06:00:00, ..., 1950-02-01 00:00:00] 360_day
which arises from the following netcdf layout:
dimensions:
axis_nbounds = 2 ;
lon = 432 ;
lat = 325 ;
um-atmos_PLEV9H = 9 ;
time_counter = UNLIMITED ; // (40 currently)
variables:
float lat(lat) ;
lat:axis = "Y" ;
lat:standard_name = "latitude" ;
lat:long_name = "Latitude" ;
lat:units = "degrees_north" ;
float lon(lon) ;
lon:axis = "X" ;
lon:standard_name = "longitude" ;
lon:long_name = "Longitude" ;
lon:units = "degrees_east" ;
float um-atmos_PLEV9H(um-atmos_PLEV9H) ;
um-atmos_PLEV9H:name = "um-atmos_PLEV9H" ;
um-atmos_PLEV9H:standard_name = "air_pressure" ;
um-atmos_PLEV9H:long_name = "pressure levels" ;
um-atmos_PLEV9H:units = "hPa" ;
um-atmos_PLEV9H:positive = "down" ;
double time_instant(time_counter) ;
time_instant:standard_name = "time" ;
time_instant:long_name = "Time axis" ;
time_instant:calendar = "360_day" ;
time_instant:units = "seconds since 1950-01-01 00:00:00" ;
time_instant:time_origin = "1950-01-01 00:00:00" ;
time_instant:bounds = "time_instant_bounds" ;
double time_instant_bounds(time_counter, axis_nbounds) ;
double time_counter(time_counter) ;
time_counter:axis = "T" ;
time_counter:standard_name = "time" ;
time_counter:long_name = "Time axis" ;
time_counter:calendar = "360_day" ;
time_counter:units = "seconds since 1950-01-01 00:00:00" ;
time_counter:time_origin = "1950-01-01 00:00:00" ;
time_counter:bounds = "time_counter_bounds" ;
double time_counter_bounds(time_counter, axis_nbounds) ;
float m01s30i208_2(time_counter, um-atmos_PLEV9H, lat, lon) ;
m01s30i208_2:standard_name = "lagrangian_tendency_of_air_pressure" ;
m01s30i208_2:long_name = "OMEGA ON P LEV/UV GRID" ;
m01s30i208_2:units = "Pa s-1" ;
m01s30i208_2:online_operation = "instant" ;
m01s30i208_2:interval_operation = "6 h" ;
m01s30i208_2:interval_write = "6 h" ;
m01s30i208_2:cell_methods = "time: point" ;
m01s30i208_2:_FillValue = -1.073742e+09f ;
m01s30i208_2:missing_value = -1.073742e+09f ;
m01s30i208_2:coordinates = "time_instant" ;
We are doing "history" and "ssp370", but not quite, because our identifiers will not be from the same stable as the parent (i.e. our variant_label
is from a different vocabulary).
We could use those as is, but change the mip-era
or parent_id
?
We also have to decide whether we want to have two sets of experiments with different attributes, e.g. historical
to 2015 and ssp370
thereafter.
Whatever we do, we probably need a formal es-doc definition which describes our initialisation and duration. It might be easier then to use neither historical
nor ssp370
, and simply call it canari
, which can then have it's own document.
TIcket #3 describes a problem with some field metadata, but it also describes an aggregation problem: There are eight hourly monthly mean at a particular time of the day, and one all month average, and three 10 day 24h averages, there ought to be three aggregations ...
HORIZON-CL5-2023-D1-01-01: Further climate knowledge through advanced science and technologies for analysing Earth Observation and Earth System Model dataΩ
$files = index['surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP']
$print(files)
['1m_0h__m01s00i507_2_195001-195001.nc', '1m_12h__m01s00i507_6_195001-195001.nc',
'1m_15h__m01s00i507_7_195001-195001.nc', '1m_18h__m01s00i507_8_195001-195001.nc',
'1m_21h__m01s00i507_9_195001-195001.nc', '1m_3h__m01s00i507_3_195001-195001.nc',
'1m_6h__m01s00i507_4_195001-195001.nc', '1m_9h__m01s00i507_5_195001-195001.nc',
'1m__m01s00i507_195001-195001.nc', '3h__m01s00i507_10_19500101-19500110.nc',
'3h__m01s00i507_10_19500111-19500120.nc', '3h__m01s00i507_10_19500121-19500130.nc']
$flds = index.get_fields('surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP',
'u-cn134-1fpf/19500101T0000Z/')
It then does the aggregation to the two CF-fields that are really in play:
$print(flds)
[<CF Field: surface_temperature(time(241), latitude(324), longitude(432)) K>,
<CF Field: surface_temperature(time(8), latitude(324), longitude(432)) K>]
It appears that the one month average has been aggregated into the 3h fields. Is this a metadata problem or a cf-python aggregation problem?
(Kudos for @jeff-cole for spotting this, I missed it, and even after he pointed it out, needed to be spoonfed as to what teh actual problem was.)
We have agreed we need the realms, for the ocean and sea ice that should be ok, but for the atmosphere we either have to
split the land off into their own files, or do per variable realms. The former is preferred (and can be done via stash table).
What are we doing with cell measures, and what should we be doing?
(E.g. there is a lot of use of areacella(atmosphere) and areacello (ocean) in the CMIP6 files, a la tas:cell_measures = "area: areacella"
)
We could just use a github wiki, or a github jekyll site, or stand something up at CEDA. Whatever we do needs to support errata and probably our own "ad-hoc" es-doc. Probably wants a zenodo DOI.
The following piece of code finds all the files which have the standard_name
of surface_temperature
and the long_name
of OPEN SEA SURFACE TEMP AFTER TIMESTEP
. (This is from the one file per field output, but I don't think that's relevant to the problem.)
$files = index['surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP']
$print(files)
['1m_0h__m01s00i507_2_195001-195001.nc', '1m_12h__m01s00i507_6_195001-195001.nc',
'1m_15h__m01s00i507_7_195001-195001.nc', '1m_18h__m01s00i507_8_195001-195001.nc',
'1m_21h__m01s00i507_9_195001-195001.nc', '1m_3h__m01s00i507_3_195001-195001.nc',
'1m_6h__m01s00i507_4_195001-195001.nc', '1m_9h__m01s00i507_5_195001-195001.nc',
'1m__m01s00i507_195001-195001.nc', '3h__m01s00i507_10_19500101-19500110.nc',
'3h__m01s00i507_10_19500111-19500120.nc', '3h__m01s00i507_10_19500121-19500130.nc']
$flds = index.get_fields('surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP',
'u-cn134-1fpf/19500101T0000Z/')
It then does the aggregation to the two CF-fields that are really in play:
$print(flds)
[<CF Field: surface_temperature(time(241), latitude(324), longitude(432)) K>,
<CF Field: surface_temperature(time(8), latitude(324), longitude(432)) K>]
These two fields are:
Field: surface_temperature (ncvar%m01s00i507_2)
-----------------------------------------------
Data : surface_temperature(time(8), latitude(324), longitude(432)) K
Cell methods : time(8): point within days time(8): mean over days
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
: longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(8)) = [1950-01-16 00:00:00, ..., 1950-01-16 21:00:00] 360_day
Field: surface_temperature (ncvar%m01s00i507_10)
------------------------------------------------
Data : surface_temperature(time(241), latitude(324), longitude(432)) K
Cell methods : time(241): mean (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
: longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(241)) = [1950-01-01 01:30:00, ..., 1950-01-30 22:30:00] 360_day
In both cases there should be a cell method which conforms to the relevant part of the CF conventions. All long names should be checked for such averaging and the appropriate cell methods used.
We need to do a CF compliance exercise for the RCM data products.
I admit to be being surprised that CMIP6 explicitly controls some variables to have explicit long_name
s AND standard_name
s, as does CORDEX. I somehow missed that as it happened.
Here, for example is one of the CMIP5 (!) formal examples:
float tas(time, lat, lon) ;
tas:standard_name = "air_temperature" ;
tas:long_name = "Near-Surface Air Temperature" ;
tas:comment = "comment from CMIP5 table: near-surface
(usually, 2 meter) air temperature." ;
tas:units = "K" ;
tas:original_name = "TS" ;
tas:history = "2010-04-21T21:05:23Z altered by CMOR: Treated
scalar dimension: \'height\'. Inverted axis: lat." ;
tas:cell_methods = "time: mean" ;
tas:cell_measures = "area: areacella" ;
which is formally in the CMIP6 controlled vocabularies as:
!============
variable_entry: tas
!============
modeling_realm: atmos
!----------------------------------
! Variable attributes:
!----------------------------------
standard_name: air_temperature
units: K
cell_methods: time: mean
cell_measures: area: areacella
long_name: Near-Surface Air Temperature
!----------------------------------
We have to decide whether we care about following the CMIP6 bindings of these variables (and whether or not we want to put the CMIP6 table names and variable names anywhere in our metadata). Given we won't have files and directories with those names, there still might be benefits in putting them in per variable metadata.
We might also want to save the stash long names directly, we could do that with something like a um_context
attribute.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.