ncas-cms / canari-data Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 33 KB

Documents and code for canari data management

Python 100.00%

canari-data's People

Contributors

Stargazers

Watchers

Forkers

reinhardinho

canari-data's Issues

Consider what to do with the RCM boundary files.

How are we going to organise, document, and catalog these? Can we include names for the internal domain? What if we have many such boundaries (e.g. all the CORDEX domains)?

CICE/NEMO cell_methods

how do we handle CICE and NEMO area cell_methods -- neither model writes them

Check all 2D DIAG variables to ensure that if they need a vertical coordinate, one is present.

We need to go through all the DIAG variables and look at whether or not there needs to be a vertical coordinate associated with them.

For example, we currently have a field that looks like this:

f.long_name,f.standard_name =  ('TEMPERATURE AT 1.5M', 'air_temperature')
print(f)
Field: air_temperature (ncvar%m01s03i236_2)
-------------------------------------------
Data            : air_temperature(time(10), latitude(324), longitude(432)) K
Cell methods    : time(10): maximum (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(10)) = [1950-01-01 12:00:00, ..., 1950-01-10 12:00:00] 360_day

but if we compare it to a similar field in CMIP6, we see:

tas1.long_name, tas1.standard_name = ('Near-Surface Air Temperature','air_temperature')
print(tas1):
Field: air_temperature (ncvar%tas)
----------------------------------
Data            : air_temperature(time(240), latitude(324), longitude(432)) K
Cell methods    : area: time(240): mean
Dimension coords: time(240) = [1850-01-16 00:00:00, ..., 1869-12-16 00:00:00] 360_day
                : latitude(324) = [-89.72222137451172, ..., 89.72223663330078] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.58331298828125] degrees_east
                : height(1) = [1.5] m
Cell measures   : measure:area (external variable: ncvar%areacella)

The issue at hand here is that our variable is relying on the long_name to provide coordinate information. These sort of variables need the appropriate hight coordinate.

Consider what to do with the start dumps

If we anticipate users from outside the CANARI consortium wanting access to start dumps for their regional model domains (assuming we have written them bc files), how are we going to catalog and organise them?

ocean variables with dodgy units and names

Here are a couple of outputs:

Field: sea_water_potential_temperature (ncvar%votemper2)
--------------------------------------------------------
Data            : sea_water_potential_temperature(time(1), depth(75), ncdim%y(1207), ncdim%x(1442)) degree_C
Cell methods    : time(1): mean
Dimension coords: depth(75) = [0.5057600140571594, ..., 5902.0576171875] m
Auxiliary coords: time(time(1)) = [1950-01-16 00:00:00] 360_day
                : latitude(ncdim%y(1207), ncdim%x(1442)) = [[-89.5, ..., 49.99550247192383]] degrees_north
                : longitude(ncdim%y(1207), ncdim%x(1442)) = [[72.75, ..., 73.0]] degrees_east
Cell measures   : measure:area(ncdim%y(1207), ncdim%x(1442)) = [[1000000.0, ..., 445.6573486328125]] m2

Field: sea_water_potential_temperature (ncvar%votemper)
-------------------------------------------------------
Data            : sea_water_potential_temperature(time(1), depth(75), ncdim%y(1207), ncdim%x(1442)) degree_C
Cell methods    : time(1): mean
Dimension coords: depth(75) = [0.5057600140571594, ..., 5902.0576171875] m
Auxiliary coords: time(time(1)) = [1950-01-16 00:00:00] 360_day
                : latitude(ncdim%y(1207), ncdim%x(1442)) = [[-89.5, ..., 49.99550247192383]] degrees_north
                : longitude(ncdim%y(1207), ncdim%x(1442)) = [[72.75, ..., 73.0]] degrees_east
Cell measures   : measure:area(ncdim%y(1207), ncdim%x(1442)) = [[1000000.0, ..., 445.6573486328125]] m2

On the face of it they are the same, but actually these represent two different quantities:

votemper is monthly avg of (toce_e3t )divided by the monthly average of (e3t)
votemper2 is monthly avg of ((toce_e3t)^2 divided by monthly average of (e3t)

This is the relevant xml

<field id="e3t"          long_name="Ocean Model cell Thickness"   standard_name="cell_thickness"   unit="m"   grid_ref="grid_T_3D"/>
  <field id="toce"         long_name="Sea Water Potential Temperature"         standard_name="sea_water_potential_temperature"   unit="degree_C"     grid_ref="grid_T_3D"/>
  <field id="toce_e3t"     long_name="temperature * e3t"                                                     unit="degree_C*m"   grid_ref="grid_T_3D" > toce * e3t </field >

and

<field ts_enabled="true" field_ref="toce"         name="votemper"        operation="average" freq_op="1mo" > @toce_e3t / @e3t </field>
<field ts_enabled="true" field_ref="toce"         name="votemper2"       operation="average" freq_op="1mo" > @toce2_e3t / @e3t </field>

Clearly

the two variables cannot have the same units, and
the standard names are not very helpful,
neither are the long names

It would help to find out what these represent scientifically first ...

CICE data problems

Several CICE fields are wrong and have been processed through cdds to be right (?) - there is some thought that the fields have been fixed in https://code.metoffice.gov.uk/svn/cice/main/branches/dev/alexwest/r400_correct_cmip6_diagnostics_take2 - I'm running a test

cmip table names

We want to make sure we use appropriate table names for our output files. The CMIP list is here, but we probably don't want to comply fully, not least because we have our own output frequencies. We should ensure our file names (and global attributes) include:

Omon_ (all ocean data will be monthly)
Ice (to discuss)

Draft for atmosphere (not land):

Amon_ pt_(all our atmospheric monthly 1m_pt data)
Amon_ (all our 1m_ average data)
Aday (all our 1d_ averaged data)
Aday_pt (all our instantaneous data, 1d_pt)
A1hr_
A1hr_pt
A3hr_
A3hr_pt_
A6hr_
A6hr_pt_

Points of distinction from CMIP, everything starts with A, not just Amon ... and everything is averaged unless it includes pt.

As above, for land, but with L ...

first cycle checkpoint code

Some of the global attributes will need to be entered by the user starting the simulation, and information will need to be loaded into the further_info_url, and there is obvious scope to get initial condition and boundary files wrong in the attempt to create a new realisation.

We need some code to run at the fend of the first cycle, which does some basic checks, delivers a basic look at some data, and checks it isn't the same as previous run, and this code needs to be run before allowing the simulation to restart and run for the rest of its time.

This probably needs to run for both the historical and SSP runs (after the change over).

Make sure instantaneous fields do not have spurious time means

A large number of fields which are instantaneous fields at a specific time have spurious cell methods. E.g.

Field: surface_temperature (ncvar%m01s00i507_10)
------------------------------------------------
Data            : surface_temperature(time(241), latitude(324), longitude(432)) K
Cell methods    : time(241): mean (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(241)) = [1950-01-01 01:30:00, ..., 1950-01-30 22:30:00] 360_day

is either supposed to be hourly instantaneous data, no cell method, or it's hourly averages, it's not this!

atmosphere lat-long bounds

Do we need to include bounds for atmos lat and long (& vertical) coordinates

NEMO zonal/basin data

NEMO diaptr files contain zonal data that is represented as full 3-d fields - how should that be described with cell methods
in addition they contain data for indian/pacific/atlantic basins (again zonal data represented as 3-d fields) - what cell methods?

NEMO standard names are mostly wrong

I'd assumed that since NEMO XML had standard_names eg
<field field_ref="zotempac" name="zotempac" standard_name="zonal_mean_temperature_pacific" grid_ref="gznl_T_3D" >

that the standard name was correct -- not so -- a large part of NEMO standard names are not CF -- I am comparing against http://cfconventions.org/Data/cf-standard-names/78/build/cf-standard-name-table.html

Not sure how to proceed -- remove all non standard names?

CORDEX support

This is the list of fields we'd need to include if we wanted to fully support CORDEX. The full list is here.

They fall into three groups:

cx-6hr (instantaneous values)

(6 variables, ua7h, ta7h, va77, zg7h, hus7h, plus psi

daily (these appear to be means)

prw.Eday [1]: Water Vapor Path
clt.day [1]: Total Cloud Cover Percentage
tslsi.day [1]: Surface Temperature Where Land or Sea Ice
hurs.day [1]: Near-Surface Relative Humidity
hus.day [1]: Specific Humidity
huss.day [1]: Near-Surface Specific Humidity
pr.day [1]: Precipitation
psl.day [1]: Sea Level Pressure
siconc.SIday [1]: Sea-Ice Area Percentage (Ocean Grid)
ta.day [1]: Air Temperature
tas.day [1]: Near-Surface Air Temperature
tasmax.day [1]: Daily Maximum Near-Surface Air Temperature
tasmin.day [1]: Daily Minimum Near-Surface Air Temperature
tos.Oday [1]: Sea Surface Temperature
ua.day [1]: Eastward Wind
uas.day [1]: Eastward Near-Surface Wind
va.day [1]: Northward Wind
vas.day [1]: Northward Near-Surface Wind
zg.day [1]: Geopotential Height

Monthly (averages I think)

siconc.SImon [1]: Sea-Ice Area Percentage (Ocean Grid) {groups: 22, vars: 2}
tos.Omon [1]: Sea Surface Temperature {groups: 23, vars: 4}

How many of these do we already have?

Need to develop a tape pool strategy.

We probably want to separate the data on tapes, need to discuss with the CEDA/JASMIN group:

Need at least:

dumps and boundary conditions pool
high frequency pool
monthly data pool
RCM data pool

extra simulations for initialisation - where are they and how documented

We are, I think, running extra simulations to get extra ocean states for macro initialisation. Where are these simulations being stored? Are they being stored? How will we refer to them in the parent_id?

WIP CMIP6plus list

The WGCM Infrastructure Panel (WIP) is maintaining a list of "MIPs" which are tracking CMIP6 data standards and tools : https://docs.google.com/spreadsheets/d/1guq4uL68i6Y9rjTeiuTBpoyzZp2J_5tX/edit#gid=2099926125

Should we add CANARI to this list (just for the visibility)? The idea is to support some degree of consistency and exchange of information.

Wierd time axes

We do not understand why some data has auxiliary coordinates used for time.

There seems to be some unnecessary indirection arising from the way we have configured XIOS. This can't be the way it was done by IPSL in CMIP6. We need to find out if we can change some configuration to avoid this.

For example, we see:

Data            : lagrangian_tendency_of_air_pressure(time(120), air_pressure(9), latitude(325), longitude(432)) Pa s-1
Cell methods    : time(120): point
Dimension coords: time(120) = [1950-01-01 06:00:00, ..., 1950-02-01 00:00:00] 360_day
                : air_pressure(9) = [925.0, ..., 50.0] hPa
                : latitude(325) = [-90.0, ..., 90.0] degrees_north
                : longitude(432) = [0.0, ..., 359.1666564941406] degrees_east
Auxiliary coords: time(time(120)) = [1950-01-01 06:00:00, ..., 1950-02-01 00:00:00] 360_day

which arises from the following netcdf layout:

dimensions:
        axis_nbounds = 2 ;
        lon = 432 ;
        lat = 325 ;
        um-atmos_PLEV9H = 9 ;
        time_counter = UNLIMITED ; // (40 currently)
variables:
        float lat(lat) ;
                lat:axis = "Y" ;
                lat:standard_name = "latitude" ;
                lat:long_name = "Latitude" ;
                lat:units = "degrees_north" ;
        float lon(lon) ;
                lon:axis = "X" ;
                lon:standard_name = "longitude" ;
                lon:long_name = "Longitude" ;
                lon:units = "degrees_east" ;
        float um-atmos_PLEV9H(um-atmos_PLEV9H) ;
                um-atmos_PLEV9H:name = "um-atmos_PLEV9H" ;
                um-atmos_PLEV9H:standard_name = "air_pressure" ;
                um-atmos_PLEV9H:long_name = "pressure levels" ;
                um-atmos_PLEV9H:units = "hPa" ;
                um-atmos_PLEV9H:positive = "down" ;
        double time_instant(time_counter) ;
                time_instant:standard_name = "time" ;
                time_instant:long_name = "Time axis" ;
                time_instant:calendar = "360_day" ;
                time_instant:units = "seconds since 1950-01-01 00:00:00" ;
                time_instant:time_origin = "1950-01-01 00:00:00" ;
                time_instant:bounds = "time_instant_bounds" ;
        double time_instant_bounds(time_counter, axis_nbounds) ;
        double time_counter(time_counter) ;
                time_counter:axis = "T" ;
                time_counter:standard_name = "time" ;
                time_counter:long_name = "Time axis" ;
                time_counter:calendar = "360_day" ;
                time_counter:units = "seconds since 1950-01-01 00:00:00" ;
                time_counter:time_origin = "1950-01-01 00:00:00" ;
                time_counter:bounds = "time_counter_bounds" ;
        double time_counter_bounds(time_counter, axis_nbounds) ;
        float m01s30i208_2(time_counter, um-atmos_PLEV9H, lat, lon) ;
                m01s30i208_2:standard_name = "lagrangian_tendency_of_air_pressure" ;
                m01s30i208_2:long_name = "OMEGA ON P LEV/UV GRID" ;
                m01s30i208_2:units = "Pa s-1" ;
                m01s30i208_2:online_operation = "instant" ;
                m01s30i208_2:interval_operation = "6 h" ;
                m01s30i208_2:interval_write = "6 h" ;
                m01s30i208_2:cell_methods = "time: point" ;
                m01s30i208_2:_FillValue = -1.073742e+09f ;
                m01s30i208_2:missing_value = -1.073742e+09f ;
                m01s30i208_2:coordinates = "time_instant" ;

Sort out formal experiment description

We are doing "history" and "ssp370", but not quite, because our identifiers will not be from the same stable as the parent (i.e. our variant_label is from a different vocabulary).

We could use those as is, but change the mip-era or parent_id?

We also have to decide whether we want to have two sets of experiments with different attributes, e.g. historical to 2015 and ssp370 thereafter.

Whatever we do, we probably need a formal es-doc definition which describes our initialisation and duration. It might be easier then to use neither historical nor ssp370, and simply call it canari, which can then have it's own document.

cf aggregation not working properly: cf-python issue or metadata issue?

TIcket #3 describes a problem with some field metadata, but it also describes an aggregation problem: There are eight hourly monthly mean at a particular time of the day, and one all month average, and three 10 day 24h averages, there ought to be three aggregations ...
HORIZON-CL5-2023-D1-01-01: Further climate knowledge through advanced science and technologies for analysing Earth Observation and Earth System Model dataΩ

$files = index['surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP']
$print(files)
['1m_0h__m01s00i507_2_195001-195001.nc', '1m_12h__m01s00i507_6_195001-195001.nc', 
   '1m_15h__m01s00i507_7_195001-195001.nc', '1m_18h__m01s00i507_8_195001-195001.nc', 
   '1m_21h__m01s00i507_9_195001-195001.nc', '1m_3h__m01s00i507_3_195001-195001.nc', 
   '1m_6h__m01s00i507_4_195001-195001.nc', '1m_9h__m01s00i507_5_195001-195001.nc', 
   '1m__m01s00i507_195001-195001.nc', '3h__m01s00i507_10_19500101-19500110.nc', 
   '3h__m01s00i507_10_19500111-19500120.nc', '3h__m01s00i507_10_19500121-19500130.nc']
$flds = index.get_fields('surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP',
    'u-cn134-1fpf/19500101T0000Z/')
It then does the aggregation to the two CF-fields that are really in play:

$print(flds)
[<CF Field: surface_temperature(time(241), latitude(324), longitude(432)) K>,
 <CF Field: surface_temperature(time(8), latitude(324), longitude(432)) K>]

It appears that the one month average has been aggregated into the 3h fields. Is this a metadata problem or a cf-python aggregation problem?

(Kudos for @jeff-cole for spotting this, I missed it, and even after he pointed it out, needed to be spoonfed as to what teh actual problem was.)

realms

We have agreed we need the realms, for the ocean and sea ice that should be ok, but for the atmosphere we either have to
split the land off into their own files, or do per variable realms. The former is preferred (and can be done via stash table).

cell measures?

What are we doing with cell measures, and what should we be doing?

(E.g. there is a lot of use of areacella(atmosphere) and areacello (ocean) in the CMIP6 files, a la tas:cell_measures = "area: areacella")

canari further_info_url

We could just use a github wiki, or a github jekyll site, or stand something up at CEDA. Whatever we do needs to support errata and probably our own "ad-hoc" es-doc. Probably wants a zenodo DOI.

Data which is averaged over some area should have cell methods, and not just rely on names

The following piece of code finds all the files which have the standard_name of surface_temperature and the long_name of OPEN SEA SURFACE TEMP AFTER TIMESTEP. (This is from the one file per field output, but I don't think that's relevant to the problem.)

$files = index['surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP']
$print(files)
['1m_0h__m01s00i507_2_195001-195001.nc', '1m_12h__m01s00i507_6_195001-195001.nc', 
   '1m_15h__m01s00i507_7_195001-195001.nc', '1m_18h__m01s00i507_8_195001-195001.nc', 
   '1m_21h__m01s00i507_9_195001-195001.nc', '1m_3h__m01s00i507_3_195001-195001.nc', 
   '1m_6h__m01s00i507_4_195001-195001.nc', '1m_9h__m01s00i507_5_195001-195001.nc', 
   '1m__m01s00i507_195001-195001.nc', '3h__m01s00i507_10_19500101-19500110.nc', 
   '3h__m01s00i507_10_19500111-19500120.nc', '3h__m01s00i507_10_19500121-19500130.nc']
$flds = index.get_fields('surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP',
    'u-cn134-1fpf/19500101T0000Z/')

It then does the aggregation to the two CF-fields that are really in play:

$print(flds)
[<CF Field: surface_temperature(time(241), latitude(324), longitude(432)) K>,
 <CF Field: surface_temperature(time(8), latitude(324), longitude(432)) K>]

These two fields are:

Field: surface_temperature (ncvar%m01s00i507_2)
-----------------------------------------------
Data            : surface_temperature(time(8), latitude(324), longitude(432)) K
Cell methods    : time(8): point within days time(8): mean over days
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(8)) = [1950-01-16 00:00:00, ..., 1950-01-16 21:00:00] 360_day

Field: surface_temperature (ncvar%m01s00i507_10)
------------------------------------------------
Data            : surface_temperature(time(241), latitude(324), longitude(432)) K
Cell methods    : time(241): mean (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(241)) = [1950-01-01 01:30:00, ..., 1950-01-30 22:30:00] 360_day

In both cases there should be a cell method which conforms to the relevant part of the CF conventions. All long names should be checked for such averaging and the appropriate cell methods used.

RCM data products

We need to do a CF compliance exercise for the RCM data products.

grab initialization index to set ensemeble number in atmos&ocean file names

long names and standard names

I admit to be being surprised that CMIP6 explicitly controls some variables to have explicit long_names AND standard_names, as does CORDEX. I somehow missed that as it happened.

Here, for example is one of the CMIP5 (!) formal examples:

float tas(time, lat, lon) ;
 tas:standard_name = "air_temperature" ;
 tas:long_name = "Near-Surface Air Temperature" ;
tas:comment = "comment from CMIP5 table: near-surface
(usually, 2 meter) air temperature." ;
 tas:units = "K" ;
 tas:original_name = "TS" ;
tas:history = "2010-04-21T21:05:23Z altered by CMOR: Treated
scalar dimension: \'height\'. Inverted axis: lat." ;
 tas:cell_methods = "time: mean" ;
 tas:cell_measures = "area: areacella" ;

which is formally in the CMIP6 controlled vocabularies as:

!============
variable_entry:    tas
!============
modeling_realm:    atmos
!----------------------------------
! Variable attributes:
!----------------------------------
standard_name:     air_temperature
units:             K
cell_methods:      time: mean
cell_measures:     area: areacella
long_name:         Near-Surface Air Temperature
!----------------------------------

We have to decide whether we care about following the CMIP6 bindings of these variables (and whether or not we want to put the CMIP6 table names and variable names anywhere in our metadata). Given we won't have files and directories with those names, there still might be benefits in putting them in per variable metadata.

We might also want to save the stash long names directly, we could do that with something like a um_context attribute.