Code Monkey home page Code Monkey logo

canari-data's People

Contributors

bnlawrence avatar reinhardinho avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

reinhardinho

canari-data's Issues

Check all 2D DIAG variables to ensure that if they need a vertical coordinate, one is present.

We need to go through all the DIAG variables and look at whether or not there needs to be a vertical coordinate associated with them.

For example, we currently have a field that looks like this:

f.long_name,f.standard_name =  ('TEMPERATURE AT 1.5M', 'air_temperature')
print(f)
Field: air_temperature (ncvar%m01s03i236_2)
-------------------------------------------
Data            : air_temperature(time(10), latitude(324), longitude(432)) K
Cell methods    : time(10): maximum (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(10)) = [1950-01-01 12:00:00, ..., 1950-01-10 12:00:00] 360_day

but if we compare it to a similar field in CMIP6, we see:

tas1.long_name, tas1.standard_name = ('Near-Surface Air Temperature','air_temperature')
print(tas1):
Field: air_temperature (ncvar%tas)
----------------------------------
Data            : air_temperature(time(240), latitude(324), longitude(432)) K
Cell methods    : area: time(240): mean
Dimension coords: time(240) = [1850-01-16 00:00:00, ..., 1869-12-16 00:00:00] 360_day
                : latitude(324) = [-89.72222137451172, ..., 89.72223663330078] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.58331298828125] degrees_east
                : height(1) = [1.5] m
Cell measures   : measure:area (external variable: ncvar%areacella)

The issue at hand here is that our variable is relying on the long_name to provide coordinate information. These sort of variables need the appropriate hight coordinate.

Consider what to do with the start dumps

If we anticipate users from outside the CANARI consortium wanting access to start dumps for their regional model domains (assuming we have written them bc files), how are we going to catalog and organise them?

ocean variables with dodgy units and names

Here are a couple of outputs:

Field: sea_water_potential_temperature (ncvar%votemper2)
--------------------------------------------------------
Data            : sea_water_potential_temperature(time(1), depth(75), ncdim%y(1207), ncdim%x(1442)) degree_C
Cell methods    : time(1): mean
Dimension coords: depth(75) = [0.5057600140571594, ..., 5902.0576171875] m
Auxiliary coords: time(time(1)) = [1950-01-16 00:00:00] 360_day
                : latitude(ncdim%y(1207), ncdim%x(1442)) = [[-89.5, ..., 49.99550247192383]] degrees_north
                : longitude(ncdim%y(1207), ncdim%x(1442)) = [[72.75, ..., 73.0]] degrees_east
Cell measures   : measure:area(ncdim%y(1207), ncdim%x(1442)) = [[1000000.0, ..., 445.6573486328125]] m2

Field: sea_water_potential_temperature (ncvar%votemper)
-------------------------------------------------------
Data            : sea_water_potential_temperature(time(1), depth(75), ncdim%y(1207), ncdim%x(1442)) degree_C
Cell methods    : time(1): mean
Dimension coords: depth(75) = [0.5057600140571594, ..., 5902.0576171875] m
Auxiliary coords: time(time(1)) = [1950-01-16 00:00:00] 360_day
                : latitude(ncdim%y(1207), ncdim%x(1442)) = [[-89.5, ..., 49.99550247192383]] degrees_north
                : longitude(ncdim%y(1207), ncdim%x(1442)) = [[72.75, ..., 73.0]] degrees_east
Cell measures   : measure:area(ncdim%y(1207), ncdim%x(1442)) = [[1000000.0, ..., 445.6573486328125]] m2

On the face of it they are the same, but actually these represent two different quantities:

votemper is monthly avg of (toce_e3t )divided by the monthly average of (e3t)
votemper2 is monthly avg of ((toce_e3t)^2 divided by monthly average of (e3t)

This is the relevant xml

<field id="e3t"          long_name="Ocean Model cell Thickness"   standard_name="cell_thickness"   unit="m"   grid_ref="grid_T_3D"/>
  <field id="toce"         long_name="Sea Water Potential Temperature"         standard_name="sea_water_potential_temperature"   unit="degree_C"     grid_ref="grid_T_3D"/>
  <field id="toce_e3t"     long_name="temperature * e3t"                                                     unit="degree_C*m"   grid_ref="grid_T_3D" > toce * e3t </field >

and

<field ts_enabled="true" field_ref="toce"         name="votemper"        operation="average" freq_op="1mo" > @toce_e3t / @e3t </field>
<field ts_enabled="true" field_ref="toce"         name="votemper2"       operation="average" freq_op="1mo" > @toce2_e3t / @e3t </field>  

Clearly

  1. the two variables cannot have the same units, and
  2. the standard names are not very helpful,
  3. neither are the long names

It would help to find out what these represent scientifically first ...

CICE data problems

Several CICE fields are wrong and have been processed through cdds to be right (?) - there is some thought that the fields have been fixed in https://code.metoffice.gov.uk/svn/cice/main/branches/dev/alexwest/r400_correct_cmip6_diagnostics_take2 - I'm running a test

cmip table names

We want to make sure we use appropriate table names for our output files. The CMIP list is here, but we probably don't want to comply fully, not least because we have our own output frequencies. We should ensure our file names (and global attributes) include:

  • Omon_ (all ocean data will be monthly)
  • Ice (to discuss)

Draft for atmosphere (not land):

  • Amon_ pt_(all our atmospheric monthly 1m_pt data)
  • Amon_ (all our 1m_ average data)
  • Aday (all our 1d_ averaged data)
  • Aday_pt (all our instantaneous data, 1d_pt)
  • A1hr_
  • A1hr_pt
  • A3hr_
  • A3hr_pt_
  • A6hr_
  • A6hr_pt_

Points of distinction from CMIP, everything starts with A, not just Amon ... and everything is averaged unless it includes pt.

As above, for land, but with L ...

first cycle checkpoint code

Some of the global attributes will need to be entered by the user starting the simulation, and information will need to be loaded into the further_info_url, and there is obvious scope to get initial condition and boundary files wrong in the attempt to create a new realisation.

We need some code to run at the fend of the first cycle, which does some basic checks, delivers a basic look at some data, and checks it isn't the same as previous run, and this code needs to be run before allowing the simulation to restart and run for the rest of its time.

This probably needs to run for both the historical and SSP runs (after the change over).

Make sure instantaneous fields do not have spurious time means

A large number of fields which are instantaneous fields at a specific time have spurious cell methods. E.g.

Field: surface_temperature (ncvar%m01s00i507_10)
------------------------------------------------
Data            : surface_temperature(time(241), latitude(324), longitude(432)) K
Cell methods    : time(241): mean (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(241)) = [1950-01-01 01:30:00, ..., 1950-01-30 22:30:00] 360_day

is either supposed to be hourly instantaneous data, no cell method, or it's hourly averages, it's not this!

NEMO zonal/basin data

NEMO diaptr files contain zonal data that is represented as full 3-d fields - how should that be described with cell methods
in addition they contain data for indian/pacific/atlantic basins (again zonal data represented as 3-d fields) - what cell methods?

CORDEX support

This is the list of fields we'd need to include if we wanted to fully support CORDEX. The full list is here.

They fall into three groups:


cx-6hr (instantaneous values)

  • (6 variables, ua7h, ta7h, va77, zg7h, hus7h, plus psi

daily (these appear to be means)

  • prw.Eday [1]: Water Vapor Path
  • clt.day [1]: Total Cloud Cover Percentage
  • tslsi.day [1]: Surface Temperature Where Land or Sea Ice
  • hurs.day [1]: Near-Surface Relative Humidity
  • hus.day [1]: Specific Humidity
  • huss.day [1]: Near-Surface Specific Humidity
  • pr.day [1]: Precipitation
  • psl.day [1]: Sea Level Pressure
  • siconc.SIday [1]: Sea-Ice Area Percentage (Ocean Grid)
  • ta.day [1]: Air Temperature
  • tas.day [1]: Near-Surface Air Temperature
  • tasmax.day [1]: Daily Maximum Near-Surface Air Temperature
  • tasmin.day [1]: Daily Minimum Near-Surface Air Temperature
  • tos.Oday [1]: Sea Surface Temperature
  • ua.day [1]: Eastward Wind
  • uas.day [1]: Eastward Near-Surface Wind
  • va.day [1]: Northward Wind
  • vas.day [1]: Northward Near-Surface Wind
  • zg.day [1]: Geopotential Height

Monthly (averages I think)

  • siconc.SImon [1]: Sea-Ice Area Percentage (Ocean Grid) {groups: 22, vars: 2}
  • tos.Omon [1]: Sea Surface Temperature {groups: 23, vars: 4}

How many of these do we already have?

Need to develop a tape pool strategy.

We probably want to separate the data on tapes, need to discuss with the CEDA/JASMIN group:

Need at least:

  • dumps and boundary conditions pool
  • high frequency pool
  • monthly data pool
  • RCM data pool

Wierd time axes

We do not understand why some data has auxiliary coordinates used for time.

There seems to be some unnecessary indirection arising from the way we have configured XIOS. This can't be the way it was done by IPSL in CMIP6. We need to find out if we can change some configuration to avoid this.

For example, we see:

Data            : lagrangian_tendency_of_air_pressure(time(120), air_pressure(9), latitude(325), longitude(432)) Pa s-1
Cell methods    : time(120): point
Dimension coords: time(120) = [1950-01-01 06:00:00, ..., 1950-02-01 00:00:00] 360_day
                : air_pressure(9) = [925.0, ..., 50.0] hPa
                : latitude(325) = [-90.0, ..., 90.0] degrees_north
                : longitude(432) = [0.0, ..., 359.1666564941406] degrees_east
Auxiliary coords: time(time(120)) = [1950-01-01 06:00:00, ..., 1950-02-01 00:00:00] 360_day

which arises from the following netcdf layout:

dimensions:
        axis_nbounds = 2 ;
        lon = 432 ;
        lat = 325 ;
        um-atmos_PLEV9H = 9 ;
        time_counter = UNLIMITED ; // (40 currently)
variables:
        float lat(lat) ;
                lat:axis = "Y" ;
                lat:standard_name = "latitude" ;
                lat:long_name = "Latitude" ;
                lat:units = "degrees_north" ;
        float lon(lon) ;
                lon:axis = "X" ;
                lon:standard_name = "longitude" ;
                lon:long_name = "Longitude" ;
                lon:units = "degrees_east" ;
        float um-atmos_PLEV9H(um-atmos_PLEV9H) ;
                um-atmos_PLEV9H:name = "um-atmos_PLEV9H" ;
                um-atmos_PLEV9H:standard_name = "air_pressure" ;
                um-atmos_PLEV9H:long_name = "pressure levels" ;
                um-atmos_PLEV9H:units = "hPa" ;
                um-atmos_PLEV9H:positive = "down" ;
        double time_instant(time_counter) ;
                time_instant:standard_name = "time" ;
                time_instant:long_name = "Time axis" ;
                time_instant:calendar = "360_day" ;
                time_instant:units = "seconds since 1950-01-01 00:00:00" ;
                time_instant:time_origin = "1950-01-01 00:00:00" ;
                time_instant:bounds = "time_instant_bounds" ;
        double time_instant_bounds(time_counter, axis_nbounds) ;
        double time_counter(time_counter) ;
                time_counter:axis = "T" ;
                time_counter:standard_name = "time" ;
                time_counter:long_name = "Time axis" ;
                time_counter:calendar = "360_day" ;
                time_counter:units = "seconds since 1950-01-01 00:00:00" ;
                time_counter:time_origin = "1950-01-01 00:00:00" ;
                time_counter:bounds = "time_counter_bounds" ;
        double time_counter_bounds(time_counter, axis_nbounds) ;
        float m01s30i208_2(time_counter, um-atmos_PLEV9H, lat, lon) ;
                m01s30i208_2:standard_name = "lagrangian_tendency_of_air_pressure" ;
                m01s30i208_2:long_name = "OMEGA ON P LEV/UV GRID" ;
                m01s30i208_2:units = "Pa s-1" ;
                m01s30i208_2:online_operation = "instant" ;
                m01s30i208_2:interval_operation = "6 h" ;
                m01s30i208_2:interval_write = "6 h" ;
                m01s30i208_2:cell_methods = "time: point" ;
                m01s30i208_2:_FillValue = -1.073742e+09f ;
                m01s30i208_2:missing_value = -1.073742e+09f ;
                m01s30i208_2:coordinates = "time_instant" ;

Sort out formal experiment description

We are doing "history" and "ssp370", but not quite, because our identifiers will not be from the same stable as the parent (i.e. our variant_label is from a different vocabulary).

We could use those as is, but change the mip-era or parent_id?

We also have to decide whether we want to have two sets of experiments with different attributes, e.g. historical to 2015 and ssp370 thereafter.

Whatever we do, we probably need a formal es-doc definition which describes our initialisation and duration. It might be easier then to use neither historical nor ssp370, and simply call it canari, which can then have it's own document.

cf aggregation not working properly: cf-python issue or metadata issue?

TIcket #3 describes a problem with some field metadata, but it also describes an aggregation problem: There are eight hourly monthly mean at a particular time of the day, and one all month average, and three 10 day 24h averages, there ought to be three aggregations ...
HORIZON-CL5-2023-D1-01-01: Further climate knowledge through advanced science and technologies for analysing Earth Observation and Earth System Model dataΩ

$files = index['surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP']
$print(files)
['1m_0h__m01s00i507_2_195001-195001.nc', '1m_12h__m01s00i507_6_195001-195001.nc', 
   '1m_15h__m01s00i507_7_195001-195001.nc', '1m_18h__m01s00i507_8_195001-195001.nc', 
   '1m_21h__m01s00i507_9_195001-195001.nc', '1m_3h__m01s00i507_3_195001-195001.nc', 
   '1m_6h__m01s00i507_4_195001-195001.nc', '1m_9h__m01s00i507_5_195001-195001.nc', 
   '1m__m01s00i507_195001-195001.nc', '3h__m01s00i507_10_19500101-19500110.nc', 
   '3h__m01s00i507_10_19500111-19500120.nc', '3h__m01s00i507_10_19500121-19500130.nc']
$flds = index.get_fields('surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP',
    'u-cn134-1fpf/19500101T0000Z/')
It then does the aggregation to the two CF-fields that are really in play:

$print(flds)
[<CF Field: surface_temperature(time(241), latitude(324), longitude(432)) K>,
 <CF Field: surface_temperature(time(8), latitude(324), longitude(432)) K>]

It appears that the one month average has been aggregated into the 3h fields. Is this a metadata problem or a cf-python aggregation problem?

(Kudos for @jeff-cole for spotting this, I missed it, and even after he pointed it out, needed to be spoonfed as to what teh actual problem was.)

realms

We have agreed we need the realms, for the ocean and sea ice that should be ok, but for the atmosphere we either have to
split the land off into their own files, or do per variable realms. The former is preferred (and can be done via stash table).

cell measures?

What are we doing with cell measures, and what should we be doing?

(E.g. there is a lot of use of areacella(atmosphere) and areacello (ocean) in the CMIP6 files, a la tas:cell_measures = "area: areacella")

canari further_info_url

We could just use a github wiki, or a github jekyll site, or stand something up at CEDA.  Whatever we do needs to support errata and probably our own "ad-hoc" es-doc. Probably wants a zenodo DOI.

Data which is averaged over some area should have cell methods, and not just rely on names

The following piece of code finds all the files which have the standard_name of surface_temperature and the long_name of OPEN SEA SURFACE TEMP AFTER TIMESTEP. (This is from the one file per field output, but I don't think that's relevant to the problem.)

$files = index['surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP']
$print(files)
['1m_0h__m01s00i507_2_195001-195001.nc', '1m_12h__m01s00i507_6_195001-195001.nc', 
   '1m_15h__m01s00i507_7_195001-195001.nc', '1m_18h__m01s00i507_8_195001-195001.nc', 
   '1m_21h__m01s00i507_9_195001-195001.nc', '1m_3h__m01s00i507_3_195001-195001.nc', 
   '1m_6h__m01s00i507_4_195001-195001.nc', '1m_9h__m01s00i507_5_195001-195001.nc', 
   '1m__m01s00i507_195001-195001.nc', '3h__m01s00i507_10_19500101-19500110.nc', 
   '3h__m01s00i507_10_19500111-19500120.nc', '3h__m01s00i507_10_19500121-19500130.nc']
$flds = index.get_fields('surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP',
    'u-cn134-1fpf/19500101T0000Z/')

It then does the aggregation to the two CF-fields that are really in play:

$print(flds)
[<CF Field: surface_temperature(time(241), latitude(324), longitude(432)) K>,
 <CF Field: surface_temperature(time(8), latitude(324), longitude(432)) K>]

These two fields are:

Field: surface_temperature (ncvar%m01s00i507_2)
-----------------------------------------------
Data            : surface_temperature(time(8), latitude(324), longitude(432)) K
Cell methods    : time(8): point within days time(8): mean over days
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(8)) = [1950-01-16 00:00:00, ..., 1950-01-16 21:00:00] 360_day

Field: surface_temperature (ncvar%m01s00i507_10)
------------------------------------------------
Data            : surface_temperature(time(241), latitude(324), longitude(432)) K
Cell methods    : time(241): mean (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(241)) = [1950-01-01 01:30:00, ..., 1950-01-30 22:30:00] 360_day

In both cases there should be a cell method which conforms to the relevant part of the CF conventions. All long names should be checked for such averaging and the appropriate cell methods used.

RCM data products

We need to do a CF compliance exercise for the RCM data products.

long names and standard names

I admit to be being surprised that CMIP6 explicitly controls some variables to have explicit long_names AND standard_names, as does CORDEX. I somehow missed that as it happened.

Here, for example is one of the CMIP5 (!) formal examples:

float tas(time, lat, lon) ;
 tas:standard_name = "air_temperature" ;
 tas:long_name = "Near-Surface Air Temperature" ;
tas:comment = "comment from CMIP5 table: near-surface
(usually, 2 meter) air temperature." ;
 tas:units = "K" ;
 tas:original_name = "TS" ;
tas:history = "2010-04-21T21:05:23Z altered by CMOR: Treated
scalar dimension: \'height\'. Inverted axis: lat." ;
 tas:cell_methods = "time: mean" ;
 tas:cell_measures = "area: areacella" ; 

which is formally in the CMIP6 controlled vocabularies as:

!============
variable_entry:    tas
!============
modeling_realm:    atmos
!----------------------------------
! Variable attributes:
!----------------------------------
standard_name:     air_temperature
units:             K
cell_methods:      time: mean
cell_measures:     area: areacella
long_name:         Near-Surface Air Temperature
!----------------------------------

We have to decide whether we care about following the CMIP6 bindings of these variables (and whether or not we want to put the CMIP6 table names and variable names anywhere in our metadata). Given we won't have files and directories with those names, there still might be benefits in putting them in per variable metadata.

We might also want to save the stash long names directly, we could do that with something like a um_context attribute.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.