jbusecke / xmip Goto Github PK

Analysis ready CMIP6 data in python the easy way with pangeo tools.

Home Page: https://cmip6-preprocessing.readthedocs.io/en/latest/?badge=latest

License: Apache License 2.0

Jupyter Notebook 96.40% Python 3.60%

xgcm cmip6 climate-analysis climate-models analysis-ready-data preprocessing cmip6-data pangeo

xmip's Introduction

Science is not immune to racism. Academia is an elitist system with numerous gatekeepers that has mostly allowed a very limited spectrum of people to pursue a career. I believe we need to change that.

Open source development and reproducible science are a great way to democratize the means for scientific analysis. But you can't git clone software if you are being murdered by the police for being Black!

Free access to software and hollow diversity statements are hardly enough to crush the systemic and institutionalized racism in our society and academia.

If you are using this package, I ask you to go beyond just speaking out and donate here to Data for Black Lives and Black Lives Matter Action.

I explicitly welcome suggestions regarding the wording of this statement and for additional organizations to support. Please raise an issue for suggestions.

xmip (formerly cmip6_preprocessing)

This package facilitates the cleaning, organization and interactive analysis of Model Intercomparison Projects (MIPs) within the Pangeo software stack.

Are you interested in CMIP6 data, but find that is is not quite analysis ready? Do you just want to run a simple (or complicated) analysis on various models and end up having to write logic for each seperate case, because various datasets still require fixes to names, coordinates, etc.? Then this package is for you.

Developed during the cmip6-hackathon this package provides utility functions that play nicely with intake-esm.

We currently support the following functions

Preprocessing CMIP6 data (Please check out the tutorial for some examples using the pangeo cloud). The preprocessig includes: a. Fix inconsistent naming of dimensions and coordinates b. Fix inconsistent values,shape and dataset location of coordinates c. Homogenize longitude conventions d. Fix inconsistent units
Creating large scale ocean basin masks for arbitrary model output

The following issues are under development:

Reconstruct/find grid metrics
Arrange different variables on their respective staggered grid, so they can work seamlessly with xgcm

Check out this recent Earthcube notebook (cite via doi: 10.1002/essoar.10504241.1) for a high level demo of xmip and xgcm.

Installation

Install xmip via pip:

pip install xmip

or conda:

conda install -c conda-forge xmip

To install the newest main from github you can use pip aswell:

pip install git+pip install git+https://github.com/jbusecke/xmip.git

xmip's People

Contributors

Stargazers

Watchers

Forkers

kmcgarry17 dhruvbalwada aaronspring robinrongeccc markusritschel yx577 andrewilwilliams jetesdal bwblack generalpeng26 mackenzieblanusa tsjackson-noaa xrosliang aoe-sdh tomnicholas kaviankit andrewbrettin rwegener2 huangzq681 thailengthol yadidya-b readthedocs-assistant saibo-li jdldeauna cortehz emaroon mjwang1010 ziwangdeng marionalberty wrongkindofdoctor pete-rb121 hmkhatri ocean1125 joranangevaare emmomp ytakano3 hamidehfakhr

xmip's Issues

Squeezing peak performance out of the cloud CI

I am pretty happy with the setup we have right now, but I am wondering if we can speed up the test setup even more.

Currently each one of the tests runs for ~16 minutes example. This is split into ~4 minutes setup and ~12 minutes actual testing.

I will think about how to reduce the testing time, but we might be limited by I/O there? (What would be a good and easy way to profile such a job?)

On the other hand I was wondering if there are ways to get even faster setup of the environments? The more jobs we run in parallel, the more beneficial it would be to reduce this to a minimum.

I remember @andersy005 had a custom docker image? that was cached? If anybody here has some ideas or suggestions how to speed up the setup here even further I would very much appreciate it.

Better structure for different 'levels' of preprocessing.

I have been thinking about a major restructuring of the package which would go along with an actual documentation.

This has been inspired by the continued use of the package in my own projects and some great discussions in the Pangeo CMIP6 group. In particular @agstephens got me thinking about what cmip6_preprocessing 'fixes' at which 'level' of the processing.

Let me try to illustrate:

All CMIP6 data starts from multiple files.
These files then form a dataset (this is usually a single variable)
Several datasets of different variables and grid metrics can be combined to a member, which represents 'the stuff you would normally get when you run a model', e.g. a bunch of variables (possibly on a staggered grid) and maybe some grid metrics like the vertical cell thickness and cell area.
Then some models have several members, which can be combined again, calling this an experiment (e.g. a preindustrial control experiment, or a historical forcing experiment).
One could then combine these experiments into some sort of super dataset for each model, but that is often impractical due to the different time frames they run over, so ill stop here.

Or in short for each level file-dataset-member-experiment.

It is important to know when a fix should/has to be applied. There are many fixes that have been applied by @naomi-henderson on the file level in order to put out the zarr stores for the Pangeo CMIP6 data.

Most of the stuff in the preprocessing.py module is meant to be applied at the dataset level, but I am planning to add more of my prototyped code to the package soon and so I wanted to discuss if everybody thinks that this 'vocabulary' looks ok or if I am missing some important level/use-case here.

failed to join/concatenate datasets for model CESM2-FV2

I use the following code to load CESM2-FV2

from cmip6_preprocessing.preprocessing import combined_preprocessing

url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
model = 'CESM2-FV2'

query = dict(experiment_id=['historical'], table_id='Omon', 
             variable_id='tos', grid_label=['gn'], source_id=model)
cat = col.search(**query)
print(cat.df['source_id'].unique())
z_kwargs = {'consolidated': True, 'decode_times':False}
tos_dict = cat.to_dataset_dict(zarr_kwargs=z_kwargs, preprocess=combined_preprocessing)

and I get the following error:

AggregationError: 
        Failed to join/concatenate datasets in group with key=CMIP.NCAR.CESM2-FV2.historical.Omon.gn along a new dimension `member_id`.
        *** Arguments passed to xarray.concat() ***:
        - objs: a list of 3 datasets
        - dim: <xarray.DataArray 'member_id' (member_id: 3)>
array(['r1i1p1f1', 'r2i1p1f1', 'r3i1p1f1'], dtype='<U8')
Dimensions without coordinates: member_id
        - data_vars: ['tos']
        - and kwargs: {'coords': 'minimal', 'compat': 'override'}
        ********************************************

Drop versioneer

I would like to adopt the same overhaul done in xgcm/#287 here. Should be straightforward to adapt.

Make preprocessing work with depth section data

In the current form combined_preprocessing only works if the dataset has two horizontal dimensions ( x and y). It fails with e.g. zonally averaged streamfunctions.

It would be nice to relax these criteria so that processing on these doesnt fail.

preprocessing atmosphere MissingDimensionsError

related to #14

preprocessing outputs strange 'x' and 'y' coordinates for the CMCC models

Here's a plot of the processed x and y values for CMCC-CM2-HR4:

The problem does not persist when I try preprocess = None

This problem is true for CMCC-CM2-HR4, CMCC-CM2-SR5 and CMCC-ESM2

Proper docs

I think it would be beneficial to have a proper (even if short) documentation with RTD, to e.g. provide a Contributors Guide. This would be helpful once #26 is implemented.

Transfer the existing tutorials + notebooks to rtd structure
Example how to use on raw datastore or local data w/o intake-esm
Add "whats new" to track contributions, features etc.
Contributors Guide

Issue with extracted Area for CSIRO ocean model

I noticed a problem with the static metric parsing for the CSIRO model recently. This only arises when passing preprocessing=combined_preprocessing.

# modified search to check for models with all three (picontrol, historical, 'spss585')
import intake
from cmip6_preprocessing.parse_static_metrics import extract_static_metric, parse_metrics
from cmip6_preprocessing.preprocessing import combined_preprocessing

col = intake.open_esm_datastore("https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json")
cat = col.search(variable_id='o2', experiment_id='historical', source_id ='ACCESS-ESM1-5', table_id='Omon')
data_dict = cat.to_dataset_dict(zarr_kwargs={'consolidated': True, 'decode_times':False},
                                preprocess=combined_preprocessing)#Tested with None and that works fine
data_dict = parse_metrics(data_dict, col, preprocess=combined_preprocessing)
data_dict['CMIP.CSIRO.ACCESS-ESM1-5.historical.Omon.gn'].areacello.plot()

The problem can be solved by dropping the area dim values. ATM I assume this is an issue with slightly misaligned dimensions?

area = extract_static_metric(col, 'gn', 'ACCESS-ESM1-5', preprocess=combined_preprocessing)
data_dict['CMIP.CSIRO.ACCESS-ESM1-5.historical.Omon.gn'].coords['areacello'] = area.areacello.drop(['x', 'y'])

data_dict['CMIP.CSIRO.ACCESS-ESM1-5.historical.Omon.gn'].areacello.plot()

I think we should add a force_align option to parse_metrics, which drops the values if True.

Example to pangeo gallery

I would like to have an example of cmip6_preprocessing in the pangeo gallery to promote the package.

Specify chunking in `read_data`

I would like to adjust the chunking of CMIP6 data read in using read_data for Dask calculations. Specify as an input like intake-esm?

Hosting Cloud Tests in pangeo resources instead of GHA

Circling back to a discussion we had at the pangeo meeting with @WesleyTheGeolien about running the CI on pangeo resources (instead of the ones provided by GHA).

The main reason to do this is that we could theoretically? run more concurrent jobs and might be able to get more performant nodes.

I have attempted a naive performance test with the following steps:

Get a large server on the pangeo gcp deployment. This gives us 2 cores/4 threads to run on?

(notebook) jovyan@jupyter-jbusecke:~/cmip6_preprocessing$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           63
Model name:                      Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:                        0
CPU MHz:                         2300.000
BogoMIPS:                        4600.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        512 KiB
L3 cache:                        45 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1
                                 gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse
                                 4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssbd ibrs ibpb stibp 
                                 fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities

Set up the cloud testing environment with mamba env create -f ci/environment-cloud-test.yml
Run the cloud specific tests with pytest -n auto --maxfail 5 --reruns 1 --reruns-delay 10 tests/test_preprocessing_cloud.py. This runs two parallel processes and takes ~14 minutes (I also tried to run it with -n 4 and it took ~12 minutes)
These times are however much slower than what I am getting for the GHA workflows (~7-9 minutes example)

I am wondering if it is even worth trying to port this to pangeo? I am not sure what kind of increased performance we could expect. Even if we manage to squeeze out higher performance, the fact that we have this working in a decent state on GHA right now, makes me think we should maybe focus our efforts elsewhere? It seems that GHA allows for a decent amount of concurrent jobs (~8?) and most of the test can be run in parallel. This gives us ~30 min for the full CI, which is not great, but also not terrible.

Any thoughts?

Setting up `interrogate`

I have left out the interrogate section of the pre-commity config file in #83, since getting the docstrings up to par will take a bit of time, but I still want to do it.

This is just a reminder for myself

lon dimension error when applying combined_preprocessing for some CMIP6 models

When I use combined_preprocessing in conjunction with CMIP6 model output in the example below:

import xarray as xr
import numpy as np 
import os 

import intake
import zarr
from cmip6_preprocessing.preprocessing import combined_preprocessing

# search the catalog and get model data
col = intake.open_esm_datastore("https://storage.googleapis.com/cmip6/pangeo-cmip6.json")
cat = col.search(activity_id='ScenarioMIP', institution_id='EC-Earth-Consortium', source_id='EC-Earth3-Veg-LR', experiment_id='ssp370', table_id='day', variable_id='tasmax', 
                 member_id='r1i1p1f1')

z_kwargs = {'consolidated': True, 'decode_times':False}
dset_dict = cat.to_dataset_dict(zarr_kwargs=z_kwargs,
                                preprocess=combined_preprocessing)

I'm getting the following MissingDimensionsError:

MissingDimensionsError                    Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/merge_util.py in _open_asset()
    314         try:
--> 315             ds = preprocess(ds)
    316         except Exception as exc:

/srv/conda/envs/notebook/lib/python3.8/site-packages/cmip6_preprocessing/preprocessing.py in combined_preprocessing()
    904         # broadcast lon/lat
--> 905         ds = broadcast_lonlat(ds)
    906         # replace x,y with nominal lon,lat

/srv/conda/envs/notebook/lib/python3.8/site-packages/cmip6_preprocessing/preprocessing.py in broadcast_lonlat()
    814     if len(ds["lon"].dims) < 2:
--> 815         ds.coords["lon"] = ds["lon"] * xr.ones_like(ds["lat"])
    816     if len(ds["lat"].dims) < 2:

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/coordinates.py in __setitem__()
     39     def __setitem__(self, key: Hashable, value: Any) -> None:
---> 40         self.update({key: value})
     41 

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/coordinates.py in update()
    114         other_vars = getattr(other, "variables", other)
--> 115         coords, indexes = merge_coords(
    116             [self.variables, other_vars], priority_arg=1, indexes=self.indexes

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/merge.py in merge_coords()
    454     )
--> 455     collected = collect_variables_and_indexes(aligned)
    456     prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat)

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/merge.py in collect_variables_and_indexes()
    277 
--> 278             variable = as_variable(variable, name=name)
    279             if variable.dims == (name,):

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/variable.py in as_variable()
    153         if obj.ndim != 1:
--> 154             raise MissingDimensionsError(
    155                 "%r has more than 1-dimension and the same name as one of its "

MissingDimensionsError: 'lon' has more than 1-dimension and the same name as one of its dimensions ('lon', 'lat'). xarray disallows such variables because they conflict with the coordinates used to label dimensions.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-30-3884fdb04db1> in <module>
      2 
      3 z_kwargs = {'consolidated': True, 'decode_times':False}
----> 4 dset_dict = cat.to_dataset_dict(zarr_kwargs=z_kwargs,
      5                                 preprocess=combined_preprocessing)
      6 # dset_dict = cat.to_dataset_dict(zarr_kwargs=z_kwargs)

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/core.py in to_dataset_dict(self, zarr_kwargs, cdf_kwargs, preprocess, storage_options, progressbar, aggregate)
    925             ]
    926             for i, task in enumerate(concurrent.futures.as_completed(future_tasks)):
--> 927                 key, ds = task.result()
    928                 self._datasets[key] = ds
    929                 if self.progressbar:

/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433 
    434             self._condition.wait(timeout)

/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result

/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/core.py in _load_source(key, source)
    911 
    912         def _load_source(key, source):
--> 913             return key, source.to_dask()
    914 
    915         sources = {key: source(**source_kwargs) for key, source in self.items()}

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/source.py in to_dask(self)
    244     def to_dask(self):
    245         """Return xarray object (which will have chunks)"""
--> 246         self._load_metadata()
    247         return self._ds
    248 

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake/source/base.py in _load_metadata(self)
    124         """load metadata only if needed"""
    125         if self._schema is None:
--> 126             self._schema = self._get_schema()
    127             self.datashape = self._schema.datashape
    128             self.dtype = self._schema.dtype

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/source.py in _get_schema(self)
    173 
    174         if self._ds is None:
--> 175             self._open_dataset()
    176 
    177             metadata = {

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/source.py in _open_dataset(self)
    225             for _, row in self.df.iterrows()
    226         ]
--> 227         datasets = dask.compute(*datasets)
    228         mapper_dict = dict(datasets)
    229         nd = create_nested_dict(self.df, self.path_column, self.aggregation_columns)

/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
    450         postcomputes.append(x.__dask_postcompute__())
    451 
--> 452     results = schedule(dsk, keys, **kwargs)
    453     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    454 

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2723                     should_rejoin = False
   2724             try:
-> 2725                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2726             finally:
   2727                 for f in futures.values():

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1984             else:
   1985                 local_worker = None
-> 1986             return self.sync(
   1987                 self._gather,
   1988                 futures,

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    830             return future
    831         else:
--> 832             return sync(
    833                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    834             )

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    338     if error[0]:
    339         typ, exc, tb = error[0]
--> 340         raise exc.with_traceback(tb)
    341     else:
    342         return result[0]

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py in f()
    322             if callback_timeout is not None:
    323                 future = asyncio.wait_for(future, callback_timeout)
--> 324             result[0] = yield future
    325         except Exception as exc:
    326             error[0] = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1849                             exc = CancelledError(key)
   1850                         else:
-> 1851                             raise exception.with_traceback(traceback)
   1852                         raise exc
   1853                     if errors == "skip":

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/source.py in read_dataset()
    202             # replace path column with mapper (dependent on filesystem type)
    203             mapper = _path_to_mapper(path, storage_options, data_format)
--> 204             ds = _open_asset(
    205                 mapper,
    206                 data_format=data_format,

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/merge_util.py in _open_asset()
    315             ds = preprocess(ds)
    316         except Exception as exc:
--> 317             raise RuntimeError(
    318                 f'Failed to apply pre-processing function: {preprocess.__name__}'
    319             ) from exc

RuntimeError: Failed to apply pre-processing function: combined_preprocessing

When updating the source_id to source_id='EC-Earth3', the error does not occur - it appears to be due to combined_preprocessing not being able to handle 2-d longitudes and the longitude dimension having the same name as the variable, even though to_dataset_dict can.

Add pre-commit

Now that this project is becoming more active and hopefully more of a community effort we should add black formatting via pre-commit.

Testing catalog producing a bunch of failures

@naomi-henderson, I just noticed a bunch of failures in the large archive sweep and so far all failures seem to be associated with the testing catalog. Just wanted to alert you of that.

Release a new major version

The new features used in the earthcube conference paper use features that are not released.

I should release a new major version soon.

cannot import name parse_lon_lat_bounds

I am running the single utility function, parse_lon_lat_bounds and am running into an import error

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-92-1e894fbe4790> in <module>
----> 1 from cmip6_preprocessing.preprocessing import parse_lon_lat_bounds

ImportError: cannot import name 'parse_lon_lat_bounds' from 'cmip6_preprocessing.preprocessing' (/srv/conda/envs/notebook/lib/python3.7/site-packages/cmip6_preprocessing/preprocessing.py)

I came across this trying to go through the tutorial and started having issues when trying to do the "Consistent CF bounds" section. I did a pip install on pangeo.

Try to use cf-xarray for some things?

Right now cmip6 preprocessing works by directly renaming variables / dimensions to homogenize across models.

Another model would be to try to fill in cf-compliant metadata for all the datasets, and then use cf-xarray:
https://cf-xarray.readthedocs.io/en/latest/examples/introduction.html

I think that has some architectural advantages over the current model.

cc @dcherian

Drop the 'reordering' of longitude values.

I got quite frustrated with the attempt to make the step that reorders datasets along the longitude in #85, and will close that attempt.

I would like to advocate to drop this part of the preprocessing altogether.

My personal workflow has had numerous problems introduced by the attempt to 'reorder' the grid. In the end this is not necessary for many tasks, or it can be done at a later stage (e.g. when very intense processing is completed and you want to plot a map).

I also have the suspicion that recent changes introduced in #79 might cause #93, but I haven't confirmed this yet.
I am quite convinced we should drop the line ds = replace_x_y_nominal_lat_lon(ds) from combined_preprocessing.

Does anyone here crucially depend on that functionality?

My plan, for now, would be to remove it from the combined_preprocessing wrapper, but leave it in the module, so that users can apply it manually.

cc @dcherian @sckw @jetesdal @aaronspring

Performance issue; preprocessing produces ungodly amount of dask tasks

I just discovered a concerning behavior of combined_preprocessing, which seems to create a lot more dask tasks for each datasets when I enable the new automatic slicing for large arrays.

Consider this example:

import intake
import warnings
from cmip6_preprocessing.preprocessing import combined_preprocessing

import dask
dask.config.set(**{'array.slicing.split_large_chunks': True, "array.chunk-size": "256 MiB",})

url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)

cat = col.search(
    table_id='Omon',
    grid_label='gn',
    experiment_id='piControl',
    variable_id='o2',
    source_id=['CanESM5'])

Now lets load this single model into a dictionary with and without using preprocess

ddict_raw = cat.to_dataset_dict(
                zarr_kwargs={"consolidated": True, "decode_times":False}, 
            )
ddict_raw['CMIP.CCCma.CanESM5.piControl.Omon.gn'].o2.data

ddict = cat.to_dataset_dict(
                zarr_kwargs={"consolidated": True, "decode_times":False},
                preprocess=combined_preprocessing
            )
ddict['CMIP.CCCma.CanESM5.piControl.Omon.gn'].o2.data

The tasks increased from ~30k to more than 9 million!. This seems to hit the limit of what dask can handle.

I dug a little deeper and it seems that the increase is happening during this step , specifically the call to .sortby('x) here.

If I deactivate the new automatic slicing for large arrays I get this

import dask
with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ddict_test = cat.to_dataset_dict(
                zarr_kwargs={"consolidated": True, "decode_times":False},
                preprocess=combined_preprocessing
            )
ddict_test['CMIP.CCCma.CanESM5.piControl.Omon.gn'].o2.data

This prevents the x dimension to be rechunked to single values chunks.

Unfortunately I dont understand enough about these xarray/dask internals. @dcherian is this something that you know more about? Ill try to come up with a more simplified example and crosspost over at xarray. Just wanted to document this behavior here first.

combine cmip6_preprocessing with xesmf

Has anyone used cmip6_pre with xesmf for ocean variables?

with cmip6_pre.__version = 'v0.1.3' I get these coords, lat(x), lon(x) for all models, which cannot be right . this then crashes xesmf.

BCC-CSM2-MR CMIP.BCC.BCC-CSM2-MR.esm-piControl.Omon.gn
Coordinates:
    lon        (x) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
    lat        (x) float32 0.1662087 0.1662087 0.1662087 ... 0.1662087 0.1662087

Switch CI to mamba

Let me try to do that to further decrease build time, especially with the many builds that #70 will create, this could be an advantage.

Large update of obsolete datasets

@jbusecke , I just want to give you a heads up that I am starting to go through the whole CMIP6 collection, replacing old dataset versions with new versions when available. I have only been replacing old datasets upon request so far, so the downstream effects have been minimal. There have been many updates to existing datasets since we started building our GC CMIP6 collection, so I am anticipating some confusion and glitches in the next few weeks as I focus on this large remove/replace process.

For @jbusecke, this could change the pre-processing required for some of the datasets (hopefully it will fix many problems??). The tracking_id will change, as should the version entry in the csv forms for each of the affected datasets.

For everyone, there will be small intervals when a dataset is temporarily unavailable and/or the csv catalog is temporarily out of sync. I have tried to minimize the disruption, but replacing objects in the object store is a bit of a pain.
Apologies in advance for any disruption, but it had to happen. In future, the processing of any new request coming in involves the checking and replacement of any datasets for which there is a newer version. So this is not a one-time revision - datasets in the future will be automatically updated as new versions become available.

Silence intake_esm in `parse_static_metrics.py`

Depending on the version of intake-esm, we could deactivate the progress bar (see here) to produce less verbose output.

rename routine not considering ni, nj

It seems like the renaming routine does not consider dimensions called ni and nj.
They can be found, for example, in sea-ice related files, e.g. "siconc_SImon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc".
If I hand over an accordingly adjusted rename_dict, i.e. added "ni" and "nj" to the entries of "x" and "y":

rename_dict = {
    # dim labels
    "x": ["x", "i", "nlon", "lon", "longitude", "ni"],
    "y": ["y", "j", "nlat", "lat", "latitude", "nj"],
    "lev": ["lev", "deptht", "olevel", "zlev", "olev"],
    "bnds": ["bnds", "axis_nbounds", "d2"],
    "vertex": ["vertex", "nvertex", "vertices"],
    # coordinate labels
    "lon": ["lon", "longitude", "nav_lon"],
    "lat": ["lat", "latitude", "nav_lat"],
    "lev_bounds": [
        "lev_bounds",
        "deptht_bounds",
        "lev_bnds",
        "olevel_bounds",
        "zlev_bnds",
    ],
    "lon_bounds": [
        "bounds_lon",
        "bounds_nav_lon",
        "lon_bnds",
        "x_bnds",
        "vertices_longitude",
    ],
    "lat_bounds": [
        "bounds_lat",
        "bounds_nav_lat",
        "lat_bnds",
        "y_bnds",
        "vertices_latitude",
    ],
    "time_bounds": ["time_bounds", "time_bnds"],
}

rename_cmip6(ds, rename_dict=rename_dict)

to the rename_cmip6 routine, it doesn't change anything. Or am I missing something?

Unify variable attributes and units

We are currently fixing units if they are in non-standard units (e.g. for depth cm ==> m).

There are also some inconsistencies in the long names of e.g. dimensions and coordinates. This is a cosmetic issue but I wanted to pin it here. Eventually combined_preprocessing should rename all variables in a consistent way.

DOCS: Note on old catalog

I need to make sure that the new docs do not mention the old NCAR catalog (e.g. #93 #88) and clearly point folks to the new catalog.

Add dummy naming scheme for AWI model

There is no key in the renaming dict for the AWI model, resulting in an error.

I should add at least an empty dummy entry to avoid this.

dealing with the time dimension

for a seamless use of multi-model CMIP6 data, the time dimension is also annoying because the calendar definitions are all over the place. cmip6_preprocessing already deals with spatial dimensions and units, maybe we can also get time to a common standard.

Especially when working with monthly data, there is a variety of calendars (360_day,gregorian,...) and frequencies (MS,middle of the month,M) used, but essentially they all mean the same thing: a monthly mean. (For daily output this seems less of a problem because daily calendars are better defined.)

How to proceed (proposal):

One first step would be to use intake-esm or xr with use_cftime=True.
monthly output: check year and month of first and last timestep. if this matches time.size, then time=xr.cftime_range() with common freq='MS or M' could be created. This function could even be used via preprocessing, as then the time coord/dim is concatinated. Alternatively, this could also just be used as a function applied to each model dataset afterwards.
daily: makes sense to align?
annual: makes sense to align? not too much annual CMIP6 output anyways
additional checks could get the experiment_id and see whether the time dimension starts as expected, e.g. historical from 1850-01 and raise a warning if not. (this wouldnt work nicely in preprocessing but rather as a function call after intake-esm)

I think essentially most users do something like this when they attempt to get different models into one xr.Dataset.

Thoughts?

Do we dare to test the full CMIP6 cloud archive?

#70 is coming along and will hopefully be merged today.

Currently, we are testing a matrix with a comprehensive set of models (all with ocean output for now), two grid_labels (gn and gr), looking at a small subset of variables (thetao and o2) and two experiments (historical and ssp585).

The ability to parallelize these parameters in GHA jobs opens up the possibility to really do an extensive sweep across the full cmip6 data archive. Surely this will not be part of the regular CI, but I could imagine doing a "once a month" job that really tests a lot more files?

Do people here have opinions on that?

Lat/lon output error when processing CMCC-CESM2

Hello! Thank you so much for this package, it helps a lot with processing CMIP6 datasets.

I'm encountering an issue when using the package with the model CMCC-CESM2. The package proceeds with no errors, but the output lat/lon is very different from the input. I'm attaching a notebook with test code.

Thanks!

MissingDimensionsError when using combined_preprocessing

Hi, there,

Thanks for sharing the preprocessing tools for cmip6! I am trying to use it on the monthly SST data from model pi-control runs. However, it keeps giving me error (attached screen shot). There are in total 44 models.

Thanks!
Daisy

replace_x_y_nominal_lat_lon should use assign_coords

I found a breaking change with xr0.15.1: http://xarray.pydata.org/en/latest/whats-new.html

variable='fgco2'
cat = col.search(experiment_id='esm-piControl', table_id='Omon',
             variable_id=variable)
dset_dict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time': 12*50}},
                                preprocess=combined_preprocessing)

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
                
--> There is/are 6 group(s)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-76-7beade68e92f> in <module>
      1 dset_dict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time': 12*50}},
----> 2                                 preprocess=combined_preprocessing)

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/core.py in to_dataset_dict(self, zarr_kwargs, cdf_kwargs, preprocess, aggregate, storage_options, progressbar)
    372             self.progressbar = progressbar
    373 
--> 374         return self._open_dataset()
    375 
    376     def _open_dataset(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/core.py in _open_dataset(self)
    484                 progress(futures)
    485 
--> 486             dsets = client.gather(futures)
    487 
    488         self._ds = {group_id: ds for (group_id, ds) in dsets}

/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1891                 direct=direct,
   1892                 local_worker=local_worker,
-> 1893                 asynchronous=asynchronous,
   1894             )
   1895 

/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    778         else:
    779             return sync(
--> 780                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    781             )
    782 

/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    346     if error[0]:
    347         typ, exc, tb = error[0]
--> 348         raise exc.with_traceback(tb)
    349     else:
    350         return result[0]

/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils.py in f()
    330             if callback_timeout is not None:
    331                 future = asyncio.wait_for(future, callback_timeout)
--> 332             result[0] = yield future
    333         except Exception as exc:
    334             error[0] = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1750                             exc = CancelledError(key)
   1751                         else:
-> 1752                             raise exception.with_traceback(traceback)
   1753                         raise exc
   1754                     if errors == "skip":

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/core.py in _load_group_dataset()
    565         zarr_kwargs,
    566         cdf_kwargs,
--> 567         preprocess,
    568     )
    569 

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/merge_util.py in _aggregate()
    171             return ds
    172 
--> 173     return apply_aggregation(v)
    174 
    175 

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/merge_util.py in apply_aggregation()
    118             dsets = [
    119                 apply_aggregation(value, agg_column, key=key, level=level + 1)
--> 120                 for key, value in v.items()
    121             ]
    122             keys = list(v.keys())

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/merge_util.py in <listcomp>()
    118             dsets = [
    119                 apply_aggregation(value, agg_column, key=key, level=level + 1)
--> 120                 for key, value in v.items()
    121             ]
    122             keys = list(v.keys())

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/merge_util.py in apply_aggregation()
    118             dsets = [
    119                 apply_aggregation(value, agg_column, key=key, level=level + 1)
--> 120                 for key, value in v.items()
    121             ]
    122             keys = list(v.keys())

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/merge_util.py in <listcomp>()
    118             dsets = [
    119                 apply_aggregation(value, agg_column, key=key, level=level + 1)
--> 120                 for key, value in v.items()
    121             ]
    122             keys = list(v.keys())

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/merge_util.py in apply_aggregation()
    100                 cdf_kwargs=cdf_kwargs,
    101                 preprocess=preprocess,
--> 102                 varname=varname,
    103             )
    104 

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_esm/merge_util.py in _open_asset()
    211     else:
    212         logger.debug(f'Applying pre-processing with {preprocess.__name__} function')
--> 213         return preprocess(ds)
    214 
    215 

/srv/conda/envs/notebook/lib/python3.7/site-packages/cmip6_preprocessing/preprocessing.py in combined_preprocessing()
    868         ds = broadcast_lonlat(ds)
    869         # replace x,y with nominal lon,lat
--> 870         ds = replace_x_y_nominal_lat_lon(ds)
    871         # shift all lons to consistent 0-360
    872         ds = correct_lon(ds)

/srv/conda/envs/notebook/lib/python3.7/site-packages/cmip6_preprocessing/preprocessing.py in replace_x_y_nominal_lat_lon()
    773         eq_ind = abs(ds.lat.mean('x')).load().argmin().data
    774         nominal_x = ds.lon.isel(y=eq_ind)
--> 775         ds.coords['x'].data = nominal_x.data
    776         ds.coords['y'].data = nominal_y.data
    777 

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/common.py in __setattr__()
    260         """
    261         try:
--> 262             object.__setattr__(self, name, value)
    263         except AttributeError as e:
    264             # Don't accidentally shadow custom AttributeErrors, e.g.

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataarray.py in data()
    551     @data.setter
    552     def data(self, value: Any) -> None:
--> 553         self.variable.data = value
    554 
    555     @property

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/variable.py in data()
   2106     def data(self, data):
   2107         raise ValueError(
-> 2108             f"Cannot assign to the .data attribute of dimension coordinate a.k.a IndexVariable {self.name!r}. "
   2109             f"Please use DataArray.assign_coords, Dataset.assign_coords or Dataset.assign as appropriate."
   2110         )

ValueError: Cannot assign to the .data attribute of dimension coordinate a.k.a IndexVariable 'x'. Please use DataArray.assign_coords, Dataset.assign_coords or Dataset.assign as appropriate.

Try out yourself: CMIP6 pangeo cloud
https://binder.pangeo.io/v2/gh/aaronspring/LunchBytes_intake-esm_cloud/try_pangeo-notebooks

Brittle logic in renaming dict

As nicely pointed out in #51, dimension names that could also appear as coordinate names (e.g. lon/lat are used in both contexts), need to be at the end of the renaming dicts entry for e.g. x. Entering new names could lead to unintended behavior.

Just pinning this here as a reminder for the next refactor.

failed to join/concatenate datasets for model NorESM2-MM

I use the following code to load NorESM2-MM


url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
model = 'CESM2-FV2'

query = dict(experiment_id=['historical'], table_id='Omon', 
             variable_id='tos', grid_label=['gn'], source_id=model)
cat = col.search(**query)
print(cat.df['source_id'].unique())
z_kwargs = {'consolidated': True, 'decode_times':False}
tos_dict = cat.to_dataset_dict(zarr_kwargs=z_kwargs, preprocess=combined_preprocessing)

and I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/merge_util.py in join_new(dsets, dim_name, coord_value, varname, options, group_key)
     55         concat_dim = xr.DataArray(coord_value, dims=(dim_name), name=dim_name)
---> 56         return xr.concat(dsets, dim=concat_dim, data_vars=varname, **options)
     57     except Exception as exc:
/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/concat.py in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs)
    190         )
--> 191     return f(
    192         objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs
/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/concat.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs)
    383     datasets = list(
--> 384         align(*datasets, join=join, copy=False, exclude=[dim], fill_value=fill_value)
    385     )
/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
    352         else:
--> 353             new_obj = obj.reindex(
    354                 copy=copy, fill_value=fill_value, indexers=valid_indexers
/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataset.py in reindex(self, indexers, method, tolerance, copy, fill_value, **indexers_kwargs)
   2622         """
-> 2623         return self._reindex(
   2624             indexers,
/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataset.py in _reindex(self, indexers, method, tolerance, copy, fill_value, sparse, **indexers_kwargs)
   2651 
-> 2652         variables, indexes = alignment.reindex_variables(
   2653             self.variables,
/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/alignment.py in reindex_variables(variables, sizes, indexes, indexers, method, tolerance, copy, fill_value, sparse)
    564             if not index.is_unique:
--> 565                 raise ValueError(
    566                     "cannot reindex or align along dimension %r because the "
ValueError: cannot reindex or align along dimension 'time' because the index has duplicate values
The above exception was the direct cause of the following exception:
AggregationError                          Traceback (most recent call last)
<ipython-input-25-1b10fdf987c0> in <module>
     37 
     38 z_kwargs = {'consolidated': True, 'decode_times':True}
---> 39 tos_dict = cat.to_dataset_dict(zarr_kwargs=z_kwargs,
     40                                 preprocess=combined_preprocessing)
     41 
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/core.py in to_dataset_dict(self, zarr_kwargs, cdf_kwargs, preprocess, storage_options, progressbar, aggregate)
    928             ]
    929             for i, task in enumerate(concurrent.futures.as_completed(future_tasks)):
--> 930                 key, ds = task.result()
    931                 self._datasets[key] = ds
    932                 if self.progressbar:
/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433 
    434             self._condition.wait(timeout)
/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result
/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/core.py in _load_source(key, source)
    914 
    915         def _load_source(key, source):
--> 916             return key, source.to_dask()
    917 
    918         sources = {key: source(**source_kwargs) for key, source in self.items()}
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/source.py in to_dask(self)
    244     def to_dask(self):
    245         """Return xarray object (which will have chunks)"""
--> 246         self._load_metadata()
    247         return self._ds
    248 
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake/source/base.py in _load_metadata(self)
    124         """load metadata only if needed"""
    125         if self._schema is None:
--> 126             self._schema = self._get_schema()
    127             self.datashape = self._schema.datashape
    128             self.dtype = self._schema.dtype
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/source.py in _get_schema(self)
    173 
    174         if self._ds is None:
--> 175             self._open_dataset()
    176 
    177             metadata = {
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/source.py in _open_dataset(self)
    230         n_agg = len(self.aggregation_columns)
    231 
--> 232         ds = _aggregate(
    233             self.aggregation_dict,
    234             self.aggregation_columns,
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/merge_util.py in _aggregate(aggregation_dict, agg_columns, n_agg, nd, mapper_dict, group_key)
    238         return ds
    239 
--> 240     return apply_aggregation(nd)
    241 
    242 
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/merge_util.py in apply_aggregation(nd, agg_column, key, level)
    194             agg_options = {}
    195 
--> 196         dsets = [
    197             apply_aggregation(value, agg_column, key=key, level=level + 1)
    198             for key, value in nd.items()
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/merge_util.py in <listcomp>(.0)
    195 
    196         dsets = [
--> 197             apply_aggregation(value, agg_column, key=key, level=level + 1)
    198             for key, value in nd.items()
    199         ]
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/merge_util.py in apply_aggregation(nd, agg_column, key, level)
    216         if agg_type == 'join_new':
    217             varname = dsets[0].attrs['intake_esm_varname']
--> 218             ds = join_new(
    219                 dsets,
    220                 dim_name=agg_column,
/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_esm/merge_util.py in join_new(dsets, dim_name, coord_value, varname, options, group_key)
     69         """
     70 
---> 71         raise AggregationError(message) from exc
     72 
     73 
AggregationError: 
        Failed to join/concatenate datasets in group with key=CMIP.NCC.NorESM2-MM.historical.Omon.gn along a new dimension `member_id`.
        *** Arguments passed to xarray.concat() ***:
        - objs: a list of 3 datasets
        - dim: <xarray.DataArray 'member_id' (member_id: 3)>
array(['r1i1p1f1', 'r2i1p1f1', 'r3i1p1f1'], dtype='<U8')
Dimensions without coordinates: member_id
        - data_vars: ['tos']
        - and kwargs: {'coords': 'minimal', 'compat': 'override'}

The pangeo-cmip6.json url should be updated

The current implementation points to the old intake_esm/json file here:
https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json
but really should be updated to:
https://cmip6.storage.googleapis.com/pangeo-cmip6.json

The current differences are not very important for cmip6_preprocessing, but as we change the way we deal with dataset versions, see this issue, I would like to be able to update the json, if needed.

Thanks! (Yes, I know I could do a pull request - but it will be much more quicker (and reliable) if @jbusecke does it!)

Improved 'filling' of missing coordinate values

@raphaeldussin suggested some methods to fill/extrapolate coordinate values over at (xgcm/xgcm#292 (comment)) that might be better than the current logic used here.

This might help with #66 too.

Adding limited support for unstructured grids

I suggest to add some limited support for unstructured grids, since @jbusecke expressed his interest. From CMIP6 models this will consider AWI-CM (FESOM1.4) and MPAS-O (E3SM). All metrics that are based on multiplying scalar data by areacelo and depth of the vertical levels should be possible. The only big problem I see is that sometimes areacello should be 3D.

At the end it would be nice to see unstructured models in analysis like this https://cmip6-preprocessing.readthedocs.io/en/latest/parsing_metrics.html

Some very basic information on FESOM1.4 mesh is available here: https://fesom.de/cmip6/work-with-awi-cm-unstructured-data/

@trackow , @helgegoessling and @hegish you now AWI-CM1 grids much better than I do, can you share the file with node weights for LR and MR meshes? Also do you see some potential problems?

@mark-petersen Maybe your group will be interested as well?

@jbusecke What do you think should be our first steps apart from providing the file with weights? :)

Zero padded values in MPI-LR

I just noticed that some of the MPI-LR outputs do not have nans on land but instead are padded with zeros.

Not sure if we should be fixing that?

encoding CESM1-1-CAM5-CMIP5

this might be a bit off-topic, but it fits when regarding the package as making CMIP6 work in xarray.

when I tried to save CESM1-1-CAM5-CMIP5 data to netcdf ds.to_netcdf() I get the following warning:
Variable 'tas' has multiple fill values [1e+20, 1e+20]. Cannot encode data.

I haven't tested this for more variables and models.

Drop requirements for pyproj and xgcm

I should put the functions that need xgcm and pyproj into a new module, so `preprocessing can be imported with minimal dependencies.

Consistent logging

Currently the functions return a hot mess of print statements/warnings.

I think for v0.2 it would be nice to implement a logger, so that detailed messages can be turned on/off easily and consistently.

Feature request: xesmf regrid ocean

What I think would be very powerful if we could have a demo or module that enables regrading of all different ocean grids to be concatinated into a single xr dataset.

Often a regridding to 1x1 fails with the default settings.

Think about cyclic/periodic true false and other fixes that everyone does manually.

Sorting longitudes produces many chunks

These lines do not work nicely with the auto-chunking (see #58).

@dcherian mentioned that it might be more efficient to separate the array in only two arrays and then shuffle those around. This seems like a promising step to try.

The other way of dealing with it (which is what I am currently doing), is to omit that step alltoghether and execute that step at the end of a workflow, where the dimensionality has been reduced (e.g. for plotting).

Incorrect grid output when combined_preprocessing is used.

Thanks for putting the package together! This is a great tool to deal with all the differences among CMIP6 models.

I noticed that in some cases the output does not look right when combined_preprocessing is used. Please see below two examples 'HadGEM3-GC31-MM' and 'CMCC-CM2-HR4' where I am comparing the grid cell area with and without combined_preprocessing. I can look into this further but not sure how much time I can spend on it right now. So I thought to share this here already. Any idea what's going on?

Check Pangeo CMIP6 docs if recommended usage changes

Just a reminder to check back in the Pangeo CMIP6 docs and make sure that these still reflect the recommended usage.

Add some CI tests using the pangeo cloud data

Besides unit tests, I think it is important that we test directly with some example datasets from the pangeo cloud (see intake-esm/#221).

`replace_x_y_nominal_lat_lon` produces non monotonic dimensions.

The logic in replace_x_y_nominal_lat_lon does produce duplicate values for certain models:

See here:

import xarray as xr
import numpy as np
import intake
from cmip6_preprocessing.preprocessing import combined_preprocessing
col = intake.open_esm_datastore("https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json")
cat = col.search(source_id='MPI-ESM1-2-HR',
                 experiment_id='historical',
                 variable_id='o2',
                 table_id='Omon',
                 member_id='r1i1p1f1')
ddict = cat.to_dataset_dict(zarr_kwargs={'consolidated': True}, preprocess=combined_preprocessing)
ddict
ds = ddict['CMIP.MPI-M.MPI-ESM1-2-HR.historical.Omon.gn']
# check all dims for duplicates
for di in ds.dims:
    print(di)
    assert len(ds[di]) == len(np.unique(ds[di]))


--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
                
--> There is/are 1 group(s)
bnds
lev
member_id
time
vertex
x

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-8-897d6aeae206> in <module>
     15 for di in ds.dims:
     16     print(di)
---> 17     assert len(ds[di]) == len(np.unique(ds[di]))

AssertionError:

Ill fix this logic in a new release (which might lead to breaking changes in the dimension creation).

MDTF

MDTF seems to do some useful things with CMIP6 metadata Not sure how useful they are for this package but it seems worth looking through it.

https://github.com/NOAA-GFDL/MDTF-diagnostics/blob/main/src/cmip6.py