Currently a new xarray dataset (stored as Zarr) is created for every new update. Since

One xarray dataset per NWP field about metoffice_ec2 HOT 5 OPEN

openclimatefix commented on May 29, 2024

One xarray dataset per NWP field

from metoffice_ec2.

Comments (5)

tomwhite commented on May 29, 2024 1

That makes a lot of sense, thanks for the clear explanation.

So yes, perhaps the simplest thing is to have a separate Zarr store for each init time. Larger consolidated datasets can be built later as needed.

from metoffice_ec2.

JackKelly commented on May 29, 2024

Definitely agree this is a good idea!

How would you like the data pipeline to look for implementing this on AWS given that the Met Office files arrive out-of-order?

Should we have two micro-services: one which just dumps Met Office data to S3 in whatever order it arrives; and a second service which concatenates those files (and deletes the originals)? We could save a few quid by storing the temporary files on the VM's local disk; but then we'd have to do some work to make that resilient to the VM failing. Or is there a more elegant solution? :)

For reference, @tv3141's notebook also supports the idea of using a single dataset per NWP field because it:

reduces data read time
reduces size on disk by 4%

from metoffice_ec2.

JackKelly commented on May 29, 2024

Sorry, one more quick thought: Should we have a single Zarr store per NWP field and per NWP run? e.g. we'd have a Zarr store for, say, irradiance for the NWP initialised at 2020-01-01T00; and another Zarr store for irradiance for the NWP initialised at 2020-01-01T01, etc. Does that sound right?

from metoffice_ec2.

tomwhite commented on May 29, 2024

On the out-of-order problem, your solution sounds like a good one. Do you know what the maximum lag is? If it's long enough to start impacting the timeliness of predictions, then they can be run from the out-out-order individual files, and the ordered file can be created later for the purpose of training models.

Should we have a single Zarr store per NWP field and per NWP run? e.g. we'd have a Zarr store for, say, irradiance for the NWP initialised at 2020-01-01T00; and another Zarr store for irradiance for the NWP initialised at 2020-01-01T01, etc. Does that sound right?

No, I think there should be a single Zarr store, chunked along the time dimension. Then if you wanted a certain time range you would only have to load the relevant chunks. If you had one store per time, then you would have to open multiple stores to analyse a bigger time range, and the problem with this is that xarray doesn't have good support for it (afaict).

from metoffice_ec2.

JackKelly commented on May 29, 2024

Do you know what the maximum lag is?

I'm afraid I don't know for sure! I would guess the lag is "pretty small" (tens of seconds?!?) But I'm not really sure!

Regarding multiple Zarr stores for different init times: One of the things that makes talking about NWPs confusing is that there are two times dimensions that we care about: The initialisation time (the time that Met Office started computing the forecasts) and the forecast time (the time that the forecast is about). In the case of MOGREPs, the Met Office run 3 ensemble members (aka 'realisations') concurrently every hour, and each run provides a forecast for the next 5 days.

To give a concrete example:

At 2020-01-01T00 the Met Office computed 3 ensemble members, each of which provides forecasts for the range [2020-01-01T00, 2020-01-05T23] at hourly intervals.

Then, an hour later, at 2020-01-01T01, the Met Office kicked off another 3 ensemble members, providing forecasts for the range [2020-01-01T01, 2020-01-06T00].

(Actually, to be pedantic, there there three time dimensions we care about: 1) init time, 2) forecast time, and 3) the time the NWPs become available to our code, which is at least 24 hours while we're using the free NWPs on AWS!)

I definitely agree that, for a given NWP field and a given init time, all forecast times should go into a single Zarr store. But should we have a separate Zarr store for each init time?

from metoffice_ec2.

One xarray dataset per NWP field about metoffice_ec2 HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent