Code Monkey home page Code Monkey logo

Comments (5)

tomwhite avatar tomwhite commented on May 29, 2024 1

That makes a lot of sense, thanks for the clear explanation.

So yes, perhaps the simplest thing is to have a separate Zarr store for each init time. Larger consolidated datasets can be built later as needed.

from metoffice_ec2.

JackKelly avatar JackKelly commented on May 29, 2024

Definitely agree this is a good idea!

How would you like the data pipeline to look for implementing this on AWS given that the Met Office files arrive out-of-order?

Should we have two micro-services: one which just dumps Met Office data to S3 in whatever order it arrives; and a second service which concatenates those files (and deletes the originals)? We could save a few quid by storing the temporary files on the VM's local disk; but then we'd have to do some work to make that resilient to the VM failing. Or is there a more elegant solution? :)

For reference, @tv3141's notebook also supports the idea of using a single dataset per NWP field because it:

  • reduces data read time
  • reduces size on disk by 4%

from metoffice_ec2.

JackKelly avatar JackKelly commented on May 29, 2024

Sorry, one more quick thought: Should we have a single Zarr store per NWP field and per NWP run? e.g. we'd have a Zarr store for, say, irradiance for the NWP initialised at 2020-01-01T00; and another Zarr store for irradiance for the NWP initialised at 2020-01-01T01, etc. Does that sound right?

from metoffice_ec2.

tomwhite avatar tomwhite commented on May 29, 2024

On the out-of-order problem, your solution sounds like a good one. Do you know what the maximum lag is? If it's long enough to start impacting the timeliness of predictions, then they can be run from the out-out-order individual files, and the ordered file can be created later for the purpose of training models.

Should we have a single Zarr store per NWP field and per NWP run? e.g. we'd have a Zarr store for, say, irradiance for the NWP initialised at 2020-01-01T00; and another Zarr store for irradiance for the NWP initialised at 2020-01-01T01, etc. Does that sound right?

No, I think there should be a single Zarr store, chunked along the time dimension. Then if you wanted a certain time range you would only have to load the relevant chunks. If you had one store per time, then you would have to open multiple stores to analyse a bigger time range, and the problem with this is that xarray doesn't have good support for it (afaict).

from metoffice_ec2.

JackKelly avatar JackKelly commented on May 29, 2024

Do you know what the maximum lag is?

I'm afraid I don't know for sure! I would guess the lag is "pretty small" (tens of seconds?!?) But I'm not really sure!

Regarding multiple Zarr stores for different init times: One of the things that makes talking about NWPs confusing is that there are two times dimensions that we care about: The initialisation time (the time that Met Office started computing the forecasts) and the forecast time (the time that the forecast is about). In the case of MOGREPs, the Met Office run 3 ensemble members (aka 'realisations') concurrently every hour, and each run provides a forecast for the next 5 days.

To give a concrete example:

At 2020-01-01T00 the Met Office computed 3 ensemble members, each of which provides forecasts for the range [2020-01-01T00, 2020-01-05T23] at hourly intervals.

Then, an hour later, at 2020-01-01T01, the Met Office kicked off another 3 ensemble members, providing forecasts for the range [2020-01-01T01, 2020-01-06T00].

(Actually, to be pedantic, there there three time dimensions we care about: 1) init time, 2) forecast time, and 3) the time the NWPs become available to our code, which is at least 24 hours while we're using the free NWPs on AWS!)

I definitely agree that, for a given NWP field and a given init time, all forecast times should go into a single Zarr store. But should we have a separate Zarr store for each init time?

from metoffice_ec2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.