Comments (5)
That makes a lot of sense, thanks for the clear explanation.
So yes, perhaps the simplest thing is to have a separate Zarr store for each init time. Larger consolidated datasets can be built later as needed.
from metoffice_ec2.
Definitely agree this is a good idea!
How would you like the data pipeline to look for implementing this on AWS given that the Met Office files arrive out-of-order?
Should we have two micro-services: one which just dumps Met Office data to S3 in whatever order it arrives; and a second service which concatenates those files (and deletes the originals)? We could save a few quid by storing the temporary files on the VM's local disk; but then we'd have to do some work to make that resilient to the VM failing. Or is there a more elegant solution? :)
For reference, @tv3141's notebook also supports the idea of using a single dataset per NWP field because it:
- reduces data read time
- reduces size on disk by 4%
from metoffice_ec2.
Sorry, one more quick thought: Should we have a single Zarr store per NWP field and per NWP run? e.g. we'd have a Zarr store for, say, irradiance for the NWP initialised at 2020-01-01T00; and another Zarr store for irradiance for the NWP initialised at 2020-01-01T01, etc. Does that sound right?
from metoffice_ec2.
On the out-of-order problem, your solution sounds like a good one. Do you know what the maximum lag is? If it's long enough to start impacting the timeliness of predictions, then they can be run from the out-out-order individual files, and the ordered file can be created later for the purpose of training models.
Should we have a single Zarr store per NWP field and per NWP run? e.g. we'd have a Zarr store for, say, irradiance for the NWP initialised at 2020-01-01T00; and another Zarr store for irradiance for the NWP initialised at 2020-01-01T01, etc. Does that sound right?
No, I think there should be a single Zarr store, chunked along the time dimension. Then if you wanted a certain time range you would only have to load the relevant chunks. If you had one store per time, then you would have to open multiple stores to analyse a bigger time range, and the problem with this is that xarray doesn't have good support for it (afaict).
from metoffice_ec2.
Do you know what the maximum lag is?
I'm afraid I don't know for sure! I would guess the lag is "pretty small" (tens of seconds?!?) But I'm not really sure!
Regarding multiple Zarr stores for different init times: One of the things that makes talking about NWPs confusing is that there are two times dimensions that we care about: The initialisation time (the time that Met Office started computing the forecasts) and the forecast time (the time that the forecast is about). In the case of MOGREPs, the Met Office run 3 ensemble members (aka 'realisations') concurrently every hour, and each run provides a forecast for the next 5 days.
To give a concrete example:
At 2020-01-01T00
the Met Office computed 3 ensemble members, each of which provides forecasts for the range [2020-01-01T00, 2020-01-05T23]
at hourly intervals.
Then, an hour later, at 2020-01-01T01
, the Met Office kicked off another 3 ensemble members, providing forecasts for the range [2020-01-01T01, 2020-01-06T00]
.
(Actually, to be pedantic, there there three time dimensions we care about: 1) init time, 2) forecast time, and 3) the time the NWPs become available to our code, which is at least 24 hours while we're using the free NWPs on AWS!)
I definitely agree that, for a given NWP field and a given init time, all forecast times should go into a single Zarr store. But should we have a separate Zarr store for each init time?
from metoffice_ec2.
Related Issues (20)
- Use zstd instead of LZMA HOT 1
- Script uses increasingly large amounts of memory on EC2 HOT 5
- Avoid using h5netcdf? HOT 9
- rename MetOfficeMessage.load_netcdf to open_netcdf
- More unit tests :) HOT 1
- Run creating output images for FE on AWS
- Investigate why script doesn't output files HOT 2
- Add Sentry Error Logging
- Predict PV yield using baseline NWP model
- Deal with increasing S3 cost HOT 3
- Look into running workload on Spot instead of Fargate HOT 2
- Autoscale Compute Fleet based on SQS HOT 1
- Automatically build docker images HOT 1
- Find out which irradiance field to use HOT 2
- load_subset_and_save_data() missing 1 required positional argument
- Use irradiance data for predictions HOT 1
- float() argument must be a string or a number, not 'list' HOT 5
- dimensions or multi-index levels ['height'] do not exist HOT 2
- Run inference on AWS
- Various NaN values in predictions HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metoffice_ec2.