openclimatefix / metoffice_ec2 Goto Github PK
View Code? Open in Web Editor NEWSubset Met Office MOGREPS-UK and UKV on AWS EC2
License: MIT License
Subset Met Office MOGREPS-UK and UKV on AWS EC2
License: MIT License
Currently we are listening to all messages that are published on the respective SNS Topics.
That leads to quite a high amount of messages in SQS, even though message retention is only 90 mins.
We should check whether we can filter our subscription to only let trough relevant messages.
See: https://docs.aws.amazon.com/sns/latest/dg/sns-message-filtering.html
There is a very simple baseline model in https://github.com/openclimatefix/predict_pv_yield_nwp that we can use to make predictions for a small set of PV systems, using as input irradiance data from UKV.
The output should be in JSON so it can be easily consumed by the frontend.
Via #28 we realised that our code in combination with the awslogs
logging driver is extremely good at hiding all kind of error messages. That's actually quite bad.
We should add Sentry for Error logging so our ears start ringing whenever something goes wrong.
And update README.
Or think of a more general name. Maybe just metoffice_subset
or metoffice_aws
or something.
2020-06-18 13:39:13,637 - metoffice_ec2 - ERROR - dimensions or multi-index levels ['height'] do not exist
Traceback (most recent call last):
File "scripts/ec2.py", line 132, in loop
load_subset_and_save_data(mo_message, height_meters, s3)
File "scripts/ec2.py", line 82, in load_subset_and_save_data
dataset = subset.subset(dataset, height_meters, **DEFAULT_GEO_BOUNDARY)
File "/usr/src/app/metoffice_ec2/subset.py", line 20, in subset
dataset = dataset.sel(height=height_meters)
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/dataset.py", line 2065, in sel
pos_indexers, new_indexes = remap_label_indexers(
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/coordinates.py", line 396, in remap_label_indexers
pos_indexers, new_indexes = indexing.remap_label_indexers(
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/indexing.py", line 254, in remap_label_indexers
dim_indexers = get_dim_indexers(data_obj, indexers)
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/indexing.py", line 220, in get_dim_indexers
raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")
Hey @tomwhite, I just got around to run predict
from your work in f1812ff to get it deployed on AWS. One thing I noticed is that the Zarr file that you use in the predict_test
produces a bunch of NaN
values in the resulting GeoJSON.
e.g.
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
-0.07,
52.33
]
},
"properties": {
"system_id": 1883,
"time": "2020-09-08T13:00:00",
"pv_yield_predicted": NaN
}
},
It's no biggie - apparently NaN
is fine in JS, just not valid JSON - but I'd like to understand where they come from and if we can get rid of them. I did check whether the system_id
doesn't exist in the model, but it does.
Do you have any ideas?
We should automatically build docker images based on releases in master.
Likely can use something like this action: https://github.com/marketplace/actions/publish-docker (see "types: [published]").
MetOfficeMessage.load_netcdf()
ends with the line:
return xr.open_dataset(netcdf_bytes_io, engine='h5netcdf')
This means we must include h5netcdf
in environment.yml
. But, including h5netcdf
forces conda to downgrade cartopy
from v0.18 to v0.17, which breaks nwp_plot.py
:
> ax = plt.axes(projection=ccrs.OSGB(approx=True))
E TypeError: __init__() got an unexpected keyword argument 'approx'
metoffice_ec2/nwp_plot.py:35: TypeError
One solution may be to see if we can use an xarray engine which isn't h5netcdf
.
It would be useful to have a few representative files checked into the repo for testing purposes (i.e. to write unit tests against). Also if we want to change the way we store things (e.g. #17) so we can test changes locally.
As per @tv3141's work in https://github.com/tv3141/metoffice_ec2/blob/storage_optimisation_notebook/storage_optimisation/optimize_storage.ipynb
Just to emphasise: We wouldn't publish the raw data (which can be commercially sensitive). Instead we'd train ML models on the data, and use the models to produce public predictions.
After tweaking the setting for the desired_count
of our fleet a bit we got to a point now where we should scale a bit more dynamically, rather than tweaking the desired_count by hand.
We should set up a step based autoscaler, based on the SQS AverageMessageAge.
I think a sensible initial baseline is:
Current hard deadline is 90min (5400s).
Currently a new xarray dataset (stored as Zarr) is created for every new update. Since xarray doesn't support reading multiple Zarr files at once, it would be better to append updates to a single xarray dataset along the time dimension. This would also allow us to use larger chunk sizes (generally better).
Then we would have a single dataset for each NWP field: wind speed, wind direction, irradiance, etc.
I did a quick experiment and the code would look a bit like this:
# Each update would be a single chunk or around ~4MB.
# Bigger chunks might be better still, we could do this by chunking multiple time coordinates
chunking = {
"projection_x_coordinate": 553,
"projection_y_coordinate": 706,
"realization": 3,
"height": 4,
}
dataset1 = xr.open_zarr("data/mogreps/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T07.zarr.zip")
dataset1 = dataset1.expand_dims("time") # make 1-D coordinate into a dimension
dataset1 = dataset1.chunk(chunking)
dataset2 = xr.open_zarr("data/mogreps/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T08.zarr.zip")
dataset2 = dataset2.expand_dims("time") # make 1-D coordinate into a dimension
dataset2 = dataset2.chunk(chunking)
# Create a new file
dataset1.to_zarr("data/mogreps/combined.zarr", consolidated=True)
# Append new data to the existing file
dataset2.to_zarr("data/mogreps/combined.zarr", consolidated=True, append_dim="time")
Thoughts?
2 advantages:
Only limited to surface_downwelling_shortwave_flux_in_air
for now.
Directly integrated in the ingestion part of the pipeline.
Using predict.predict_as_geojson
.
We are currently running 2 replicas on Fargate for this script. Our compute workload will likely increase. We should look into maybe switching this to spot instances via spot.io Elastigroup.
update. seems like we are a bit ahead of our time. Migrating to spot is very hard. Their docs are significantly outdated. We should not use them for now, imo.
There is however Fargate Spot. As our workloads are fault-tolerant we should definitely look into that.
Seems fairly straightforward, just need to create a capacity provider and make the tasks/service use it.
We get interesting file structures like the following one:
ocf-uk-metoffice-nwp/MOGREPS-UK/wind_from_direction/2020/m05/d20/h08/MOGREPS-UK__wind_from_direction__2020-05-20T08__2020-05-22T06.zarr
I'd assume that the d20/h08
stands for day 20, hour 8 - why are there files from May 22nd in there then?
Also, so far the only subfolders are for 2020/m05/d20/h08
, does that mean no other files are being saved anymore? Did it crash somewhere?
Bucket Filesize as of May 29th: 122.2MB
Bucket Filesize as of May 30th: 122.2MB
-> Something is wrong with the deployment - the script works locally, but deployed it doesn't output files.
Update. After some investigation it turns out that there is an IAM permissions issue. The script is not able to open files from the external MetOffice bucket.
Update 2. Added read/write permission to external MetOffice S3 via 38552aa
SQS messages may occasionally be duplicated.
For now, it's no big deal if we overwrite a few files. But the cost may start getting significant
Using Tom's new functions :)
Run nwp_plot.py somewhere in the scripts/ec2.py
or similar and output it to a new frontend bucket.
TypeError: float() argument must be a string or a number, not 'list'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scripts/ec2.py", line 132, in loop
load_subset_and_save_data(mo_message, height_meters, s3)
File "scripts/ec2.py", line 82, in load_subset_and_save_data
dataset = subset.subset(dataset, height_meters, **DEFAULT_GEO_BOUNDARY)
File "/usr/src/app/metoffice_ec2/subset.py", line 20, in subset
dataset = dataset.sel(height=height_meters)
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/dataset.py", line 2065, in sel
pos_indexers, new_indexes = remap_label_indexers(
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/coordinates.py", line 396, in remap_label_indexers
pos_indexers, new_indexes = indexing.remap_label_indexers(
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/indexing.py", line 269, in remap_label_indexers
label = maybe_cast_to_coords_dtype(label, coords_dtype)
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/utils.py", line 82, in maybe_cast_to_coords_dtype
label = np.asarray(label, dtype=coords_dtype)
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/pandas/core/series.py", line 754, in __array__
return np.asarray(self.array, dtype)
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/pandas/core/arrays/numpy_.py", line 184, in __array__
return np.asarray(self._ndarray, dtype=dtype)
File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
(this mostly applies to the code I've contributed - which currently has zero unit tests! Sorry!)
At present, if the script takes longer than an hour, AWS Batch will try to launch it multiple times, which will result in Bad Stuff happening!
2020-06-11 17:18:24,361 - metoffice_ec2 - ERROR - load_subset_and_save_data() missing 1 required positional argument: 's3' Traceback (most recent call last):
File "scripts/ec2.py", line 130, in loop
load_subset_and_save_data(mo_message, s3) TypeError: load_subset_and_save_data() missing 1 required positional argument: 's3'
This is triggered from this code snippet:
Line 130 in 360e3c7
whereas the definition is:
Line 78 in 360e3c7
I didn't see where to get the height from, otherwise I would have fixed it myself.
As per @tv3141's work in https://github.com/tv3141/metoffice_ec2/blob/storage_optimisation_notebook/storage_optimisation/optimize_storage.ipynb
zstd is considerably faster to compress.
Compare Zarr vs TileDB vs NetCDF. Also try compressing with pbzip2 -5
(which reduces .nat
files to 0.2x their original size). Consider:
https://medium.com/informatics-lab/storing-cloud-ready-geoscience-data-with-tiledb-34d454c33055
Now that we have a rapidly growing dataset on AWS (~100GB/week, likely more) we should look a bit more into how we can reduce our S3 bill. We are already at 3$/day - assuming that this will only grow it's better to look into this sooner rather than later.
We should think again about how frequently this data will need to be accessed and what kind of delay is acceptable to us. We should also probably think about whether we can live with loosing some of the data - if so, we should consider S3 IA One AZ.
TBD.
ec2.py
. Maybe need new way to represent which fields we want; possibly in a YAML file? e.g. [{'name': 'humidity', 'height_meters': 'ALL', 'ensemble_members': ['UKV', 'CONTROL'], 'compression': 'RLE', 'mask': 'land'}
. members defaults to all members; and set sane defaults for compression and mask.To reduce space requirements, save these for just UKV and the control ensemble member:
Compressed Size: 1.6GB
Seems a bit excessive...
See: https://hub.docker.com/r/openclimatefix/metoffice_ec2/tags
Edit. This tool might be useful: https://github.com/wagoodman/dive
In https://github.com/openclimatefix/predict_pv_yield_nwp we build a baseline model for predicting PV yield from the NWP irradiance field. This uses the GRIB data, where the field is called "Downward short-wave radiation flux" or dswrf
.
In #34 we use the irradiance data from the AWS feed in NetCDF/Zarr format. The field names are different (see #2 (comment)), but surface_downwelling_shortwave_flux_in_air
seems like it might be the corresponding field.
Is this correct? (Tagging @tv3141 for possible insight here.)
(Of course, in the future we will use the data from the AWS feed in NetCDF/Zarr format, so this problem will go away.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.