openclimatefix / metoffice_ec2 Goto Github PK

View Code? Open in Web Editor NEW

11.0 2.0 2.0 36.56 MB

Subset Met Office MOGREPS-UK and UKV on AWS EC2

License: MIT License

Python 60.06% Dockerfile 1.97% HCL 37.97%

nowcasting

metoffice_ec2's People

Contributors

Stargazers

Watchers

Forkers

tomwhite tv3141

metoffice_ec2's Issues

Check if we can filter incoming messages

Currently we are listening to all messages that are published on the respective SNS Topics.
That leads to quite a high amount of messages in SQS, even though message retention is only 90 mins.

We should check whether we can filter our subscription to only let trough relevant messages.
See: https://docs.aws.amazon.com/sns/latest/dg/sns-message-filtering.html

Predict PV yield using baseline NWP model

There is a very simple baseline model in https://github.com/openclimatefix/predict_pv_yield_nwp that we can use to make predictions for a small set of PV systems, using as input irradiance data from UKV.

The output should be in JSON so it can be easily consumed by the frontend.

Script uses increasingly large amounts of memory on EC2

Add Sentry Error Logging

Via #28 we realised that our code in combination with the awslogs logging driver is extremely good at hiding all kind of error messages. That's actually quite bad.

We should add Sentry for Error logging so our ears start ringing whenever something goes wrong.

change name to metoffice_aws_batch

And update README.

Or think of a more general name. Maybe just metoffice_subset or metoffice_aws or something.

dimensions or multi-index levels ['height'] do not exist

2020-06-18 13:39:13,637 - metoffice_ec2 - ERROR - dimensions or multi-index levels ['height'] do not exist
Traceback (most recent call last):
  File "scripts/ec2.py", line 132, in loop
    load_subset_and_save_data(mo_message, height_meters, s3)
  File "scripts/ec2.py", line 82, in load_subset_and_save_data
    dataset = subset.subset(dataset, height_meters, **DEFAULT_GEO_BOUNDARY)
  File "/usr/src/app/metoffice_ec2/subset.py", line 20, in subset
    dataset = dataset.sel(height=height_meters)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/dataset.py", line 2065, in sel
    pos_indexers, new_indexes = remap_label_indexers(
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/coordinates.py", line 396, in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/indexing.py", line 254, in remap_label_indexers
    dim_indexers = get_dim_indexers(data_obj, indexers)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/indexing.py", line 220, in get_dim_indexers
    raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")

Various NaN values in predictions

Hey @tomwhite, I just got around to run predict from your work in f1812ff to get it deployed on AWS. One thing I noticed is that the Zarr file that you use in the predict_test produces a bunch of NaN values in the resulting GeoJSON.

e.g.

{
    "type": "Feature",
    "geometry": {
        "type": "Point",
        "coordinates": [
            -0.07,
            52.33
        ]
    },
    "properties": {
        "system_id": 1883,
        "time": "2020-09-08T13:00:00",
        "pv_yield_predicted": NaN
    }
},

It's no biggie - apparently NaN is fine in JS, just not valid JSON - but I'd like to understand where they come from and if we can get rid of them. I did check whether the system_id doesn't exist in the model, but it does.

Do you have any ideas?

Automatically build docker images

We should automatically build docker images based on releases in master.
Likely can use something like this action: https://github.com/marketplace/actions/publish-docker (see "types: [published]").

Avoid using h5netcdf?

MetOfficeMessage.load_netcdf() ends with the line:

return xr.open_dataset(netcdf_bytes_io, engine='h5netcdf')

This means we must include h5netcdf in environment.yml. But, including h5netcdf forces conda to downgrade cartopy from v0.18 to v0.17, which breaks nwp_plot.py:

>       ax = plt.axes(projection=ccrs.OSGB(approx=True))
E       TypeError: __init__() got an unexpected keyword argument 'approx'

metoffice_ec2/nwp_plot.py:35: TypeError

One solution may be to see if we can use an xarray engine which isn't h5netcdf.

Add some files from the MetOffice feed for testing

It would be useful to have a few representative files checked into the repo for testing purposes (i.e. to write unit tests against). Also if we want to change the way we store things (e.g. #17) so we can test changes locally.

Concat multiple time steps into single Zarr file

As per @tv3141's work in https://github.com/tv3141/metoffice_ec2/blob/storage_optimisation_notebook/storage_optimisation/optimize_storage.ipynb

reduces data read time
reduces size on disk by 4%

Can we use any wind data to train ML models used in a public MVP?

Can we use Electralink data to train ML models? (Probably not?!?)
Can we get timeseries data for transmission-connected wind farms?

Just to emphasise: We wouldn't publish the raw data (which can be commercially sensitive). Instead we'd train ML models on the data, and use the models to produce public predictions.

Autoscale Compute Fleet based on SQS

After tweaking the setting for the desired_count of our fleet a bit we got to a point now where we should scale a bit more dynamically, rather than tweaking the desired_count by hand.

We should set up a step based autoscaler, based on the SQS AverageMessageAge.
I think a sensible initial baseline is:

1000s (~16min) -> 1
2000s (~33min) -> 2
3000s (~50min) -> 3
4000s (~66min) -> 5

Current hard deadline is 90min (5400s).

One xarray dataset per NWP field

Currently a new xarray dataset (stored as Zarr) is created for every new update. Since xarray doesn't support reading multiple Zarr files at once, it would be better to append updates to a single xarray dataset along the time dimension. This would also allow us to use larger chunk sizes (generally better).

Then we would have a single dataset for each NWP field: wind speed, wind direction, irradiance, etc.

I did a quick experiment and the code would look a bit like this:

# Each update would be a single chunk or around ~4MB.
# Bigger chunks might be better still, we could do this by chunking multiple time coordinates
chunking = {
    "projection_x_coordinate": 553,
    "projection_y_coordinate": 706,
    "realization": 3,
    "height": 4,
}

dataset1 = xr.open_zarr("data/mogreps/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T07.zarr.zip")
dataset1 = dataset1.expand_dims("time") # make 1-D coordinate into a dimension
dataset1 = dataset1.chunk(chunking)

dataset2 = xr.open_zarr("data/mogreps/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T08.zarr.zip")
dataset2 = dataset2.expand_dims("time") # make 1-D coordinate into a dimension
dataset2 = dataset2.chunk(chunking)

# Create a new file
dataset1.to_zarr("data/mogreps/combined.zarr", consolidated=True)
# Append new data to the existing file
dataset2.to_zarr("data/mogreps/combined.zarr", consolidated=True, append_dim="time")

Thoughts?

rename MetOfficeMessage.load_netcdf to open_netcdf

Consider running the EC2 job continuously

2 advantages:

Don't need SQS (so could save a few quid there, maybe) (See #8 for reasons why)
Can process NWPs as soon as they're published (rather than waiting up to an hour)

Run inference on AWS

Only limited to surface_downwelling_shortwave_flux_in_air for now.
Directly integrated in the ingestion part of the pipeline.
Using predict.predict_as_geojson.

Look into running workload on Spot instead of Fargate

We are currently running 2 replicas on Fargate for this script. Our compute workload will likely increase. We should look into maybe switching this to spot instances ~~via spot.io Elastigroup~~.

update. seems like we are a bit ahead of our time. Migrating to spot is very hard. Their docs are significantly outdated. We should not use them for now, imo.

There is however Fargate Spot. As our workloads are fault-tolerant we should definitely look into that.

Seems fairly straightforward, just need to create a capacity provider and make the tasks/service use it.

Investigate why script doesn't output files

We get interesting file structures like the following one:

ocf-uk-metoffice-nwp/MOGREPS-UK/wind_from_direction/2020/m05/d20/h08/MOGREPS-UK__wind_from_direction__2020-05-20T08__2020-05-22T06.zarr

I'd assume that the d20/h08 stands for day 20, hour 8 - why are there files from May 22nd in there then?

Also, so far the only subfolders are for 2020/m05/d20/h08, does that mean no other files are being saved anymore? Did it crash somewhere?

Bucket Filesize as of May 29th: 122.2MB
Bucket Filesize as of May 30th: 122.2MB

-> Something is wrong with the deployment - the script works locally, but deployed it doesn't output files.

Update. After some investigation it turns out that there is an IAM permissions issue. The script is not able to open files from the external MetOffice bucket.

Update 2. Added read/write permission to external MetOffice S3 via 38552aa

Handle duplicate SQS messages

SQS messages may occasionally be duplicated.

For now, it's no big deal if we overwrite a few files. But the cost may start getting significant

Setup OCF AWS account

Run creating output images for FE on AWS

Using Tom's new functions :)

Run nwp_plot.py somewhere in the scripts/ec2.py or similar and output it to a new frontend bucket.

Once EC2 has been running for a while, check how much it'll cost us!

float() argument must be a string or a number, not 'list'

TypeError: float() argument must be a string or a number, not 'list'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/ec2.py", line 132, in loop
    load_subset_and_save_data(mo_message, height_meters, s3)
  File "scripts/ec2.py", line 82, in load_subset_and_save_data
    dataset = subset.subset(dataset, height_meters, **DEFAULT_GEO_BOUNDARY)
  File "/usr/src/app/metoffice_ec2/subset.py", line 20, in subset
    dataset = dataset.sel(height=height_meters)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/dataset.py", line 2065, in sel
    pos_indexers, new_indexes = remap_label_indexers(
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/coordinates.py", line 396, in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/indexing.py", line 269, in remap_label_indexers
    label = maybe_cast_to_coords_dtype(label, coords_dtype)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/utils.py", line 82, in maybe_cast_to_coords_dtype
    label = np.asarray(label, dtype=coords_dtype)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/pandas/core/series.py", line 754, in __array__
    return np.asarray(self.array, dtype)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/pandas/core/arrays/numpy_.py", line 184, in __array__
    return np.asarray(self._ndarray, dtype=dtype)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

More unit tests :)

(this mostly applies to the code I've contributed - which currently has zero unit tests! Sorry!)

Protect against overrunning

At present, if the script takes longer than an hour, AWS Batch will try to launch it multiple times, which will result in Bad Stuff happening!

load_subset_and_save_data() missing 1 required positional argument

2020-06-11 17:18:24,361 - metoffice_ec2 - ERROR - load_subset_and_save_data() missing 1 required positional argument: 's3' Traceback (most recent call last):

File "scripts/ec2.py", line 130, in loop
load_subset_and_save_data(mo_message, s3) TypeError: load_subset_and_save_data() missing 1 required positional argument: 's3'

This is triggered from this code snippet:

metoffice_ec2/scripts/ec2.py

Line 130 in 360e3c7

load_subset_and_save_data(mo_message, s3)

whereas the definition is:

metoffice_ec2/scripts/ec2.py

Line 78 in 360e3c7

def load_subset_and_save_data(mo_message, height_meters, s3):

I didn't see where to get the height from, otherwise I would have fixed it myself.

Use irradiance data for predictions

With #34 we introduced predictions. Once we are successfully capturing irradiance data (probably via v1.4.1) we should also include that data in our predictions.

~~Currently blocked via #42 #44.~~

Ideas for reducing data volume (if it's too expensive)

only store full geo resolution for, say, a month (for users to see full maps), then down-res for long-term storage.
Extract wind at multiple vertical levels just for wind farm locations, and store that indefinitely for ML training
apply a land mask to solar PV variables

Do we have to use SQS? Can EC2 read directly from SNS?

Use zstd instead of LZMA

As per @tv3141's work in https://github.com/tv3141/metoffice_ec2/blob/storage_optimisation_notebook/storage_optimisation/optimize_storage.ipynb

zstd is considerably faster to compress.

Move from Jack's personal AWS account to OCF AWS!

Experiment with different file formats on AWS

Compare Zarr vs TileDB vs NetCDF. Also try compressing with pbzip2 -5 (which reduces .nat files to 0.2x their original size). Consider:

size on disk
write speed
read speed for queries similar to ones we'd use for the front-end and for ML training.

https://medium.com/informatics-lab/storing-cloud-ready-geoscience-data-with-tiledb-34d454c33055

Deal with increasing S3 cost

Now that we have a rapidly growing dataset on AWS (~100GB/week, likely more) we should look a bit more into how we can reduce our S3 bill. We are already at 3$/day - assuming that this will only grow it's better to look into this sooner rather than later.

We should think again about how frequently this data will need to be accessed and what kind of delay is acceptable to us. We should also probably think about whether we can live with loosing some of the data - if so, we should consider S3 IA One AZ.

TBD.

Extract additional NWP fields

Prep work:

Search through SNS messages to find exact names used for the NWP fields we want.
Refactor ec2.py. Maybe need new way to represent which fields we want; possibly in a YAML file? e.g. [{'name': 'humidity', 'height_meters': 'ALL', 'ensemble_members': ['UKV', 'CONTROL'], 'compression': 'RLE', 'mask': 'land'}. members defaults to all members; and set sane defaults for compression and mask.

Additional NWP fields to add:

irradiance at the surface
apply land mask to irradiance
temperature at the surface (for PV)
wind speed, direction & gust at the surface (for PV)
cloud fraction & cloud type

To reduce space requirements, save these for just UKV and the control ensemble member:

At multiple vertical levels, up to ~ 10 km (to help nowcast clouds):
- temperature
- wind speed, direction & gust
- humidity
At surface:
- rainfall (probably affects PV, right?)
- snow
- rainfall & snow will be mostly zeros; so will benefit from compression which squishes zeros.

Reduce Docker Image Size

Compressed Size: 1.6GB

Seems a bit excessive...

See: https://hub.docker.com/r/openclimatefix/metoffice_ec2/tags

Edit. This tool might be useful: https://github.com/wagoodman/dive

Push code to AWS Batch

Find out which irradiance field to use

In https://github.com/openclimatefix/predict_pv_yield_nwp we build a baseline model for predicting PV yield from the NWP irradiance field. This uses the GRIB data, where the field is called "Downward short-wave radiation flux" or dswrf.

In #34 we use the irradiance data from the AWS feed in NetCDF/Zarr format. The field names are different (see #2 (comment)), but surface_downwelling_shortwave_flux_in_air seems like it might be the corresponding field.

Is this correct? (Tagging @tv3141 for possible insight here.)

(Of course, in the future we will use the data from the AWS feed in NetCDF/Zarr format, so this problem will go away.)