Code Monkey home page Code Monkey logo

metoffice_ec2's People

Contributors

flowirtz avatar jackkelly avatar tomwhite avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

tomwhite tv3141

metoffice_ec2's Issues

Add Sentry Error Logging

Via #28 we realised that our code in combination with the awslogs logging driver is extremely good at hiding all kind of error messages. That's actually quite bad.

We should add Sentry for Error logging so our ears start ringing whenever something goes wrong.

dimensions or multi-index levels ['height'] do not exist

2020-06-18 13:39:13,637 - metoffice_ec2 - ERROR - dimensions or multi-index levels ['height'] do not exist
Traceback (most recent call last):
  File "scripts/ec2.py", line 132, in loop
    load_subset_and_save_data(mo_message, height_meters, s3)
  File "scripts/ec2.py", line 82, in load_subset_and_save_data
    dataset = subset.subset(dataset, height_meters, **DEFAULT_GEO_BOUNDARY)
  File "/usr/src/app/metoffice_ec2/subset.py", line 20, in subset
    dataset = dataset.sel(height=height_meters)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/dataset.py", line 2065, in sel
    pos_indexers, new_indexes = remap_label_indexers(
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/coordinates.py", line 396, in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/indexing.py", line 254, in remap_label_indexers
    dim_indexers = get_dim_indexers(data_obj, indexers)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/indexing.py", line 220, in get_dim_indexers
    raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")

Various NaN values in predictions

Hey @tomwhite, I just got around to run predict from your work in f1812ff to get it deployed on AWS. One thing I noticed is that the Zarr file that you use in the predict_test produces a bunch of NaN values in the resulting GeoJSON.

e.g.

{
    "type": "Feature",
    "geometry": {
        "type": "Point",
        "coordinates": [
            -0.07,
            52.33
        ]
    },
    "properties": {
        "system_id": 1883,
        "time": "2020-09-08T13:00:00",
        "pv_yield_predicted": NaN
    }
},

It's no biggie - apparently NaN is fine in JS, just not valid JSON - but I'd like to understand where they come from and if we can get rid of them. I did check whether the system_id doesn't exist in the model, but it does.

Do you have any ideas?

Avoid using h5netcdf?

MetOfficeMessage.load_netcdf() ends with the line:

return xr.open_dataset(netcdf_bytes_io, engine='h5netcdf')

This means we must include h5netcdf in environment.yml. But, including h5netcdf forces conda to downgrade cartopy from v0.18 to v0.17, which breaks nwp_plot.py:

>       ax = plt.axes(projection=ccrs.OSGB(approx=True))
E       TypeError: __init__() got an unexpected keyword argument 'approx'

metoffice_ec2/nwp_plot.py:35: TypeError

One solution may be to see if we can use an xarray engine which isn't h5netcdf.

Add some files from the MetOffice feed for testing

It would be useful to have a few representative files checked into the repo for testing purposes (i.e. to write unit tests against). Also if we want to change the way we store things (e.g. #17) so we can test changes locally.

Can we use any wind data to train ML models used in a public MVP?

  • Can we use Electralink data to train ML models? (Probably not?!?)
  • Can we get timeseries data for transmission-connected wind farms?

Just to emphasise: We wouldn't publish the raw data (which can be commercially sensitive). Instead we'd train ML models on the data, and use the models to produce public predictions.

Autoscale Compute Fleet based on SQS

After tweaking the setting for the desired_count of our fleet a bit we got to a point now where we should scale a bit more dynamically, rather than tweaking the desired_count by hand.

We should set up a step based autoscaler, based on the SQS AverageMessageAge.
I think a sensible initial baseline is:

  • 1000s (~16min) -> 1
  • 2000s (~33min) -> 2
  • 3000s (~50min) -> 3
  • 4000s (~66min) -> 5

Current hard deadline is 90min (5400s).

One xarray dataset per NWP field

Currently a new xarray dataset (stored as Zarr) is created for every new update. Since xarray doesn't support reading multiple Zarr files at once, it would be better to append updates to a single xarray dataset along the time dimension. This would also allow us to use larger chunk sizes (generally better).

Then we would have a single dataset for each NWP field: wind speed, wind direction, irradiance, etc.

I did a quick experiment and the code would look a bit like this:

# Each update would be a single chunk or around ~4MB.
# Bigger chunks might be better still, we could do this by chunking multiple time coordinates
chunking = {
    "projection_x_coordinate": 553,
    "projection_y_coordinate": 706,
    "realization": 3,
    "height": 4,
}

dataset1 = xr.open_zarr("data/mogreps/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T07.zarr.zip")
dataset1 = dataset1.expand_dims("time") # make 1-D coordinate into a dimension
dataset1 = dataset1.chunk(chunking)

dataset2 = xr.open_zarr("data/mogreps/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T08.zarr.zip")
dataset2 = dataset2.expand_dims("time") # make 1-D coordinate into a dimension
dataset2 = dataset2.chunk(chunking)

# Create a new file
dataset1.to_zarr("data/mogreps/combined.zarr", consolidated=True)
# Append new data to the existing file
dataset2.to_zarr("data/mogreps/combined.zarr", consolidated=True, append_dim="time")

Thoughts?

Consider running the EC2 job continuously

2 advantages:

  • Don't need SQS (so could save a few quid there, maybe) (See #8 for reasons why)
  • Can process NWPs as soon as they're published (rather than waiting up to an hour)

Run inference on AWS

Only limited to surface_downwelling_shortwave_flux_in_air for now.
Directly integrated in the ingestion part of the pipeline.
Using predict.predict_as_geojson.

Look into running workload on Spot instead of Fargate

We are currently running 2 replicas on Fargate for this script. Our compute workload will likely increase. We should look into maybe switching this to spot instances via spot.io Elastigroup.

update. seems like we are a bit ahead of our time. Migrating to spot is very hard. Their docs are significantly outdated. We should not use them for now, imo.

There is however Fargate Spot. As our workloads are fault-tolerant we should definitely look into that.

Seems fairly straightforward, just need to create a capacity provider and make the tasks/service use it.

Investigate why script doesn't output files

We get interesting file structures like the following one:

ocf-uk-metoffice-nwp/MOGREPS-UK/wind_from_direction/2020/m05/d20/h08/MOGREPS-UK__wind_from_direction__2020-05-20T08__2020-05-22T06.zarr

I'd assume that the d20/h08 stands for day 20, hour 8 - why are there files from May 22nd in there then?

Also, so far the only subfolders are for 2020/m05/d20/h08, does that mean no other files are being saved anymore? Did it crash somewhere?

Bucket Filesize as of May 29th: 122.2MB
Bucket Filesize as of May 30th: 122.2MB

-> Something is wrong with the deployment - the script works locally, but deployed it doesn't output files.

Update. After some investigation it turns out that there is an IAM permissions issue. The script is not able to open files from the external MetOffice bucket.

Update 2. Added read/write permission to external MetOffice S3 via 38552aa

Handle duplicate SQS messages

SQS messages may occasionally be duplicated.

For now, it's no big deal if we overwrite a few files. But the cost may start getting significant

float() argument must be a string or a number, not 'list'

TypeError: float() argument must be a string or a number, not 'list'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/ec2.py", line 132, in loop
    load_subset_and_save_data(mo_message, height_meters, s3)
  File "scripts/ec2.py", line 82, in load_subset_and_save_data
    dataset = subset.subset(dataset, height_meters, **DEFAULT_GEO_BOUNDARY)
  File "/usr/src/app/metoffice_ec2/subset.py", line 20, in subset
    dataset = dataset.sel(height=height_meters)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/dataset.py", line 2065, in sel
    pos_indexers, new_indexes = remap_label_indexers(
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/coordinates.py", line 396, in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/indexing.py", line 269, in remap_label_indexers
    label = maybe_cast_to_coords_dtype(label, coords_dtype)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/xarray/core/utils.py", line 82, in maybe_cast_to_coords_dtype
    label = np.asarray(label, dtype=coords_dtype)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/pandas/core/series.py", line 754, in __array__
    return np.asarray(self.array, dtype)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/pandas/core/arrays/numpy_.py", line 184, in __array__
    return np.asarray(self._ndarray, dtype=dtype)
  File "/opt/conda/envs/metoffice_ec2/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

More unit tests :)

(this mostly applies to the code I've contributed - which currently has zero unit tests! Sorry!)

Protect against overrunning

At present, if the script takes longer than an hour, AWS Batch will try to launch it multiple times, which will result in Bad Stuff happening!

load_subset_and_save_data() missing 1 required positional argument

2020-06-11 17:18:24,361 - metoffice_ec2 - ERROR - load_subset_and_save_data() missing 1 required positional argument: 's3' Traceback (most recent call last):

File "scripts/ec2.py", line 130, in loop
load_subset_and_save_data(mo_message, s3) TypeError: load_subset_and_save_data() missing 1 required positional argument: 's3'

This is triggered from this code snippet:

load_subset_and_save_data(mo_message, s3)

whereas the definition is:

def load_subset_and_save_data(mo_message, height_meters, s3):

I didn't see where to get the height from, otherwise I would have fixed it myself.

Use irradiance data for predictions

With #34 we introduced predictions. Once we are successfully capturing irradiance data (probably via v1.4.1) we should also include that data in our predictions.

Currently blocked via #42 #44.

Ideas for reducing data volume (if it's too expensive)

  • only store full geo resolution for, say, a month (for users to see full maps), then down-res for long-term storage.
  • Extract wind at multiple vertical levels just for wind farm locations, and store that indefinitely for ML training
  • apply a land mask to solar PV variables

Deal with increasing S3 cost

Now that we have a rapidly growing dataset on AWS (~100GB/week, likely more) we should look a bit more into how we can reduce our S3 bill. We are already at 3$/day - assuming that this will only grow it's better to look into this sooner rather than later.

We should think again about how frequently this data will need to be accessed and what kind of delay is acceptable to us. We should also probably think about whether we can live with loosing some of the data - if so, we should consider S3 IA One AZ.

TBD.

Extract additional NWP fields

Prep work:

  • Search through SNS messages to find exact names used for the NWP fields we want.
  • Refactor ec2.py. Maybe need new way to represent which fields we want; possibly in a YAML file? e.g. [{'name': 'humidity', 'height_meters': 'ALL', 'ensemble_members': ['UKV', 'CONTROL'], 'compression': 'RLE', 'mask': 'land'}. members defaults to all members; and set sane defaults for compression and mask.

Additional NWP fields to add:

  • irradiance at the surface
  • apply land mask to irradiance
  • temperature at the surface (for PV)
  • wind speed, direction & gust at the surface (for PV)
  • cloud fraction & cloud type

To reduce space requirements, save these for just UKV and the control ensemble member:

  • At multiple vertical levels, up to ~ 10 km (to help nowcast clouds):
    • temperature
    • wind speed, direction & gust
    • humidity
  • At surface:
    • rainfall (probably affects PV, right?)
    • snow
    • rainfall & snow will be mostly zeros; so will benefit from compression which squishes zeros.

Find out which irradiance field to use

In https://github.com/openclimatefix/predict_pv_yield_nwp we build a baseline model for predicting PV yield from the NWP irradiance field. This uses the GRIB data, where the field is called "Downward short-wave radiation flux" or dswrf.

In #34 we use the irradiance data from the AWS feed in NetCDF/Zarr format. The field names are different (see #2 (comment)), but surface_downwelling_shortwave_flux_in_air seems like it might be the corresponding field.

Is this correct? (Tagging @tv3141 for possible insight here.)

(Of course, in the future we will use the data from the AWS feed in NetCDF/Zarr format, so this problem will go away.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.