openclimatefix / predict_pv_yield Goto Github PK

Using optical flow & machine learning to predict PV yield

License: MIT License

Jupyter Notebook 99.66% Python 0.34%

nowcasting

predict_pv_yield's Introduction

Intro

Early experiments on predicting solar electricity generation over the next few hours, using deep learning, satellite imagery, and as many other data sources as we can think of :)

These experiments are focused on predicting solar PV yield.

Please see SatFlow for complementary experiments on predicting the next few hours of satellite imagery (i.e. trying to predict how clouds are going to move!)

And please see OCF's Nowcasting page for more context.

Installation

From within the cloned predict_pv_yield directory:

conda env create -f environment.yml
conda activate predict_pv_yield
pip install -e .

predict_pv_yield's People

Contributors

Stargazers

Watchers

Forkers

christabella walkhope aavashsubedi plaskod vedal jayantb1019 maddy091 pihu24 wdmc-nitj manik-jindal20

predict_pv_yield's Issues

Try CoordConv

https://medium.com/@Cambridge_Spark/coordconv-layer-deep-learning-e02d728c2311

Suggested by @thecapeador

The idea being that it's really important for the CNN to discriminate between based on their location.

Several combinations to try:

Location relative to the pixel boundaries of the image (i.e. the x and y locations of the pixels)
The geospatial position (lat, lon)
The clouds in the direct path between the sun and the PV system in the center of the image?!? (actually quite excited about this idea!) That is, given the Sun's azimuth and angle in the sky from the central point in the image, show the neural net where a direct line-of-sight between the Sun and the central point in the image would be, through the lower ~10 km of atmosphere (where the clouds live).

Use satellite-estimated irradiance (perhaps just as a benchmark?)

Download CMSAF data
Can we use CMSAF at inference time? How quickly is it updated? (UPDATE: Looks like data might be available 2-days behind?!?)

Some notes about CMSAF from the PDF documentation:

Data is free.
30-minute "instantaneous" values
Data volume = "up to 5 GB per month per parameter"

Modelling ideas:

Inputs: SEVIRI (lots of channels) & ground measurements of PV, irradiance, aerosols & precipitation (radar) (start with what we have already: SEVIRI & PV). How to handle sparse inputs? Maybe start with a convnet (u-net?) where missing inputs are just zero or something; or there's a "mask" channel. Also try an attention mechanism? Output: 3 channel image: CMSAF, PV, irradiance. This is nice because we can train on whatever data is available for a given timestep (e.g. only timesteps at 00 and 30 minutes past the hour have CMSAF; and PV & irradiance are very sparse spatially). But including CM SAF hopefully forces the model to be at least as good as CMSAF.
Model which calibrates CMSAF using ground measurements (PV & irradiance). Inputs: CMSAF, PV & irradiance (to handle the sparse PV & irradiance data, maybe use attention mechanism?). Output: PV & irradiance at the centre of the image? Or maybe outputs full image, but during training we only back-prop loss for locations we have PV / irradiance data for? We could then use this gridded 'calibrated irradiance' image as a target for the SEVIRI-to-PV model?
Although why not do it in one step: Inputs: CMSAF, SEVIRI, PV & irradiance. Output: Complete PV / irradiance image.

Ultimately, I'd hope our model will learn a better mapping from SEVIRI to irradiance than CM SAF. So maybe CM SAF isn't useful as an input, but instead is most useful as a benchmark?

Plot multiple validation batches, maybe some hand-picked examples

Build separate multi-process optical flow library

Stand-alone pip-installable Python library.

OpticalFlow class. Pass it a 4D numpy array (example, 2 timesteps, height, width) of input image pairs, and the number of timesteps into the future we want, and it will spin up a pool of processes to compute optical flow as quickly as possible. It'll optionally also give us the flow fields.

Give the model the full NWP image

Plot tSNE of the pv_system_id embedding to identify bad PV systems

During validation, pass all the PV IDs through the embedding, and then through tSNE, and plot. Hopefully we'll see outliers, and these might be "bad" PV systems.

Predict just future PV (not imagery)

Maybe a transformer, where the encoder (like axial deeplab) gets recent satellite imagery and PV from many panels, and the decoder outputs just PV for panel in the centre?

Maybe the decoder gets optical flow predictions (and, later, the predictions from SatFlow)

Maybe pre-train the satellite-imagery-processing model somehow (using the entire geographical extent of the SEVIRI imagery?)

Feed PV metadata into ML model

Remember that for the National Grid ESO use-case (and, in general, for the TSO use-case), we won't have PV metadata for the vast majority of the ~ 1 million PV systems in the UK!

How to deal with missing data?
Could have two-stage model: First predict probability distribution of PV yield, independent of PV metadata. Then, if we have PV metadata, refine that probability distribution with a second ML model.

Or simple one-stage model where we somehow encode 'missing data'.

Visualise attention

e.g. wouldn't it be nice to see the model fixating on approaching clouds :)

Related: #35

Simple "climatology" baseline: Neural net doesn't get sat image or weather forecasts. Just gets hour of day, day of year, geolocation, clearsky, and Solar azimuth and angle.

Try CUDA optical flow & image warping

To speed up optical flow and image warping, try OpenCV's CUDA functions:

Baseline: Predict PV from NWPs a few 10s of km away (to mimic ESO's solar forecasts)

Try 2d, 3d CNN before attention

related: #35

3D cnn: sees two timesteps at once

Shorter epochs?

It's taking 3 hrs per epoch, so we don't get validation results very often. Either need to speed up training (#58) and/or validate more often, whilst still using all the training batches.

ML modules should be in `predict_pv_yield/predict_pv_yield/models/`

So models can be imported from multiple places, e.g. for plotting results post-hoc

Speed up data loading

Maybe it's slow because we're very close to running out of mem (#22)? Try more mem on the VM?
More workers?

"Predict" PV using future satellite images (as a 'best case' benchmark)

Just reshape satellite data to concat timesteps into the batch dimension. And maybe then re-shape and use an RNN or something.

Learning rate decay

Use standard approach to pass hyper params into model (and log hyper params)

Jacob uses hydra and omegaconf

Jacob says:

Hydra is a superset of OmegaConf, I believe, so far I've liked Hydra quite a bit, I took a lot of how its setup for satflow from https://github.com/ashleve/lightning-hydra-template which makes it really easy to get started with it

Log to Neptune AI

See changes that Bella made

Log training and validation loss

Failed to get_sample for NWP

Failed to get_sample for NWP.  
start=2018-06-19T11:40:00.000000000, 
end=2018-06-19 11:45:00, 
t0=2018-06-19 11:45:00, 

self.nwp_in_mem.init_time=<xarray.DataArray 'init_time' (init_time: 3)>
array(['2018-06-19T03:00:00.000000000', '2018-06-19T06:00:00.000000000',
       '2018-06-19T09:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * init_time  (init_time) datetime64[ns] 2018-06-19T03:00:00 ... 2018-06-19T...
Attributes:
    long_name:      initial time of forecast
Epoch 0: : 3285it [02:38, 20.75it/s, loss=0.139, v_num=220]
Failed get_sample.  
segment=
Segment(start=Timestamp('2018-06-19 03:15:00'), end=Timestamp('2018-06-19 11:55:00'))

Use dask to load data for many batches asynchronously

use async functionality of dask?

Three ideas to try, in increasing level of change to the codebase:

Use client.compute() to asynchonously load data into memory. One thing I'm not sure about is that, in the existing code, future.result() returns a DataInMemory class. Not sure how to get dask to return a DataInMemory class? Might be possible to do future = client.compute(DataInMemory(data=selected_data))
As above, but persist the data on the workers, and use something like client.submit(xr.DataArray.isel, future, init_time=0) to select data (but then we'll have to wait for the data of each example to be copied to the main process. But maybe that's OK, because each example is pretty tiny).
Don't use multiple PyTorch DataLoader processes. Instead rely entirely on dask to distribute work across multiple processes.

RNN

Ongoing thread: Research results & design ideas

Keeping track of some basic research results:

Inferring PV yield for t₀ ("now") (not predicting the future):

Getting about 6% normalised MAE where the network input is 128x128 pixels of satellite data (all channels) plus an embedding of the PV system ID. Simple CNN.

Getting about 8% NMAE where network input is 2x2 pixels of NWP data (all 10 surface parameters in UKV) plus an embedding of the PV system ID. Just a fully connected net.

Neither net gets datetime features, or clear sky irradiance, or geo location, or anything like that. So lots of room for improvement!

Baseline: Predict PV from just NWP

Irradiance, temperature & wind speed.

Try with and without historical PV data (from site of interest, and from neighbouring sites). Could go all-out and use Temporal Fusion Transformer, given the historical PV yield from all PV systems in range; and the NWP irradiance for each of those PV systems, and the NWP forecast irradiance.

Predict future satellite images using machine learning (replace optical flow)

See Jack's (pencil-and-paper) notebook for modelling ideas

Use surface measurements of aerosols / pollution

Related to #33

Use irradiance estimated from satellite(s) in low Earth orbit as training target

Suggested by @tjvandal :)

e.g. the MCD18A1 dataset (5 km pixel resolution, 3-hourly, produced daily, estimates downward shortwave radiation)

e.g. train a model to infer MCD18A1 from EUMETSAT SEVIRI data.

Log more metrics

Log timeseries of validation data (predicted PV over several hours, and actual PV)
Separately, create video of the clouds for that validation sequence? Or maybe log that video to Neptune too. But don't need to re-create every run (as long as the validation data never changes)

https://docs.google.com/document/d/1E9pccSVVIfn8m14fUqBCVLWKNiU1dUe_zgTcurqLWww/edit#

Dropout PV system ID & PV metadata

During training:

With some probability (30%?), zero-out the output from the PV system ID encoding. And, separately, zero-out different elements of the PV metadata.

Then, during inference, just feed the network zeros for any missing metadata, or zero-out the ID encoding if we haven't seen this PV system during training.

With luck, this should clearly fatten-out the probability distribution of the predicted PV yield.

When using a transformer, try learning-rate warm up and/or layer norm inside the residual blocks

https://twitter.com/sytelus/status/1411607820542218245

Multi-GPU training

Four T4 GPUs offers much more bang-for-the-buck than an A100 of Google Cloud:

4 x T4 GPUs = total of $1.40 per hour, 31.4 TFLOPs FP32
1 x A100 = $2.93 per hour, 19.5 TFLOPs FP32
For reference: 1 x RTX-3090 = 29 TFLOPS / 35 TFLOPS (boost); 1 x A6000 = 39 TFLOPs (!).

Code changes to enable multi-GPU training: https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html

API docs: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#accelerator

Probabilistic model output

Probably using a mixture density network

Also see this doc: https://docs.google.com/document/d/13audVlWmW6HBqhfQqE7R2FyWwWJDNC2JYphO8V1H2DI/edit

Why does data loading code use so much RAM?

Uses > 50 GB RAM when using multiple workers.

Things to try:

Try without PVDataLoader - are we using lots of RAM because the PV data is getting copied lots of times?
Only load NWPs for the land area of the UK (not the oceans)

Separate data loading code out into new git "nowcasting_data_loader" repo

After #16

Use Hydra for hyper params

Predict GSP-level PV power

Background: "GSP" means "grid supply point". National Grid divide Britain up into ~350 non-overlapping regions, and National Grid need PV forecasts for each of these regions. Also, National Grid will probably expect our GSP-level PV forecasts to be calibrated to Sheffield Solar's PV Live "ground truth" for PV generation at each GSP.

Several options:

ML predicts PV for each of the ~ 1 million PV systems in Britain. Then we'll sum spatially over each GSP region, and then calibrate to Sheffield Solar's PV Live, perhaps using a Temporal Fusion Transformer. We'll need the location & capacity for each of those million PV systems. Not entirely clear how to produce probabilistic GSP-level forecasts except Monte Carlo sampling from the PV-level probability distributions? Maybe we could deliver two forecasts: one that's calibrated to PV Live, and one that isn't?
ML directly predicts PV power for each GSP. Almost certainly still want to train on individual PV systems (as well as GSPs). Maybe it's as simple as including the GSPs a separate "PV system IDs" and feeding them into the "PV system ID" embedding. Or maybe have two embeddings (to make it really clear to the net that individual PV systems and GSPs are two separate concepts. Although a small GSP might not be much larger than a large, multi-field PV farm, so the concepts aren't that different). This is nice because it directly outputs exactly what we need (GSP-level PV, nicely calibrated to PV Live) and makes it trivial to produce probabilistic forecasts. But do the large GSPs fit within the ML's square of satellite imagery? (If not, maybe drop the spatial res and enlarge the spatial extent?)
predict PV yield over a regular (4km?) grid spacing over Britain. Then split those predictions per GSP. Feed into a simple fully connected net (one per GSP) which predicts GSP-level PV according to PV Live.
OCF predicts PV power for the ~20,000 PassivSystem PV systems that Sheffield Solar use as input to PV Live. Sheffield Solar use their PV Live algorithm to upscale from those 20,000 PV systems to the total output per GSP. As an added bonus, OCF could try to predict 5-minutely data for the ~20,000 PV systems which, actually, only report half-hourly data once-per-day.

Could provide a 'mask' to the net, which shows the spatial extent of the GSP (and shows a point for small PV systems; and geometry for larger PV farms??? Although few if any PV farms are larger than a pixel of satellite imagery)

To do this stuff, we need some more data:

Sheffield Solar's historical GSP-level PV Live: openclimatefix/nowcasting_dataset#88
Recent satellite imagery and PV data: openclimatefix/nowcasting_dataset#95

Try predicting just clouds (not background) by subtracting 'non-cloudy' image from actual image

Quentin Paletta et al.'s 2021 "ECLIPSE" paper show that it's a good idea to predict cloud masks, because these are concise representations of the sky (with the caveat that I'm afraid I've only skim-read the paper so far, so I may have misunderstood!).

Also, ideally, we'd probably prefer satellite image sequences of just the clouds, rather than the clouds plus the land. Sometimes optical flow messes up by moving the land around (!), so optical flow would probably perform better on 'pure cloud' images (with the land removed). And, for ML approaches for forecasting future satellite images, it's unfair to expect the ML model to reconstruct images of land as the cloud moves away from land, when that information may be entirely absent from the input image sequence (because the cloud has completely covered the land in the input image sequence).

On the other hand, I'm a little nervous about using binary cloud masks because clouds are so varied, and thin wispy clouds which might not be classified as 'cloud' by a binary mask can have a significant effect on irradiance.

Also, reliable cloud mask labels are hard to come by, I think? (Sure, there are algorithms for segmenting clouds, but none are perfect, right?)

So maybe we can automatically subtract 'cloud' from 'land' something like this: If we had a perfect 'cloud-free' satellite image for every time of day, then we could do a pixel-wise subtraction: just_cloud_image[t] = cloud_free_image[t] - actual_image[t]. Then we could use the just_cloud_image for our downstream models.

The question then becomes: How to generate the set of cloud_free_image[t] for every time of day, and time of year? Could it be as simple as taking the median pixel value, per pixel, at a given time of day, over the last month or so of imagery? e.g. to get the cloud_free_image for 12:00, look at all the images taken at 12:00 over the last month, and take the median pixel value?

Or maybe the median is the wrong statistic: Maybe instead we can assume the histogram of pixel values at a given time of day, over the last month, would have two peaks: one corresponding to 'cloud free', and the other corresponding to 'cloudy'. And we want the mode of the 'cloud free' peak, which I guess will always be the less-bright peak?

(also see the twitter discussion about this issue)

TODO:

Check out the EUMETSAT Cloud Mask (suggested by Quentin). Should be visible in EUMETView.
Try segmenting using different satellite channels
Think about using an ML model to segment clouds, and score each segment according to how much sunlight it is estimated to let through (its optical depth) (see comment below). Could also score the clouds based on how they are likely to evolve over time?

Fix crash with gcsfs

See fsspec/gcsfs#379

Some specific ideas:

Find a good metric for measuring if the timing of the forecast is correct

Correlation?

Accuracy / f1-score / precision / recall (after thresholding the forecast).

Maybe a multi-class classification, motivated by the idea that the grid control room users care a lot about getting the timings of ramps right: classify each timestep as ramp up / ramp down / no change (maybe with multiple levels, e.g. 5 levels: level 1 is fast ramp down, level 3 is no change; level 5 is fast ramp up). But that's a very harsh metric: if the power output is bouncing up and down every few timesteps and your forecast is behind / ahead by one timestep then the metric will say you're completely wrong (and, in some sense, you are!)

One network which gets PV metadata when it's available. When it's not available, somehow mask those inputs. Set to -1? Or have a separate 'mask' input?
Two networks: One which predicts distribution of PV yield, without knowing any PV metadata. A second network which takes that PV distribution, and refines it when metadata is available

Implement simple benchmark algos

Persistence
ML model inferring PV yield from clear-sky irradiance
NWP (including irradiance, temperature, wind speed, precipitation, cloud cover) to PV yield
PVLib

Try The Perceiver

Try FP16 training

The Perceiver (#35) trains at about 2 it/s, and GPU FLOPs is definitely the bottleneck. Alternatively, just try an A100 :)

Here's the nvidia T4 datasheet.