Code Monkey home page Code Monkey logo

predict_pv_yield's Introduction

Intro

Early experiments on predicting solar electricity generation over the next few hours, using deep learning, satellite imagery, and as many other data sources as we can think of :)

These experiments are focused on predicting solar PV yield.

Please see SatFlow for complementary experiments on predicting the next few hours of satellite imagery (i.e. trying to predict how clouds are going to move!)

And please see OCF's Nowcasting page for more context.

Installation

From within the cloned predict_pv_yield directory:

conda env create -f environment.yml
conda activate predict_pv_yield
pip install -e .

predict_pv_yield's People

Contributors

jackkelly avatar peterdudfield avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

predict_pv_yield's Issues

Try CoordConv

https://medium.com/@Cambridge_Spark/coordconv-layer-deep-learning-e02d728c2311

Suggested by @thecapeador

The idea being that it's really important for the CNN to discriminate between based on their location.

Several combinations to try:

  • Location relative to the pixel boundaries of the image (i.e. the x and y locations of the pixels)
  • The geospatial position (lat, lon)
  • The clouds in the direct path between the sun and the PV system in the center of the image?!? (actually quite excited about this idea!) That is, given the Sun's azimuth and angle in the sky from the central point in the image, show the neural net where a direct line-of-sight between the Sun and the central point in the image would be, through the lower ~10 km of atmosphere (where the clouds live).

Use satellite-estimated irradiance (perhaps just as a benchmark?)

  • Download CMSAF data
  • Can we use CMSAF at inference time? How quickly is it updated? (UPDATE: Looks like data might be available 2-days behind?!?)

Some notes about CMSAF from the PDF documentation:

  • Data is free.
  • 30-minute "instantaneous" values
  • Data volume = "up to 5 GB per month per parameter"

Modelling ideas:

  1. Inputs: SEVIRI (lots of channels) & ground measurements of PV, irradiance, aerosols & precipitation (radar) (start with what we have already: SEVIRI & PV). How to handle sparse inputs? Maybe start with a convnet (u-net?) where missing inputs are just zero or something; or there's a "mask" channel. Also try an attention mechanism? Output: 3 channel image: CMSAF, PV, irradiance. This is nice because we can train on whatever data is available for a given timestep (e.g. only timesteps at 00 and 30 minutes past the hour have CMSAF; and PV & irradiance are very sparse spatially). But including CM SAF hopefully forces the model to be at least as good as CMSAF.
  2. Model which calibrates CMSAF using ground measurements (PV & irradiance). Inputs: CMSAF, PV & irradiance (to handle the sparse PV & irradiance data, maybe use attention mechanism?). Output: PV & irradiance at the centre of the image? Or maybe outputs full image, but during training we only back-prop loss for locations we have PV / irradiance data for? We could then use this gridded 'calibrated irradiance' image as a target for the SEVIRI-to-PV model?
  3. Although why not do it in one step: Inputs: CMSAF, SEVIRI, PV & irradiance. Output: Complete PV / irradiance image.

Ultimately, I'd hope our model will learn a better mapping from SEVIRI to irradiance than CM SAF. So maybe CM SAF isn't useful as an input, but instead is most useful as a benchmark?

Build separate multi-process optical flow library

Stand-alone pip-installable Python library.

OpticalFlow class. Pass it a 4D numpy array (example, 2 timesteps, height, width) of input image pairs, and the number of timesteps into the future we want, and it will spin up a pool of processes to compute optical flow as quickly as possible. It'll optionally also give us the flow fields.

Predict just future PV (not imagery)

Maybe a transformer, where the encoder (like axial deeplab) gets recent satellite imagery and PV from many panels, and the decoder outputs just PV for panel in the centre?

Maybe the decoder gets optical flow predictions (and, later, the predictions from SatFlow)

Maybe pre-train the satellite-imagery-processing model somehow (using the entire geographical extent of the SEVIRI imagery?)

Feed PV metadata into ML model

Remember that for the National Grid ESO use-case (and, in general, for the TSO use-case), we won't have PV metadata for the vast majority of the ~ 1 million PV systems in the UK!

How to deal with missing data?
Could have two-stage model: First predict probability distribution of PV yield, independent of PV metadata. Then, if we have PV metadata, refine that probability distribution with a second ML model.

Or simple one-stage model where we somehow encode 'missing data'.

Visualise attention

e.g. wouldn't it be nice to see the model fixating on approaching clouds :)

Related: #35

Shorter epochs?

It's taking 3 hrs per epoch, so we don't get validation results very often. Either need to speed up training (#58) and/or validate more often, whilst still using all the training batches.

Speed up data loading

  • Maybe it's slow because we're very close to running out of mem (#22)? Try more mem on the VM?
  • More workers?

Failed to get_sample for NWP

Failed to get_sample for NWP.  
start=2018-06-19T11:40:00.000000000, 
end=2018-06-19 11:45:00, 
t0=2018-06-19 11:45:00, 

self.nwp_in_mem.init_time=<xarray.DataArray 'init_time' (init_time: 3)>
array(['2018-06-19T03:00:00.000000000', '2018-06-19T06:00:00.000000000',
       '2018-06-19T09:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * init_time  (init_time) datetime64[ns] 2018-06-19T03:00:00 ... 2018-06-19T...
Attributes:
    long_name:      initial time of forecast
Epoch 0: : 3285it [02:38, 20.75it/s, loss=0.139, v_num=220]
Failed get_sample.  
segment=
Segment(start=Timestamp('2018-06-19 03:15:00'), end=Timestamp('2018-06-19 11:55:00'))

Use dask to load data for many batches asynchronously

use async functionality of dask?

Three ideas to try, in increasing level of change to the codebase:

  1. Use client.compute() to asynchonously load data into memory. One thing I'm not sure about is that, in the existing code, future.result() returns a DataInMemory class. Not sure how to get dask to return a DataInMemory class? Might be possible to do future = client.compute(DataInMemory(data=selected_data))

  2. As above, but persist the data on the workers, and use something like client.submit(xr.DataArray.isel, future, init_time=0) to select data (but then we'll have to wait for the data of each example to be copied to the main process. But maybe that's OK, because each example is pretty tiny).

  3. Don't use multiple PyTorch DataLoader processes. Instead rely entirely on dask to distribute work across multiple processes.

Ongoing thread: Research results & design ideas

Keeping track of some basic research results:

Inferring PV yield for t0 ("now") (not predicting the future):

Getting about 6% normalised MAE where the network input is 128x128 pixels of satellite data (all channels) plus an embedding of the PV system ID. Simple CNN.

Getting about 8% NMAE where network input is 2x2 pixels of NWP data (all 10 surface parameters in UKV) plus an embedding of the PV system ID. Just a fully connected net.

Neither net gets datetime features, or clear sky irradiance, or geo location, or anything like that. So lots of room for improvement!

Baseline: Predict PV from just NWP

Irradiance, temperature & wind speed.

Try with and without historical PV data (from site of interest, and from neighbouring sites). Could go all-out and use Temporal Fusion Transformer, given the historical PV yield from all PV systems in range; and the NWP irradiance for each of those PV systems, and the NWP forecast irradiance.

Dropout PV system ID & PV metadata

During training:

With some probability (30%?), zero-out the output from the PV system ID encoding. And, separately, zero-out different elements of the PV metadata.

Then, during inference, just feed the network zeros for any missing metadata, or zero-out the ID encoding if we haven't seen this PV system during training.

With luck, this should clearly fatten-out the probability distribution of the predicted PV yield.

Why does data loading code use so much RAM?

Uses > 50 GB RAM when using multiple workers.

Things to try:

  • Try without PVDataLoader - are we using lots of RAM because the PV data is getting copied lots of times?
  • Only load NWPs for the land area of the UK (not the oceans)

Predict GSP-level PV power

Background: "GSP" means "grid supply point". National Grid divide Britain up into ~350 non-overlapping regions, and National Grid need PV forecasts for each of these regions. Also, National Grid will probably expect our GSP-level PV forecasts to be calibrated to Sheffield Solar's PV Live "ground truth" for PV generation at each GSP.

Several options:

  1. ML predicts PV for each of the ~ 1 million PV systems in Britain. Then we'll sum spatially over each GSP region, and then calibrate to Sheffield Solar's PV Live, perhaps using a Temporal Fusion Transformer. We'll need the location & capacity for each of those million PV systems. Not entirely clear how to produce probabilistic GSP-level forecasts except Monte Carlo sampling from the PV-level probability distributions? Maybe we could deliver two forecasts: one that's calibrated to PV Live, and one that isn't?

  2. ML directly predicts PV power for each GSP. Almost certainly still want to train on individual PV systems (as well as GSPs). Maybe it's as simple as including the GSPs a separate "PV system IDs" and feeding them into the "PV system ID" embedding. Or maybe have two embeddings (to make it really clear to the net that individual PV systems and GSPs are two separate concepts. Although a small GSP might not be much larger than a large, multi-field PV farm, so the concepts aren't that different). This is nice because it directly outputs exactly what we need (GSP-level PV, nicely calibrated to PV Live) and makes it trivial to produce probabilistic forecasts. But do the large GSPs fit within the ML's square of satellite imagery? (If not, maybe drop the spatial res and enlarge the spatial extent?)

  3. predict PV yield over a regular (4km?) grid spacing over Britain. Then split those predictions per GSP. Feed into a simple fully connected net (one per GSP) which predicts GSP-level PV according to PV Live.

  4. OCF predicts PV power for the ~20,000 PassivSystem PV systems that Sheffield Solar use as input to PV Live. Sheffield Solar use their PV Live algorithm to upscale from those 20,000 PV systems to the total output per GSP. As an added bonus, OCF could try to predict 5-minutely data for the ~20,000 PV systems which, actually, only report half-hourly data once-per-day.

Could provide a 'mask' to the net, which shows the spatial extent of the GSP (and shows a point for small PV systems; and geometry for larger PV farms??? Although few if any PV farms are larger than a pixel of satellite imagery)

To do this stuff, we need some more data:

Try predicting *just* clouds (not background) by subtracting 'non-cloudy' image from actual image

Quentin Paletta et al.'s 2021 "ECLIPSE" paper show that it's a good idea to predict cloud masks, because these are concise representations of the sky (with the caveat that I'm afraid I've only skim-read the paper so far, so I may have misunderstood!).

Also, ideally, we'd probably prefer satellite image sequences of just the clouds, rather than the clouds plus the land. Sometimes optical flow messes up by moving the land around (!), so optical flow would probably perform better on 'pure cloud' images (with the land removed). And, for ML approaches for forecasting future satellite images, it's unfair to expect the ML model to reconstruct images of land as the cloud moves away from land, when that information may be entirely absent from the input image sequence (because the cloud has completely covered the land in the input image sequence).

On the other hand, I'm a little nervous about using binary cloud masks because clouds are so varied, and thin wispy clouds which might not be classified as 'cloud' by a binary mask can have a significant effect on irradiance.

Also, reliable cloud mask labels are hard to come by, I think? (Sure, there are algorithms for segmenting clouds, but none are perfect, right?)

So maybe we can automatically subtract 'cloud' from 'land' something like this: If we had a perfect 'cloud-free' satellite image for every time of day, then we could do a pixel-wise subtraction: just_cloud_image[t] = cloud_free_image[t] - actual_image[t]. Then we could use the just_cloud_image for our downstream models.

The question then becomes: How to generate the set of cloud_free_image[t] for every time of day, and time of year? Could it be as simple as taking the median pixel value, per pixel, at a given time of day, over the last month or so of imagery? e.g. to get the cloud_free_image for 12:00, look at all the images taken at 12:00 over the last month, and take the median pixel value?

Or maybe the median is the wrong statistic: Maybe instead we can assume the histogram of pixel values at a given time of day, over the last month, would have two peaks: one corresponding to 'cloud free', and the other corresponding to 'cloudy'. And we want the mode of the 'cloud free' peak, which I guess will always be the less-bright peak?

(also see the twitter discussion about this issue)

TODO:

  • Check out the EUMETSAT Cloud Mask (suggested by Quentin). Should be visible in EUMETView.
  • Try segmenting using different satellite channels
  • Think about using an ML model to segment clouds, and score each segment according to how much sunlight it is estimated to let through (its optical depth) (see comment below). Could also score the clouds based on how they are likely to evolve over time?

Fix crash with gcsfs

See fsspec/gcsfs#379

Some specific ideas:

  • Don't load NWPs in DataLoader, to see if that still works.
  • Try less worker processes.
  • Try with skip_instance_cache=True or gcs.clear_instance_cache()
  • Try fsspec.asyn.iothread[0] = None; fsspec.asyn.loop[0] = None and latest version of fsspec and gcsfs as suggested here.
  • Try latest version of gcsfs.
  • Delete all Zarr objects in main thread
  • Don't even call Zarr objects in the main thread. Instead pre-compute valid times? Or use a separate child process to compute valid times?
  • Each child process only has one gcs object, which is used to open both the sat data and the NWP data within the process?
  • run open_zarr_on_gcp once in the main thread, and pass the Zarr dataarray into each child process?
  • Write a minimal code example which demonstrates the crash, using PyTorch.
  • Recreate the crash described here: https://stackoverflow.com/questions/66283634/use-gcsfilesystem-with-multiprocessing
  • Use print statements to figure out where the code hangs
  • Disable loading of PV data to speed things up

Find a good metric for measuring if the *timing* of the forecast is correct

Correlation?

Accuracy / f1-score / precision / recall (after thresholding the forecast).

Maybe a multi-class classification, motivated by the idea that the grid control room users care a lot about getting the timings of ramps right: classify each timestep as ramp up / ramp down / no change (maybe with multiple levels, e.g. 5 levels: level 1 is fast ramp down, level 3 is no change; level 5 is fast ramp up). But that's a very harsh metric: if the power output is bouncing up and down every few timesteps and your forecast is behind / ahead by one timestep then the metric will say you're completely wrong (and, in some sense, you are!)

Use PV metadata (tilt angle etc.)

Two approaches:

  1. One network which gets PV metadata when it's available. When it's not available, somehow mask those inputs. Set to -1? Or have a separate 'mask' input?

  2. Two networks: One which predicts distribution of PV yield, without knowing any PV metadata. A second network which takes that PV distribution, and refines it when metadata is available

Implement simple benchmark algos

  • Persistence
  • ML model inferring PV yield from clear-sky irradiance
  • NWP (including irradiance, temperature, wind speed, precipitation, cloud cover) to PV yield
  • PVLib

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.