openclimatefix / nowcasting_dataset Goto Github PK

View Code? Open in Web Editor NEW

24.0 3.0 6.0 420.87 MB

Prepare batches of data for training machine learning solar electricity nowcasting data

Home Page: https://nowcasting-dataset.readthedocs.io/en/stable/

License: MIT License

Jupyter Notebook 94.69% Python 5.30% Dockerfile 0.01%

nowcasting python

nowcasting_dataset's People

Stargazers

Watchers

Forkers

rohancalum lenassero vnshanmukh tshi393978 janebbing zheyux

nowcasting_dataset's Issues

Validate against hand-picked examples (e.g. clear sky for forecast duration, single large cloud, etc.)

Try again using Dask to load Satellite data, but use chunks={}; and don't process entire array

End epoch after a set number of batches!

SatelliteDataSource.get_sample() returns 1 too many pixels!

Returns images which are 129 x 129, not 128 x 128. Probably because slice(left, right) is [left, right], not [left, right).

Also need to test for x and y centre coordinates which aren't exactly divisible by 2,000!

Try partial_decompress=True

Especially for NWPs, where we often only want a single value per chunk.

Also try in combination with uncompressed chunks

See zarr-developers/zarr-python#667

Check for -1 values in sat data

Try simple approach: Multiple DataLoader workers, each loads samples at random

Do #15 first.

No manually-coded multi-process stuff. Just use DataLoader's worker processes. prefetch_factor should be high (especially if dataset yields individual samples)

Each worker samples totally from random from the entire dataset. No pre-loading. No carefully aligning with Zarr chunk boundaries.

To speed things up, pick the geographical location ahead of time (as per #1), and write efficient data loading code. Load a complete batch at once using dask.compute(), so dask can parallelise loading and processing each sample in the batch. Use dask to parallelise optical flow

If not fast enough then maybe re-create Zarr, with each chunk being a single timestep long.

Ensure we always validate against exactly the same examples

Must be the same datetimes, and the same locations, and the same PV systems within those locations!

Might be as simple as resetting the RNG seed in the validation dataset (and each DataSource) every epoch.

Loading slows down with large dataset

The problem

When using the 3,600 timesteps of test Zarr data, loading is super-quick (40 it/s with batch_size=32, image_size_pixels=128, n_samples_per_timestep=4, num_workers=16). This test Zarr has chunk sizes: time=1, y=704, x=548, variable=1. It reads data at almost 200 MB/s.

But, using the full Zarr dataset (with exactly the same chunk size and compression), it struggles to get more than about 5 it/s; and reads data at a few tens of MB/s.

Experimenting, I don't think the bottleneck is gcsfs. Reading a single file; or searching using glob all seem about the same speed on the two zarr datasets.

Instead, it looks like Dask takes a long time to consider what to do with all those little chunks! The full Zarr dataset has 2 million chunks. Reading is even slower when using the Zarr array with quarter spatial resolution.

Potential solutions

First thing I'm trying is preparing a dataset with just HRV. UPDATE: This seems to work!

When we need more channels, then re-create a dataset and put the other channels in the same chunk, so the total number of chunks stays the same.

Use bigger chunks!

Can Xarray read data without dask?? Update: Yes: xr.open_zarr(filename, chunks=None)

Re-chunk Satellite Zarr with one chunk for all channels and full spatial extent

Hmmm... script keeps getting killed. Maybe try on a dedicated "data processing" VM with 128 GB of RAM?

Use cubic interpolation when upsampling NWPs to 5-minutely

Needs a buffer of 1 hour to start_hourly and end_hourly in get_nwp_example. But also needs some fidly things fixing too:

When computing datetimes available for training, need to take this 1 hour buffer into consideration
In NWPDataLoader need to load data with an extra buffer. Maybe this means dropping an hour from the start and end of all the contiguous segments, to leave that hour buffer for the NWPs.

BUG: Sat data sometimes returns images of wrong size when training

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-13-7b6b8391c42e> in <module>
----> 1 trainer.fit(model, data_module)

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    456         )
    457 
--> 458         self._run(model)
    459 
    460         assert self.state.stopped

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in _run(self, model)
    754 
    755         # dispatch `start_training` or `start_evaluating` or `start_predicting`
--> 756         self.dispatch()
    757 
    758         # plugin will finalized fitting (e.g. ddp_spawn will load trained model)

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in dispatch(self)
    795             self.accelerator.start_predicting(self)
    796         else:
--> 797             self.accelerator.start_training(self)
    798 
    799     def run_stage(self):

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in start_training(self, trainer)
     94 
     95     def start_training(self, trainer: 'pl.Trainer') -> None:
---> 96         self.training_type_plugin.start_training(trainer)
     97 
     98     def start_evaluating(self, trainer: 'pl.Trainer') -> None:

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py in start_training(self, trainer)
    142     def start_training(self, trainer: 'pl.Trainer') -> None:
    143         # double dispatch to initiate the training loop
--> 144         self._results = trainer.run_stage()
    145 
    146     def start_evaluating(self, trainer: 'pl.Trainer') -> None:

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in run_stage(self)
    805         if self.predicting:
    806             return self.run_predict()
--> 807         return self.run_train()
    808 
    809     def _pre_training_routine(self):

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in run_train(self)
    867                 with self.profiler.profile("run_training_epoch"):
    868                     # run train epoch
--> 869                     self.train_loop.run_training_epoch()
    870 
    871                 if self.max_steps and self.max_steps <= self.global_step:

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    489         is_last_batch = None
    490 
--> 491         for batch_idx, (batch, is_last_batch) in train_dataloader:
    492             self.trainer.batch_idx = batch_idx
    493             self.trainer.is_last_batch = is_last_batch

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/profiler/profilers.py in profile_iterable(self, iterable, action_name)
    110             try:
    111                 self.start(action_name)
--> 112                 value = next(iterator)
    113                 self.stop(action_name)
    114                 yield value

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py in prefetch_iterator(iterable)
    532         return
    533 
--> 534     for val in it:
    535         # yield last and has next
    536         yield last, False

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py in __next__(self)
    462 
    463         """
--> 464         return self.request_next_batch(self.loader_iters)
    465 
    466     @staticmethod

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py in request_next_batch(loader_iters)
    476 
    477         """
--> 478         return apply_to_collection(loader_iters, Iterator, next)
    479 
    480     @staticmethod

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pytorch_lightning/utilities/apply_func.py in apply_to_collection(data, dtype, function, wrong_dtype, *args, **kwargs)
     83     # Breaking condition
     84     if isinstance(data, dtype) and (wrong_dtype is None or not isinstance(data, wrong_dtype)):
---> 85         return function(data, *args, **kwargs)
     86 
     87     # Recursively apply to collection items

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/utils/data/dataloader.py in _next_data(self)
   1181             if len(self._task_info[self._rcvd_idx]) == 2:
   1182                 data = self._task_info.pop(self._rcvd_idx)[1]
-> 1183                 return self._process_data(data)
   1184 
   1185             assert not self._shutdown and self._tasks_outstanding > 0

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
   1227         self._try_put_index()
   1228         if isinstance(data, ExceptionWrapper):
-> 1229             data.reraise()
   1230         return data
   1231 

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/_utils.py in reraise(self)
    423             # have message field
    424             raise self.exc_type(message=msg)
--> 425         raise self.exc_type(msg)
    426 
    427 

RuntimeError: Caught RuntimeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/dataset.py", line 62, in __iter__
    yield self._get_batch()
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/dataset.py", line 88, in _get_batch
    return dask.compute(batch_delayed)[0]
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/dask/threaded.py", line 79, in get
    results = get_async(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/dask/local.py", line 514, in get_async
    raise_exception(exc, tb)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/dask/local.py", line 325, in reraise
    raise exc
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/dask/local.py", line 223, in execute_task
    result = _execute_task(task, data)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 64, in default_collate
    return default_collate([torch.as_tensor(b) for b in batch])
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 56, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [2, 128, 128, 1] at entry 0 and [1, 128, 128, 1] at entry 9

Look into odd chunks in sat zarr

When writing test data, ran into this problem with int16 dataset:

Specified zarr chunks encoding['chunks']=(36, 704, 548, 1) for variable named 'stacked_eumetsat_data' would overlap multiple dask chunks ((32, 36, 4), (704,), (548,), (1,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using chunk(), deleting or modifying encoding['chunks'], or specify safe_chunks=False.

Add clear sky irradiance to dataset

Create tidy Python library for loading data :)

With automated unit tests :)

User can easily say "I want this many historical timesteps; and this many forecast timesteps; and including these satellite channels, and these NWP params, and compute optical flow predictions based on the most recent pair of satellite images".

Optical flow: Predict future PV yield

Build a OpticalFlowDataSource class, which inherits from DataSource, and adds optical_flow_predictions to the Sample dict.

Pre-compute and save in the NetCDF batches.

See #18 for some more notes

Experiment to understand how PyTorch.DataLoader samples from worker processes

Write a super-simple DataSet which returns random values, and prints its worker id to the screen, and then sleeps

Does DataLoader do a round-robin of the workers? Does this change if the workers yield individual samples vs entire batches?

Create small sample files for testing

Satellite data Zarr
NWP Zarr
PV power data
PV metadata

More satellite channels

change the sel(variable='HRV') line in get_sat_data()

Tidy up data loading & prep code

All code for each DataSource should live in its own class (or a python file with helper functions)
Standardise interface for getting list of available datetimes (for pre-computing valid datetimes)
Clean up the naming of zarr_chunk_sequences, segments, chunks etc. Feels confusing. Also the code is inconsistent in that sometimes it uses start and end, and sometimes it uses a Segment. Could try ripping out the whole concept of a 'Zarr chunk sequence', and see if it slows the code down noticeably. That is, maybe randomly pick a contiguous section (proportional to how long it is), and then just randomly pick any start date (no matter if it aligns perfectly with Zarr chunks or not)
Consistent capitalisation of PV in class names.
Convert all datetimes to UTC, then to naive before continuing. Then rip out all the to_naive stuff
Convert satellite data timestamps to 00, 05, 10, .... (instead of 04, 09, etc.)

Ideas for speeding up:

Have a 'master' data loader class, which:
- Takes an ordered list of DataSource objects.
- Before training: Computes the intersection of the datetimes for each DataSource.
- During training:
  - Pre-fetch data into memory: Each process randomly selects a time segment (perhaps continue with time segments aligned to the satellite Zarr's boundaries; or perhaps simplify the code by throwing away that idea?), and then hands off to worker threads for each data source to pre-load the data in parallel into memory.
  - Loop round data in memory. Construct a Sample by passing it in sequence through the DataSource objects
  - Then the Transforms are responsible for selecting
Maybe need to load NWPs in 'Zarr-friendly' chunks, and then iterate around those in memory. Not entirely sure how to coordinate that with loading satellite-data Zarr-friendly chunks

Get more data

NWP (more UKV from CEDA but need to talk to Met Office to get access again. Maybe MOGREPs. Maybe ECMWF).
EUMETSAT SEVIRI RSS (extend Future Energy Associate's Airflow pipeline code for ingesting data from EUMETSAT's API)
PV (sub-tasks: get more data from PVOutput.org for UK, using OCF's PVOutput Python code. Get data from PassivSystems (UK only). Get PV data from European PV systems)
CM-SAF irradiance?
Precipitation (UK rainfall radar. Jacob has already loaded this into his code, I think)
EUMETSAT cloud mask? (Jacob has cloud mask data, I think)

Instead of image_size_pixels, do any geospatial libraries have an existing 'bounding box' class?

Run get_sample() for all 3 DataSources in a ThreadPoolExecutor

Should be easy because threads can create other threads. So just have nested threads.

Check dask optimises NWPDataSource.get_batch()

Ensure it only loads each timeslice from disk once.

Time of day & day of year (sin & cos)

NWPDataSource must cache repeated datetimes

but we can't use ThreadPoolExecutor.

So use dask to load the complete spatial extents for the unique datetimes. Then sample from those in-memory.

Log to Neptune

openclimatefix/predict_pv_yield#5

Implement OpticalFlow transform

See notebooks/design.ipynb for ideas.

Probably needs to be single process, because the daemonic worker processes can't spawn processes.

See this notebook for my last attempt at optical flow (using multiple processes): https://github.com/openclimatefix/predict_pv_yield_2/blob/master/notebooks/16_maxpool.ipynb

refactor DataSources to have different functions for selecting timesteps, selecting geo locations, and post-processing

And reduce duplication between NWPDataSource and SatelliteDataSource, because they both work in very similar ways now!

Compute NWP means & std over the complete dataset

Currently computed using:

nwp_ds.data.isel(init_time=slice(0, 10)).mean(dim=['step', 'x', 'init_time', 'y']).compute()

Using 100 init_times crashes (dask tries to use > 64 GB of RAM).

Try again with a VM with more RAM

Implement PVDataSource

See notebooks/design.ipynb for ideas of interface.

Might want to share memory across worker processes; perhaps by constructing PV data & metadata as Dask DataFrames?

Implement PVDataSource.pick_locations()

PVDataSource must use dask.delayed

Currently using PV data roughly halves the training speed (from 30 it/s down to 17 it/s), even though all the PV data is in memory, so should be very fast to load.

Prob need the pv_power data (and metadata?) to be dask arrays. Perhaps as simple as keeping it in xarray, instead of converting to Pandas?

Get contiguous examples (for plotting several hours of predictions)

Maybe implement as a child class of NowcastingDataset which overrides _get_t0_datetimes_for_batch() and _get_locations_for_batch().

For each batch:

Pick random start datetime for first example, then subsequent examples use contiguous datetimes.
Pick random location for first example, then use that location throughout.

Try Satellite Zarr with quarter spatial extent (again)

Replace get_sample() with get_batch(). give each DataSource the full list of locations and timestamps for the batch, to let the DataSource load the necessary chunks from disk in a ThreadPoolExecutor. Needs to figure out which spatial chunks to load, given the Zarr chunk boundaries and the locations of the examples. Then load those chunks into memory in parallel, and return a full batch (with the examples in the correct order).

Each DataSource should have its own history_len and forecast_len

And get_sample() should just take t0_dt (not start_dt or end_dt); and should take the example so far. And convert start_datetimes to t0_datetimes by adding history_duration.

So we can do this:

HISTORY_LEN = 2
FORECAST_LEN = 12

data_sources = [
    PVDataSource(
        history_len=HISTORY_LEN,
        forecast_len=FORECAST_LEN,
        pv_system_selection=DISJOINT_HISTORY_AND_FORECAST),
    SatelliteDataSource(
        history_len=HISTORY_LEN,
        forecast_len=0,
        image_size_pixels=192,
        transform=OpticalFlow(
            include_flow_in_example=False,
            output_image_size_pixels=128,
            forecast_len=FORECAST_LEN
        ),
    NWPDataSource(
        history_len=HISTORY_LEN,
        forecast_len=FORECAST_LEN,
        params=['t'],
        transform=SinglePointAtCenter
    )
]

Re-create NWP Zarr with one chunk per init_time and step and small bit depth

following on from #8

TODO:

Combine the 4 existing Zarrs
Use minimal data types for each variable. e.g. uint8 for temperature? (Although should benchmark to see if this actually makes a difference to training speed!)
Reload from original GRIB files to fix the few issues)
Put the GRIB files into cold storage on GCS

Re-create NWP Zarr

Combine the 4 existing Zarrs (or maybe re-load from grib files to fix the few issues)
Use minimal data types for each variable. e.g. uint8 for temperature?

Ingest numerical weather prediction data (NWP)

Use temperature at surface, precipitation, irradiance, cloud fraction, accumulated snow cover.

Use NamedTensors to name dimensions

https://pytorch.org/docs/stable/named_tensor.html

See this twitter conversation.

In particular, try extending xarray with a little "accessor" extension like dataarray.torch.to_tensor() (as suggested by Joe Hamman on Twitter!)

And see Joe's pull request, which implements xarray-to-pytorch-named-tensors.

But note that torch.stack doesn't yet support named tensors (here's a feature request)

BUG: InvalidIndexError

DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening satellite data: gs://solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
DEBUG:nowcasting_dataset:Opening NWP data: gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_zarr
ERROR:nowcasting_dataset:Exception! start_hourly=2019-11-07 15:00:00, t0_hourly=2019-11-07 16:00:00, end_hourly=2019-11-07 16:00:00, target_times_hourly=DatetimeIndex(['2019-11-07 15:00:00', '2019-11-07 16:00:00'], dtype='datetime64[ns]', freq='H'), Reindexing only valid with uniquely valued Index objects, is_increasing=True, is_unique=True
Traceback (most recent call last):
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/data_source.py", line 64, in _get_cached_time_slice
    return self._cache[t0_dt]
KeyError: Timestamp('2019-11-07 15:55:00')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/nwp_data_source.py", line 102, in _get_time_slice
    init_times = self.data.sel(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/dataarray.py", line 1271, in sel
    ds = self._to_temp_dataset().sel(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/dataset.py", line 2365, in sel
    pos_indexers, new_indexes = remap_label_indexers(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/coordinates.py", line 421, in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/indexing.py", line 274, in remap_label_indexers
    idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/indexing.py", line 200, in convert_label_indexer
    indexer = get_indexer_nd(index, label, method, tolerance)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/indexing.py", line 101, in get_indexer_nd
    flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3442, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
ERROR:nowcasting_dataset:Exception!  t0_dt=2019-11-07 15:55:00, x_meters_center=40000, y_meters_center=20000, Reindexing only valid with uniquely valued Index objects
Traceback (most recent call last):
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/data_source.py", line 64, in _get_cached_time_slice
    return self._cache[t0_dt]
KeyError: Timestamp('2019-11-07 15:55:00')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/dataset.py", line 122, in _get_example
    example_from_source = data_source.get_example(
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/data_source.py", line 148, in get_example
    selected_data = self._get_cached_time_slice(t0_dt)
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/data_source.py", line 66, in _get_cached_time_slice
    data = self._get_time_slice(t0_dt)
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/nwp_data_source.py", line 102, in _get_time_slice
    init_times = self.data.sel(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/dataarray.py", line 1271, in sel
    ds = self._to_temp_dataset().sel(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/dataset.py", line 2365, in sel
    pos_indexers, new_indexes = remap_label_indexers(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/coordinates.py", line 421, in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/indexing.py", line 274, in remap_label_indexers
    idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/indexing.py", line 200, in convert_label_indexer
    indexer = get_indexer_nd(index, label, method, tolerance)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/indexing.py", line 101, in get_indexer_nd
    flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3442, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
ERROR:nowcasting_dataset:Exception! start_hourly=2019-09-30 13:00:00, t0_hourly=2019-09-30 14:00:00, end_hourly=2019-09-30 14:00:00, target_times_hourly=DatetimeIndex(['2019-09-30 13:00:00', '2019-09-30 14:00:00'], dtype='datetime64[ns]', freq='H'), Reindexing only valid with uniquely valued Index objects, is_increasing=True, is_unique=True
Traceback (most recent call last):
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/data_source.py", line 64, in _get_cached_time_slice
    return self._cache[t0_dt]
KeyError: Timestamp('2019-09-30 13:45:00')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/nwp_data_source.py", line 102, in _get_time_slice
    init_times = self.data.sel(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/dataarray.py", line 1271, in sel
    ds = self._to_temp_dataset().sel(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/dataset.py", line 2365, in sel
    pos_indexers, new_indexes = remap_label_indexers(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/coordinates.py", line 421, in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/indexing.py", line 274, in remap_label_indexers
    idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/indexing.py", line 200, in convert_label_indexer
    indexer = get_indexer_nd(index, label, method, tolerance)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/xarray/core/indexing.py", line 101, in get_indexer_nd
    flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance)
  File "/home/jack/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3442, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
ERROR:nowcasting_dataset:Exception!  t0_dt=2019-09-30 13:45:00, x_meters_center=40000, y_meters_center=250000, Reindexing only valid with uniquely valued Index objects
Traceback (most recent call last):
  File "/home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/data_source.py", line 64, in _get_cached_time_slice
    return self._cache[t0_dt]
KeyError: Timestamp('2019-09-30 13:45:00')