openclimatefix / nowcasting_dataloader Goto Github PK
View Code? Open in Web Editor NEWPyTorch Dataloader for working with multi-modal data for nowcasting applications
Home Page: https://nowcasting-dataloader.readthedocs.io/
License: MIT License
PyTorch Dataloader for working with multi-modal data for nowcasting applications
Home Page: https://nowcasting-dataloader.readthedocs.io/
License: MIT License
Would be great to check on data sources are in batchML
lucky find with #10
add asserts of all data sources in
https://github.com/openclimatefix/nowcasting_dataloader/blob/main/tests/test_netcdf_dataset.py#L69
Normalize the pv and gsp, x and y coordinates.
Need to find the max and min (x and y) values of these from all batches.
Useful to normalize, so that these values can be included in ML models
Once we have the map of installed PV capacity in nowcasting_dataset
(see issue openclimatefix/nowcasting_dataset#184), we need to figure out how best to encode this map.
The map isn't a map of individual PV sites. Instead it's the PV capacity per LLSOA. Jamie says:
LSOAs vary in size. The biggest ones are in Scotland. The largest is ~1200 km2 !! The median size is 0.418 km2 though, so in general they're pretty small.
The simplest encoding is probably as a bitmap, at the same spatial resolution as the satellite imagery. Each row (pixel) of data would encode:
t0
? We can probably assume the installed PV capacity is static per ML example. So each example only needs a single map of installed PV capacity. But we probably do need to provide a datetime encoding, so the attention mechanism can see that the map is "close" to the other data modalities.Make sure x and y coordinates are using standardised named
Could use
but when these are normalised are these names right?
Describe the bug
Topological data is not coverted from xr.Dataset properly
To Reproduce
run this tests/test_netcdf_dataset.py and see that batchML has not topological data in it
** additional context **
These files are being moved from nowcasting_dataset, so its good not didnt want to make this change there.
Currently:
if TOPOGRAPHIC_DATA in xr_dataset.keys():
return TopographicML(
batch_size=xr_dataset[TOPOGRAPHIC_DATA].shape[0],
topo_data=xr_dataset[TOPOGRAPHIC_DATA],
topo_x_coords=xr_dataset[TOPOGRAPHIC_DATA].topo_x,
topo_y_coords=xr_dataset[TOPOGRAPHIC_DATA].topo_y,
)
else:
return None
Potential solution
+
return TopographicML(
batch_size=xr_dataset.data.shape[0],
topo_data=xr_dataset.data,
topo_x_coords=xr_dataset.x,
topo_y_coords=xr_dataset.y,
)
Make sure random data is float32s not float64s. This is so it is easier to work with pytorch
To Reproduce
x: BatchML = BatchML.fake()
print(type(x.satellite.data[0,0,0,0,0 ]))
Expected behavior
float32 to be made
Additional context
Float 32 work better with pytorch
Normalization fo data should be done in this repo
This is linked with - openclimatefix/nowcasting_dataset#231
Idea is to have not normalization in dataset, but all done all the fly
Could add this the pydantic models as a method, which is then called
Looking at the output from SatFlowDataset
, it looks like nwp
, sat_data
and hrv_sat_data
are all float64
? If so, we can probably get an easy speedup by using float32
(as I'm sure you know, GPUs are much faster at float32
than float64
)
Just for reference, here's the mean, std, dtype and shape of all the outputs of SatFlowDataset
:
*** INPUTS ***
pv_yield
MEAN = tensor(0.3119)
STD = tensor(0.2217)
dtype = torch.float32
shape = torch.Size([32, 31, 7])
pv_system_id
MEAN = tensor(nan)
STD = tensor(nan)
dtype = torch.float32
shape = torch.Size([32, 128])
nwp
MEAN = tensor(-0.0024, dtype=torch.float64)
STD = tensor(0.9119, dtype=torch.float64)
dtype = torch.float64
shape = torch.Size([32, 10, 4, 64, 64])
topo_data
MEAN = tensor(-0.0682)
STD = tensor(0.9836)
dtype = torch.float32
shape = torch.Size([32, 1, 1, 64, 64])
gsp_id
MEAN = tensor(nan)
STD = tensor(nan)
dtype = torch.float32
shape = torch.Size([32, 32])
sat_data
MEAN = tensor(0.0705, dtype=torch.float64)
STD = tensor(0.8638, dtype=torch.float64)
dtype = torch.float64
shape = torch.Size([32, 11, 7, 24, 24])
hrv_sat_data
MEAN = tensor(0.0233, dtype=torch.float64)
STD = tensor(0.1539, dtype=torch.float64)
dtype = torch.float64
shape = torch.Size([32, 1, 7, 64, 64])
*** TARGETS ***
gsp_yield
MEAN = tensor(nan)
STD = tensor(nan)
dtype = torch.float32
shape = torch.Size([32, 4, 32])
gsp_id
MEAN = tensor(nan)
STD = tensor(nan)
dtype = torch.float32
shape = torch.Size([32, 32])
Scheudle CI test runner to run every monday lunctime
useful to regular check things are broken
on:
push:
schedule:
- cron: "0 12 * * 1"
Add GSP data to fake BatchML
https://github.com/openclimatefix/nowcasting_dataloader/blob/main/nowcasting_dataloader/batch.py#L93
This is used for testing in some ML repos
use similar method to other data sources, gsp.fake()
https://github.com/openclimatefix/nowcasting_dataloader/blob/main/nowcasting_dataloader/data_sources/gsp/gsp_model.py#L88
Describe the bug
We got rid of the metadata files in Batch, but subselect relies on them. So it fails and can't be used for now.
To Reproduce
Steps to reproduce the behavior:
Run tests
Expected behavior
Subselect to select the data
Additional context
WE could load in the t0 datetiems needed from the CSV files created, or add the metadata to the batch files, i.e. in the Satellite file have an attribute saying the t0 time.
Instead of using strings like 'NWP' to identify each modality, should we use an Enum? (Just to make sure that folks don't accidentally type the wrong modality name, which might break things in weird ways?)
Current Batch
and BatchML
can be made with fake data.
Probably best to just have one method and then transform the data.
Good to keep code tidy
code here could load
batch = Batch.fake()
batch_ml = BatchML.from_batch(batch=batch)
Should be able to get rid of this file
Could be implemented as an iPython notebook (with lots of text to explain what's going on)?
Describe the bug
For a history of 30min, and forecast of 120min, the position encoding has 6 timesteps, which should be correct, 1 last timestemp, 1 current time timestep, and 4 future ones. The position encoding for the queries, which only should be the future timesteps should then contain only 4 timesteps, but instead contains 6.
To Reproduce
Steps to reproduce the behavior:
4. Run SatFlowDatset and look at output tensors
Expected behavior
The position encodings of the GSP query to have 4 timesteps.
Additional context
Add any other context about the problem here.
When changing xr.dataset to torch (xr_dataset.torch.to_tensor
) the variable is called dims
. These variables holds names of both dims and data_vars
Perhaps a better name is data_vars_and_dims
and one of these would be called data_var_or_dim
.
This would make it clear that it was data vars and dims rather than just dims
nice to make the code clear
Load gcp.yaml from nowcasting_dataset
Just to reduce double code
Update NWP fake data to be in hour chunks.
Useful for testing ML models
Describe the bug
With the newest version of nowcasting-dataset, the dataloader fails because of changes to the Configuration
object
To Reproduce
def test_fake_dataset():
> train = torch.utils.data.DataLoader(FakeDataset(configuration=Configuration()), batch_size=None)
tests/test_batch.py:15:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <nowcasting_dataloader.fake.FakeDataset object at 0x7ff20c6af3a0>
configuration = Configuration(general=General(name='example', description='example configuration', cloud='gcp'), input_data=InputData(..._data/v7/'), process=Process(seed=1234, batch_size=32, upload_every_n_batches=16, local_temp_path='~/temp/'), git=None)
length = 10
def __init__(self, configuration: Configuration, length: int = 10):
"""
Init
Args:
configuration: configuration object
length: length of dataset
"""
> self.number_nwp_channels = len(configuration.process.nwp_channels)
E AttributeError: 'Process' object has no attribute 'nwp_channels'
nowcasting_dataloader/fake.py:19: AttributeError
________________ test_netcdf_dataset_local_using_configuration _________________
configuration = Configuration(general=General(name='gcp', description='Configuration for Google Cloud', cloud='gcp'), input_data=Input..._data/v7/'), process=Process(seed=1234, batch_size=32, upload_every_n_batches=16, local_temp_path='~/temp/'), git=None)
def test_netcdf_dataset_local_using_configuration(configuration: Configuration):
DATA_PATH = os.path.join(
os.path.dirname(nowcasting_dataloader.__file__), "../tests", "data", "batch"
)
TEMP_PATH = os.path.join(
os.path.dirname(nowcasting_dataloader.__file__), "../tests", "data", "batch", "temp"
)
> train_dataset = NetCDFDataset(
1,
DATA_PATH,
TEMP_PATH,
cloud="local",
history_minutes=10,
forecast_minutes=10,
required_keys=[NWP_DATA, NWP_TARGET_TIME, SATELLITE_DATA, SATELLITE_DATETIME_INDEX],
configuration=configuration,
)
tests/test_netcdf_dataset.py:40:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <nowcasting_dataloader.datasets.NetCDFDataset object at 0x7ff20c6b8be0>
n_batches = 1
src_path = '/home/runner/work/nowcasting_dataloader/nowcasting_dataloader/nowcasting_dataloader/../tests/data/batch'
tmp_path = '/home/runner/work/nowcasting_dataloader/nowcasting_dataloader/nowcasting_dataloader/../tests/data/batch/temp'
configuration = Configuration(general=General(name='gcp', description='Configuration for Google Cloud', cloud='gcp'), input_data=Input..._data/v7/'), process=Process(seed=1234, batch_size=32, upload_every_n_batches=16, local_temp_path='~/temp/'), git=None)
cloud = 'local'
required_keys = ['nwp', 'nwp_target_time', 'sat_data', 'sat_datetime_index']
history_minutes = 10, forecast_minutes = 10, normalize = False
def __init__(
self,
n_batches: int,
src_path: str,
tmp_path: str,
configuration: Configuration,
cloud: str = "gcp",
required_keys: Union[Tuple[str], List[str]] = None,
history_minutes: Optional[int] = None,
forecast_minutes: Optional[int] = None,
normalize: bool = False,
):
"""
Netcdf Dataset
Args:
n_batches: Number of batches available on disk.
src_path: The full path (including 'gs://') to the data on
Google Cloud storage.
tmp_path: The full path to the local temporary directory
(on a local filesystem).
cloud:
required_keys: Tuple or list of keys required in the example for it to be considered usable
history_minutes: How many past minutes of data to use, if subsetting the batch
forecast_minutes: How many future minutes of data to use, if reducing the amount of forecast time
configuration: configuration object
cloud: which cloud is used, can be "gcp", "aws" or "local".
normalize: normalize the batch data
"""
self.n_batches = n_batches
self.src_path = src_path
self.tmp_path = tmp_path
self.cloud = cloud
self.history_minutes = history_minutes
self.forecast_minutes = forecast_minutes
self.configuration = configuration
self.normalize = normalize
logger.info(f"Setting up NetCDFDataset for {src_path}")
if self.forecast_minutes is None:
self.forecast_minutes = configuration.process.forecast_minutes
if self.history_minutes is None:
self.history_minutes = configuration.process.history_minutes
# see if we need to select the subset of data. If turned on -
# only history_minutes + current time + forecast_minutes data is used.
self.select_subset_data = False
> if self.forecast_minutes != configuration.process.forecast_minutes:
E AttributeError: 'Process' object has no attribute 'forecast_minutes'
nowcasting_dataloader/datasets.py:129: AttributeError
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
At the moment, the SatelliteML
class has x
and y
fields, which store the x_geostationary
and y_geostationary
coords, respectively.
As noted in #113, field names x
and y
are a bit ambiguous. Are they OSGB? Are they pixel indicies? Are they geostationary coords?! ๐
So I'd propose a simple PR that:
SatelliteML
:
x_geostationary
y_geostationary
x_osgb
y_osgb
x
and y
then a tonne of stuff will break. So, for now, I'd propose we keep x
and y
(but correct the docstrings), but state that x
and y
are deprecated and will be removed in a future version.Does that sound OK, @peterdudfield?
e.g. check that, after normalisation, the mean of the entire batch is roughly 0; and the std is roughly 1.
bad normalisation can be harmful!
I'll add to this list as I find new things that need to be added:
Move data_source specific code within eahc datasource
this code could be moved within satellite or nwp
keep things tidy
Position encoding creation is quite slow, taking about 4 seconds for a batch of 32. To speed up our training we need to get this down.
A recent bit of profiling shows that the position encoding and all the concatenations in it take up about 4 seconds of the creation of each batch.
Profile stats for: get_train_batch
199664 function calls (194382 primitive calls) in 11.075 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
23/1 0.000 0.000 11.075 11.075 {built-in method builtins.next}
1 0.000 0.000 11.075 11.075 supporters.py:658(prefetch_iterator)
2 0.000 0.000 11.075 5.538 supporters.py:582(__next__)
3 0.000 0.000 11.075 3.692 apply_func.py:69(apply_to_collection)
2 0.000 0.000 11.075 5.537 supporters.py:591(request_next_batch)
2 0.000 0.000 11.075 5.537 supporters.py:603(next_fn)
2 0.000 0.000 11.075 5.537 dataloader.py:517(__next__)
2 0.000 0.000 11.074 5.537 dataloader.py:559(_next_data)
2 0.454 0.227 11.074 5.537 fetch.py:47(fetch)
2 0.001 0.000 10.619 5.310 datasets.py:250(__getitem__)
2 0.001 0.001 10.239 5.119 datasets.py:134(__getitem__)
2 0.061 0.031 8.311 4.156 position_encoding.py:62(generate_position_encodings_for_batch)
12 1.197 0.100 8.244 0.687 position_encoding.py:193(encode_absolute_position)
416 0.005 0.000 4.530 0.011 einops.py:327(reduce)
416 0.006 0.000 4.521 0.011 einops.py:202(apply)
32 0.000 0.000 4.501 0.141 einops.py:455(repeat)
32 0.000 0.000 4.494 0.140 _backends.py:98(add_axes)
32 0.000 0.000 4.493 0.140 _backends.py:336(tile)
32 4.493 0.140 4.493 0.140 {method 'repeat' of 'torch._C._TensorBase' objects}
826 2.113 0.003 2.113 0.003 {built-in method cat}
2 0.001 0.001 1.623 0.812 batch.py:144(load_netcdf)
10 0.000 0.000 1.437 0.144 position_encoding.py:251(combine_space_and_time_features)
126 1.390 0.011 1.390 0.011 {method 'acquire' of '_thread.lock' objects}
2 0.000 0.000 1.380 0.690 _base.py:635(__exit__)
2 0.000 0.000 1.380 0.690 thread.py:210(shutdown)
14 0.000 0.000 1.380 0.099 threading.py:1021(join)
14 0.000 0.000 1.379 0.099 threading.py:1059(_wait_for_tstate_lock)
12 0.020 0.002 1.207 0.101 position_encoding.py:279(normalize_geospatial_coordinates)
404 0.192 0.000 0.956 0.002 position_encoding.py:399(fourier_encode)
6 0.000 0.000 0.241 0.040 datasets.py:340(add_encodings)
14 0.000 0.000 0.207 0.015 xr_utils.py:84(validate)
2 0.000 0.000 0.187 0.093 batch.py:157(normalize)
404 0.187 0.000 0.187 0.000 {method 'sin' of 'torch._C._TensorBase' objects}
396 0.182 0.000 0.182 0.000 {built-in method stack}
4 0.000 0.000 0.162 0.041 satellite_model.py:18(model_validation)
It seems like reducing the number of concatentations is the key, as generating the Fourier features is quite fast. But concatentation takes 2 of the 4 seconds it takes to create the encodings.
The easiest way would be to just run the spatial and feature encodings over all the batch at once, not one at a time like it currently does.
An issue with that is finding some suitable replacements, for the spatial encoding, for example, torch.meshgrid
only takes 1D tensors, and so can't be done over the whole batch at once, I think?
For the datetime features, doing it over the whole batch should also speed it up a bit, just need to extract the hour, minute, and day of year of a set of timestamps all at once.
Describe the bug
In satellite and nwp data sources, normalise()
does the right thing: it ensures that, on average, the means will be zero; and the std will be 1.
In gsp and pv data sources, normalise()
rescales the values to be in the range [0, 1], which isn't exactly the same thing!
Expected behavior
For any data source that's used as an input to the model, we probably want means to be zero and std to be 1.
For the target, we may sometimes want to re-scale to [0, 1] (if, for example, we're using a sigmoid output layer). But we should probably ignore that for now ๐
Describe the bug
Satellite and nwp not in this shape
To Reproduce
create batchML.fake() and see the shape on the satellite data
Expected behavior
Expect shape to be B,C,T,H,W
** potential solution
Should we implement "multi-scale" position encoding? e.g., for temporal encoding, we could have a total of 8 channels:
A full cycle of a sine and cosine for:
The thinking being that, if we only have day-of-year and time-of-day, then it might be hard for the model to tell if two timesteps are consecutive (because 5 minutes would be represented as a tiny change in a sine wave which takes a full 24 hours to complete)
(Also, if we get into electricity demand forecasting, we'd want "human" temporal encoding like day-of-week)
Current a different branch of the workflow is used.
https://github.com/openclimatefix/nowcasting_dataloader/blob/main/.github/workflows/workflows.yaml#L6
Would be good to move this back to 'main'
Describe the bug
array of strings for channel names can not be converted to torch
return torch.tensor(self._obj.data, dtype=torch.float32)
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
To Reproduce
run the most recenet dataset
'/mnt/storage_b/data/ocf/solar_pv_nowcasting/nowcasting_dataset_pipeline/prepared_ML_training_data/v10'
and it errors when trying to turn array of strings to a torch filed
Expected behavior
dataset can be turned into torch tensors
Possibile soultions
When getting batch data from gcp/aws, want option of adding random flips.
This helps will help reduce over fitting
Currently time features are just the value of a single sin/cos. The spatial features use multiple sin/cos in their fourier encoding, so adding that to the datetime featurtes could make it easier for the model to figure things out, especially for timesteps that are close to each other.
We should experiment with telling our ML models which year it is.
The country's PV capacity increases year-on-year.
A few possible encodings:
Let's experiment with giving our models:
It will be taken out of nowcasting_dataset
see https://github.com/openclimatefix/nowcasting_dataset/compare/issue/231-normalization-gsp?expand=1
For GSP:
and openclimatefix/nowcasting_dataset#427
Decidded not normalise bata data, but do it in the data loader
Only need the batch test data, all the other folders in test data can go
Run the test generation script in the test action.
This would fix stale test data from causing problems and making tests pass that shouldn't.
Add a step in the tests that calls the test generation script, either in the reusable workflow or here
With attention, imagery doesn't have to be in a regular grid, so we can gradually reduce resolution as pixels get further from the center. Although, in practice, probably just have 'native res' in the center, then reduce res in a border around the center
Now that we are using the original projection of the satellite data, the mean and std are probably close, but still wrong for the new format. So we need to rerun the calculation for all the channels.
We need it to have the correct normalization.
Move the position encoding step to BatchML, so that is more easily works across all datasets, etc.
Would also include moving the code for zeroing out the missing PV and GSP systems, which to work correctly need to be zeroed out when the PV and GSP systems are.
Its come up a few times that moving position encoding to BatchML would be helpful.
As mentioned in openclimatefix/satflow#101 it would be helpful to have a way of having consistent position encodings for the PV systems/satellite imagery so that the model can associate the PV systems output in time and space with the satellite imagery.
This unified position encoding could also be useful for other modalities, so that the model only needs one set of position encodings but can use it for all the input modalities.
It would help with the joint model and unifying the inputs/outputs. This position encoding can also be used with the openclimatefix/perceiver-pytorch#20 to ensure the queries use the same positional encodings when getting the output
Could be just a way of encoding all of this with Fourier Features. Or could be some other way of encoding that works across the different modalities
Describe the bug
tests/test_position_encoding.py::test_batch_encoding
fails at line 53 because torch.max(position_encodings[position_encoding_key])
is too large. It's expected to be less than or equal to 1. But it's actually 1.3904!
@jacobbieker if you can give me a hint, I'm happy to try to debug. Unless you're happy to do it? ๐
To Reproduce
py.test tests/test_position_encoding.py::test_batch_encoding
Move over data loader object from dataset
copy over
Would it be worth loading 2 batches at once and taking and randon (seeded) sample of examples from each one.
This would mean the that the ML models dont see the same batch each epoch. This might help stop over fitting.
Microsoft have released a new PyTorch library: https://github.com/microsoft/torchgeo
The readme starts:
TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.
The goal of this library is to make it simple:
- for machine learning experts to use geospatial data in their workflows, and
- for remote sensing experts to use their data in machine learning workflows.
The docs are here: https://torchgeo.readthedocs.io/en/latest/
Of particular relevance for us might be their samplers and transforms.
(Although, after spending a few minutes skim-reading the docs, I can't see anything that's obviously relevant to our work, tbh.... but probably worth having a more detailed look some time in early 2022?)
Maybe in the README (or somewhere else) we could document what 'B', 'C', 'T', 'H', and 'W' mean? (I assume it's batch, channel, time, height, width?!)
Should this be moved into this repo?
https://github.com/openclimatefix/predict_pv_yield/blob/master/predict_pv_yield/data/dataloader.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.