openclimatefix / nowcasting_dataloader Goto Github PK

View Code? Open in Web Editor NEW

7.0 1.0 0.0 60.95 MB

PyTorch Dataloader for working with multi-modal data for nowcasting applications

Home Page: https://nowcasting-dataloader.readthedocs.io/

License: MIT License

Python 10.96% Jupyter Notebook 89.04%

nowcasting

nowcasting_dataloader's People

Contributors

Stargazers

Watchers

nowcasting_dataloader's Issues

Detailed Description

Would be great to check on data sources are in batchML

Context

lucky find with #10

Possible Implementation

add asserts of all data sources in
https://github.com/openclimatefix/nowcasting_dataloader/blob/main/tests/test_netcdf_dataset.py#L69

Detailed Description

Normalize the pv and gsp, x and y coordinates.
Need to find the max and min (x and y) values of these from all batches.

Context

Useful to normalize, so that these values can be included in ML models

How to encode the map of installed PV capacity?

Detailed Description

Once we have the map of installed PV capacity in nowcasting_dataset (see issue openclimatefix/nowcasting_dataset#184), we need to figure out how best to encode this map.

Context

The map isn't a map of individual PV sites. Instead it's the PV capacity per LLSOA. Jamie says:

LSOAs vary in size. The biggest ones are in Scotland. The largest is ~1200 km2 !! The median size is 0.418 km2 though, so in general they're pretty small.

Possible Implementation

The simplest encoding is probably as a bitmap, at the same spatial resolution as the satellite imagery. Each row (pixel) of data would encode:

the amount of installed PV
An embedding of the LLSOA ID? (Maybe that's not required?)
The geographical position
The datetime encoding at t0? We can probably assume the installed PV capacity is static per ML example. So each example only needs a single map of installed PV capacity. But we probably do need to provide a datetime encoding, so the attention mechanism can see that the map is "close" to the other data modalities.

Detailed Description

Make sure x and y coordinates are using standardised named

Could use

x_osgb
y_osgb

but when these are normalised are these names right?

Topological ML data

Describe the bug
Topological data is not coverted from xr.Dataset properly

To Reproduce
run this tests/test_netcdf_dataset.py and see that batchML has not topological data in it

** additional context **
These files are being moved from nowcasting_dataset, so its good not didnt want to make this change there.

Currently:

   if TOPOGRAPHIC_DATA in xr_dataset.keys():

```
       return TopographicML(
```

           batch_size=xr_dataset[TOPOGRAPHIC_DATA].shape[0],

           topo_data=xr_dataset[TOPOGRAPHIC_DATA],

           topo_x_coords=xr_dataset[TOPOGRAPHIC_DATA].topo_x,

           topo_y_coords=xr_dataset[TOPOGRAPHIC_DATA].topo_y,

```
       )
```
```
   else:
```
```
       return None
```

Potential solution
+

```
   return TopographicML(
```

       batch_size=xr_dataset.data.shape[0],

```
       topo_data=xr_dataset.data,
```
```
       topo_x_coords=xr_dataset.x,
```
```
       topo_y_coords=xr_dataset.y,
```
```
   )
```

dataloader - float32

Make sure random data is float32s not float64s. This is so it is easier to work with pytorch

To Reproduce
x: BatchML = BatchML.fake()
print(type(x.satellite.data[0,0,0,0,0 ]))
Expected behavior
float32 to be made

Additional context
Float 32 work better with pytorch

Detailed Description

Normalization fo data should be done in this repo

This is linked with - openclimatefix/nowcasting_dataset#231

Context

Idea is to have not normalization in dataset, but all done all the fly

Possible Implementation

Could add this the pydantic models as a method, which is then called

use `float32` for `nwp`, `sat_data`, and `hrv_sat_data`?

Detailed Description

Looking at the output from SatFlowDataset, it looks like nwp, sat_data and hrv_sat_data are all float64? If so, we can probably get an easy speedup by using float32 (as I'm sure you know, GPUs are much faster at float32 than float64)

Just for reference, here's the mean, std, dtype and shape of all the outputs of SatFlowDataset:

*** INPUTS ***
pv_yield
MEAN  = tensor(0.3119)
STD   = tensor(0.2217)
dtype = torch.float32
shape = torch.Size([32, 31, 7])

pv_system_id
MEAN  = tensor(nan)
STD   = tensor(nan)
dtype = torch.float32
shape = torch.Size([32, 128])

nwp
MEAN  = tensor(-0.0024, dtype=torch.float64)
STD   = tensor(0.9119, dtype=torch.float64)
dtype = torch.float64
shape = torch.Size([32, 10, 4, 64, 64])

topo_data
MEAN  = tensor(-0.0682)
STD   = tensor(0.9836)
dtype = torch.float32
shape = torch.Size([32, 1, 1, 64, 64])

gsp_id
MEAN  = tensor(nan)
STD   = tensor(nan)
dtype = torch.float32
shape = torch.Size([32, 32])

sat_data
MEAN  = tensor(0.0705, dtype=torch.float64)
STD   = tensor(0.8638, dtype=torch.float64)
dtype = torch.float64
shape = torch.Size([32, 11, 7, 24, 24])

hrv_sat_data
MEAN  = tensor(0.0233, dtype=torch.float64)
STD   = tensor(0.1539, dtype=torch.float64)
dtype = torch.float64
shape = torch.Size([32, 1, 7, 64, 64])

*** TARGETS ***
gsp_yield
MEAN  = tensor(nan)
STD   = tensor(nan)
dtype = torch.float32
shape = torch.Size([32, 4, 32])

gsp_id
MEAN  = tensor(nan)
STD   = tensor(nan)
dtype = torch.float32
shape = torch.Size([32, 32])

Run CI every Monday

Detailed Description

Scheudle CI test runner to run every monday lunctime

Context

useful to regular check things are broken

Possible Implementation

on:
  push:
  schedule:
    - cron: "0 12 * * 1"

Add GSP fake data to Batch

Detailed Description

Add GSP data to fake BatchML
https://github.com/openclimatefix/nowcasting_dataloader/blob/main/nowcasting_dataloader/batch.py#L93

Context

This is used for testing in some ML repos

Possible Implementation

use similar method to other data sources, gsp.fake()
https://github.com/openclimatefix/nowcasting_dataloader/blob/main/nowcasting_dataloader/data_sources/gsp/gsp_model.py#L88

Subselect doesn't work without the metadata in Batch

Describe the bug
We got rid of the metadata files in Batch, but subselect relies on them. So it fails and can't be used for now.

To Reproduce
Steps to reproduce the behavior:
Run tests

Expected behavior
Subselect to select the data

Additional context
WE could load in the t0 datetiems needed from the CSV files created, or add the metadata to the batch files, i.e. in the Satellite file have an attribute saying the t0 time.

Use Enum for modality names?

Instead of using strings like 'NWP' to identify each modality, should we use an Enum? (Just to make sure that folks don't accidentally type the wrong modality name, which might break things in weird ways?)

Remove Fake BatchML

Detailed Description

Current Batch and BatchML can be made with fake data.
Probably best to just have one method and then transform the data.

Context

Good to keep code tidy

Possible Implementation

code here could load

batch = Batch.fake()
batch_ml = BatchML.from_batch(batch=batch)

Should be able to get rid of this file

Document how to train an ML model using the pre-prepared batches.

Could be implemented as an iPython notebook (with lots of text to explain what's going on)?

GSP Position Encodings are not sliced correctly

Describe the bug
For a history of 30min, and forecast of 120min, the position encoding has 6 timesteps, which should be correct, 1 last timestemp, 1 current time timestep, and 4 future ones. The position encoding for the queries, which only should be the future timesteps should then contain only 4 timesteps, but instead contains 6.

To Reproduce
Steps to reproduce the behavior:
4. Run SatFlowDatset and look at output tensors

Expected behavior
The position encodings of the GSP query to have 4 timesteps.

Additional context
Add any other context about the problem here.

Check our models are getting this configuration of data inputs

General

The target should be future GSP power and, optionally, future satellite imagery

The model should receive these as inputs

An embedding of the GSP ID (also see openclimatefix/nowcasting_dataset#451)
Historical data from all individual PV systems within the region of interest (which almost certainly will be available at inference time and is probably important so the model can see how much sunlight each cloud is letting through!), probably including:
- An embedding of each PV system's ID (so the model can learn which PV systems to trust)?
- The spatial and temporal location of each PV reading

The model should NOT receive these as inputs 🙂

Triple-check that we're not accidentally giving the model future GSP power as an input 🙂 (mistakes like this are all too easy to make!)
Historical GSP power data (because that won't be available at inference time!)
Future satellite images 🙂
Future data from individual PV systems

Rename dims to data_vars

Detailed Description

When changing xr.dataset to torch (xr_dataset.torch.to_tensor) the variable is called dims. These variables holds names of both dims and data_vars

Possible Implementation

Perhaps a better name is data_vars_and_dims and one of these would be called data_var_or_dim.
This would make it clear that it was data vars and dims rather than just dims

Context

nice to make the code clear

Load .yaml from nowcasting_dataset

Detailed Description

Load gcp.yaml from nowcasting_dataset

Context

Just to reduce double code

Update NWP fake data

Detailed Description

Update NWP fake data to be in hour chunks.

Context

Useful for testing ML models

Nowcasting-dataloader tests fail because of changes to Nowcasting-dataset

Describe the bug
With the newest version of nowcasting-dataset, the dataloader fails because of changes to the Configuration object

To Reproduce


    def test_fake_dataset():
>       train = torch.utils.data.DataLoader(FakeDataset(configuration=Configuration()), batch_size=None)

tests/test_batch.py:15: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <nowcasting_dataloader.fake.FakeDataset object at 0x7ff20c6af3a0>
configuration = Configuration(general=General(name='example', description='example configuration', cloud='gcp'), input_data=InputData(..._data/v7/'), process=Process(seed=1234, batch_size=32, upload_every_n_batches=16, local_temp_path='~/temp/'), git=None)
length = 10

    def __init__(self, configuration: Configuration, length: int = 10):
        """
        Init
    
        Args:
            configuration: configuration object
            length: length of dataset
        """
>       self.number_nwp_channels = len(configuration.process.nwp_channels)
E       AttributeError: 'Process' object has no attribute 'nwp_channels'

nowcasting_dataloader/fake.py:19: AttributeError
________________ test_netcdf_dataset_local_using_configuration _________________

configuration = Configuration(general=General(name='gcp', description='Configuration for Google Cloud', cloud='gcp'), input_data=Input..._data/v7/'), process=Process(seed=1234, batch_size=32, upload_every_n_batches=16, local_temp_path='~/temp/'), git=None)

    def test_netcdf_dataset_local_using_configuration(configuration: Configuration):
        DATA_PATH = os.path.join(
            os.path.dirname(nowcasting_dataloader.__file__), "../tests", "data", "batch"
        )
        TEMP_PATH = os.path.join(
            os.path.dirname(nowcasting_dataloader.__file__), "../tests", "data", "batch", "temp"
        )
    
>       train_dataset = NetCDFDataset(
            1,
            DATA_PATH,
            TEMP_PATH,
            cloud="local",
            history_minutes=10,
            forecast_minutes=10,
            required_keys=[NWP_DATA, NWP_TARGET_TIME, SATELLITE_DATA, SATELLITE_DATETIME_INDEX],
            configuration=configuration,
        )

tests/test_netcdf_dataset.py:40: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <nowcasting_dataloader.datasets.NetCDFDataset object at 0x7ff20c6b8be0>
n_batches = 1
src_path = '/home/runner/work/nowcasting_dataloader/nowcasting_dataloader/nowcasting_dataloader/../tests/data/batch'
tmp_path = '/home/runner/work/nowcasting_dataloader/nowcasting_dataloader/nowcasting_dataloader/../tests/data/batch/temp'
configuration = Configuration(general=General(name='gcp', description='Configuration for Google Cloud', cloud='gcp'), input_data=Input..._data/v7/'), process=Process(seed=1234, batch_size=32, upload_every_n_batches=16, local_temp_path='~/temp/'), git=None)
cloud = 'local'
required_keys = ['nwp', 'nwp_target_time', 'sat_data', 'sat_datetime_index']
history_minutes = 10, forecast_minutes = 10, normalize = False

    def __init__(
        self,
        n_batches: int,
        src_path: str,
        tmp_path: str,
        configuration: Configuration,
        cloud: str = "gcp",
        required_keys: Union[Tuple[str], List[str]] = None,
        history_minutes: Optional[int] = None,
        forecast_minutes: Optional[int] = None,
        normalize: bool = False,
    ):
        """
        Netcdf Dataset
    
        Args:
            n_batches: Number of batches available on disk.
            src_path: The full path (including 'gs://') to the data on
                Google Cloud storage.
            tmp_path: The full path to the local temporary directory
                (on a local filesystem).
            cloud:
            required_keys: Tuple or list of keys required in the example for it to be considered usable
            history_minutes: How many past minutes of data to use, if subsetting the batch
            forecast_minutes: How many future minutes of data to use, if reducing the amount of forecast time
            configuration: configuration object
            cloud: which cloud is used, can be "gcp", "aws" or "local".
            normalize: normalize the batch data
        """
        self.n_batches = n_batches
        self.src_path = src_path
        self.tmp_path = tmp_path
        self.cloud = cloud
        self.history_minutes = history_minutes
        self.forecast_minutes = forecast_minutes
        self.configuration = configuration
        self.normalize = normalize
    
        logger.info(f"Setting up NetCDFDataset for {src_path}")
    
        if self.forecast_minutes is None:
            self.forecast_minutes = configuration.process.forecast_minutes
        if self.history_minutes is None:
            self.history_minutes = configuration.process.history_minutes
    
        # see if we need to select the subset of data. If turned on -
        # only history_minutes + current time + forecast_minutes data is used.
        self.select_subset_data = False
>       if self.forecast_minutes != configuration.process.forecast_minutes:
E       AttributeError: 'Process' object has no attribute 'forecast_minutes'

nowcasting_dataloader/datasets.py:129: AttributeError

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

`SatelliteML`: Change `x` and `y` to `x_geostationary` and `y_geostationary`, and include `x_osgb` and `y_osgb`

At the moment, the SatelliteML class has x and y fields, which store the x_geostationary and y_geostationary coords, respectively.

As noted in #113, field names x and y are a bit ambiguous. Are they OSGB? Are they pixel indicies? Are they geostationary coords?! 🙂

So I'd propose a simple PR that:

Adds the following fields to SatelliteML:
- x_geostationary
- y_geostationary
- x_osgb
- y_osgb
I assume that, if we remove x and y then a tonne of stuff will break. So, for now, I'd propose we keep x and y (but correct the docstrings), but state that x and y are deprecated and will be removed in a future version.

Does that sound OK, @peterdudfield?

Related issues:

Automatically check that data is normalised roughly correctly for each batch

Detailed Description

e.g. check that, after normalisation, the mean of the entire batch is roughly 0; and the std is roughly 1.

Context

bad normalisation can be harmful!

Changes required to support Power Perceiver

I'll add to this list as I find new things that need to be added:

Tidy batch code

Detailed Description

Move data_source specific code within eahc datasource
this code could be moved within satellite or nwp

Context

keep things tidy

Speed up Position Encodings

Detailed Description

Position encoding creation is quite slow, taking about 4 seconds for a batch of 32. To speed up our training we need to get this down.

Context

A recent bit of profiling shows that the position encoding and all the concatenations in it take up about 4 seconds of the creation of each batch.

Profile stats for: get_train_batch
         199664 function calls (194382 primitive calls) in 11.075 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     23/1    0.000    0.000   11.075   11.075 {built-in method builtins.next}
        1    0.000    0.000   11.075   11.075 supporters.py:658(prefetch_iterator)
        2    0.000    0.000   11.075    5.538 supporters.py:582(__next__)
        3    0.000    0.000   11.075    3.692 apply_func.py:69(apply_to_collection)
        2    0.000    0.000   11.075    5.537 supporters.py:591(request_next_batch)
        2    0.000    0.000   11.075    5.537 supporters.py:603(next_fn)
        2    0.000    0.000   11.075    5.537 dataloader.py:517(__next__)
        2    0.000    0.000   11.074    5.537 dataloader.py:559(_next_data)
        2    0.454    0.227   11.074    5.537 fetch.py:47(fetch)
        2    0.001    0.000   10.619    5.310 datasets.py:250(__getitem__)
        2    0.001    0.001   10.239    5.119 datasets.py:134(__getitem__)
        2    0.061    0.031    8.311    4.156 position_encoding.py:62(generate_position_encodings_for_batch)
       12    1.197    0.100    8.244    0.687 position_encoding.py:193(encode_absolute_position)
      416    0.005    0.000    4.530    0.011 einops.py:327(reduce)
      416    0.006    0.000    4.521    0.011 einops.py:202(apply)
       32    0.000    0.000    4.501    0.141 einops.py:455(repeat)
       32    0.000    0.000    4.494    0.140 _backends.py:98(add_axes)
       32    0.000    0.000    4.493    0.140 _backends.py:336(tile)
       32    4.493    0.140    4.493    0.140 {method 'repeat' of 'torch._C._TensorBase' objects}
      826    2.113    0.003    2.113    0.003 {built-in method cat}
        2    0.001    0.001    1.623    0.812 batch.py:144(load_netcdf)
       10    0.000    0.000    1.437    0.144 position_encoding.py:251(combine_space_and_time_features)
      126    1.390    0.011    1.390    0.011 {method 'acquire' of '_thread.lock' objects}
        2    0.000    0.000    1.380    0.690 _base.py:635(__exit__)
        2    0.000    0.000    1.380    0.690 thread.py:210(shutdown)
       14    0.000    0.000    1.380    0.099 threading.py:1021(join)
       14    0.000    0.000    1.379    0.099 threading.py:1059(_wait_for_tstate_lock)
       12    0.020    0.002    1.207    0.101 position_encoding.py:279(normalize_geospatial_coordinates)
      404    0.192    0.000    0.956    0.002 position_encoding.py:399(fourier_encode)
        6    0.000    0.000    0.241    0.040 datasets.py:340(add_encodings)
       14    0.000    0.000    0.207    0.015 xr_utils.py:84(validate)
        2    0.000    0.000    0.187    0.093 batch.py:157(normalize)
      404    0.187    0.000    0.187    0.000 {method 'sin' of 'torch._C._TensorBase' objects}
      396    0.182    0.000    0.182    0.000 {built-in method stack}
        4    0.000    0.000    0.162    0.041 satellite_model.py:18(model_validation)

Possible Implementation

It seems like reducing the number of concatentations is the key, as generating the Fourier features is quite fast. But concatentation takes 2 of the 4 seconds it takes to create the encodings.

The easiest way would be to just run the spatial and feature encodings over all the batch at once, not one at a time like it currently does.

An issue with that is finding some suitable replacements, for the spatial encoding, for example, torch.meshgrid only takes 1D tensors, and so can't be done over the whole batch at once, I think?

For the datetime features, doing it over the whole batch should also speed it up a bit, just need to extract the hour, minute, and day of year of a set of timestamps all at once.

`normalise()` should have consistent behaviour across DataSources: It should give mean=0 and std=1.

Describe the bug
In satellite and nwp data sources, normalise() does the right thing: it ensures that, on average, the means will be zero; and the std will be 1.

In gsp and pv data sources, normalise() rescales the values to be in the range [0, 1], which isn't exactly the same thing!

Expected behavior
For any data source that's used as an input to the model, we probably want means to be zero and std to be 1.

For the target, we may sometimes want to re-scale to [0, 1] (if, for example, we're using a sigmoid output layer). But we should probably ignore that for now 🙂

B,C,T,H,W

Describe the bug
Satellite and nwp not in this shape

To Reproduce
create batchML.fake() and see the shape on the satellite data

Expected behavior
Expect shape to be B,C,T,H,W

** potential solution

change the fake data
reshuffle dims in nwp and satellite while the data is in xr format

Multi-scale position encoding

Detailed Description

Should we implement "multi-scale" position encoding? e.g., for temporal encoding, we could have a total of 8 channels:

A full cycle of a sine and cosine for:

day-of-year
time-of-day
6-hours
2-hours

The thinking being that, if we only have day-of-year and time-of-day, then it might be hard for the model to tell if two timesteps are consecutive (because 5 minutes would be represented as a tiny change in a sine wave which takes a full 24 hours to complete)

(Also, if we get into electricity demand forecasting, we'd want "human" temporal encoding like day-of-week)

Update branch pytest workflow

Detailed Description

Current a different branch of the workflow is used.
https://github.com/openclimatefix/nowcasting_dataloader/blob/main/.github/workflows/workflows.yaml#L6

Would be good to move this back to 'main'

Satellite and nwp channel strings --> torch

Describe the bug
array of strings for channel names can not be converted to torch

    return torch.tensor(self._obj.data, dtype=torch.float32)
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

To Reproduce
run the most recenet dataset
'/mnt/storage_b/data/ocf/solar_pv_nowcasting/nowcasting_dataset_pipeline/prepared_ML_training_data/v10'
and it errors when trying to turn array of strings to a torch filed

Expected behavior
dataset can be turned into torch tensors

Possibile soultions

Convert strings to numbers by using channels as index

Add random flips

When getting batch data from gcp/aws, want option of adding random flips.
This helps will help reduce over fitting

Remove concept of `cloud` from nowcasting_dataloaded. Use fsspec instead

Make Datetime postion encoding use multiple Fourier features

Detailed Description

Currently time features are just the value of a single sin/cos. The spatial features use multiple sin/cos in their fourier encoding, so adding that to the datetime featurtes could make it easier for the model to figure things out, especially for timesteps that are close to each other.

Context

Possible Implementation

Encode the year

Detailed Description

We should experiment with telling our ML models which year it is.

Context

The country's PV capacity increases year-on-year.

Possible Implementation

A few possible encodings:

a single real-valued input which is -1 for the first year in the dataset (2016?) and +1 for the last year of the dataset (2021?)
one-hot encoding of the year (this perhaps feels a bit wrong because it doesn't capture the ordering of years)

Related issues:

openclimatefix/nowcasting_dataset#184

Embedding of GSP ID and PV System ID

Detailed Description

Let's experiment with giving our models:

An embedding of the GSP ID (used in the query? And for each row of GSP-level PV data if we use historical GSP-level PV?)
An embedding of the PV System ID (could be used as pat of the position encoding for each row of PV data)

Normalize PV and GSP data

Detailed Description

It will be taken out of nowcasting_dataset

see https://github.com/openclimatefix/nowcasting_dataset/compare/issue/231-normalization-gsp?expand=1

For GSP:
and openclimatefix/nowcasting_dataset#427

Context

Decidded not normalise bata data, but do it in the data loader

Only need batch test data

Only need the batch test data, all the other folders in test data can go

GSP plotting time axis

Describe the bug
GSP time plot

To Reproduce
Steps to reproduce the behavior:
plotting GSP from batches not fake. GSP batches are turned into torch tensors then plotted

Expected behavior
Times should be around 2020 and 2021

Generate test data on the fly

Detailed Description

Run the test generation script in the test action.

Context

This would fix stale test data from causing problems and making tests pass that shouldn't.

Possible Implementation

Add a step in the tests that calls the test generation script, either in the reusable workflow or here

Reduce resolution the further out the pixels are from the center

With attention, imagery doesn't have to be in a regular grid, so we can gradually reduce resolution as pixels get further from the center. Although, in practice, probably just have 'native res' in the center, then reduce res in a border around the center

Recompute Mean/Std for Satellite data

Detailed Description

Now that we are using the original projection of the satellite data, the mean and std are probably close, but still wrong for the new format. So we need to rerun the calculation for all the channels.

Context

We need it to have the correct normalization.

Possible Implementation

Move Position Encoding to BatchML

Detailed Description

Move the position encoding step to BatchML, so that is more easily works across all datasets, etc.

Would also include moving the code for zeroing out the missing PV and GSP systems, which to work correctly need to be zeroed out when the PV and GSP systems are.

Context

Its come up a few times that moving position encoding to BatchML would be helpful.

Possible Implementation

Add PV/Unified Position encoding

Detailed Description

As mentioned in openclimatefix/satflow#101 it would be helpful to have a way of having consistent position encodings for the PV systems/satellite imagery so that the model can associate the PV systems output in time and space with the satellite imagery.

This unified position encoding could also be useful for other modalities, so that the model only needs one set of position encodings but can use it for all the input modalities.

Context

It would help with the joint model and unifying the inputs/outputs. This position encoding can also be used with the openclimatefix/perceiver-pytorch#20 to ensure the queries use the same positional encodings when getting the output

Possible Implementation

Could be just a way of encoding all of this with Fourier Features. Or could be some other way of encoding that works across the different modalities

Failing test: `tests/test_position_encoding.py::test_batch_encoding`

Describe the bug
tests/test_position_encoding.py::test_batch_encoding fails at line 53 because torch.max(position_encodings[position_encoding_key]) is too large. It's expected to be less than or equal to 1. But it's actually 1.3904!

@jacobbieker if you can give me a hint, I'm happy to try to debug. Unless you're happy to do it? 🙂

To Reproduce
py.test tests/test_position_encoding.py::test_batch_encoding

Move over data loader object from nowcasting_dataset

Detailed Description

Move over data loader object from dataset

Context

Good to divide up dataset a bit
might have to do it after openclimatefix/nowcasting_dataset#229

Possible Implementation

copy over

BatchML and its models
Dataloader

Load 2 batches at once

Detailed Description

Would it be worth loading 2 batches at once and taking and randon (seeded) sample of examples from each one.
This would mean the that the ML models dont see the same batch each epoch. This might help stop over fitting.

Check out Microsoft's new TorchGeo library

Detailed Description

Microsoft have released a new PyTorch library: https://github.com/microsoft/torchgeo

The readme starts:

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

The goal of this library is to make it simple:

for machine learning experts to use geospatial data in their workflows, and

for remote sensing experts to use their data in machine learning workflows.

The docs are here: https://torchgeo.readthedocs.io/en/latest/

Of particular relevance for us might be their samplers and transforms.

(Although, after spending a few minutes skim-reading the docs, I can't see anything that's obviously relevant to our work, tbh.... but probably worth having a more detailed look some time in early 2022?)

Document what [B, C, T, H, W] ordering means :slightly_smiling_face:

Maybe in the README (or somewhere else) we could document what 'B', 'C', 'T', 'H', and 'W' mean? (I assume it's batch, channel, time, height, width?!)

Move data loader from predict_pv_yield

Should this be moved into this repo?

https://github.com/openclimatefix/predict_pv_yield/blob/master/predict_pv_yield/data/dataloader.py

openclimatefix / nowcasting_dataloader Goto Github PK

nowcasting_dataloader's People

Contributors

Stargazers

Watchers

nowcasting_dataloader's Issues

Detailed Description

Context

Possible Implementation

Detailed Description

Context

Detailed Description

Context

Possible Implementation

Detailed Description

Detailed Description

Context

Possible Implementation

Detailed Description

Detailed Description

Context

Possible Implementation

Detailed Description

Context

Possible Implementation

Detailed Description

Context

Possible Implementation

General

The model should receive these as inputs

The model should NOT receive these as inputs 🙂

Detailed Description

Possible Implementation

Context

Detailed Description

Context

Detailed Description

Context

Related issues:

Detailed Description

Context

Detailed Description

Context

Detailed Description

Context

Possible Implementation

Detailed Description

Detailed Description

Detailed Description

Context

Possible Implementation

Detailed Description

Context

Possible Implementation

Related issues:

Detailed Description

Detailed Description

Context

Detailed Description

Context

Possible Implementation

Detailed Description

Context

Possible Implementation

Detailed Description

Context

Possible Implementation

Detailed Description

Context

Possible Implementation

Detailed Description

Context

Possible Implementation

Detailed Description

Detailed Description

Recommend Projects

Recommend Topics

Recommend Org