Describe the bug GSP time plot To Repro

I'm struggerling with 2. <div class="snippet-clipboard-content notranslate positio

Perhaps also the encoding work that <a class="user-mention notranslate" data-hovercard

GSP plotting time axis about nowcasting_dataloader HOT 8 CLOSED

openclimatefix commented on May 29, 2024

GSP plotting time axis

from nowcasting_dataloader.

Comments (8)

jacobbieker commented on May 29, 2024 1

Yeah, the position encoding should contain the relevant time information, so we shouldn't have to be sending the actual time to the models at all. I think they should probably just be used for plotting then, in which case they don't have to be converted to torch tensors at all, so we can do whatever is easiest for plotting instead.

from nowcasting_dataloader.

jacobbieker commented on May 29, 2024 1

It does convert to torch if not already, the SatFlowDataset though removes all non-inputs from the dictionary that is returned, only keeping the data and the position encodings for the inputs and targets. But yeah, otherwise you can change the collation function, which I think is where everything gets changed to a Tensor

from nowcasting_dataloader.

peterdudfield commented on May 29, 2024

Turns out this is due to chaning datetime64[ns] to a int32, and it loses information here. Normally it is changed to a int64

optional to solve this

change to int64, and use more memory, but its is simple.
xarray.Dataset --> int64--> torch.float64. perhaps eveyrthing could be changed to float64s
xarray cant be saved as datetime64[s] so for time vectors we could go for xarray.Dataset --> datetime64[ns] --> datetime64[s] --> int32--> torch.float32

I think i prefer 2. It means everything ends with torch.float32, but there is a bit more moving stuff around for time vectors. But good to note the times vectors are small compared to the satellite and nwp data

@JackKelly and @jacobbieker would be interested in your option

from nowcasting_dataloader.

peterdudfield commented on May 29, 2024

I'm struggerling with 2.

    time = pd.date_range("2019-01-01", "2019-01-02", freq="5T").values
    time_datetime64_s = time.astype('datetime64[s]')
    time_int32 = time_datetime64_s.astype('int32')
    time_torch = torch.tensor(time_int32, dtype=torch.float32)
    time_out = pd.to_datetime(time_torch.numpy(), unit='s')

makes datetimes which are a few seconds off

DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:04:16',
'2019-01-01 00:10:40', '2019-01-01 00:14:56',
'2019-01-01 00:19:12', '2019-01-01 00:25:36',
'2019-01-01 00:29:52', '2019-01-01 00:34:08',
'2019-01-01 00:40:32', '2019-01-01 00:44:48',
...
'2019-01-01 23:15:12', '2019-01-01 23:19:28',
'2019-01-01 23:25:52', '2019-01-01 23:30:08',
'2019-01-01 23:34:24', '2019-01-01 23:40:48',
'2019-01-01 23:45:04', '2019-01-01 23:49:20',
'2019-01-01 23:55:44', '2019-01-02 00:00:00'],
dtype='datetime64[ns]', length=289, freq=None)

from nowcasting_dataloader.

peterdudfield commented on May 29, 2024

It seems that in order to get times accurate we need to use 'torch.float64'. Perhaps a good comprise is to use 'torch.float32' for everything apart from the time vectors.

Another option is to normalize the time,

this might make it harder to immediate read.
might be good for the ML model

from nowcasting_dataloader.

peterdudfield commented on May 29, 2024

Perhaps also the encoding work that @jacobbieker has done means we dont need these variables for the models, we just might need them for plotting --> just change these to float64 ready for plotting

from nowcasting_dataloader.

JackKelly commented on May 29, 2024

Thanks loads for looking into this!

I think you're doing the right thing by converting the datetimes to datetime64[s] first.

From here, I think we have two options:

Can we just leave the times as int32 (after first converting to datetime64[s])? I feel like this is probably the safest option. Or is there a specific reason why it's troublesome to leave the time vectors as int32?
If we prefer to convert to np.float32 then one trick would be to subtract the starttime of the dataset from every datetime (e.g. subtract "2020-01-01"). But this only works for datasets less than about 3 years in length:

In [13]: time = pd.date_range("2019-01-01", "2025-01-02", freq="5T").values

In [14]: time_datetime64_s = time.astype('datetime64[s]')

In [15]: time_int32 = time_datetime64_s.astype('int32')

In [16]: time_float32 = (time_int32 - time_int32[0]).astype(np.float32)

In [17]: time_out = pd.to_datetime(time_float32 + time_int32[0], unit='s')

In [18]: time_out
Out[18]: 
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:05:00',
               '2019-01-01 00:10:00', '2019-01-01 00:15:00',
               '2019-01-01 00:20:00', '2019-01-01 00:25:00',
               '2019-01-01 00:30:00', '2019-01-01 00:35:00',
               '2019-01-01 00:40:00', '2019-01-01 00:45:00',
               ...
               '2025-01-01 23:14:56', '2025-01-01 23:20:00',
               '2025-01-01 23:25:04', '2025-01-01 23:30:08',
               '2025-01-01 23:34:56', '2025-01-01 23:40:00',
               '2025-01-01 23:45:04', '2025-01-01 23:49:52',
               '2025-01-01 23:54:56', '2025-01-02 00:00:00'],
              dtype='datetime64[ns]', length=631585, freq=None)

In [19]: time_out[300000:]
Out[19]: 
DatetimeIndex(['2021-11-07 16:00:00', '2021-11-07 16:05:04',
               '2021-11-07 16:10:00', '2021-11-07 16:14:56',
               '2021-11-07 16:20:00', '2021-11-07 16:25:04',
               '2021-11-07 16:30:00', '2021-11-07 16:34:56',
               '2021-11-07 16:40:00', '2021-11-07 16:45:04',
               ...
               '2025-01-01 23:14:56', '2025-01-01 23:20:00',
               '2025-01-01 23:25:04', '2025-01-01 23:30:08',
               '2025-01-01 23:34:56', '2025-01-01 23:40:00',
               '2025-01-01 23:45:04', '2025-01-01 23:49:52',
               '2025-01-01 23:54:56', '2025-01-02 00:00:00'],
              dtype='datetime64[ns]', length=331585, freq=None)

In [20]: time_out[400000:]
Out[20]: 
DatetimeIndex(['2022-10-20 21:20:00', '2022-10-20 21:25:04',
               '2022-10-20 21:30:00', '2022-10-20 21:34:56',
               '2022-10-20 21:40:00', '2022-10-20 21:45:04',
               '2022-10-20 21:50:00', '2022-10-20 21:54:56',
               '2022-10-20 22:00:00', '2022-10-20 22:05:04',
               ...
               '2025-01-01 23:14:56', '2025-01-01 23:20:00',
               '2025-01-01 23:25:04', '2025-01-01 23:30:08',
               '2025-01-01 23:34:56', '2025-01-01 23:40:00',
               '2025-01-01 23:45:04', '2025-01-01 23:49:52',
               '2025-01-01 23:54:56', '2025-01-02 00:00:00'],
              dtype='datetime64[ns]', length=231585, freq=None)

(As I'm sure you know, the problem with converting large ints to float32 is that float32 only has much less than 32 bits for the 'significand'... so, to give a cartoon example with a fictional, much smaller float, the integer 123,456 might be represented as 1.23 x 10⁵... i.e. floats throw away precision to get larger dynamic range)

from nowcasting_dataloader.

peterdudfield commented on May 29, 2024

Yeah, the position encoding should contain the relevant time information, so we shouldn't have to be sending the actual time to the models at all. I think they should probably just be used for plotting then, in which case they don't have to be converted to torch tensors at all, so we can do whatever is easiest for plotting instead.

Doesn't the torch DataLoader covert things if they are not torch already? But perhaps there is some way to keep it in a time format

from nowcasting_dataloader.

GSP plotting time axis about nowcasting_dataloader HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent