Code Monkey home page Code Monkey logo

Comments (8)

jacobbieker avatar jacobbieker commented on May 29, 2024 1

Yeah, the position encoding should contain the relevant time information, so we shouldn't have to be sending the actual time to the models at all. I think they should probably just be used for plotting then, in which case they don't have to be converted to torch tensors at all, so we can do whatever is easiest for plotting instead.

from nowcasting_dataloader.

jacobbieker avatar jacobbieker commented on May 29, 2024 1

It does convert to torch if not already, the SatFlowDataset though removes all non-inputs from the dictionary that is returned, only keeping the data and the position encodings for the inputs and targets. But yeah, otherwise you can change the collation function, which I think is where everything gets changed to a Tensor

from nowcasting_dataloader.

peterdudfield avatar peterdudfield commented on May 29, 2024

Turns out this is due to chaning datetime64[ns] to a int32, and it loses information here. Normally it is changed to a int64

optional to solve this

  1. change to int64, and use more memory, but its is simple.
    xarray.Dataset --> int64--> torch.float64. perhaps eveyrthing could be changed to float64s
  2. xarray cant be saved as datetime64[s] so for time vectors we could go for xarray.Dataset --> datetime64[ns] --> datetime64[s] --> int32--> torch.float32

I think i prefer 2. It means everything ends with torch.float32, but there is a bit more moving stuff around for time vectors. But good to note the times vectors are small compared to the satellite and nwp data

@JackKelly and @jacobbieker would be interested in your option

from nowcasting_dataloader.

peterdudfield avatar peterdudfield commented on May 29, 2024

I'm struggerling with 2.

    time = pd.date_range("2019-01-01", "2019-01-02", freq="5T").values
    time_datetime64_s = time.astype('datetime64[s]')
    time_int32 = time_datetime64_s.astype('int32')
    time_torch = torch.tensor(time_int32, dtype=torch.float32)
    time_out = pd.to_datetime(time_torch.numpy(), unit='s')

makes datetimes which are a few seconds off

DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:04:16',
'2019-01-01 00:10:40', '2019-01-01 00:14:56',
'2019-01-01 00:19:12', '2019-01-01 00:25:36',
'2019-01-01 00:29:52', '2019-01-01 00:34:08',
'2019-01-01 00:40:32', '2019-01-01 00:44:48',
...
'2019-01-01 23:15:12', '2019-01-01 23:19:28',
'2019-01-01 23:25:52', '2019-01-01 23:30:08',
'2019-01-01 23:34:24', '2019-01-01 23:40:48',
'2019-01-01 23:45:04', '2019-01-01 23:49:20',
'2019-01-01 23:55:44', '2019-01-02 00:00:00'],
dtype='datetime64[ns]', length=289, freq=None)

from nowcasting_dataloader.

peterdudfield avatar peterdudfield commented on May 29, 2024

It seems that in order to get times accurate we need to use 'torch.float64'. Perhaps a good comprise is to use 'torch.float32' for everything apart from the time vectors.

Another option is to normalize the time,

  • this might make it harder to immediate read.
  • might be good for the ML model

from nowcasting_dataloader.

peterdudfield avatar peterdudfield commented on May 29, 2024

Perhaps also the encoding work that @jacobbieker has done means we dont need these variables for the models, we just might need them for plotting --> just change these to float64 ready for plotting

from nowcasting_dataloader.

JackKelly avatar JackKelly commented on May 29, 2024

Thanks loads for looking into this!

I think you're doing the right thing by converting the datetimes to datetime64[s] first.

From here, I think we have two options:

  1. Can we just leave the times as int32 (after first converting to datetime64[s])? I feel like this is probably the safest option. Or is there a specific reason why it's troublesome to leave the time vectors as int32?
  2. If we prefer to convert to np.float32 then one trick would be to subtract the starttime of the dataset from every datetime (e.g. subtract "2020-01-01"). But this only works for datasets less than about 3 years in length:
In [13]: time = pd.date_range("2019-01-01", "2025-01-02", freq="5T").values

In [14]: time_datetime64_s = time.astype('datetime64[s]')

In [15]: time_int32 = time_datetime64_s.astype('int32')

In [16]: time_float32 = (time_int32 - time_int32[0]).astype(np.float32)

In [17]: time_out = pd.to_datetime(time_float32 + time_int32[0], unit='s')

In [18]: time_out
Out[18]: 
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:05:00',
               '2019-01-01 00:10:00', '2019-01-01 00:15:00',
               '2019-01-01 00:20:00', '2019-01-01 00:25:00',
               '2019-01-01 00:30:00', '2019-01-01 00:35:00',
               '2019-01-01 00:40:00', '2019-01-01 00:45:00',
               ...
               '2025-01-01 23:14:56', '2025-01-01 23:20:00',
               '2025-01-01 23:25:04', '2025-01-01 23:30:08',
               '2025-01-01 23:34:56', '2025-01-01 23:40:00',
               '2025-01-01 23:45:04', '2025-01-01 23:49:52',
               '2025-01-01 23:54:56', '2025-01-02 00:00:00'],
              dtype='datetime64[ns]', length=631585, freq=None)

In [19]: time_out[300000:]
Out[19]: 
DatetimeIndex(['2021-11-07 16:00:00', '2021-11-07 16:05:04',
               '2021-11-07 16:10:00', '2021-11-07 16:14:56',
               '2021-11-07 16:20:00', '2021-11-07 16:25:04',
               '2021-11-07 16:30:00', '2021-11-07 16:34:56',
               '2021-11-07 16:40:00', '2021-11-07 16:45:04',
               ...
               '2025-01-01 23:14:56', '2025-01-01 23:20:00',
               '2025-01-01 23:25:04', '2025-01-01 23:30:08',
               '2025-01-01 23:34:56', '2025-01-01 23:40:00',
               '2025-01-01 23:45:04', '2025-01-01 23:49:52',
               '2025-01-01 23:54:56', '2025-01-02 00:00:00'],
              dtype='datetime64[ns]', length=331585, freq=None)

In [20]: time_out[400000:]
Out[20]: 
DatetimeIndex(['2022-10-20 21:20:00', '2022-10-20 21:25:04',
               '2022-10-20 21:30:00', '2022-10-20 21:34:56',
               '2022-10-20 21:40:00', '2022-10-20 21:45:04',
               '2022-10-20 21:50:00', '2022-10-20 21:54:56',
               '2022-10-20 22:00:00', '2022-10-20 22:05:04',
               ...
               '2025-01-01 23:14:56', '2025-01-01 23:20:00',
               '2025-01-01 23:25:04', '2025-01-01 23:30:08',
               '2025-01-01 23:34:56', '2025-01-01 23:40:00',
               '2025-01-01 23:45:04', '2025-01-01 23:49:52',
               '2025-01-01 23:54:56', '2025-01-02 00:00:00'],
              dtype='datetime64[ns]', length=231585, freq=None)

(As I'm sure you know, the problem with converting large ints to float32 is that float32 only has much less than 32 bits for the 'significand'... so, to give a cartoon example with a fictional, much smaller float, the integer 123,456 might be represented as 1.23 x 105... i.e. floats throw away precision to get larger dynamic range)

from nowcasting_dataloader.

peterdudfield avatar peterdudfield commented on May 29, 2024

Yeah, the position encoding should contain the relevant time information, so we shouldn't have to be sending the actual time to the models at all. I think they should probably just be used for plotting then, in which case they don't have to be converted to torch tensors at all, so we can do whatever is easiest for plotting instead.

Doesn't the torch DataLoader covert things if they are not torch already? But perhaps there is some way to keep it in a time format

from nowcasting_dataloader.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.