Comments (8)
Yeah, the position encoding should contain the relevant time information, so we shouldn't have to be sending the actual time to the models at all. I think they should probably just be used for plotting then, in which case they don't have to be converted to torch tensors at all, so we can do whatever is easiest for plotting instead.
from nowcasting_dataloader.
It does convert to torch if not already, the SatFlowDataset
though removes all non-inputs from the dictionary that is returned, only keeping the data and the position encodings for the inputs and targets. But yeah, otherwise you can change the collation function, which I think is where everything gets changed to a Tensor
from nowcasting_dataloader.
Turns out this is due to chaning datetime64[ns] to a int32, and it loses information here. Normally it is changed to a int64
optional to solve this
- change to int64, and use more memory, but its is simple.
xarray.Dataset --> int64--> torch.float64. perhaps eveyrthing could be changed to float64s - xarray cant be saved as datetime64[s] so for time vectors we could go for xarray.Dataset --> datetime64[ns] --> datetime64[s] --> int32--> torch.float32
I think i prefer 2. It means everything ends with torch.float32, but there is a bit more moving stuff around for time vectors. But good to note the times vectors are small compared to the satellite and nwp data
@JackKelly and @jacobbieker would be interested in your option
from nowcasting_dataloader.
I'm struggerling with 2.
time = pd.date_range("2019-01-01", "2019-01-02", freq="5T").values
time_datetime64_s = time.astype('datetime64[s]')
time_int32 = time_datetime64_s.astype('int32')
time_torch = torch.tensor(time_int32, dtype=torch.float32)
time_out = pd.to_datetime(time_torch.numpy(), unit='s')
makes datetimes which are a few seconds off
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:04:16',
'2019-01-01 00:10:40', '2019-01-01 00:14:56',
'2019-01-01 00:19:12', '2019-01-01 00:25:36',
'2019-01-01 00:29:52', '2019-01-01 00:34:08',
'2019-01-01 00:40:32', '2019-01-01 00:44:48',
...
'2019-01-01 23:15:12', '2019-01-01 23:19:28',
'2019-01-01 23:25:52', '2019-01-01 23:30:08',
'2019-01-01 23:34:24', '2019-01-01 23:40:48',
'2019-01-01 23:45:04', '2019-01-01 23:49:20',
'2019-01-01 23:55:44', '2019-01-02 00:00:00'],
dtype='datetime64[ns]', length=289, freq=None)
from nowcasting_dataloader.
It seems that in order to get times accurate we need to use 'torch.float64'. Perhaps a good comprise is to use 'torch.float32' for everything apart from the time vectors.
Another option is to normalize the time,
- this might make it harder to immediate read.
- might be good for the ML model
from nowcasting_dataloader.
Perhaps also the encoding work that @jacobbieker has done means we dont need these variables for the models, we just might need them for plotting --> just change these to float64 ready for plotting
from nowcasting_dataloader.
Thanks loads for looking into this!
I think you're doing the right thing by converting the datetimes to datetime64[s]
first.
From here, I think we have two options:
- Can we just leave the times as
int32
(after first converting todatetime64[s]
)? I feel like this is probably the safest option. Or is there a specific reason why it's troublesome to leave the time vectors asint32
? - If we prefer to convert to
np.float32
then one trick would be to subtract the starttime of the dataset from every datetime (e.g. subtract "2020-01-01"). But this only works for datasets less than about 3 years in length:
In [13]: time = pd.date_range("2019-01-01", "2025-01-02", freq="5T").values
In [14]: time_datetime64_s = time.astype('datetime64[s]')
In [15]: time_int32 = time_datetime64_s.astype('int32')
In [16]: time_float32 = (time_int32 - time_int32[0]).astype(np.float32)
In [17]: time_out = pd.to_datetime(time_float32 + time_int32[0], unit='s')
In [18]: time_out
Out[18]:
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:05:00',
'2019-01-01 00:10:00', '2019-01-01 00:15:00',
'2019-01-01 00:20:00', '2019-01-01 00:25:00',
'2019-01-01 00:30:00', '2019-01-01 00:35:00',
'2019-01-01 00:40:00', '2019-01-01 00:45:00',
...
'2025-01-01 23:14:56', '2025-01-01 23:20:00',
'2025-01-01 23:25:04', '2025-01-01 23:30:08',
'2025-01-01 23:34:56', '2025-01-01 23:40:00',
'2025-01-01 23:45:04', '2025-01-01 23:49:52',
'2025-01-01 23:54:56', '2025-01-02 00:00:00'],
dtype='datetime64[ns]', length=631585, freq=None)
In [19]: time_out[300000:]
Out[19]:
DatetimeIndex(['2021-11-07 16:00:00', '2021-11-07 16:05:04',
'2021-11-07 16:10:00', '2021-11-07 16:14:56',
'2021-11-07 16:20:00', '2021-11-07 16:25:04',
'2021-11-07 16:30:00', '2021-11-07 16:34:56',
'2021-11-07 16:40:00', '2021-11-07 16:45:04',
...
'2025-01-01 23:14:56', '2025-01-01 23:20:00',
'2025-01-01 23:25:04', '2025-01-01 23:30:08',
'2025-01-01 23:34:56', '2025-01-01 23:40:00',
'2025-01-01 23:45:04', '2025-01-01 23:49:52',
'2025-01-01 23:54:56', '2025-01-02 00:00:00'],
dtype='datetime64[ns]', length=331585, freq=None)
In [20]: time_out[400000:]
Out[20]:
DatetimeIndex(['2022-10-20 21:20:00', '2022-10-20 21:25:04',
'2022-10-20 21:30:00', '2022-10-20 21:34:56',
'2022-10-20 21:40:00', '2022-10-20 21:45:04',
'2022-10-20 21:50:00', '2022-10-20 21:54:56',
'2022-10-20 22:00:00', '2022-10-20 22:05:04',
...
'2025-01-01 23:14:56', '2025-01-01 23:20:00',
'2025-01-01 23:25:04', '2025-01-01 23:30:08',
'2025-01-01 23:34:56', '2025-01-01 23:40:00',
'2025-01-01 23:45:04', '2025-01-01 23:49:52',
'2025-01-01 23:54:56', '2025-01-02 00:00:00'],
dtype='datetime64[ns]', length=231585, freq=None)
(As I'm sure you know, the problem with converting large ints to float32 is that float32 only has much less than 32 bits for the 'significand'... so, to give a cartoon example with a fictional, much smaller float, the integer 123,456 might be represented as 1.23 x 105... i.e. floats throw away precision to get larger dynamic range)
from nowcasting_dataloader.
Yeah, the position encoding should contain the relevant time information, so we shouldn't have to be sending the actual time to the models at all. I think they should probably just be used for plotting then, in which case they don't have to be converted to torch tensors at all, so we can do whatever is easiest for plotting instead.
Doesn't the torch DataLoader covert things if they are not torch already? But perhaps there is some way to keep it in a time format
from nowcasting_dataloader.
Related Issues (20)
- Remove Fake BatchML
- Normalize pv and gsp coords HOT 2
- Load 2 batches at once HOT 2
- Standarise x and y osgb names
- Run CI every Monday HOT 2
- Changes required to support Power Perceiver
- `SatelliteML`: Change `x` and `y` to `x_geostationary` and `y_geostationary`, and include `x_osgb` and `y_osgb` HOT 2
- Failing test: `tests/test_position_encoding.py::test_batch_encoding` HOT 5
- Subselect doesn't work without the metadata in Batch HOT 4
- Update NWP fake data
- GSP Position Encodings are not sliced correctly
- Update branch pytest workflow
- Speed up Position Encodings HOT 4
- Rename dims to data_vars
- Move Position Encoding to BatchML
- `normalise()` should have consistent behaviour across DataSources: It should give mean=0 and std=1. HOT 3
- Automatically check that data is normalised roughly correctly for each batch
- Tidy batch code
- use `float32` for `nwp`, `sat_data`, and `hrv_sat_data`?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nowcasting_dataloader.