The gtsa from friedrichknuth

Taking stock of existing efforts and reflecting on directions

Below are existing efforts that I found which could be useful to discuss and define GTSA's clear objective and build its core structure during the Hackweek:

DARTS has a huge collection of methods, and can import Time Series from multiple source including Xarray datasets, even a GP example: https://unit8co.github.io/darts/quickstart/00-quickstart.html#Inferring-missing-values-with-Gaussian-Processes. However, I'm not sure that the methods can "vectorize" across an entire data cube (couldn't find anything on that),
PTSA has some time series methods for EEG (medical): https://pennmem.github.io/ptsa/html/examples/getting_started.html, also built on top of Xarray,
EOFs for spatial and temporal decomposition, also supporting Xarray: https://ajdawson.github.io/eofs/latest/,
GeoWombat's time series module: https://geowombat.readthedocs.io/en/latest/gpu.html. The package is based on Rasterio/Dask directly, without using Rioxarray.

The obvious dependencies that are now more stable:

Rioxarray for applying Rasterio operation/georeferencing to Xarray datasets: https://github.com/corteva/rioxarray,
Geocube to rasterize GeoPandas dataframe on Xarray objects: https://github.com/corteva/geocube,
Maybe in the future, Xvec to manipulate Vector cubes in Xarray: https://github.com/xarray-contrib/xvec.

Apart from GeoWombat's Time Series section, I don't see anything that does what GTSA currently does (scalable spatiotemporal prediction). GeoWombat are also the only ones providing an interface to ingest raster data + chunk it + process it. The limitation is that they have to maintain all these aspects at once in a single package. While GTSA can leave the ingestion + chunking + vector operations to Rioxarray + Geocube for the most part, and focus on making the link to more easily apply scalable method on the processing side. I really like their approach of allowing any PyTorch & other algorithm to be passed, we should probably aim towards something similar.

So, in terms of package objectives, I see two core aspects:

Provide routines built on top of Rioxarray to create temporal raster stacks, with possibly multiple variables, in an out-of-memory fashion from a list of raster with various extent/projections/dates (most of the heavy work is done by Rioxarray). One main issue I see is that rasters don't natively have dates in their metadata, so GTSA would need a generic interface for that (an Xarray accessor that reads dates from most auxiliary files/filenames for raster would be super useful, we want to copy the functionalities from the SatelliteImage class in GeoUtils: https://geoutils.readthedocs.io/en/latest/satimg_class.html, but it'll take a while).
Provide routines to perform spatiotemporal error-aware prediction in a scalable manner: using already implemented methods from SciPy, PyTorch, etc, wherever possible. Here again, creating routines that support most specifically scalable algorithms can be a challenge. For predicting GPs, methods exist to do this such as Batch GPs: https://docs.gpytorch.ai/en/stable/examples/08_Advanced_Usage/index.html#batch-gps). For applying GPs: one only needs a chunk the scale of the covariance, then points are independent! But this is not natively supported in most packages, we'd have to write it. Another challenge would be to provide "error-aware" methods wherever possible, having mostly methods that understand and propagate uncertainties in the prediction. That would allow to later rely on ObsArray (or similar) to use the predicted datasets at different scales! (the GP + OLS + WLS in pyddem all have this!).

In terms of ideal code structure: I'm not sure what is best... Definitely not a class-based object. I feel that an Xarray accessor could maybe work quite nicely? But we'd need to grasp all the implications for out-of-memory ops.
For instance:

import gtsa

# The package itself would only be called to open the list of files and stack them out-of-memory to a certain disk location
ds = gtsa.open_rasterstack(list_raster_files=..., zarr_file=...)`
# (or this could be several functions if needed: define different tiling types? areas with different projections?)

# Then the Xarray accessor would do everything else
# For example, define additional Xarray attributes to ensure the time/space units are known, or to store the covariance of the data in space and time (based on ObsArray, maybe, if it takes off)
ds.gtsa.time_unit
ds.gtsa.space_unit

# For prediction: have a fit/apply function that returns predicted values at new spatiotemporal locations
ds_pred = ds.gtsa.predict(method=..., time_pred=..., x_pred=...)
ds_pred.to_zarr(zarr_file=...).

Do you think that would work (even out-of-memory)?

That's all I've got for now 😛!

Add option to store data as a different data type

Currently, data are stored as float64 by default, which is excessive precision for most analyses. Other data types should be made optional to reduce the size of the Zarr stack on disk.

friedrichknuth / gtsa Goto Github PK

gtsa's People

Contributors

Stargazers

Watchers

Forkers

gtsa's Issues

Taking stock of existing efforts and reflecting on directions

Add option to store data as a different data type

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent