Comments (4)
This is a tricky issue. One problem we have in our stack is that we currently outsourced nearly all actual parallelism to Dask. (The one exception to this is fsspec's async capabilities, which are hidden behind a separate thread housing an async event loop.)
Ideally, there would be one single runtime responsible for actually implementing concurrent data access and I/O. If all the libraries implemented async methods, then that could be placed completely in the user's responsibility, i.e. you could right code like
async def my_processing_function():
await xr.open_dataset(...)
# which would call
await zarr.open_group(...)
# which would call
await object_store.get_object(...)
The user would be responsible for starting an event loop and running the coroutine. The event loop would manage the concurrency for the whole stack and everything would be fine.
In Zarr we are in the process of adding the async methods. That begs the question...should Xarray add them too?
If not, then Xarray has to decide how to call async code. It could use the fsspec approach of managing an async event loop on another thread. It could manage a threadpool of its own. How would these interact with Dask / fsspec / Zarr / etc. The futures approach proposed here is one example of how to add concurrency within Xarray.
I feel like this conundrum really illustrates the limitations of the modularity that we value so much from our stack. I have no idea what the "right" answer is. However, my perspective has been greatly influenced by writing Tokio Rust code, which does not suffer from this delegation problem. It's a very different situation from Python.
from xarray.
Would that be compatible with async stores?
from xarray.
This idea of passing an arbitrary concurrent executor to xarray seems potentially related to #7810, which suggests allowing open_mfdataset(parallel=true)
to use something other than dask.delayed
to parallelize the opening of the files.
from xarray.
FWIW this appears to do what I wanted with Zarr at least, i.e. issue concurrent loads per variable.
def concurrent_compute(ds: xr.Dataset) -> xr.Dataset:
from concurrent.futures import ThreadPoolExecutor, as_completed
copy = ds.copy()
def load_variable_data(name: str, var: xr.Variable) -> np.ndarray:
return (name, var.compute().data)
with ThreadPoolExecutor(max_workers=None) as executor:
futures = [
executor.submit(load_variable_data, k, v) for k, v in copy.variables.items()
]
for future in as_completed(futures):
name, loaded = future.result()
copy.variables[name].data = loaded
return copy
concurrent_compute(ds)
from xarray.
Related Issues (20)
- (i)loc slicer specialization for convenient slicing by dimension label as `.loc('dim_name')[:n]`
- Improving performance of open_datatree HOT 4
- Why does xr.apply_ufunc support numpy/dask.arrays?
- Enhancement of xarray.Dataset.from_dataframe HOT 5
- Stricter check for .array attribute
- Release? HOT 5
- The numpy.array_api namespace has been removed in numpy 2.0 HOT 2
- Documentation Request: Clarity for __matmul__ operator HOT 3
- ```_FillValue``` and ```missing_value``` attributes get removed when using ```open_dataset``` HOT 4
- Potential regression in Dataset.from_dataframe() not preserving timezone HOT 6
- interpolate using quadratic returns nan HOT 1
- Map block reduction HOT 2
- Strings in coordinates may be truncated when saving concatenated rasters to zarr HOT 2
- Can't call open_mfdataset without creating chunked dask arrays HOT 3
- `DataSet.chunk` and `DataArray.chunk` handling object coordinates differently
- Regression/#1840: decoding to `float64` instead of `float32` HOT 8
- Passing in DataArray into `np.linspace` breaks with Numpy 2
- Square Logos HOT 9
- weighted polyfit HOT 1
- Cannot use documented interp() methods due to "vectorizeable_only" check and kwargs name clash HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xarray.