Comments (10)
I've felt the pain on this particular store as well, it's a nice test case for the whole stack. 3PB total, >1,000,000 chunks in each of the pressure level variables!
Looks like this is a dask problem though. All the time is spent in single-threaded code creating the array chunks.
If we skip dask with xr.open_zarr(..., chunks=None)
it takes 1.5s.
We currently have a drop_variables
arg. When you have a dataset with 273 variables and you only want a couple, the inverse keep_variables
would be a lot easier. It looks like drop_variables
gets applied before we create the dask chunks for the arrays, so reading the store once and on the second read adding drop_variables=[v for v in ds.data_vars if v != "geopotential"]
, I recover a ~1.5s read time.
from xarray.
Interestingly things are almost a factor of 4x worse on both those PRs, but both are out of date so may be missing other recent improvements.
from xarray.
This is snakeviz: https://jiffyclub.github.io/snakeviz/
from xarray.
FYI something like:
ds = xr.open_zarr(
"gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
chunks=dict(time=-1, level=-1, latitude="auto", longitude="auto"),
)
...may give a better balance between dask task graph size and chuck size.
(I do this "make the dask chunks bigger than the zarr chunks" a lot, because simple dask graphs can become huge with a 50TB dataset, since the default zarr encoding has a maximum chunk size of 2GB. Not sure it's necessarily the best way, very open to ideas...)
from xarray.
Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!
from xarray.
Nice profile!
@jrbourbeau dask/dask#10648 probably improves this by a lot. Can that be reviewed/merged please?
edit: xref dask/dask#10269
from xarray.
@slevang, do you mind sharing how you are generating these profiles and associated graphs? I've struggled in the past to do this effectively with a dask
cluster. This looks great!
from xarray.
^yep. Probably not useful for distributed profiling but I haven't really tried. It's just a visualization layer for cProfile
.
In this case the creation of this monster task graph would be happening serially in the main process even if your goal was to eventually use a distributed client to run processing. The graph (only describing the array chunks) is close to a billion objects in this case, so would run into issues even trying to serialize that out to workers.
from xarray.
Edit: nevermind I actually have no idea where this profile came from. Disregard
from xarray.
How do I get my work done?
- I opened the dataset with
chunks=None
. - Then filter that as required like select only a few data variables only for some fixed time ranges and for some fixed
lat
lon
ranges. - Then chunk that small dataset only.
ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks=None)
data_vars = ['surface_pressure', 'temperature']
ds = ds[data_vars]
ds = ds.sel(time=slice('2013-01-01', '2013-12-31'))
... and a few more filters ...
ds = ds.chunk()
This will only generate chunks for a filtered dataset.
These steps worked for me so closing the issue.
from xarray.
Related Issues (20)
- Indexing tree should create new tree HOT 1
- API for filtering / subsetting
- datatree: Collapsible items in `groups` DataTree HOT 1
- API for reorganizing levels of a datatree
- Datatree: parent of new nodes is not assigned properly in update
- datatree: Tree-aware dataset handling/selection
- How to efficiently test the huge inherited Dataset API HOT 1
- Plotting methods
- Datatree: API for collapsing subtrees
- Datatree: Dynamically populate the HTML repr HOT 2
- Add open_mfdatatree
- Easy way to set the compression level for all dataarrays in a datatree?
- DataTree.__contains__ should support pathlike syntax
- Implement dask-specific methods on DataTree
- TimeResampler("ME") can't deal with missing months HOT 2
- `to_netcdf` fails to write to files previously opened with `open_mfdataset` HOT 3
- bump min versions
- datatree: Bug with arithmetic between datasets and datatrees (e.g. ds * dt) HOT 1
- Reading data after saving data from masked arrays results in different numbers HOT 2
- xarray.open_datatree is taking too long to open datatree in a s3 bucket HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xarray.