What is your issue? I have an era5 dataset stored in GCS bucket as

FYI something like: <div class="highlight highlight-source-python notranslate posi

Nice profile! <a class="user-mention notranslate" data-hovercard-typ

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How do I get my work done? I opened the dataset with <code cla

opening a zarr dataset taking so much time about xarray HOT 10 CLOSED

DarshanSP19 commented on September 24, 2024

opening a zarr dataset taking so much time

from xarray.

Comments (10)

slevang commented on September 24, 2024 3

I've felt the pain on this particular store as well, it's a nice test case for the whole stack. 3PB total, >1,000,000 chunks in each of the pressure level variables!

Looks like this is a dask problem though. All the time is spent in single-threaded code creating the array chunks.

If we skip dask with xr.open_zarr(..., chunks=None) it takes 1.5s.

We currently have a drop_variables arg. When you have a dataset with 273 variables and you only want a couple, the inverse keep_variables would be a lot easier. It looks like drop_variables gets applied before we create the dask chunks for the arrays, so reading the store once and on the second read adding drop_variables=[v for v in ds.data_vars if v != "geopotential"], I recover a ~1.5s read time.

from xarray.

slevang commented on September 24, 2024 1

Interestingly things are almost a factor of 4x worse on both those PRs, but both are out of date so may be missing other recent improvements.

from xarray.

dcherian commented on September 24, 2024 1

This is snakeviz: https://jiffyclub.github.io/snakeviz/

from xarray.

max-sixty commented on September 24, 2024 1

FYI something like:

ds = xr.open_zarr(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    chunks=dict(time=-1, level=-1, latitude="auto", longitude="auto"),
)

...may give a better balance between dask task graph size and chuck size.

(I do this "make the dask chunks bigger than the zarr chunks" a lot, because simple dask graphs can become huge with a 50TB dataset, since the default zarr encoding has a maximum chunk size of 2GB. Not sure it's necessarily the best way, very open to ideas...)

from xarray.

welcome commented on September 24, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

from xarray.

dcherian commented on September 24, 2024

Nice profile!

@jrbourbeau dask/dask#10648 probably improves this by a lot. Can that be reviewed/merged please?

edit: xref dask/dask#10269

from xarray.

riley-brady commented on September 24, 2024

@slevang, do you mind sharing how you are generating these profiles and associated graphs? I've struggled in the past to do this effectively with a dask cluster. This looks great!

from xarray.

slevang commented on September 24, 2024

^yep. Probably not useful for distributed profiling but I haven't really tried. It's just a visualization layer for cProfile.

In this case the creation of this monster task graph would be happening serially in the main process even if your goal was to eventually use a distributed client to run processing. The graph (only describing the array chunks) is close to a billion objects in this case, so would run into issues even trying to serialize that out to workers.

from xarray.

slevang commented on September 24, 2024

Edit: nevermind I actually have no idea where this profile came from. Disregard

from xarray.

DarshanSP19 commented on September 24, 2024

How do I get my work done?

I opened the dataset with chunks=None.
Then filter that as required like select only a few data variables only for some fixed time ranges and for some fixed lat lon ranges.
Then chunk that small dataset only.

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks=None)
data_vars = ['surface_pressure', 'temperature']
ds = ds[data_vars]
ds = ds.sel(time=slice('2013-01-01', '2013-12-31'))
... and a few more filters ...
ds = ds.chunk()

This will only generate chunks for a filtered dataset.
These steps worked for me so closing the issue.

from xarray.

opening a zarr dataset taking so much time about xarray HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent