Code Monkey home page Code Monkey logo

Comments (10)

slevang avatar slevang commented on September 24, 2024 3

I've felt the pain on this particular store as well, it's a nice test case for the whole stack. 3PB total, >1,000,000 chunks in each of the pressure level variables!

Looks like this is a dask problem though. All the time is spent in single-threaded code creating the array chunks.

Screenshot from 2024-04-02 14-34-43

If we skip dask with xr.open_zarr(..., chunks=None) it takes 1.5s.

We currently have a drop_variables arg. When you have a dataset with 273 variables and you only want a couple, the inverse keep_variables would be a lot easier. It looks like drop_variables gets applied before we create the dask chunks for the arrays, so reading the store once and on the second read adding drop_variables=[v for v in ds.data_vars if v != "geopotential"], I recover a ~1.5s read time.

from xarray.

slevang avatar slevang commented on September 24, 2024 1

Interestingly things are almost a factor of 4x worse on both those PRs, but both are out of date so may be missing other recent improvements.

Screenshot from 2024-04-02 15-13-21

from xarray.

dcherian avatar dcherian commented on September 24, 2024 1

This is snakeviz: https://jiffyclub.github.io/snakeviz/

from xarray.

max-sixty avatar max-sixty commented on September 24, 2024 1

FYI something like:

ds = xr.open_zarr(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    chunks=dict(time=-1, level=-1, latitude="auto", longitude="auto"),
)

...may give a better balance between dask task graph size and chuck size.

(I do this "make the dask chunks bigger than the zarr chunks" a lot, because simple dask graphs can become huge with a 50TB dataset, since the default zarr encoding has a maximum chunk size of 2GB. Not sure it's necessarily the best way, very open to ideas...)

from xarray.

welcome avatar welcome commented on September 24, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

from xarray.

dcherian avatar dcherian commented on September 24, 2024

Nice profile!

@jrbourbeau dask/dask#10648 probably improves this by a lot. Can that be reviewed/merged please?

edit: xref dask/dask#10269

from xarray.

riley-brady avatar riley-brady commented on September 24, 2024

@slevang, do you mind sharing how you are generating these profiles and associated graphs? I've struggled in the past to do this effectively with a dask cluster. This looks great!

from xarray.

slevang avatar slevang commented on September 24, 2024

^yep. Probably not useful for distributed profiling but I haven't really tried. It's just a visualization layer for cProfile.

In this case the creation of this monster task graph would be happening serially in the main process even if your goal was to eventually use a distributed client to run processing. The graph (only describing the array chunks) is close to a billion objects in this case, so would run into issues even trying to serialize that out to workers.

from xarray.

slevang avatar slevang commented on September 24, 2024

Edit: nevermind I actually have no idea where this profile came from. Disregard

from xarray.

DarshanSP19 avatar DarshanSP19 commented on September 24, 2024

How do I get my work done?

  • I opened the dataset with chunks=None.
  • Then filter that as required like select only a few data variables only for some fixed time ranges and for some fixed lat lon ranges.
  • Then chunk that small dataset only.
ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks=None)
data_vars = ['surface_pressure', 'temperature']
ds = ds[data_vars]
ds = ds.sel(time=slice('2013-01-01', '2013-12-31'))
... and a few more filters ...
ds = ds.chunk() 

This will only generate chunks for a filtered dataset.
These steps worked for me so closing the issue.

from xarray.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.