Code Monkey home page Code Monkey logo

Comments (6)

jrbourbeau avatar jrbourbeau commented on September 18, 2024 2

I took a quick look at this by comparing a few cluster spinup times with the existing coiled-runtime=0.0.3 release (i.e. the current default software environment):

%time cluster = coiled.Cluster()

and a modified version of coiled-runtime=0.0.3 that doesn't include libraries not strictly needed on the cluster, which I'm calling coiled-runtime-core. Specifically, using this conda environment file:

name: coiled-runtime-core
channels:
- conda-forge
dependencies:
- python ==3.9
- pip
- coiled
# - nodejs ==17.8.0
# - nb_conda_kernels ==2.3.1
- numpy ==1.21.5
- pandas ==1.3.5
- dask ==2022.1.0
- distributed ==2022.1.0
- fsspec ==2022.3.0
- s3fs ==2022.3.0
- gcsfs ==2022.3.0
- pyarrow ==7.0.0
- python-snappy ==0.6.0
# - jupyterlab ==3.3.2
# - dask-labextension ==5.2.0
- lz4 ==4.0.0
# - ipywidgets ==7.7.0
- numba ==0.55.1
- scikit-learn ==1.0.2
# - python-graphviz ==0.19.1
- click ==8.0.0
- xarray ==0.20.2
- zarr ==2.11.3

and

%time cluster = coiled.Cluster(software="jrbourbeau/coiled-runtime-core")

The corresponding cluster spin up times were:

  • coiled-runtime: 2min 12s, 2min 16s, 1min 57s
  • coiled-runtime-core: 2min 11s, 2min 2s, 1min 51s

To me these times look identical given the spread in spinup times for a single, specified software environment.

Because of this I think we should stick with a single coiled-runtime metapackage, at least for now. Thoughts from others?

from benchmarks.

jrbourbeau avatar jrbourbeau commented on September 18, 2024 1

Perhaps it could be difficult to ensure exactly same versions of all the non-optional packages

I think this issue talking about something different. There are some packages which are already included and pinned in the coiled-runtime like jupyterlab and dask-labextension

https://github.com/coiled/coiled-runtime/blob/1fbfc124a6f497855b767511b668a7c59010a9ca/recipe/meta.yaml#L37-L38

that are commonly used alongside Dask, but don't need to be installed on cluster because they are purely needed client-side. This is nice because users don't need to worry about manually installing these packages, but comes at a cost because these extra packages will contribute to overall cluster startup times. This issue is concerned specifically with these sorts of packages

from benchmarks.

jrbourbeau avatar jrbourbeau commented on September 18, 2024 1

Sounds good. I'm proposing we add a benchmark which monitors how long it takes a cluster to spin up over in #172. This will help inform future decisions around adding new packages / optional dependencies

from benchmarks.

jrbourbeau avatar jrbourbeau commented on September 18, 2024

Thanks for raising this issue @hayesgb. Totally agree that packages like jupyterlab and matplotlib generally don't need to be installed on workers or schedulers. Have we tried comparing cluster spinup times with and without, for example, jupyterlab? I'm curious about how much this slows cluster spinup. For example, is this a 30 second impact (where removing jupyterlab would be a big win) or a 3 second impact?

from benchmarks.

SultanOrazbayev avatar SultanOrazbayev commented on September 18, 2024

Perhaps it could be difficult to ensure exactly same versions of all the non-optional packages... for some use-cases, it might also be tricky, if user has a custom function that (as example) generates and saves some matplotlib image. If this is submitted to a worker that doesn't have matlplotlib, then the function will fail.

from benchmarks.

ncclementi avatar ncclementi commented on September 18, 2024

@jrbourbeau this sound very reasonable. Let's stick with what we have for now and re-evaluate in the future if we have more packages.

from benchmarks.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.