Comments (6)
I took a quick look at this by comparing a few cluster spinup times with the existing coiled-runtime=0.0.3
release (i.e. the current default software environment):
%time cluster = coiled.Cluster()
and a modified version of coiled-runtime=0.0.3
that doesn't include libraries not strictly needed on the cluster, which I'm calling coiled-runtime-core
. Specifically, using this conda environment file:
name: coiled-runtime-core
channels:
- conda-forge
dependencies:
- python ==3.9
- pip
- coiled
# - nodejs ==17.8.0
# - nb_conda_kernels ==2.3.1
- numpy ==1.21.5
- pandas ==1.3.5
- dask ==2022.1.0
- distributed ==2022.1.0
- fsspec ==2022.3.0
- s3fs ==2022.3.0
- gcsfs ==2022.3.0
- pyarrow ==7.0.0
- python-snappy ==0.6.0
# - jupyterlab ==3.3.2
# - dask-labextension ==5.2.0
- lz4 ==4.0.0
# - ipywidgets ==7.7.0
- numba ==0.55.1
- scikit-learn ==1.0.2
# - python-graphviz ==0.19.1
- click ==8.0.0
- xarray ==0.20.2
- zarr ==2.11.3
and
%time cluster = coiled.Cluster(software="jrbourbeau/coiled-runtime-core")
The corresponding cluster spin up times were:
coiled-runtime
: 2min 12s, 2min 16s, 1min 57scoiled-runtime-core
: 2min 11s, 2min 2s, 1min 51s
To me these times look identical given the spread in spinup times for a single, specified software environment.
Because of this I think we should stick with a single coiled-runtime
metapackage, at least for now. Thoughts from others?
from benchmarks.
Perhaps it could be difficult to ensure exactly same versions of all the non-optional packages
I think this issue talking about something different. There are some packages which are already included and pinned in the coiled-runtime
like jupyterlab
and dask-labextension
that are commonly used alongside Dask, but don't need to be installed on cluster because they are purely needed client-side. This is nice because users don't need to worry about manually installing these packages, but comes at a cost because these extra packages will contribute to overall cluster startup times. This issue is concerned specifically with these sorts of packages
from benchmarks.
Sounds good. I'm proposing we add a benchmark which monitors how long it takes a cluster to spin up over in #172. This will help inform future decisions around adding new packages / optional dependencies
from benchmarks.
Thanks for raising this issue @hayesgb. Totally agree that packages like jupyterlab
and matplotlib
generally don't need to be installed on workers or schedulers. Have we tried comparing cluster spinup times with and without, for example, jupyterlab
? I'm curious about how much this slows cluster spinup. For example, is this a 30 second impact (where removing jupyterlab
would be a big win) or a 3 second impact?
from benchmarks.
Perhaps it could be difficult to ensure exactly same versions of all the non-optional packages... for some use-cases, it might also be tricky, if user has a custom function that (as example) generates and saves some matplotlib image. If this is submitted to a worker that doesn't have matlplotlib, then the function will fail.
from benchmarks.
@jrbourbeau this sound very reasonable. Let's stick with what we have for now and re-evaluate in the future if we have more packages.
from benchmarks.
Related Issues (20)
- ⚠️ CI failed ⚠️ - Regression - test_adjacent_groups [1-128MiB-p2p-disk] Duration HOT 1
- ⚠️ CI failed ⚠️ - stability/test_deadlock.py::test_repeated_merge_spill HOT 1
- Set index regression HOT 3
- ⚠️ CI failed ⚠️ - test_join_big_small / test_set_index duration regressions HOT 1
- ⚠️ CI failed ⚠️ - regressions: dataframe_cow_chain - prepreocess - q6 - q8 - set_index, write_wide_data HOT 3
- ⚠️ CI failed ⚠️ - test_basic_sum[slow-square] TimeoutError HOT 1
- ⚠️ CI failed ⚠️ - Regression: test_spilling HOT 1
- ⚠️ CI failed ⚠️ HOT 1
- optuna is failing HOT 1
- DuckDB fails with OutOfMemoryException HOT 1
- How to create a dev environment to run tpch benchmarks? HOT 4
- Difficulty generating local data HOT 3
- tpc-h query operations aren't aligned across backends HOT 1
- Add datafusion, chdb HOT 2
- Is duckdb out-of-core processing properly enabled? HOT 1
- Fair dataframe API vs API vs SQL benchmarking. HOT 7
- Make TPC-H data publicly available HOT 2
- Migrate AB runs (and database) to 3.10
- Rethink how we persist historical data of scheduled benchmarking runs
- `benchmark.db` quickly blows up in size
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from benchmarks.