Comments (4)
@jrbourbeau - Thanks for opening this ticket.
A lot of the benchmarking work with dev advocacy & OSS engineering will overlap (e.g. generating the datasets, creating code to run benchmarks, making sure the Dask queries are optimized), but the actual infrastructure will differ a bit.
For the h2o benchmarks, we'll need to replicate their infrastructure exactly, which is a single r3-8xlarge node. I am hoping to create an issue that shows if we structure the code like this and use your same exact infrastructure, then the Dask queries run 30 times faster (exact numbers to be confirmed later).
For the Databricks benchmarks which I'd also like to address, we'll need to use the environments they list in their blog post.
The GitHub Actions / Coiled infrastructure you'll setup will be great for other dev advocacy content down the road. In the near term, I'll hope to pair with the product engineers on replicating the h2o / Databricks environments exactly and rerunning their benchmarks with properly structured code.
Let me know if this plan sounds alright with you. I get the feeling that there is a widespread impression that "Dask is slow". I'm not seeing that in reality with the benchmarks I am running. I am hoping to dig up the truth here and write content that clears up this misinformation.
from benchmarks.
Happy to help with any questions on getting the infrastructure you need. The #engineering-clusters channel would also be a good place to ask.
from benchmarks.
Cool, joined #engineering-clusters, thanks @shughes-uk!!
from benchmarks.
Closing this as obsolete. We have a separate tracking ticket for a blogpost on h2o benchmarks as well as we are about to invest work to improve the performance of the queries.
Closing in favor of https://github.com/orgs/coiled/projects/12/views/3
from benchmarks.
Related Issues (20)
- ⚠️ CI failed ⚠️ - Regression - test_adjacent_groups [1-128MiB-p2p-disk] Duration HOT 1
- ⚠️ CI failed ⚠️ - stability/test_deadlock.py::test_repeated_merge_spill HOT 1
- Set index regression HOT 3
- ⚠️ CI failed ⚠️ - test_join_big_small / test_set_index duration regressions HOT 1
- ⚠️ CI failed ⚠️ - regressions: dataframe_cow_chain - prepreocess - q6 - q8 - set_index, write_wide_data HOT 3
- ⚠️ CI failed ⚠️ - test_basic_sum[slow-square] TimeoutError HOT 1
- ⚠️ CI failed ⚠️ - Regression: test_spilling HOT 1
- ⚠️ CI failed ⚠️ HOT 1
- optuna is failing HOT 1
- DuckDB fails with OutOfMemoryException HOT 1
- How to create a dev environment to run tpch benchmarks? HOT 4
- Difficulty generating local data HOT 3
- tpc-h query operations aren't aligned across backends HOT 1
- Add datafusion, chdb HOT 2
- Is duckdb out-of-core processing properly enabled? HOT 1
- Fair dataframe API vs API vs SQL benchmarking. HOT 7
- Make TPC-H data publicly available HOT 2
- Migrate AB runs (and database) to 3.10
- Rethink how we persist historical data of scheduled benchmarking runs
- `benchmark.db` quickly blows up in size
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from benchmarks.