Code Monkey home page Code Monkey logo

datafusion-comet's Introduction

Apache DataFusion Comet

Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful Apache DataFusion query engine. Comet is designed to significantly enhance the performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the Spark ecosystem without requiring any code changes.

Benefits of Using Comet

Run Spark Queries at DataFusion Speeds

Comet delivers a performance speedup for many queries, enabling faster data processing and shorter time-to-insights.

The following chart shows the time it takes to run the 22 TPC-H queries against 100 GB of data in Parquet format using a single executor with 8 cores. See the Comet Benchmarking Guide for details of the environment used for these benchmarks.

When using Comet, the overall run time is reduced from 649 seconds to 433 seconds, a 1.5x speedup, with some queries showing a 2x-3x speedup.

Running the same queries with DataFusion standalone (without Spark) using the same number of cores results in a 3.9x speedup compared to Spark.

Comet is not yet achieving full DataFusion speeds in all cases, but with future work we aim to provide a 2x-4x speedup for a broader set of queries.

Here is a breakdown showing relative performance of Spark, Comet, and DataFusion for each TPC-H query.

The following chart shows how much Comet currently accelerates each query from the benchmark. Performance optimization is an ongoing task, and we welcome contributions from the community to help achieve even greater speedups in the future.

These benchmarks can be reproduced in any environment using the documentation in the Comet Benchmarking Guide. We encourage you to run your own benchmarks.

Use Commodity Hardware

Comet leverages commodity hardware, eliminating the need for costly hardware upgrades or specialized hardware accelerators, such as GPUs or FGPA. By maximizing the utilization of commodity hardware, Comet ensures cost-effectiveness and scalability for your Spark deployments.

Spark Compatibility

Comet aims for 100% compatibility with all supported versions of Apache Spark, allowing you to integrate Comet into your existing Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness the benefits of Comet's acceleration capabilities without disrupting your Spark applications.

Tight Integration with Apache DataFusion

Comet tightly integrates with the core Apache DataFusion project, leveraging its powerful execution engine. With seamless interoperability between Comet and DataFusion, you can achieve optimal performance and efficiency in your Spark workloads.

Active Community

Comet boasts a vibrant and active community of developers, contributors, and users dedicated to advancing the capabilities of Apache DataFusion and accelerating the performance of Apache Spark.

Getting Started

To get started with Apache DataFusion Comet, follow the installation instructions. Join the DataFusion Slack and Discord channels to connect with other users, ask questions, and share your experiences with Comet.

Contributing

We welcome contributions from the community to help improve and enhance Apache DataFusion Comet. Whether it's fixing bugs, adding new features, writing documentation, or optimizing performance, your contributions are invaluable in shaping the future of Comet. Check out our contributor guide to get started.

License

Apache DataFusion Comet is licensed under the Apache License 2.0. See the LICENSE.txt file for details.

Acknowledgments

We would like to express our gratitude to the Apache DataFusion community for their support and contributions to Comet. Together, we're building a faster, more efficient future for big data processing with Apache Spark.

datafusion-comet's People

Contributors

viirya avatar andygrove avatar advancedxy avatar sunchao avatar huaxingao avatar comphead avatar kazuyukitanimura avatar snmvaughan avatar vaibhawvipul avatar eejbyfeldt avatar parthchandra avatar planga82 avatar tshauck avatar leoluan2009 avatar psvri avatar edmondop avatar ganeshkumar269 avatar wankunde avatar vidyasankarv avatar thexiay avatar dependabot[bot] avatar ceppelli avatar wforget avatar haoxins avatar sujithjay avatar sonhmai avatar semyonsinchenko avatar rz-vastdata avatar rohitrastogi avatar mattharder91 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.