Code Monkey home page Code Monkey logo

blaze's Introduction

BLAZE

test test

The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing. It combines the power of the Apache Arrow-DataFusion library and the scale of the Spark distributed computing framework.

Blaze takes a fully optimized physical plan from Spark, mapping it into DataFusion's execution plan, and performs native plan computation in Spark executors.

Blaze is composed of the following high-level components:

  • Blaze Spark Extension: hooks the whole accelerator into Spark execution lifetime.
  • Native Operators: defines how each SparkPlan maps to its native execution counterparts.
  • JNI Gateways: passing data and control through JNI boundaries.
  • Plan SerDe: serialization and deserialization of DataFusion plan with protobuf.
  • Columnarized Shuffle: shuffle data file organized in Arrow-IPC format.

Based on the inherent well-defined extensibility of DataFusion, Blaze can be easily extended to support:

  • Various object stores.
  • Operators.
  • Simple and Aggregate functions.
  • File formats.

We encourage you to extend DataFusion capability directly and add the supports in Blaze with simple modifications in plan-serde and extension translation.

Build from source

To build Blaze, please follow the steps below:

  1. Install Rust

The underlying native execution lib, DataFusion, is written in Rust Lang. So you're required to install Rust first for compilation. We recommend you to use rustup.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  1. Check out the source code.
git clone [email protected]:blaze-init/blaze.git
cd blaze
  1. Build the project.

You could either build Blaze in debug mode for testing purposes or in release mode to unlock the full potential of Blaze.

./gradlew -Pmode=[dev|release-lto] build

After the build is finished, a fat Jar package that contains all the dependencies will be generated in the target directory.

Run Spark Job with Blaze Accelerator

This section describes how to submit and configure a Spark Job with Blaze support.

You could enable Blaze accelerator through:

$SPARK_HOME/bin/spark-[sql|submit] \
  --jars "/path/to/blaze-engine-1.0-SNAPSHOT.jar" \
  --conf spark.sql.extensions=org.apache.spark.sql.blaze.BlazeSparkSessionExtension \
  --conf spark.executor.extraClassPath="./blaze-engine-1.0-SNAPSHOT.jar" \
  .... # your original arguments goes here

At the same time, there are a series of configurations that you can use to control Blaze with more granularity.

Parameter Default value Description
spark.executor.memoryOverhead executor.memory * 0.1 The amount of non-heap memory to be allocated per executor. Blaze would use this part of memory.
spark.blaze.memoryFraction 0.75 A fraction of the off-heap that Blaze could use during execution.
spark.blaze.batchSize 16384 Batch size for vectorized execution.
spark.blaze.enable.shuffle true If enabled, use native, Arrow-IPC based Shuffle.
spark.blaze.enable.[scan,project,filter,sort,union,sortmergejoin] true If enabled, offload the corresponding operator to native engine.

Performance

We periodically benchmark Blaze locally with a 1 TB TPC-DS Dataset to show our latest results and prevent unnoticed performance regressions. Check Benchmark Results with the latest date for the performance comparison with vanilla Spark.

Currently, you can expect up to a 2x performance boost, cutting resource consumption to 1/5 within several keystrokes. Stay tuned and join us for more upcoming thrilling numbers.

20220522-memcost

We also encourage you to benchmark Blaze locally and share the results with us. ๐Ÿค—

Roadmap

1. Operators

Currently, there are still several operators that we cannot execute natively:

2. Compressed Shuffle

We use segmented Arrow-IPC files to express shuffle data. If we could apply IPC compression, we would benefit more from Shuffle since columnar data would have a better compression ratio. Tracked in #4.

3. UDF support

We would like to have a high-performance JVM-UDF invocation framework that could utilize a great variety of the existing UDFs written in Spark/Hive language. They are not supported natively in Blaze at the moment.

Community

We're using Discussions to connect with other members of our community. We hope that you:

  • Ask questions you're wondering about.
  • Share ideas.
  • Engage with other community members.
  • Welcome others and are open-minded. Remember that this is a community we build together ๐Ÿ’ช .

License

Blaze is licensed under the Apache 2.0 License. A copy of the license can be found here.

blaze's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.