Code Monkey home page Code Monkey logo

bdtk's Introduction

Intel Big Data Analytic Toolkit

Intel Big Data Analytic Toolkit (abbrev. BDTK) is a set of acceleration libraries aimed to optimize big data analytic frameworks.

By using this library, frontend SQL engines like Prestodb/Spark query performance will be significant improved.

Problem Statement

For big data analytic framework users, it becomes more and more significant needs for better performance. And most of existing big data analytic frameworks are built via Java and it's designed mostly for CPU only computation. To unblock performance further to bare metal hardware, native implementation and leveraging state-of-art hardwares are employed in this toolkit.

Furthermore, assembling & building becomes a new trend for data analytic solution providers. More and more SQL based solutions were built based on some primitive building blocks over the last five years. Having some performt OOB building blocks (as libraries) can significantly reduce time-to-value for building everything from scratch. With such general-purpose toolkit, it can significantly reduce time-to-value for analytic solution developers.

Targeted Use Cases

BDTK focuses on following areas:

  • End-users of big data analytic frameworks who're looking for performance acceleration
  • Data engineers who want some Intel architecture-based optimizations
  • Database developers who're seeking for reusable building blocks
  • Data Scientist who looks for heterogenous execution

Users can reuse implemented operators/functions to build a full-featured SQL engine. Currently this library offers a highly optimized compiler to JITed function for execution.

Building blocks utilizing compression codec (based on IAA, QAT) can be used directly to Hadoop/Spark for compression acceleration.

Below comes the view from personas for this project.

BDTK-Personas

Introduction

The following diagram shows the design architecture. Currently, it offers a few building blocks including a lightweight LLVM based SQL compiler(Cider) on top of Arrow data format, ICL - a compression codec leveraging Intel IAA accelerator, QATCodec - compression codec wrapper based on Intel QAT accelerator.

BDTK-INTRODUCTION

Solutions Introduction

  • Presto E2E Solution:

    BDTK could provide a Presto End-to-End accelaration solution via Velox and Velox-Plugin.

    • Velox Plugin:

      Velox-plugin is a bridge to enable Big Data Analytic Toolkit onto Velox. It introduces hybrid execution mode for both compilation and vectorization (existed in Velox). It works as a plugin to Velox seamlessly without changing Velox code.

    • Cider:

      A modularized and general-purposed Just-In-Time (JIT) compiler for data analytic query engine. It employs Substrait as a protocol allowing to support multiple front-end engines. Currently it provides a LLVM based implementation based on HeavyDB.

  • Analytic Cache Solution

    Analytic Cache targets to improve data source side performance for multiple bigdata analytic framework such as Apache Spark and Apache Flink. Compare to other row based execution engine, Analytic Cache could utilize column format and do batch computation, which will boost performance in Ad-hoc queries. Meanwhile, Analytic Cache provide QAT codec accelaration and IAA predicition pushing down.

Reusable Modules

BDTK provides several functional modules for user to use or integrate into their product. Here are breif description for each module. Details can be found on Module Page

Intel Codec Library module provides compression and decompression library for Apache Hadoop/Spark to make use of the acceleration hardware for compression/decompression. It not only can leverage QAT/IAA hardware to accelerate deflate-compatible data compression algorithms but also supports the use of Intel software optimized solutions such as Intel ISA-L(Intel Intelligent Storage Acceleration Library and IPP(Intel Integrated Performance Primitives Library) to accelerate the data compression.

Hash table performance is critical to a SQL engine. Operators like hash join, hash aggregation count on an efficient hash table implementation.

Hash table module will provide a bunch of hash table implementations, which are easy to use, leverage state of art hardware technology like AVX-512, will be optimized for query-specific scenarios.

JIT Lib module provides unified JIT interfaces like Value, Ptr, control flow and .etc to isolate operator logic and IR generation

Expression evaluation module could do Projection/Filter computation effectively. It provides a runtime expression evaluation API which accept Substrait based expression representation and Apache Arrow based column format data representation. It only handles projection and filters currently.

BDTK implements typical SQL operators based on JitLib, provide a batch-at-a-time execution model. Each operator support plug and play, could easily integrated into other existing sql-engines. Operators BDTK target to supported includes: HashAggregation, HashJoin(HashBuild and HashProbe), etc.

Supported Features

Current supported features are available on Project Page. Newly supported feature in release 0.9 is available at release page.

Getting Started

Get the BDTK Source

git clone --recursive https://github.com/intel/BDTK.git
cd BDTK
# if you are updating an existing checkout
git submodule sync --recursive
git submodule update --init --recursive

Setting up BDTK develop envirenmont on Linux Docker

We provide Dockerfile to help developers setup and install BDTK dependencies.

  1. Build an image from a Dockerfile
$ cd ${path_to_source_of_bdtk}/ci/docker
$ docker build -t ${image_name} .
  1. Start a docker container for development
$ docker run -d --name ${container_name} --privileged=true -v ${path_to_source_of_bdtk}:/workspace/bdtk ${image_name} /usr/sbin/init

How To Build

Once you have setup the Docker build environment for BDTK and get the source, you can enter the BDTK container and build like:

Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version, or make release to build an optimized version. Use make test-debug or make test-release to run tests.

How to Enable in Presto

To use it with Prestodb, Intel version Prestodb is required together with Intel version Velox. Detailed steps are available at installation guide.

Roadmap

In the next coming release, following working items were prioritized.

  • Better test coverage for entire library
  • Better robustness and enable more implemented features in Prestodb as pilot SQL engine, by improving offloading framework
  • Better extensibility at multi-levels (incl. relational algebra operator, expression function, data format), by adopting state-of-art compiler design (multi-levels)
  • Complete Arrow format migration
  • Next-gen codegen framework
  • Support large volume data processing
  • Advanced features development

Code Of Conduct

Big Data Analytic Toolkit's Code of Conduct can be found here.

Online Documentation

You can find the all the Big Data Analytic Toolkit documents on the project web page.

License

Big Data Analytic Toolkit is licensed under the Apache 2.0 License. A copy of the license can be found here.

bdtk's People

Contributors

bigpyj1151 avatar chaojun-zhang avatar deegue avatar haiweizh avatar harborn avatar jikunshang avatar kyotoyx avatar ma-jian1 avatar penghuijiao avatar qiuyangshen avatar rdower avatar spevenhe avatar winningsix avatar xieqi avatar yantingtao1315 avatar ybrua avatar yma11 avatar zjie1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bdtk's Issues

Question: How does BDTK relate to QPL?

Hi,

I am familiar with Intel's QPL library and how it interacts with the in-memory analytics accelerator (IAA). So far, I wrote some microbenchmarks and compared the filter performance to internal vectorized filter execution code.

The README and paper for BDTK mention compression acceleration with IAA (but not QPL), and as far as I can tell there is code that references QPL in CMakefiles/wrappers. How do QPL and BDTK interact? Is QPL an official part of BDTK, is BDTK supposed to be a superset of functionality or is QPL a standalone thing that can also be used via BDTK?

I think I understand what Intel's goal is with each project, but the overall strategy of how these efforts fit together is unclear to me. Is there a roadmap or document outlining all of these analytics/query processing/accelerator from Intel?

Thank you,
Jonas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.