Code Monkey home page Code Monkey logo

tispark's Introduction

What is TiSpark?

TiSpark is a thin layer built for running Apache Spark on top of TiDB/TiKV to answer the complex OLAP queries. It takes advantages of both the Spark platform and the distributed TiKV cluster, at the same time, seamlessly glues to TiDB, the distributed OLTP database, to provide a Hybrid Transactional/Analytical Processing (HTAP) to serve as a one-stop solution for online transactions and analysis.

TiSpark Architecture

architecture

  • TiSpark integrates with Spark Catalyst Engine deeply. It provides precise control of the computing, which allows Spark read data from TiKV efficiently. It also supports index seek, which improves the performance of the point query execution significantly.

  • It utilizes several strategies to push down the computing to reduce the size of dataset handling by Spark SQL, which accelerates the query execution. It also uses the TiDB built-in statistical information for the query plan optimization.

  • From the data integration point of view, TiSpark + TiDB provides a solution runs both transaction and analysis directly on the same platform without building and maintaining any ETLs. It simplifies the system architecture and reduces the cost of maintenance.

  • In addition, you can deploy and utilize tools from the Spark ecosystem for further data processing and manipulation on TiDB. For example, using TiSpark for data analysis and ETL; retrieving data from TiKV as a machine learning data source; generating reports from the scheduling system and so on.

TiSpark depends on the existence of TiKV clusters and PDs. It also needs to setup and use Spark clustering platform.

A thin layer of TiSpark. Most of the logic is inside tikv-java-client library. https://github.com/pingcap/tikv-client-lib-java

Uses as below

./bin/spark-shell --jars /wherever-it-is/tispark-0.1.0-SNAPSHOT-jar-with-dependencies.jar

import org.apache.spark.sql.TiContext
val ti = new TiContext(spark) 

// Mapping all TiDB tables from database tpch as Spark SQL tables
ti.tidbMapDatabase("tpch")

spark.sql("select count(*) from lineitem").show

Configuration

Below configurations can be put together with spark-defaults.conf or passed in the same way as other Spark config properties.

Key Default Value Description
spark.tispark.pd.addresses 127.0.0.1:2379 PD Cluster Addresses, split by comma
spark.tispark.grpc.framesize 268435456 Max frame size of GRPC response
spark.tispark.grpc.timeout_in_sec 10 GRPC timeout time in seconds
spark.tispark.meta.reload_period_in_sec 60 Metastore reload period in seconds
spark.tispark.plan.allow_agg_pushdown true If allow aggregation pushdown (in case of busy TiKV nodes)
spark.tispark.plan.allow_index_double_read false If allow index double read (which might cause heavy pressure on TiKV)
spark.tispark.index.scan_batch_size 2000000 How many row key in batch for concurrent index scan
spark.tispark.index.scan_concurrency 5 Maximal threads for index scan retrieving row keys (shared among tasks inside each JVM)
spark.tispark.table.scan_concurrency 512 Maximal threads for table scan (shared among tasks inside each JVM)
spark.tispark.request.command.priority "Low" "Low", "Normal", "High" which impacts resource to get in TiKV. Low is recommended for not disturbing OLTP workload

Quick start

Read the Quick Start.

How to build

TiSpark depends on TiKV java client project which is included as a submodule. To build TiKV client:

./bin/build-client.sh

To build TiSpark itself:

mvn clean package

Follow us

Twitter

@PingCAP

Mailing list

[email protected]

Google Group

License

TiSpark is under the Apache 2.0 license. See the LICENSE file for details.

tispark's People

Contributors

birdstorm avatar cove9988 avatar ilovesoup avatar liancheng avatar novemser avatar wolfstudy avatar zhexuany avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.