Code Monkey home page Code Monkey logo

bulker's Introduction

๐Ÿšš Bulker

Bulker is a tool for streaming and batching large amount of semi-structured data into data warehouses. It uses Kafka internally

How it works?

Send and JSON object to Bulker HTTP endpoint, and it will make sure it will be saved to data warehouse:

  • JSON flattening. Your object will be flattened - {a: {b: 1}} becomes {a_b: 1}
  • Schema managenent for semi-structured data. For each fuekd, bulker will make sure that a corresponding column exist in destination table. If not, Bulker will create. Type will be best-guessed by value, or it could be explicitely set via type hint as in {"a": "test", "__sql_type_a": "varchar(4)"}
  • Reliability. Bulker will put the object to Kafka Queue immidiatly, so if datawarehouse is down, data won't be lost
  • Streaming or Batching. Bulker will send data to datawarehouse either as soon it become available in Kafka (streaming) or after some time (batching). Most data warehouses won't tolerate large number of inserts, that's why we implemented batching

Bulker is a ๐Ÿ’œ of Jitsu, an open-source data integration platform.

See full list of features below

Bulker is also available as a go library if you want to embed it into your application as opposed to use a HTTP-server

Features

  • ๐Ÿ›ข๏ธ Batching - Bulker sends data in batches in most efficient way for particular database. For example, for Postgres it uses COPY command, for BigQuery it uses batch-files
  • ๐Ÿšฟ Streaming - alternatively, Bulker can stream data to database. It is useful when number of records is low. Up to 10 records per second for most databases
  • ๐Ÿซ Deduplication - if configured, Bulker will deduplicate records by primary key
  • ๐Ÿ“‹ Schema management - Bulker creates tables and columns on the fly. It also flattens nested JSON-objects. Example if you send {"a": {"b": 1}} to bulker, it will make sure that there is a column a_b in the table (and will create it)
  • ๐Ÿฆพ Implicit typing - Bulker infers types of columns from JSON-data.
  • ๐Ÿ“Œ Explicit typing - Explicit types can be by type hints that are placed in JSON. Example: for event {"a": "test", "__sql_type_a": "varchar(4)"} Bulker will make sure that there is a column a, and it's type is varchar(4).
  • ๐Ÿ“ˆ Horizontal Scaling. Bulker scales horizontally. Too much data? No problem, just add Bulker instances!
  • ๐Ÿ“ฆ Dockerized - Bulker is dockerized and can be deployed to any cloud provider and k8s.
  • โ˜๏ธ Cloud Native - each Bulker instance is stateless and is configured by only few environment variables.

Supported databases

Bulker supports the following databases:

  • โœ… PostgresSQL
  • โœ… Redshit
  • โœ… Snowflake
  • โœ… Clickhouse
  • โœ… BigQuery
  • โœ… MySQL
  • โœ… S3
  • โœ… GCS

Please see Compatibility Matrix to learn what Bulker features are supported by each database.

Documentation Links

Note We highly recommend to read Core Concepts below before diving into details

Core Concepts

Destinations

Bulker operates with destinations. Destination is database or storage service (e.g. S3, GCS). Each destination has an ID and configuration which is represented by JSON object.

Bulker exposes HTTP API to load data into destinations, where those destinations are referenced by their IDs.

If destination is a database, you'll need to provide a destination table name.

Event

The main unit of data in Bulker is an event. Event is a represented JSON-object

Batching and Streaming (aka Destination Mode)

Bulker can send data to database in two ways:

  • Streaming. Bulker sends evens to destinaion one by one. It is useful when number of events is low (less than 10 events per second for most DBs).
  • Batching. Bulker accumulates events in batches and sends them periodically once batch is full or timeout is reached. Batching is more efficient for large amounts of events. Especially for cloud data-warehouses (e.g. Postgres, Clickhouse, BigQuery).

Primary Keys and Deduplication

Optionally, Bulker can deduplicate events by primary key. It is useful when you same event can be sent to Bulker multiple times. If available, Bulker uses primary keys, but for some data warehouses alternative strategies are used.

Read more about deduplication ยป

bulker's People

Contributors

absorbb avatar vklimontovich avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.