Code Monkey home page Code Monkey logo

replibyte's Introduction

replibyte logo

The Simplest Way To Synchronize Your Cloud Databases

Replibyte is an application to replicate your cloud databases
from one place to the other while hiding sensitive data ๐Ÿ•ต๏ธโ€โ™‚๏ธ

work in progress badge Build and Tests Discord


โš ๏ธ DEVELOPMENT IN PROGRESS - CONTRIBUTORS WANTED!! JOIN DISCORD


Install

The installation from the package managers is coming soon

Requirements for Postgres

You need to have pg_dump and psql binaries installed on your machine. Download Postgres.

git clone https://github.com/Qovery/replibyte.git

# you need to install rust compiler before
cargo build --release

# feel free to move the binary elsewhere
./target/release/replibyte

Usage

Example with Postgres as a Source and Destination database AND S3 as a Bridge (cf configuration file)

Backup your Postgres databases into S3

replibyte -c prod-conf.yaml backup run

Backup from local Postgres dump file into S3

cat dump.sql | replibyte -c prod-conf.yaml backup run -s postgres -i

Restore your Postgres databases from S3

replibyte -c prod-conf.yaml backup list

type          name                    size    when
PostgreSQL    backup-1647706359405    154MB   Yesterday at 03:00 am
PostgreSQL    backup-1647731334517    152MB   2 days ago at 03:00 am
PostgreSQL    backup-1647734369306    149MB   3 days ago at 03:00 am
replibyte -c prod-conf.yaml restore -v latest

OR 

replibyte -c prod-conf.yaml restore -v backup-1647706359405

Configuration

Create your prod-conf.yaml configuration file to source your production database.

source:
  connection_uri: $DATABASE_URL
  transformers:
    - database: public
      table: employees
      columns:
        - name: last_name
          transformer: random
        - name: birth_date
          transformer: random-date
        - name: first_name
          transformer: first-name
bridge:
  bucket: $BUCKET_NAME
  access_key_id: $ACCESS_KEY_ID
  secret_access_key: $AWS_SECRET_ACCESS_KEY

Run the app for the source

replibyte -c prod-conf.yaml

Destination

Create your staging-conf.yaml configuration file to sync your production database with your staging database.

bridge:
  bucket: $BUCKET_NAME
  access_key_id: $ACCESS_KEY_ID
  secret_access_key: $AWS_SECRET_ACCESS_KEY
destination:
  connection_uri: $DATABASE_URL

Run the app for the destination

replibyte -c staging-conf.yaml

How RepliByte works

RepliByte is built to replicate small and very large databases from one place (source) to the other (destination) with a bridge as intermediary (bridge). Here is an example of what happens while replicating a Postgres database.

sequenceDiagram
    participant RepliByte
    participant Postgres (Source)
    participant AWS S3 (Bridge)
    Postgres (Source)->>RepliByte: 1. Dump data
    loop Transformer
        RepliByte->>RepliByte: 2. Obfuscate sensitive data
    end
    RepliByte->>AWS S3 (Bridge): 3. Upload obfuscated dump data
    RepliByte->>AWS S3 (Bridge): 4. Write index file
  1. RepliByte connects to the Postgres Source database and makes a full SQL dump of it.
  2. RepliByte receives the SQL dump, parse it, and generates random/fake information in real-time.
  3. RepliByte streams and uploads the modified SQL dump in real-time on AWS S3.
  4. RepliByte keeps track of the uploaded SQL dump by writing it into an index file.

Once at least a replica from the source Postgres database is available in the S3 bucket, RepliByte can use and inject it into the destination PostgresSQL database.

sequenceDiagram
    participant RepliByte
    participant Postgres (Destination)
    participant AWS S3 (Bridge)
    AWS S3 (Bridge)->>RepliByte: 1. Read index file
    AWS S3 (Bridge)->>RepliByte: 2. Download dump SQL file
    RepliByte->>Postgres (Destination): 1. Restore dump SQL
  1. RepliByte connects to the S3 bucket and reads the index file to retrieve the latest SQL to download.
  2. RepliByte downloads the SQL dump in a stream bytes.
  3. RepliByte restores the SQL dump in the destination Postgres database in real-time.

Features

  • Complete data synchronization
  • Work on different VPC/network
  • Generate random/fake information
  • Backup TB of data (read Design)
  • On-the-fly data (de)compression (Zlib)

Here are the features we plan to support

  • Incremental data synchronization
  • Auto-detect sensitive fields and generate fake data
  • Auto-clean up bridge data

Connectors

Supported Source connectors

  • PostgreSQL
  • MySQL (Coming Soon)
  • MongoDB (Coming Soon)
  • Local dump file (Yes for PostgreSQL)

Supported Transformers

A transformer is useful to change / hide the value of a column. RepliByte provides pre-made transformers.

Check out the list of our available Transformers

RepliByte Bridge

The S3 wire protocol, used by RepliByte bridge, is supported by most cloud providers. Here is a non-exhaustive list of S3 compatible services.

Cloud Service Provider S3 service name S3 compatible
Amazon Web Services S3 Yes (Original)
Google Cloud Platform Cloud Storage Yes
Microsoft Azure Blob Storage Yes
Digital Ocean Spaces Yes
Scaleway Object Storage Yes
Minio Object Storage Yes

Feel free to drop a PR to include another S3 compatible solution.

Supported Destination connectors

  • PostgreSQL
  • MySQL (Coming Soon)
  • MongoDB (Coming Soon)
  • Local dump file (Coming soon)

Design

Low Memory and CPU footprint

Written in Rust, RepliByte can run with 512 MB of RAM and 1 CPU to replicate 1 TB of data (we are working on a benchmark). RepliByte replicate the data in a stream of bytes and does not store anything on a local disk.

Limitations

  • Tested with Postgres 13 and 14. It should work with prior versions.

Index file structure

An index file describe the structure of your backups and all of them.

Here is the manifest file that you can find at the root of your target Bridge (E.g: S3).

{
  "backups": [
    {
      "size": 1024000, // in bytes
      "directory_name": "backup-{epoch timestamp}",
      "created_at": "epoch timestamp"
    }
  ]
}

Motivation

At Qovery (the company behind RepliByte), developers can clone their applications and databases just with one click. However, the cloning process can be tedious and time-consuming, and we end up copying the information multiple times. With RepliByte, the Qovery team wants to provide a comprehensive way to seed cloud databases from one place to another.

The long-term motivation behind RepliByte is to provide a way to clone any database in real-time. This project starts small, but has big ambition!

Use cases

Scenario Supported
Synchronize the whole Postgres instance Yes
Synchronize the whole Postgres instance and replace sensitive data with fake data Yes
Synchronize specific Postgres tables and replace sensitive data with fake data WIP
Synchronize specific Postgres databases and replace sensitive data with fake data WIP
Migrate from one database hosting platform to the other Yes

Do you want to support an additional use-case? Feel free to contribute by opening an issue or submitting a PR.

What is not RepliByte

RepliByte is not an ETL

RepliByte is not an ETL like AirByte, AirFlow, Talend, and it will never be. If you need to synchronize versatile data sources, you are better choosing a classic ETL. RepliByte is a tool for software engineers to help them to synchronize data from the same databases. With RepliByte, you can only replicate data from the same type of databases. As mentioned above, the primary purpose of RepliByte is to duplicate into different environments. You can see RepliByte as a specific use case of an ETL, where an ETL is more generic.

FAQ

โฌ†๏ธ Open an issue if you have any question - I'll pick the most common questions and put them here with the answer

Contributing

Local development

For local development, you will need to install Docker and run docker compose -f ./docker-compose-postgres-minio.yml to start the local databases. At the moment, docker-compose includes 2 Postgres database instances and a Minio bridge. One source, one destination database and one bridge. In the future, we will provide more options.

The Minio console is accessible at http://localhost:9001.

Once your Docker instances are running, you can run the RepliByte tests, to check if everything is configured correctly:

AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin cargo test

How to contribute

RepliByte is in its early stage of development and need some time to be usable in production. We need some help, and you are welcome to contribute. To better synchronize consider joining our #replibyte channel on our Discord. Otherwise, you can pick any open issues and contribute.

Where should I start?

Check the open issues and their priority.

How can I contact you?

3 options:

  1. Open an issue.
  2. Join our #replibyte channel on our discord.
  3. Drop us an email to github+replibyte {at} qovery {dot} com.

Live Coding Session

Romaric, main contributor to RepliByte does some live coding session on Twitch to learn more about RepliByte and explain how to develop in Rust. Feel free to join the sessions.

Thanks

Thanks to all people sharing their ideas to make RepliByte better. We do appreciate it. I would also thank AirByte, a great product and a trustworthy source of inspiration for this project.

replibyte's People

Contributors

evoxmusic avatar fabriceclementz avatar michaelgrigoryan25 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.