The Simplest Way To Synchronize Your Cloud Databases

Replibyte is an application to replicate your cloud databases
from one place to the other while hiding sensitive data 🕵️‍♂️

⚠️ DEVELOPMENT IN PROGRESS - CONTRIBUTORS WANTED!! JOIN DISCORD

Install

The installation from the package managers is coming soon

Requirements for Postgres

You need to have pg_dump and psql binaries installed on your machine. Download Postgres.

git clone https://github.com/Qovery/replibyte.git

# you need to install rust compiler before
cargo build --release

# feel free to move the binary elsewhere
./target/release/replibyte

Usage

Example with Postgres as a Source and Destination database AND S3 as a Bridge (cf configuration file)

Backup your Postgres databases into S3

replibyte -c prod-conf.yaml backup run

Backup from local Postgres dump file into S3

cat dump.sql | replibyte -c prod-conf.yaml backup run -s postgres -i

Restore your Postgres databases from S3

replibyte -c prod-conf.yaml backup list

type          name                    size    when
PostgreSQL    backup-1647706359405    154MB   Yesterday at 03:00 am
PostgreSQL    backup-1647731334517    152MB   2 days ago at 03:00 am
PostgreSQL    backup-1647734369306    149MB   3 days ago at 03:00 am

replibyte -c prod-conf.yaml restore -v latest

OR 

replibyte -c prod-conf.yaml restore -v backup-1647706359405

Configuration

Create your prod-conf.yaml configuration file to source your production database.

source:
  connection_uri: $DATABASE_URL
  transformers:
    - database: public
      table: employees
      columns:
        - name: last_name
          transformer: random
        - name: birth_date
          transformer: random-date
        - name: first_name
          transformer: first-name
bridge:
  bucket: $BUCKET_NAME
  access_key_id: $ACCESS_KEY_ID
  secret_access_key: $AWS_SECRET_ACCESS_KEY

Run the app for the source

replibyte -c prod-conf.yaml

Destination

Create your staging-conf.yaml configuration file to sync your production database with your staging database.

bridge:
  bucket: $BUCKET_NAME
  access_key_id: $ACCESS_KEY_ID
  secret_access_key: $AWS_SECRET_ACCESS_KEY
destination:
  connection_uri: $DATABASE_URL

Run the app for the destination

replibyte -c staging-conf.yaml

How RepliByte works

RepliByte is built to replicate small and very large databases from one place (source) to the other (destination) with a bridge as intermediary (bridge). Here is an example of what happens while replicating a Postgres database.

sequenceDiagram
    participant RepliByte
    participant Postgres (Source)
    participant AWS S3 (Bridge)
    Postgres (Source)->>RepliByte: 1. Dump data
    loop Transformer
        RepliByte->>RepliByte: 2. Obfuscate sensitive data
    end
    RepliByte->>AWS S3 (Bridge): 3. Upload obfuscated dump data
    RepliByte->>AWS S3 (Bridge): 4. Write index file

RepliByte connects to the Postgres Source database and makes a full SQL dump of it.
RepliByte receives the SQL dump, parse it, and generates random/fake information in real-time.
RepliByte streams and uploads the modified SQL dump in real-time on AWS S3.
RepliByte keeps track of the uploaded SQL dump by writing it into an index file.

Once at least a replica from the source Postgres database is available in the S3 bucket, RepliByte can use and inject it into the destination PostgresSQL database.

sequenceDiagram
    participant RepliByte
    participant Postgres (Destination)
    participant AWS S3 (Bridge)
    AWS S3 (Bridge)->>RepliByte: 1. Read index file
    AWS S3 (Bridge)->>RepliByte: 2. Download dump SQL file
    RepliByte->>Postgres (Destination): 1. Restore dump SQL

RepliByte connects to the S3 bucket and reads the index file to retrieve the latest SQL to download.
RepliByte downloads the SQL dump in a stream bytes.
RepliByte restores the SQL dump in the destination Postgres database in real-time.

Features

Complete data synchronization
Work on different VPC/network
Generate random/fake information
Backup TB of data (read Design)
On-the-fly data (de)compression (Zlib)

Here are the features we plan to support

Incremental data synchronization
Auto-detect sensitive fields and generate fake data
Auto-clean up bridge data

Connectors

Supported Source connectors

PostgreSQL
MySQL (Coming Soon)
MongoDB (Coming Soon)
Local dump file (Yes for PostgreSQL)

Supported Transformers

A transformer is useful to change / hide the value of a column. RepliByte provides pre-made transformers.

Check out the list of our available Transformers

RepliByte Bridge

The S3 wire protocol, used by RepliByte bridge, is supported by most cloud providers. Here is a non-exhaustive list of S3 compatible services.

Cloud Service Provider	S3 service name	S3 compatible
Amazon Web Services	S3	Yes (Original)
Google Cloud Platform	Cloud Storage	Yes
Microsoft Azure	Blob Storage	Yes
Digital Ocean	Spaces	Yes
Scaleway	Object Storage	Yes
Minio	Object Storage	Yes

Feel free to drop a PR to include another S3 compatible solution.

Supported Destination connectors

PostgreSQL
MySQL (Coming Soon)
MongoDB (Coming Soon)
Local dump file (Coming soon)

Design

Low Memory and CPU footprint

Written in Rust, RepliByte can run with 512 MB of RAM and 1 CPU to replicate 1 TB of data (we are working on a benchmark). RepliByte replicate the data in a stream of bytes and does not store anything on a local disk.

Limitations

Tested with Postgres 13 and 14. It should work with prior versions.

Index file structure

An index file describe the structure of your backups and all of them.

Here is the manifest file that you can find at the root of your target Bridge (E.g: S3).

{
  "backups": [
    {
      "size": 1024000, // in bytes
      "directory_name": "backup-{epoch timestamp}",
      "created_at": "epoch timestamp"
    }
  ]
}

Motivation

At Qovery (the company behind RepliByte), developers can clone their applications and databases just with one click. However, the cloning process can be tedious and time-consuming, and we end up copying the information multiple times. With RepliByte, the Qovery team wants to provide a comprehensive way to seed cloud databases from one place to another.

The long-term motivation behind RepliByte is to provide a way to clone any database in real-time. This project starts small, but has big ambition!

Use cases

Scenario	Supported
Synchronize the whole Postgres instance	Yes
Synchronize the whole Postgres instance and replace sensitive data with fake data	Yes
Synchronize specific Postgres tables and replace sensitive data with fake data	WIP
Synchronize specific Postgres databases and replace sensitive data with fake data	WIP
Migrate from one database hosting platform to the other	Yes

Do you want to support an additional use-case? Feel free to contribute by opening an issue or submitting a PR.

What is not RepliByte

RepliByte is not an ETL

RepliByte is not an ETL like AirByte, AirFlow, Talend, and it will never be. If you need to synchronize versatile data sources, you are better choosing a classic ETL. RepliByte is a tool for software engineers to help them to synchronize data from the same databases. With RepliByte, you can only replicate data from the same type of databases. As mentioned above, the primary purpose of RepliByte is to duplicate into different environments. You can see RepliByte as a specific use case of an ETL, where an ETL is more generic.

FAQ

⬆️ Open an issue if you have any question - I'll pick the most common questions and put them here with the answer

Contributing

Local development

For local development, you will need to install Docker and run docker compose -f ./docker-compose-postgres-minio.yml to start the local databases. At the moment, docker-compose includes 2 Postgres database instances and a Minio bridge. One source, one destination database and one bridge. In the future, we will provide more options.

The Minio console is accessible at http://localhost:9001.

Once your Docker instances are running, you can run the RepliByte tests, to check if everything is configured correctly:

AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin cargo test

How to contribute

RepliByte is in its early stage of development and need some time to be usable in production. We need some help, and you are welcome to contribute. To better synchronize consider joining our #replibyte channel on our Discord. Otherwise, you can pick any open issues and contribute.

Where should I start?

Check the open issues and their priority.

How can I contact you?

3 options:

Open an issue.
Join our #replibyte channel on our discord.
Drop us an email to github+replibyte {at} qovery {dot} com.

Live Coding Session

Romaric, main contributor to RepliByte does some live coding session on Twitch to learn more about RepliByte and explain how to develop in Rust. Feel free to join the sessions.

Thanks

Thanks to all people sharing their ideas to make RepliByte better. We do appreciate it. I would also thank AirByte, a great product and a trustworthy source of inspiration for this project.

posilva / replibyte Goto Github PK

replibyte's Introduction