Code Monkey home page Code Monkey logo

airbyte_serverless's Introduction

logo

Airbyte made simple



๐Ÿ”๏ธ What is AirbyteServerless?

AirbyteServerless is a simple tool to manage Airbyte connectors, run them locally or deploy them in serverless mode.


๐Ÿ’ก Why AirbyteServerless?

Airbyte is a must-have in your data-stack with its catalog of open-source connectors to move your data from any source to your data-warehouse.

To manage these connectors, Airbyte offers Airbyte-Open-Source-Platform which includes a server, workers, database, UI, orchestrator, connectors, secret manager, logs manager, etc.

AirbyteServerless aims at offering a lightweight alternative to Airbyte-Open-Source-Platform to simplify connectors management.


๐Ÿ“ Comparing Airbyte-Open-Source-Platform & AirbyteServerless

Airbyte-Open-Source-Platform AirbyteServerless
Has a UI Has NO UI
Connections configurations are managed by documented yaml files
Has a database Has NO database
- Configurations files are versioned in git
- The destination stores the state (the checkpoint of where sync stops) and logs which can then be visualized with your preferred BI tool
Has a transform layer
Airbyte loads your data in a raw format but then enables you to perform basic transform such as replace, upsert, schema normalization
Has NO transform layer
- Data is appended in your destination in raw format.
- airbyte_serverless is dedicated to do one thing and do it well: Extract-Load.
NOT Serverless
- Can be deployed on a VM or Kubernetes Cluster.
- The platform is made of tens of dependent containers that you CANNOT deploy with serverless
Serverless
- An Airbyte source docker image is upgraded with a destination connector
- The upgraded docker image can then be deployed as an isolated Cloud Run Job (or Cloud Run Service)
- Cloud Run is natively monitored with metrics, dashboards, logs, error reporting, alerting, etc
- It can be scheduled or triggered by events
Is scalable with conditions
Scalable if deployed on autoscaled Kubernetes Cluster and if you are skilled enough.
๐Ÿ‘‰ Check that you are skilled enough with Kubernetes by watching this video ๐Ÿ˜.
Is scalable
Each connector is deployed independently of each other. You can have as many as you want.

๐Ÿ’ฅ Getting Started with abs CLI

abs is the CLI (command-line-interface) of AirbyteServerless which facilitates connectors management.

Install abs ๐Ÿ› ๏ธ

pip install airbyte-serverless

Create your first Connection ๐Ÿ‘จโ€๐Ÿ’ป

abs create my_first_connection --source="airbyte/source-faker:0.1.4" --destination="bigquery:my_project.my_dataset" --remote-runner "cloud_run_job"
  1. Docker is required. Make sure you have it installed.
  2. source param can be any Public Docker Airbyte Source (here is the list). We recomend that you use faker source to get started.
  3. destination param must be one of the following:
    • print (default value if not set)
    • bigquery
    • contributions are welcome to offer more destinations ๐Ÿค—
  4. remote-runner param must be cloud_run_job. More integrations will come in the future. This remote-runner is only used if you want to run the connection on a remote runner and schedule it.
  5. The command will create a configuration file ./connections/my_first_connection.yaml with initialized configuration.
  6. Update this configuration file to suit your needs.

Run it! โšก

abs run my_first_connection
  1. This will launch an Extract-Load Job from the source to the destination.
  2. The run commmand will only work if you have correctly edited ./connections/my_first_connection.yaml configuration file.
  3. If you chose bigquery destination, you must:
    • have gcloud installed on your machine with default credentials initialized with the command gcloud auth application-default login.
    • have correctly edited the destination section of ./connections/my_first_connection.yaml configuration file. You must have dataEditor permission on the chosen BigQuery dataset.
  4. Data is always appended at destination (not replaced nor upserted). It will be in raw format.
  5. If the connector supports incremental extract (extract only new or recently modified data) then this mode is chosen.

Select only some streams ๐Ÿง›๐Ÿผ

You may not want to copy all the data that the source can get. To see all available streams run:

abs list-available-streams my_first_connection

If you want to configure your connection with only some of these streams, run:

abs set-streams my_first_connection "stream1,stream2"

Next run executions will extract selected streams only.

Run from the Remote Runner ๐Ÿš€

abs remote-run my_first_connection
  1. The remote-run commmand will only work if you have correctly edited ./connections/my_first_connection.yaml configuration file including the remote_runner part.
  1. This command will launch an Extract-Load Job like the abs run command. The main difference is that the command will be run on a remote deployed container (we use Cloud Run Job as the only container runner for now).
  2. If you chose bigquery destination, the selected service account must be bigquery.dataEditor on the target dataset and have permission to create some BigQuery jobs in the project.

Schedule the run from the Remote Runner โฑ๏ธ

abs schedule-remote-run my_first_connection "0 * * * *"

โš ๏ธ THIS IS NOT IMPLEMENTED YET

Get help ๐Ÿ“™

$ abs --help
Usage: abs [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  create                  Create CONNECTION
  list                    List created connections
  list-available-streams  List available streams of CONNECTION
  remote-run              Run CONNECTION Extract-Load Job from remote runner
  run                     Run CONNECTION Extract-Load Job
  run-env-vars            Run Extract-Load Job configured by environment...
  set-streams             Set STREAMS to retrieve for CONNECTION (STREAMS...

๐Ÿ‘‹ Contribute

Any contribution is more than welcome ๐Ÿค—!

  • Add a โญ on the repo to show your support
  • Raise an issue to raise a bug or suggest improvements
  • Open a PR! Below are some suggestions of work to be done:
    • implements a scheduler
    • improve secrets management (use secret manager)
    • implement the get_logs method of BigQueryDestination
    • add a new destination connector (Cloud Storage?)
    • add more remote runners such compute instances.
    • implements vpc access
    • implement optional post-processing (replace, upsert data at destination instead of append?)

๐Ÿ† Credits

  • Big kudos to Airbyte for all the hard work on connectors!
  • The generation of the sample connector configuration in yaml is heavily inspired from the code of octavia CLI developed by airbyte.

airbyte_serverless's People

Contributors

unytics avatar lynnfarhat1 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.