Code Monkey home page Code Monkey logo

airbyte-module's Introduction

License

airbyte-module

The airbyte-module (ABM) for Fybrik is a FybrikModule which makes use of Airbyte connectors.

ABM is both an Apache Arrow Flight and an HTTP server.

What is Airbyte?

Airbyte is a data integration tool that focuses on extracting and loading data.

Airbyte has a vast catalog of connectors that support dozens of data sources and data destinations. These Airbyte connectors run in docker containers and are built in accordance with the Airbyte specification.

What is the Airbyte Module?

ABM is an arrow-flight server that enables applications to consume tabular data from a wide range of data sources.

Since Airbyte connectors are implemented as docker images and run as docker containers, the Airbyte Module does not require Airbyte as a prerequisite. To run the Airbyte Module, only docker is required.

How to run the Airbyte Module server locally

Follow the instructions in the sample folder.

How to deploy the Airbyte Module to kubernetes using helm

Follow the instructions in the helm folder.

How a Fybrik Application can access a dataset, using an Airbyte FybrikModule

If you would like to run a use case where the application has unrestricted access to a dataset, follow the instructions here.

However, if you are interested in a use case where the governance policies mandate that some of the dataset columns must be redacted, follow the instructions here. In this scenario, both the airbyte module and the arrow-flight-module are deployed. The airbyte module reads the dataset, whereas the arrow-flight-module transforms the dataset based on the governance policies.

airbyte-module's People

Contributors

cdoron avatar revit13 avatar simanadler avatar elsalant avatar mohammad-nassar10 avatar dependabot[bot] avatar roytman avatar shlomitk1 avatar

Stargazers

Dinh Ngoc Hien avatar Jarrian Gojar avatar Mohamed Chorfa avatar

Watchers

 avatar  avatar  avatar  avatar Ronen Kat avatar

airbyte-module's Issues

Arrow flight in code comments

In the code in the abm folder (connector.py and server.py) there are a lot of comments and variables referring to arrow flight.
I understand that the arrow flight module was used as a starting point, but these comments and variables should probably be fixed.

Support for secret passing

@cdoron
Some Airbyte connectors, like S3, will require that secrets be passed to them. Currently, the only way this could be done is to put the secret keyword and value in the Asset yaml where they will eventually end up in a mounted file (/etc/conf/conf.yaml) This is of course a security hole. Instead, the secrets - keyword as expected by the Airbyte connector and values - either be stored in Vault or in a Secrets asset and the module should then extract these, append them to the other values specified in Asset yaml and pass these as configuration parameters to the airbyte connector.

Support TLS

Add TLS support:

  • run the server with TLS enabled.
  • make the connection between the server and vault using TLS

Problems to run airbyte-module on OpenShift

When we deployed airbyte-module on OpenShift we have faced with a number of issues:

  • The security context was not accepted by the OpenShift
  • In order to run dind container, the pod has privilege: true property, which was not acceptable by OpenShift.
  • Without the privilege: true property, the docker demon container cannot run.

Add a test for using mysql

A test should be added for the usage of reading / writing to mysql database.
The test can be run without Fybrik.

Redefine "file" connection structure

The current connection is defined in asset.yaml as follows:

connection:
      name: file
      file:
        connector: "airbyte/source-file"
        dataset_name: userdata
        format: parquet
        url: "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet"
        provider:
          storage: HTTPS

A number of issues:

  1. Name "file" is misleading. Files can be local or remote, and not necessarily all connectors supporting one type, support another. Thus, I suggest changing the name. "remote"/"https" - suggestions are welcome.
  2. "connector" property is an implementation detail, not a part of the connection. It should be determined in module charts.
  3. "provider" is redundant since "https" appears in the url, unless I am missing here something.
  4. "dataset_name" is 1. redundant and 2. incorrect since it appears in the url as userdata2.
  5. format is not a part of the connection, it appears under dataFormat of the asset.
    I suggest having something like this:
connection:
      name: remote
      remote:
        protocol: HTTPS
        host: "github.com"
        folder: "Teradata/kylo/raw/master/samples/sample-data/parquet"
        file: "userdata2.parquet"

or

connection:
      name: https
      https:
        url: "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet"

The connection should be defined in Fybrik in a custom layer pkg/storage/layers/connection.yaml
@cdoron @Mohammad-nassar10 @revit13 @simanadler

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.