Code Monkey home page Code Monkey logo

modern_open_data_stack's Introduction

Info

This repository is a fork of Nicolas Roux's package. You can see the base backage in this link:

Ref: https://gitlab.com/nicolas217/modern-open-source-data-stack

Login Information

Airflow

URL: https://localhost:8080

Username: airflow

password: airflow

Airbyte

URL: https://localhost:8000

Username: airbyte

password: password

Superset

URL: https://localhost:8088

Username: admin

password: admin

Cartelis

Cartelis is an operational data & CRM consulting firm. We improve marketing ROI and streamline the customer's digital journey. We work with our clients to first design and implement martech solutions and then help them capitalize on their data to prioritize impactful marketing actions.

Integration tool based on open-source data management frameworks

Designed to be self-sufficient in a local or remote machine ( limited to Ubuntu 18.04 as of 30/06/2021), this project brings together best-of-breed open-source frameworks in order to provide automated data migration, hosting, and transformation features in a single consolidated and fully customizable tool. Specifically, it currently puts to use:

  • Airflow: to orchestrate and automate the entire data pipeline
  • Airbyte: to integrate data, relying on a growing set of connectors
  • DBT: to transform data hosted in a data warehouse with models based on largely on SQL syntax
  • Postgresql: to host and transform data

Requirements

An objective of this project is to avoid version conflicts due to the rapidly evolving nature of these open-source frameworks, so the tool is designed to rely on specific versions of each software package:

  • Airflow:2.7.3
  • Airbyte:0.50.38
  • DBT:1.7.4
  • PostgreSQL:13

Setup

  • Clone the repository
  • Enter the repository: cd data-integration
  • Run ./setup.sh Once the build has terminated:
  • List the docker containers running: sudo docker ps
  • Find the container named airflow-worker and run: sudo docker exec -it <container ID> /bin/bash Inside the docker container:
  • Check if dbt is installed (not fully working as of 19/12/2023): dbt --version
  • If an error is returned, run: pip install dbt==1.7.4

Connections

  • Airflow webserver UI: localhost:8080
  • Airbyte webserver UI: localhost:8000
  • Postgresql server: localhost:5400
  • Postgresql server: localhost:5399

Accessing Postgresql from the host machine

  • docker exec -it psql -U airflow
  • Useful commands:
    • \l : list databases
    • \c : enter database
    • \dt: view database relations
    • TABLE <table_name>: view table data
    • SELECT * FROM <table_name>: view table data

Manipulating the installation

  • Stop and remove containers : docker-compose down -v in the docker folder
  • Stop containers : docker-compose stop in the docker folder
  • Run containers : docker-compose run in the docker folder
  • Stop and delete containers : docker-compose down --volumes --rmi all

Setting up dbt in the container (manually)

  • access docker: sudo docker exec -it <airflow worker> /bin/bash
  • install vim: apt-get up date and apt-get install vim
  • Check which version is installed: dbt --version
  • Check which dbt python packages are installed: pip list, if only is present, install more packages: pip install dbt In the docker folder:
  • Create a local dbt project in the dbt folder: dbt init <project_name>
  • Enter into the dbt project : cd <project_name>
  • Edit dbt_project.yml file : vim dbt_project.yml
  • Replace "my_new_project","default", and my_new_project in models by <project_name>
  • Navigate to .dbt folder (/root/.dbt) and edit file for a Postgresql connection (host is the host machine IP address, and the port number is 5400)

Ref: https://docs.getdbt.com/reference/warehouse-profiles/postgres-profile

DAGS

Airflow's purpose in this implementation is to orchestrate Airbyte, DBT, and Postgresql task in order to provide as close to full automation as possible in this data pipeline.

Airbyte DAGs

DAGS written to run integration tasks between external sources and the local Postgresql server will make use of existing Airbyte operators made available by downloading the apache-airflow-providers-airbyte python package.

Airbyte-Airflow connection: set-up in Airflow admin connections with connector as HTTP

The Airbyte task needs to contain the Airbyte connector ID contained in the connector URL.

Airflow-Airbyte

Connect Airflow to airbyte with a new connectio

DBT DAGs

While it may be easier to run all DBT models with a single task in a DAG (Directed Acyclic Graph) file, this method runs the risk of performing excessive computational tasks despite a potential lack of updated data and is inflexible when performing debugging operations.

A better way is to create a DAG file capable of parsing the manifest.json file and creating individual tasks based on identified nodes which are then run or tested while taking into account potential dependencies.

Ref: https://www.astronomer.io/blog/airflow-dbt-1.

modern_open_data_stack's People

Contributors

aramhuseiny avatar nicolascartelis avatar laoban-man avatar

Stargazers

Ali Esmaeili avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.