PBG: Modern Data Engineering

This repository aim to investigate, experiment and understand the modern open source tools and approaches to data engineering with Python.

In this repository we have created a dashboard for Lombardia Italian Healthcare system to understand how the healthcare system is working.

This is just a test and sample experiment

What we talk

how to define the cluster
how to create a simple dag
tasks and operators
dummy operators and file sensors
scheduling
xcom transfer
plugins

Status

Create DAGS for data import on different sources
Create DAGS for data manipulation
Ingest new data into db
Create visualization and analysis on streamlit dashboard

Completed

streamlit template
airflow cluster
project setup

Tools

Python
Poetry
Airflow
Streamlit
Jupyter (for exploration)
Folium (for map visualization)
Pydantic
Sqlalchemy (pyodbc)

If you want to use SQL Server you have to install odbc microsoft sql server libraries.

On linux you can follow this guide

For us the code in the official website is not working, so we have done this procedure to install the drivers:

#on ubuntu 20.04
sudo apt update
sudo apt upgrade -y
sudo su
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
apt-get update
ACCEPT_EULA=Y apt-get install -y msodbcsql18
ACCEPT_EULA=Y apt-get install -y msodbcsql17
ACCEPT_EULA=Y apt-get install -y mssql-tools18
echo 'export PATH="$PATH:/opt/mssql-tools18/bin"' >> ~/.bashrc
echo 'export PATH="$PATH:/opt/mssql-tools17/bin"' >> ~/.bashrc
echo 'export PATH="$PATH:/opt/mssql-tools18/bin"' >> ~/.zshrc
echo 'export PATH="$PATH:/opt/mssql-tools17/bin"' >> ~/.zshrc
source ~/.bashrc
sudo apt-get install -y unixodbc-dev

Secrets and settings

If you want to pass some secrets and settings to the airflow cluster to be used inside the DAG you have to create a .env file in the root of the repository because launching the docker-compose.yml.

All the secrets will be read by airflow and will be used within the cluster by dags.

All the settings that you can place inside the .env file are:

AIRFLOW_UID = the user id code for the user inside the cluster
AIRFLOW_GID = the group id code for the user inside the cluster
DB_ADDRESS = the azure sql db address
DB_NAME = the azure sql db name
DB_NAME_EXTERNAL = the azure sql external db name (final staging db)
DB_USER = the name of the user of the azure db
DB_PASSWORD = the password for the db user
BLOB_CONN_STR = the connection string for the azure blob storage
BLOB_CONTAINER = the name of the container in the azure blob storage
BLOB_NAME = the name of the blob you want to use (csv file)
API_URL = the url of the api you want to crawl
LINK_WEB = the link of online resources
API_DATA_URL = the socrata data url
API_DATA_SET = the socrata dataset id and name you want to download
API_DATA_TOKEN = the datatoken to connect to socrata apis
CHROME_DRIVER_PATH = the place where the selenium chrome driver is inside the cluster

Repository and code

This is a monorepo, so all the code is included here.
Inside this repository you can find:

The airflow etl code with the dags
The streamlit application (inside the app folder)
The visual studio code dev container
Some notebooks and queries for explorative analysis

Streamlit ui interface

Since this repository is a monorepo, inside the app folder you can find the multipage application streamlit containing the dashboards.

To use this application you need to create inside the .streamlit folder the secrets.toml file with these keys

server = address of db
database = name of db
username = username of db
password = password of db

The credentials are for sql server, in particular for azure sql server that we used for this tutorial.

Useful commands

# launch the airflow infrastructure
docker-compose -f docker-compose.airflow.yml up --build -d

# launch the etl container that used airflow with the dags

# launch the streamlit application

# stop or remove all the docker containers for a specific dockerfile configuration
docker-compose -f docker-compose.airflow.yml stop  #if you want to stop them
docker-compose -f docker-compose.airflow.yml rm #if you want to remove them

# visualize the informations of docker container running and not
docker ps -a

# visualize the logs for a docker container
docker logs -t <container name>

# enter inside a docker container (for bin/bash)
docker exec -it <container name> /bin/bash

# inspect a container and get all the informations
docker inspect <container name>

# stop containers and delete volumes
docker-compose -f docker-compose.airflow.yml down --volumes --rmi all

# launch all the project with a make file

If you want to forward the port from the server to your machine (because everything here is not exposed on internet on PBG server) you can do:

ssh -L [LOCAL_IP:]LOCAL_PORT:DESTINATION:DESTINATION_PORT [USER@]SSH_SERVER

#example (pbg server is defined into ssh config file)
ssh -LN 8042:localhost:8080 pbg

You can add your machine configuration inside the ssh config file for your user: ~/.ssh/config

Considerations, errors and problems

Collection of comments, info and problems that we found during the development and how to fix them

Airflow manual configuration

If you want to use FileSensor capabilities, you have to manual create the connection going to the web interface to: Admin > Connections and create a new connection to read the files in the machine.

If you want to know more about FileSensor, please refer to this guide

You can also specify connections programmatically, please referer to the official documentation if you want to know more.

pyodbc

To use pyodbc on linux you have to install the following dependencies: sudo apt install unixodbc-dev

We suggest also on linux and docker to install those packages:

sudo apt install vim libpq-dev gcc curl openssh-client git unixodbc-dev libxml2-dev libxslt1-dev zlib1g-dev g++

psycopg library

To use the system with psycopg2 for the postgres database connection it's important to install in your system (linux-based) the requirements: sudo apt-get install libpq-dev

To launch the project from terminal if you are on the project root you have to do: PYTHONPATH="./" python ./test/<name of the script>.py

Be carefull not to install virtualenv via apt on linux, but use virtualenv by pip.

Useful Documentation

If you want to monitor and control the ETL you have to connect to the web interface.

If you run this on a server don't forget to forward the port via ssh (see Useful commands section behind)
Monitor executions of DAGs, available at: http://ip:8080/
Monitor celery execution: http://ip:5555

Default credentials for Airflow web interface:

user: Airflow
password: airflow

pythonbiellagroup / moderndataengineering Goto Github PK

moderndataengineering's Introduction

PBG: Modern Data Engineering

What we talk

Status

Tools

Secrets and settings

Repository and code

Streamlit ui interface

Useful commands

Considerations, errors and problems

Airflow manual configuration

pyodbc

psycopg library

Useful Documentation

moderndataengineering's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org