👨‍🔧 Online Retail Data Pipeline 👷

Retail Data Pipeline with Terraform, Airflow, GCP BigQuery, dbt, Soda, and Looker

📝 Table of Contents

Project Overview
Key Features
Project Architecture
Usage
Credits
Contact

🔬 Project Overview

This an end-to-end data engineering project, where I created a robust data pipeline to extract, analyze, and visualize insights from the data.

💾 Dataset

This is a transnational data set that contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

The dataset includes the following columns:

Column	Description
InvoiceNo	Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
StockCode	Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description	Product (item) name. Nominal.
Quantity	The quantities of each product (item) per transaction. Numeric.
InvoiceDate	Invoice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice	Unit price. Numeric, Product price per unit in sterling.
CustomerID	Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country	Country name. Nominal, the name of the country where each customer resides.

🎯 Project Goals

Create a data pipeline from scratch using Apache Airflow.
Set up your Airflow local environment with the Astro CLI.
Implement data quality checks in the pipeline using Soda.
Integrate dbt and run data models with Airflow and Cosmos.
Isolate tasks to avoid dependency conflicts.
Upload CSV files into Google Cloud Storage.
Ingest data into BigQuery using the Astro SDK.

🔌 Key Features

End-to-End Data Pipeline: This project provides a complete data engineering solution, from data ingestion to visualization.
Modular Airflow DAGs: Airflow Directed Acyclic Graphs (DAGs) are modular and easy to maintain, ensuring efficient pipeline execution.
Data Quality Checks: Ensure data integrity and quality with automated data quality checks using Soda.
Integration with dbt: Leverage dbt for data transformation and modeling within the Airflow pipeline.
Google Cloud Integration: Utilize Google Cloud Storage and BigQuery for scalable and cost-effective data storage and processing.

📝 Project Architecture

The end-to-end data pipeline includes the following steps:

Setting up the infrastructure on GCP (Terraform)
Downloading, processing, and uploading the initial dataset to a Data Lake (GCP Storage Bucket)
Moving the data from the lake to a Data Warehouse (GCP BigQuery)
Transforming the data in the Data Warehouse and preparing it for the dashboard (dbt)
Checking the quality of the data in the Data Warehouse (Soda)
Creating the dashboard (Looker Studio)

You can find the detailed information on the diagram below:

🔧 Pipeline Architecture

🌪️ Pipeline on Airflow

⚙️ Data Modeling

🛠️ Technologies Used

Infrastructure: Terraform
Google Cloud Platform (GCP)
- Data Lake (DL): Cloud Storage
- Data Warehouse (DWH): BigQuery
Astro SDK for Airflow
Workflow orchestration: Apache Airflow
Transforming data: dbt (Data Build Tool)
Data quality checks: Soda
Containerization: Docker
Data Visualization: Looker Studio

💻 Usage

First, clone this repository.

git clone https://github.com/Hamagistral/OnlineRetail-DataEng.git

1. Pre-requisites

Make sure you have the following pre-installed components:

2. Google Cloud Platform

To set up GCP, please follow the steps below:

If you don't have a GCP account, please create a free trial.
Set up new project and write down your Project ID.
Configure service account to get access to this project and download auth-keys (.json). Please check the service account has all the permissions below:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
Download SDK for local setup.
Set environment variable to point to your downloaded auth-keys:

export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token/session, and verify authentication
gcloud auth application-default login

Enable the following options under the APIs and services section:

Identity and Access Management (IAM) API
IAM service account credentials API

3. Terraform

We use Terraform to build and manage GCP infrastructure. Terraform configuration files are located in the separate folder. There are 3 configuration files:

terraform-version - contains information about the installed version of Terraform;
variables.tf - contains variables to make your configuration more dynamic and flexible;
main.tf - is a key configuration file consisting of several sections.

Now you can use the steps below to generate resources inside the GCP:

Move to the terraform folder using bash command cd.
Run terraform init command to initialize the configuration.
Use terraform plan to match previews local changes against a remote state.
Apply changes to the cloud with terraform apply command.

Note: In steps 3 and 4 Terraform may ask you to specify the Project ID. Please use the ID that you noted down earlier at the project setup stage.

If you would like to remove your stack from the Cloud, use the terraform destroy command.

4. Airflow

Start Airflow on your local machine by running:

astro dev start

This command will spin up 4 Docker containers on your machine, each for a different Airflow component:

Postgres: Airflow's Metadata Database
Webserver: The Airflow component responsible for rendering the Airflow UI
Scheduler: The Airflow component responsible for monitoring and triggering tasks
Triggerer: The Airflow component responsible for triggering deferred tasks

Verify that all 4 Docker containers were created by running 'docker ps'.

Note: Running 'astro dev start' will start your project with the Airflow Webserver exposed at port 8080 and Postgres exposed at port 5432. If you already have either of those ports allocated, you can either stop your existing Docker containers or change the port.

Access the Airflow UI for your local Airflow project. To do so, go to http://localhost:8080/ and log in with 'admin' for both your Username and Password.

You should also be able to access your Postgres Database at 'localhost:5432/postgres'.

Configure your Google Cloud Platform credentials.
Create and configure the necessary connections in Airflow.
Customize the Airflow DAGs to suit your specific requirements.
Run the pipeline and monitor its execution.
Explore the data using Looker Studio for insights and visualization.

📋 Credits

This Project is inspired by the recent video of the YouTube Channel Data With Marc
Readme inspired by this Project

📨 Contact Me

LinkedIn • Website • Gmail

gunsan / onlineretail-dataeng Goto Github PK

onlineretail-dataeng's Introduction