Code Monkey home page Code Monkey logo

onlineretail-dataeng's Introduction

Banner

    ๐Ÿ‘จโ€๐Ÿ”ง Online Retail Data Pipeline ๐Ÿ‘ท

Retail Data Pipeline with Terraform, Airflow, GCP BigQuery, dbt, Soda, and Looker

Dashboard ๐Ÿ“Š Request Feature

๐Ÿ“ Table of Contents

  1. Project Overview
  2. Key Features
  3. Project Architecture
  4. Usage
  5. Credits
  6. Contact

๐Ÿ”ฌ Project Overview

This an end-to-end data engineering project, where I created a robust data pipeline to extract, analyze, and visualize insights from the data.

๐Ÿ’พ Dataset

This is a transnational data set that contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

The dataset includes the following columns:

Column Description
InvoiceNo Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
StockCode Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description Product (item) name. Nominal.
Quantity The quantities of each product (item) per transaction. Numeric.
InvoiceDate Invoice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice Unit price. Numeric, Product price per unit in sterling.
CustomerID Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country Country name. Nominal, the name of the country where each customer resides.

๐ŸŽฏ Project Goals

  • Create a data pipeline from scratch using Apache Airflow.
  • Set up your Airflow local environment with the Astro CLI.
  • Implement data quality checks in the pipeline using Soda.
  • Integrate dbt and run data models with Airflow and Cosmos.
  • Isolate tasks to avoid dependency conflicts.
  • Upload CSV files into Google Cloud Storage.
  • Ingest data into BigQuery using the Astro SDK.

๐Ÿ”Œ Key Features

  • End-to-End Data Pipeline: This project provides a complete data engineering solution, from data ingestion to visualization.

  • Modular Airflow DAGs: Airflow Directed Acyclic Graphs (DAGs) are modular and easy to maintain, ensuring efficient pipeline execution.

  • Data Quality Checks: Ensure data integrity and quality with automated data quality checks using Soda.

  • Integration with dbt: Leverage dbt for data transformation and modeling within the Airflow pipeline.

  • Google Cloud Integration: Utilize Google Cloud Storage and BigQuery for scalable and cost-effective data storage and processing.

๐Ÿ“ Project Architecture

The end-to-end data pipeline includes the following steps:

  • Setting up the infrastructure on GCP (Terraform)
  • Downloading, processing, and uploading the initial dataset to a Data Lake (GCP Storage Bucket)
  • Moving the data from the lake to a Data Warehouse (GCP BigQuery)
  • Transforming the data in the Data Warehouse and preparing it for the dashboard (dbt)
  • Checking the quality of the data in the Data Warehouse (Soda)
  • Creating the dashboard (Looker Studio)

You can find the detailed information on the diagram below:

Architecture

๐Ÿ”ง Pipeline Architecture

onlineretail-arch

๐ŸŒช๏ธ Pipeline on Airflow

airflowretail

โš™๏ธ Data Modeling

image

๐Ÿ› ๏ธ Technologies Used

  • Infrastructure: Terraform
  • Google Cloud Platform (GCP)
    • Data Lake (DL): Cloud Storage
    • Data Warehouse (DWH): BigQuery
  • Astro SDK for Airflow
  • Workflow orchestration: Apache Airflow
  • Transforming data: dbt (Data Build Tool)
  • Data quality checks: Soda
  • Containerization: Docker
  • Data Visualization: Looker Studio

๐Ÿ’ป Usage

First, clone this repository.

git clone https://github.com/Hamagistral/OnlineRetail-DataEng.git

1. Pre-requisites

Make sure you have the following pre-installed components:

2. Google Cloud Platform

To set up GCP, please follow the steps below:

  1. If you don't have a GCP account, please create a free trial.
  2. Set up new project and write down your Project ID.
  3. Configure service account to get access to this project and download auth-keys (.json). Please check the service account has all the permissions below:
    • Viewer
    • Storage Admin
    • Storage Object Admin
    • BigQuery Admin
  4. Download SDK for local setup.
  5. Set environment variable to point to your downloaded auth-keys:
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token/session, and verify authentication
gcloud auth application-default login
  1. Enable the following options under the APIs and services section:
  • Identity and Access Management (IAM) API
  • IAM service account credentials API

3. Terraform

We use Terraform to build and manage GCP infrastructure. Terraform configuration files are located in the separate folder. There are 3 configuration files:

  • terraform-version - contains information about the installed version of Terraform;
  • variables.tf - contains variables to make your configuration more dynamic and flexible;
  • main.tf - is a key configuration file consisting of several sections.

Now you can use the steps below to generate resources inside the GCP:

  1. Move to the terraform folder using bash command cd.
  2. Run terraform init command to initialize the configuration.
  3. Use terraform plan to match previews local changes against a remote state.
  4. Apply changes to the cloud with terraform apply command.

Note: In steps 3 and 4 Terraform may ask you to specify the Project ID. Please use the ID that you noted down earlier at the project setup stage.

If you would like to remove your stack from the Cloud, use the terraform destroy command.

4. Airflow

  1. Start Airflow on your local machine by running:
astro dev start

This command will spin up 4 Docker containers on your machine, each for a different Airflow component:

  • Postgres: Airflow's Metadata Database
  • Webserver: The Airflow component responsible for rendering the Airflow UI
  • Scheduler: The Airflow component responsible for monitoring and triggering tasks
  • Triggerer: The Airflow component responsible for triggering deferred tasks
  1. Verify that all 4 Docker containers were created by running 'docker ps'.

Note: Running 'astro dev start' will start your project with the Airflow Webserver exposed at port 8080 and Postgres exposed at port 5432. If you already have either of those ports allocated, you can either stop your existing Docker containers or change the port.

  1. Access the Airflow UI for your local Airflow project. To do so, go to http://localhost:8080/ and log in with 'admin' for both your Username and Password.

You should also be able to access your Postgres Database at 'localhost:5432/postgres'.

  1. Configure your Google Cloud Platform credentials.
  2. Create and configure the necessary connections in Airflow.
  3. Customize the Airflow DAGs to suit your specific requirements.
  4. Run the pipeline and monitor its execution.
  5. Explore the data using Looker Studio for insights and visualization.

๐Ÿ“‹ Credits

๐Ÿ“จ Contact Me

LinkedIn โ€ข Website โ€ข Gmail

onlineretail-dataeng's People

Contributors

hamagistral avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.