Code Monkey home page Code Monkey logo

incident_data_warehouse's Introduction

Incident Data Warehouse

The project aims to build a data warehouse for incidents that happened in San Francisco from scratch. There are 2 pipelines in this project:

  • full_load_pipeline: Load all data to the data warehouse for the first load.
  • incremental_load_pipeline: Run daily and load new data into the data warehouse as well as manage changes in data by implementing SCD Types 1 and 2.

Tech Stack

  • Staging Area: Amazon S3

  • Data warehouse: Amazon Redshift

  • Visualization: Power BI

  • Orchestration: Apache Airflow with Docker

  • Processing: Python, SQL

Architecture

The architecture of this project is presented as follows:

Architecture_1

  • Data is sourced from Socrata API and ingested into raw zone of Staging Area hosted on S3.
  • Raw data is cleansed and standardized before moving to cleansed zone.
  • Cleansed data is transformed into data model used in data warehouse and loaded into stage zone. Now the data is ready for moving to data warehouse.
  • Reports are created in Power BI from the data in data warehouse.

Data Model

  • The data warehouse schema is designed follow Star schema model.
  • To manage the relation many to many between incidents and incident categories, the bridge table is used.
  • To manage changes in data, SCD Type 1 is applied to all table, SCD Type 2 applied to dim_category and dim_intersection tables.
  • These SCD types are implemented using SQL.

ETL Pipeline

  • To manage the repetitive jobs of ingest (from source to staging area) and load to data warehouse, I build 2 Airflow custom operators are Socrata_to_S3 and S3_to_Redshift.
  • Full load pipeline architecture

full_load

  • Full load pipeline architecture (zoom in)

full_load_zoom_in

  • Incremental load pipeline architecture

incre_load

  • Incremental load pipeline architecture (zoom in)

incre_load_zoom_in

Visualization

Some dashboards create from the data from data warehouse

  • Total report for year 2022

year_report

  • Daily report for everyday

daily_report

Achievement in learning

Python

  • Use Python to process data, especially the date time datatype. Most used libraries to process data are Pandas and Numpy.
  • Implement OOP in the project.
  • Structure files in project.

Apache Airflow

  • Write custom operators for repetive jobs (code)
  • Use runtime variable and template variable
  • Group tasks that belong to each stage of the pipeline for more briefness when looking
  • Connect to cloud services through Hook
  • Secure connections and other important variables by using Variable and Connection features
  • Implement Airflow using Docker
  • Components of Airflow

AWS

  • Use S3 as the Staging Area
  • Use Redshift as the data warehouse
  • How Redshift works

Visualization

  • Connect to Redshift and creata dashboard

incident_data_warehouse's People

Contributors

minhky2185 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

mehfuz22

incident_data_warehouse's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.