Code Monkey home page Code Monkey logo

data-engineering-capstone-project's Introduction

COVID-19 Data Engineering Project NOTE:(Infrastructure code is not uploading due to some conflicts)

This project focuses on collecting, processing, and analyzing COVID-19 data using various data engineering tools and technologies. The project employs Terraform for infrastructure setup, dbt for analytical engineering, Mage.ai for workflow orchestration and data transformation, Google Cloud Platform (GCP) BigQuery for data warehousing, PySpark for batch processing, and Confluent Kafka for real-time data processing.

Table of Contents

Introduction

The COVID-19 pandemic has generated massive amounts of data related to infection rates, testing, hospitalizations, and more. This project aims to centralize, process, and analyze this data to provide valuable insights for healthcare professionals, policymakers, and the general public.

Technologies Used

  • Terraform: Infrastructure as code tool used to provision and manage the project's infrastructure on cloud platforms.
  • dbt (Data Build Tool): Analytics engineering tool used for transforming and modeling data in the data warehouse.
  • Mage.ai: Workflow orchestration and data transformation platform used to streamline data processing tasks.
  • Google Cloud Platform (GCP) BigQuery: Fully managed, serverless data warehouse used for storing and querying large datasets.
  • PySpark: Python API for Apache Spark used for large-scale batch processing of data.
  • Confluent Kafka: Distributed streaming platform used for real-time data processing and event streaming.
  • Docker Compose: Tool for defining and running multi-container Docker applications. Used to run Mage.ai and Confluent Kafka services.
  • Looker: Business intelligence and data visualization platform used to create dashboards and reports.

Project Structure

The project is structured as follows:

covid19/
│
├── analyitcs/
│ ├── dbt/
│ │ ├── analyses/
│ │ ├── macros/
│ │ └── ...
│ └── ...
│
├── containerization/
│ ├── docker/
│ │ ├── docker-compose.yml
│ │ │ 
│ │ └── ...
│ └── ...
│
├── workflows/
│ ├── mage/
│ │ ├── export_data/
│ │ │ ├── export_to_gcp.py
│ │ ├── load_data/
│   │ ├── load_data_to_gcp.py
│ └── ...
│
│
├── kafka/
│ │── consumer.py
│ │── producer.py
└── README.md

Setup Instructions

  1. Infrastructure Setup: Use Terraform scripts in the infrastructure/terraform/ directory to provision the required cloud resources. Make sure to configure your cloud provider credentials and settings.

  2. Analytical Engineering: Utilize dbt models in the analytics/dbt/models/ directory to transform and model data in the data warehouse.

  3. Workflow Orchestration: Define and manage data processing workflows using Mage.ai workflows in the workflows/mage/workflows/ directory.

  4. Data Warehousing: Load Data to implement data warehousing in workflows/mage/ directory.

  5. Real-time Processing: Develop real-time data processing pipelines using Confluent Kafka consumer and producer scripts in the kafka/ directory.

  6. Docker Compose Setup: Use the provided docker-compose.yml file to run Mage.ai and Confluent Kafka services. Make sure Docker is installed on your system.

  7. Looker Dashboards: Use Looker to import and customize dashboard.

Dashboard

  • Currently in Progress......

Usage

  • Modify and extend the provided scripts and configurations to suit your specific data processing requirements.
  • Run Docker Compose to start Mage.ai and Confluent Kafka services.
  • Use Looker to visualize and explore data through the imported dashboards.
  • Refer to individual tool documentation for detailed usage instructions and best practices.

Contributing

Contributions to improve and expand this project are welcome! Feel free to fork the repository, make your changes, and submit a pull request.

License

This project is licensed under the MIT License.

data-engineering-capstone-project's People

Contributors

faranbutt avatar

Stargazers

Vladislav Skripniuk avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.