dataengg-covid19-aws's Introduction

dataengg-covid19-aws

Data Engineering Project on COVID-19 DataLake by AWS

Performing data modeling, data wrangling and extract-load-transform using python on the COVID-19 Data Lake available on registry of open data AWS using various AWS tools such as boto3, Glue, S3, Athena and Redshift.

Tools and Usages:

Amazon S3 - Storing the data
Crawler - Used to extract all the schema and information straight from S3
Amazon Athena - Running adhoc sql queries on the available data in S3
AWS Glue - data transformation
Amazon Redshift - storing the tranfromed dimensional model in datawarehouse
boto3 - aws python sdk for create, configure, and manage AWS services.

Architecture

Data Set: https://registry.opendata.aws/aws-covid19-lake/

How to get the data?

Simply download and upload to S3 bucket
Using AWS CLI copy command: aws s3 cp s3://mybucket/test.txt s3://mybucket/test2.txt

STEPS:

Running Crawlers on the data uploaded in S3
Analysing data using AWS Athena query editor
Building the ER-Data Model
ETL jobs in python
Saving result in S3
Building the Dimensional Model
Building Dimensional schema in Redshift
Storing the dimensional model into Redshift

ER-Data Model

Dimension Model

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

oovk / dataengg-covid19-aws Goto Github PK