Light

kzinmr / dp-dataengineeringnd Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 117 KB

Python 98.15% Shell 1.85%

dp-dataengineeringnd's Introduction

dp-dataengineeringnd

In this project, we try to build analytics data pipeline for a Music service. We perform an ETL pipeline using Airflow.

Setup Airflow

Make a user that has AmazonRedshiftFullAccess and AmazonS3ReadOnlyAccess policies.
Set plugins and dags folders in our Airflow folder.
Add two connections in Airflow Admin > Connections:
- redshift: connection type: PostgreSQL with environment and credential information.
- aws_credentials: connection type: Amazon Web Services, with user credentials to access S3.

How to run

Run /opt/airflow/start.sh command to start the Airflow web server.
Trigger tasks in Airflow UI.
Make sure that start_date and schedule_interval in the DAG are properly configured.

Dags outline

Add default parameters according to these guidelines:

The DAG does not have dependencies on past runs
On failure, the task are retried 3 times
Retries happen every 5 minutes
Catchup is turned off
Do not email on retry

The graph view of our dags and dependencies follows the flow shown in the image below:

Operators outline

Stage Operator
- Load JSON files from S3 to Amazon Redshift.
- Make SQL COPY based on given parameters in which the source JSON and the target tables are specified.
- Add a templated field that allows it to load timestamped files from S3 based on the execution time and run backfills.
Fact and Dimension Operators
- Transform data with the provided SQL helpers.
- Take as input a SQL statement and target database and define a target table.
- Add a parameter that allows switching between insert modes.
  - Dimension should be loaded with truncate-insert pattern where the target table is emptied before the load.
  - Fact tables should allow only append type functionality because they are usually so massive.
Data Quality Operator
- receive one or more SQL based test cases along with the expected results
- raise exceptions and retry task if there is no match

dp-dataengineeringnd's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.