Code Monkey home page Code Monkey logo

dp-dataengineeringnd's Introduction

dp-dataengineeringnd

In this project, we try to build analytics data pipeline for a Music service. We perform an ETL pipeline using Airflow.

Setup Airflow

  • Make a user that has AmazonRedshiftFullAccess and AmazonS3ReadOnlyAccess policies.
  • Set plugins and dags folders in our Airflow folder.
  • Add two connections in Airflow Admin > Connections:
    • redshift: connection type: PostgreSQL with environment and credential information.
    • aws_credentials: connection type: Amazon Web Services, with user credentials to access S3.

How to run

  • Run /opt/airflow/start.sh command to start the Airflow web server.
  • Trigger tasks in Airflow UI.
  • Make sure that start_date and schedule_interval in the DAG are properly configured.

Dags outline

Add default parameters according to these guidelines:

  • The DAG does not have dependencies on past runs
  • On failure, the task are retried 3 times
  • Retries happen every 5 minutes
  • Catchup is turned off
  • Do not email on retry

The graph view of our dags and dependencies follows the flow shown in the image below:

example-dat.png

Operators outline

  • Stage Operator
    • Load JSON files from S3 to Amazon Redshift.
    • Make SQL COPY based on given parameters in which the source JSON and the target tables are specified.
    • Add a templated field that allows it to load timestamped files from S3 based on the execution time and run backfills.
  • Fact and Dimension Operators
    • Transform data with the provided SQL helpers.
    • Take as input a SQL statement and target database and define a target table.
    • Add a parameter that allows switching between insert modes.
      • Dimension should be loaded with truncate-insert pattern where the target table is emptied before the load.
      • Fact tables should allow only append type functionality because they are usually so massive.
  • Data Quality Operator
    • receive one or more SQL based test cases along with the expected results
    • raise exceptions and retry task if there is no match

dp-dataengineeringnd's People

Contributors

kzinmr avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.