Code Monkey home page Code Monkey logo

data-modeling-spark-s3's Introduction

Introduction & Context

As Sparkify continues to have more and more users and it would be very beneficial to migrate from a datawarehouse to a data lake (please check the data modeling series in my github repos so you can follow up). This work aims to design an ETL data pipeline using Spark that extracts and store output files in S3 as a data Lake. The purpose of this project is to acquire new insights manipulating the easy-to-use data residing in parquet files in S3.

Solution

The solution is to create a star schema of fact and dimension tables optimized for queries on songplays analysis. To do that, we need to extract data which resides in S3 buckets, then load data into spark dataframes on EMR clusters,transform it into dimensional tables and finally store them back into S3 parquet files. In this case we are dealing with raw data type in JSON format that are located into two different S3 buckets. here are links for each :

  • s3://udacity-dend/song_data.
  • s3://udacity-dend/log_data.

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here is a filepath to a file in this dataset: song_data/A/B/C/TRABCEI128F424C983.json

The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings. The log files in the dataset we will be using are partitioned by year and month. For example, here is a filepath to two file in this dataset: log_data/2018/11/2018-11-12-events.json

Database schema

Please refer to this repo to get the schema as it is the same data schema as before.

EMR configuration:

There are several ways to configure and launch an EMR cluster, either programmatically or simply using AWS Console. For the sake of simplicity I will use the AWS Console to set up an EMR cluster.

Here my cluster's configuration : 3 EC2 instances ( 1 Master and 2 Slaves) of type m5.xlarge, 16 GiB memory, 64 GiB EBS storage each. You can choose whatever configuration you want. The jobs took 12 minutes to finish. So you can understand that your application time running will depend on the EMR configuration you set and so many else factors( like your internet connectivity speed)
Important note, I set up the EMR cluster and S3 in the same region, this way It would save me time.

CODE EXECUTION STEPS:

  • Create an IAM user with a programmatic access type, attach existing policies and set permission as Administrator Access, then choose Next: Tags, skip the tag page and choose Next:Review and choose Create User. Finally save credentials into a safe place.
  • From the AWS console, click on Service, type 'EC2' to go to EC2 console, Choose Key Pairs in Network & Security on the left panel => Choose Create key pair, Type the name for the key pairs, File format: pem => Choose Create key pair . After this step, a .pem file will be automatically downloaded into your computer and will be used later.
  • Launch an EMR cluster with the configuration you choose (either using the AWS Console or AWS CLI) and assign the key pair you created before (pem file).
  • Edit your inbound rules in the security group and allow SSH access to your computer IP adress.
  • Open up a terminal and SSH to your cluster (please refer to summary page of your EMR cluster=> connect to Master Node using SSH under Master public DNS)
  • Once connected to cluster via your terminal create a new file ( nano etl.py ) and copy paste the code in the etl.py I provided in this work. Or you can SCP your file from your local machine to your host machine (please refer to the txt file in the repoto be able to copy your file).
  • Create a new file called dl.cfg, copy and paste the content of my dl.cfg I provided and replace the blanks with your own credentials !! DO NOT explicitily plain write your credentials into to etl.py, do it my way and let the configParser do the job.
  • Submit your etl.py using the command ( spark-submit --master yarn etl.py)
  • Depending on your EMR cluster configuration, the execution could take a while ( 12 minutes in my case )
  • You can monitor your job with SPARK UI ( you must set up a dynamic port forwarding first -- for more info please check the AWS documentation on PORT FORWARDING).
  • For data quality you can launch a jupyter notebook on the EMR cluster and start playing with the dimentional table created in this work

Attached files

  • etl.py : a python file you will be using in your EMR cluster !! YOU NEED TO CHANGE THE "output_file" with your own S3 outputfile
  • df.cfg : the configuration file that contains the IAM user ( keep the file's name & structure I use and replace the blanks with your own credentials)
  • scp-method.txt : to copy your local files into your host machine

!!! Do not forget to clean up the AWS resources otherwise you will be charged for a service you did not actually use, to do so terminate the EMR cluster and clean up your S3 buckets.

data-modeling-spark-s3's People

Contributors

omarfessi avatar

Watchers

 avatar  avatar

Forkers

rinaj12

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.