Code Monkey home page Code Monkey logo

de-zoomcamp-project's Introduction

Data Engineering Zoomcamp Project

This is my project for the Data Engineering Zoomcamp by DataTalks.Club

Index

Problem Statement

  • You would like to research GitHub activity to find out some trends.
  • You have data about GitHub events for April 2023.
  • Here are some questions that you want to answer:
    • How many events happen on GitHub daily?
    • What is the most popular type of event?
    • What are the top 10 most active repos?

About the Dataset

Github Archive is a project to record the public Github timeline, archive it, and make it accessible for further analysis.

Architecture

architecture

back to index

Technologies/Tools

About the Project

  • Starting from April 1, Github Archive data is ingested daily into Google Cloud Storage
  • A PySpark job is run on the data in GCS using Google DataProc
  • The results are written to 2 pre-defined tables in Google BigQuery
  • A dashboard is created from the BigQuery tables
  • Cloud resources (Storage bucket, BigQuery tables) are created with Terraform
  • Extract & load scripts and PySpark Job are orchestrated with Airflow
  • Dataproc cluster is created with Airflow and is deleted after the job completes

Dashboard

Dashboard is build in Looker Studio and publicly available on this link

In case dashboard is not accessible there is an image below: dashboard

back to index

Reproducibility

Pre-Requisites

Google Cloud Platform Account

  1. Create a GCP account if you do not have one. Note that GCP offers $300 free credits for 90 days
  2. Create a new project from the GCP dashboard. Note your project ID

Create a Service Account

  1. Go to IAM & Admin > Service Accounts
  2. Click Create Service Account. More information here
  3. Add the following roles to the service account:
    • Viewer
    • Storage Admin
    • Storage Object Admin
    • BigQuery Admin
    • DataProc Administrator
  4. Download the private JSON keyfile. Rename it to google_credentials.json and store it in ${HOME}/.google/credentials/
  5. You would need to enable this APIs if you have not done already

back to index

Pre-Infrastructure Setup

Terraform is used to setup most of the infrastructure but the Virtual Machine was created on the cloud console. Follow the instructions below to create a VM.

You can also use your local machine to reproduce this project but it is much better to use a VM. If you still choose to use your local machine, install the necessary packages on your local machine.

Setting up a Virtual Machine on GCP
  1. On the project dashboard, go to Compute Engine > VM Instances
  2. Create a new instance
    • Use any name of your choosing
    • Choose a region that suits you most

      All your GCP resources should be in the same region

    • For machine configuration, choose the E2 series. An e2-standard-2 (2 vCPU, 8 GB memory) is sufficient for this project
    • In the Boot disk section, change it to Ubuntu preferably Ubuntu 20.04 LTS. A disk size of 30GB is also enough.
    • Leave all other settings on default value and click Create

You would need to enable the Compute Engine API if you have not already.

back to index

Installing Required Packages on the VM

Before installing packages on the VM, an SSH key has to be created to connect to the VM

SSH Key Connection
  1. To create the SSH key, check this guide
  2. Copy the public key in the ~/.ssh folder
  3. On the GCP dashboard, navigate to Compute Engine > Metadata > SSH KEYS
  4. Click Edit. Then click Add Item. Paste the public key and click Save
  5. Go to the VM instance you created and copy the External IP
  6. Go back to your terminal and type this command in your home directory
    ssh -i <path-to-private-key> <USER>@<External IP>
    • This should connect you to the VM
  7. When you're through with using the VM, you should always shut it down. You can do this either on the GCP dashboard or on your terminal
    sudo shutdown now
Google Cloud SDK

Google Cloud SDK is already pre-installed on a GCP VM. You can confirm by running gcloud --version.
If you are not using a VM, check this link to install it on your local machine

Docker
  1. Connect to your VM
  2. Install Docker
    sudo apt-get update
    sudo apt-get install docker.io
  3. Docker needs to be configured so that it can run without sudo
    sudo groupadd docker
    sudo gpasswd -a $USER docker
    sudo service docker restart
    • Logout of your SSH session and log back in
    • Test that docker works successfully by running docker run hello-world
Docker-Compose
  1. Check and copy the latest release for Linux from the official Github repository
  2. Create a folder called bin/ in the home directory. Navigate into the /bin directory and download the binary file there
    wget <copied-file> -O docker-compose
  3. Make the file executable
    chmod +x docker-compose
  4. Add the .bin/ directory to PATH permanently
    • Open the .bashrc file in the HOME directory
    nano .bashrc
    • Go to the end of the file and paste this there
    export PATH="${HOME}/bin:${PATH}"
    • Save the file (CTRL-O) and exit nano (CTRL-X)
    • Reload the PATH variable
    source .bashrc
  5. You should be able to run docker-compose from anywhere now. Test this with docker-compose --version
Terraform
  1. Navigate to the bin/ directory that you created and run this
    wget https://releases.hashicorp.com/terraform/1.1.7/terraform_1.1.7_linux_amd64.zip
  2. Unzip the file
    unzip terraform_1.1.7_linux_amd64.zip

    You might have to install unzip sudo apt-get install unzip

  3. Remove the zip file
    rm terraform_1.1.7_linux_amd64.zip
  4. Terraform is already installed. Test it with terraform -v
Google Application Credentials

The JSON credentials downloaded is on your local machine. We are going to transfer it to the VM using scp

  1. On your local machine, navigate to the location of the credentials file ${HOME}/.google/credentials/

  2. Copy credentials file to vm

    scp google_credentials.json <you vm user>@<vm external IP>:/home/<your vm user>/.google/credentials/google_credentials.json
    

    you might need to specify your identity (ssh key) adding -i /path/to/your/private/key after scp command

  3. Connect to your vm using ssh ssh -i /path/to/private/ssh/key <your vm user>@<vm external IP> and check, that the file is there ls ~/.google/credentials

  4. For convenience, add this line to the end of the .bashrc file

    export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/credentials/google_credentials.json
    • Refresh with source .bashrc
  5. Use the service account credentials file for authentication

    gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
Remote-SSH

To work with folders on a remote machine on Visual Studio Code, you need this extension. This extension also simplifies the forwarding of ports.

  1. Install the Remote-SSH extension from the Extensions Marketplace
  2. At the bottom left-hand corner, click the Open a Remote Window icon
  3. Click Connect to Host. Click the name of your config file host.
  4. In the Explorer tab, open any folder on your Virtual Machine Now, you can use VSCode completely to run this project.

back to index

Main

Clone the repository

    git clone https://github.com/alinali87/de-zoomcamp-project.git

Create remaining infrastructure with Terraform

We use Terraform to create a GCS bucket, a BQ dataset, and 2 BQ tables

  1. Navigate to the terraform folder
  2. Initialise terraform
    terraform init
  3. Check infrastructure plan
    terraform plan
  4. Create new infrastructure
    terraform apply
  5. Confirm that the infrastructure has been created on the GCP dashboard

Initialise Airflow

Airflow is run in a docker container. This section contains steps on initisialing Airflow resources

  1. Navigate to the airflow folder
  2. Create a logs folder airflow/logs/
    mkdir logs/
  3. Build the docker image
    docker-compose build
  4. The names of some project resources are hardcoded in the docker_compose.yaml file. Change this values to suit your use-case docker-compose-change
  5. Initialise Airflow resources
    docker-compose up airflow-init
  6. Kick up all other services
    docker-compose up
  7. Open another terminal instance and check docker running services
    docker ps
    • Check if all the services are healthy
  8. Forward port 8080 from VS Code. Open localhost:8080 on your browser and sign into airflow

    Both username and password is airflow

Run the pipeline

You are already signed into Airflow. Now it's time to run the pipeline

  1. Click on the DAG gharchive_dag that you see there
  2. You should see a tree-like structure of the DAG you're about to run airflow-dag
  3. At the top right-hand corner, trigger the DAG. Make sure Auto-refresh is turned on before doing this

    The DAG would run from April 1 at 8:00am UTC till 8:00am UTC of the present day
    This should take a while

  4. While this is going on, check the cloud console to confirm that everything is working accordingly

    If you face any problem or error, confirm that you have followed all the above instructions religiously. If the problems still persist, raise an issue.

  5. When the pipeline is finished, and you've confirmed that everything went well, shut down *docker-compose with CTRL-C and kill all containers with docker-compose down
  6. Take a well-deserved break to rest. This has been a long ride.

back to index

Notes

  • Partitioning and Clustering is pre-defined on the tables in the data warehouse. You can check the definition in the main terraform file
  • Dataproc configuration is in the gharchive_dag.py file.

Acknowledgements

I'd like to thank the organisers of this wonderful course. It has given me valuable insights into the field of Data Engineering. Also, all fellow students who took time to answer my questions on the Slack channel, thank you very much.

back to index

de-zoomcamp-project's People

Contributors

alinali87 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.