Data Engineering Zoomcamp Project

This is my project for the Data Engineering Zoomcamp by DataTalks.Club

Index

Data Engineering Zoomcamp Project

Problem Statement

You would like to research GitHub activity to find out some trends.
You have data about GitHub events for April 2023.
Here are some questions that you want to answer:
- How many events happen on GitHub daily?
- What is the most popular type of event?
- What are the top 10 most active repos?

About the Dataset

Github Archive is a project to record the public Github timeline, archive it, and make it accessible for further analysis.

Architecture

back to index

Technologies/Tools

Containerisation - Docker
Infrastructure-as-Code (IaC) - Terraform
Cloud - Google Cloud Platform
Workflow Orchestration - Airflow
Data Lake - Google Cloud Storage
Data Warehouse - Google BigQuery
Batch Processing - Google DataProc
Visualisation - Google Data Studio

About the Project

Starting from April 1, Github Archive data is ingested daily into Google Cloud Storage
A PySpark job is run on the data in GCS using Google DataProc
The results are written to 2 pre-defined tables in Google BigQuery
A dashboard is created from the BigQuery tables
Cloud resources (Storage bucket, BigQuery tables) are created with Terraform
Extract & load scripts and PySpark Job are orchestrated with Airflow
Dataproc cluster is created with Airflow and is deleted after the job completes

Dashboard

Dashboard is build in Looker Studio and publicly available on this link

In case dashboard is not accessible there is an image below:

back to index

Reproducibility

Pre-Requisites

Google Cloud Platform Account

Create a GCP account if you do not have one. Note that GCP offers $300 free credits for 90 days
Create a new project from the GCP dashboard. Note your project ID

Create a Service Account

Go to IAM & Admin > Service Accounts
Click Create Service Account. More information here
Add the following roles to the service account:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
- DataProc Administrator
Download the private JSON keyfile. Rename it to google_credentials.json and store it in ${HOME}/.google/credentials/
You would need to enable this APIs if you have not done already

back to index

Pre-Infrastructure Setup

Terraform is used to setup most of the infrastructure but the Virtual Machine was created on the cloud console. Follow the instructions below to create a VM.

You can also use your local machine to reproduce this project but it is much better to use a VM. If you still choose to use your local machine, install the necessary packages on your local machine.

Setting up a Virtual Machine on GCP

On the project dashboard, go to Compute Engine > VM Instances
Create a new instance
- Use any name of your choosing
- Choose a region that suits you most
  
  All your GCP resources should be in the same region
- For machine configuration, choose the E2 series. An e2-standard-2 (2 vCPU, 8 GB memory) is sufficient for this project
- In the Boot disk section, change it to Ubuntu preferably Ubuntu 20.04 LTS. A disk size of 30GB is also enough.
- Leave all other settings on default value and click Create

You would need to enable the Compute Engine API if you have not already.

back to index

Installing Required Packages on the VM

Before installing packages on the VM, an SSH key has to be created to connect to the VM

SSH Key Connection

To create the SSH key, check this guide
Copy the public key in the ~/.ssh folder
On the GCP dashboard, navigate to Compute Engine > Metadata > SSH KEYS
Click Edit. Then click Add Item. Paste the public key and click Save
Go to the VM instance you created and copy the External IP
Go back to your terminal and type this command in your home directory
```
ssh -i <path-to-private-key> <USER>@<External IP>
```
- This should connect you to the VM
When you're through with using the VM, you should always shut it down. You can do this either on the GCP dashboard or on your terminal
```
sudo shutdown now
```

Google Cloud SDK

Google Cloud SDK is already pre-installed on a GCP VM. You can confirm by running gcloud --version.
If you are not using a VM, check this link to install it on your local machine

Docker

Connect to your VM

Install Docker

sudo apt-get update
sudo apt-get install docker.io

Docker needs to be configured so that it can run without sudo
```
sudo groupadd docker
sudo gpasswd -a $USER docker
sudo service docker restart
```
- Logout of your SSH session and log back in
- Test that docker works successfully by running docker run hello-world

Docker-Compose

Check and copy the latest release for Linux from the official Github repository
Create a folder called bin/ in the home directory. Navigate into the /bin directory and download the binary file there
```
wget <copied-file> -O docker-compose
```
Make the file executable
```
chmod +x docker-compose
```
Add the .bin/ directory to PATH permanently
- Open the .bashrc file in the HOME directory
```
nano .bashrc
```
- Go to the end of the file and paste this there
```
export PATH="${HOME}/bin:${PATH}"
```
- Save the file (CTRL-O) and exit nano (CTRL-X)
- Reload the PATH variable
```
source .bashrc
```
You should be able to run docker-compose from anywhere now. Test this with docker-compose --version

Terraform

Navigate to the bin/ directory that you created and run this

wget https://releases.hashicorp.com/terraform/1.1.7/terraform_1.1.7_linux_amd64.zip

Unzip the file
```
unzip terraform_1.1.7_linux_amd64.zip
```
You might have to install unzip sudo apt-get install unzip
Remove the zip file
```
rm terraform_1.1.7_linux_amd64.zip
```
Terraform is already installed. Test it with terraform -v

Google Application Credentials

The JSON credentials downloaded is on your local machine. We are going to transfer it to the VM using scp

On your local machine, navigate to the location of the credentials file ${HOME}/.google/credentials/
Copy credentials file to vm
```
scp google_credentials.json <you vm user>@<vm external IP>:/home/<your vm user>/.google/credentials/google_credentials.json
```
you might need to specify your identity (ssh key) adding -i /path/to/your/private/key after scp command
Connect to your vm using ssh ssh -i /path/to/private/ssh/key <your vm user>@<vm external IP> and check, that the file is there ls ~/.google/credentials

For convenience, add this line to the end of the .bashrc file

export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/credentials/google_credentials.json

Refresh with source .bashrc

Use the service account credentials file for authentication

gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

Remote-SSH

To work with folders on a remote machine on Visual Studio Code, you need this extension. This extension also simplifies the forwarding of ports.

Install the Remote-SSH extension from the Extensions Marketplace
At the bottom left-hand corner, click the Open a Remote Window icon
Click Connect to Host. Click the name of your config file host.
In the Explorer tab, open any folder on your Virtual Machine Now, you can use VSCode completely to run this project.

back to index

Main

Clone the repository

    git clone https://github.com/alinali87/de-zoomcamp-project.git

Create remaining infrastructure with Terraform

We use Terraform to create a GCS bucket, a BQ dataset, and 2 BQ tables

Navigate to the terraform folder
Initialise terraform
```
terraform init
```
Check infrastructure plan
```
terraform plan
```
Create new infrastructure
```
terraform apply
```
Confirm that the infrastructure has been created on the GCP dashboard

Initialise Airflow

Airflow is run in a docker container. This section contains steps on initisialing Airflow resources

Navigate to the airflow folder
Create a logs folder airflow/logs/
```
mkdir logs/
```
Build the docker image
```
docker-compose build
```
The names of some project resources are hardcoded in the docker_compose.yaml file. Change this values to suit your use-case
Initialise Airflow resources
```
docker-compose up airflow-init
```
Kick up all other services
```
docker-compose up
```
Open another terminal instance and check docker running services
```
docker ps
```
- Check if all the services are healthy
Forward port 8080 from VS Code. Open localhost:8080 on your browser and sign into airflow

Both username and password is airflow

Run the pipeline

You are already signed into Airflow. Now it's time to run the pipeline

Click on the DAG gharchive_dag that you see there
You should see a tree-like structure of the DAG you're about to run
At the top right-hand corner, trigger the DAG. Make sure Auto-refresh is turned on before doing this

The DAG would run from April 1 at 8:00am UTC till 8:00am UTC of the present day
This should take a while
While this is going on, check the cloud console to confirm that everything is working accordingly

If you face any problem or error, confirm that you have followed all the above instructions religiously. If the problems still persist, raise an issue.
When the pipeline is finished, and you've confirmed that everything went well, shut down *docker-compose with CTRL-C and kill all containers with docker-compose down
Take a well-deserved break to rest. This has been a long ride.

back to index

Notes

Partitioning and Clustering is pre-defined on the tables in the data warehouse. You can check the definition in the main terraform file
Dataproc configuration is in the gharchive_dag.py file.

Acknowledgements

I'd like to thank the organisers of this wonderful course. It has given me valuable insights into the field of Data Engineering. Also, all fellow students who took time to answer my questions on the Slack channel, thank you very much.

back to index

alinali87 / de-zoomcamp-project Goto Github PK

de-zoomcamp-project's Introduction

Data Engineering Zoomcamp Project

Index

Problem Statement

About the Dataset

Architecture

Technologies/Tools

About the Project

Dashboard

Reproducibility

Pre-Requisites

Google Cloud Platform Account

Create a Service Account

Pre-Infrastructure Setup

Setting up a Virtual Machine on GCP

Installing Required Packages on the VM

SSH Key Connection

Google Cloud SDK

Docker

Docker-Compose

Terraform

Google Application Credentials

Remote-SSH

Main

Clone the repository

Create remaining infrastructure with Terraform

Initialise Airflow

Run the pipeline

Notes

Acknowledgements

de-zoomcamp-project's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org