This is my project for the Data Engineering Zoomcamp by DataTalks.Club
- Data Engineering Zoomcamp Project
- You would like to research GitHub activity to find out some trends.
- You have data about GitHub events for April 2023.
- Here are some questions that you want to answer:
- How many events happen on GitHub daily?
- What is the most popular type of event?
- What are the top 10 most active repos?
Github Archive is a project to record the public Github timeline, archive it, and make it accessible for further analysis.
- Containerisation - Docker
- Infrastructure-as-Code (IaC) - Terraform
- Cloud - Google Cloud Platform
- Workflow Orchestration - Airflow
- Data Lake - Google Cloud Storage
- Data Warehouse - Google BigQuery
- Batch Processing - Google DataProc
- Visualisation - Google Data Studio
- Starting from April 1, Github Archive data is ingested daily into Google Cloud Storage
- A PySpark job is run on the data in GCS using Google DataProc
- The results are written to 2 pre-defined tables in Google BigQuery
- A dashboard is created from the BigQuery tables
- Cloud resources (Storage bucket, BigQuery tables) are created with Terraform
- Extract & load scripts and PySpark Job are orchestrated with Airflow
- Dataproc cluster is created with Airflow and is deleted after the job completes
Dashboard is build in Looker Studio and publicly available on this link
In case dashboard is not accessible there is an image below:
- Create a GCP account if you do not have one. Note that GCP offers $300 free credits for 90 days
- Create a new project from the GCP dashboard. Note your project ID
- Go to IAM & Admin > Service Accounts
- Click Create Service Account. More information here
- Add the following roles to the service account:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
- DataProc Administrator
- Download the private JSON keyfile. Rename it to
google_credentials.json
and store it in${HOME}/.google/credentials/
- You would need to enable this APIs if you have not done already
Terraform is used to setup most of the infrastructure but the Virtual Machine was created on the cloud console. Follow the instructions below to create a VM.
You can also use your local machine to reproduce this project but it is much better to use a VM. If you still choose to use your local machine, install the necessary packages on your local machine.
- On the project dashboard, go to Compute Engine > VM Instances
- Create a new instance
- Use any name of your choosing
- Choose a region that suits you most
All your GCP resources should be in the same region
- For machine configuration, choose the E2 series. An e2-standard-2 (2 vCPU, 8 GB memory) is sufficient for this project
- In the Boot disk section, change it to Ubuntu preferably Ubuntu 20.04 LTS. A disk size of 30GB is also enough.
- Leave all other settings on default value and click Create
You would need to enable the Compute Engine API if you have not already.
Before installing packages on the VM, an SSH key has to be created to connect to the VM
- To create the SSH key, check this guide
- Copy the public key in the
~/.ssh
folder - On the GCP dashboard, navigate to Compute Engine > Metadata > SSH KEYS
- Click Edit. Then click Add Item. Paste the public key and click Save
- Go to the VM instance you created and copy the External IP
- Go back to your terminal and type this command in your home directory
ssh -i <path-to-private-key> <USER>@<External IP>
- This should connect you to the VM
- When you're through with using the VM, you should always shut it down. You can do this either on the GCP dashboard or on your terminal
sudo shutdown now
Google Cloud SDK is already pre-installed on a GCP VM. You can confirm by running gcloud --version
.
If you are not using a VM, check this link to install it on your local machine
- Connect to your VM
- Install Docker
sudo apt-get update sudo apt-get install docker.io
- Docker needs to be configured so that it can run without
sudo
sudo groupadd docker sudo gpasswd -a $USER docker sudo service docker restart
- Logout of your SSH session and log back in
- Test that docker works successfully by running
docker run hello-world
- Check and copy the latest release for Linux from the official Github repository
- Create a folder called
bin/
in the home directory. Navigate into the/bin
directory and download the binary file therewget <copied-file> -O docker-compose
- Make the file executable
chmod +x docker-compose
- Add the
.bin/
directory to PATH permanently- Open the
.bashrc
file in the HOME directory
nano .bashrc
- Go to the end of the file and paste this there
export PATH="${HOME}/bin:${PATH}"
- Save the file (CTRL-O) and exit nano (CTRL-X)
- Reload the PATH variable
source .bashrc
- Open the
- You should be able to run docker-compose from anywhere now. Test this with
docker-compose --version
- Navigate to the
bin/
directory that you created and run thiswget https://releases.hashicorp.com/terraform/1.1.7/terraform_1.1.7_linux_amd64.zip
- Unzip the file
unzip terraform_1.1.7_linux_amd64.zip
You might have to install unzip
sudo apt-get install unzip
- Remove the zip file
rm terraform_1.1.7_linux_amd64.zip
- Terraform is already installed. Test it with
terraform -v
The JSON credentials downloaded is on your local machine. We are going to transfer it to the VM using scp
-
On your local machine, navigate to the location of the credentials file
${HOME}/.google/credentials/
-
Copy credentials file to vm
scp google_credentials.json <you vm user>@<vm external IP>:/home/<your vm user>/.google/credentials/google_credentials.json
you might need to specify your identity (ssh key) adding
-i /path/to/your/private/key
after scp command -
Connect to your vm using ssh
ssh -i /path/to/private/ssh/key <your vm user>@<vm external IP>
and check, that the file is therels ~/.google/credentials
-
For convenience, add this line to the end of the
.bashrc
fileexport GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/credentials/google_credentials.json
- Refresh with
source .bashrc
- Refresh with
-
Use the service account credentials file for authentication
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
To work with folders on a remote machine on Visual Studio Code, you need this extension. This extension also simplifies the forwarding of ports.
- Install the Remote-SSH extension from the Extensions Marketplace
- At the bottom left-hand corner, click the Open a Remote Window icon
- Click Connect to Host. Click the name of your config file host.
- In the Explorer tab, open any folder on your Virtual Machine Now, you can use VSCode completely to run this project.
git clone https://github.com/alinali87/de-zoomcamp-project.git
We use Terraform to create a GCS bucket, a BQ dataset, and 2 BQ tables
- Navigate to the terraform folder
- Initialise terraform
terraform init
- Check infrastructure plan
terraform plan
- Create new infrastructure
terraform apply
- Confirm that the infrastructure has been created on the GCP dashboard
Airflow is run in a docker container. This section contains steps on initisialing Airflow resources
- Navigate to the airflow folder
- Create a logs folder
airflow/logs/
mkdir logs/
- Build the docker image
docker-compose build
- The names of some project resources are hardcoded in the docker_compose.yaml file. Change this values to suit your use-case
- Initialise Airflow resources
docker-compose up airflow-init
- Kick up all other services
docker-compose up
- Open another terminal instance and check docker running services
docker ps
- Check if all the services are healthy
- Forward port 8080 from VS Code. Open
localhost:8080
on your browser and sign into airflowBoth username and password is
airflow
You are already signed into Airflow. Now it's time to run the pipeline
- Click on the DAG
gharchive_dag
that you see there - You should see a tree-like structure of the DAG you're about to run
- At the top right-hand corner, trigger the DAG. Make sure Auto-refresh is turned on before doing this
The DAG would run from April 1 at 8:00am UTC till 8:00am UTC of the present day
This should take a while - While this is going on, check the cloud console to confirm that everything is working accordingly
If you face any problem or error, confirm that you have followed all the above instructions religiously. If the problems still persist, raise an issue.
- When the pipeline is finished, and you've confirmed that everything went well, shut down *docker-compose with CTRL-C and kill all containers with
docker-compose down
- Take a well-deserved break to rest. This has been a long ride.
- Partitioning and Clustering is pre-defined on the tables in the data warehouse. You can check the definition in the main terraform file
- Dataproc configuration is in the gharchive_dag.py file.
I'd like to thank the organisers of this wonderful course. It has given me valuable insights into the field of Data Engineering. Also, all fellow students who took time to answer my questions on the Slack channel, thank you very much.