Code Monkey home page Code Monkey logo

de-zoomcamp-project's Introduction

de-zoomcamp-project

Overview

This project explores the idea of deciding whether to cancel a meeting in a data-driven way. It introduces a probabilistic decision criterion to cancel the meeting with a client if the probability of the flight arriving within 15 minutes of the scheduled arrival time was less than 70%.

Dataset

The dataset that captures flight delays is called Airline On-Time Performance Data. It includes information such as the origin and destination airports, flight numbers, flight departure and arrival dates, and nonstop distance between the two airports.

Technologies

The following technologies have been utilized in the project:

  • Data Lake: GCP Cloud Storage.
  • Data Warehouse: GCP BigQuery.
  • Infrastructure-as-Code (IaC): Terraform.
  • Data Analysis & Exploration: SQL.
  • Data Transformation: Apache Spark (Batch processing).
  • Distributed Processing: GCP Dataproc.
  • Pipeline Orchestration: GitHub Actions.
  • Containarization: Docker (in Cloud Run).
  • Automation: GCP Cloud Run and GCP Cloud Scheduler.
  • Visualisation: GCP Data Studio.

Architecture

Hub-and-Spoke Architecture

Architecture-1

The architecture of the monthly ingest job

Architecture-2

Dashboard

Dashboard-1

Implementation Details

Ingesting data onto the Data Lake (GCS) and Data Warehouse (GBQ)

Create a bucket using Terraform

  • Start VM instance de-zoomcamp
  • Log-in to VM instance using web-based SSH.
  • Update the project files to get the Terraform files from the project GitHub repo: git pull
  • Run the Terraform commands to create the GCP bucket for the project:
terraform init

# Check changes to new infra plan
terraform plan -var="project=de-zoomcamp-prj-375800"
# Create new infra
terraform apply -var="project=de-zoomcamp-prj-375800"
  • Verify on the dashboard that the bucket has been created.

Create a Deployment to automate the batch ingestion of the model training data with GitHub Actions

  • In your GitHub repository, click on the Actions tab.

  • Click on the New workflow button and select a template or create your own custom workflow.

  • In the workflow file, you can define the events that trigger the workflow, such as a push or pull request to a specific branch.

  • You can also define the jobs and steps that run when the workflow is triggered. For example, you can add a step to run your ingestion bash script.

  • To run your bash script in a step, you can use the run keyword followed by the command to execute your script. For example:

- name: Run ingestion script
  run: |
    chmod +x ./ingest.sh
    ./ingest.sh
  • Once you have defined your workflow, commit and push your changes to GitHub.

  • You can use environment variables to parameterize your deployment in GitHub Actions. For example:

jobs:
  deploy:
    runs-on: ubuntu-latest
    env:
      FIRST_YEAR: 2015
      LAST_YEAR: 2018
      BUCKET: my_gcs_bucket
  • In the step where you run your ingestion bash script, you can pass the values of these environment variables as arguments to your script. For example:
- name: Run ingestion script
  run: |
    chmod +x ./ingest.sh
    ./ingest.sh $FIRST_YEAR $LAST_YEAR $BUCKET
  • To change the values of these parameters when the script is executed, you can update the values of the environment variables in your workflow file and commit and push your changes to GitHub.

  • For GitHub Actions authentication into Google Cloud, use the recommended Workload Identity Federation

Deploy the monthly downloads for the model serving data with Cloud Run

  • Go to the ingest/monthlyupdate folder in the repo.
  • Make sure the Dockerfile and requirements.txt exist in the folder.
  • Grant the service account the sufficient privileges to build and submit Docker images.
  • Deploy the Python Web service as a Docker container to Cloud Run using the command:
export NAME=ingest-flights-monthly
export SVC_ACCT=dtc-de-zoomcamp-srv-acc
export PROJECT_ID=$(gcloud config get-value project)
export REGION=australia-southeast1
export SVC_EMAIL=${SVC_ACCT}@${PROJECT_ID}.iam.gserviceaccount.com

gcloud run deploy $NAME --region $REGION --source=$(pwd) \
    --platform=managed --service-account ${SVC_EMAIL}  \
    --no-allow-unauthenticated --timeout 12m

Invoke the monthly downloads service on Cloud Run

  • Invoke the deployed Web service on Cloud Run using the command:
NAME=ingest-flights-monthly
BUCKET=dsongcp_data_lake_de-zoomcamp-prj-375800
URL=$(gcloud run services describe ${NAME} --format 'value(status.url)')
echo $URL
# Ingest next month in last available year (2019)
echo {\"bucket\":\"${BUCKET}\"\} > /tmp/message
curl -k -X POST $URL \
   -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
   -H "Content-Type:application/json" --data-binary @/tmp/message

Schedule the monthly downloads with Cloud Scheduler

  • Invoke Cloud Run once a month to ingest the newly monthly data using the command:
SVC_ACCT=dtc-de-zoomcamp-srv-acc-102
PROJECT_ID=$(gcloud config get-value project)
SVC_EMAIL=${SVC_ACCT}@${PROJECT_ID}.iam.gserviceaccount.com
BUCKET=dsongcp_data_lake_de-zoomcamp-prj-375800
URL=$(gcloud run services describe ${NAME} --format 'value(status.url)')
echo {\"bucket\":\"${BUCKET}\"\} > /tmp/message
cat /tmp/message

gcloud scheduler jobs create http monthlyupdate \
       --description "Ingest flights using Cloud Run" \
       --schedule="8 of month 10:00" \
       --time-zone "America/New_York" \
       --uri=$URL --http-method POST \
       --oidc-service-account-email $SVC_EMAIL \   
       --oidc-token-audience=$URL \
       --max-backoff=7d \
       --max-retry-attempts=5 \
       --max-retry-duration=2d \
       --min-backoff=12h \
       --headers="Content-Type=application/json" \
       --message-body-from-file=/tmp/message

Data Transformation: Bayes Classifier on Cloud Dataproc

Create Dataproc cluster

In CloudShell:

  • Clone the repository if you haven't already done so:
    git clone https://github.com/anammari/de-zoomcamp-project.git
    
  • Change to the transform directory:
    cd transform
    
  • Create the Dataproc cluster to run jobs on, specifying the name of your bucket and a zone in the region that the bucket is in.
     ./create_cluster.sh <BUCKET-NAME>  <COMPUTE-ZONE>
    

Notes:

  • Make sure that the compute zone is in the same region as the bucket, otherwise you will incur network egress charges.

  • The create_cluster.sh bash script will perform the activities below:

    • Create the dataproc cluster.

    • Preinstall the required Python packages on all the dataproc cluster nodes.

    • Download the GitHub repository containing the data transformation code on the the dataproc cluster Master node.

Transformations applied include:

- Apply data quantization (e.g. putting each flight into one of several bins) of two numerical variables: the departure delay and the distance to be traveled. 

- Build a statistical model (e.g. Bayesian classification) that uses the quantized version of the two variables, the departure delay and the distance to be traveled, to predict whether a flight will or will not encounter an arrival delay. 

- Model evaluation using 2 validation datasets:
    - Rule-based out-of-sample dataset, generated using a boolean condition `is_train_day=FALSE`
    - Temporal-based out-of-sample dataset, generated using a random sample of the data in Year 2019, which was not used in training the model.  
  • Navigate to the Dataproc section of the GCP web console and click on "Web Interfaces".

  • Click on JupyterLab

  • In JupyterLab, open transform/quantization.ipynb. Click Run | Clear All Outputs. Then run the cells one by one.

  • [optional] make the changes suggested in the notebook to run on the full dataset. Note that you might have to reduce numbers to fit into your quota.

Productionization: Serverless Spark

  • Copy the PySpark script on Cloud Storage:
gsutil cp bayes_on_spark.py gs://$BUCKET/
  • Submit the Dataproc job using gcloud:
gcloud beta dataproc batches submit pyspark \
   --project=$(gcloud config get-value project) \
   --region=$REGION \
   gs://${BUCKET}/bayes_on_spark.py \
   -- \
   --bucket ${BUCKET} --debug

Delete the cluster

  • Delete the cluster either from the GCP web console or by typing in CloudShell, ./delete_cluster.sh <YOUR REGION>

de-zoomcamp-project's People

Contributors

anammari avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.