Code Monkey home page Code Monkey logo

de_zoomcamp_2022_earthquake_capstone's Introduction

Earthquake Data Engineering Capstone Project

This project automatically ingests, stores, transforms the latest seismic activity data from the USGS (United States Geologic Society) for later analysis.

This GitHub repository fulfills the final capstone project for the Data Engineering Zoomcamp by DataTalks.Club.

Data Pipeline Architecture

Dashboard

Earthquake Capstone Project Google Looker Studio Data Pipeline Architecture

Technical Challenge

Analysts are studying the frequency, intensity, and spatial occurrence of seismic activity. This data needs to be automatically stored and processed in a way that analysts can quickly analyze and build out reports and dashboards. The technical implementation needs to be:

  • Automated
  • Reliable
  • Scalable
  • Affordable

Technology Utilized

Dataset

The data comes from access of a public REST API from the USGS in collaboration with the International Federation of Digital Seismograph Networks (FDSN). The data is accessed through the query API endpoint, which has 3 parameters:

  • format={geojson}
  • starttime={yyyy-mm-dd}
  • endtime={yyyy-mm-dd}

https://earthquake.usgs.gov/fdsnws/event/1/query?format={format}&starttime={yyyy-mm-dd}&endtime={yyyy-mm-dd}

The geojson data format was used in this project.

  • Example of json response: USGS API json Example

Data Transformation

  • The REST API json response comes as a nested json structure.

  • The seismic events that are of the most interest are in the "features" array of objects that needs to be iterated over and extracted into a Python array.

  • This array was turned into a Pandas dataframe.

  • Additional columns were created that converted the UNIX or POSIX time (ms) into datetime objects.

  • Example of Flattened data USGS API json Example

Data Modeling

Data Storage

  • Both the raw json's and flattened parquet files are stored in a datalake in Google Cloud Storage bucket(GCS)

Data Lake

Airflow Orchestration

The DAG does the following on a '@daily' schedule:

  • Parameterizes the API endpoint to use the dates passed in by using the Airflow provided template variables
  • Uses the Python requests library GET request and returns an API response to be stored in local memory with a variable.
  • Saves the raw json file to local storage, uploads the blob into the data lake, then deletes the local json file.
  • Uses Python to parse the nested JSON into list of dictionaries that gets transformed into a Pandas dataframe.
  • Converts the DataFrame into a parquet file that gets saved locally.
  • Uploads the parquet into my Google Cloud Storage data lake with the parameterized date as filename.
  • Deletes the locally stored parquet file.
  • Pushes the data lake parquet into a BigQuery Native table using the 'LOAD DATA INTO...' command.
  • Runs the dbt models using 'dbt_run' command with the BashOperator.

Airflow DAG

Airflow DAG

dbt Transformation

  • dbt was used to take the raw_earthquakes data from BigQuery native table and deduplicate the data using a SQL window function.
  • The goal is to have only one record of seismic event using the seismic 'id' column.
  • Requirements indicate that only the latest version of the event based on its 'properties_updated_datetime' timestamp are needed.
  • It is assumed as the seismic data is reprocessed, this architecture design will allow analysts to examine the historical changes of a particular seismic event, yet still show the latest version of each event.
with earthquakes as 
(
    select *,
        row_number() over(partition by id, properties_updated_datetime) as rn
    from {{ source('raw','raw_earthquakes') }}
    where id is not null 
)
select ...
from earthquakes 
where rn = 1
{{ config(
    materialized='incremental',
    partition_by={
      "field": "properties_updated_datetime",
      "data_type": "timestamp",
      "granularity": "day"
    }
)}}

dbt Lineage Graph

  • dbt has a built in tool used to visually show data lineage as it travels across the data layers: raw -> stage -> final dbt Lineage Graph

Visualization

  1. Pick a data source, including 100+ data connectors including BigQuery.
  2. Pick the columns needed, or create your own using logical functions.
  3. Create the dashboard.

Step1 Step2 Step3

Future Work That Could Be Done

  1. Testing: dbt testing, airflow testing
  2. More experimentation with the Airflow configuration and VM instance. I would like to attempt a lightweight version using the sequential executor instead of the celery executor and SQLite over Postgres backend.
  3. I would like to try more automation around the devops implimentation of the entire pipeline. I would like to see if I could automate more of the dockerfile to execute more steps of the initialization.
  4. Explore making the airflow instance less brittle. Not using any local compute, but rather push more compute to external cloud functions.
  5. Build out more reports, push out weekly updates via slack webhook or email
  6. Integrate Population Data to show earthquakes potential impact to human inhabitants.
  7. Data Science, specifically time series analysis

Clean Up

A GCP trial account was used to work on this project with 90 day access. The final step was to transfer the final fact_earthquakes BiqQuery table to my personal account for safe keeping. The BigQuery Copy Command was utilized.

bq cp -f \
old_dataset.fact_earthquakes \
main_projectid:new_dataset.fact_earthquakes

The final dashboard was conected to this BigQuery Table.

Terraform

Lastly, the cloud resources were archived and exported to Terraform HCL code using GCP gcloud beta resource-config bulk-export.

Acknowledgements

Thanks to the instructors.

And my employer and teammates.

evolv

LinkedIn

de_zoomcamp_2022_earthquake_capstone's People

Contributors

anelson82 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

iamsingularity

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.