Code Monkey home page Code Monkey logo

deb-project1-group1's Introduction

OzWeathers

A Comprehensive Dataset to forecast various activities in most populated cities in Australia


Table of Contents
  1. About The Project
  2. Setups
  3. Folder Structure
  4. How it's done
  5. Discussion, Lesson Learnt and Future Improvement
  6. Contributors & Contributions

About The Project

This project aims to create a useful weather dataset for the top populated cities in Australia. We all know that weather can be unpredictable especially when planning for any outdoor activities. Our group created a weather dataset from WeatherApi with reference of the top populated cities from CountriesNow. This dataset would support any data analyst on presenting current weather data with it's suitable activities and past forecast data would be used by ML Data Scientist for future activities/weather forecasting.

The Great Australian Outdoors


Setups

This project is built with Python version 3.9.

The following packages and tools are used in this project:

  • Jupyter Notebook
  • Pandas
  • Requests
  • SQLAlchemy
  • Postgres
  • Jinja
  • Docker
  • AWS Services (S3, RDS, EC2)

Requirement list can be found in the project folder within src.

To run the pipeline locally: Conda Activate "your venv"

pip install -r requirement.txt

Obtain API key from WeatherApi

Setup Postgres, AWS account (RDS, S3) and add in appropriate credentials in config.bat

Run set_python_path.bat

Run main weatherapi_pipeline.py


Folder Structure

code
.
│
├── OzWeather
│   ├── src
│   │   ├── database
│   │   │   ├── __init__.py
│   │   │   └── postgres.py
│   │   ├── utility
│   │   |   ├── __init__.py
│   │   |   └── metadata_logging.py
│   │   └── weatherapi
│   │       ├── etl
|   |       |   ├── __init__.py
|   |       |   ├── extract.py
|   |       |   ├── load.py
|   |       |   └── transform.py
│   │       ├── models
|   |       |   └── transform
|   |       |         ├── staging_forecast.sql
|   |       |         ├── staging_historic.sql
|   |       |         ├── staging_historic_and_forecast.sql
|   |       |         ├── serving_activities.sql
|   |       |         ├── serving_rank_city_metrics_by_day.sql
|   |       |         ├── serving_rank_city_metrics_by_month.sql
|   |       |         ├── serving_weather_condition_count_by_month.sql
|   |       |         └── serving_weather_condition_count_by_years.sql
|   |       |         
│   │       ├── pipeline
|   |       |   ├── __init__.py
|   |       |   └── weatherapi_pipeline.py
│   │       ├── test
|   |       |   ├── __init__.py            
|   |       |   └── test_cities_api.py
|   |       |
|   |       ├── __init__.py
│   |       └── config.yaml
│   │
│   ├── requirements.txt
│   └── set_python_path.bat
│
│
├── Snippets
├── Dockerfile
└─── README.md

How it's Done

We built an ELTL pipeline with an Extract and Load Class without the @static method.
No static method was used
We did not utilise a Extract_load_pipeline but we have incorporate that into our main pipeline. In other words, there are only dag nodes for the load and transform classes.
DAG Extract
DAG Load and Transform
The pipeline built into a Docker Image, hosted on AWS ECR and task instance run on EC2. Successful runs can be seen on the CloudWatch Logs. Cloudwatch Logs


Solution Architecture Diagram

Solution Architecture Diagram


View all tables

View All Tables


Extract

We've extracted data from 2 APIs - WeatherApi and CountriesNow
The extract_cities funciton gets the top 10 most populated cities in Australia. In order for the WeatherAPI to work seamlessly, We have excluded any cities with a space or symbols in their names. This table then acts as a reference table for the extract_weather_forecast and extract_weather_historic function.
extract_weather_forecast function extracts 7 days worth of forecasting data for each cities whereas extract_weather_historic function gets weather data all the way back from a year ago for all the cities. Incremental Extract is performed on the historic data with the reference of an incremental value. The incremental value is stored in a CSV file "historic_log.csv" on S3 and the value referenced from the last recorded date from the extraction. For every run, the function refers back to the log file and starts the extraction process a day after the incremental value date. The justification for this is due to the fact that the data already exists for the log date and we would want to get data for the next day. We've utilized a while loop to stop the historic data extraction if it reaches current date, any data that is today and onwards would be covered by the forecast data. The extract_from_api acts as the main function which bundles everything together and gets used in the main pipeline.

"We will get today's historic data tomorrow when it becomes yesterday's. - FX


Load

Extracted data are loaded into AWS RDS PostgresSQL database with the help of SQLAlchemy. We have also upserting extracted raw historic data in chunks. As the data is extracted daily, the cities and forecast table would be overwritten to always get the most updated data whereas the raw_historic is an upsert.


Transform

Transformations are executed with the help of SQL models with Jinja templating after the load. For the staging_forecast and staging_historic table, we've only selected the columns that are relevant and renamed them accordingly. We then join them together to form the staging_historic_and_forecast table with a UNION ALL, this way, we have cities weather data for all the dates. The staging_historic_and_forecast table is then used for all the other 5 serving tables:

serving_activities Using weather data to create an activities table. Using multiple CASE statements to determine if an activity is suitable for a particular date in its corresponding city. Activities includes werewolf, picnics, indoor activity, sailing, being happy and moonbathing.

serving_rank_city_metrics_by_day Using weather data to get ranking of each cities according to their highest and lowest temperature, wind speeds and total amount of rain. We utilised Window Function with RANK() for this.

serving_rank_city_metrics_by_month To get the ranking by Month. We've first created a CTE to calculate the average of each required columns (max_temp_celcius, min_temp_celcius, max_wind_kph, total_rain_mm) and then grouped them by year_month and city. This is then RANK() with Window function, using the year_month as partition.

serving_weather_condition_count_by_month For this table, we wanted to get the frequency of weather condition by the month. From the weather data, we COUNT() each occurances and classify them into main categories - sunny_day, rainy_day, cloudy_day, thunder_day. These are then grouped by year_month and city, ordering by year_month

serving_weather_condition_count_by_years Similar transformation is done to get the weather condition by the year with one significant difference where the aggregated data is grouped by year instead of year_month and is ordered by year.

For all the tables, we have used the following techniques to transform the data:

  • Renaming - rename column headers from raw to staging tables
  • Data type casting - date string to date, columns data to numeric and int
  • Unions - Union of raw_historic & raw_forecast
  • Aggregation function - avg() temperature, rank() for metrics, count() frequencies of weather condition, sum() total amount of rain
  • Window function - partition by year_month
  • Calculation - duration of moonbathing time
  • Filtering - where claused used when creating list for CTE on days that are not sunny.
  • Grouping - group by year, year_month, city
  • Sorting - Order By


Discussion, Lesson Learnt and Future Improvement

Current iteration docker image is built and pushed onto ECR and task instance created to run the pipeline on EC2.
What have we learned in this project and what can be done better:

  • Never underestimate how long it will take to troubleshoot an error. And when it comes to testing and troubleshooting, print() helps!
  • Always incorporate good error catching
    Error Catching
  • We need to incorporate S3 List object on the get_incremental_value function as Boto3 doesnt provide a response if the key is not present/not the right name.
  • The current uses boto3 client to perform function which requires user credentials. Can possibly be using read with the right S3 policy.
  • Incremental value currently written during the data extract, however, if the process fails, the value will still be written to the logs. Ideally, this should happen after the upsert to database.
  • When a table is using the overwrite methods, there may be a timing issue where the table is not dropped fast enough before the new creation.
  • Figure out Metadatalogging, why it stopped after we've uploaded onto EC2. Identify a way to incorporate local timezone to the logs
  • Weather API Call gets Cities from the other Cities API call. Ideally, we would want the extract to read from Cities Table and then call the Weather API
  • Noticed an issue when reading from RDS, even when the table exists and we know that there are data, querying from PgAdmin will not return any results initially, then in the span of a minute, the table columns exists and so are the data
    Phantom Table

What have we learned while using Docker:

To execute the following Docker commands to avoid having multiple containers:
  • Build docker image with no cache: docker build --no-cache . -t IMAGE_NAME
  • Stop the existing docker container: docker stop CONTAINER_NAME
  • Remove the existing docker container: docker rm CONTAINER_NAME
  • Start docker container by name: docker run --restart=unless-stopped --name CONTAINER_NAME IMAGE_NAME
To execute the following Docker commands when tagging and pushing to ECR:
  • Build image for ECR: docker tag LOCAL_IMAGE_NAME : LOCAL_IMAGE_TAG AWS_ECR_URI : LOCAL_IMAGE_TAG
  • Push image to ECR: docker push AWS_ECR_URI : LOCAL_IMAGE_TAG


Contributors & Contributions

All participants in this project are professional individuals enrolled in Data Engineering Camp

Name GitHub Contributions
Luke Huntley LuckyLukeAtGitHub Transform(ETL), Load(ETL), MainPipeline, Testing, Docker Image Build, AWS resource creation (IAM, RDS)
Fang Xuan Foo foofx88 Extract(ETL), Transform(ETL), AWS - S3, ECR, EC2, Troubleshoot, Pull request review, Project Documentation
Muhammad Mursalin doctormachine Initial Pull request review
Rohith Korupalli korupalli Initial Weather API Extract

Team members partook on testing their own individual parts, cross check and supplied content for project documentation. This was the First project for the ETL part of the course in the Data Engineering Camp.
No werewolves were harmed in the process of getting this dataset.

Go back up🔼

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.