OzWeathers

A Comprehensive Dataset to forecast various activities in most populated cities in Australia

Table of Contents

About The Project
Setups
Folder Structure
How it's done

Solution Architecture Diagram
View All Tables
Extract
Transform
Load

Discussion, Lesson Learnt and Future Improvement
Contributors & Contributions

About The Project

This project aims to create a useful weather dataset for the top populated cities in Australia. We all know that weather can be unpredictable especially when planning for any outdoor activities. Our group created a weather dataset from WeatherApi with reference of the top populated cities from CountriesNow. This dataset would support any data analyst on presenting current weather data with it's suitable activities and past forecast data would be used by ML Data Scientist for future activities/weather forecasting.

Setups

This project is built with Python version 3.9.

The following packages and tools are used in this project:

Jupyter Notebook
Pandas
Requests
SQLAlchemy
Postgres
Jinja
Docker
AWS Services (S3, RDS, EC2)

Requirement list can be found in the project folder within src.

To run the pipeline locally: Conda Activate "your venv"

pip install -r requirement.txt

Obtain API key from WeatherApi

Setup Postgres, AWS account (RDS, S3) and add in appropriate credentials in config.bat

Run set_python_path.bat

Run main weatherapi_pipeline.py

Folder Structure

code
.
│
├── OzWeather
│   ├── src
│   │   ├── database
│   │   │   ├── __init__.py
│   │   │   └── postgres.py
│   │   ├── utility
│   │   |   ├── __init__.py
│   │   |   └── metadata_logging.py
│   │   └── weatherapi
│   │       ├── etl
|   |       |   ├── __init__.py
|   |       |   ├── extract.py
|   |       |   ├── load.py
|   |       |   └── transform.py
│   │       ├── models
|   |       |   └── transform
|   |       |         ├── staging_forecast.sql
|   |       |         ├── staging_historic.sql
|   |       |         ├── staging_historic_and_forecast.sql
|   |       |         ├── serving_activities.sql
|   |       |         ├── serving_rank_city_metrics_by_day.sql
|   |       |         ├── serving_rank_city_metrics_by_month.sql
|   |       |         ├── serving_weather_condition_count_by_month.sql
|   |       |         └── serving_weather_condition_count_by_years.sql
|   |       |         
│   │       ├── pipeline
|   |       |   ├── __init__.py
|   |       |   └── weatherapi_pipeline.py
│   │       ├── test
|   |       |   ├── __init__.py            
|   |       |   └── test_cities_api.py
|   |       |
|   |       ├── __init__.py
│   |       └── config.yaml
│   │
│   ├── requirements.txt
│   └── set_python_path.bat
│
│
├── Snippets
├── Dockerfile
└─── README.md

How it's Done

We built an ELTL pipeline with an Extract and Load Class without the @static method.

We did not utilise a Extract_load_pipeline but we have incorporate that into our main pipeline. In other words, there are only dag nodes for the load and transform classes.

The pipeline built into a Docker Image, hosted on AWS ECR and task instance run on EC2. Successful runs can be seen on the CloudWatch Logs.

Solution Architecture Diagram

View all tables

Extract

We've extracted data from 2 APIs - WeatherApi and CountriesNow
The extract_cities funciton gets the top 10 most populated cities in Australia. In order for the WeatherAPI to work seamlessly, We have excluded any cities with a space or symbols in their names. This table then acts as a reference table for the extract_weather_forecast and extract_weather_historic function.
extract_weather_forecast function extracts 7 days worth of forecasting data for each cities whereas extract_weather_historic function gets weather data all the way back from a year ago for all the cities. Incremental Extract is performed on the historic data with the reference of an incremental value. The incremental value is stored in a CSV file "historic_log.csv" on S3 and the value referenced from the last recorded date from the extraction. For every run, the function refers back to the log file and starts the extraction process a day after the incremental value date. The justification for this is due to the fact that the data already exists for the log date and we would want to get data for the next day. We've utilized a while loop to stop the historic data extraction if it reaches current date, any data that is today and onwards would be covered by the forecast data. The extract_from_api acts as the main function which bundles everything together and gets used in the main pipeline.

"We will get today's historic data tomorrow when it becomes yesterday's. - FX

Load

Extracted data are loaded into AWS RDS PostgresSQL database with the help of SQLAlchemy. We have also upserting extracted raw historic data in chunks. As the data is extracted daily, the cities and forecast table would be overwritten to always get the most updated data whereas the raw_historic is an upsert.

Transform

Transformations are executed with the help of SQL models with Jinja templating after the load. For the staging_forecast and staging_historic table, we've only selected the columns that are relevant and renamed them accordingly. We then join them together to form the staging_historic_and_forecast table with a UNION ALL, this way, we have cities weather data for all the dates. The staging_historic_and_forecast table is then used for all the other 5 serving tables:

serving_activities Using weather data to create an activities table. Using multiple CASE statements to determine if an activity is suitable for a particular date in its corresponding city. Activities includes werewolf, picnics, indoor activity, sailing, being happy and moonbathing.

serving_rank_city_metrics_by_day Using weather data to get ranking of each cities according to their highest and lowest temperature, wind speeds and total amount of rain. We utilised Window Function with RANK() for this.

serving_rank_city_metrics_by_month To get the ranking by Month. We've first created a CTE to calculate the average of each required columns (max_temp_celcius, min_temp_celcius, max_wind_kph, total_rain_mm) and then grouped them by year_month and city. This is then RANK() with Window function, using the year_month as partition.

serving_weather_condition_count_by_month For this table, we wanted to get the frequency of weather condition by the month. From the weather data, we COUNT() each occurances and classify them into main categories - sunny_day, rainy_day, cloudy_day, thunder_day. These are then grouped by year_month and city, ordering by year_month

serving_weather_condition_count_by_years Similar transformation is done to get the weather condition by the year with one significant difference where the aggregated data is grouped by year instead of year_month and is ordered by year.

For all the tables, we have used the following techniques to transform the data:

Renaming - rename column headers from raw to staging tables
Data type casting - date string to date, columns data to numeric and int
Unions - Union of raw_historic & raw_forecast
Aggregation function - avg() temperature, rank() for metrics, count() frequencies of weather condition, sum() total amount of rain
Window function - partition by year_month
Calculation - duration of moonbathing time
Filtering - where claused used when creating list for CTE on days that are not sunny.
Grouping - group by year, year_month, city
Sorting - Order By

Discussion, Lesson Learnt and Future Improvement

Current iteration docker image is built and pushed onto ECR and task instance created to run the pipeline on EC2.
What have we learned in this project and what can be done better:

Never underestimate how long it will take to troubleshoot an error. And when it comes to testing and troubleshooting, print() helps!
Always incorporate good error catching
We need to incorporate S3 List object on the get_incremental_value function as Boto3 doesnt provide a response if the key is not present/not the right name.
The current uses boto3 client to perform function which requires user credentials. Can possibly be using read with the right S3 policy.
Incremental value currently written during the data extract, however, if the process fails, the value will still be written to the logs. Ideally, this should happen after the upsert to database.
When a table is using the overwrite methods, there may be a timing issue where the table is not dropped fast enough before the new creation.
Figure out Metadatalogging, why it stopped after we've uploaded onto EC2. Identify a way to incorporate local timezone to the logs
Weather API Call gets Cities from the other Cities API call. Ideally, we would want the extract to read from Cities Table and then call the Weather API
Noticed an issue when reading from RDS, even when the table exists and we know that there are data, querying from PgAdmin will not return any results initially, then in the span of a minute, the table columns exists and so are the data

What have we learned while using Docker:

To execute the following Docker commands to avoid having multiple containers:

Build docker image with no cache: docker build --no-cache . -t IMAGE_NAME
Stop the existing docker container: docker stop CONTAINER_NAME
Remove the existing docker container: docker rm CONTAINER_NAME
Start docker container by name: docker run --restart=unless-stopped --name CONTAINER_NAME IMAGE_NAME

To execute the following Docker commands when tagging and pushing to ECR:

Build image for ECR: docker tag LOCAL_IMAGE_NAME : LOCAL_IMAGE_TAG AWS_ECR_URI : LOCAL_IMAGE_TAG
Push image to ECR: docker push AWS_ECR_URI : LOCAL_IMAGE_TAG

Contributors & Contributions

All participants in this project are professional individuals enrolled in Data Engineering Camp

Name	GitHub	Contributions
Luke Huntley	LuckyLukeAtGitHub	Transform(ETL), Load(ETL), MainPipeline, Testing, Docker Image Build, AWS resource creation (IAM, RDS)
Fang Xuan Foo	foofx88	Extract(ETL), Transform(ETL), AWS - S3, ECR, EC2, Troubleshoot, Pull request review, Project Documentation
Muhammad Mursalin	doctormachine	Initial Pull request review
Rohith Korupalli	korupalli	Initial Weather API Extract

Team members partook on testing their own individual parts, cross check and supplied content for project documentation. This was the First project for the ETL part of the course in the Data Engineering Camp.
No werewolves were harmed in the process of getting this dataset.

Go back up🔼

foofx88 / deb-project1-group1 Goto Github PK

deb-project1-group1's Introduction

OzWeathers

A Comprehensive Dataset to forecast various activities in most populated cities in Australia

About The Project

Setups

Folder Structure

How it's Done

Solution Architecture Diagram

View all tables

Extract

Load

Transform

Discussion, Lesson Learnt and Future Improvement

Contributors & Contributions

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent