Code Monkey home page Code Monkey logo

heatwave's Introduction

Heatwave

Protecting populations from extreme temperatures

By Romain Fardel

A project created for the Insight Data Engineering Fellowship, Fall 2020 session (20C)

Introduction

Heatwave can be deadly for populations. For instance, over 30,000 people died from the heatwave that swept Europe during the summer 2003. The issue affects the whole world, and the problem will keep increasing with global warming.

To protect populations, researchers need to study how mortality is affected by temperature. The problem is that there is currently not a single source of data available to researchers. For the US, researchers need to retrieve vital statistics from the CDC and historical weather data from NOAA. Combining these datasets in challenging for a few reasons:

  • Geographical mismatch: mortality is reported by county, whereas temperature is reported by weather station.

  • Temporal mismatch: mortality is reported by day or by month, where temperature is reported at variable intervals depending on the data year and station location.

  • Schema evolution: the format of mortality data evolves every few years, and data is provided in a non-delimited format, where knowledge of the position of each field is needed to extract the data.

  • County evolution in time: county limits have changed over the last 50 years.

This project addresses that need by creating a pipeline to combine these datasets and make them available in a GIS-enabled (Geographic Information System) database that the end user can query with SQL.

Execution

Tech stack

  1. Raw data is stored in Amazon S3
    • Weather data is readily available in a NOAA S3 bucket
    • Mortality data is ingested by downloading from FTP, unzipping and saving text files to S3
  2. Raw data is processed in Apache Spark. Each dataset is extracted, filtered, and aggregated separately
  3. Data is loaded to PostgreSQL with the PostGIS extension
  4. Auxillary datasets (weather station, county definitions) are loaded to PostGIS and the data is joined
  5. Final table is queried on demand and displayed in Dash.

Data sources

Weather

NOAA Global Historical Climatology Network Daily (GHCN-D), available in an Amazon S3 bucket.

Mortality

Data files

CDC - Vital Statistics Online Data Portal, under Mortality Multiple Cause Files, U.S. data (.zip files).

Descriptor files

CDC - Public Use Data File Documentation, in PDF format.

Weather stations

ghcnd-stations.txt from NOAA bucket

Counties

County Boundaries of the United States, 1990

County time concordance (a.k.a. crosswalk)

A Crosswalk for US Spatial Data 1790 - 2000, Fabian Eckert and Michael Peters. ZIP file

heatwave's People

Contributors

rfardel avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.