This repository contains and data for the project "Parsing Gas: a Scalable Pipeline for Nearest Neighbour Calculations on Spatial Data". The overarching goal of the project is to answer the following question: What ist the distribution of distances from gas-heated buildings to the district-heating network in Denmark?.
The repository depends on data from the BBR ("Building and Housing Register") - a public database maintained by the Housing Agency. Below are a description of the data artifacts.
Filename | Description | License | Source | Generated by reproduce.sh |
---|---|---|---|---|
BBR_Aktuelt_Totaludtraek_XML_20220517180008.zip |
The complete 65GB dataset from the BBR | NA | datafordeler.dk | ❌ |
bbr_clean.csv |
Processed dataset | MIT | Instructions here | ✔️ |
output/gas_fjernvarme_xy.csv |
Euclidean distance to district heating network for each gas-heated building | MIT | Generated by analyse_distances.py |
✔️ |
output/{KOMMUNE-ID}_road_dist.csv |
Comparison of road distance and Euclidean distance for a specific municipality (see codes here | MIT | Generated by analyse_road_dists.py |
✔️ |
TL;DR: An example of the entire setup and running the pipeline can be run using the bash-script reproduce.sh
.
Below I explain how to reproduce the analyses and plots of my report.
This project uses mamba, a blazingly fast cross-platform package manager for data science. As described in their docs, it is most easy to install through either miniconda or anaconda so make sure to have one of these installed on your system! After that it is as easy as running the setup.sh
script in a bash terminal.
The dependencies of this project are in two yml-files. full_environment.yml
has the minimal dependencies and is the file used by setup.sh
. frozen_env.yml
has the complete 'frozen' environment exactly as was used on my machine. If there are any problems with the setup script it might be a good idea to install directly from the frozen environment with the following command:
mamba env create -f frozen_env.yml
Parts of the project are developed using a test-driven development framework using pytest. The tests can be run using the following commands:
python -m pytest --cov-report term --cov ./src
This will print a coverage report to the terminal.
The formatted data is stored in a .csv-file in Google Drive. It can be downloaded manually by following this link, and unzipping the file to the data/raw
directory. However, the recommended way is to run the download_data.sh
as this does it all automagically.
Below is a high level overview of the different scripts in the repo in relation to the analysis pipeline:
Name | Component of Pipeline | Description | Part of reproduce.sh |
---|---|---|---|
extract_bbr.py |
1. Extract BBR | Parses building information from the full BBR xml | ❌ |
format_bbr.py |
2. Format to CSV | Extracts relevant columns to a .CSV | ❌ |
analyse_distances.py |
3. Find Nearest District Heating | Does Euclidean distance calculations | ✔️ |
analyse_road_dists.py |
4. Compare Road Distances | Compares Euclidean Distances for Aabenraa and Gentofte respectively | ✔️ |
plot_dists.R |
5.1 Plot Distributions | Plots distribution of distances (found here) | ✔️ |
leaflet_map.R |
5.2 Create Map | Creates an interactive map of gas-heated buildings and their distance | ✔️ |
All of the python scripts are documented using argparse. This means that full documentation can be found using the --help
-flag.
To improve coherence and make the code more SOLID I have refactored much functionality into a /src
directory. An overview can be seen below:
Name | Description | Part of tests |
---|---|---|
extract.py |
For parsing the BBR data efficiently | ✔️ |
wrangle_bbr.py |
Formats the BBR data to a readable format | ✔️ |
geo_transform |
Transforms the data into coordinates | ✔️ |
util.py |
Simple helper functions for reading and writing files | ❌ |