This repo is an example of using duckdb and dbt to build data.
Note
Use of python version 3.12 and a virtual environment is recommended. We like to use pyenv
and pyenv-virtualenv
.
Install the required python packages:
python3 -m pip install --requirement setup/requirements.txt
Confirm packages listed in setup/requirements.in
are installed:
pip list
Tip
Use pip freeze | xargs pip uninstall -y
to uninstall all python packages.
Run a python script to confirm everything is setup:
python3 -m stage_0.sanity_check
Load and explore data from various sources
These datasets were chosen to show some of the ways source data can be imported with DuckDB.
- PLUTO from the NYC Department of City Planning (source)
- NYC Airbnb data (source)
- Trip record data from the NYC Taxi and Limousine Commission (TLC) (source)
-
Download PLUTO from NYC Open Data here by navigating to
Actions
->API
->Download file
->Export format: CSV
-
Move the downloaded csv file to
data/source_data/
-
Run a python script to load all source data into a database:
python3 -m stage_1.load
-
Use the Jupyter notebook
stage_1/explore.ipynb
to explore the data
Use a data pipeline to build "VegeTables/Farm To Market": a dataset about local agriculture
VegeTables/Farm To Market captures where local food is sold and might be grown in New York City.
This dataset combines the locations of farmers markets and potential farms (gardens, other suitable sites) to highlight availability and potential local suppliers of healthy produce in NYC.
farm_to_market
table
Each row is a market and farm pair
market_name
farm_name
suitability_score
distance_ft
line_geometry_wgs84
markets
table
Each row is a farmers market
name
accepts_ebt
distributes_health_bucks
open_year_round
address
borough
latitude
longitude
geometry_is_in_nyc
point_geometry_wgs84
farms
table
Each row is a potential farm
type
name
area_sqft
whole_lot
address
borough
bbl
latitude
longitude
point_geometry_wgs84
polygon_geometry_wgs84
Source data:
- NYC Borough Boundaries [source]
- NYC Farmers Markets [source]
- GreenThumb Garden Info [source]
- GreenThumb Block-Lot [source]
-
Download all source data from their Open Data pages by navigating to
Actions
->API
->Download file
->Export format: CSV
[!IMPORTANT] NYC Borough Boundaries must be downloaded as a geojson file.
-
Move the downloaded csv file to
data/source_data/
-
Run a python script to load all source data into a database:
python3 -m stage_2.load
-
(Optional) Use the Jupyter notebook
stage_2/explore.ipynb
to explore the source data -
Run python scripts to transform and export data:
python3 -m stage_2.transform python3 -m stage_2.export
-
Use the Jupyter notebook
stage_2/analyze.ipynb
to review and analyze the dataset
Use dbt to build a data pipeline
...