Code Monkey home page Code Monkey logo

duckdb-dbt's Introduction

duckdb-dbt

This repo is an example of using duckdb and dbt to build data.

Setup

Python

Note

Use of python version 3.12 and a virtual environment is recommended. We like to use pyenv and pyenv-virtualenv.

Install the required python packages:

python3 -m pip install --requirement setup/requirements.txt

Confirm packages listed in setup/requirements.in are installed:

pip list

Tip

Use pip freeze | xargs pip uninstall -y to uninstall all python packages.

Stage 0: Sanity Check

Run a python script to confirm everything is setup:

python3 -m stage_0.sanity_check

Stage 1: Load and explore data

Load and explore data from various sources

Source data

These datasets were chosen to show some of the ways source data can be imported with DuckDB.

  • PLUTO from the NYC Department of City Planning (source)
  • NYC Airbnb data (source)
  • Trip record data from the NYC Taxi and Limousine Commission (TLC) (source)

Steps

  1. Download PLUTO from NYC Open Data here by navigating to Actions -> API -> Download file -> Export format: CSV

  2. Move the downloaded csv file to data/source_data/

  3. Run a python script to load all source data into a database:

    python3 -m stage_1.load
  4. Use the Jupyter notebook stage_1/explore.ipynb to explore the data

Stage 2: Pipeline

Use a data pipeline to build "VegeTables/Farm To Market": a dataset about local agriculture

Dataset description

VegeTables/Farm To Market captures where local food is sold and might be grown in New York City.

This dataset combines the locations of farmers markets and potential farms (gardens, other suitable sites) to highlight availability and potential local suppliers of healthy produce in NYC.

Data Dictionary

farm_to_market table

Each row is a market and farm pair

  • market_name
  • farm_name
  • suitability_score
  • distance_ft
  • line_geometry_wgs84

markets table

Each row is a farmers market

  • name
  • accepts_ebt
  • distributes_health_bucks
  • open_year_round
  • address
  • borough
  • latitude
  • longitude
  • geometry_is_in_nyc
  • point_geometry_wgs84

farms table

Each row is a potential farm

  • type
  • name
  • area_sqft
  • whole_lot
  • address
  • borough
  • bbl
  • latitude
  • longitude
  • point_geometry_wgs84
  • polygon_geometry_wgs84

Source data

Source data:

  • NYC Borough Boundaries [source]
  • NYC Farmers Markets [source]
  • GreenThumb Garden Info [source]
  • GreenThumb Block-Lot [source]

Steps

  1. Download all source data from their Open Data pages by navigating to Actions -> API -> Download file -> Export format: CSV

    [!IMPORTANT] NYC Borough Boundaries must be downloaded as a geojson file.

  2. Move the downloaded csv file to data/source_data/

  3. Run a python script to load all source data into a database:

    python3 -m stage_2.load
  4. (Optional) Use the Jupyter notebook stage_2/explore.ipynb to explore the source data

  5. Run python scripts to transform and export data:

    python3 -m stage_2.transform
    python3 -m stage_2.export
  6. Use the Jupyter notebook stage_2/analyze.ipynb to review and analyze the dataset

Stage 3: dbt pipeline

Use dbt to build a data pipeline

...

duckdb-dbt's People

Contributors

damonmcc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.