Code Monkey home page Code Monkey logo

simple_dbt_project's Introduction

Repository for the following blogs

  1. dbt(data build tool) Tutorial
  2. Uplevel dbt(data build tool) workflow

Setup

Prerequisites

  1. Docker
  2. python
  3. poetry
  4. pgcli
  5. git
  6. just

In addition to the tools, you would also need to know what dbt is, you can learn about it here: dbt tutorial.

Clone the git repo and switch to the uplevel-workflow branch.

git clone https://github.com/josephmachado/simple_dbt_project.git
cd simple_dbt_project

# Set dbt env variables to tell dbt where the profiles.yml and dbt_project.yml files are
export DBT_PROFILES_DIR=$(just profile-dir)
export DBT_PROJECT_DIR=$(just project-dir)

Note: All the just commands are avaialble in the justfile

Create virtual environment and start postgres docker

In your project terminal, create a virtual environment and activate it as shown below.

just create-venv # uses poetry to install python modules in pyproject.ml
source .venv/bin/activate

Start the postgres container (to be used as warehouse for development) with the following command.

just restart # clears up any existing container with the same name, starts a new postgres container and sets up elementary (a dbt package) table.

Run dbt

From your project terminal run the following commands.

dbt deps
dbt snapshot
dbt run --select sde_dbt_tutorial
dbt test
dbt docs generate
dbt docs serve

Go to http://localhost:8080 to see the dbt documentation (press ctrl+c).

Let's do some testing, Insert some data into source customer table, to demonstrate dbt snapshots. From your terminal (after exiting dbt container) run the following command.

just warehouse

You will be logged into your warehouse, here use the below command:

COPY raw_layer.customers(customer_id, zipcode, city, state_code, datetime_created, datetime_updated) FROM '/input_data/customer_new.csv' DELIMITER ',' CSV HEADER;
\q

Run snapshot and create models again.

dbt snapshot --select sde_dbt_tutorial
dbt run --select sde_dbt_tutorial

From your terminal run the following command.

just warehouse
select * from your_name_warehouse.customer_orders limit 3;
\q

Note: The following sections are for this article Uplevel dbt(data build tool) workflow

Use selectors and tags to only run necessary models/tests

We can use selectors to only select source models, models/tests with a specific tag, etc see this link for details.

for example

# make sure your virtual env is acitaved
cd sde_dbt_tutorial
# Use the following command to list all the source table's tests
dbt ls --select "source:*,resource_type:test"
# Use the following command to list all dbt objects (tests, tables, views, etc) which has a tag elementary
dbt ls --select "tag:elementary"
cd ..

Use defer to only run the required model and use prod's upstream tables

If you wan to use another environment's tables to build your model, you can use defer. Assuming we have two environments(schema) in the same database let's see how we can use defer to build a model for developing locally.

# make sure your virtual env is activated
just restart # this will clear out existing models in dev and prod
just prod-run # this will run dbt models in the prod environment

# make sure that you are in simple_dbt_project folder
rm -f ./sde_dbt_tutorial/prod-run-artifacts/manifest.json
cp ./sde_dbt_tutorial/target/manifest.json ./sde_dbt_tutorial/prod-run-artifacts/

# Run defer using pre build prod tables and create a new model in dev
cd sde_dbt_tutorial
dbt run --select "customer_orders" --defer --state prod-run-artifacts
cd ..
# You will see a log showing one model created

# Open warehouse with 
just warehouse
select * from your_name_warehouse.dim_customers; -- table does not exist error
select * from your_name_warehouse.customer_orders limit 10; -- you will see results
set search_path to 'your_name_warehouse';
\d -- you will only see customer_orders tables in your your_name_warehouse schema
\q -- exit pgcli

Validate data before pushing to prod, with data-diff

We can use data-diff to compare (column value based) datasets. This is especially helpful when you want to make sure that your changes did not in advertently change the granularity, alter values, etc. Let's see how we can compare data using data-diff.

# make sure your virtual env is activated
just restart && just dev-run && just prod-run

# make sure that you are in simple_dbt_project folder
rm -f ./sde_dbt_tutorial/prod-run-artifacts/manifest.json
cp ./sde_dbt_tutorial/target/manifest.json ./sde_dbt_tutorial/prod-run-artifacts/

# go to and ./sde_dbt_tutorial/models/marts/marketing/customer_orders.sql 
# and add a new col 'som col' as some_new_col,

cd sde_dbt_tutorial
dbt run --select "customer_orders" && data-diff --dbt --state ./prod-run-artifacts/manifest.json --select customer_orders -k order_id,customer_id
cd ..

You will see the diff as shown below.

dbt.warehouse.customer_orders <> dbt.your_name_warehouse.customer_orders 
Column(s) added: {'some_new_col'}
No row differences

Observe data quality with elementary

We have added elementary data checks in our dbt project, run them as shown below.

just restart
just dev-run 
just dq-report # creates and opens a static HTML file which shows dq report on your browser

add: image

Note: There are more dbt tools & process improvements explained at https://www.startdataengineering.com/post/uplevel-dbt-workflow/

Autorun linting & checks locally before opening a PR to save on CI costs

echo -e '
#!/bin/sh
just ci
' > .git/hooks/pre-commit
chmod ug+x .git/hooks/*

Shutdown

deactivate # to deactivate the virtual environment
just down # to stop postgres containers

simple_dbt_project's People

Contributors

josephmachado avatar nmardess avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

simple_dbt_project's Issues

Breaking build โ—

Tried running this project following README.MD but I run into the following error

Running with dbt=0.21.1
Encountered an error:
dlopen(/usr/local/lib/python3.9/site-packages/psycopg2/_psycopg.cpython-39-darwin.so, 0x0002): Library not loaded: /usr/local/opt/postgresql/lib/libpq.5.dylib
  Referenced from: <F642848D-A704-3B42-B2DC-3DF22D6D8D5D> /usr/local/lib/python3.9/site-packages/psycopg2/_psycopg.cpython-39-darwin.so
  Reason: tried: '/usr/local/opt/postgresql/lib/libpq.5.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/usr/local/opt/postgresql/lib/libpq.5.dylib' (no such file), '/usr/local/opt/postgresql/lib/libpq.5.dylib' (no such file), '/usr/local/lib/libpq.5.dylib' (no such file), '/usr/lib/libpq.5.dylib' (no such file, not in dyld cache), '/usr/local/Cellar/postgresql@14/14.6/lib/libpq.5.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/usr/local/Cellar/postgresql@14/14.6/lib/libpq.5.dylib' (no such file), '/usr/local/Cellar/postgresql@14/14.6/lib/libpq.5.dylib' (no such file), '/usr/local/lib/libpq.5.dylib' (no such file), '/usr/lib/libpq.5.dylib' (no such file, not in dyld cache)

Fix this.

Cannot create warehouse

I'm using WSL2 with Ubuntu 20.04.6 and have installed all prerequisites.

I followed the README up to the just restart command under the "Create virtual environment and start postgres docker". When I ran the just restart command I got this error (1). I took a look at the justfile in an attempt to debug and tried to run just up to see if that would remedy the issue since this would create the postgres container that wasn't found when running just restart. Although after running just up the postgres container was created but I ran into this error (2). I tried running just restart after just up to see what would happen and ran into this error (3) which has the same cause as the second error I mentioned. Postgres is running locally on port 5432 but there is no database or user named "dbt". I'm not sure if I'm supposed to create them or if a step in the instructions is meant to?

1:

rm -rf /raw_data/*
rm -rf ./sde_dbt_tutorial/target
just down
docker stop postgres && docker rm postgres
Error response from daemon: No such container: postgres
error: Recipe `down` failed on line 147 with exit code 1
error: Recipe `restart` failed on line 152 with exit code 1

2:

just start-db
docker run -d --name postgres -e POSTGRES_USER=dbt -e POSTGRES_PASSWORD=password1234 -e POSTGRES_DB=dbt -v $(pwd)/raw_data:/input_data -v $(pwd)/warehouse_setup:/docker-entrypoint-initdb.d -p 5432:5432 postgres:16
59c54ebecfa88a6e83109a36038164095e9795c6d53bbde22e206d2ba8c9917e
just deps
dbt deps
/home/josue/simple_dbt_project/.venv/lib/python3.12/site-packages/dateutil/tz/tz.py:37: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
<frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.
<frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.
00:46:52  Running with dbt=1.7.4
00:48:53  Installing elementary-data/elementary
00:48:53  Installed from version 0.13.2
00:48:53  Updated version available: 0.14.1
00:48:53  Installing dbt-labs/dbt_utils
00:48:54  Installed from version 1.1.1
00:48:54  Up to date!
00:48:54  Installing calogica/dbt_expectations
00:48:55  Installed from version 0.9.0
00:48:55  Updated version available: 0.10.3
00:48:55  Installing calogica/dbt_date
00:48:55  Installed from version 0.8.1
00:48:55  Updated version available: 0.10.0
00:48:55
00:48:55  Updates available for packages: ['elementary-data/elementary', 'calogica/dbt_expectations', 'calogica/dbt_date']
Update your versions in packages.yml, then run dbt deps
just elem-tables
dbt run --select elementary
/home/josue/simple_dbt_project/.venv/lib/python3.12/site-packages/dateutil/tz/tz.py:37: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
<frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.
<frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.
00:48:57  Running with dbt=1.7.4
00:48:58  Registered adapter: postgres=1.7.4
00:48:58  Found 37 models, 1 snapshot, 14 tests, 2 operations, 3 sources, 1 exposure, 0 metrics, 1369 macros, 0 groups, 0 semantic models
00:48:58
00:48:58
00:48:58  Finished running  in 0 hours 0 minutes and 0.04 seconds (0.04s).
00:48:58  Encountered an error:
Database Error
  connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "dbt"
  connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "dbt"

error: Recipe `elem-tables` failed on line 65 with exit code 2
error: Recipe `up` failed on line 137 with exit code 2

3:

just start-db
docker run -d --name postgres -e POSTGRES_USER=dbt -e POSTGRES_PASSWORD=password1234 -e POSTGRES_DB=dbt -v $(pwd)/raw_data:/input_data -v $(pwd)/warehouse_setup:/docker-entrypoint-initdb.d -p 5432:5432 postgres:16
f6a505eef6c7c9c0e89f173b5eb4f126a46d866cfcca2a18b0eaba65af5f1e1b
just deps
dbt deps
/home/josue/simple_dbt_project/.venv/lib/python3.12/site-packages/dateutil/tz/tz.py:37: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
<frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.
<frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.
01:04:16  Running with dbt=1.7.4
01:04:17  Installing elementary-data/elementary
01:04:17  Installed from version 0.13.2
01:04:17  Updated version available: 0.14.1
01:04:17  Installing dbt-labs/dbt_utils
01:04:18  Installed from version 1.1.1
01:04:18  Up to date!
01:04:18  Installing calogica/dbt_expectations
01:04:20  Installed from version 0.9.0
01:04:20  Updated version available: 0.10.3
01:04:20  Installing calogica/dbt_date
01:04:20  Installed from version 0.8.1
01:04:20  Updated version available: 0.10.0
01:04:20
01:04:20  Updates available for packages: ['elementary-data/elementary', 'calogica/dbt_expectations', 'calogica/dbt_date']
Update your versions in packages.yml, then run dbt deps
just elem-tables
dbt run --select elementary
/home/josue/simple_dbt_project/.venv/lib/python3.12/site-packages/dateutil/tz/tz.py:37: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
<frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.
<frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.
01:04:22  Running with dbt=1.7.4
01:04:22  Registered adapter: postgres=1.7.4
01:04:22  Unable to do partial parsing because saved manifest not found. Starting full parse.
01:04:26  [WARNING]: Did not find matching node for patch with name 'non_existent_table' in the 'models' section of file 'models/marts/marketing/marketing.yml'
01:04:27  [WARNING]: Test 'test.sde_dbt_tutorial.unique_non_existent_table_non_existent.1fe469feba' (models/marts/marketing/marketing.yml) depends on a node named 'non_existent_table' in package '' which was not found
01:04:27  Found 37 models, 1 snapshot, 14 tests, 2 operations, 3 sources, 1 exposure, 0 metrics, 1369 macros, 0 groups, 0 semantic models
01:04:27
01:04:27
01:04:27  Finished running  in 0 hours 0 minutes and 0.05 seconds (0.05s).
01:04:27  Encountered an error:
Database Error
  connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "dbt"
  connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "dbt"

error: Recipe `elem-tables` failed on line 65 with exit code 2
error: Recipe `up` failed on line 137 with exit code 2

Invalid password

the password for postgress 'password1234' shows 'invalid password'. Pls help.

Error in profiles.yml

When creating my profiles.yml it looks like dbt debug doesn't like the schema entry, dbt_tutorial. My file looks like this:

tutorial:
  outputs:
    dev:
      type: postgres
      threads: 1
      host: localhost
      port: 5432
      user: steven
      password: ***
      dbname: tutorial
      schema: `dbt`_tutorial

  target: dev

and for the schema, it looks like the first ` symbol doesn't light up orange in vs code. Is this a legal yaml first-character? Getting this error btw:

Runtime Error
    Syntax error near line 11
    ------------------------------
    8  |       user: steven
    9  |       password: ***
    10 |       dbname: tutorial
    11 |       schema: `dbt`_tutorial
    12 |
    13 |   target: dev
    14 |

    Raw Error:
    ------------------------------
    while scanning for the next token
    found character that cannot start any token
      in "<unicode string>", line 11, column 15

Error when running "dbt seed"

dbt version: 0.20.1
database: postgresql

when running the command "dbt seed", it raises error:

Unhandled error while executing seed.tutorial.customer
Tried to resolve a name to a reference that was unknown to the frame ('column')
19:17:41 | 1 of 3 ERROR loading seed file tutorial.customer..................... [ERROR in 3.61s]
19:17:41 | 3 of 3 START seed file tutorial.state................................ [RUN]
Unhandled error while executing seed.tutorial.state
Tried to resolve a name to a reference that was unknown to the frame ('column')
19:17:41 | 3 of 3 ERROR loading seed file tutorial.state........................ [ERROR in 0.23s]
Unhandled error while executing seed.tutorial.orders
Tried to resolve a name to a reference that was unknown to the frame ('column')
19:17:51 | 2 of 3 ERROR loading seed file tutorial.orders....................... [ERROR in 14.37s]
19:17:51 |
19:17:51 | Finished running 3 seeds in 14.96s.

Completed with 3 errors and 0 warnings:

Tried to resolve a name to a reference that was unknown to the frame ('column')

Tried to resolve a name to a reference that was unknown to the frame ('column')

Tried to resolve a name to a reference that was unknown to the frame ('column')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.