DBT Retail Example Project

`Why are we here?`

The purpose of this project is for me (a Machine Learning Engineer) to gain more practical experience with DBT (and several other tools) while creating a functional demo in a domain I have some experience in. Maybe someday I will turn this into an interactive demo.

`What are we doing?`

The ultimate goal of this project is to provide time-series forecasts using a publicly available dataset from the state of Iowa that documents all transactions between liquor stores and vendors. This repo currenlty provides monthly-level forecasts for product-categories at each store. In theory, this could be useful to individual liquor stores to understand future demand in order to make more intelligent purchasing decisions or for vendors to be better prepared for future orders. Ultimately, this level of granularity was chosen arbitrarily, potentially trading-off some practical utililty for data that is easier to work with.

`How does it work?`

We are using DBT to define a number of transformations required to prepare raw transactional data for a many-models time-series forecasting implementation. The source data for this project is available as a public-facing

The raw data is transformed via BigQuery tables and views before a python model uses a package called dart to generate the time-series forecasts. The python models are run in the backend using

`DBT`

DBT's website sums up their product quite succinctly: "dbt™ is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation". This tool is an easy-to-use option for teams looking to optimize and version-manage their data transformation pipelines that provides a common interface for Data Engineers, Analysts, Machine Learning Engineers and Data Scientists.

`Project configuration`

The dbt_project.yml file is the primary configuration for your DBT project. Learn more about it here

`Project structure`

For information on best practice for how to structure your models directory review the following docs:

`Models`

The key component of DBT is the model. Models are generally SQL-based transformation and are placed in the /models directory.

`Python model`

In DBT the most commonly used models are SQL-based transformations. Many Data Engineers and Analysts might only ever need this type of model. However, many ML pipelines require some amount of python processing. This can be done using the python model. DBT allows users to define python transformation via Snowpark.

`dbt-snowflake adapter`

DBT offers a number of connectors (Redshift, Databricks, Bigquery, etc.) but we will be using the Snowflake adapter for this project. Review the docs in order to setup this adapter and learn more about it.

`Snowflake backend`

`Snowpark`

`Data Science`

This project is done from the perspective of a Machine Learning Engineer, so the Data Science likely leaves a lot to be desired but this should serve as a good introduction to time-series forecasting. For this project, the forecasts are made monthly for all combinations stores and product-categories. This fits nicely with the spark runtime as it allows us to aggregate data at the store/product-category level, then fit and predict with time-series in each partition.

`darts`

According to their website: "Darts is a Python library for user-friendly forecasting and anomaly detection on time series. It contains a variety of models, from classics such as ARIMA to deep neural networks. The forecasting models can all be used in the same way, using fit() and predict() functions, similar to scikit-learn. The library also makes it easy to backtest models, combine the predictions of several models, and take external data into account." This package adds a convenient layer on many different forecasting packages, providing a consistent way to fit and test multiple forecasting models.

During experimentation it makes sense to download the full package, however it supports many different model architectures, so unless you intend on using all of them (or at least the most complex ones) it might make sense to perform testing and validation with darts, but in production it might make sense to use u8darts (the slimmed down version) or to remove the darts dependency and use only the required packages.

`Python Worksheet experimentation`

TBD. Preliminary thoughts are to use a python worksheet to perform initial experimentation with different model architectures and hyperparameters. This would be done on a small subset of the data, but would allow for quick iteration.

`pandas_udf`

In this and similar time-series projects its natural for Data Scientists to perform initial experimentation on a filtered datasets with a handful of various store/product-cateogory combinations. As long as the output of the Data Science experimentation results in a function that accepts a pandas dataframe with all data for one group and returns a pandas dataframe with the forecasts it is incredibly easy to translate Data Science into for a PySpark job that aggregates at the store/product-cateogory level and applies the function derived by the DS team. For more details about this workflow review the following docs:

`Likely next steps`

As with any project, there are a number of possible improvements that have not been explored yet. We'll spend some time highlighting some of the more obvious choices.

`Improve project requirements`

This project provides monthly forecasts at the store/product-category granularity level, but it might not actually be useful to any possible users of this tool. For the sake of this project, we are assuming a couple of different potential fictional user, but the first step would be identifying that actual users. Once, the users are known, it would be importatnt to meet with them and better understand their needs.

`Additional data`

This dataset is a great example of retail data, but in most real-world time-series forecasting projects there will be several other supporting datasets. For example, it would be nice to know of store closures, so the transaction data can be properly imputed. Moreover, it would be helpful to know the status of all stores, vendors, categories and items. One known problem in the current implementation is dead product-categories. It would be nice to know which ones are active so only the appropriate time-series are used for forecasting.

`More DS experimentation`

This project has only done a cursory amount of DS experimentation. Only a handful of categories at two high-volume stores were used for experimentation. It would be good to explore more modeling techniques. This might include more model architectures at the same granularity, be it univariate or multivariate.

`Model promotion`

Determine a metric and threshold for succes so Data Scientists can identify superior models that should be promoted to production.

`Scale up`

Right now forecasts are generated for all categories at two stores. Doing so would almost certainly surface unknown data quality issues.

How to do this at home

Install dbt

Setup Google Cloud Project

Setup DBT project environment

DBT environment

Build and push runtime container

Build your datasets

Marvel in your success!

`Resources:`

Learn more about the data
Learn more about data preparation for time-series forecasting
Learn more about DBT

Learn more about Prophet
Learn more about pypark pandas_udf

bveber / dbt-retail-snowflake Goto Github PK

dbt-retail-snowflake's Introduction

DBT Retail Example Project

Why are we here?

What are we doing?

How does it work?

DBT

Project configuration

Project structure

Models

Python model

dbt-snowflake adapter

Snowflake backend

Snowpark

Data Science

darts

Python Worksheet experimentation

pandas_udf

Likely next steps

Improve project requirements

Additional data

More DS experimentation

Model promotion

Scale up