Code Monkey home page Code Monkey logo

dbt-retail-snowflake's Introduction

DBT Retail Example Project

Why are we here?

The purpose of this project is for me (a Machine Learning Engineer) to gain more practical experience with DBT (and several other tools) while creating a functional demo in a domain I have some experience in. Maybe someday I will turn this into an interactive demo.

What are we doing?

The ultimate goal of this project is to provide time-series forecasts using a publicly available dataset from the state of Iowa that documents all transactions between liquor stores and vendors. This repo currenlty provides monthly-level forecasts for product-categories at each store. In theory, this could be useful to individual liquor stores to understand future demand in order to make more intelligent purchasing decisions or for vendors to be better prepared for future orders. Ultimately, this level of granularity was chosen arbitrarily, potentially trading-off some practical utililty for data that is easier to work with.

How does it work?

We are using DBT to define a number of transformations required to prepare raw transactional data for a many-models time-series forecasting implementation. The source data for this project is available as a public-facing

The raw data is transformed via BigQuery tables and views before a python model uses a package called dart to generate the time-series forecasts. The python models are run in the backend using

DBT

DBT's website sums up their product quite succinctly: "dbt™ is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation". This tool is an easy-to-use option for teams looking to optimize and version-manage their data transformation pipelines that provides a common interface for Data Engineers, Analysts, Machine Learning Engineers and Data Scientists.

Project configuration

The dbt_project.yml file is the primary configuration for your DBT project. Learn more about it here

Project structure

For information on best practice for how to structure your models directory review the following docs:

Models

The key component of DBT is the model. Models are generally SQL-based transformation and are placed in the /models directory.

Python model

In DBT the most commonly used models are SQL-based transformations. Many Data Engineers and Analysts might only ever need this type of model. However, many ML pipelines require some amount of python processing. This can be done using the python model. DBT allows users to define python transformation via Snowpark.

dbt-snowflake adapter

DBT offers a number of connectors (Redshift, Databricks, Bigquery, etc.) but we will be using the Snowflake adapter for this project. Review the docs in order to setup this adapter and learn more about it.

Snowflake backend

Snowpark

Data Science

This project is done from the perspective of a Machine Learning Engineer, so the Data Science likely leaves a lot to be desired but this should serve as a good introduction to time-series forecasting. For this project, the forecasts are made monthly for all combinations stores and product-categories. This fits nicely with the spark runtime as it allows us to aggregate data at the store/product-category level, then fit and predict with time-series in each partition.

darts

According to their website: "Darts is a Python library for user-friendly forecasting and anomaly detection on time series. It contains a variety of models, from classics such as ARIMA to deep neural networks. The forecasting models can all be used in the same way, using fit() and predict() functions, similar to scikit-learn. The library also makes it easy to backtest models, combine the predictions of several models, and take external data into account." This package adds a convenient layer on many different forecasting packages, providing a consistent way to fit and test multiple forecasting models.

During experimentation it makes sense to download the full package, however it supports many different model architectures, so unless you intend on using all of them (or at least the most complex ones) it might make sense to perform testing and validation with darts, but in production it might make sense to use u8darts (the slimmed down version) or to remove the darts dependency and use only the required packages.

Python Worksheet experimentation

TBD. Preliminary thoughts are to use a python worksheet to perform initial experimentation with different model architectures and hyperparameters. This would be done on a small subset of the data, but would allow for quick iteration.

pandas_udf

In this and similar time-series projects its natural for Data Scientists to perform initial experimentation on a filtered datasets with a handful of various store/product-cateogory combinations. As long as the output of the Data Science experimentation results in a function that accepts a pandas dataframe with all data for one group and returns a pandas dataframe with the forecasts it is incredibly easy to translate Data Science into for a PySpark job that aggregates at the store/product-cateogory level and applies the function derived by the DS team. For more details about this workflow review the following docs:

Likely next steps

As with any project, there are a number of possible improvements that have not been explored yet. We'll spend some time highlighting some of the more obvious choices.

Improve project requirements

This project provides monthly forecasts at the store/product-category granularity level, but it might not actually be useful to any possible users of this tool. For the sake of this project, we are assuming a couple of different potential fictional user, but the first step would be identifying that actual users. Once, the users are known, it would be importatnt to meet with them and better understand their needs.

Additional data

This dataset is a great example of retail data, but in most real-world time-series forecasting projects there will be several other supporting datasets. For example, it would be nice to know of store closures, so the transaction data can be properly imputed. Moreover, it would be helpful to know the status of all stores, vendors, categories and items. One known problem in the current implementation is dead product-categories. It would be nice to know which ones are active so only the appropriate time-series are used for forecasting.

More DS experimentation

This project has only done a cursory amount of DS experimentation. Only a handful of categories at two high-volume stores were used for experimentation. It would be good to explore more modeling techniques. This might include more model architectures at the same granularity, be it univariate or multivariate.

Model promotion

Determine a metric and threshold for succes so Data Scientists can identify superior models that should be promoted to production.

Scale up

Right now forecasts are generated for all categories at two stores. Doing so would almost certainly surface unknown data quality issues.

How to do this at home

Install dbt

Setup Google Cloud Project

Setup DBT project environment

DBT environment

Build and push runtime container

Build your datasets

Marvel in your success!

Resources:

dbt-retail-snowflake's People

Watchers

Brandon Veber avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.