The purpose of this project is for me (a Machine Learning Engineer) to gain more practical experience with DBT (and several other tools) while creating a functional demo in a domain I have some experience in. Maybe someday I will turn this into an interactive demo.
The ultimate goal of this project is to provide time-series forecasts using a publicly available dataset from the state of Iowa that documents all transactions between liquor stores and vendors. This repo currenlty provides monthly-level forecasts for product-categories at each store. In theory, this could be useful to individual liquor stores to understand future demand in order to make more intelligent purchasing decisions or for vendors to be better prepared for future orders. Ultimately, this level of granularity was chosen arbitrarily, potentially trading-off some practical utililty for data that is easier to work with.
We are using DBT to define a number of transformations required to prepare raw transactional data for a many-models time-series forecasting implementation. The source data for this project is available as a public-facing
The raw data is transformed via BigQuery tables and views before a python model uses a package called dart to generate the time-series forecasts. The python models are run in the backend using
DBT's website sums up their product quite succinctly: "dbt™ is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation". This tool is an easy-to-use option for teams looking to optimize and version-manage their data transformation pipelines that provides a common interface for Data Engineers, Analysts, Machine Learning Engineers and Data Scientists.
The dbt_project.yml file is the primary configuration for your DBT project. Learn more about it here
For information on best practice for how to structure your models directory review the following docs:
The key component of DBT is the model
. Models are generally SQL-based transformation and are placed in
the /models
directory.
In DBT the most commonly used models are SQL-based transformations. Many Data Engineers and Analysts might only ever need this type of model. However, many ML pipelines require some amount of python processing. This can be done using the python model. DBT allows users to define python transformation via Snowpark.
DBT offers a number of connectors (Redshift, Databricks, Bigquery, etc.) but we will be using the Snowflake adapter for this project. Review the docs in order to setup this adapter and learn more about it.
This project is done from the perspective of a Machine Learning Engineer, so the Data Science likely leaves a lot to be desired but this should serve as a good introduction to time-series forecasting. For this project, the forecasts are made monthly for all combinations stores and product-categories. This fits nicely with the spark runtime as it allows us to aggregate data at the store/product-category level, then fit and predict with time-series in each partition.
According to their website: "Darts is a Python library for user-friendly forecasting and anomaly detection on time series. It contains a variety of models, from classics such as ARIMA to deep neural networks. The forecasting models can all be used in the same way, using fit() and predict() functions, similar to scikit-learn. The library also makes it easy to backtest models, combine the predictions of several models, and take external data into account." This package adds a convenient layer on many different forecasting packages, providing a consistent way to fit and test multiple forecasting models.
During experimentation it makes sense to download the full package, however it supports many different model architectures, so unless you intend on using all of them (or at least the most complex ones) it might make sense to perform testing and validation with darts, but in production it might make sense to use u8darts (the slimmed down version) or to remove the darts dependency and use only the required packages.
TBD. Preliminary thoughts are to use a python worksheet to perform initial experimentation with different model architectures and hyperparameters. This would be done on a small subset of the data, but would allow for quick iteration.
In this and similar time-series projects its natural for Data Scientists to perform initial experimentation on a filtered datasets with a handful of various store/product-cateogory combinations. As long as the output of the Data Science experimentation results in a function that accepts a pandas dataframe with all data for one group and returns a pandas dataframe with the forecasts it is incredibly easy to translate Data Science into for a PySpark job that aggregates at the store/product-cateogory level and applies the function derived by the DS team. For more details about this workflow review the following docs:
As with any project, there are a number of possible improvements that have not been explored yet. We'll spend some time highlighting some of the more obvious choices.
This project provides monthly forecasts at the store/product-category granularity level, but it might not actually be useful to any possible users of this tool. For the sake of this project, we are assuming a couple of different potential fictional user, but the first step would be identifying that actual users. Once, the users are known, it would be importatnt to meet with them and better understand their needs.
This dataset is a great example of retail data, but in most real-world time-series forecasting projects there will be several other supporting datasets. For example, it would be nice to know of store closures, so the transaction data can be properly imputed. Moreover, it would be helpful to know the status of all stores, vendors, categories and items. One known problem in the current implementation is dead product-categories. It would be nice to know which ones are active so only the appropriate time-series are used for forecasting.
This project has only done a cursory amount of DS experimentation. Only a handful of categories at two high-volume stores were used for experimentation. It would be good to explore more modeling techniques. This might include more model architectures at the same granularity, be it univariate or multivariate.
Determine a metric and threshold for succes so Data Scientists can identify superior models that should be promoted to production.
Right now forecasts are generated for all categories at two stores. Doing so would almost certainly surface unknown data quality issues.
- Learn more about the data
- Learn more about data preparation for time-series forecasting
- Learn more about DBT
- Learn more about Prophet
- Learn more about pypark pandas_udf