Status - Work in progress -- Non-Functional
This project is an experiment in designing custom data science workbenches on AWS Sagemaker.
The goals of project are as follows:
- Demonstrate how to loosely couple data engineering and modelling
- Illustrate how to train a combination of Sagemaker and bespoke models.
- Perform model selection using a flexible independent model comparison Notebook.
- Deploy a chosen model.
We achieve this with a combination of convention, configuration and prebuilt applications that depend on these requirements.
- Data is partitioned in an independent job that should be respected by all models
- Models are then built independently according to the data scientists ideas and requirements
- Models are deployed to an endpoint and registered in order to permit comparison
- Comparison is performed using these endpoints on independent data.
- After selection and final deployment, all artefacts are cleaned to reduce costs.
- Overall data partitioning is done once to enforce rigorous comparison of methods.
- All models load their training data through the dataset utility functions
- All experiments should be performed inside an independent directory below experiments
- Completed models need to be deployed and registered using the models utility functions
Clone this repository into an instance of Sagemaker Studio.
There are then two usage pathways you can follow: GUI/Notebook Workflow and Script Workflow They both rely on the same underlying scripts and configuration.
Follow the Notebook data/prepare_data.ipynb to understand how we get the data and prepare it for modelling.
Examples of modelling approaches are shown in the experiments directory.
The proposed flow is as follows:
- Build a Simple Baseline - Using a sci-kit learn script
- Build an XGBoost Model - Using a pre-built training job container.
- Run an Autopilot Job
With these models built we can then explore their performance.
The Model Comparisons Notebook will allow you to compare any model that has been built following the conventions show in the experients sections
This notebook makes extensive use of configuration and GUI widgets so that you can always return and perform additional comparisons after additional models have been run.
The [Deployment Notebook] demonstrates how to select any of the models built and create an endpoint. In some instances there will be additional configuration required to add pre-processing into the endpoint.
The same steps as above can be executed using the RUN script in the root of the repository. This script is parameterised such that you can run individual steps seperately, or the entire process in sequence.
The goal of this workflow is demonstrate how you might automate certains elements of your data science workflow and develop a code base that is easier to deploy.