Light

gerovanmi / algorithmic-quartet-mlops Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 2 MB

A showcase Machine Learning Operations (MLOps) Project.

Python 5.47% Jupyter Notebook 93.78% Dockerfile 0.52% Shell 0.23%

algorithmic-quartet-mlops's Introduction

Algorithmic Quartet Image Generation

Building a Pokemon Image Generator with MLOps best practices!

Training Pipeline

The training is done with PyTorch and HuggingFace on Lightning.ai Studios. The data for it is stored on a GCP Cloud Bucket and downloaded to the GPU device for training.

Frontend

The frontend is a streamlit UI that renders the generated images.
It is continously deployed with GCP Cloud Build and is live on a GCP Cloud Run instance.

Backend

The image generation service uses FastAPI to serve the latest trained model and runs on a GCP Cloud Run instance.

Automated training & Continous Delivery

Autmated Training is described best in this video: https://vimeo.com/948396185

The Frontend and Backend are both built and deployed automatically when a Git Tag is added to a certain commit with either a frontend/VERSION or backend/VERSION.

algorithmic-quartet-mlops's People

Contributors

Forkers

patrickliuu

algorithmic-quartet-mlops's Issues

Decide on Deployment Server

Possible Options:

Flask
FastAPI
MLServer

Currently leaning towards MLServer or FastAPI.

Important:
Decide on why this Framework was chosen.

The server should use the newest/best version of the model based on arbitrary metric - assigned through weights and biases.
The function should use the model to make a prediction.
This prediction (image) is then returned and given to the Streamlit UI

Depends on #5,#6, #15

Interface between GC Storage and lightningAI

Find out, how the GCS can be used in lightningAI, a possible approach could include:
https://github.com/Lightning-AI/litdata,
sadly https://lightning.ai/docs/app/stable/workflows/mount_cloud_object_store.html states, that google cloud buckets aren't supported now.

Define hyperparameters for the training of the models

Define the Hyperparameters that influence the model performance. (i.e. Learning Rate, Batch Size, etc.)
Track the Hyperparameters per Run in W&B.

Deal with NaN loss issue

Optimize GPU usage

We could theoretically calculate the free memory and adjust the batch size accordingly to save money.

Define hyperparameters monitoring for training

config file
define sweeps settings

Migrate Artifact Registry to europe-west1

The Artifact Registry is currently on europe-west9 (France while the Cloud Build runs on europe-west1 (Belgium).
Since our docker images are fairly large, transferring them between datacenters is probably expensive so we should have them in one location.

Define Dataset for training

In order to train a diffusion model we need a dataset.
The data can either be static or continously added.

Select Dataset and give reasons for why we use it
Upload dataset to GC
#4

Move the pipeline to Los Angeles

Hollywood & Vice keep down the price!

W&B - Secure login for external servers

Choosing Webserver

Options:

GUnicorn
Guvicorn
uvicorn
Waitress
uWSGI

Note: uWSGI is only to be used when you know why you're using it. I most likely don't.

Model intake

Possibilities:

Get model from weights and biases, and get model directly. The Tag is deployment
This happens over GitHub Actions.

Create Google Cloud Bucket for Models

Integrate Model Registry into Production

Create a Example Server for possible Production

When #5 is done - create an example Server for the Deployment.

Google Cloud Storage usage for image prediction logging

Depends on #4
Depends on #7
Depends on #5

How does the pathing work? significantly different from normal machine?

Create W&B Academic Team Account

Create a new account for our team and invite all members.

Define the metrics for the Production-monitoring

Create a Streamlit UI

Name ist programm

Prevent images from being overwritten

Use a timestamp instead of 0-4.jpg

Define the metrics for the evaluation of the model

Should be:

Losses
- evaluation
Human Feedback
- for example, 5 stars to every picture and evaluate the result

Choose a tool/the tools for monitoring metrics in training and production

Weights & Biases:

Positive:

Sweeps (with the config files, where we have all dependencies)
The experiments can be well structured and very well followed
Integration into existing frameworks
Good and professional visualizations
supports team collaboration
Model Saving and Reproducibility are possible
Scalability: It supports scaling from local experiments to large-scale deployments smoothly, making it suitable for both research and production environments.
System performance could be checked
[ ]

Negative:

Costs
Dependency and Privacy

Comet.ml

Comprehensive Experiment Management: Tracks code, experiments, and results
Integration: Supports multiple frameworks and languages, facilitating integration into existing projects without major adjustments.
Community and Support: Provides good documentation and has a supportive community, which can be helpful for troubleshooting and best practices.

PyTorch

HYDRA

Pros

Simplification of Configuration: HYDRA allows you to create hierarchical configurations dynamically by composing and overriding them from the command line. This simplifies managing configurations for complex applications requiring various setups for different environments or tasks.
Flexibility: It supports a wide range of data sources for configuration files, including YAML, and it can seamlessly merge configurations from multiple sources. This flexibility is particularly useful in research and development environments where rapid testing of configurations is standard.
Plugin System: HYDRA includes a robust plugin system, allowing seamless functionality extension and integration with other tools and frameworks.
Command Line Integration: Configurations can be easily adjusted or overridden from the command line, making it convenient to quickly change parameters without altering code or configuration files directly.

Cons

Learning Curve: Although HYDRA is very powerful, it has a learning curve. Users must become familiar with its configuration composition and overriding principles to use it effectively.
Integration Effort: While integrating HYDRA into existing projects offers many long-term benefits, the initial setup and integration might require significant effort, especially in projects not designed to use external configuration management.
Dependency and Complexity: Adding HYDRA introduces another dependency into the project, which can increase the complexity and potentially impact the build and deployment processes.
Overhead for Smaller Projects: For smaller projects or those that do not require frequent configuration changes or extensive experimentation, the overhead of implementing and maintaining a HYDRA-based configuration might not be justified.
PyTorch - framework supports techniques used in continual learning, like experience replay, elastic weight consolidation (EWC), and other regularization methods
Avdvertorch

Alibi Detect

Pros

Versatility: Alibi Detect provides a wide range of detection algorithms suitable for various data types, including tabular data, text, images, and time series.
Framework Compatibility: It supports PyTorch backends for drift detection, making it flexible for use in different machine learning stacks.
Online and Offline Detection: The library supports both online and offline detection methods, which means it can handle streaming and batch data.
Good documentation
Ease of Integration: It can be easily integrated into existing data processing pipelines and machine learning workflows.

Cons

Complexity: The wide range of features and algorithms might be overwhelming for beginners or practitioners new to drift, outlier, and adversarial detection.
Scalability: Depending on the models' complexity and the data's size, scaling the detectors for high-throughput systems might require additional engineering.
Resource Intensive: Some detection algorithms, especially for high-dimensional data such as images and time series, can be resource-intensive, requiring significant computational power.
[ ]

Create W&B Model Registry

The model registry will allow us to fetch previously trained models and compare them.
We can also use it to provide the model service with our currently best model.

We could also only use GC or another provider, but since we are using W&B for the metrics, it makes a lot of sense to also store the models there.

Integrate Model Registry into Training

Save them to the W&B registry from #9

Create a Docker image from the Server

To be served on Google Cloud container Registry

Depends on #11 and #9 .

Make the W&B run name equal to the tag name

Build Training Pipeline

Dataset class #1
Model class
Pytorch Lightning Trainer
#12

Add fault tolerance to the training

Create Automated Training Pipeline

Run GitHub Action when the tag run_training is pushed
We are running out of space with the Pytorch image, so let's switch to a Nvidia / CUDA based one
Still running out of space, so perhaps switch back to the normal docker build command?
Implement Google Cloud Build pipeline
Create secrets for W&B and Cloud Bucket keys
Start Lightning AI studio with GPU and run training
Track metrics to W&B project #29
#12

Automatically update changes to the CI pipeline

Currently the CI pipeline needs to be updated manually with a docker build command.

docker build -t europe-west9-docker.pkg.dev/algorithmic-quartet/training-pipelines/lightning-executor:latest
docker push europe-west9-docker.pkg.dev/algorithmic-quartet/training-pipelines/lightning-executor:latest

We could also automate this by building the container whenever something in the CI folder has changed.

Streamlit call model to predict

Make a prediction happen
Get the prediction from the Google cloud storage

Create Google Cloud storage for Training

Upload dataset to a storage according to: https://cloud.google.com/storage/docs/uploading-objects?hl=de

Why we decided to choose GC:

Positive:

data availabilty / distributed, collaboration is a breeze
variety of storage modes -> for every need / price
Credits granted for GC

Negative:

dependence of GC

Figure out how to use the Dockerimage in Google Cloud Server

The Hows and whys will be explained further by Gerome

Depends on #7

Define Model for Image generation

look into actual hugging face models, resp. into stable diffusion: https://huggingface.co/docs/diffusers/main/en/tutorials/basic_training#train-a-diffusion-model or https://www.kaggle.com/code/digvijayyadav/pokemon-generator-gans

It's defintily going to be a GAN!

Build Model Selection Pipeline

Define a trigger from the W&B Model Registry (#9) (Refer to https://docs.wandb.ai/guides/model_registry)
Upon trigger, have a Google Cloud Run Container compute the best model
Remove deployment alias from all models
Define a "deployment" alias for the new best model

Reduce Docker Image build time

The Docker image takes a long time to build (>15 minutes). A large portion of this comes down to the python libraries, especially the pytorch dependencies!

We need to find some way to reduce this to a reasonable duration.

Depends on / Related to #26
Create a base-image which is only updated if the requirements or Dockerfile has changed (See this Stackoverflow Post
Then create a training image which runs on top of this base image which is re-built everytime.
Find a way to cache the installs between runs.
Look into poetry

Create Streamlit Docker configuration

Prepare the Developer Environment for the Frontend.

Create project and template for W&B

create Project in W&B
define run-names for experiments
settings for W&B in code

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.