Code Monkey home page Code Monkey logo

algorithmic-quartet-mlops's Introduction

Algorithmic Quartet Image Generation

Building a Pokemon Image Generator with MLOps best practices!

Flow Chart of the pipeline

Training Pipeline

The training is done with PyTorch and HuggingFace on Lightning.ai Studios. The data for it is stored on a GCP Cloud Bucket and downloaded to the GPU device for training.

Frontend

The frontend is a streamlit UI that renders the generated images.
It is continously deployed with GCP Cloud Build and is live on a GCP Cloud Run instance.

Backend

The image generation service uses FastAPI to serve the latest trained model and runs on a GCP Cloud Run instance.

Automated training & Continous Delivery

Autmated Training is described best in this video: https://vimeo.com/948396185

The Frontend and Backend are both built and deployed automatically when a Git Tag is added to a certain commit with either a frontend/VERSION or backend/VERSION.

algorithmic-quartet-mlops's People

Contributors

gerovanmi avatar patrickliuu avatar themadbevah avatar

Forkers

patrickliuu

algorithmic-quartet-mlops's Issues

Decide on Deployment Server

Possible Options:

  • Flask
  • FastAPI
  • MLServer

Currently leaning towards MLServer or FastAPI.

Important:
Decide on why this Framework was chosen.

Function: Model prediction

The server should use the newest/best version of the model based on arbitrary metric - assigned through weights and biases.
The function should use the model to make a prediction.
This prediction (image) is then returned and given to the Streamlit UI

Optimize GPU usage

We could theoretically calculate the free memory and adjust the batch size accordingly to save money.

Migrate Artifact Registry to europe-west1

The Artifact Registry is currently on europe-west9 (France while the Cloud Build runs on europe-west1 (Belgium).
Since our docker images are fairly large, transferring them between datacenters is probably expensive so we should have them in one location.

Define Dataset for training

In order to train a diffusion model we need a dataset.
The data can either be static or continously added.

  • Select Dataset and give reasons for why we use it
  • Upload dataset to GC
  • #4

Choosing Webserver

Options:

  • GUnicorn
  • Guvicorn
  • uvicorn
  • Waitress
  • uWSGI

Note: uWSGI is only to be used when you know why you're using it. I most likely don't.

Model intake

Possibilities:

  • Get model from weights and biases, and get model directly. The Tag is deployment
  • This happens over GitHub Actions.

Choose a tool/the tools for monitoring metrics in training and production

Weights & Biases:

Positive:

  • Sweeps (with the config files, where we have all dependencies)
  • The experiments can be well structured and very well followed
  • Integration into existing frameworks
  • Good and professional visualizations
  • supports team collaboration
  • Model Saving and Reproducibility are possible
  • Scalability: It supports scaling from local experiments to large-scale deployments smoothly, making it suitable for both research and production environments.
  • System performance could be checked
  • [ ]

Negative:

  • Costs
  • Dependency and Privacy

Comet.ml

  • Comprehensive Experiment Management: Tracks code, experiments, and results
  • Integration: Supports multiple frameworks and languages, facilitating integration into existing projects without major adjustments.
  • Community and Support: Provides good documentation and has a supportive community, which can be helpful for troubleshooting and best practices.

PyTorch

HYDRA

Pros

  • Simplification of Configuration: HYDRA allows you to create hierarchical configurations dynamically by composing and overriding them from the command line. This simplifies managing configurations for complex applications requiring various setups for different environments or tasks.
  • Flexibility: It supports a wide range of data sources for configuration files, including YAML, and it can seamlessly merge configurations from multiple sources. This flexibility is particularly useful in research and development environments where rapid testing of configurations is standard.
  • Plugin System: HYDRA includes a robust plugin system, allowing seamless functionality extension and integration with other tools and frameworks.
  • Command Line Integration: Configurations can be easily adjusted or overridden from the command line, making it convenient to quickly change parameters without altering code or configuration files directly.

Cons

  • Learning Curve: Although HYDRA is very powerful, it has a learning curve. Users must become familiar with its configuration composition and overriding principles to use it effectively.

  • Integration Effort: While integrating HYDRA into existing projects offers many long-term benefits, the initial setup and integration might require significant effort, especially in projects not designed to use external configuration management.

  • Dependency and Complexity: Adding HYDRA introduces another dependency into the project, which can increase the complexity and potentially impact the build and deployment processes.

  • Overhead for Smaller Projects: For smaller projects or those that do not require frequent configuration changes or extensive experimentation, the overhead of implementing and maintaining a HYDRA-based configuration might not be justified.

  • PyTorch - framework supports techniques used in continual learning, like experience replay, elastic weight consolidation (EWC), and other regularization methods

  • Avdvertorch

Alibi Detect

Pros

  • Versatility: Alibi Detect provides a wide range of detection algorithms suitable for various data types, including tabular data, text, images, and time series.
  • Framework Compatibility: It supports PyTorch backends for drift detection, making it flexible for use in different machine learning stacks.
  • Online and Offline Detection: The library supports both online and offline detection methods, which means it can handle streaming and batch data.
  • Good documentation
  • Ease of Integration: It can be easily integrated into existing data processing pipelines and machine learning workflows.

Cons

  • Complexity: The wide range of features and algorithms might be overwhelming for beginners or practitioners new to drift, outlier, and adversarial detection.
  • Scalability: Depending on the models' complexity and the data's size, scaling the detectors for high-throughput systems might require additional engineering.
  • Resource Intensive: Some detection algorithms, especially for high-dimensional data such as images and time series, can be resource-intensive, requiring significant computational power.
  • [ ]

Create W&B Model Registry

The model registry will allow us to fetch previously trained models and compare them.
We can also use it to provide the model service with our currently best model.

We could also only use GC or another provider, but since we are using W&B for the metrics, it makes a lot of sense to also store the models there.

Create Automated Training Pipeline

  • Run GitHub Action when the tag run_training is pushed
  • We are running out of space with the Pytorch image, so let's switch to a Nvidia / CUDA based one
  • Still running out of space, so perhaps switch back to the normal docker build command?
  • Implement Google Cloud Build pipeline
  • Create secrets for W&B and Cloud Bucket keys
  • Start Lightning AI studio with GPU and run training
  • Track metrics to W&B project #29
  • #12

Automatically update changes to the CI pipeline

Currently the CI pipeline needs to be updated manually with a docker build command.

docker build -t europe-west9-docker.pkg.dev/algorithmic-quartet/training-pipelines/lightning-executor:latest
docker push europe-west9-docker.pkg.dev/algorithmic-quartet/training-pipelines/lightning-executor:latest

We could also automate this by building the container whenever something in the CI folder has changed.

Reduce Docker Image build time

The Docker image takes a long time to build (>15 minutes). A large portion of this comes down to the python libraries, especially the pytorch dependencies!

We need to find some way to reduce this to a reasonable duration.

  • Depends on / Related to #26
  • Create a base-image which is only updated if the requirements or Dockerfile has changed (See this Stackoverflow Post
    Then create a training image which runs on top of this base image which is re-built everytime.
  • Find a way to cache the installs between runs.
  • Look into poetry

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.