Code Monkey home page Code Monkey logo

llmaz-operator's Introduction

llmaz

stability-alpha GoReport Widget Latest Release

llmaz (pronounced /lima:z/), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.

🌱 llmaz is alpha now, so API may change before graduating to Beta.

Concept

image

Feature Overview

  • User Friendly: People can quick deploy a LLM service with minimal configurations.
  • High Performance: llmaz supports a wide range of advanced inference backends for high performance, like vLLM, SGLang, llama.cpp. Find the full list of supported backends here.
  • Scaling Efficiency (WIP): llmaz works smoothly with autoscaling components like Cluster-Autoscaler or Karpenter to support elastic scenarios.
  • Accelerator Fungibility (WIP): llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
  • SOTA Inference (WIP): llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise to run on Kubernetes.
  • Various Model Providers: llmaz automatically loads models from various providers, such as HuggingFace, ModelScope, ObjectStores(aliyun OSS, more on the way).
  • Multi-hosts Support: llmaz supports both single-host and multi-hosts scenarios with LWS from day 1.

Quick Start

Installation

Read the Installation for guidance.

Deploy

Here's a simplest sample for deploying facebook/opt-125m, all you need to do is to apply a Model and a Playground.

Please refer to examples to learn more.

Note: if your model needs Huggingface token for weight downloads, please run kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token> ahead.

Model

apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: opt-125m
spec:
  familyName: opt
  source:
    modelHub:
      modelID: facebook/opt-125m
  inferenceFlavors:
  - name: t4 # GPU type
    requests:
      nvidia.com/gpu: 1

Inference Playground

apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: opt-125m
spec:
  replicas: 1
  modelClaim:
    modelName: opt-125m

Test

Expose the service

kubectl port-forward pod/opt-125m-0 8080:8080

Get registered models

curl http://localhost:8080/v1/models

Request a query

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 10,
    "temperature": 0
}'

Roadmap

  • Gateway support for traffic routing
  • Serverless support for cloud-agnostic users
  • CLI tool support
  • Model training, fine tuning in the long-term

Contributions

🚀 All kinds of contributions are welcomed ! Please follow Contributing. Thanks to all these contributors.

llmaz-operator's People

Contributors

kerthcet avatar inftyai-agent avatar dependabot[bot] avatar vicoooo26 avatar

Stargazers

杨朱 · Kiki avatar Adeel Ahmad avatar  avatar dublc avatar Alex Wang avatar  avatar CYJiang avatar Paco Xu avatar 草镯子 avatar Odysseus Zhang avatar Peter Pan avatar

Watchers

 avatar

llmaz-operator's Issues

Support speculative decoding

What would you like to be added:

Speculative Decoding helps to accelerate the prediction of large language models. which is supported by vllm by default.

Why is this needed:

Improve the inference throughput.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Benchmark toolkit support

What would you like to be added:

It would be super great to support benchmarking the LLM throughputs or latencies with different backends.

Why is this needed:

Provide proofs for users.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Add more e2e tests

Why is this needed:

Backends and datasources support heavily depends on the e2e tests to verify that it works as expected, however, we're lack of GPU machines.

Failover policy for various backends

What would you like to be added:

Different backends support a large range of popular models, but not all models are supported, so we should have a failover policy for them, at least this policy should be supported in Playground.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Support multi-host inference

We use lws as underlying workload to support multi-host inference, however, we only support one pod one model right now. The general idea is once model flavor requires like nvidia.com/gpu: 32, we'll split into 4 hosts each requires 8GPU.

Support autoscaling

As the serve.Spec describes, we have minReplicas and maxReplicas, what we hope to do is adjust the number based on the traffic, aka. servreless. We can use ray or keda/knative as alternatives, but here we hope we can have a simple implementation, then no need to depend on other libraries.

Hope we can do that.

Model version management

What would you like to be added:

Right now, we only have one model version for common deployment, however, if we take a higher level view of the model lifecycle, version management is necessary.

Why is this needed:

Support the full lifecycle of models.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Concurrently download the main container image when downloading weights

This can help to optimize the startup time of the Pod, however, usually this is limited by the bandwidth, which means it will slow down the weight downloading. If the image has been downloaded, there's no different since it's cached.

But if your registry is deployed in the intranet, it will still benefit your startup time.

CI support for tests

We should have the baseline for the project, generally three kinds of tests:

  • unit test -> make test
  • integration tests -> make test-intergration
  • e2e tests -> make test-e2e

Once name containers dot, failed to create Pods

What happened:

Once the Playground containers the dot in the name, for example qwen2-0.5b, failed to create Pods. we should add validations to the name field.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • LWS version:
  • llmaz version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Support TGI as another alternative backend

What would you like to be added:

TGI is also a popular inference backend that we should support. More importantly, we should set this as an example once people would like to support another backend in the future.

Why is this needed:

Adopte more backends and make this a reference.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

ollama support

What would you like to be added:

ollama provides sdk for integrations, we can easily integrate with it, one of the benefits I can think of is ollama maintains a bunch of quantized models, we can leverage.

Why is this needed:

Ecosystem integration.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Support different GPU accelerators for fungibility

What would you like to be added:

Models can be loaded with different accelerators, for example, llama2-70b can be located with 2 A100 80GB or 4 A100 40 GB, we should support this. And usually high-end GPUs will be stockout frequently, this can help improve the SLO of services.

Why is this needed:

Cost saving and SLO consideration.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Support OpenAPI

Right now, we support inference engines like vllm for inference, what if people want calling OpenAPIs like chatGPT, it's easy to integrate.

Integrate with Kueue for fungibility capacity

What would you like to be added:

Kueue is a great project which focus on job queueing and resource management, it can also support inference service by managing Pods, it's efficient because we have the overview of the cluster and we know much whether the GPU kinds are insufficient or not, comparing to runtime failover.

What's more, if kueue is already part of your component, it would be really great!

Why is this needed:

Fungibility capacity.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Failed to pass through the labels to the lws Pods

Right now, when I add a label to the Model, the most important one, llmaz.io/model-family-name, I hope it can be passed through to the Pods used this model, however, lws can't do that right now.

Model aware scheduling

What would you like to be added:

Right now, model management is a tricky problem in the cluster, it's big, so we need to cache them in the node just like images, however, kubelet will take over the image lifecycle management but files, so that's a problem, and will not be tacked in the near future, so maybe we need to manage the models manually and make it aware by the scheduler to make pod placement decisions.

Why is this needed:

Efficient pod scheduling with models

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Model should be namespaced

Right now model is a cluster scope obj, should it be namespaced? Maybe we should add a namespacedModel, or a better choice is the model obj is visible to selected namespaces, but this is not following the design of kubernetes.

Downsize the model-loader image

What would you like to be added:

Currently, the model-runner is about 56MB, however, the model-loader is about 466MB, we should try to smaller the size.

Why is this needed:

Fast startup.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Support Secret in Playground

We support env & vars right now in Playground, however, sometimes, some parameters should be protected and now show directly in Pod yaml, just like the huggingface_token or something like that, we should support secret and plumbed into pod spec.

Support llama.cpp as alternative backend

What would you like to be added:

llama.cpp supports running inference on cpus this is useful for uses has no GPU accelerators, in fact, also helpful to llmaz as we have no GPU servers right now.

What's more, llama.cpp also supports multi-host inference, see https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Playground will not reconcile once model created

What happened:

Let's image there's no Model in the cluster, I created a Playground, because no Model exists, inference Pods will not be created. However, if I created the corresponding Model, the Playground should be triggered to create Services and then Pods. Right now, this is not true.

The solution is quite simple, make Playground watch for model creation events.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • LWS version:
  • llmaz version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

An an example for multi-host inference with Service

What would you like to be added:

Right now we don't support multi-host in playground, but this is supported in Service, we should provide an example.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Support reconcile the `serve.Replicas`

The general idea is quite similar to deployment, we'll set the pods number equal to the replicas.
What we need here:

  • Reconcile logics
  • Test framework
  • Examples

We can use nginx:1.14.2 as the default image at first.

Always download the model weights when pod starts

What happened:

Because of InftyAI/omnistore#12, we'll always download the model weights even we cached them in the host machine.

What you expected to happen:

When model weights downloaded, we should not reload again.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • LWS version:
  • llmaz version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Prompts managements

What would you like to be added:

For inference scenarios, prompts management is an important part of it.

Why is this needed:

Easy to use for inference users.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Mount /dev/shm for shared memory files

What would you like to be added:

This is how it should look like:

    volumeMounts:
    - mountPath: /dev/shm
      name: dshm

But this memory size is unknown.

Why is this needed:

Accelerate model loading and model inference.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Add more testcases for webhooks

We have several integration tests for webhooks, however, they're the very simple ones, we need more, like covering the update cases.

Support ObjectStore as another datasource

What would you like to be added:

We may have s3, gcs, oss as model storage, we should load them during runtime.

Why is this needed:

Support more model sources.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Remove `core` folder

This is how the code tree looks like right now:

├── api
│   ├── core
│   │   └── v1alpha1
│   │       ├── groupversion_info.go
│   │       ├── model_types.go
│   │       ├── types.go
│   │       └── zz_generated.deepcopy.go
│   └── inference
│       └── v1alpha1
│           ├── config_types.go
│           ├── groupversion_info.go
│           ├── playground_types.go
│           ├── service_types.go
│           ├── types.go
│           └── zz_generated.deepcopy.go

However we do not need the core folder, this is just because a bug of code-generator, which is fixed in kubernetes/kubernetes#125162, we should remove the folder once kubernetes v1.31 is released.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.