inftyai / llmaz-operator Goto Github PK

☸️ Effortlessly serve state-of-the-art LLMs on Kubernetes.

License: Apache License 2.0

Dockerfile 0.59% Makefile 5.48% Go 88.27% Shell 1.18% Python 4.48%

kubernetes inference llm

llmaz-operator's Introduction

llmaz

llmaz (pronounced /lima:z/), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.

🌱 llmaz is alpha now, so API may change before graduating to Beta.

Concept

Feature Overview

User Friendly: People can quick deploy a LLM service with minimal configurations.
High Performance: llmaz supports a wide range of advanced inference backends for high performance, like vLLM, SGLang, llama.cpp. Find the full list of supported backends here.
Scaling Efficiency (WIP): llmaz works smoothly with autoscaling components like Cluster-Autoscaler or Karpenter to support elastic scenarios.
Accelerator Fungibility (WIP): llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
SOTA Inference (WIP): llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise to run on Kubernetes.
Various Model Providers: llmaz automatically loads models from various providers, such as HuggingFace, ModelScope, ObjectStores(aliyun OSS, more on the way).
Multi-hosts Support: llmaz supports both single-host and multi-hosts scenarios with LWS from day 1.

Quick Start

Installation

Read the Installation for guidance.

Deploy

Here's a simplest sample for deploying facebook/opt-125m, all you need to do is to apply a Model and a Playground.

Please refer to examples to learn more.

Note: if your model needs Huggingface token for weight downloads, please run kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token> ahead.

Model

apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: opt-125m
spec:
  familyName: opt
  source:
    modelHub:
      modelID: facebook/opt-125m
  inferenceFlavors:
  - name: t4 # GPU type
    requests:
      nvidia.com/gpu: 1

Inference Playground

apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: opt-125m
spec:
  replicas: 1
  modelClaim:
    modelName: opt-125m

Test

Expose the service

kubectl port-forward pod/opt-125m-0 8080:8080

Get registered models

curl http://localhost:8080/v1/models

Request a query

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 10,
    "temperature": 0
}'

Roadmap

Gateway support for traffic routing
Serverless support for cloud-agnostic users
CLI tool support
Model training, fine tuning in the long-term

Contributions

🚀 All kinds of contributions are welcomed ! Please follow Contributing. Thanks to all these contributors.

llmaz-operator's People

Contributors

Stargazers

Watchers

Forkers

kerthcet b1f030 vicoooo26

llmaz-operator's Issues

Support speculative decoding

What would you like to be added:

Speculative Decoding helps to accelerate the prediction of large language models. which is supported by vllm by default.

Why is this needed:

Improve the inference throughput.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Add webhooks

Benchmark toolkit support

What would you like to be added:

It would be super great to support benchmarking the LLM throughputs or latencies with different backends.

Why is this needed:

Provide proofs for users.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Liveness & Readiness support

Add the support for inference services.

Add more e2e tests

Why is this needed:

Backends and datasources support heavily depends on the e2e tests to verify that it works as expected, however, we're lack of GPU machines.

Lack the flexibility to express deploy primitives

What would you like to be cleaned:

For example, people want to deploy the model with different scheduling primitives, colocated or exclusive?

Why is this needed:

Expressing deploy primitives.

Failover policy for various backends

What would you like to be added:

Different backends support a large range of popular models, but not all models are supported, so we should have a failover policy for them, at least this policy should be supported in Playground.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Support multi-host inference

We use lws as underlying workload to support multi-host inference, however, we only support one pod one model right now. The general idea is once model flavor requires like nvidia.com/gpu: 32, we'll split into 4 hosts each requires 8GPU.

Will sharing models via hostPath leading to security probelm

At the first glance, because the Models are published by the admins, it maybe ok because the data source is under supervising.

Or is that user need?

Support DataSource of ModelScope

Lora multiplexing support

Support splitwise with multiModelsClaims

Support splitwise which split the prefill and decode apart, belongs to different Pods.

Support autoscaling

As the serve.Spec describes, we have minReplicas and maxReplicas, what we hope to do is adjust the number based on the traffic, aka. servreless. We can use ray or keda/knative as alternatives, but here we hope we can have a simple implementation, then no need to depend on other libraries.

Hope we can do that.

Install lws at llmaz-system namespace

This requires the support of lws community.

Model version management

What would you like to be added:

Right now, we only have one model version for common deployment, however, if we take a higher level view of the model lifecycle, version management is necessary.

Why is this needed:

Support the full lifecycle of models.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Support DataSource of Huggingface

Support OCI artifacts

xref: #20

Concurrently download the main container image when downloading weights

This can help to optimize the startup time of the Pod, however, usually this is limited by the bandwidth, which means it will slow down the weight downloading. If the image has been downloaded, there's no different since it's cached.

But if your registry is deployed in the intranet, it will still benefit your startup time.

CI support for tests

We should have the baseline for the project, generally three kinds of tests:

unit test -> make test
integration tests -> make test-intergration
e2e tests -> make test-e2e

Support Deployment for serving most models

We support lws as the default workload, however, most of the cases mutli-hosts is not needed, even with Llama3.1 405B. So maybe this is a better choice.

Once name containers dot, failed to create Pods

What happened:

Once the Playground containers the dot in the name, for example qwen2-0.5b, failed to create Pods. we should add validations to the name field.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
LWS version:
llmaz version (use git describe --tags --dirty --always):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

Support TGI as another alternative backend

What would you like to be added:

TGI is also a popular inference backend that we should support. More importantly, we should set this as an example once people would like to support another backend in the future.

Why is this needed:

Adopte more backends and make this a reference.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

ollama support

What would you like to be added:

ollama provides sdk for integrations, we can easily integrate with it, one of the benefits I can think of is ollama maintains a bunch of quantized models, we can leverage.

Why is this needed:

Ecosystem integration.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Support different GPU accelerators for fungibility

What would you like to be added:

Models can be loaded with different accelerators, for example, llama2-70b can be located with 2 A100 80GB or 4 A100 40 GB, we should support this. And usually high-end GPUs will be stockout frequently, this can help improve the SLO of services.

Why is this needed:

Cost saving and SLO consideration.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Use rust instead of python when downloading model weights

Because of the huge size of model weights and the GIL lock of python, let's move to rust instead for performance save.

Support OpenAPI

Right now, we support inference engines like vllm for inference, what if people want calling OpenAPIs like chatGPT, it's easy to integrate.

Integrate with Kueue for fungibility capacity

What would you like to be added:

Kueue is a great project which focus on job queueing and resource management, it can also support inference service by managing Pods, it's efficient because we have the overview of the cluster and we know much whether the GPU kinds are insufficient or not, comparing to runtime failover.

What's more, if kueue is already part of your component, it would be really great!

Why is this needed:

Fungibility capacity.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Failed to pass through the labels to the lws Pods

Right now, when I add a label to the Model, the most important one, llmaz.io/model-family-name, I hope it can be passed through to the Pods used this model, however, lws can't do that right now.

Model aware scheduling

What would you like to be added:

Right now, model management is a tricky problem in the cluster, it's big, so we need to cache them in the node just like images, however, kubelet will take over the image lifecycle management but files, so that's a problem, and will not be tacked in the near future, so maybe we need to manage the models manually and make it aware by the scheduler to make pod placement decisions.

Why is this needed:

Efficient pod scheduling with models

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Model should be namespaced

Right now model is a cluster scope obj, should it be namespaced? Maybe we should add a namespacedModel, or a better choice is the model obj is visible to selected namespaces, but this is not following the design of kubernetes.

Failed to patch inferenceService because of schema undeclared

when patching the inferenceService, it reports metadata.creationTimestamp: field not declared in schema". Log the creationTimestamp, it has a null value, "metadata":{"creationTimestamp":null,.

See related issue: kubernetes/kubernetes#109427

Use lws as default workload

See https://github.com/kubernetes-sigs/lws

Downsize the model-loader image

What would you like to be added:

Currently, the model-runner is about 56MB, however, the model-loader is about 466MB, we should try to smaller the size.

Why is this needed:

Fast startup.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Support Secret in Playground

We support env & vars right now in Playground, however, sometimes, some parameters should be protected and now show directly in Pod yaml, just like the huggingface_token or something like that, we should support secret and plumbed into pod spec.

Support llama.cpp as alternative backend

What would you like to be added:

llama.cpp supports running inference on cpus this is useful for uses has no GPU accelerators, in fact, also helpful to llmaz as we have no GPU servers right now.

What's more, llama.cpp also supports multi-host inference, see https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Support HA with LeaderElection

We only have one replica for controller right now.

Playground will not reconcile once model created

What happened:

Let's image there's no Model in the cluster, I created a Playground, because no Model exists, inference Pods will not be created. However, if I created the corresponding Model, the Playground should be triggered to create Services and then Pods. Right now, this is not true.

The solution is quite simple, make Playground watch for model creation events.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
LWS version:
llmaz version (use git describe --tags --dirty --always):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

Support DataSource of docker image

The ability to serve model from docker image(contains the model weight), whether we can integrate with ollama?

An an example for multi-host inference with Service

What would you like to be added:

Right now we don't support multi-host in playground, but this is supported in Service, we should provide an example.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Support reconcile the `serve.Replicas`

The general idea is quite similar to deployment, we'll set the pods number equal to the replicas.
What we need here:

Reconcile logics
Test framework
Examples

We can use nginx:1.14.2 as the default image at first.

Always download the model weights when pod starts

What happened:

Because of InftyAI/omnistore#12, we'll always download the model weights even we cached them in the host machine.

What you expected to happen:

When model weights downloaded, we should not reload again.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
LWS version:
llmaz version (use git describe --tags --dirty --always):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

Prompts managements

What would you like to be added:

For inference scenarios, prompts management is an important part of it.

Why is this needed:

Easy to use for inference users.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Add E2E test framework

Milestone v0.1.0

The first minor release should includes all the stuff in https://github.com/InftyAI/llmaz/milestone/1

Mount /dev/shm for shared memory files

What would you like to be added:

This is how it should look like:

    volumeMounts:
    - mountPath: /dev/shm
      name: dshm

But this memory size is unknown.

Why is this needed:

Accelerate model loading and model inference.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Support ObjectStore as another datasource

What would you like to be added:

We may have s3, gcs, oss as model storage, we should load them during runtime.

Why is this needed:

Support more model sources.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

Remove `core` folder

This is how the code tree looks like right now:

├── api
│   ├── core
│   │   └── v1alpha1
│   │       ├── groupversion_info.go
│   │       ├── model_types.go
│   │       ├── types.go
│   │       └── zz_generated.deepcopy.go
│   └── inference
│       └── v1alpha1
│           ├── config_types.go
│           ├── groupversion_info.go
│           ├── playground_types.go
│           ├── service_types.go
│           ├── types.go
│           └── zz_generated.deepcopy.go

However we do not need the core folder, this is just because a bug of code-generator, which is fixed in kubernetes/kubernetes#125162, we should remove the folder once kubernetes v1.31 is released.

inftyai / llmaz-operator Goto Github PK

llmaz-operator's Introduction

llmaz

Concept

Feature Overview

Quick Start

Installation

Deploy

Model

Inference Playground

Test

Expose the service

Get registered models

Request a query

Roadmap

Contributions

llmaz-operator's People

Contributors

Stargazers

Watchers

Forkers

llmaz-operator's Issues

Recommend Projects

Recommend Topics

Recommend Org