triton-vllm-inference-server

Container project for NVIDIA Triton using vLLM backend

Downloading Models

Download the various models to a location on disk. On a local machine, this directory will be attached to the container using podman arguments, the on an OpenShift cluster it will be a Persistent Volume. For example, here's how to download the Llama-2-7b model:

$ mkdir ~/Downloads/model_registry && cd ~/Downloads/model_registry

$ git clone https://${HUGGING_FACE_HUB_USER}:${HUGGING_FACE_HUB_TOKEN}@huggingface.co/meta-llama/Llama-2-7b-hf

Note: vLLM requires models in Hugging Face Transformer Format, this is why we pick Llama-2-7b-hf over Llama-2-7b, which doesn't have the model in a supported format for serving using vLLM. The Llama models can be downloaded from meta by following these instructions: https://github.com/meta-llama/llama#download, however they will not be in a supported format.

Next, we add some configuration files into the model_repository location on disk to identify our model(s). For reference, see the example model repository, instructions on the vLLM backend repo as well as the Triton Quickstart Guide.

model_repository (provide this dir as source / MODEL_REPOSITORY )
└── vllm_model
    ├── 1
    │   └── model.json
    ├── Llama-2-7b-hf
    │   └── model files...
    └── config.pbtxt

Here are example files that work with the Llama-2-7B-hf model:

model.json

{
    "model": "/opt/app-root/mnt/model_repository/vllm_model/Llama-2-7b-hf",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 1
}

config.pbtxt

backend: "vllm"

instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]

model_transaction_policy {
  decoupled: True
}

max_batch_size: 0

# https://github.com/triton-inference-server/server/issues/6578#issuecomment-1813112797
# Note: The vLLM backend uses the following input and output names.
# Any change here needs to also be made in model.py

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
  },
  {
    name: "sampling_parameters"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

Note: Configuration extracted from: triton-inference-server/server#6578

Building and Running Locally using Podman

$ podman build -t "triton-vllm-inference-server" .

$ mkdir $HOME/Downloads/model_repository

$ podman run --rm -p 8000:8000 --name "triton-vllm-inference-server" -v $HOME/Downloads/model_repository:/opt/app-root/model_repository --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 --gpus all triton-vllm-inference-server

Alternative direct run, from

$ podman run --env HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:23.10-vllm-python-py3 tritonserver --model-store ./model_repository

Building on Quay, Running using OpenShift

Note: Under Construction

See examples at: https://github.com/codekow/s2i-patch/blob/main/s2i-triton/README.md

TODO: Add instructions for creating a Deployment, defining a PV and binding to the pod, creating a service and route. Also securityContext for read write access on PV

Once the pod is started, copy the contents of the model_repository to the PV bound to the running pod:

$  oc rsync ~/Downloads/model_repository triton-vllm-inference-server-784d54f45f-jwr25:/opt/app-root/ --strategy=tar --progress=true

Testing the running instance

$ curl -X POST https://triton-vllm-inference-server-triton-vllm-inference-server.apps.cluster-4b2f6.sandbox888.opentlc.com/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'

Running using Run:AI

Note: Under Construction

References

Triton Inference Server: vLLM Backend
NVIDIA NGC Catalog: Triton Inference Server (search tags for "vllm")
NVIDIA Docs: Triton Inference Server Quickstart
Github: NVIDIA Triton Inference Server Organization
- Deploying a vLLM model in Triton
Run:AI: Sample Triton Inference Server
Run:AI: Quickstart: Launch an Inference Workload

carlmes / triton-vllm-inference-server Goto Github PK