Code Monkey home page Code Monkey logo

triton-vllm-inference-server's Introduction

triton-vllm-inference-server

Container project for NVIDIA Triton using vLLM backend

Downloading Models

Download the various models to a location on disk. On a local machine, this directory will be attached to the container using podman arguments, the on an OpenShift cluster it will be a Persistent Volume. For example, here's how to download the Llama-2-7b model:

$ mkdir ~/Downloads/model_registry && cd ~/Downloads/model_registry

$ git clone https://${HUGGING_FACE_HUB_USER}:${HUGGING_FACE_HUB_TOKEN}@huggingface.co/meta-llama/Llama-2-7b-hf 

Note: vLLM requires models in Hugging Face Transformer Format, this is why we pick Llama-2-7b-hf over Llama-2-7b, which doesn't have the model in a supported format for serving using vLLM. The Llama models can be downloaded from meta by following these instructions: https://github.com/meta-llama/llama#download, however they will not be in a supported format.

Next, we add some configuration files into the model_repository location on disk to identify our model(s). For reference, see the example model repository, instructions on the vLLM backend repo as well as the Triton Quickstart Guide.

model_repository (provide this dir as source / MODEL_REPOSITORY )
└── vllm_model
    ├── 1
    │   └── model.json
    ├── Llama-2-7b-hf
    │   └── model files...
    └── config.pbtxt

Here are example files that work with the Llama-2-7B-hf model:

model.json

{
    "model": "/opt/app-root/mnt/model_repository/vllm_model/Llama-2-7b-hf",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 1
}

config.pbtxt

backend: "vllm"

instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]

model_transaction_policy {
  decoupled: True
}

max_batch_size: 0

# https://github.com/triton-inference-server/server/issues/6578#issuecomment-1813112797
# Note: The vLLM backend uses the following input and output names.
# Any change here needs to also be made in model.py

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
  },
  {
    name: "sampling_parameters"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

Note: Configuration extracted from: triton-inference-server/server#6578

Building and Running Locally using Podman

$ podman build -t "triton-vllm-inference-server" . 
$ mkdir $HOME/Downloads/model_repository

$ podman run --rm -p 8000:8000 --name "triton-vllm-inference-server" -v $HOME/Downloads/model_repository:/opt/app-root/model_repository --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 --gpus all triton-vllm-inference-server

Alternative direct run, from

$ podman run --env HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:23.10-vllm-python-py3 tritonserver --model-store ./model_repository

Building on Quay, Running using OpenShift

Note: Under Construction

See examples at: https://github.com/codekow/s2i-patch/blob/main/s2i-triton/README.md

TODO: Add instructions for creating a Deployment, defining a PV and binding to the pod, creating a service and route. Also securityContext for read write access on PV

Once the pod is started, copy the contents of the model_repository to the PV bound to the running pod:

$  oc rsync ~/Downloads/model_repository triton-vllm-inference-server-784d54f45f-jwr25:/opt/app-root/ --strategy=tar --progress=true

Testing the running instance

$ curl -X POST https://triton-vllm-inference-server-triton-vllm-inference-server.apps.cluster-4b2f6.sandbox888.opentlc.com/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'

Running using Run:AI

Note: Under Construction

References

Additional Links

triton-vllm-inference-server's People

Contributors

carlmes avatar

Watchers

 avatar

Forkers

codekow

triton-vllm-inference-server's Issues

User issues

The only thing in the container file that should probably be run as root is the apt install.

Everything else should be run after switching to 1001, especially the mkdir so it is owned by user 1001.

I think if we run the mkdir as 1001 we should be able to skip the security context in the deployment as well but I’m not 100% sure.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.