ray-project / ray-llm Goto Github PK

View Code? Open in Web Editor NEW

1.1K 20.0 80.0 2 MB

RayLLM - LLMs on Ray

Home Page: https://aviary.anyscale.com

License: Apache License 2.0

Python 98.90% JavaScript 0.50% Shell 0.14% Dockerfile 0.46%

distributed-systems large-language-models ray serving transformers llm llm-inference llm-serving llmops

ray-llm's Introduction

RayLLM - LLMs on Ray

The hosted Aviary Explorer is not available anymore. Visit Anyscale to experience models served with RayLLM.

RayLLM (formerly known as Aviary) is an LLM serving solution that makes it easy to deploy and manage a variety of open source LLMs, built on Ray Serve. It does this by:

Providing an extensive suite of pre-configured open source LLMs, with defaults that work out of the box.
Supporting Transformer models hosted on Hugging Face Hub or present on local disk.
Simplifying the deployment of multiple LLMs
Simplifying the addition of new LLMs
Offering unique autoscaling support, including scale-to-zero.
Fully supporting multi-GPU & multi-node model deployments.
Offering high performance features like continuous batching, quantization and streaming.
Providing a REST API that is similar to OpenAI's to make it easy to migrate and cross test them.
Supporting multiple LLM backends out of the box, including vLLM and TensorRT-LLM.

In addition to LLM serving, it also includes a CLI and a web frontend (Aviary Explorer) that you can use to compare the outputs of different models directly, rank them by quality, get a cost and latency estimate, and more.

RayLLM supports continuous batching and quantization by integrating with vLLM. Continuous batching allows you to get much better throughput and latency than static batching. Quantization allows you to deploy compressed models with cheaper hardware requirements and lower inference costs. See quantization guide for more details on running quantized models on RayLLM.

RayLLM leverages Ray Serve, which has native support for autoscaling and multi-node deployments. RayLLM can scale to zero and create new model replicas (each composed of multiple GPU workers) in response to demand.

Getting started

Deploying RayLLM

The guide below walks you through the steps required for deployment of RayLLM on Ray Serve.

Locally

We highly recommend using the official anyscale/ray-llm Docker image to run RayLLM. Manually installing RayLLM is currently not a supported use-case due to specific dependencies required, some of which are not available on pip.

cache_dir=${XDG_CACHE_HOME:-$HOME/.cache}

docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir:~/data anyscale/ray-llm:latest bash
# Inside docker container
serve run ~/serve_configs/amazon--LightGPT.yaml

On a Ray Cluster

RayLLM uses Ray Serve, so it can be deployed on Ray Clusters.

Currently, we only have a guide and pre-configured YAML file for AWS deployments. Make sure you have exported your AWS credentials locally.

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...

Start by cloning this repo to your local machine.

You may need to specify your AWS private key in the deploy/ray/rayllm-cluster.yaml file. See Ray on Cloud VMs page in Ray documentation for more details.

git clone https://github.com/ray-project/ray-llm.git
cd ray-llm

# Start a Ray Cluster (This will take a few minutes to start-up)
ray up deploy/ray/rayllm-cluster.yaml

Connect to your Cluster

# Connect to the Head node of your Ray Cluster (This will take several minutes to autoscale)
ray attach deploy/ray/rayllm-cluster.yaml

# Deploy the LightGPT model. 
serve run serve_configs/amazon--LightGPT.yaml

You can deploy any model in the models directory of this repo, or define your own model YAML file and run that instead.

On Kubernetes

For Kubernetes deployments, please see our documentation for deploying on KubeRay.

Query your models

Once the models are deployed, you can install a client outside of the Docker container to query the backend.

pip install "rayllm @ git+https://github.com/ray-project/ray-llm.git"

You can query your RayLLM deployment in many ways.

In all cases start out by doing:

export ENDPOINT_URL="http://localhost:8000/v1"

This is because your deployment is running locally, but you can also access remote deployments (in which case you would set ENDPOINT_URL to a remote URL).

Using curl

You can use curl at the command line to query your deployed LLM:

% curl $ENDPOINT_URL/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

{
  "id":"meta-llama/Llama-2-7b-chat-hf-308fc81f-746e-4682-af70-05d35b2ee17d",
  "object":"text_completion","created":1694809775,
  "model":"meta-llama/Llama-2-7b-chat-hf",
  "choices":[
    {
      "message":
        {
          "role":"assistant",
          "content":"Hello there! *adjusts glasses* It's a pleasure to meet you! Is there anything I can help you with today? Have you got a question or a task you'd like me to assist you with? Just let me know!"
        },
      "index":0,
      "finish_reason":"stop"
    }
  ],
  "usage":{"prompt_tokens":30,"completion_tokens":53,"total_tokens":83}}

Connecting directly over python

Use the requests library to connect with Python. Use this script to receive a streamed response, automatically parse the outputs, and print just the content.

import os
import json
import requests

s = requests.Session()

api_base = os.getenv("ENDPOINT_URL")
url = f"{api_base}/chat/completions"
body = {
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me a long story with many words."}
  ],
  "temperature": 0.7,
  "stream": True,
}

with s.post(url, json=body, stream=True) as response:
    for chunk in response.iter_lines(decode_unicode=True):
        if chunk is not None:
            try:
                # Get data from reponse chunk
                chunk_data = chunk.split("data: ")[1]

                # Get message choices from data
                choices = json.loads(chunk_data)["choices"]

                # Pick content from first choice
                content = choices[0]["delta"]["content"]

                print(content, end="", flush=True)
            except json.decoder.JSONDecodeError:
                # Chunk was not formatted as expected
                pass
            except KeyError:
                # No message was contained in the chunk
                pass
    print("")

Using the OpenAI SDK

RayLLM uses an OpenAI-compatible API, allowing us to use the OpenAI SDK to access our deployments. To do so, we need to set the OPENAI_API_BASE env var.

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY='not_a_real_key'

import openai

# List all models.
models = openai.Model.list()
print(models)

# Note: not all arguments are currently supported and will be ignored by the backend.
chat_completion = openai.ChatCompletion.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Say 'test'."}
    ],
    temperature=0.7
)
print(chat_completion)

RayLLM Reference

Installing RayLLM

To install RayLLM and its dependencies, run the following command:

pip install "rayllm @ git+https://github.com/ray-project/ray-llm.git"

RayLLM consists of a set of configurations and utilities for deploying LLMs on Ray Serve, in addition to a frontend (Aviary Explorer), both of which come with additional dependencies. To install the dependencies for the frontend run the following commands:

pip install "rayllm[frontend] @ git+https://github.com/ray-project/ray-llm.git"

The backend dependencies are heavy weight, and quite large. We recommend using the official anyscale/ray-llm image. Installing the backend manually is not a supported usecase.

Usage stats collection

Ray collects basic, non-identifiable usage statistics to help us improve the project. For more information on what is collected and how to opt-out, see the Usage Stats Collection page in Ray documentation.

Using RayLLM through the CLI

RayLLM uses the Ray Serve CLI that allows you to interact with deployed models.

# Start a new model in Ray Serve from provided configuration
serve run serve_configs/<model_config_path>

# Get the status of the running deployments
serve status

# Get the current config of current live Serve applications 
serve config

# Shutdown all Serve applications
serve shutdown

RayLLM Model Registry

You can easily add new models by adding two configuration files. To learn more about how to customize or add new models, see the Model Registry.

Frequently Asked Questions

How do I add a new model?

The easiest way is to copy the configuration of the existing model's YAML file and modify it. See models/README.md for more details.

How do I deploy multiple models at once?

Run multiple models at once by aggregating the Serve configs for different models into a single, unified config. For example, use this config to run the LightGPT and Llama-2-7b-chat model in a single Serve application:

# File name: serve_configs/config.yaml

applications:
- name: router
  import_path: rayllm.backend:router_application
  route_prefix: /
  args:
    models:
      - ./models/continuous_batching/amazon--LightGPT.yaml
      - ./models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml

The config includes both models in the model argument for the router. Additionally, the Serve configs for both model applications are included. Save this unified config file to the serve_configs/ folder.

Run the config to deploy the models:

serve run serve_configs/<config.yaml>

How do I deploy a model to multiple nodes?

All our default model configurations enforce a model to be deployed on one node for high performance. However, you can easily change this if you want to deploy a model across nodes for lower cost or GPU availability. In order to do that, go to the YAML file in the model registry and change placement_strategy to PACK instead of STRICT_PACK.

My deployment isn't starting/working correctly, how can I debug?

There can be several reasons for the deployment not starting or not working correctly. Here are some things to check:

You might have specified an invalid model id.
Your model may require resources that are not available on the cluster. A common issue is that the model requires Ray custom resources (eg. accelerator_type_a10) in order to be scheduled on the right node type, while your cluster is missing those custom resources. You can either modify the model configuration to remove those custom resources or better yet, add them to the node configuration of your Ray cluster. You can debug this issue by looking at Ray Autoscaler logs (monitor.log).
Your model is a gated Hugging Face model (eg. meta-llama). In that case, you need to set the HUGGING_FACE_HUB_TOKEN environment variable cluster-wide. You can do that either in the Ray cluster configuration or by setting it before running serve run
Your model may be running out of memory. You can usually spot this issue by looking for keywords related to "CUDA", "memory" and "NCCL" in the replica logs or serve run output. In that case, consider reducing the max_batch_prefill_tokens and max_batch_total_tokens (if applicable). See models/README.md for more information on those parameters.

In general, Ray Dashboard is a useful debugging tool, letting you monitor your Ray Serve / LLM application and access Ray logs.

A good sanity check is deploying the test model in tests/models/. If that works, you know you can deploy a model.

How do I write a program that accesses both OpenAI and your hosted model at the same time?

The OpenAI create() commands allow you to specify the API_KEY and API_BASE. So you can do something like this.

# Call your self-hosted model running on the local host:
OpenAI.ChatCompletion.create(api_base="http://localhost:8000/v1", api_key="",...)

# Call OpenAI. Set OPENAI_API_KEY to your key and unset OPENAI_API_BASE 
OpenAI.ChatCompletion.create(api_key="OPENAI_API_KEY", ...)

Getting Help and Filing Bugs / Feature Requests

We are eager to help you get started with RayLLM. You can get help on:

Via Slack -- fill in this form to sign up.
Via Discuss.

For bugs or for feature requests, please submit them here.

Contributions

We are also interested in accepting contributions. Those could be anything from a new evaluator, to integrating a new model with a yaml file, to more. Feel free to post an issue first to get our feedback on a proposal first, or just file a PR and we commit to giving you prompt feedback.

We use pre-commit hooks to ensure that all code is formatted correctly. Make sure to pip install pre-commit and then run pre-commit install. You can also run ./format to run the hooks manually.

ray-llm's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes tiago-peres thanhpham1987 evdcush markhng525 kenhuangus eltociear data-hound biyanisuraj papiguy ai-mou kevin85421 hhy5277 hiboyang knightcn1983 tonywhite11 yanniszhou vishalsingh17 lightning-dev brosand yawningphantom 5l1v3r1 datastark rohan138 hubayirp openselab gaohuan2015 dylenwu shomilj ferdinandzhong mdeora datastark xbinglzh danielbichuetti gbendandi dattgoswami richardliaw edmuthiah nvbkdw bharatsuryadevara morhidi yq-wang sihanwang41 akshay0570 jgsweets muharremokutan mvandermeulen ravishankar-as avnishn ethanyanjiali roelschr enori marov chaoseternal parkeraddison araiyuno mawdoo3-com lokeshjonnakuti neelammandavia sanyaade-projects jwang3417 xusenlinzy tattrongvu xuesongtap ego asoans gongcheng121 viirya leetcode-1533 neuralfabricai cshyjak lynkz-matt-psaltis awesome-release tensorfuse evinism hoyiai georgegu easynet-world xwu99 teocns tianyil1

ray-llm's Issues

Deploying RayLLM locally failed with exit code 0 even if deployment is ready

Hi, I'm trying to deploy meta-llama--Llama-2-7b-chat-hf.yaml using the instruction provided in the README. The deployment seems to work but just when everything is about to ready, it just exit without any error:

(base) ray@35cf69569a48:~/models/continuous_batching$ aviary run --model ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
[WARNING 2023-10-16 09:04:22,790] api.py: 382  DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead.
[INFO 2023-10-16 09:04:24,848] accelerator.py: 171  Failed to detect number of TPUs: [Errno 2] No such file or directory: '/dev/vfio'
2023-10-16 09:04:24,987 INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
[INFO 2023-10-16 09:04:26,208] api.py: 148  Nothing to shut down. There's no Serve application running on this Ray cluster.
[INFO 2023-10-16 09:04:26,269] deployment_base_client.py: 28  Initialized with base handles {'meta-llama/Llama-2-7b-chat-hf': <ray.serve.deployment.Application object at 0x7f1a8e5a94c0>}
/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/api.py:519: UserWarning: Specifying host and port in `serve.run` is deprecated and will be removed in a future version. To specify custom HTTP options, use `serve.start`.
  warnings.warn(
(HTTPProxyActor pid=22159) INFO 2023-10-16 09:04:28,523 http_proxy 172.17.0.2 http_proxy.py:1428 - Proxy actor 69fb321f9360031e80d6562c01000000 starting on node 82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25.
[INFO 2023-10-16 09:04:28,555] api.py: 328  Started detached Serve instance in namespace "serve".
(HTTPProxyActor pid=22159) INFO 2023-10-16 09:04:28,530 http_proxy 172.17.0.2 http_proxy.py:1612 - Starting HTTP server on node: 82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25 listening on port 8000
(HTTPProxyActor pid=22159) INFO:     Started server process [22159]
(ServeController pid=22117) INFO 2023-10-16 09:04:28,689 controller 22117 deployment_state.py:1390 - Deploying new version of deployment VLLMDeployment:meta-llama--Llama-2-7b-chat-hf in application 'router'.
(ServeController pid=22117) INFO 2023-10-16 09:04:28,690 controller 22117 deployment_state.py:1390 - Deploying new version of deployment Router in application 'router'.
(ServeController pid=22117) INFO 2023-10-16 09:04:28,793 controller 22117 deployment_state.py:1679 - Adding 1 replica to deployment VLLMDeployment:meta-llama--Llama-2-7b-chat-hf in application 'router'.
(ServeController pid=22117) INFO 2023-10-16 09:04:28,796 controller 22117 deployment_state.py:1679 - Adding 2 replicas to deployment Router in application 'router'.
(ServeReplica:router:Router pid=22202) [WARNING 2023-10-16 09:04:32,739] api.py: 382  DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead.
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:32,808] vllm_models.py: 201  Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f10b18d9040> PlacementGroupID(371dfe1112ca6705f22ac50c828201000000). {'placement_group_id': '371dfe1112ca6705f22ac50c828201000000', 'name': 'SERVE_REPLICA::router#VLLMDeployment:meta-llama--Llama-2-7b-chat-hf#mZlJZj', 'bundles': {0: {'CPU': 1.0}, 1: {'CPU': 4.0, 'GPU': 1.0}}, 'bundles_to_node_id': {0: '82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25', 1: '82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25'}, 'strategy': 'STRICT_PACK', 'state': 'CREATED', 'stats': {'end_to_end_creation_latency_ms': 1.814, 'scheduling_latency_ms': 1.728, 'scheduling_attempt': 1, 'highest_retry_delay_ms': 0.0, 'scheduling_state': 'FINISHED'}}
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:32,809] vllm_models.py: 204  Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f10b18d9040>
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:32,809] vllm_node_initializer.py: 38  Starting initialize_node tasks on the workers and local node...
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:37,474] vllm_node_initializer.py: 53  Finished initialize_node tasks.
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) INFO 10-16 09:04:37 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-ll
ama/Llama-2-7b-chat-hf', tokenizer_mode=auto, revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) INFO 10-16 09:04:37 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the init
ialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) INFO 10-16 09:04:51 llm_engine.py:205] # GPU blocks: 1014, # CPU blocks: 512
[INFO 2023-10-16 09:04:53,741] client.py: 581  Deployment 'VLLMDeployment:meta-llama--Llama-2-7b-chat-hf:biUfsX' is ready. component=serve deployment=VLLMDeployment:meta-llama--Llama-2-7b-chat-hf
[INFO 2023-10-16 09:04:53,741] client.py: 581  Deployment 'Router:QHkGZE' is ready at `http://0.0.0.0:8000/`. component=serve deployment=Router
(pid=22359) [WARNING 2023-10-16 09:04:37,030] api.py: 382  DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` inst
ead. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for
 more options.)
(base) ray@35cf69569a48:~/models/continuous_batching$ echo $?
0

Noted that I tried to modify the config so it can run on my custom machine with 8 CPU core, 32 GB of RAM and NVIDIA L4 GPU:

(base) ray@35cf69569a48:~/models/continuous_batching$ cat ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
deployment_config:
  autoscaling_config:
    min_replicas: 1
    initial_replicas: 1
    max_replicas: 1
    target_num_ongoing_requests_per_replica: 24
    metrics_interval_s: 10.0
    look_back_period_s: 30.0
    smoothing_factor: 0.5
    downscale_delay_s: 300.0
    upscale_delay_s: 15.0
  max_concurrent_queries: 64
  ray_actor_options:
    resources:
      accelerator_type_a10: 0
engine_config:
  model_id: meta-llama/Llama-2-7b-chat-hf
  hf_model_id: meta-llama/Llama-2-7b-chat-hf
  type: VLLMEngine
  engine_kwargs:
    trust_remote_code: true
    max_num_batched_tokens: 4096
    max_num_seqs: 64
    gpu_memory_utilization: 0.95
  max_total_tokens: 4096
  generation:
    prompt_format:
      system: "<<SYS>>\n{instruction}\n<</SYS>>\n\n"
      assistant: " {instruction} </s><s> "
      trailing_assistant: " "
      user: "[INST] {system}{instruction} [/INST]"
      system_in_user: true
      default_system_message: ""
    stopping_sequences: ["<unk>"]
scaling_config:
  num_workers: 1
  num_gpus_per_worker: 1
  num_cpus_per_worker: 4
  placement_strategy: "STRICT_PACK"
  resources_per_worker:
    accelerator_type_a10: 0

I also confirmed that my machine can run meta-llama/Llama-2-7b-chat-hf using pure vllm, and the RayLLM seems to confirm that the model can be loaded, so why does it keep exiting ? Am I doing anything wrong here ?

Thank you for checking by

Follow the doc to deploy llama2 70b throws error

Followed the document https://github.com/ray-project/ray-llm/blob/master/docs/kuberay/deploy-on-eks.md.

Env: latest aviary docker image and kuberay-operator 0.6.0.

(HTTPProxyActor pid=373) ERROR 2023-09-22 16:14:03,473 http_proxy 10.0.136.122 2e2d3085-f043-473d-920b-fbebe1572747 /v1/chat/completions router http_proxy.py:1282 - Unexpected ASGI message 'http.response.start' sent, after response already completed.
(HTTPProxyActor pid=373) Traceback (most recent call last):
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/http_proxy.py", line 1258, in send_request_to_replica_streaming
(HTTPProxyActor pid=373)     status_code = await self._consume_and_send_asgi_message_generator(
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/http_proxy.py", line 1174, in _consume_and_send_asgi_message_generator
(HTTPProxyActor pid=373)     await send(asgi_message)
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/http_proxy.py", line 1331, in send_with_request_id
(HTTPProxyActor pid=373)     await send(message)
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 544, in send
(HTTPProxyActor pid=373)     raise RuntimeError(msg % message_type)
(HTTPProxyActor pid=373) RuntimeError: Unexpected ASGI message 'http.response.start' sent, after response already completed.
(HTTPProxyActor pid=373) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ServeReplica:router:Router.handle_request_streaming() (pid=265, ip=10.0.150.166, actor_id=783369e101eb514895ce0ed202000000, repr=<ray.serve._private.replica.ServeReplica:router:Router object at 0x7f8df213e7c0>)
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
(HTTPProxyActor pid=373)     return self.__get_result()
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
(HTTPProxyActor pid=373)     raise self._exception
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
(HTTPProxyActor pid=373)     result = self.fn(*self.args, **self.kwargs)
(HTTPProxyActor pid=373)   File "stringsource", line 67, in cfunc.to_py.__Pyx_CFunc_object____object____StreamingGeneratorExecutionContext___to_py.wrap
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/exceptions.py", line 32, in to_bytes
(HTTPProxyActor pid=373)     serialized_exception=pickle.dumps(self),
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps
(HTTPProxyActor pid=373)     cp.dump(obj)
(HTTPProxyActor pid=373)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump
(HTTPProxyActor pid=373)     return Pickler.dump(self, obj)
(HTTPProxyActor pid=373) TypeError: can't pickle multidict._multidict.CIMultiDictProxy objects
(AviaryTGIInferenceWorker:meta-llama/Llama-2-70b-chat-hf pid=1345, ip=10.0.150.166) [INFO 2023-09-22 16:01:28,721] tgi_worker.py: 663  Model finished warming up (max_batch_total_tokens=19840) and is ready to serve requests. [repeated 7x across cluster]
(AviaryTGIInferenceWorker:meta-llama/Llama-2-70b-chat-hf pid=1345, ip=10.0.150.166) [INFO 2023-09-22 16:01:26,554] tgi_worker.py: 650  Model is warming up. Num requests: 2 Prefill tokens: 6000 Max batch total tokens: 19831 [repeated 3x across cluster]
(ServeReplica:router:Router pid=406) INFO 2023-09-22 16:14:03,465 Router router#Router#GRIFiv 1cb0cd39-1775-4947-9ce0-1207279eb553 /meta-llama--Llama-2-70b-chat-hf/stream router replica.py:741 - __CALL__ OK 5.1ms
(ServeReplica:router:Router pid=265, ip=10.0.150.166) /home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/server/routers/router_app.py:285: DeprecationWarning: with timeout() is deprecated, use async with timeout() instead
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   with async_timeout.timeout(TIMEOUT):
(ServeReplica:router:Router pid=265, ip=10.0.150.166) [INFO 2023-09-22 16:14:03,467] router_query_engine.py: 120  No tokens produced. Id: 6cf85a69d3d77edbc8df8bd4d5af98b6
(ServeReplica:router:Router pid=265, ip=10.0.150.166) ERROR 2023-09-22 16:14:03,470 Router router#Router#shBjkX 2e2d3085-f043-473d-920b-fbebe1572747 /v1/chat/completions router replica.py:733 - Request failed due to RayTaskError:
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 730, in wrap_user_method_call
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     yield
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 870, in call_user_method
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raise e from None
(ServeReplica:router:Router pid=265, ip=10.0.150.166) ray.exceptions.RayTaskError: ray::ServeReplica:router:Router() (pid=265, ip=10.0.150.166)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/anyio/streams/memory.py", line 98, in receive
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     return self.receive_nowait()
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/anyio/streams/memory.py", line 93, in receive_nowait
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raise WouldBlock
(ServeReplica:router:Router pid=265, ip=10.0.150.166) anyio.WouldBlock
(ServeReplica:router:Router pid=265, ip=10.0.150.166)
(ServeReplica:router:Router pid=265, ip=10.0.150.166) During handling of the above exception, another exception occurred:
(ServeReplica:router:Router pid=265, ip=10.0.150.166)
(ServeReplica:router:Router pid=265, ip=10.0.150.166) ray::ServeReplica:router:Router() (pid=265, ip=10.0.150.166)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/middleware/base.py", line 78, in call_next
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     message = await recv_stream.receive()
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/anyio/streams/memory.py", line 118, in receive
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raise EndOfStream
(ServeReplica:router:Router pid=265, ip=10.0.150.166) anyio.EndOfStream
(ServeReplica:router:Router pid=265, ip=10.0.150.166)
(ServeReplica:router:Router pid=265, ip=10.0.150.166) During handling of the above exception, another exception occurred:
(ServeReplica:router:Router pid=265, ip=10.0.150.166)
(ServeReplica:router:Router pid=265, ip=10.0.150.166) ray::ServeReplica:router:Router() (pid=265, ip=10.0.150.166)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/utils.py", line 225, in wrap_to_ray_error
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raise exception
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 851, in call_user_method
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     result = await method_to_call(*request_args, **request_kwargs)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/http_util.py", line 437, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await self._asgi_app(
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/fastapi/applications.py", line 290, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await super().__call__(scope, receive, send)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await self.middleware_stack(scope, receive, send)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raise exc
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await self.app(scope, receive, _send)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 576, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await self.app(scope, otel_receive, otel_send)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/middleware/base.py", line 108, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     response = await self.dispatch_func(request, call_next)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/server/routers/middleware.py", line 12, in add_request_id
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     return await call_next(request)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/middleware/base.py", line 84, in call_next
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raise app_exc
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/middleware/base.py", line 70, in coro
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await self.app(scope, receive_or_disconnect, send_no_error)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/middleware/cors.py", line 83, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await self.app(scope, receive, send)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raise exc
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await self.app(scope, receive, sender)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raise e
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await self.app(scope, receive, send)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/routing.py", line 718, in __call__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await route.handle(scope, receive, send)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/routing.py", line 276, in handle
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     await self.app(scope, receive, send)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/routing.py", line 66, in app
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     response = await func(request)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/fastapi/routing.py", line 241, in app
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raw_response = await run_endpoint_function(
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/fastapi/routing.py", line 167, in run_endpoint_function
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     return await dependant.call(**values)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/server/routers/router_app.py", line 286, in chat
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     results = await self.query_engine.query(body.model, prompt, request)
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/server/plugins/router_query_engine.py", line 48, in query
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     responses = [resp async for resp in response_stream]
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/server/plugins/router_query_engine.py", line 48, in <listcomp>
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     responses = [resp async for resp in response_stream]
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/observability/fn_call_metrics.py", line 192, in new_gen
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     async for x in async_generator_fn(*args, **kwargs):
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/server/plugins/router_query_engine.py", line 98, in stream
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     async for response in stream_model_responses(url, json=json):
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/server/utils.py", line 192, in stream_model_responses
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     async with session.post(
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aiohttp/client.py", line 1141, in __aenter__
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     self._resp = await self._coro
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aiohttp/client.py", line 643, in _request
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     resp.raise_for_status()
(ServeReplica:router:Router pid=265, ip=10.0.150.166)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
(ServeReplica:router:Router pid=265, ip=10.0.150.166)     raise ClientResponseError(
(ServeReplica:router:Router pid=265, ip=10.0.150.166) aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('http://localhost:8000/meta-llama--Llama-2-70b-chat-hf/stream')
(ServeReplica:router:Router pid=265, ip=10.0.150.166) INFO 2023-09-22 16:14:03,471 Router router#Router#shBjkX 2e2d3085-f043-473d-920b-fbebe1572747 /v1/chat/completions router replica.py:741 - __CALL__ ERROR 49.6ms

Support for multi-modal models

It'd be great to host Llava or an equivalent model on this project, with an OpenAI-compatible API.

T5 model support

Hi,
Can you please provide the support for T5 model inference. I see that only decoder models are supported https://github.com/ray-project/aviary/tree/master/models/static_batching

Thanks

Ray LLM on Nvidia RTX series?

I am trying to deploy sharded LLMs to multiple RTX 3090s. So, far I have tried TGI by HF and it works fine. However, I came across Ray LLM at the last NLP summit and curious whether aviary support RTX series too. So far the pre-configured yaml files only point to a100, a10, and v100. Any leads to docs or sample configuration would be helpful. Thank you!

ray-llm support for ML Accelerators (Google's TPU, AWS Inferential & etc)

HI all,

I would love to know if in future will ray-llm support serving llm's on ML accelerators. With Vllms and other optimizations its possible to achieve high throughput and low latency. It would be great if we can use ray-llm's on ML accelerators as well.

Error loading model from local filesystem

models/README.md says "For loading a model from file system, set engine_config.hf_model_id to an absolute filesystem path accessible from every node in the cluster."

I ran:

sudo docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=/data -v "$(pwd)"/data:/data -v "$(pwd)"/models:/models anyscale/aviary:latest bash
aviary run --model /models/myconfig.yaml

with /models/myconfig.yaml having hf_model_id: /models/llama-2-13b-chat.ggmlv3.q4_1.bin

The output was:

(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 417, in initialize_and_get_metadata [repeated 4x across cluster]
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143) Traceback (most recent call last):
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     return self.__get_result()
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     raise self._exception
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     raise RuntimeError(traceback.format_exc()) from None
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143) RuntimeError: Traceback (most recent call last):
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     await self.replica.update_user_config(
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 688, in update_user_config
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     await reconfigure_method(user_config)
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/server/routers/model_app.py", line 82, in reconfigure
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     await self.engine.start()
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/llm/engine/tgi.py", line 165, in start
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     self.new_worker_group = await self._create_worker_group(
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/observability/fn_call_metrics.py", line 126, in async_wrapper
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     return await wrapped(*args, **kwargs)
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/aviary/backend/llm/engine/tgi.py", line 364, in _create_worker_group
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     _ = AutoTokenizer.from_pretrained(llm_config.actual_hf_model_id)
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 677, in from_pretrained
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 510, in get_tokenizer_config
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     resolved_config_file = cached_file(
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/hub.py", line 428, in cached_file
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     resolved_file = hf_hub_download(
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     validate_repo_id(arg_value)
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)   File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143)     raise HFValidationError(
(ServeReplica:meta-llama--Llama-2-13b-chat-hf_meta-llama--Llama-2-13b-chat-hf pid=17143) huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/models/llama-2-13b-chat.ggmlv3.q4_1.bin'

This looks like hf_model_id is being validated as a Hugging Face repo name and can't be an absolute path.

how to deploy on kubernetes ?

hi, I am unable to find steps for deploying this on kubernetes / eks. Do we have plans to support k8s ?

VLLM Ray Workers are being killed by GCS

Hi, we are running into a weird issue where the Ray Workers created by VLLM are being killed, even though the deployment itself stays alive. The effect of this is that when you make a request to a model deployment, the following error message occurs, since the workers are already dead but AviaryLLMEngine still has a reference to them via the workers object property:

return (yield from awaitable.__await__())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: RayWorker
    actor_id: 1963424394259fade0c44c5501000000
    pid: 714
    namespace: _ray_internal_dashboard
    ip: 100.64.144.72
The actor is dead because all references to the actor were removed

(...)

raise AsyncEngineDeadError(101vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

In the logs, it appears that the Ray Workers created by VLLM are being killed by GCS with the following message:

49[2023-10-27 15:56:14,963 I 2203 2281] core_worker.cc:3737: Force kill actor request has received. exiting immediately... The actor is dead because all references to the actor were removed.50[2023-10-27 15:56:14,963 W 2203 2281] core_worker.cc:857: Force exit the process.  Details: Worker exits because the actor is killed. The actor is dead because all references to the actor were removed.51[2023-10-27 15:56:14,965 I 2203 2281] core_worker.cc:759: Try killing all child processes of this worker as it exits. Child process pids: 52[2023-10-27 15:56:14,965 I 2203 2281] core_worker.cc:718: Disconnecting to the raylet.53[2023-10-27 15:56:14,965 I 2203 2281] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_SYSTEM_EXIT, exit_detail=Worker exits because the actor is killed. The actor is dead because all references to the actor were removed., has creation_task_exception_pb_bytes=0

More specifically, this part of the message: The actor is dead because all references to the actor were removed appears to indicate that GCS is killing the Ray Workers because it believes that there are no references left to the Ray Workers from anywhere.

However, I don't understand how this could be the case, since the top-level Ray Serve deployment is still alive and the deployment holds on to the VLLMEngine as a reference. VLLMEngine holds on to AviaryAsyncLLMEngine as self.engine, and the AviaryAsyncLLMEngine holds on to AviaryLLMEngine which has a reference to the workers as self.workers.

If the top-level deployment hasn't died, I don't see how the reference count on the workers could have been decremented/why GCS would think that these actors are out of scope.

Deploy 2 models via `aviary run` - `aviary model list` only displays 1

If I deploy 2 models with separate aviary run commands, aviary model list only displays the last one deployed - however in the ray dashboard, it looks like two are running.

Is it possible to deploy two models with separate aviary run commands?
If it is, aviary model list should display both models

Error running TheBloke--Llama-2-70B-chat-GPTQ

Hi Aviary Team,

Testing out the new update and facing the following error.

I am using the default YAML file with docker image: "anyscale/aviary:latest-tgi" and TheBloke--Llama-2-70B-chat-GPTQ

Attaching serve controller log file
serve_controller_502.log

ERROR 2023-07-28 22:38:24,144 controller 502 deployment_state.py:567 - Exception in replica 'TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ#CtvupC', the replica will be stopped. Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 565, in check_ready _, self._version = ray.get(self._ready_obj_ref) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2520, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): �[36mray::ServeReplica:TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ.initialize_and_get_metadata()�[39m (pid=19572, ip=172.31.18.182, actor_id=d3085fba2f6070e53474e80601000000, repr=<ray.serve._private.replica.ServeReplica:TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ object at 0x7f6c63ad8be0>) File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 413, in initialize_and_get_metadata raise RuntimeError(traceback.format_exc()) from None RuntimeError: Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 403, in initialize_and_get_metadata await self.replica.update_user_config( File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 638, in update_user_config await reconfigure_method(user_config) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/server/app.py", line 93, in reconfigure await self.predictor.rollover( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 374, in rollover self.new_worker_group = await self._create_worker_group( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/continuous_batching_predictor.py", line 297, in _create_worker_group worker_group = await super()._create_worker_group( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 484, in _create_worker_group worker_group = await self._start_prediction_workers( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/continuous_batching_predictor.py", line 273, in _start_prediction_workers worker_group = await super()._start_prediction_workers( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 409, in _start_prediction_workers await asyncio.gather( File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable return (yield from awaitable.__await__()) ray.exceptions.RayTaskError(AttributeError): �[36mray::ContinuousBatchingPredictionWorker.init_model()�[39m (pid=19871, ip=172.31.18.182, actor_id=22ce6fec997530eb25d9abdc01000000, repr=ContinuousBatchingPredictionWorker:TheBloke/Llama-2-70B-chat-GPTQ) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 131, in init_model self.generator = init_model( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/utils.py", line 90, in inner ret = func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 73, in init_model pipeline = get_pipeline_cls_by_name(pipeline_name).from_initializer( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/pipelines/_base.py", line 43, in from_initializer model, tokenizer = initializer.load(model_id) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/hf_transformers/base.py", line 57, in load model = self.load_model(model_id) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/tgi.py", line 51, in load_model return TGIInferenceWorker( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/continuous/tgi/tgi_worker.py", line 452, in __init__ with patch( File "/home/ray/anaconda3/lib/python3.10/unittest/mock.py", line 1437, in __enter__ original, local = self.get_original() File "/home/ray/anaconda3/lib/python3.10/unittest/mock.py", line 1410, in get_original raise AttributeError( AttributeError: <class 'text_generation_server.utils.weights.Weights'> does not have the attribute '_set_gptq_params' INFO 2023-07-28 22:38:24,144 controller 502 deployment_state.py:887 - Stopping replica TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ#CtvupC for deployment TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ. INFO 2023-07-28 22:39:16,555 controller 502 deployment_state.py:1586 - Adding 1 replica to deployment TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ. INFO 2023-07-28 22:39:16,555 controller 502 deployment_state.py:331 - Starting replica TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ#QPfLlB for deployment TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ. ERROR 2023-07-28 22:39:27,663 controller 502 deployment_state.py:567 - Exception in replica 'TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ#QPfLlB', the replica will be stopped. Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 565, in check_ready _, self._version = ray.get(self._ready_obj_ref) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2520, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): �[36mray::ServeReplica:TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ.initialize_and_get_metadata()�[39m (pid=20115, ip=172.31.18.182, actor_id=80f90fd080031632a4d414b001000000, repr=<ray.serve._private.replica.ServeReplica:TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ object at 0x7eeee0588b50>) File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 413, in initialize_and_get_metadata raise RuntimeError(traceback.format_exc()) from None RuntimeError: Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 403, in initialize_and_get_metadata await self.replica.update_user_config( File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 638, in update_user_config await reconfigure_method(user_config) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/server/app.py", line 93, in reconfigure await self.predictor.rollover( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 374, in rollover self.new_worker_group = await self._create_worker_group( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/continuous_batching_predictor.py", line 297, in _create_worker_group worker_group = await super()._create_worker_group( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 484, in _create_worker_group worker_group = await self._start_prediction_workers( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/continuous_batching_predictor.py", line 273, in _start_prediction_workers worker_group = await super()._start_prediction_workers( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 409, in _start_prediction_workers await asyncio.gather( File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable return (yield from awaitable.__await__()) ray.exceptions.RayTaskError(AttributeError): �[36mray::ContinuousBatchingPredictionWorker.init_model()�[39m (pid=20477, ip=172.31.18.182, actor_id=b0081819c1527f5aee87c50b01000000, repr=ContinuousBatchingPredictionWorker:TheBloke/Llama-2-70B-chat-GPTQ) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 131, in init_model self.generator = init_model( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/utils.py", line 90, in inner ret = func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 73, in init_model pipeline = get_pipeline_cls_by_name(pipeline_name).from_initializer( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/pipelines/_base.py", line 43, in from_initializer model, tokenizer = initializer.load(model_id) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/hf_transformers/base.py", line 57, in load model = self.load_model(model_id) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/tgi.py", line 51, in load_model return TGIInferenceWorker( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/continuous/tgi/tgi_worker.py", line 452, in __init__ with patch( File "/home/ray/anaconda3/lib/python3.10/unittest/mock.py", line 1437, in __enter__ original, local = self.get_original() File "/home/ray/anaconda3/lib/python3.10/unittest/mock.py", line 1410, in get_original raise AttributeError( AttributeError: <class 'text_generation_server.utils.weights.Weights'> does not have the attribute '_set_gptq_params' INFO 2023-07-28 22:39:27,664 controller 502 deployment_state.py:887 - Stopping replica TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ#QPfLlB for deployment TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ. INFO 2023-07-28 22:40:20,799 controller 502 deployment_state.py:1586 - Adding 1 replica to deployment TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ. INFO 2023-07-28 22:40:20,799 controller 502 deployment_state.py:331 - Starting replica TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ#mgcjuf for deployment TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ. ERROR 2023-07-28 22:40:31,807 controller 502 deployment_state.py:567 - Exception in replica 'TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ#mgcjuf', the replica will be stopped. Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 565, in check_ready _, self._version = ray.get(self._ready_obj_ref) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2520, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): �[36mray::ServeReplica:TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ.initialize_and_get_metadata()�[39m (pid=20720, ip=172.31.18.182, actor_id=3963f7bf0f917a265bd53dea01000000, repr=<ray.serve._private.replica.ServeReplica:TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ object at 0x7efb895b8b20>) File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 413, in initialize_and_get_metadata raise RuntimeError(traceback.format_exc()) from None RuntimeError: Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 403, in initialize_and_get_metadata await self.replica.update_user_config( File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 638, in update_user_config await reconfigure_method(user_config) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/server/app.py", line 93, in reconfigure await self.predictor.rollover( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 374, in rollover self.new_worker_group = await self._create_worker_group( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/continuous_batching_predictor.py", line 297, in _create_worker_group worker_group = await super()._create_worker_group( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 484, in _create_worker_group worker_group = await self._start_prediction_workers( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/continuous_batching_predictor.py", line 273, in _start_prediction_workers worker_group = await super()._start_prediction_workers( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 409, in _start_prediction_workers await asyncio.gather( File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable return (yield from awaitable.__await__()) ray.exceptions.RayTaskError(AttributeError): �[36mray::ContinuousBatchingPredictionWorker.init_model()�[39m (pid=21022, ip=172.31.18.182, actor_id=e40ed769652bfe8fa7a373db01000000, repr=ContinuousBatchingPredictionWorker:TheBloke/Llama-2-70B-chat-GPTQ) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 131, in init_model self.generator = init_model( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/utils.py", line 90, in inner ret = func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor/predictor.py", line 73, in init_model pipeline = get_pipeline_cls_by_name(pipeline_name).from_initializer( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/pipelines/_base.py", line 43, in from_initializer model, tokenizer = initializer.load(model_id) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/hf_transformers/base.py", line 57, in load model = self.load_model(model_id) File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/tgi.py", line 51, in load_model return TGIInferenceWorker( File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/continuous/tgi/tgi_worker.py", line 452, in __init__ with patch( File "/home/ray/anaconda3/lib/python3.10/unittest/mock.py", line 1437, in __enter__ original, local = self.get_original() File "/home/ray/anaconda3/lib/python3.10/unittest/mock.py", line 1410, in get_original raise AttributeError( AttributeError: <class 'text_generation_server.utils.weights.Weights'> does not have the attribute '_set_gptq_params' INFO 2023-07-28 22:40:31,808 controller 502 deployment_state.py:887 - Stopping replica TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ#mgcjuf for deployment TheBloke--Llama-2-70B-chat-GPTQ_TheBloke--Llama-2-70B-chat-GPTQ.

No example for quantized model

Currently, the Llama 2 model requires a significant number of GPUs to serve. Would it be possible to add support for quantized models to Ray-LLM? This would allow us to reduce the hardware requirements for serving Llama 2 models, making them more accessible to a wider range of users.

I have not been able to find any examples of quantized model serving for Ray-LLM, so it is not clear if this is currently supported.

how to configure if we have low spec?

the log keep saying "has 1 replicas that have taken more than 30s to initialize"

Unable to run aviary on V100 GPU.

I was trying to run llama-2 on a machine with V100 GPU.

I ran aviary run --model ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml inside the docker container and got a stack trace:

(HTTPProxyActor pid=2448) INFO 2023-08-22 23:38:44,774 http_proxy 172.17.0.2 http_proxy.py:904 - Proxy actor f5a0692e60801e1b0ef45a8301000000 starting on node 57297f3255438333c74bdc7b75d3fd3aa4b1c48e7bdcf6d07db72a41.
[INFO 2023-08-22 23:38:44,824] api.py: 320  Started detached Serve instance in namespace "serve".
(HTTPProxyActor pid=2448) INFO:     Started server process [2448]
[INFO 2023-08-22 23:38:44,951] api.py: 300  Connecting to existing Serve app in namespace "serve". New http options will not be applied.
(ServeController pid=2420) INFO 2023-08-22 23:38:44,942 controller 2420 deployment_state.py:1319 - Deploying new version of deployment meta-llama--Llama-2-7b-chat-hf_meta-llama--Llama-2-7b-chat-hf.
(ServeController pid=2420) INFO 2023-08-22 23:38:45,046 controller 2420 deployment_state.py:1586 - Adding 1 replica to deployment meta-llama--Llama-2-7b-chat-hf_meta-llama--Llama-2-7b-chat-hf.
(ServeController pid=2420) INFO 2023-08-22 23:38:45,083 controller 2420 deployment_state.py:1319 - Deploying new version of deployment router_Router.
(ServeController pid=2420) INFO 2023-08-22 23:38:45,187 controller 2420 deployment_state.py:1586 - Adding 2 replicas to deployment router_Router.
(ServeReplica:router_Router pid=2480) There was a problem when trying to write in your cache folder (/home/jupyter/cache/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
(autoscaler +15s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +15s) Error: No available node types can fulfill resource request {'accelerator_type_a10': 0.01, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(ServeController pid=2420) WARNING 2023-08-22 23:39:15,112 controller 2420 deployment_state.py:1889 - Deployment "meta-llama--Llama-2-7b-chat-hf_meta-llama--Llama-2-7b-chat-hf" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"accelerator_type_a10": 0.01, "CPU": 1}, resources available: {"CPU": 14.0}.
(ServeReplica:router_Router pid=2479) There was a problem when trying to write in your cache folder (/home/jupyter/cache/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
(autoscaler +50s) Error: No available node types can fulfill resource request {'accelerator_type_a10': 0.01, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

Is aviary incompatible with V100 GPUs?

Leaderboard can be manipulated

Users aren't currently limited to one vote on 'best response' per prompt submitted or forced to 'change vote' to select a different option.

Steps to reproduce:

Submit prompt
Select Best Answer X times
Refresh leaderboard

Loom Video

What cluster configuration do I need to deploy Aviary?

How do I know how many GPUs, memory, etc. I need in my Ray cluster to deploy Aviary? Or will it work regardless of what resources I have?

GKE How-To is out-of-date

This link contains heml chart with image anyscale/aviary:latest-tgi
This image is out-of-date. Even yaml shemas for models are wrong.
https://ray-project.github.io/aviary/kuberay/deploy-on-gke/

rayllm's frontend can't work properly via rayllm:0.4.0 image

##Reproduce procedure

deploy rayllm locally

cache_dir=${XDG_CACHE_HOME:-$HOME/.cache}

docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir:~/data anyscale/ray-llm:0.4.0 bash
# Inside docker container
serve run ~/serve_configs/amazon--LightGPT.yaml --host 0.0.0.0 --non-blocking
export AVIARY_URL=http://localhost:8000
serve run rayllm.frontend.app:app --blocking --host 0.0.0.0

And get the following error message

(ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result                                                                       (ServeController pid=683)     return self.__get_result()                                                                                                                                    (ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result                                                                 
(ServeController pid=683)     raise self._exception                                                                                                                                         
(ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 442, in initialize_and_get_metadata                                  
(ServeController pid=683)     raise RuntimeError(traceback.format_exc()) from None                                                                                                          (ServeController pid=683) RuntimeError: Traceback (most recent call last):                                                                                                                  
(ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 430, in initialize_and_get_metadata                                  (ServeController pid=683)     await self._initialize_replica()                                                                                                                              
(ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 190, in initialize_replica
(ServeController pid=683)     await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
(ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/frontend/app.py", line 470, in __init__
(ServeController pid=683)     blocks = builder()
(ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/frontend/app.py", line 323, in gradio_app_builder
(ServeController pid=683)     JavaScriptLoader()
(ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/frontend/javascript_loader.py", line 38, in __init__
(ServeController pid=683)     self.load_js()
(ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/frontend/javascript_loader.py", line 42, in load_js
(ServeController pid=683)     js_scripts = ScriptLoader.get_scripts(self.path, self.script_type)
(ServeController pid=683)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/frontend/javascript_loader.py", line 25, in get_scripts
(ServeController pid=683)     dir_list = [os.path.join(path, f) for f in os.listdir(path)]
(ServeController pid=683) FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/frontend/javascript'

Add javascript/aviary.js from repo to /home/ray/anaconda3/lib/python3.9/site-packages/rayllm/frontend/ manually, then get the following error message

(ServeController pid=2203) ERROR 2023-11-02 19:24:56,145 controller 2203 deployment_state.py:617 - Exception in replica 'default#AviaryFrontend#sVQxkR', the replica will be stopped.       (ServeController pid=2203) Traceback (most recent call last):                                                                                                                               
(ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/deployment_state.py", line 615, in check_ready                                        (ServeController pid=2203)     _, self._version = ray.get(self._ready_obj_ref)                                                                                                              
(ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper                                           
(ServeController pid=2203)     return fn(*args, **kwargs)                                                                                                                                   (ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper                                                  
(ServeController pid=2203)     return func(*args, **kwargs)                                                                                                                                 (ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2547, in get                                                               
(ServeController pid=2203)     raise value.as_instanceof_cause()                                                                                                                            
(ServeController pid=2203) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:default:AviaryFrontend.initialize_and_get_metadata() (pid=2617, ip=172.23.0.3, actor_id=80f2435baff9090308abf9ee08000000, repr=<ray.serve._private.replica.ServeReplica:default:AviaryFrontend object at 0x7f54fdad1130>)
(ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 446, in result
(ServeController pid=2203)     return self.__get_result()
(ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
(ServeController pid=2203)     raise self._exception
(ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 442, in initialize_and_get_metadata
(ServeController pid=2203)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=2203) RuntimeError: Traceback (most recent call last):
(ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 430, in initialize_and_get_metadata
(ServeController pid=2203)     await self._initialize_replica()
(ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 190, in initialize_replica
(ServeController pid=2203)     await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
(ServeController pid=2203)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/frontend/app.py", line 487, in __init__
(ServeController pid=2203)     blocks._queue.set_url(f"http://localhost:{port}{route_prefix}/")
(ServeController pid=2203) AttributeError: 'Queue' object has no attribute 'set_url'

edit /home/ray/anaconda3/lib/python3.9/site-packages/rayllm/frontend/app.py, comment lines 487-488 and restart frontend

487         #blocks._queue.set_url(f"http://localhost:{port}{route_prefix}/")
488         #blocks._queue.set_url = noop

it seems like startup success in the terminal, but the page is broken by visiting from browser

Error downloading and running model on clean deploy

The following is the error encountered when trying to run
aviary run --model ./models/amazon--LightGPT.yaml as per the readme setup of doing the following steps

# Setup AWS env vars

# Perform the aviary cluster setup
git clone https://github.com/ray-project/aviary.git
cd aviary
ray up deploy/ray/aviary-cluster.yaml
ray attach deploy/ray/aviary-cluster.yaml

# The command with error
aviary run --model ./models/amazon--LightGPT.yaml

The error line is believed to the be the following

...
RuntimeError: Deployment default_amazon--LightGPT is UNHEALTHY: The Deployment failed to start 3 times in a row. This may be due to a problem with the deployment 
constructor or the initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
[36mray::ServeReplica:default_amazon--LightGPT.is_initialized()[39m (pid=1259, ip=172.31.76.164, actor_id=74666bc8c5fe4e8e2f51f68801000000, 
repr=<ray.serve._private.replica.ServeReplica:default_amazon--LightGPT object at 0x7f5193312f50>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 338, in is_initialized
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 330, in is_initialized
    metadata = await self.reconfigure(deployment_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 347, in reconfigure
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 344, in reconfigure
    await self.replica.reconfigure(deployment_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 631, in reconfigure
    await reconfigure_method(self.deployment_config.user_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/server/app.py", line 97, in reconfigure
    await self.rollover(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 268, in rollover
    self.new_worker_group = await self._create_worker_group(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 340, in _create_worker_group
    await asyncio.gather(
  File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(OSError): [36mray::PredictionWorker.init_model()[39m (pid=9567, ip=172.31.37.30, actor_id=872d1d4babacb73015b0edde01000000, 
repr=PredictionWorker:amazon/LightGPT)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 176, in init_model
    self.generator = init_model(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/utils.py", line 83, in inner
    ret = func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 67, in init_model
    pipeline = get_pipeline_cls_by_name(pipeline_name).from_initializer(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/pipelines/_base.py", line 79, in from_initializer
    model, tokenizer = initializer.load(model_id)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/hf_transformers/base.py", line 57, in load
    model = self.load_model(model_id)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/hf_transformers/deepspeed.py", line 132, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2387, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory 
/home/ray/.cache/huggingface/hub/models--amazon--LightGPT/snapshots/ee9e7bc83ff435561d0bacfdf8dd2eeb6a5c6f9f.

Full error log as per attached

aviary-error.log

How can we use hosted version by making an api call?

It is possible to make requests to hosted version server?

Request for Comment: Aviary <-> LangChain Integration

Purpose

Aviary is an open source LLM management toolkit built on top of Ray Serve and Ray. LangChain is an incredibly popular open source toolkit for building LLM applications. The question is how do these things fit together?

Possible integrations

This first integration is focused on LLM (non chat) to begin with. When we add streaming to Aviary, we will also integrate that for the chat application too.

There are 3 possible integration points with LangChain.

1. Aviary as an LLM provider for LangChain

Make Aviary a model backend for LangChain (the same way that OpenAI is done currently).

This would enable you to do things like:

import os
from langchain.llms import Aviary
aviary_url = os.environ["AVIARY_URL"]
# Token is optional
aviary_token = os.environ["AVIARY_TOKEN"] or None

from langchain.llms import Aviary

llm = Aviary(model_name = 'amazon/LightGPT')

#single query
llm.predict('How do you make fried rice?')

#uses Aviary's batch interface for greater efficiency
llm.generate(['How do you make fried rice?', 'What are the most influential punk bands?'])

The only real decision here is do we use our SDK or do we allow direct connection to our endpoints. Since our Web API right now is so simple, it might be easier to code against it in the short term, and use the SDK when it is justified.

2. Aviary “wraps” LLMs provided by LangChain

Allow any model supported by LangChain to be wired up through Aviary (the same way that Aviary currently “wires up” Hugging Face). This would give a way for centrally managed Aviaries to control access to models from OpenAI and to impose additional limits on length.

For every model you want to wrap, you would have to set up a models/ file in the https://github.com/ray-project/aviary/tree/master/models directory. We would expand that file format to also support LangChain LLMs as well.

3. Integrate LangChain LLM support directly into Aviary Explorer and Aviary CLI

Allow users to query any model supported by LangChain directly. This would be useful for example to do cross OSS <-> commercial comparisons e.g. with GPT-3.5-turbo.

What we would do there is allow Aviary CLI to do something like this:

aviary query -–model amazon/LightGPT -–model model-configs/langchain-openai-gpt-35,yaml examples/qa-prompts.txt

In the aviary command, we would read openai://gpt-3.5-turbo and use the LangChain OpenAI LLM tool allowing for cross evaluation.

We would have add new functionality to Aviary Explorer to support adding arbitrarily configured LangChain LLMs.

In essence the difference between proposals 2 and 3 is: where do the config files for specifying LLM properties live?

Decision

We are not limited to doing one of these.

The most immediate need and highest impact is perhaps #1.

#2 and #3 are similar in many ways. Perhaps #3 is more impactful. The Aviary Explorer changes, however, are more complicated. It’s slightly ugly in the sense that we now have yaml files both on the Aviary backend and in the Aviary CLI and Explorer.

Adding confidence level to the output of requests

One useful feature is to be able to extract confidence level and top-n probs on a per token basis as the generation happens. It can open up a lot of applications. One interesting use-case is to do monte-carlo search over the answers to boost the output of models.

Unexpected id when stream=True

Hey aviary team. The v0.2.0 release is looking great, nice work!

I had a question about the response ID for streaming.

Encountered Behavior

When calling /chat/completions with stream=True each response has a unique id

Expected Behavior

Expect that each response has the same id. This is how OpenAI formats the response.

Is it possible to have the responses have the same id? If not, is there a suggested way to group the response streams?

See below for side-by-side examples of aviary response and OpenAI response
chat_completion = openai.ChatCompletion.create(
model=model_name,
messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Say 'test me please'."}],
temperature=0.7,
stream=stream
)

{ "id": "meta-llama/Llama-2-7b-chat-hf-ceb3f770-4897-4aa0-bb7b-b33bf4cbb821", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": { "role": "assistant" }, "index": 0, "finish_reason": null } ], "usage": null } { "id": "meta-llama/Llama-2-7b-chat-hf-7ccf1294-0371-4fbb-a143-306029eafa39", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": { "content": "Of" }, "index": 0, "finish_reason": null } ], "usage": null } { "id": "meta-llama/Llama-2-7b-chat-hf-1ca78f25-1a41-4d5f-92eb-833997192896", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": { "content": " course" }, "index": 0, "finish_reason": null } ], "usage": null } { "id": "meta-llama/Llama-2-7b-chat-hf-cee49ab7-fb78-4dd6-97bd-2814e5d48159", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": { "content": "!" }, "index": 0, "finish_reason": null } ], "usage": null } { "id": "meta-llama/Llama-2-7b-chat-hf-25ef4544-1d92-44f1-bc2d-dd8225f24efd", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": { "content": " *" }, "index": 0, "finish_reason": null } ], "usage": null } { "id": "meta-llama/Llama-2-7b-chat-hf-4bb5200b-90c7-4541-b33a-37c422bf80cb", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": { "content": "test" }, "index": 0, "finish_reason": null } ], "usage": null } { "id": "meta-llama/Llama-2-7b-chat-hf-1fa3c651-f4a8-4101-81fb-b8113f935931", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": { "content": " me" }, "index": 0, "finish_reason": null } ], "usage": null } { "id": "meta-llama/Llama-2-7b-chat-hf-3eeadcd7-f8fc-44a1-b3cc-36f751dfc7e5", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": { "content": " please" }, "index": 0, "finish_reason": null } ], "usage": null } { "id": "meta-llama/Llama-2-7b-chat-hf-6be0060b-a46e-4553-963a-dd7f43b1d809", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": { "content": "*" }, "index": 0, "finish_reason": null } ], "usage": null } { "id": "meta-llama/Llama-2-7b-chat-hf-c73f0096-1e6a-4f7f-803f-8603d70489d9", "object": "text_completion", "created": 1692221192, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "delta": {}, "index": 0, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 35, "completion_tokens": 9, "total_tokens": 44 } }

{ "id": "chatcmpl-7oIKTj4FV7OH8tmG9DUnxpjlSTH2S", "object": "chat.completion.chunk", "created": 1692221213, "model": "gpt-3.5-turbo-0613", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "" }, "finish_reason": null } ] } { "id": "chatcmpl-7oIKTj4FV7OH8tmG9DUnxpjlSTH2S", "object": "chat.completion.chunk", "created": 1692221213, "model": "gpt-3.5-turbo-0613", "choices": [ { "index": 0, "delta": { "content": "Test" }, "finish_reason": null } ] } { "id": "chatcmpl-7oIKTj4FV7OH8tmG9DUnxpjlSTH2S", "object": "chat.completion.chunk", "created": 1692221213, "model": "gpt-3.5-turbo-0613", "choices": [ { "index": 0, "delta": { "content": " me" }, "finish_reason": null } ] } { "id": "chatcmpl-7oIKTj4FV7OH8tmG9DUnxpjlSTH2S", "object": "chat.completion.chunk", "created": 1692221213, "model": "gpt-3.5-turbo-0613", "choices": [ { "index": 0, "delta": { "content": " please" }, "finish_reason": null } ] } { "id": "chatcmpl-7oIKTj4FV7OH8tmG9DUnxpjlSTH2S", "object": "chat.completion.chunk", "created": 1692221213, "model": "gpt-3.5-turbo-0613", "choices": [ { "index": 0, "delta": { "content": "." }, "finish_reason": null } ] } { "id": "chatcmpl-7oIKTj4FV7OH8tmG9DUnxpjlSTH2S", "object": "chat.completion.chunk", "created": 1692221213, "model": "gpt-3.5-turbo-0613", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop" } ] }

TheBloke--Llama-2-70B-chat-GPTQ model: weight model.layers.0.self_attn.q_proj.weight does not exist

I tried to serve TheBloke--Llama-2-70B-chat-GPTQ model with Aviary 0.2.0 and have the following error: RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist. This seems to be an issue huggingface/text-generation-inference#500 with TGI.

I noticed that the issue in Aviary to run TheBloke--Llama-2-70B-chat-GPTQ model is marked as resolved. Can you share any suggestions?

cc @Yard1

Embedding model support in ray-llm

In Ray summit, @pcmoritz talked about the embedding models (especially the GTE-base) in the session Developing and Serving RAG-Based LLM Applications in Production. It would be great if we could also have a model config in models/continuous_batching for this CPU model so the developer can host all the models relevant to the RAG in ray-llm.

Possible to run on a single 8x A100 machine on-premise?

I would like to run a single machine that is on-premise, but not able to get the models to load as it is looking for actor/worker resource nodes that don't exist. Do you have any config example for single machine on-premise?

Langchain integration

I have tried to use ray-llm, but it doesn't work out of the box with what vllm suppose to provide. Is there any changes that ray-llm provides to the API endpoint while inference the model?

Here is comment on issue to vllm repo that was solved long time ago: vllm-project/vllm#323 (comment)

How were the numbers in the performance leaderboard benchmarked?

Thanks for putting the leaderboard up. I was just curious about the performance numbers there. Could you comment on how the performance numbers in the leaderboard in https://aviary.anyscale.com/ were generated?

For instance, what GPU was used? For larger ones, was distributed inference used? Can we run the same benchmark using the code in this repository? Were they run with batch size 1?

Thanks a lot.

Missing RWKV/rwkv-raven-14b from the model list?

RWKV is shown in the main readme

https://github.com/ray-project/aviary#listing-all-available-models

but is missing in the current models page

https://github.com/ray-project/aviary/tree/master/models

Placement group is not released when user code exists, resulting in a resource leak

Symptoms: When a user script hits an exception, the associated RayWorker actor is marked as dead, however the node that hosted those actors can't be scaled down. Even if there are no actors left (GPU, CPU, and memory all come down to zero), the worker node can't be removed because there are some place groups left.

Theory here is that Aviary uses Serve API to create a place group while it doesn't release a placement group when the actor dies. As a result, the place group is leaked and blocks the termination of an idle node. Note that there is no GC for placement groups and the expectation is always that callers should release the resource.

Reproduce Env:
Aviary version: 0.2.1
Ray version: nightly
Ray dashboard: https://session-xtfeimv54hk6bt23g5lc9eputm.i.anyscaleuserdata-staging.com/#/cluster
Cluster url: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_zvyp4jhsu8g9in8j5t7cwf1c1d/clusters/ses_xtfeimv54hk6bt23g5lc9eputm?user=usr_b9yhdfc2syn6sx3wiqvyw1tzc2

Missing Llama2 policy

I got an error:
ValueError: Model meta-llama/Llama-2-7b-chat-hf cannot automatically infer max_batch_total_tokens. Make sure to set engine_config.scheduler.policy.max_batch_total_tokens in the model configuration yaml.

Anyscale Image

In the aviary cluster yaml file ,
image used is anyscale/aviary:test
should it be changed to anyscale/ray-llm:latest

# An unique identifier for the head node and workers of this cluster.
cluster_name: aviary-deploy

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    cache_stopped_nodes: False
docker:
    image: "anyscale/aviary:test"            #### --
    image: "anyscale/ray-llm:latest"            #### ++
    container_name: "aviary"
    run_options:
      - --entrypoint ""

How could I run aviary in the remote ray cluster

I export env parameter: RAY_ADDRESS, it's a remote existed ray cluster.
I got No module named 'aviary'

how could I fix it, thanks

[Docs] "max_total_tokens" is missing in the doc

https://github.com/ray-project/ray-llm/tree/master/models#engine-config

Couldn't find "max_total_tokens" in the doc

aviary run --model failing to deploy

Hi Aviary team,

Thanks for the great package. I am trying to get it to work for my use case and I am running into several issues. Details are provided below. Let me know if I can provide any additional information to help identify the root cause.

When deploying new models the deployment will sometimes hang for over an hour before it silently fails.
Unable to kill that individual serve application which means I must restart the entire cluster to try again to deploy that model
aviary models shows models that are not available to be queried and does not display others that are available

Using the latest docker image and default deploy/ray/aviary-cluster.yaml with the following change:

gpu_worker_g5: node_config: InstanceType: g5.4xlarge BlockDeviceMappings: *mount resources: worker_node: 1 instance_type_g5: 1 accelerator_type_a10: 1 min_workers: 0 max_workers: 8

When I run
export AVIARY_URL="http://localhost:8000"
aviary run --model ./models/static_batching/mosaicml--mpt-7b-instruct.yaml
aviary run --model ./models/static_batching/OpenAssistant--falcon-7b-sft-top1-696.yaml

Falcon-7b deploys successfully, but mpt-7b-instruct never deploys and just hangs for about an hour until it says failed. If I retry same result. If I try a different model same result. I am well below the vCPU quota on G Instances. I also tried vicuna13b and that also failed to launch a GPU instance.

Also aviary models shows the model running although it is not. For some reason falcon-7b is not shown but it actually is running. If you ping /-/routes directly then you see both models running. Expected behavior would be that only running models available to be queried are shown when you call aviary models.
(base) ray@ip-172-31-52-1:~$ aviary models Connecting to Aviary backend at: http://localhost:8000/ mosaicml/mpt-7b-instruct

(base) ray@ip-172-31-52-1:~$ ray list actors --detail

actor_id: 014519cc10a3c1393952282303000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#HoiArw
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 578
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
accelerator_type_cpu: 0.01
CPU: 1.0
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 578
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#HoiArw
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: 014519cc10a3c1393952282303000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 02f2b6bd4c9a00b6b81b4c2503000000
class_name: HTTPProxyActor
state: ALIVE
job_id: '03000000'
name: SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 158
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources: {}
death_cause: null
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 10622e4649179965e5a6d0c303000000
class_name: ServeReplica:OpenAssistant--falcon-7b-sft-top1-696_OpenAssistant--falcon-7b-sft-top1-696
state: ALIVE
job_id: '03000000'
name: SERVE_REPLICA::OpenAssistant--falcon-7b-sft-top1-696_OpenAssistant--falcon-7b-sft-top1-696#pRMsbX
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 157
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
accelerator_type_cpu: 0.01
CPU: 1.0
death_cause: null
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 37b968e0c1adc19c2317963c03000000
class_name: ServeController
state: ALIVE
job_id: '03000000'
name: SERVE_CONTROLLER_ACTOR
node_id: 3a928be2a77f07dd58f2ae3672f853f3c3a0e341995289bcbca66d35
pid: 547
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
node:internal_head: 0.001
death_cause: null
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 3e24b48d34bd94402134c1f403000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#ZJCwbG
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 748
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 748
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#ZJCwbG
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: 3e24b48d34bd94402134c1f403000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 3e28e1cacda268b21faa5c7503000000
class_name: HTTPProxyActor
state: ALIVE
job_id: '03000000'
name: SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-3a928be2a77f07dd58f2ae3672f853f3c3a0e341995289bcbca66d35
node_id: 3a928be2a77f07dd58f2ae3672f853f3c3a0e341995289bcbca66d35
pid: 572
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources: {}
death_cause: null
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 642421606c9735b2b038281503000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#ZHeJeU
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 476
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 476
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#ZHeJeU
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: 642421606c9735b2b038281503000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 76073afc503c62fb0fc0c2a303000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#PiHekO
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 714
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 714
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#PiHekO
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: 76073afc503c62fb0fc0c2a303000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 81de79edf7d71d28f866ed4a03000000
class_name: PredictionWorker
state: ALIVE
job_id: '03000000'
name: ''
node_id: cf564dfb2977f9931871554534b14ef561487e1dfbd0082bcd9ea19d
pid: 315
ray_namespace: serve
serialized_runtime_env: '{"env_vars": {"PYTORCH_CUDA_ALLOC_CONF": "backend:cudaMallocAsync"}}'
required_resources:
accelerator_type_a10_group_a1eecb5ff05974e7cfd257634e0903000000: 0.01
GPU_group_a1eecb5ff05974e7cfd257634e0903000000: 1.0
CPU_group_a1eecb5ff05974e7cfd257634e0903000000: 8.0
death_cause: null
is_detached: false
placement_group_id: a1eecb5ff05974e7cfd257634e0903000000
repr_name: PredictionWorker:OpenAssistant/falcon-7b-sft-top1-696
actor_id: 873651bc5c6bdf4494c063bc03000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: ALIVE
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#zXQasq
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 782
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause: null
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 99f4b1e5b33343787c6ebbda03000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#dgCkoa
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 371
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 371
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#dgCkoa
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: 99f4b1e5b33343787c6ebbda03000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: 9ed8f9167aff1416f9b40fb903000000
class_name: ServeReplica:router_Router
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::router_Router#SnpzWO
node_id: 3a928be2a77f07dd58f2ae3672f853f3c3a0e341995289bcbca66d35
pid: 688
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.52.1
pid: 688
name: SERVE_REPLICA::router_Router#SnpzWO
ray_namespace: serve
class_name: ServeReplica:router_Router
actor_id: 9ed8f9167aff1416f9b40fb903000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: aa085f6ae7a2de244227f74b03000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#NUJehq
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 680
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
accelerator_type_cpu: 0.01
CPU: 1.0
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 680
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#NUJehq
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: aa085f6ae7a2de244227f74b03000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: bdefd5bdada986dc6a86c20f03000000
class_name: ServeReplica:router_Router
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::router_Router#YVeHIq
node_id: 3a928be2a77f07dd58f2ae3672f853f3c3a0e341995289bcbca66d35
pid: 601
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.52.1
pid: 601
name: SERVE_REPLICA::router_Router#YVeHIq
ray_namespace: serve
class_name: ServeReplica:router_Router
actor_id: bdefd5bdada986dc6a86c20f03000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: bfff95555c8aad324470ffd303000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#JlnvAU
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 337
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 337
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#JlnvAU
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: bfff95555c8aad324470ffd303000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: c9fa37906bdebd1f2bd16b0b03000000
class_name: HTTPProxyActor
state: ALIVE
job_id: '03000000'
name: SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cf564dfb2977f9931871554534b14ef561487e1dfbd0082bcd9ea19d
node_id: cf564dfb2977f9931871554534b14ef561487e1dfbd0082bcd9ea19d
pid: 163
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources: {}
death_cause: null
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: cad3c4d91c30b987dd98e33203000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#DGeusc
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 405
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
accelerator_type_cpu: 0.01
CPU: 1.0
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 405
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#DGeusc
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: cad3c4d91c30b987dd98e33203000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: ce3265064d0a2c858b25fde703000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#NeMVAn
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 510
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 510
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#NeMVAn
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: ce3265064d0a2c858b25fde703000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: d26865451e8634b713fe64a903000000
class_name: ServeReplica:router_Router
state: ALIVE
job_id: '03000000'
name: SERVE_REPLICA::router_Router#yKFoKB
node_id: 3a928be2a77f07dd58f2ae3672f853f3c3a0e341995289bcbca66d35
pid: 930
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
death_cause: null
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: dfe8997fd860700452da4ad103000000
class_name: ServeReplica:router_Router
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::router_Router#PzsVij
node_id: 3a928be2a77f07dd58f2ae3672f853f3c3a0e341995289bcbca66d35
pid: 857
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.52.1
pid: 857
name: SERVE_REPLICA::router_Router#PzsVij
ray_namespace: serve
class_name: ServeReplica:router_Router
actor_id: dfe8997fd860700452da4ad103000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: e1ea02ba81a4f7f6704fd01603000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#oVTtwj
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 439
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
accelerator_type_cpu: 0.01
CPU: 1.0
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 439
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#oVTtwj
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: e1ea02ba81a4f7f6704fd01603000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: e6460cb5a1bd9c9875d71ffa03000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#ZiPkFw
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 544
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 544
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#ZiPkFw
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: e6460cb5a1bd9c9875d71ffa03000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: e8db0e9b51037e8f85f03fec03000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#qObhMp
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 646
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 646
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#qObhMp
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: e8db0e9b51037e8f85f03fec03000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: eeeb659a56054345b32fc10503000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#XrHvwh
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 156
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
accelerator_type_cpu: 0.01
CPU: 1.0
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 156
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#XrHvwh
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: eeeb659a56054345b32fc10503000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: f85cc894e7c8b87a37a3da9203000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#NGyLdh
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 303
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
CPU: 1.0
accelerator_type_cpu: 0.01
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 303
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#NGyLdh
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: f85cc894e7c8b87a37a3da9203000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
actor_id: fff57bdd81a5b9190538fcd003000000
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
state: DEAD
job_id: '03000000'
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#EYOIDW
node_id: 700b06906e2d611398d2a1c72140294f3cf773ece7a801c95520f669
pid: 612
ray_namespace: serve
serialized_runtime_env: '{}'
required_resources:
accelerator_type_cpu: 0.01
CPU: 1.0
death_cause:
actor_died_error_context:
error_message: The actor is dead because it was killed by ray.kill.
owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c
owner_ip_address: 172.31.52.1
node_ip_address: 172.31.50.34
pid: 612
name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#EYOIDW
ray_namespace: serve
class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct
actor_id: fff57bdd81a5b9190538fcd003000000
never_started: false
is_detached: true
placement_group_id: null
repr_name: ''
...

Ray dashboard doesn't connect with image: "anyscale/aviary:latest"

Thanks for the great project!

With the exact same setup and steps, Ray dashboard connects with image:anyscale/aviary:0.1.0-a98a94c5005525545b9ea0a5b0b7b22f25f322d7-tgi

Happy to provide logs if you let me know which ones would be helpful and how to generate them.

Appreciate you guys taking a look at this.

[2023-08-28] Sunsetting non-Llama model examples.

This issue serves as a "news ticker" for the Aviary frontend.

Current news:

We're sunsetting the non-Llama models as of the 0.3.0 release. The reason is because we've seen the demand for llama models significantly outpace the Falcon and MPT models along with corresponding accuracy improvements.

Please feel free to create an issue if you would like to see new issues.

Past updates:

[2023-08-28] Sunsetting non-Llama model examples on RayLLM - see chat.lmsys.org for others!
[2023-07-20] Just added: Llama 2 models All Sizes!!!
[2023-07-03] Just added: continuous batching! Refreshed model list!
[2023-06-22] Just added: mpt-30b-chat! Streaming support & other improvements!
[2023-06-21] Just added: Streaming support & other improvements!
[2023-06-15] Falcon Models have a bug that slows them down and cause timeouts.
[2023-06-06] Just added: Falcon models! KubeRay guide!
[2023-06-02] Just added: Expanded context windows (900 words)! OpenAI CLI support! Upcoming: Falcon 40b

issue with run locally

I try to run inside the latest image, but after the model warmup, it just died with no error.
I was trying to run this
aviary run --model ~/models/continuous_batching/mosaicml--mpt-7b-chat.yaml
the only change inside the yaml is to remove
ray_actor_options:
num_gpus: 1
since I don't have 'accelerator_type_a10', I have a6000
here is the last of the logs

ve taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) Downloaded /home/ray/data/hub/models--mosaicml--mpt-7b-chat/snapshots/64e5c9c9fb53a8e89690c2dee75a5add37f7113e/pytorch_model-00001-of-00002.bin in 0:02:35.
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) Download: [1/2] -- ETA: 0:02:35
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) Download file: pytorch_model-00002-of-00002.bin
(ServeController pid=30116) WARNING 2023-10-02 06:40:38,770 controller 30116 deployment_state.py:2006 - Deployment 'mosaicml--mpt-7b-chat' in application 'mosaicml--mpt-7b-chat' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=30116) WARNING 2023-10-02 06:41:08,775 controller 30116 deployment_state.py:2006 - Deployment 'mosaicml--mpt-7b-chat' in application 'mosaicml--mpt-7b-chat' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) Downloaded /home/ray/data/hub/models--mosaicml--mpt-7b-chat/snapshots/64e5c9c9fb53a8e89690c2dee75a5add37f7113e/pytorch_model-00002-of-00002.bin in 0:00:58.
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) Download: [2/2] -- ETA: 0
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) No safetensors weights found for model mosaicml/mpt-7b-chat at revision None. Converting PyTorch weights to safetensors.
(ServeController pid=30116) WARNING 2023-10-02 06:41:38,862 controller 30116 deployment_state.py:2006 - Deployment 'mosaicml--mpt-7b-chat' in application 'mosaicml--mpt-7b-chat' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) Convert: [1/2] -- Took: 0:00:20.415345
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) Convert: [2/2] -- Took: 0:00:06.243851
(ServeReplica:mosaicml--mpt-7b-chat:mosaicml--mpt-7b-chat pid=30186) [INFO 2023-10-02 06:42:05,045] tgi.py: 214  Warming up model on workers...
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) [INFO 2023-10-02 06:42:05,054] tgi_worker.py: 650  Model is warming up. Num requests: 3 Prefill tokens: 6000 Max batch total tokens: None
(AviaryTGIInferenceWorker:mosaicml/mpt-7b-chat pid=31233) [INFO 2023-10-02 06:42:07,307] tgi_worker.py: 663  Model finished warming up (max_batch_total_tokens=None) and is ready to serve requests.
(ServeReplica:mosaicml--mpt-7b-chat:mosaicml--mpt-7b-chat pid=30186) [INFO 2023-10-02 06:42:07,520] tgi.py: 170  Rolling over to new worker group [Actor(AviaryTGIInferenceWorker, 725292a8070301f947130c2c01000000)]
(ServeReplica:mosaicml--mpt-7b-chat:mosaicml--mpt-7b-chat pid=30186) [INFO 2023-10-02 06:42:07,661] model_app.py: 83  Reconfigured and ready to serve.
(ServeReplica:mosaicml--mpt-7b-chat:mosaicml--mpt-7b-chat pid=30186) DeprecationWarning: `ray.state.actors` is a private attribute and access will be removed in a future Ray version.
/home/ray/anaconda3/lib/python3.9/tempfile.py:821: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmptyp67o3t'>
  _warnings.warn(warn_message, ResourceWarning)
/home/ray/anaconda3/lib/python3.9/subprocess.py:1052: ResourceWarning: subprocess 28960 is still running
  _warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback

(base) ray@4cd79d6dad32:~$

Repeated Failure and Restarting of Worker Deployment

Description:

I've been working with the Aviary codebase, successfully deploying the stack and connecting to it following the instructions in the readme. However, I've run into an issue during the model deployment phase.

Specifically, when deploying a model, I'm experiencing consistent failure and restarting of the worker deployment. Upon investigation, it appears that this could be related to the worker downloading the anyscale/aviary:latest Docker image, which is based on the rayproject/ray-ml:nightly-gpu image. This image is quite large at 20GB, which seems excessive, and I suspect it's causing a substantial delay during download. This may be exacerbated by AWS IP throttling.

The readme documentation doesn't seem to cover this scenario. I have some experience deploying various Ray infra stack items but don't have a ton of insight into how timeouts are handled at different parts of the stack. Is this a known issue? If so, I would appreciate any advice or guidance for resolving it.

Looking forward to your input and suggestions on how to proceed.

Related to:

Model deployment process from head server
Docker images: anyscale/aviary:latest and rayproject/ray-ml:nightly-gpu
AWS IP throttling (I think???)

Steps to reproduce:

Clone and install aviary repo as per the readme instructions.
Follow steps for deploying the stack and connecting to it.
Deploy a model.

Expected behavior:

The model deployment should work without the worker deployment failing and restarting repeatedly.

Actual behavior:

The worker deployment fails and restarts repeatedly. This seems to occur when downloading the Docker image anyscale/aviary:latest.

More Info:

AWS: us-east-1
Aviary version: 0.0.1
Ray version: 2.4.0

Weight caching being based on model-id creates confusion

Since Aviary caches weights based on the model id, changing the S3 path for a given model with a given model-id that has been run before does not do anything.

So in order for Aviary to respect changes in the S3 path of some model config, you have to go to the cache and delete the checkpoint.
Every time you forget this, you will get a functioning LLM but with the wrong weights.

This behaviour is silent, so it's very hard to realize what the issue is.
From the perspective of someone who does not know the internals of Aviary, this can create serious issues.
For example, you can spend half a day evaluating models with the outcome that they are all of approximately the same quality.
And even after that, you might not realize that you have made a mistake.

S3 bucket model download fails silently if the cluster doesn't have the right permissions

Reproduction: Add this to the Llama-2-7b model YAML, or probably any model YAML:

  s3_mirror_config:
    bucket_uri: s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/archit__kulkarni/ft_llms_with_deepspeed/meta-llama/Llama-2-7b-hf/demo-gsm-7b/

Any bucket for which your cluster doesn't have permissions should also reproduce the issue.

When you run aviary run for that model YAML, the following messages are printed, which are misleading:

(ServeReplica:meta-llama--Llama-2-7b-chat-hf:meta-llama--Llama-2-7b-chat-hf pid=3309, ip=172.31.60.146) [INFO 2023-09-14 15:05:23,682] utils.py: 63  Downloading meta-llama/Llama-2-7b-chat-hf from s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/archit__kulkarni/ft_llms_with_deepspeed/meta-llama/Llama-2-7b-hf/demo-gsm-7b/ to /home/ray/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/0000000000000000000000000000000000000000
[...]
(ServeReplica:meta-llama--Llama-2-7b-chat-hf:meta-llama--Llama-2-7b-chat-hf pid=3309, ip=172.31.60.146) [INFO 2023-09-14 15:05:24,200] utils.py: 184  Done downloading the model from bucket!

In fact, the folder is empty:

(base) ray@aviary-raycluster-z48t9-worker-gpu-group-m78lw:~$ ls /home/ray/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/0000000000000000000000000000000000000000/
(base) ray@aviary-raycluster-z48t9-worker-gpu-group-m78lw:~$

From what I understand from @kouroshHakha and @Yard1 (feel free to correct me): The folder is not supposed to be empty. Because it's empty, when you serve the model you're actually serving some other model (the "base model" or "chat model", I don't remember exactly), instead of the model from the specified S3 bucket.

[feature request] aviary model remove

Provide the ability to remove a model that has been deployed via aviary run - this would allow making tweaks to the underlying yaml file and then redeploying without having to kill ray processes or terminating/restarting the cluster.

LLM Deployment Observability

I assume, because RayLLM runs on top of Ray Serve, I can follow these steps to get observability for LLM deployments (Kuberay).

But how can we get custom metrics that are specific to LLMs, like the ones that are being suggested by the Ray team itself: https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference#benchmarking-results-for-per-token-llm-products

[doc] Cannot deploy an LLM model on EKS with KubeRay

I deployed on avitary head/worker pod on EKS cluster using KubeRay and
tried to deploy the LLM model due to an error in the following command.

serve run serve/meta-llama--Llama-2-7b-chat-hf.yaml

However, I couldn't deploy it.

I think the problem is compatibility with Python packages.
Is there a requirements.txt file (e.g. pydantic) with the appropriate package versions?

The following is the commands I ran and the output.

$ kubectl exec -it aviary-head-vjlb4 -- bash

(base) ray@aviary-head-vjlb4:~$ pwd
/home/ray

(base) ray@aviary-head-vjlb4:~$ export HUGGING_FACE_HUB_TOKEN=${MY_HUGGUNG_FACE_HUB_TOKEN}

(base) ray@aviary-head-vjlb4:~$ serve run serve/meta-llama--Llama-2-7b-chat-hf.yaml
2023-10-27 00:34:01,394 INFO scripts.py:471 -- Running import path: 'serve/meta-llama--Llama-2-7b-chat-hf.yaml'.
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/serve", line 8, in <module>
    sys.exit(cli())
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/scripts.py", line 473, in run
    import_attr(import_path), args_dict
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/utils.py", line 1378, in import_attr
    module = importlib.import_module(module_name)
  File "/home/ray/anaconda3/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'serve/meta-llama--Llama-2-7b-chat-hf'

(base) ray@aviary-head-vjlb4:~$ serve run models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
2023-10-27 00:47:36,307 INFO scripts.py:418 -- Running config file: 'models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml'.
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/serve", line 8, in <module>
    sys.exit(cli())
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/scripts.py", line 462, in run
    raise v1_err from None
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/scripts.py", line 449, in run
    config = ServeApplicationSchema.parse_obj(config_dict)
  File "pydantic/main.py", line 526, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for ServeApplicationSchema
import_path
  field required (type=value_error.missing)

Thank you for checking.

[docs] Improve docs around configuration

We need more polish for the config api, especially the scaling config:
• I don’t understand what it is about by looking at this name
• it still talks about Ray AIR.
• The config itself is not that intuitive. I need to think about it fro a while and then realize that each model replica actor will spawn multiple workers each of which requests those ray resources

Feature request: GCS support

I was able to deploy aviary in GCP using kuberay, but I noticed that models are downloaded from S3. I want to host some models in Google Cloud Storage instead.

Request for Comment: RayLLM <-> FastChat Integration

Hello team,

I would like to suggest the idea of the integration with the FastChat project (29k stars).
The idea would be to being able to instantiate FastChat compatible workers with RayLLM and take advantage of your both great projects with great features like infra auto-scaling, what do you guys think?
It would be amazing to have a such integration!

anyscale/aviary docker image has cuda 11.8 which is incompatibility with GKE nvidia daemonset

If I apply
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
then GKE will install the latest Nvidia drivers which is 525.125.06 in my case. But 525 drivers don't support 11.8 CUDA: docs

Ideally I would like to specify cuda version in the image tag.

Multiple models second models always request GPU: 1

Using the instructions here: https://github.com/ray-project/ray-llm#how-do-i-deploy-multiple-models-at-once I'm trying to host two models on a single A100 80G.

Two bundles are generated for the placement group:

{0: {'accelerator_type:A100': 0.1, 'CPU': 1.0}
{1: {'accelerator_type:A100': 0.1, 'GPU': 1.0, 'CPU': 1.0}}

Bundle 0 correctly generates with my configured CPU and accelerator type.
Bundle 1 adds in an additional GPU requirement.

Now, if I swap the order of the models in the multi-model config, the first model always boots and the second model doesn't because the superfluous GPU:1 entry is added (I think).

This always leads to the following log entries for the second model - e.g.

deployment_state.py:1974 - Deployment 'VLLMDeployment:TheBloke--Mistral-7b-OpenOrca-AWQ' in application 'ray-llm' has 1 replicas that have taken more t
han 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{"accelerator_type:A100": 0.1, "CPU": 1.0}, {"accelerator_type
:A100": 0.1, "CPU": 1.0, "GPU": 1.0}], total resources available: {}. Use `ray status` for more details.

Trying to work out if this a bug or my misunderstanding?
Happy to provide further details as needed :)

So far I've tried the provided containers plus building from source.

Issues serving other models from HF

The examples load and serve without issue meta-llama/Llama-2-7b-chat-hf and amazon/LightGPT models.

However, anytime I try other models such as

tiiuae/falcon-7b
mistralai/Mistral-7B-v0.1

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: aviary
spec:
  # Ray head pod template
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray start` command.
    # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
    rayStartParams:
      resources: '"{\"accelerator_type_cpu\": 2}"'
      dashboard-host: '0.0.0.0'
      block: 'true'
    #pod template
    template:
      spec:
        containers:
        - name: ray-head
          image: anyscale/aviary:0.3.1
          resources:
            limits:
              cpu: 2
              memory: 8Gi
            requests:
              cpu: 2
              memory: 8Gi
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265 # Ray dashboard
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 8000
            name: serve
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1
    minReplicas: 0
    maxReplicas: 1
    # logical group name, for this called small-group, also can be functional
    groupName: gpu-group
    rayStartParams:
      block: 'true'
      resources: '"{\"accelerator_type_cpu\": 8, \"accelerator_type_t4\": 2}"'
    # pod template
    template:
      spec:
        containers:
        - name: llm
          image: anyscale/aviary:0.3.1
          env:
          - name: HUGGING_FACE_HUB_TOKEN
            value: ${HF_API_TOKEN}
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "8"
              memory: "20G"
              nvidia.com/gpu: 2
            requests:
              cpu: "8"
              memory: "20G"
              nvidia.com/gpu: 2
        # Please ensure the following taint has been applied to the GPU node in the cluster.
        tolerations:
          - key: "ray.io/node-type"
            operator: "Equal"
            value: "worker"
            effect: "NoSchedule"
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-tesla-t4

aviary run --model model.yaml

I end up with the following error trying to load up the following models.

(ServeController pid=1548)   File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
(ServeController pid=1548)     torch._C._cuda_init()
(ServeController pid=1548) RuntimeError: No CUDA GPUs are available

Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> 
>>> print(torch.cuda.is_available())
True

ray-project / ray-llm Goto Github PK

ray-llm's Introduction

RayLLM - LLMs on Ray

Getting started

Deploying RayLLM

Locally

On a Ray Cluster

Connect to your Cluster

On Kubernetes

Query your models

Using curl

Connecting directly over python

Using the OpenAI SDK

RayLLM Reference

Installing RayLLM

Usage stats collection

Using RayLLM through the CLI

RayLLM Model Registry

Frequently Asked Questions

How do I add a new model?

How do I deploy multiple models at once?

How do I deploy a model to multiple nodes?

My deployment isn't starting/working correctly, how can I debug?

How do I write a program that accesses both OpenAI and your hosted model at the same time?

Getting Help and Filing Bugs / Feature Requests

Contributions

ray-llm's People

Contributors

Stargazers

Watchers

Forkers

ray-llm's Issues

Purpose

Possible integrations

1. Aviary as an LLM provider for LangChain

2. Aviary “wraps” LLMs provided by LangChain

3. Integrate LangChain LLM support directly into Aviary Explorer and Aviary CLI

Decision

Encountered Behavior

Expected Behavior

(base) ray@ip-172-31-52-1:~$ ray list actors --detail

Recommend Projects

Recommend Topics

Recommend Org