bentoml / openllm Goto Github PK

View Code? Open in Web Editor NEW

8.8K 51.0 554.0 39.18 MB

Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.

Home Page: https://bentoml.com

License: Apache License 2.0

Python 96.13% Shell 2.22% Jinja 0.18% Starlark 0.45% Dockerfile 0.73% Ruby 0.28%

llm llmops model-inference falcon fine-tuning stablelm llm-serving llama mpt vicuna

openllm's Introduction

🦾 OpenLLM: Self-Hosting Large Language Models Made Easy

Run any open-source LLMs, such as Llama 2 and Mistral, as OpenAI-compatible API endpoints, locally and in the cloud.

📖 Introduction

OpenLLM is an open-source platform designed to facilitate the deployment and operation of large language models (LLMs) in real-world applications. With OpenLLM, you can run inference on any open-source LLM, deploy them on the cloud or on-premises, and build powerful AI applications.

Key features include:

🚂 State-of-the-art LLMs: Integrated support for a wide range of open-source LLMs and model runtimes, including but not limited to Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder.

🔥 Flexible APIs: Serve LLMs over a RESTful API or gRPC with a single command. You can interact with the model using a Web UI, CLI, Python/JavaScript clients, or any HTTP client of your choice.

⛓️ Freedom to build: First-class support for LangChain, BentoML, LlamaIndex, OpenAI endpoints, and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services.

🎯 Streamline deployment: Automatically generate your LLM server Docker images or deploy as serverless endpoints via ☁️ BentoCloud, which effortlessly manages GPU resources, scales according to traffic, and ensures cost-effectiveness.

🤖️ Bring your own LLM: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (LLM.tuning()) is coming soon.

⚡ Quantization: Run inference with less computational and memory costs with quantization techniques such as LLM.int8, SpQR (int4), AWQ, GPTQ, and SqueezeLLM.

📡 Streaming: Support token streaming through server-sent events (SSE). You can use the /v1/generate_stream endpoint for streaming responses from LLMs.

🔄 Continuous batching: Support continuous batching via vLLM for increased total throughput.

OpenLLM is designed for AI application developers working to build production-ready applications based on LLMs. It delivers a comprehensive suite of tools and features for fine-tuning, serving, deploying, and monitoring these models, simplifying the end-to-end deployment workflow for LLMs.

💾 TL/DR

For starter, we provide two ways to quickly try out OpenLLM:

Jupyter Notebooks

Try this OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM.

Docker

We provide a docker container that helps you start running OpenLLM:

docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b --backend pt

Note

Given you have access to GPUs and have setup nvidia-docker, you can additionally pass in --gpus to use GPU for faster inference and optimization

docker run --rm --gpus all -p 3000:3000 -it ghcr.io/bentoml/openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm

🏃 Get started

The following provides instructions for how to get started with OpenLLM locally.

Prerequisites

You have installed Python 3.8 (or later) and pip. We highly recommend using a Virtual Environment to prevent package conflicts.

Install OpenLLM

Install OpenLLM by using pip as follows:

pip install openllm

To verify the installation, run:

$ openllm -h

Usage: openllm [OPTIONS] COMMAND [ARGS]...

   ██████╗ ██████╗ ███████╗███╗   ██╗██╗     ██╗     ███╗   ███╗
  ██╔═══██╗██╔══██╗██╔════╝████╗  ██║██║     ██║     ████╗ ████║
  ██║   ██║██████╔╝█████╗  ██╔██╗ ██║██║     ██║     ██╔████╔██║
  ██║   ██║██╔═══╝ ██╔══╝  ██║╚██╗██║██║     ██║     ██║╚██╔╝██║
  ╚██████╔╝██║     ███████╗██║ ╚████║███████╗███████╗██║ ╚═╝ ██║
   ╚═════╝ ╚═╝     ╚══════╝╚═╝  ╚═══╝╚══════╝╚══════╝╚═╝     ╚═╝.

  An open platform for operating large language models in production.
  Fine-tune, serve, deploy, and monitor any LLMs with ease.

Options:
  -v, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  build       Package a given models into a BentoLLM.
  import      Setup LLM interactively.
  models      List all supported models.
  prune       Remove all saved models, (and optionally bentos) built with OpenLLM locally.
  query       Query a LLM interactively, from a terminal.
  start       Start a LLMServer for any supported LLM.
  start-grpc  Start a gRPC LLMServer for any supported LLM.

Extensions:
  build-base-container  Base image builder for BentoLLM.
  dive-bentos           Dive into a BentoLLM.
  get-containerfile     Return Containerfile of any given Bento.
  get-prompt            Get the default prompt used by OpenLLM.
  list-bentos           List available bentos built by OpenLLM.
  list-models           This is equivalent to openllm models...
  playground            OpenLLM Playground.

Start a LLM server

OpenLLM allows you to quickly spin up an LLM server using openllm start. For example, to start a phi-2 server, run the following:

TRUST_REMOTE_CODE=True openllm start microsoft/phi-2

This starts the server at http://0.0.0.0:3000/. OpenLLM downloads the model to the BentoML local Model Store if it has not been registered before. To view your local models, run bentoml models list.

To interact with the server, you can visit the web UI at http://0.0.0.0:3000/ or send a request using curl. You can also use OpenLLM’s built-in Python client to interact with the server:

import openllm

client = openllm.client.HTTPClient('http://localhost:3000')
client.query('Explain to me the difference between "further" and "farther"')

Alternatively, use the openllm query command to query the model:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'Explain to me the difference between "further" and "farther"'

OpenLLM seamlessly supports many models and their variants. You can specify different variants of the model to be served. For example:

openllm start <model_id> --<options>

Note

OpenLLM supports specifying fine-tuning weights and quantized weights for any of the supported models as long as they can be loaded with the model architecture. Use the openllm models command to see the complete list of supported models, their architectures, and their variants.

Important

If you are testing OpenLLM on CPU, you might want to pass in DTYPE=float32. By default, OpenLLM will set model dtype to bfloat16 for the best performance.

DTYPE=float32 openllm start microsoft/phi-2

This will also applies to older GPUs. If your GPUs doesn't support bfloat16, then you also want to set DTYPE=float16.

🧩 Supported models

OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.

Baichuan

Quickstart

Note: Baichuan requires to install with:
pip install "openllm[baichuan]"

Run the following command to quickly spin up a Baichuan server:

TRUST_REMOTE_CODE=True openllm start baichuan-inc/baichuan-7b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Baichuan variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Baichuan-compatible models.

Supported models

You can specify any of the following Baichuan models via openllm start:

ChatGLM

Quickstart

Note: ChatGLM requires to install with:
pip install "openllm[chatglm]"

Run the following command to quickly spin up a ChatGLM server:

TRUST_REMOTE_CODE=True openllm start thudm/chatglm-6b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any ChatGLM variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more ChatGLM-compatible models.

Supported models

You can specify any of the following ChatGLM models via openllm start:

Dbrx

Quickstart

Note: Dbrx requires to install with:
pip install "openllm[dbrx]"

Run the following command to quickly spin up a Dbrx server:

TRUST_REMOTE_CODE=True openllm start databricks/dbrx-instruct

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Dbrx variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Dbrx-compatible models.

Supported models

You can specify any of the following Dbrx models via openllm start:

DollyV2

Quickstart

Run the following command to quickly spin up a DollyV2 server:

TRUST_REMOTE_CODE=True openllm start databricks/dolly-v2-3b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any DollyV2 variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more DollyV2-compatible models.

Supported models

You can specify any of the following DollyV2 models via openllm start:

Falcon

Quickstart

Note: Falcon requires to install with:
pip install "openllm[falcon]"

Run the following command to quickly spin up a Falcon server:

TRUST_REMOTE_CODE=True openllm start tiiuae/falcon-7b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Falcon variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Falcon-compatible models.

Supported models

You can specify any of the following Falcon models via openllm start:

FlanT5

Quickstart

Run the following command to quickly spin up a FlanT5 server:

TRUST_REMOTE_CODE=True openllm start google/flan-t5-large

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any FlanT5 variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more FlanT5-compatible models.

Supported models

You can specify any of the following FlanT5 models via openllm start:

Gemma

Quickstart

Note: Gemma requires to install with:
pip install "openllm[gemma]"

Run the following command to quickly spin up a Gemma server:

TRUST_REMOTE_CODE=True openllm start google/gemma-7b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Gemma variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Gemma-compatible models.

Supported models

You can specify any of the following Gemma models via openllm start:

GPTNeoX

Quickstart

Run the following command to quickly spin up a GPTNeoX server:

TRUST_REMOTE_CODE=True openllm start eleutherai/gpt-neox-20b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any GPTNeoX variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more GPTNeoX-compatible models.

Supported models

You can specify any of the following GPTNeoX models via openllm start:

eleutherai/gpt-neox-20b

Llama

Quickstart

Note: Llama requires to install with:
pip install "openllm[llama]"

Run the following command to quickly spin up a Llama server:

TRUST_REMOTE_CODE=True openllm start NousResearch/llama-2-7b-hf

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Llama variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Llama-compatible models.

Supported models

You can specify any of the following Llama models via openllm start:

Mistral

Quickstart

Note: Mistral requires to install with:
pip install "openllm[mistral]"

Run the following command to quickly spin up a Mistral server:

TRUST_REMOTE_CODE=True openllm start mistralai/Mistral-7B-Instruct-v0.1

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Mistral variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Mistral-compatible models.

Supported models

You can specify any of the following Mistral models via openllm start:

Mixtral

Quickstart

Note: Mixtral requires to install with:
pip install "openllm[mixtral]"

Run the following command to quickly spin up a Mixtral server:

TRUST_REMOTE_CODE=True openllm start mistralai/Mixtral-8x7B-Instruct-v0.1

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Mixtral variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Mixtral-compatible models.

Supported models

You can specify any of the following Mixtral models via openllm start:

MPT

Quickstart

Note: MPT requires to install with:
pip install "openllm[mpt]"

Run the following command to quickly spin up a MPT server:

TRUST_REMOTE_CODE=True openllm start mosaicml/mpt-7b-instruct

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any MPT variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more MPT-compatible models.

Supported models

You can specify any of the following MPT models via openllm start:

OPT

Quickstart

Note: OPT requires to install with:
pip install "openllm[opt]"

Run the following command to quickly spin up a OPT server:

openllm start facebook/opt-1.3b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any OPT variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more OPT-compatible models.

Supported models

You can specify any of the following OPT models via openllm start:

Phi

Quickstart

Note: Phi requires to install with:
pip install "openllm[phi]"

Run the following command to quickly spin up a Phi server:

TRUST_REMOTE_CODE=True openllm start microsoft/phi-2

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Phi variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Phi-compatible models.

Supported models

You can specify any of the following Phi models via openllm start:

Qwen

Quickstart

Note: Qwen requires to install with:
pip install "openllm[qwen]"

Run the following command to quickly spin up a Qwen server:

TRUST_REMOTE_CODE=True openllm start qwen/Qwen-7B-Chat

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Qwen variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Qwen-compatible models.

Supported models

You can specify any of the following Qwen models via openllm start:

StableLM

Quickstart

Note: StableLM requires to install with:
pip install "openllm[stablelm]"

Run the following command to quickly spin up a StableLM server:

TRUST_REMOTE_CODE=True openllm start stabilityai/stablelm-tuned-alpha-3b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any StableLM variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more StableLM-compatible models.

Supported models

You can specify any of the following StableLM models via openllm start:

StarCoder

Quickstart

Note: StarCoder requires to install with:
pip install "openllm[starcoder]"

Run the following command to quickly spin up a StarCoder server:

TRUST_REMOTE_CODE=True openllm start bigcode/starcoder

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any StarCoder variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more StarCoder-compatible models.

Supported models

You can specify any of the following StarCoder models via openllm start:

Quickstart

Note: Yi requires to install with:
pip install "openllm[yi]"

Run the following command to quickly spin up a Yi server:

TRUST_REMOTE_CODE=True openllm start 01-ai/Yi-6B

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Yi variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Yi-compatible models.

Supported models

You can specify any of the following Yi models via openllm start:

More models will be integrated with OpenLLM and we welcome your contributions if you want to incorporate your custom LLMs into the ecosystem. Check out Adding a New Model Guide to learn more.

💻 Run your model on multiple GPUs

OpenLLM allows you to start your model server on multiple GPUs and specify the number of workers per resource assigned using the --workers-per-resource option. For example, if you have 4 available GPUs, you set the value as one divided by the number as only one instance of the Runner server will be spawned.

TRUST_REMOTE_CODE=True openllm start microsoft/phi-2 --workers-per-resource 0.25

Note

The amount of GPUs required depends on the model size itself. You can use the Model Memory Calculator from Hugging Face to calculate how much vRAM is needed to train and perform big model inference on a model and then plan your GPU strategy based on it.

When using the --workers-per-resource option with the openllm build command, the environment variable is saved into the resulting Bento.

For more information, see Resource scheduling strategy.

🛞 Runtime implementations

Different LLMs may support multiple runtime implementations. Models that have vLLM (vllm) supports will use vLLM by default, otherwise it fallback to use PyTorch (pt).

To specify a specific runtime for your chosen model, use the --backend option. For example:

openllm start meta-llama/Llama-2-7b-chat-hf --backend vllm

Note:

To use the vLLM backend, you need a GPU with at least the Ampere architecture or newer and CUDA version 11.8.
To see the backend options of each model supported by OpenLLM, see the Supported models section or run openllm models.

📐 Quantization

Quantization is a technique to reduce the storage and computation requirements for machine learning models, particularly during inference. By approximating floating-point numbers as integers (quantized values), quantization allows for faster computations, reduced memory footprint, and can make it feasible to deploy large models on resource-constrained devices.

OpenLLM supports the following quantization techniques

PyTorch backend

With PyTorch backend, OpenLLM supports int8, int4, and gptq.

For using int8 and int4 quantization through bitsandbytes, you can use the following command:

TRUST_REMOTE_CODE=True openllm start microsoft/phi-2 --quantize int8

To run inference with gptq, simply pass --quantize gptq:

openllm start TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq

Note

In order to run GPTQ, make sure you run pip install "openllm[gptq]" first to install the dependency. From the GPTQ paper, it is recommended to quantized the weights before serving. See AutoGPTQ for more information on GPTQ quantization.

vLLM backend

With vLLM backend, OpenLLM supports awq, squeezellm

To run inference with awq, simply pass --quantize awq:

openllm start TheBloke/zephyr-7B-alpha-AWQ --quantize awq

To run inference with squeezellm, simply pass --quantize squeezellm:

openllm start squeeze-ai-lab/sq-llama-2-7b-w4-s0 --quantize squeezellm --serialization legacy

Important

Since both squeezellm and awq are weight-aware quantization methods, meaning the quantization is done during training, all pre-trained weights needs to get quantized before inference time. Make sure to find compatible weights on HuggingFace Hub for your model of choice.

🛠️ Serving fine-tuning layers

PEFT, or Parameter-Efficient Fine-Tuning, is a methodology designed to fine-tune pre-trained models more efficiently. Instead of adjusting all model parameters, PEFT focuses on tuning only a subset, reducing computational and storage costs. LoRA (Low-Rank Adaptation) is one of the techniques supported by PEFT. It streamlines fine-tuning by using low-rank decomposition to represent weight updates, thereby drastically reducing the number of trainable parameters.

With OpenLLM, you can take advantage of the fine-tuning feature by serving models with any PEFT-compatible layers using the --adapter-id option. For example:

openllm start facebook/opt-6.7b --adapter-id aarnphm/opt-6-7b-quotes:default

OpenLLM also provides flexibility by supporting adapters from custom file paths:

openllm start facebook/opt-6.7b --adapter-id /path/to/adapters:local_adapter

To use multiple adapters, use the following format:

openllm start facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-lora:default --adapter-id aarnphm/opt-6.7b-french:french_lora

By default, all adapters will be injected into the models during startup. Adapters can be specified per request via adapter_name:

curl -X 'POST' \
  'http://localhost:3000/v1/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "What is the meaning of life?",
  "stop": [
    "philosopher"
  ],
  "llm_config": {
    "max_new_tokens": 256,
    "temperature": 0.75,
    "top_k": 15,
    "top_p": 1
  },
  "adapter_name": "default"
}'

To include this into the Bento, you can specify the --adapter-id option when using the openllm build command:

openllm build facebook/opt-6.7b --adapter-id ...

If you use a relative path for --adapter-id, you need to add --build-ctx.

openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .

Important

Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.

🐍 Python SDK

Each LLM can be instantiated with openllm.LLM:

import openllm

llm = openllm.LLM('microsoft/phi-2')

The main inference API is the streaming generate_iterator method:

async for generation in llm.generate_iterator('What is the meaning of life?'):
  print(generation.outputs[0].text)

Note

The motivation behind making llm.generate_iterator an async generator is to provide support for Continuous batching with vLLM backend. By having the async endpoints, each prompt will be added correctly to the request queue to process with vLLM backend.

There is also a one-shot generate method:

await llm.generate('What is the meaning of life?')

This method is easy to use for one-shot generation use case, but merely served as an example how to use llm.generate_iterator as it uses generate_iterator under the hood.

Important

If you need to call your code in a synchronous context, you can use asyncio.run that wraps an async function:

import asyncio
async def generate(prompt, **attrs): return await llm.generate(prompt, **attrs)
asyncio.run(generate("The meaning of life is", temperature=0.23))

⚙️ Integrations

OpenLLM is not just a standalone product; it's a building block designed to integrate with other powerful tools easily. We currently offer integration with BentoML, OpenAI's Compatible Endpoints, LlamaIndex, LangChain, and Transformers Agents.

OpenAI Compatible Endpoints

OpenLLM Server can be used as a drop-in replacement for OpenAI's API. Simply specify the base_url to llm-endpoint/v1 and you are good to go:

import openai

client = openai.OpenAI(
  base_url='http://localhost:3000/v1', api_key='na'
)  # Here the server is running on localhost:3000

completions = client.completions.create(
  prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
)

The compatible endpoints supports /completions, /chat/completions, and /models

Note

You can find out OpenAI example clients under the examples folder.

BentoML

OpenLLM LLM can be integrated as a Runner in your BentoML service. Simply call await llm.generate to generate text. Note that llm.generate uses runner under the hood:

import bentoml
import openllm

llm = openllm.LLM('microsoft/phi-2')

svc = bentoml.Service(name='llm-phi-service', runners=[llm.runner])


@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
  generation = await llm.generate(input_text)
  return generation.outputs[0].text

LlamaIndex

To start a local LLM with llama_index, simply use llama_index.llms.openllm.OpenLLM:

import asyncio
from llama_index.llms.openllm import OpenLLM

llm = OpenLLM('HuggingFaceH4/zephyr-7b-alpha')

llm.complete('The meaning of life is')


async def main(prompt, **kwargs):
  async for it in llm.astream_chat(prompt, **kwargs):
    print(it)


asyncio.run(main('The time at San Francisco is'))

If there is a remote LLM Server running elsewhere, then you can use llama_index.llms.openllm.OpenLLMAPI:

from llama_index.llms.openllm import OpenLLMAPI

Note

All synchronous and asynchronous API from llama_index.llms.LLM are supported.

LangChain

To quickly start a local LLM with langchain, simply do the following:

from langchain.llms import OpenLLM

llm = OpenLLM(model_name='llama', model_id='meta-llama/Llama-2-7b-hf')

llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')

Important

By default, OpenLLM use safetensors format for saving models. If the model doesn't support safetensors, make sure to pass serialisation="legacy" to use the legacy PyTorch bin format.

langchain.llms.OpenLLM has the capability to interact with remote OpenLLM Server. Given there is an OpenLLM server deployed elsewhere, you can connect to it by specifying its URL:

from langchain.llms import OpenLLM

llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http')
llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')

To integrate a LangChain agent with BentoML, you can do the following:

llm = OpenLLM(model_id='google/flan-t5-large', embedded=False, serialisation='legacy')
tools = load_tools(['serpapi', 'llm-math'], llm=llm)
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)
svc = bentoml.Service('langchain-openllm', runners=[llm.runner])


@svc.api(input=Text(), output=Text())
def chat(input_text: str):
  return agent.run(input_text)

Note

You can find out more examples under the examples folder.

Transformers Agents

OpenLLM seamlessly integrates with Transformers Agents.

Warning

The Transformers Agent is still at an experimental stage. It is recommended to install OpenLLM with pip install -r nightly-requirements.txt to get the latest API update for HuggingFace agent.

import transformers

agent = transformers.HfAgent('http://localhost:3000/hf/agent')  # URL that runs the OpenLLM server

agent.run('Is the following `text` positive or negative?', text="I don't like how this models is generate inputs")

🚀 Deploying models to production

There are several ways to deploy your LLMs:

🐳 Docker container

Building a Bento: With OpenLLM, you can easily build a Bento for a specific model, like mistralai/Mistral-7B-Instruct-v0.1, using the build command.:
```
openllm build mistralai/Mistral-7B-Instruct-v0.1
```
A Bento, in BentoML, is the unit of distribution. It packages your program's source code, models, files, artefacts, and dependencies.
Containerize your Bento
```
bentoml containerize <name:version>
```
This generates a OCI-compatible docker image that can be deployed anywhere docker runs. For best scalability and reliability of your LLM service in production, we recommend deploy with BentoCloud。

☁️ BentoCloud

Deploy OpenLLM with BentoCloud, the serverless cloud for shipping and scaling AI applications.

Create a BentoCloud account: sign up here for early access

Log into your BentoCloud account:

bentoml cloud login --api-token <your-api-token> --endpoint <bento-cloud-endpoint>

Note

Replace <your-api-token> and <bento-cloud-endpoint> with your specific API token and the BentoCloud endpoint respectively.

Bulding a Bento: With OpenLLM, you can easily build a Bento for a specific model, such as mistralai/Mistral-7B-Instruct-v0.1:
```
openllm build mistralai/Mistral-7B-Instruct-v0.1
```
Pushing a Bento: Push your freshly-built Bento service to BentoCloud via the push command:
```
bentoml push <name:version>
```
Deploying a Bento: Deploy your LLMs to BentoCloud with a single bentoml deployment create command following the deployment instructions.

👥 Community

Engage with like-minded individuals passionate about LLMs, AI, and more on our Discord!

OpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy to use 👉 Join our Slack community!

🎁 Contributing

We welcome contributions! If you're interested in enhancing OpenLLM's capabilities or have any questions, don't hesitate to reach out in our discord channel.

Checkout our Developer Guide if you wish to contribute to OpenLLM's codebase.

🍇 Telemetry

OpenLLM collects usage data to enhance user experience and improve the product. We only report OpenLLM's internal API calls and ensure maximum privacy by excluding sensitive information. We will never collect user code, model data, or stack traces. For usage tracking, check out the code.

You can opt out of usage tracking by using the --do-not-track CLI option:

openllm [command] --do-not-track

Or by setting the environment variable OPENLLM_DO_NOT_TRACK=True:

export OPENLLM_DO_NOT_TRACK=True

📔 Citation

If you use OpenLLM in your research, we provide a citation to use:

@software{Pham_OpenLLM_Operating_LLMs_2023,
author = {Pham, Aaron and Yang, Chaoyu and Sheng, Sean and  Zhao, Shenyang and Lee, Sauyon and Jiang, Bo and Dong, Fog and Guan, Xipeng and Ming, Frost},
license = {Apache-2.0},
month = jun,
title = {{OpenLLM: Operating LLMs in production}},
url = {https://github.com/bentoml/OpenLLM},
year = {2023}
}

openllm's People

Contributors

Stargazers

Watchers

Forkers

parano atalaya-io llmsys guoqiangjia zhubao315 yeachen2021 knightcn1983 ajfkdk byesoft pterx onejune2018 biztrology-kd lehidalgo mec-is hhy5277 mobs75 hertera1 erickkill lierscn789 3k-1 thedotproduct khushpatel2002 matthoffner xiedongmingming platform-kit padrian2s apollohuang1 vn-os beitans rahuldshetty tspannhw wowmarcomei nimmen hirajanwin larme techthiyanes artisr kp666 stephenxxxx itsharex kustomzone neuroradiology hbcbh1999 univrs hasokeric lilleswing felipegtx soheil thearchiver iasonastr rioncarter segmond suryatmodulus karbon0x mahmouddolah gutzufusss kahirokunn huyanhvn si3mshady tonyliang19 pfery dst1213 domozhir wandergreepz nancy2278 iuriimattos2 eltociear jolz76 raahulrawat orionowl55 lucannecorn123 nurgalrash samuyflo randycru heyzol69 andromedaarcher77 hnh3516 danyray420 ethanorlander thalesfsp josegron hoangmf kenny-rogers commerceless benjamin-ky denvaltz jayjmanley asdaswqwq liudunxu 55587jijing zroaqaq whbuoe12 lin13i pterameta suzuki910705 ais-developer petercao aurora779 vaibhavb02 ycai4591679

openllm's Issues

refactor: logics

Right now there are too much coupling with how start and build interact

We need to resolve:

If the given model_id is a pretrained path?
If the given model_id is a custom file path?
- Do we create a new model entry to the ModelStore?
  - This should be the behaviour since ModelStore changed a bit in BentoML from 1.0.23
- Do we only include this entry into the Bento during the build?

Right now service starts relies on environment, so it is working for now, but very fragile

I will work on this during the weekend

PEFT LORA / QLORA

Hello,
Are you planning to add support for parameter efficient finetuning methods?
Also does it support doing inference using those adapters models to optimize VRAM?
Thanks

run error

When I run "bentoml serve svc.py:svc.I -p 2000"
I got such error
"2023-07-04T17:56:21+0800 [INFO] [cli] Starting production HTTP BentoServer from "svc.py:svc" listening on http://0.0.0.0:2000 (Press CTRL+C to quit)
2023-07-04T17:56:23+0800 [ERROR] [runner:pt-glm6b2p:1] An exception occurred while instantiating runner 'pt-glm6b2p', see details below:
2023-07-04T17:56:23+0800 [ERROR] [runner:pt-glm6b2p:1] Traceback (most recent call last):
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/bentoml/_internal/utils/lazy_loader.py", line 69, in getattr
self._module = self._load()
^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/bentoml/_internal/utils/lazy_loader.py", line 53, in _load
raise self._exc(f"{self._exc_msg} (reason: {err})") from None
bentoml.exceptions.MissingDependencyException: None (reason: No module named 'tensorflow')

bug: TypeError: environment can only contain strings

Describe the bug

During start openllm and retry to meet same error

(stephen) C:\Users\stephen\LLM\OpenLLM>openllm start opt
Error caught while starting LLM Server:
environment can only contain strings
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in run_code
File "C:\Users\stephen\anaconda3\envs\stephen\Scripts\openllm.exe_main.py", line 7, in
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\click\core.py", line 1130, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\openllm\cli.py", line 381, in wrapper
return func(*args, **attrs)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\openllm\cli.py", line 354, in wrapper
return_value = func(*args, **attrs)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\openllm\cli.py", line 329, in wrapper
return f(*args, **attrs)
^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\click\decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\openllm\cli.py", line 837, in model_start
server.start(env=start_env, text=True, blocking=True)
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\bentoml\server.py", line 190, in start
return _Manager()
^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\site-packages\bentoml\server.py", line 163, in init
self.process = subprocess.Popen(
^^^^^^^^^^^^^^^^^
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\subprocess.py", line 1024, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\stephen\anaconda3\envs\stephen\Lib\subprocess.py", line 1509, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: environment can only contain strings

To reproduce

No response

Logs

No response

Environment

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.22
python: 3.11.3
platform: Windows-10-10.0.22621-SP0
is_window_admin: False
conda: 23.3.1
in_conda_env: True

conda_packages

name: stephen
channels:
  - defaults
dependencies:
  - aiofiles=22.1.0=py311haa95532_0
  - aiosqlite=0.18.0=py311haa95532_0
  - anyio=3.5.0=py311haa95532_0
  - argon2-cffi=21.3.0=pyhd3eb1b0_0
  - argon2-cffi-bindings=21.2.0=py311h2bbff1b_0
  - asttokens=2.0.5=pyhd3eb1b0_0
  - babel=2.11.0=py311haa95532_0
  - backcall=0.2.0=pyhd3eb1b0_0
  - beautifulsoup4=4.12.2=py311haa95532_0
  - bleach=4.1.0=pyhd3eb1b0_0
  - brotlipy=0.7.0=py311h2bbff1b_1002
  - bzip2=1.0.8=he774522_0
  - ca-certificates=2023.05.30=haa95532_0
  - certifi=2023.5.7=py311haa95532_0
  - cffi=1.15.1=py311h2bbff1b_3
  - charset-normalizer=2.0.4=pyhd3eb1b0_0
  - colorama=0.4.6=py311haa95532_0
  - comm=0.1.2=py311haa95532_0
  - cryptography=39.0.1=py311h21b164f_0
  - debugpy=1.5.1=py311hd77b12b_0
  - decorator=5.1.1=pyhd3eb1b0_0
  - defusedxml=0.7.1=pyhd3eb1b0_0
  - entrypoints=0.4=py311haa95532_0
  - executing=0.8.3=pyhd3eb1b0_0
  - giflib=5.2.1=h8cc25b3_3
  - glib=2.69.1=h5dc1a3c_2
  - gst-plugins-base=1.18.5=h9e645db_0
  - gstreamer=1.18.5=hd78058f_0
  - icu=58.2=ha925a31_3
  - idna=3.4=py311haa95532_0
  - ipykernel=6.19.2=py311h86cfffd_0
  - ipython=8.12.0=py311haa95532_0
  - ipython_genutils=0.2.0=pyhd3eb1b0_1
  - ipywidgets=8.0.4=py311haa95532_0
  - jedi=0.18.1=py311haa95532_1
  - jinja2=3.1.2=py311haa95532_0
  - jpeg=9e=h2bbff1b_1
  - json5=0.9.6=pyhd3eb1b0_0
  - jsonschema=4.17.3=py311haa95532_0
  - jupyter=1.0.0=py311haa95532_8
  - jupyter_client=8.1.0=py311haa95532_0
  - jupyter_console=6.6.3=py311haa95532_0
  - jupyter_core=5.3.0=py311haa95532_0
  - jupyter_events=0.6.3=py311haa95532_0
  - jupyter_server=2.5.0=py311haa95532_0
  - jupyter_server_fileid=0.9.0=py311haa95532_0
  - jupyter_server_terminals=0.4.4=py311haa95532_1
  - jupyter_server_ydoc=0.8.0=py311haa95532_1
  - jupyter_ydoc=0.2.4=py311haa95532_0
  - jupyterlab=3.6.3=py311haa95532_0
  - jupyterlab_pygments=0.1.2=py_0
  - jupyterlab_server=2.22.0=py311haa95532_0
  - jupyterlab_widgets=3.0.5=py311haa95532_0
  - krb5=1.19.4=h5b6d351_0
  - lerc=3.0=hd77b12b_0
  - libclang=14.0.6=default_hb5a9fac_1
  - libclang13=14.0.6=default_h8e68704_1
  - libdeflate=1.17=h2bbff1b_0
  - libffi=3.4.4=hd77b12b_0
  - libiconv=1.16=h2bbff1b_2
  - libogg=1.3.5=h2bbff1b_1
  - libpng=1.6.39=h8cc25b3_0
  - libsodium=1.0.18=h62dcd97_0
  - libtiff=4.5.0=h6c2663c_2
  - libvorbis=1.3.7=he774522_0
  - libwebp=1.2.4=hbc33d0d_1
  - libwebp-base=1.2.4=h2bbff1b_1
  - libxml2=2.10.3=h0ad7f3c_0
  - libxslt=1.1.37=h2bbff1b_0
  - lxml=4.9.2=py311h2bbff1b_0
  - lz4-c=1.9.4=h2bbff1b_0
  - markupsafe=2.1.1=py311h2bbff1b_0
  - matplotlib-inline=0.1.6=py311haa95532_0
  - mistune=0.8.4=py311h2bbff1b_1000
  - nbclassic=0.5.5=py311haa95532_0
  - nbclient=0.5.13=py311haa95532_0
  - nbconvert=6.5.4=py311haa95532_0
  - nbformat=5.7.0=py311haa95532_0
  - nest-asyncio=1.5.6=py311haa95532_0
  - notebook=6.5.4=py311haa95532_0
  - notebook-shim=0.2.2=py311haa95532_0
  - openssl=1.1.1u=h2bbff1b_0
  - packaging=23.0=py311haa95532_0
  - pandocfilters=1.5.0=pyhd3eb1b0_0
  - parso=0.8.3=pyhd3eb1b0_0
  - pcre=8.45=hd77b12b_0
  - pickleshare=0.7.5=pyhd3eb1b0_1003
  - pip=23.1.2=py311haa95532_0
  - platformdirs=2.5.2=py311haa95532_0
  - ply=3.11=py311haa95532_0
  - prometheus_client=0.14.1=py311haa95532_0
  - prompt-toolkit=3.0.36=py311haa95532_0
  - prompt_toolkit=3.0.36=hd3eb1b0_0
  - psutil=5.9.0=py311h2bbff1b_0
  - pure_eval=0.2.2=pyhd3eb1b0_0
  - pycparser=2.21=pyhd3eb1b0_0
  - pygments=2.15.1=py311haa95532_1
  - pyopenssl=23.0.0=py311haa95532_0
  - pyqt=5.15.7=py311hd77b12b_0
  - pyqt5-sip=12.11.0=py311hd77b12b_0
  - pyrsistent=0.18.0=py311h2bbff1b_0
  - pysocks=1.7.1=py311haa95532_0
  - python=3.11.3=h966fe2a_0
  - python-dateutil=2.8.2=pyhd3eb1b0_0
  - python-fastjsonschema=2.16.2=py311haa95532_0
  - python-json-logger=2.0.7=py311haa95532_0
  - pytz=2022.7=py311haa95532_0
  - pywin32=305=py311h2bbff1b_0
  - pywinpty=2.0.10=py311h5da7b33_0
  - pyyaml=6.0=py311h2bbff1b_1
  - pyzmq=25.1.0=py311hd77b12b_0
  - qt-main=5.15.2=he8e5bd7_8
  - qt-webengine=5.15.9=hb9a9bb5_5
  - qtconsole=5.4.2=py311haa95532_0
  - qtpy=2.2.0=py311haa95532_0
  - qtwebkit=5.212=h2bbfb41_5
  - requests=2.29.0=py311haa95532_0
  - rfc3339-validator=0.1.4=py311haa95532_0
  - rfc3986-validator=0.1.1=py311haa95532_0
  - send2trash=1.8.0=pyhd3eb1b0_1
  - setuptools=67.8.0=py311haa95532_0
  - sip=6.6.2=py311hd77b12b_0
  - six=1.16.0=pyhd3eb1b0_1
  - sniffio=1.2.0=py311haa95532_1
  - soupsieve=2.4=py311haa95532_0
  - sqlite=3.41.2=h2bbff1b_0
  - stack_data=0.2.0=pyhd3eb1b0_0
  - terminado=0.17.1=py311haa95532_0
  - tinycss2=1.2.1=py311haa95532_0
  - tk=8.6.12=h2bbff1b_0
  - toml=0.10.2=pyhd3eb1b0_0
  - tornado=6.2=py311h2bbff1b_0
  - traitlets=5.7.1=py311haa95532_0
  - typing-extensions=4.6.3=py311haa95532_0
  - typing_extensions=4.6.3=py311haa95532_0
  - urllib3=1.26.16=py311haa95532_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - wcwidth=0.2.5=pyhd3eb1b0_0
  - webencodings=0.5.1=py311haa95532_1
  - websocket-client=0.58.0=py311haa95532_4
  - wheel=0.38.4=py311haa95532_0
  - widgetsnbextension=4.0.5=py311haa95532_0
  - win_inet_pton=1.1.0=py311haa95532_0
  - winpty=0.4.3=4
  - xz=5.4.2=h8cc25b3_0
  - y-py=0.5.9=py311hb6bf4ef_0
  - yaml=0.2.5=he774522_0
  - ypy-websocket=0.8.2=py311haa95532_0
  - zeromq=4.3.4=hd77b12b_0
  - zlib=1.2.13=h8cc25b3_0
  - zstd=1.5.5=hd43e919_0
  - pip:
      - accelerate==0.20.3
      - aiohttp==3.8.4
      - aiosignal==1.3.1
      - appdirs==1.4.4
      - asgiref==3.7.2
      - async-timeout==4.0.2
      - attrs==23.1.0
      - bentoml==1.0.22
      - build==0.10.0
      - cattrs==23.1.2
      - circus==0.18.0
      - click==8.1.3
      - click-option-group==0.5.6
      - cloudpickle==2.2.1
      - coloredlogs==15.0.1
      - contextlib2==21.6.0
      - datasets==2.13.1
      - deepmerge==1.1.0
      - deprecated==1.2.14
      - dill==0.3.6
      - django==4.2.2
      - filelock==3.12.2
      - filetype==1.2.0
      - frozenlist==1.3.3
      - fs==2.4.16
      - fsspec==2023.6.0
      - grpcio==1.56.0
      - grpcio-health-checking==1.48.2
      - h11==0.14.0
      - httpcore==0.17.2
      - httpx==0.24.1
      - huggingface-hub==0.15.1
      - humanfriendly==10.0
      - importlib-metadata==6.0.1
      - inflection==0.5.1
      - markdown-it-py==3.0.0
      - mdurl==0.1.2
      - mpmath==1.3.0
      - multidict==6.0.4
      - multiprocess==0.70.14
      - networkx==3.1
      - numpy==1.25.0
      - openllm==0.1.13
      - opentelemetry-api==1.17.0
      - opentelemetry-instrumentation==0.38b0
      - opentelemetry-instrumentation-aiohttp-client==0.38b0
      - opentelemetry-instrumentation-asgi==0.38b0
      - opentelemetry-instrumentation-grpc==0.38b0
      - opentelemetry-sdk==1.17.0
      - opentelemetry-semantic-conventions==0.38b0
      - opentelemetry-util-http==0.38b0
      - optimum==1.8.8
      - orjson==3.9.1
      - pandas==2.0.2
      - pathspec==0.11.1
      - pillow==9.5.0
      - pip-requirements-parser==32.0.1
      - pip-tools==6.13.0
      - protobuf==3.20.3
      - pyarrow==12.0.1
      - pydantic==1.10.9
      - pymysql==1.0.3
      - pynvml==11.5.0
      - pyparsing==3.1.0
      - pyproject-hooks==1.0.0
      - pyreadline3==3.4.1
      - python-multipart==0.0.6
      - regex==2023.6.3
      - rich==13.4.2
      - safetensors==0.3.1
      - schema==0.7.5
      - sentencepiece==0.1.99
      - simple-di==0.1.5
      - smartchart==6.6.8
      - smartdb==0.6
      - sqlparse==0.4.4
      - starlette==0.28.0
      - sympy==1.12
      - tabulate==0.9.0
      - tokenizers==0.13.3
      - torch==2.0.1
      - torchvision==0.15.2
      - tqdm==4.65.0
      - transformers==4.30.2
      - tzdata==2023.3
      - uvicorn==0.22.0
      - watchfiles==0.19.0
      - wrapt==1.15.0
      - xxhash==3.2.0
      - yarl==1.9.2
      - zipp==3.15.0
prefix: C:\Users\stephen\anaconda3\envs\stephen

pip_packages

accelerate==0.20.3
aiofiles @ file:///C:/b/abs_9ex6mi6b56/croot/aiofiles_1683773603390/work
aiohttp==3.8.4
aiosignal==1.3.1
aiosqlite @ file:///C:/b/abs_9djc_0pyi3/croot/aiosqlite_1683773915844/work
anyio @ file:///C:/ci_311/anyio_1676425491996/work/dist
appdirs==1.4.4
argon2-cffi @ file:///opt/conda/conda-bld/argon2-cffi_1645000214183/work
argon2-cffi-bindings @ file:///C:/ci_311/argon2-cffi-bindings_1676424443321/work
asgiref==3.7.2
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
async-timeout==4.0.2
attrs==23.1.0
Babel @ file:///C:/ci_311/babel_1676427169844/work
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
beautifulsoup4 @ file:///C:/b/abs_0agyz1wsr4/croot/beautifulsoup4-split_1681493048687/work
bentoml==1.0.22
bleach @ file:///opt/conda/conda-bld/bleach_1641577558959/work
brotlipy==0.7.0
build==0.10.0
cattrs==23.1.2
certifi @ file:///C:/b/abs_4a0polqwty/croot/certifi_1683875377622/work/certifi
cffi @ file:///C:/ci_311/cffi_1676423759166/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
circus==0.18.0
click==8.1.3
click-option-group==0.5.6
cloudpickle==2.2.1
colorama @ file:///C:/ci_311/colorama_1676422310965/work
coloredlogs==15.0.1
comm @ file:///C:/ci_311/comm_1678376562840/work
contextlib2==21.6.0
cryptography @ file:///C:/ci_311/cryptography_1679419210767/work
datasets==2.13.1
debugpy @ file:///C:/ci_311/debugpy_1676426137692/work
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
deepmerge==1.1.0
defusedxml @ file:///tmp/build/80754af9/defusedxml_1615228127516/work
Deprecated==1.2.14
dill==0.3.6
Django==4.2.2
entrypoints @ file:///C:/ci_311/entrypoints_1676423328987/work
executing @ file:///opt/conda/conda-bld/executing_1646925071911/work
fastjsonschema @ file:///C:/ci_311/python-fastjsonschema_1679500568724/work
filelock==3.12.2
filetype==1.2.0
frozenlist==1.3.3
fs==2.4.16
fsspec==2023.6.0
grpcio==1.56.0
grpcio-health-checking==1.48.2
h11==0.14.0
httpcore==0.17.2
httpx==0.24.1
huggingface-hub==0.15.1
humanfriendly==10.0
idna @ file:///C:/ci_311/idna_1676424932545/work
importlib-metadata==6.0.1
inflection==0.5.1
ipykernel @ file:///C:/ci_311/ipykernel_1678734799670/work
ipython @ file:///C:/b/abs_d1yx5tjhli/croot/ipython_1680701887259/work
ipython-genutils @ file:///tmp/build/80754af9/ipython_genutils_1606773439826/work
ipywidgets @ file:///C:/b/abs_5awapknmz_/croot/ipywidgets_1679394824767/work
jedi @ file:///C:/ci_311/jedi_1679427407646/work
Jinja2 @ file:///C:/ci_311/jinja2_1676424968965/work
json5 @ file:///tmp/build/80754af9/json5_1624432770122/work
jsonschema @ file:///C:/b/abs_d40z05b6r1/croot/jsonschema_1678983446576/work
jupyter @ file:///C:/ci_311/jupyter_1678249952587/work
jupyter-console @ file:///C:/b/abs_82xaa6i2y4/croot/jupyter_console_1680000189372/work
jupyter-events @ file:///C:/b/abs_4cak_28ewz/croot/jupyter_events_1684268050893/work
jupyter-ydoc @ file:///C:/b/abs_e7m6nh5lao/croot/jupyter_ydoc_1683747253535/work
jupyter_client @ file:///C:/b/abs_059idvdagk/croot/jupyter_client_1680171872444/work
jupyter_core @ file:///C:/b/abs_9d0ttho3bs/croot/jupyter_core_1679906581955/work
jupyter_server @ file:///C:/b/abs_3eh8sm27tx/croot/jupyter_server_1686059851383/work
jupyter_server_fileid @ file:///C:/b/abs_f1yjnmiq_6/croot/jupyter_server_fileid_1684273602142/work
jupyter_server_terminals @ file:///C:/b/abs_ec0dq4b50j/croot/jupyter_server_terminals_1686870763512/work
jupyter_server_ydoc @ file:///C:/b/abs_8ai39bligw/croot/jupyter_server_ydoc_1686767445888/work
jupyterlab @ file:///C:/b/abs_c1msr8zz3y/croot/jupyterlab_1686179674844/work
jupyterlab-pygments @ file:///tmp/build/80754af9/jupyterlab_pygments_1601490720602/work
jupyterlab-widgets @ file:///C:/b/abs_38ad427jkz/croot/jupyterlab_widgets_1679055289211/work
jupyterlab_server @ file:///C:/b/abs_e0qqsihjvl/croot/jupyterlab_server_1680792526136/work
lxml @ file:///C:/b/abs_c2bg6ck92l/croot/lxml_1679646459966/work
markdown-it-py==3.0.0
MarkupSafe @ file:///C:/ci_311/markupsafe_1676424152318/work
matplotlib-inline @ file:///C:/ci_311/matplotlib-inline_1676425798036/work
mdurl==0.1.2
mistune @ file:///C:/ci_311/mistune_1676425149302/work
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
nbclassic @ file:///C:/b/abs_c8_rs7b3zw/croot/nbclassic_1681756186106/work
nbclient @ file:///C:/ci_311/nbclient_1676425195918/work
nbconvert @ file:///C:/ci_311/nbconvert_1676425836196/work
nbformat @ file:///C:/ci_311/nbformat_1676424215945/work
nest-asyncio @ file:///C:/ci_311/nest-asyncio_1676423519896/work
networkx==3.1
notebook @ file:///C:/b/abs_49d8mc_lpe/croot/notebook_1681756182078/work
notebook_shim @ file:///C:/ci_311/notebook-shim_1678144850856/work
numpy==1.25.0
openllm==0.1.13
opentelemetry-api==1.17.0
opentelemetry-instrumentation==0.38b0
opentelemetry-instrumentation-aiohttp-client==0.38b0
opentelemetry-instrumentation-asgi==0.38b0
opentelemetry-instrumentation-grpc==0.38b0
opentelemetry-sdk==1.17.0
opentelemetry-semantic-conventions==0.38b0
opentelemetry-util-http==0.38b0
optimum==1.8.8
orjson==3.9.1
packaging @ file:///C:/b/abs_ed_kb9w6g4/croot/packaging_1678965418855/work
pandas==2.0.2
pandocfilters @ file:///opt/conda/conda-bld/pandocfilters_1643405455980/work
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
pathspec==0.11.1
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==9.5.0
pip-requirements-parser==32.0.1
pip-tools==6.13.0
platformdirs @ file:///C:/ci_311/platformdirs_1676422658103/work
ply==3.11
prometheus-client @ file:///C:/ci_311/prometheus_client_1679591942558/work
prompt-toolkit @ file:///C:/ci_311/prompt-toolkit_1676425940920/work
protobuf==3.20.3
psutil @ file:///C:/ci_311_rebuilds/psutil_1679005906571/work
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
pyarrow==12.0.1
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==1.10.9
Pygments @ file:///C:/b/abs_fay9dpq4n_/croot/pygments_1684279990574/work
PyMySQL==1.0.3
pynvml==11.5.0
pyOpenSSL @ file:///C:/b/abs_de215ipd18/croot/pyopenssl_1678965319166/work
pyparsing==3.1.0
pyproject_hooks==1.0.0
PyQt5==5.15.7
PyQt5-sip @ file:///C:/ci_311/pyqt-split_1676428895938/work/pyqt_sip
pyreadline3==3.4.1
pyrsistent @ file:///C:/ci_311/pyrsistent_1676422695500/work
PySocks @ file:///C:/ci_311/pysocks_1676425991111/work
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
python-json-logger @ file:///C:/b/abs_cblnsm6puj/croot/python-json-logger_1683824130469/work
python-multipart==0.0.6
pytz @ file:///C:/ci_311/pytz_1676427070848/work
pywin32==305.1
pywinpty @ file:///C:/ci_311/pywinpty_1677707791185/work/target/wheels/pywinpty-2.0.10-cp311-none-win_amd64.whl
PyYAML @ file:///C:/ci_311/pyyaml_1676432488822/work
pyzmq @ file:///C:/b/abs_655zk4a3s8/croot/pyzmq_1686601465034/work
qtconsole @ file:///C:/b/abs_eb4u9jg07y/croot/qtconsole_1681402843494/work
QtPy @ file:///C:/ci_311/qtpy_1676432558504/work
regex==2023.6.3
requests @ file:///C:/b/abs_41owkd5ymz/croot/requests_1682607524657/work
rfc3339-validator @ file:///C:/b/abs_ddfmseb_vm/croot/rfc3339-validator_1683077054906/work
rfc3986-validator @ file:///C:/b/abs_6e9azihr8o/croot/rfc3986-validator_1683059049737/work
rich==13.4.2
safetensors==0.3.1
schema==0.7.5
Send2Trash @ file:///tmp/build/80754af9/send2trash_1632406701022/work
sentencepiece==0.1.99
simple-di==0.1.5
sip @ file:///C:/ci_311/sip_1676427825172/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
smartchart==6.6.8
smartdb==0.6
sniffio @ file:///C:/ci_311/sniffio_1676425339093/work
soupsieve @ file:///C:/b/abs_a989exj3q6/croot/soupsieve_1680518492466/work
sqlparse==0.4.4
stack-data @ file:///opt/conda/conda-bld/stack_data_1646927590127/work
starlette==0.28.0
sympy==1.12
tabulate==0.9.0
terminado @ file:///C:/ci_311/terminado_1678228513830/work
tinycss2 @ file:///C:/ci_311/tinycss2_1676425376744/work
tokenizers==0.13.3
toml @ file:///tmp/build/80754af9/toml_1616166611790/work
torch==2.0.1
torchvision==0.15.2
tornado @ file:///C:/ci_311/tornado_1676423689414/work
tqdm==4.65.0
traitlets @ file:///C:/ci_311/traitlets_1676423290727/work
transformers==4.30.2
typing_extensions @ file:///C:/b/abs_5em9ekwz24/croot/typing_extensions_1686602003259/work
tzdata==2023.3
urllib3 @ file:///C:/b/abs_889_loyqv4/croot/urllib3_1686163174463/work
uvicorn==0.22.0
watchfiles==0.19.0
wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work
webencodings==0.5.1
websocket-client @ file:///C:/ci_311/websocket-client_1676426063281/work
widgetsnbextension @ file:///C:/b/abs_882k4_4kdf/croot/widgetsnbextension_1679313880295/work
win-inet-pton @ file:///C:/ci_311/win_inet_pton_1676425458225/work
wrapt==1.15.0
xxhash==3.2.0
y-py @ file:///C:/b/abs_b7f5go6r0j/croot/y-py_1683662173571/work
yarl==1.9.2
ypy-websocket @ file:///C:/b/abs_4e65ywlnv8/croot/ypy-websocket_1684172103529/work
zipp==3.15.0

System information (Optional)

No response

bug: Missing Dependency when running

Describe the bug

Hello,
I followed the instructions on github (nothing more) and when I try to run it with the following command:
sudo docker run -it --rm -p 3000:3000 google-flan-t5-xl-service:53fd1e22aa944eee1fd336f9aee8a437e01676ce serve
I'm getting the following error.

Error: [bentoml-cli] serve failed: Failed loading Bento from directory /home/bentoml/bento: Failed to import module "generated_flan_t5_service": No module named 'orjson'

To reproduce

No response

Logs

`sudo docker run -it --rm -p 3000:3000 google-flan-t5-xl-service:53fd1e22aa944eee1fd336f9aee8a437e01676ce serve
`

`Error: [bentoml-cli] `serve` failed: Failed loading Bento from directory /home/bentoml/bento: Failed to import module "generated_flan_t5_service": No module named 'orjson'`

Environment

python:3.9.17
bentoml:1.0.22
openllm:0.1.8

bug: Could not get starcoder to work on all 3 platforms - Mac OS, Windows and Linux

Describe the bug

In Mac OS, starcoder does not even load, probably because it has no Nvidia GPU.

In Windows, the main issue is the dependency on the bitsandbytes library. Since the makers of that library never made a version for Windows, we have to rely on some hacks and tricks to get it to install on Windows. Which is something I had done in the past in my global environment with great difficulty; but since OpenLLM recommends to install it on a new conda environment, I will need to start all the way back from installing CUDA, which itself is a big hassle on Windows. So I gave up.

Next, I tried on Google Colab notebook, which runs on Linux. Here, bitsandbytes is installed; however, this is the error message I get when I try to start starcoder:

Full stacktrace provided in the Logs section below.

To reproduce

Start a Google Colab notebook with GPU runtime type
Install libraries:

!pip install openllm
!pip install "openllm[starcoder]"
!pip install einops xformers safetensors
!pip install -q -U bitsandbytes

Check if bitsandbytes is installed correctly by trying to import it without error:

import bitsandbytes

Start starcoder:

!openllm start starcoder

Logs

Make sure to have the following dependencies available: ['bitsandbytes']
Running 'starcoder' requires at least 2 GPUs/CPUs available per worker. Make sure that it has available resources for inference.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 259, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/bigcode/starcoder/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 417, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1195, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1541, in get_hf_file_metadata
    hf_raise_for_status(r)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 291, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-6494a45b-09304ff2569b37293300e2b5)

Repository Not Found for url: https://huggingface.co/bigcode/starcoder/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/openllm/__main__.py", line 26, in <module>
    cli()
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/cli.py", line 342, in wrapper
    return func(*args, **attrs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/cli.py", line 315, in wrapper
    return_value = func(*args, **attrs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/cli.py", line 290, in wrapper
    return f(*args, **attrs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/cli.py", line 1248, in download_models
    _ref = bentoml.transformers.get(model.tag)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 729, in tag
    self.__llm_tag__ = self.make_tag(
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 661, in make_tag
    transformers.AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 944, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 574, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 629, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 433, in cached_file
    raise EnvironmentError(
OSError: bigcode/starcoder is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
Traceback (most recent call last):
  File "/usr/local/bin/openllm", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/cli.py", line 342, in wrapper
    return func(*args, **attrs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/cli.py", line 315, in wrapper
    return_value = func(*args, **attrs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/cli.py", line 290, in wrapper
    return f(*args, **attrs)
  File "/usr/local/lib/python3.10/dist-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/cli.py", line 701, in model_start
    llm = t.cast(
  File "/usr/local/lib/python3.10/dist-packages/openllm/models/auto/factory.py", line 127, in for_model
    llm.ensure_model_id_exists()
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 688, in ensure_model_id_exists
    output = subprocess.check_output(
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'openllm', 'download', 'starcoder', '--model-id', 'bigcode/starcoder', '--output', 'porcelain']' returned non-zero exit status 1.

Environment

!transformers-cli env

2023-06-22 19:50:20.588693: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/transformers/commands/env.py:63: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-06-22 19:50:26.993695: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.30.2
- Platform: Linux-5.15.107+-x86_64-with-glibc2.31
- Python version: 3.10.12
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.6.9 (gpu)
- Jax version: 0.4.10
- JaxLib version: 0.4.10
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

-----------------------------------------------------------------------------------------------------------------------------------------

!bentoml env

#### Environment variable

```bash
BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''


#### System information

`bentoml`: 1.0.22
`python`: 3.10.12
`platform`: Linux-5.15.107+-x86_64-with-glibc2.31
`uid_gid`: 0:0
<details><summary><code>pip_packages</code></summary>

<br>

absl-py==1.4.0
accelerate==0.20.3
aiohttp==3.8.4
aiosignal==1.3.1
alabaster==0.7.13
albumentations==1.2.1
altair==4.2.2
anyio==3.6.2
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
array-record==0.2.0
arviz==0.15.1
asgiref==3.7.2
astropy==5.2.2
astunparse==1.6.3
async-timeout==4.0.2
attrs==23.1.0
audioread==3.0.0
autograd==1.5
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.11.2
bentoml==1.0.22
bitsandbytes==0.39.1
bleach==6.0.0
blis==0.7.9
blosc2==2.0.0
bokeh==2.4.3
branca==0.6.0
build==0.10.0
CacheControl==0.12.11
cached-property==1.5.2
cachetools==5.3.0
catalogue==2.0.8
cattrs==23.1.2
certifi==2022.12.7
cffi==1.15.1
chardet==4.0.0
charset-normalizer==2.0.12
chex==0.1.7
circus==0.18.0
click==8.1.3
click-option-group==0.5.6
cloudpickle==2.2.1
cmake==3.25.2
cmdstanpy==1.1.0
colorcet==3.0.1
coloredlogs==15.0.1
colorlover==0.3.0
community==1.0.0b1
confection==0.0.4
cons==0.4.5
contextlib2==0.6.0.post1
contourpy==1.0.7
convertdate==2.4.0
cryptography==40.0.2
cufflinks==0.17.3
cupy-cuda11x==11.0.0
cvxopt==1.3.0
cvxpy==1.3.1
cycler==0.11.0
cymem==2.0.7
Cython==0.29.34
dask==2022.12.1
datascience==0.17.6
datasets==2.13.1
db-dtypes==1.1.1
dbus-python==1.2.16
debugpy==1.6.6
decorator==4.4.2
deepmerge==1.1.0
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.6
distributed==2022.12.1
dlib==19.24.1
dm-tree==0.1.8
docutils==0.16
dopamine-rl==4.0.6
duckdb==0.8.1
earthengine-api==0.1.350
easydict==1.10
ecos==2.0.12
editdistance==0.6.2
einops==0.6.1
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl#sha256=0964370218b7e1672a30ac50d72cdc6b16f7c867496f1d60925691188f4d2510
entrypoints==0.4
ephem==4.1.4
et-xmlfile==1.1.0
etils==1.2.0
etuples==0.3.8
exceptiongroup==1.1.1
fastai==2.7.12
fastcore==1.5.29
fastdownload==0.0.7
fastjsonschema==2.16.3
fastprogress==1.0.3
fastrlock==0.8.1
filelock==3.12.0
filetype==1.2.0
firebase-admin==5.3.0
Flask==2.2.4
flatbuffers==23.3.3
flax==0.6.9
folium==0.14.0
fonttools==4.39.3
frozendict==2.3.7
frozenlist==1.3.3
fs==2.4.16
fsspec==2023.4.0
future==0.18.3
gast==0.4.0
GDAL==3.3.2
gdown==4.6.6
gensim==4.3.1
geographiclib==2.0
geopy==2.3.0
gin-config==0.5.0
glob2==0.7
google==2.0.3
google-api-core==2.11.0
google-api-python-client==2.84.0
google-auth==2.17.3
google-auth-httplib2==0.1.0
google-auth-oauthlib==1.0.0
google-cloud-bigquery==3.9.0
google-cloud-bigquery-storage==2.19.1
google-cloud-core==2.3.2
google-cloud-datastore==2.15.1
google-cloud-firestore==2.11.0
google-cloud-language==2.9.1
google-cloud-storage==2.8.0
google-cloud-translate==3.11.1
google-colab @ file:///colabtools/dist/google-colab-1.0.0.tar.gz#sha256=a06d013448fd1c1fc1ef60002405ebbf1541361bdeaab486815149c389e4f1cb
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.5.0
googleapis-common-protos==1.59.0
googledrivedownloader==0.4
graphviz==0.20.1
greenlet==2.0.2
grpcio==1.54.0
grpcio-health-checking==1.48.2
grpcio-status==1.48.2
gspread==3.4.2
gspread-dataframe==3.0.8
gym==0.25.2
gym-notices==0.0.8
h11==0.14.0
h5netcdf==1.1.0
h5py==3.8.0
holidays==0.25
holoviews==1.15.4
html5lib==1.1
httpcore==0.17.2
httpimport==1.3.0
httplib2==0.21.0
httpx==0.24.1
huggingface-hub==0.15.1
humanfriendly==10.0
humanize==4.6.0
hyperopt==0.2.7
idna==3.4
imageio==2.25.1
imageio-ffmpeg==0.4.8
imagesize==1.4.1
imbalanced-learn==0.10.1
imgaug==0.4.0
importlib-metadata==6.0.1
importlib-resources==5.12.0
imutils==0.5.4
inflect==6.0.4
inflection==0.5.1
iniconfig==2.0.0
intel-openmp==2023.1.0
ipykernel==5.5.6
ipython==7.34.0
ipython-genutils==0.2.0
ipython-sql==0.4.1
ipywidgets==7.7.1
itsdangerous==2.1.2
jax==0.4.10
jaxlib @ https://storage.googleapis.com/jax-releases/cuda11/jaxlib-0.4.10+cuda11.cudnn86-cp310-cp310-manylinux2014_x86_64.whl#sha256=fe53205ef12727c80ed5ac2d4506d6732c0c3db69ede4565a7d4df98e609af84
jieba==0.42.1
Jinja2==3.1.2
joblib==1.2.0
jsonpickle==3.0.1
jsonschema==4.3.3
jupyter-client==6.1.12
jupyter-console==6.1.0
jupyter-server==1.24.0
jupyter_core==5.3.0
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
kaggle==1.5.13
keras==2.12.0
kiwisolver==1.4.4
korean-lunar-calendar==0.3.1
langcodes==3.3.0
lazy_loader==0.2
libclang==16.0.0
librosa==0.10.0.post2
lightgbm==3.3.5
lit==16.0.5
llvmlite==0.39.1
locket==1.0.0
logical-unification==0.4.5
LunarCalendar==0.0.9
lxml==4.9.2
Markdown==3.4.3
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.7.1
matplotlib-inline==0.1.6
matplotlib-venn==0.11.9
mdurl==0.1.2
miniKanren==1.0.3
missingno==0.5.2
mistune==0.8.4
mizani==0.8.1
mkl==2019.0
ml-dtypes==0.1.0
mlxtend==0.14.0
more-itertools==9.1.0
moviepy==1.0.3
mpmath==1.3.0
msgpack==1.0.5
multidict==6.0.4
multipledispatch==0.6.0
multiprocess==0.70.14
multitasking==0.0.11
murmurhash==1.0.9
music21==8.1.0
mypy-extensions==1.0.0
natsort==8.3.1
nbclient==0.7.4
nbconvert==6.5.4
nbformat==5.8.0
nest-asyncio==1.5.6
networkx==3.1
nibabel==3.0.2
nltk==3.8.1
notebook==6.4.8
numba==0.56.4
numexpr==2.8.4
numpy==1.22.4
oauth2client==4.1.3
oauthlib==3.2.2
opencv-contrib-python==4.7.0.72
opencv-python==4.7.0.72
opencv-python-headless==4.7.0.72
openllm==0.1.10
openpyxl==3.0.10
opentelemetry-api==1.17.0
opentelemetry-instrumentation==0.38b0
opentelemetry-instrumentation-aiohttp-client==0.38b0
opentelemetry-instrumentation-asgi==0.38b0
opentelemetry-instrumentation-grpc==0.38b0
opentelemetry-sdk==1.17.0
opentelemetry-semantic-conventions==0.38b0
opentelemetry-util-http==0.38b0
opt-einsum==3.3.0
optax==0.1.5
optimum==1.8.8
orbax-checkpoint==0.2.1
orjson==3.9.1
osqp==0.6.2.post8
packaging==23.1
palettable==3.3.3
pandas==1.5.3
pandas-datareader==0.10.0
pandas-gbq==0.17.9
pandocfilters==1.5.0
panel==0.14.4
param==1.13.0
parso==0.8.3
partd==1.4.0
pathlib==1.0.1
pathspec==0.11.1
pathy==0.10.1
patsy==0.5.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.4.0
pip-requirements-parser==32.0.1
pip-tools==6.13.0
platformdirs==3.3.0
plotly==5.13.1
plotnine==0.10.1
pluggy==1.0.0
polars==0.17.3
pooch==1.6.0
portpicker==1.3.9
prefetch-generator==1.0.3
preshed==3.0.8
prettytable==0.7.2
proglog==0.1.10
progressbar2==4.2.0
prometheus-client==0.16.0
promise==2.3
prompt-toolkit==3.0.38
prophet==1.1.3
proto-plus==1.22.2
protobuf==3.20.3
psutil==5.9.5
psycopg2==2.9.6
ptyprocess==0.7.0
py-cpuinfo==9.0.0
py4j==0.10.9.7
pyarrow==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycocotools==2.0.6
pycparser==2.21
pyct==0.5.0
pydantic==1.10.7
pydata-google-auth==1.7.0
pydot==1.4.2
pydot-ng==2.0.0
pydotplus==2.0.2
PyDrive==1.3.1
pyerfa==2.0.0.3
pygame==2.3.0
Pygments==2.14.0
PyGObject==3.36.0
pymc==5.1.2
PyMeeus==0.5.12
pymystem3==0.2.0
pynvml==11.5.0
PyOpenGL==3.1.6
pyparsing==3.0.9
pyproject_hooks==1.0.0
pyre-extensions==0.0.29
pyrsistent==0.19.3
PySocks==1.7.1
pytensor==2.10.1
pytest==7.2.2
python-apt==0.0.0
python-dateutil==2.8.2
python-json-logger==2.0.7
python-louvain==0.16
python-multipart==0.0.6
python-slugify==8.0.1
python-utils==3.5.2
pytz==2022.7.1
pytz-deprecation-shim==0.1.0.post0
pyviz-comms==2.2.1
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==23.2.1
qdldl==0.1.7
qudida==0.0.4
regex==2022.10.31
requests==2.27.1
requests-oauthlib==1.3.1
requests-unixsocket==0.2.0
requirements-parser==0.5.0
rich==13.3.4
rpy2==3.5.5
rsa==4.9
safetensors==0.3.1
schema==0.7.5
scikit-image==0.19.3
scikit-learn==1.2.2
scipy==1.10.1
scs==3.2.3
seaborn==0.12.2
Send2Trash==1.8.0
sentencepiece==0.1.99
shapely==2.0.1
simple-di==0.1.5
six==1.16.0
sklearn-pandas==2.2.0
smart-open==6.3.0
sniffio==1.3.0
snowballstemmer==2.2.0
sortedcontainers==2.4.0
soundfile==0.12.1
soupsieve==2.4.1
soxr==0.3.5
spacy==3.5.2
spacy-legacy==3.0.12
spacy-loggers==1.0.4
Sphinx==3.5.4
sphinxcontrib-applehelp==1.0.4
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.1
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
SQLAlchemy==2.0.10
sqlparse==0.4.4
srsly==2.4.6
starlette==0.28.0
statsmodels==0.13.5
sympy==1.11.1
tables==3.8.0
tabulate==0.9.0
tblib==1.7.0
tenacity==8.2.2
tensorboard==2.12.2
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tensorflow==2.12.0
tensorflow-datasets==4.9.2
tensorflow-estimator==2.12.0
tensorflow-gcs-config==2.12.0
tensorflow-hub==0.13.0
tensorflow-io-gcs-filesystem==0.32.0
tensorflow-metadata==1.13.1
tensorflow-probability==0.20.1
tensorstore==0.1.36
termcolor==2.3.0
terminado==0.17.1
text-unidecode==1.3
textblob==0.17.1
tf-slim==1.1.0
thinc==8.1.9
threadpoolctl==3.1.0
tifffile==2023.4.12
tinycss2==1.2.1
tokenizers==0.13.3
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
torch @ https://download.pytorch.org/whl/cu118/torch-2.0.1%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=a7a49d459bf4862f64f7bc1a68beccf8881c2fa9f3e0569608e16ba6f85ebf7b
torchaudio @ https://download.pytorch.org/whl/cu118/torchaudio-2.0.2%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=26692645ea061a005c57ec581a2d0425210ac6ba9f923edf11cc9b0ef3a111e9
torchdata==0.6.1
torchsummary==1.5.1
torchtext==0.15.2
torchvision @ https://download.pytorch.org/whl/cu118/torchvision-0.15.2%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=19ca4ab5d6179bbe53cff79df1a855ee6533c2861ddc7389f68349d8b9f8302a
tornado==6.3.1
tqdm==4.65.0
traitlets==5.7.1
transformers==4.30.2
triton==2.0.0
tweepy==4.13.0
typer==0.7.0
types-setuptools==68.0.0.0
typing-inspect==0.9.0
typing_extensions==4.5.0
tzdata==2023.3
tzlocal==4.3
uritemplate==4.1.1
urllib3==1.26.15
uvicorn==0.22.0
vega-datasets==0.9.0
wasabi==1.1.1
watchfiles==0.19.0
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
Werkzeug==2.3.0
widgetsnbextension==3.6.4
wordcloud==1.8.2.2
wrapt==1.14.1
xarray==2022.12.0
xarray-einstats==0.5.1
xformers==0.0.20
xgboost==1.7.5
xlrd==2.0.1
xxhash==3.2.0
yarl==1.9.2
yellowbrick==1.5
yfinance==0.2.18
zict==3.0.0
zipp==3.15.0

</details>

System information (Optional)

No response

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

feat: Stream completions

Feature request

Is it possible to support stream completion similar to the OpenAI API?

Motivation

Can save users waiting time and improve user experience.

Other

No response

feat: AMD GPU support

Feature request

It would be nice to have the option to use AMD GPUs that support ROCm .

PyTorch seems to support ROCm AMD GPUs on Linux - the following was tested on Ubuntu 22.04.2 LTS with an AMD Ryzen 5825U (Radeon Vega Barcelo 8-core, shared memory) and ROCm 5.5.0 and PyTorch for ROCm 5.4.2 installed:

>>> import torch
>>> torch.cuda.is_available()
True

A cursory look seems to indicate that currently there are only CPU and NVidia Resource implementations in BentoML:

https://github.com/bentoml/BentoML/blob/c34689050ce1a2be2ee6a8809629cd715ae50ea6/src/bentoml/_internal/resource.py#L217

Motivation

This feature would enable running models with OpenLLM on AMD GPUs with ROCm support.

Other

No response

feat: configuration subclass rework

Feature request

Currently, all of the configuration specific for each model is handled via __init_subclass__ in configuration.py, and such value will be saved under openllm of the given configuration class.

This is completely fine for now, but as configuration gets more complicated, we might want to rethink how we handle this.

Proposal 1: Simple TypedDict under LLMConfig.__config__:

import openllm

class DollyV2Config(openllm.LLMConfig):
	__config__ = {"trust_remote_code": True, "workers_per_resource": 0.5}

Motivation

No response

Other

No response

bug: Failed to download models

Describe the bug

When try to download the model using command: openllm download-models dolly-v2 or openllm download-models dolly-v2 --model-id databricks/dolly-v2-3b error occurred. Please refer to the log section for details. Other models has similar problem.

To reproduce

No response

Logs

(cuda) ➜  ~ openllm download-models dolly-v2
pt-databricks-dolly-v2-3b:877db3ed12a3086500d144b9ef74e469b107a041 does not exists yet!. Downloading...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Using the default model signature for Transformers ({'__call__': ModelSignature(batchable=False, batch_dim=(0, 0), input_spec=None, output_spec=None)}) for model "pt-databricks-dolly-v2-3b:877db3ed12a3086500d144b9ef74e469b107a041".
╭─────────────────────────────── Traceback (most recent call last) ─────────────────────────────────
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/openllm/_llm.py:631 in                      │
│ ensure_model_id_exists                                                                           │
│                                                                                                  │
│   628 │   │   trust_remote_code = self._llm_attrs.pop("trust_remote_code", self.config.__openl   │
│   629 │   │   tag, kwds = self.make_tag(return_unused_kwargs=True, trust_remote_code=trust_rem   │
│   630 │   │   try:                                                                               │
│ ❱ 631 │   │   │   return bentoml.transformers.get(tag)                                           │
│   632 │   │   except bentoml.exceptions.BentoMLException:                                        │
│   633 │   │   │   logger.info("'%s' with tag (%s) not found, importing from HuggingFace Hub.",   │
│   634 │   │   │   tokenizer_kwds = {k[len("_tokenizer_") :]: v for k, v in kwds.items() if k.s   │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/bentoml/_internal/frameworks/transformers.p │
│ y:292 in get                                                                                     │
│                                                                                                  │
│   289 │      # target model must be from the BentoML model store                                 │
│   290 │      model = bentoml.transformers.get("my_pipeline:latest")                              │
│   291 │   """                                                                                    │
│ ❱ 292 │   model = bentoml.models.get(tag_like)                                                   │
│   293 │   if model.info.module not in (MODULE_NAME, __name__):                                   │
│   294 │   │   raise NotFound(                                                                    │
│   295 │   │   │   f"Model {model.tag} was saved with module {model.info.module}, not loading w   │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/simple_di/__init__.py:139 in _              │
│                                                                                                  │
│   136 │   │   bind = sig.bind_partial(*filtered_args, **filtered_kwargs)                         │
│   137 │   │   bind.apply_defaults()                                                              │
│   138 │   │                                                                                      │
│ ❱ 139 │   │   return func(*_inject_args(bind.args), **_inject_kwargs(bind.kwargs))               │
│   140 │                                                                                          │
│   141 │   setattr(_, "_is_injected", True)                                                       │
│   142 │   return cast(WrappedCallable, _)                                                        │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/bentoml/models.py:42 in get                 │
│                                                                                                  │
│    39 │   *,                                                                                     │
│    40 │   _model_store: "ModelStore" = Provide[BentoMLContainer.model_store],                    │
│    41 ) -> "Model":                                                                              │
│ ❱  42 │   return _model_store.get(tag)                                                           │
│    43                                                                                            │
│    44                                                                                            │
│    45 @inject                                                                                    │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/bentoml/_internal/store.py:146 in get       │
│                                                                                                  │
│   143 │   │   matches = self._fs.glob(f"{path}*/")                                               │
│   144 │   │   counts = matches.count().directories                                               │
│   145 │   │   if counts == 0:                                                                    │
│ ❱ 146 │   │   │   raise NotFound(                                                                │
│   147 │   │   │   │   f"{self._item_type.get_typename()} '{tag}' is not found in BentoML store   │
│   148 │   │   │   )                                                                              │
│   149 │   │   elif counts == 1:                                                                  │
╰───────────────────────────────────────────────────────────────────────────────────────────────────
NotFound: Model 'pt-databricks-dolly-v2-3b:877db3ed12a3086500d144b9ef74e469b107a041' is not found in BentoML store <osfs '/home/wzxu/bentoml/models'>

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ─────────────────────────────────
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/serialization.py:423 in save          │
│                                                                                                  │
│    420 │                                                                                         │
│    421 │   if _use_new_zipfile_serialization:                                                    │
│    422 │   │   with _open_zipfile_writer(f) as opened_zipfile:                                   │
│ ❱  423 │   │   │   _save(obj, opened_zipfile, pickle_module, pickle_protocol)                    │
│    424 │   │   │   return                                                                        │
│    425 │   else:                                                                                 │
│    426 │   │   with _open_file_like(f, 'wb') as opened_file:                                     │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/serialization.py:650 in _save         │
│                                                                                                  │
│    647 │   │   │   storage = storage.cpu()                                                       │
│    648 │   │   # Now that it is on the CPU we can directly copy it into the zip file             │
│    649 │   │   num_bytes = storage.nbytes()                                                      │
│ ❱  650 │   │   zip_file.write_record(name, storage.data_ptr(), num_bytes)                        │
│    651                                                                                           │
│    652                                                                                           │
│    653 def load(                                                                                 │
╰───────────────────────────────────────────────────────────────────────────────────────────────────
RuntimeError: [enforce fail at inline_container.cc:445] . PytorchStreamWriter failed writing file data/0: file write failed

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ─────────────────────────────────
│ /opt/anaconda/envs/cuda/bin/openllm:8 in <module>                                                │
│                                                                                                  │
│   5 from openllm.cli import cli                                                                  │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/click/core.py:1130 in __call__              │
│                                                                                                  │
│   1127 │                                                                                         │
│   1128 │   def __call__(self, *args: t.Any, **kwargs: t.Any) -> t.Any:                           │
│   1129 │   │   """Alias for :meth:`main`."""                                                     │
│ ❱ 1130 │   │   return self.main(*args, **kwargs)                                                 │
│   1131                                                                                           │
│   1132                                                                                           │
│   1133 class Command(BaseCommand):                                                               │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/click/core.py:1055 in main                  │
│                                                                                                  │
│   1052 │   │   try:                                                                              │
│   1053 │   │   │   try:                                                                          │
│   1054 │   │   │   │   with self.make_context(prog_name, args, **extra) as ctx:                  │
│ ❱ 1055 │   │   │   │   │   rv = self.invoke(ctx)                                                 │
│   1056 │   │   │   │   │   if not standalone_mode:                                               │
│   1057 │   │   │   │   │   │   return rv                                                         │
│   1058 │   │   │   │   │   # it's not safe to `ctx.exit(rv)` here!                               │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/click/core.py:1657 in invoke                │
│                                                                                                  │
│   1654 │   │   │   │   super().invoke(ctx)                                                       │
│   1655 │   │   │   │   sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)                    │
│   1656 │   │   │   │   with sub_ctx:                                                             │
│ ❱ 1657 │   │   │   │   │   return _process_result(sub_ctx.command.invoke(sub_ctx))               │
│   1658 │   │                                                                                     │
│   1659 │   │   # In chain mode we create the contexts step by step, but after the                │
│   1660 │   │   # base command has been invoked.  Because at that point we do not                 │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/click/core.py:1404 in invoke                │
│                                                                                                  │
│   1401 │   │   │   echo(style(message, fg="red"), err=True)                                      │
│   1402 │   │                                                                                     │
│   1403 │   │   if self.callback is not None:                                                     │
│ ❱ 1404 │   │   │   return ctx.invoke(self.callback, **ctx.params)                                │
│   1405 │                                                                                         │
│   1406 │   def shell_complete(self, ctx: Context, incomplete: str) -> t.List["CompletionItem"]:  │
│   1407 │   │   """Return a list of completions for the incomplete value. Looks                   │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/click/core.py:760 in invoke                 │
│                                                                                                  │
│    757 │   │                                                                                     │
│    758 │   │   with augment_usage_errors(__self):                                                │
│    759 │   │   │   with ctx:                                                                     │
│ ❱  760 │   │   │   │   return __callback(*args, **kwargs)                                        │
│    761 │                                                                                         │
│    762 │   def forward(                                                                          │
│    763 │   │   __self, __cmd: "Command", *args: t.Any, **kwargs: t.Any  # noqa: B902             │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/openllm/cli.py:369 in wrapper               │
│                                                                                                  │
│    366 │   │   @functools.wraps(func)                                                            │
│    367 │   │   def wrapper(*args: P.args, **attrs: P.kwargs) -> t.Any:                           │
│    368 │   │   │   try:                                                                          │
│ ❱  369 │   │   │   │   return func(*args, **attrs)                                               │
│    370 │   │   │   except openllm.exceptions.OpenLLMException as err:                            │
│    371 │   │   │   │   raise click.ClickException(                                               │
│    372 │   │   │   │   │   click.style(f"[{group.name}] '{command_name}' failed: " + err.messag  │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/openllm/cli.py:342 in wrapper               │
│                                                                                                  │
│    339 │   │   │   │   assert group.name is not None, "group.name should not be None"            │
│    340 │   │   │   │   event = analytics.OpenllmCliEvent(cmd_group=group.name, cmd_name=command  │
│    341 │   │   │   │   try:                                                                      │
│ ❱  342 │   │   │   │   │   return_value = func(*args, **attrs)                                   │
│    343 │   │   │   │   │   duration_in_ms = (time.time_ns() - start_time) / 1e6                  │
│    344 │   │   │   │   │   event.duration_in_ms = duration_in_ms                                 │
│    345 │   │   │   │   │   analytics.track(event)                                                │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/openllm/cli.py:317 in wrapper               │
│                                                                                                  │
│    314 │   │   │                                                                                 │
│    315 │   │   │   configure_logging()                                                           │
│    316 │   │   │                                                                                 │
│ ❱  317 │   │   │   return f(*args, **attrs)                                                      │
│    318 │   │                                                                                     │
│    319 │   │   return t.cast("ClickFunctionWrapper[..., t.Any]", wrapper)                        │
│    320                                                                                           │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/openllm/cli.py:892 in download_models       │
│                                                                                                  │
│    889 │   │   if len(bentoml.models.list(tag)) == 0:                                            │
│    890 │   │   │   if output == "pretty":                                                        │
│    891 │   │   │   │   _echo(f"{tag} does not exists yet!. Downloading...", fg="yellow", nl=Tru  │
│ ❱  892 │   │   │   m = model.ensure_model_id_exists()                                            │
│    893 │   │   │   if output == "pretty":                                                        │
│    894 │   │   │   │   _echo(f"Saved model: {m.tag}")                                            │
│    895 │   │   │   elif output == "json":                                                        │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/openllm/_llm.py:649 in                      │
│ ensure_model_id_exists                                                                           │
│                                                                                                  │
│   646 │   │   │   │   │   **kwds,                                                                │
│   647 │   │   │   │   }                                                                          │
│   648 │   │   │                                                                                  │
│ ❱ 649 │   │   │   return self.import_model(                                                      │
│   650 │   │   │   │   self._model_id,                                                            │
│   651 │   │   │   │   tag,                                                                       │
│   652 │   │   │   │   *self._llm_args,                                                           │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/openllm/models/dolly_v2/modeling_dolly_v2.p │
│ y:65 in import_model                                                                             │
│                                                                                                  │
│    62 │   │   │   device_map=device_map,                                                         │
│    63 │   │   )                                                                                  │
│    64 │   │   try:                                                                               │
│ ❱  65 │   │   │   return bentoml.transformers.save_model(                                        │
│    66 │   │   │   │   tag,                                                                       │
│    67 │   │   │   │   pipeline,                                                                  │
│    68 │   │   │   │   custom_objects={"tokenizer": tokenizer},                                   │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/bentoml/_internal/frameworks/transformers.p │
│ y:805 in save_model                                                                              │
│                                                                                                  │
│   802 │   │   │   external_modules=external_modules,                                             │
│   803 │   │   │   metadata=metadata,                                                             │
│   804 │   │   ) as bento_model:                                                                  │
│ ❱ 805 │   │   │   pipeline_.save_pretrained(bento_model.path, **save_kwargs)                     │
│   806 │   │   │                                                                                  │
│   807 │   │   │   # NOTE: we want to pickle the class so that tensorflow, flax pipeline will a   │
│   808 │   │   │   # the weights is already save, so we only need to save the class.              │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/transformers/pipelines/base.py:860 in       │
│ save_pretrained                                                                                  │
│                                                                                                  │
│    857 │   │   │   # Save the pipeline custom code                                               │
│    858 │   │   │   custom_object_save(self, save_directory)                                      │
│    859 │   │                                                                                     │
│ ❱  860 │   │   self.model.save_pretrained(save_directory, safe_serialization=safe_serialization  │
│    861 │   │                                                                                     │
│    862 │   │   if self.tokenizer is not None:                                                    │
│    863 │   │   │   self.tokenizer.save_pretrained(save_directory)                                │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/transformers/modeling_utils.py:1849 in      │
│ save_pretrained                                                                                  │
│                                                                                                  │
│   1846 │   │   │   │   # joyfulness), but for now this enough.                                   │
│   1847 │   │   │   │   safe_save_file(shard, os.path.join(save_directory, shard_file), metadata  │
│   1848 │   │   │   else:                                                                         │
│ ❱ 1849 │   │   │   │   save_function(shard, os.path.join(save_directory, shard_file))            │
│   1850 │   │                                                                                     │
│   1851 │   │   if index is None:                                                                 │
│   1852 │   │   │   path_to_weights = os.path.join(save_directory, _add_variant(WEIGHTS_NAME, va  │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/serialization.py:422 in save          │
│                                                                                                  │
│    419 │   _check_dill_version(pickle_module)                                                    │
│    420 │                                                                                         │
│    421 │   if _use_new_zipfile_serialization:                                                    │
│ ❱  422 │   │   with _open_zipfile_writer(f) as opened_zipfile:                                   │
│    423 │   │   │   _save(obj, opened_zipfile, pickle_module, pickle_protocol)                    │
│    424 │   │   │   return                                                                        │
│    425 │   else:                                                                                 │
│                                                                                                  │
│ /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/serialization.py:290 in __exit__      │
│                                                                                                  │
│    287 │   │   super(_open_zipfile_writer_file, self).__init__(torch._C.PyTorchFileWriter(str(n  │
│    288 │                                                                                         │
│    289 │   def __exit__(self, *args) -> None:                                                    │
│ ❱  290 │   │   self.file_like.write_end_of_file()                                                │
│    291                                                                                           │
│    292                                                                                           │
│    293 class _open_zipfile_writer_buffer(_opener):                                               │
╰───────────────────────────────────────────────────────────────────────────────────────────────────
RuntimeError: [enforce fail at inline_container.cc:325] . unexpected pos 68800 vs 68683
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:325] . unexpected pos 68800 vs 68683
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x55 (0x7fe20f8982f5 in /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x3cbbe2c (0x7fe23fb89e2c in /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: mz_zip_writer_add_mem_ex_v2 + 0x5c5 (0x7fe23fb83775 in /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0xb9 (0x7fe23fb8b419 in /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x2c3 (0x7fe23fb8b8e3 in /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x125 (0x7fe23fb8bb55 in /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x806915 (0x7fe267b95915 in /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x3c77f3 (0x7fe2677567f3 in /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x3c86cf (0x7fe2677576cf in /opt/anaconda/envs/cuda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x12c917 (0x5639bf300917 in /opt/anaconda/envs/cuda/bin/python3.10)
frame #10: <unknown function> + 0x1390d6 (0x5639bf30d0d6 in /opt/anaconda/envs/cuda/bin/python3.10)
frame #11: <unknown function> + 0x148f7f (0x5639bf31cf7f in /opt/anaconda/envs/cuda/bin/python3.10)
frame #12: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #13: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #14: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #15: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #16: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #17: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #18: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #19: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #20: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #21: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #22: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #23: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #24: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #25: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #26: <unknown function> + 0x148fdd (0x5639bf31cfdd in /opt/anaconda/envs/cuda/bin/python3.10)
frame #27: <unknown function> + 0x12191f (0x5639bf2f591f in /opt/anaconda/envs/cuda/bin/python3.10)
frame #28: PyDict_SetItemString + 0x52 (0x5639bf2f8cb2 in /opt/anaconda/envs/cuda/bin/python3.10)
frame #29: <unknown function> + 0x20bb51 (0x5639bf3dfb51 in /opt/anaconda/envs/cuda/bin/python3.10)
frame #30: Py_FinalizeEx + 0x170 (0x5639bf3dee90 in /opt/anaconda/envs/cuda/bin/python3.10)
frame #31: Py_RunMain + 0x10b (0x5639bf3d141b in /opt/anaconda/envs/cuda/bin/python3.10)
frame #32: Py_BytesMain + 0x39 (0x5639bf39f089 in /opt/anaconda/envs/cuda/bin/python3.10)
frame #33: __libc_start_main + 0xf3 (0x7fe4e9857083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #34: <unknown function> + 0x1caf81 (0x5639bf39ef81 in /opt/anaconda/envs/cuda/bin/python3.10)

[1]    1092880 abort (core dumped)  openllm download-models dolly-v2

Environment

bentoml: 1.0.22
transformers: 4.30.1
python: 3.10.9
plantform: Linux dell 5.4.0-150-generic #167-Ubuntu SMP Mon May 15 17:35:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

feat: disable GPU (or at least VRAM)

Feature request

I have CUDA libs installed and run out of VRAM:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.95 GiB total capacity; 1.72 GiB already allocated;

My video card is an old (cheap) one and this is to be expected. Is there a way (runtime/config flag) to use system RAM instead of video RAM?
Even if it means disabling GPU support completely, I'm fine with that.

Perhaps something along the lines of: --novram or --nogpu
Cheers.

Motivation

I'd like to keep my CUDA libraries installed for other projects and an option would allow this.

Other

No response

bug: CalledProcessError: Unable to download and use models using OpenLLM

Describe the bug

Cannot use openllm locally at all due to CalledProcessError

To reproduce

from langchain.llms import OpenLLM
llm = OpenLLM(model_name='falcon', model_id='tiiuae/falcon-40b-instruct', temperature=0.0)

Logs

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-2-9e706d7fe198> in <module>
      3 import os
      4 
----> 5 llm = OpenLLM(model_name='falcon', model_id='tiiuae/falcon-40b-instruct', temperature=0.0)
      6 
      7 llm("What is the difference between a duck and a goose? And why there are so many Goose in Canada?")

~/.local/lib/python3.8/site-packages/langchain/llms/openllm.py in __init__(self, model_name, model_id, server_url, server_type, embedded, **llm_kwargs)
    168             # in-process. Wrt to BentoML users, setting embedded=False is the expected
    169             # behaviour to invoke the runners remotely
--> 170             runner = openllm.Runner(
    171                 model_name=model_name,
    172                 model_id=model_id,

~/.local/lib/python3.8/site-packages/openllm/_llm.py in Runner(model_name, ensure_available, init_local, implementation, **attrs)
   1404                 behaviour
   1405     """
-> 1406     runner = t.cast(
   1407         "_BaseAutoLLMClass",
   1408         openllm[implementation if implementation is not None else EnvVarMixin(model_name)["framework_value"]],  # type: ignore (internal API)

~/.local/lib/python3.8/site-packages/openllm/models/auto/factory.py in create_runner(cls, model_name, model_id, **attrs)
    155             A LLM instance.
    156         """
--> 157         llm, runner_attrs = cls.for_model(model_name, model_id, return_runner_kwargs=True, **attrs)
    158         return llm.to_runner(**runner_attrs)
    159 

~/.local/lib/python3.8/site-packages/openllm/models/auto/factory.py in for_model(cls, model_name, model_id, return_runner_kwargs, llm_config, ensure_available, **attrs)
    133                     llm.model_id,
    134                 )
--> 135                 llm.ensure_model_id_exists()
    136             if not return_runner_kwargs:
    137                 return llm

~/.local/lib/python3.8/site-packages/openllm/_llm.py in ensure_model_id_exists(self)
    898         Auto LLM initialisation.
    899         """
--> 900         output = subprocess.check_output(
    901             [
    902                 sys.executable,

/usr/lib/python3.8/subprocess.py in check_output(timeout, *popenargs, **kwargs)
    413         kwargs['input'] = empty
    414 
--> 415     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    416                **kwargs).stdout
    417 

/usr/lib/python3.8/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    514         retcode = process.poll()
    515         if check and retcode:
--> 516             raise CalledProcessError(retcode, process.args,
    517                                      output=stdout, stderr=stderr)
    518     return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['/usr/bin/python3', '-m', 'openllm', 'download', 'falcon', '--model-id', 'tiiuae/falcon-40b-instruct', '--machine', '--implementation', 'pt']' returned non-zero exit status 1.

Environment

latests

System information (Optional)

1x H100 (80 GB PCIe)
26 vCPUs, 200 GiB RAM, 1 TiB SSD

feat: Add vLLM engine for generation tasks

Feature request

As it is done with Triton Inference server, it would be great to integrate vLLM (https://github.com/vllm-project/vllm) as a higly optimized engine for LLM generation based on continuous batching (bentoml/BentoML#3981).

Motivation

No response

Other

No response

m1/2 gpu support

Feature request

it seems that it just support Nvidia gpu and chatglm/chatglm2 can't run under apple silicon env

Motivation

No response

Other

No response

feat: GGML model support

Feature request

Being able to use GGML models using ctransformers https://github.com/marella/ctransformers or llama.cpp https://github.com/abetlen/llama-cpp-python

Motivation

CPU support for Starcoder and eventually Falcon models, and overall perf improvements.

Other

No response

bug: Not able to start tiiuae/falcon-7b

Describe the bug

Hi there,

I followed the instruction on GitHub to start tiiuae/falcon-7b.

pip install "openllm[falcon]" openllm start falcon --model-id tiiuae/falcon-7b
Then, when calling the localhost:3000 for the first time, it's giving timeouts for 30 seconds.

The second time it gives back the next output (check the logs) and also timeouts after some time.

Thanks in advance!

To reproduce

No response

Logs

openllm start falcon --model-id tiiuae/falcon-7b
Make sure to have the following dependencies available: ['einops', 'xformers', 'safetensors']
2023-06-20T16:17:11+0000 [INFO] [cli] Environ for worker 0: set CUDA_VISIBLE_DEVICES to 0
2023-06-20T16:17:11+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "_service.py:svc" can be accessed at http://localhost:3000/metrics.
2023-06-20T16:17:12+0000 [INFO] [cli] Starting production HTTP BentoServer from "_service.py:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
2023-06-20 16:17:15.720532: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.81s/it]
The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
2023-06-20T16:19:42+0000 [INFO] [runner:llm-falcon-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.616ms (trace=61c419d1f6ebf4618a33c76ab591ca84,span=f0534e5aa9799160,sampled=1,service.name=llm-falcon-runner)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:8] 127.0.0.1:32864 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 139.235ms (trace=61c419d1f6ebf4618a33c76ab591ca84,span=8acdd6ebfdfc0bc3,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [runner:llm-falcon-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.315ms (trace=e9158a075fc27b60719a6852115ec748,span=5147903d91a8f5cc,sampled=1,service.name=llm-falcon-runner)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:3] 127.0.0.1:32872 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 140.223ms (trace=e9158a075fc27b60719a6852115ec748,span=3340643d8fd8fbf3,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:8] 127.0.0.1:32874 (scheme=http,method=GET,path=/docs.json,type=,length=) (status=200,type=application/json,length=6855) 10.052ms (trace=78691d213c95604978c79a03e7af901e,span=b1ae35a9f48663c7,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:1] 127.0.0.1:32882 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 4.523ms (trace=994b1e4334df607b036138b15b5bd92d,span=8fe7b9288f48d7eb,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:42+0000 [INFO] [api_server:llm-falcon-service:7] 127.0.0.1:32896 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 3.804ms (trace=1cd9f72ec6c621f4dfc0378da339833f,span=f05a197843500866,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:43+0000 [INFO] [api_server:llm-falcon-service:4] 127.0.0.1:32900 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 3.348ms (trace=a2df8f149d8c22318f5bee1beef3b58b,span=38859e751c1b52fa,sampled=1,service.name=llm-falcon-service)
2023-06-20T16:19:43+0000 [INFO] [api_server:llm-falcon-service:4] 127.0.0.1:32906 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=706) 0.691ms (trace=b24a982fb330e7db790eee4e166c5fbe,span=63ab2145057b9fb1,sampled=1,service.name=llm-falcon-service)
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.

Environment

bentoml: 1.0.22
openllm: 0.1.8
platform: paperspace

Connection refused

Hi,

I'm Mr. Martian, a very friendly guy.

These two commands give me connection refused error on every machine I have tested:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'Explain to me the difference between "further" and "farther"'

perf: `openllm.LLM` metaclass creation

Feature request

Currently, the metaclass implementation are rudimentary and inefficient. It actually takes quite a bit to initialize the model, given that the overhead for openllm.LLMConfig is non-existent (openllm.LLMConfig is pretty optimized).

This has to do with namespace lookup and assignment.

Possible improvement:

__slots__ class, and using __init_subclass__: One problem with metaclass is that we actually dynamically lookup through the MRO to create the subclass. This inherently recreating the base LLM class everytime openllm.AutoLLM is invoked.

bug: Exception in ASGI application

Describe the bug

Just gave your repo a quick shot, on 0.1.6 I did

openllm download falcon
openllm start falcon
openllm query ...

Hardware: 2x1080, 96GB RAM, 12 core intel something. Failed with the following error. The query didn't complete in 5 minutes.

To reproduce

No response

Logs

2023-06-19T14:21:00+0200 [ERROR] [api_server:llm-falcon-service:19] Exception in ASGI application
Traceback (most recent call last):
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/uvicorn/middleware/message_logger.py", line 86, in __call__
    raise exc from None
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/uvicorn/middleware/message_logger.py", line 82, in __call__
    await self.app(scope, inner_receive, inner_send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/http/instruments.py", line 176, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 579, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    response = await func(request)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/http_app.py", line 286, in readyz
    runners_ready = all(await asyncio.gather(*runner_statuses))
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/runner/runner.py", line 156, in runner_handle_is_ready
    return await self._runner_handle.is_ready(timeout)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 304, in is_ready
    async with self._client.get(
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/aiohttp/client.py", line 560, in _request
    await resp.start(conn)
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 894, in start
    with self._timer:
  File "/home/max/.conda/envs/openllm/lib/python3.10/site-packages/aiohttp/helpers.py", line 721, in __exit__
    raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError

Environment

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.22
python: 3.10.11
platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.31
uid_gid: 1001:1001
conda: 4.14.0
in_conda_env: True

conda_packages

name: openllm
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2023.05.30=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_0
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.8=h7f8727e_0
  - pip=23.1.2=py310h06a4308_0
  - python=3.10.11=h955ad1f_3
  - readline=8.2=h5eee18b_0
  - setuptools=67.8.0=py310h06a4308_0
  - sqlite=3.41.2=h5eee18b_0
  - tk=8.6.12=h1ccaba5_0
  - wheel=0.38.4=py310h06a4308_0
  - xz=5.4.2=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
    - accelerate==0.20.3
    - aiohttp==3.8.4
    - aiosignal==1.3.1
    - anyio==3.7.0
    - appdirs==1.4.4
    - asgiref==3.7.2
    - async-timeout==4.0.2
    - attrs==23.1.0
    - bentoml==1.0.22
    - build==0.10.0
    - cattrs==23.1.2
    - certifi==2023.5.7
    - charset-normalizer==3.1.0
    - circus==0.18.0
    - click==8.1.3
    - click-option-group==0.5.6
    - cloudpickle==2.2.1
    - cmake==3.26.4
    - coloredlogs==15.0.1
    - contextlib2==21.6.0
    - datasets==2.13.0
    - deepmerge==1.1.0
    - deprecated==1.2.14
    - dill==0.3.6
    - einops==0.6.1
    - exceptiongroup==1.1.1
    - filelock==3.12.2
    - filetype==1.2.0
    - frozenlist==1.3.3
    - fs==2.4.16
    - fsspec==2023.6.0
    - grpcio==1.54.2
    - grpcio-health-checking==1.48.2
    - h11==0.14.0
    - httpcore==0.17.2
    - httpx==0.24.1
    - huggingface-hub==0.15.1
    - humanfriendly==10.0
    - idna==3.4
    - importlib-metadata==6.0.1
    - inflection==0.5.1
    - jinja2==3.1.2
    - lit==16.0.6
    - markdown-it-py==3.0.0
    - markupsafe==2.1.3
    - mdurl==0.1.2
    - mpmath==1.3.0
    - multidict==6.0.4
    - multiprocess==0.70.14
    - mypy-extensions==1.0.0
    - networkx==3.1
    - numpy==1.25.0
    - nvidia-cublas-cu11==11.10.3.66
    - nvidia-cuda-cupti-cu11==11.7.101
    - nvidia-cuda-nvrtc-cu11==11.7.99
    - nvidia-cuda-runtime-cu11==11.7.99
    - nvidia-cudnn-cu11==8.5.0.96
    - nvidia-cufft-cu11==10.9.0.58
    - nvidia-curand-cu11==10.2.10.91
    - nvidia-cusolver-cu11==11.4.0.1
    - nvidia-cusparse-cu11==11.7.4.91
    - nvidia-nccl-cu11==2.14.3
    - nvidia-nvtx-cu11==11.7.91
    - openllm==0.1.6
    - opentelemetry-api==1.17.0
    - opentelemetry-instrumentation==0.38b0
    - opentelemetry-instrumentation-aiohttp-client==0.38b0
    - opentelemetry-instrumentation-asgi==0.38b0
    - opentelemetry-instrumentation-grpc==0.38b0
    - opentelemetry-sdk==1.17.0
    - opentelemetry-semantic-conventions==0.38b0
    - opentelemetry-util-http==0.38b0
    - optimum==1.8.8
    - orjson==3.9.1
    - packaging==23.1
    - pandas==2.0.2
    - pathspec==0.11.1
    - pillow==9.5.0
    - pip-requirements-parser==32.0.1
    - pip-tools==6.13.0
    - prometheus-client==0.17.0
    - protobuf==3.20.3
    - psutil==5.9.5
    - pyarrow==12.0.1
    - pydantic==1.10.9
    - pygments==2.15.1
    - pynvml==11.5.0
    - pyparsing==3.1.0
    - pyproject-hooks==1.0.0
    - pyre-extensions==0.0.29
    - python-dateutil==2.8.2
    - python-json-logger==2.0.7
    - python-multipart==0.0.6
    - pytz==2023.3
    - pyyaml==6.0
    - pyzmq==25.1.0
    - regex==2023.6.3
    - requests==2.31.0
    - rich==13.4.2
    - safetensors==0.3.1
    - schema==0.7.5
    - sentencepiece==0.1.99
    - simple-di==0.1.5
    - six==1.16.0
    - sniffio==1.3.0
    - starlette==0.28.0
    - sympy==1.12
    - tabulate==0.9.0
    - tokenizers==0.13.3
    - tomli==2.0.1
    - torch==2.0.1
    - torchvision==0.15.2
    - tornado==6.3.2
    - tqdm==4.65.0
    - transformers==4.30.2
    - triton==2.0.0
    - typing-extensions==4.6.3
    - typing-inspect==0.9.0
    - tzdata==2023.3
    - urllib3==2.0.3
    - uvicorn==0.22.0
    - watchfiles==0.19.0
    - wcwidth==0.2.6
    - wrapt==1.15.0
    - xformers==0.0.20
    - xxhash==3.2.0
    - yarl==1.9.2
    - zipp==3.15.0
prefix: /home/max/.conda/envs/openllm

pip_packages

accelerate==0.20.3
aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.7.0
appdirs==1.4.4
asgiref==3.7.2
async-timeout==4.0.2
attrs==23.1.0
bentoml==1.0.22
build==0.10.0
cattrs==23.1.2
certifi==2023.5.7
charset-normalizer==3.1.0
circus==0.18.0
click==8.1.3
click-option-group==0.5.6
cloudpickle==2.2.1
cmake==3.26.4
coloredlogs==15.0.1
contextlib2==21.6.0
datasets==2.13.0
deepmerge==1.1.0
Deprecated==1.2.14
dill==0.3.6
einops==0.6.1
exceptiongroup==1.1.1
filelock==3.12.2
filetype==1.2.0
frozenlist==1.3.3
fs==2.4.16
fsspec==2023.6.0
grpcio==1.54.2
grpcio-health-checking==1.48.2
h11==0.14.0
httpcore==0.17.2
httpx==0.24.1
huggingface-hub==0.15.1
humanfriendly==10.0
idna==3.4
importlib-metadata==6.0.1
inflection==0.5.1
Jinja2==3.1.2
lit==16.0.6
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==1.0.0
networkx==3.1
numpy==1.25.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openllm==0.1.6
opentelemetry-api==1.17.0
opentelemetry-instrumentation==0.38b0
opentelemetry-instrumentation-aiohttp-client==0.38b0
opentelemetry-instrumentation-asgi==0.38b0
opentelemetry-instrumentation-grpc==0.38b0
opentelemetry-sdk==1.17.0
opentelemetry-semantic-conventions==0.38b0
opentelemetry-util-http==0.38b0
optimum==1.8.8
orjson==3.9.1
packaging==23.1
pandas==2.0.2
pathspec==0.11.1
Pillow==9.5.0
pip-requirements-parser==32.0.1
pip-tools==6.13.0
prometheus-client==0.17.0
protobuf==3.20.3
psutil==5.9.5
pyarrow==12.0.1
pydantic==1.10.9
Pygments==2.15.1
pynvml==11.5.0
pyparsing==3.1.0
pyproject_hooks==1.0.0
pyre-extensions==0.0.29
python-dateutil==2.8.2
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.3
PyYAML==6.0
pyzmq==25.1.0
regex==2023.6.3
requests==2.31.0
rich==13.4.2
safetensors==0.3.1
schema==0.7.5
sentencepiece==0.1.99
simple-di==0.1.5
six==1.16.0
sniffio==1.3.0
starlette==0.28.0
sympy==1.12
tabulate==0.9.0
tokenizers==0.13.3
tomli==2.0.1
torch==2.0.1
torchvision==0.15.2
tornado==6.3.2
tqdm==4.65.0
transformers==4.30.2
triton==2.0.0
typing-inspect==0.9.0
typing_extensions==4.6.3
tzdata==2023.3
urllib3==2.0.3
uvicorn==0.22.0
watchfiles==0.19.0
wcwidth==0.2.6
wrapt==1.15.0
xformers==0.0.20
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

transformers version: 4.30.2
Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.31
Python version: 3.10.11
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: ?
Using distributed or parallel set-up in script?: ?

feat: Support for MPT models

Feature request

Could OpenLLM support MPT models ?
https://www.mosaicml.com/blog/mpt-7b
https://www.mosaicml.com/blog/mpt-30b

Motivation

Falcon 40B is currently the best Open LLM but have some limitations

It required a certain amount of hardware
It is limited to 2k tokens
It is slow

Now with MPT 30B we can say we have a very good alternative we can use instead of Falcon

It requires less hardware
Can handle up to 8k tokens
It prety fast

Other

No response

mpt is listed under supported models but not available in openllm build command

Describe the bug

mpt is listed under supported models but not available in openllm build command, this is the error message.

openllm build mpt
Usage: openllm build [OPTIONS] {flan-t5|dolly-v2|chatglm|starcoder|falcon|stablelm|opt}
Try 'openllm build -h' for help.

Error: Invalid value for '{flan-t5|dolly-v2|chatglm|starcoder|falcon|stablelm|opt}': 'mpt' is not one of 'flan-t5', 'dolly-v2', 'chatglm', 'starcoder', 'falcon', 'stablelm', 'opt'.

To reproduce

pip install "openllm[mpt]"
openllm build mpt

Logs

No response

Environment

bentoml env

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.23
python: 3.11.0
platform: Windows-10-10.0.19045-SP0
is_window_admin: True

pip_packages

absl-py==1.4.0
accelerate==0.20.3
aiofiles==23.1.0
aiohttp==3.8.4
aiolimiter==1.1.0
aiosignal==1.3.1
altair==5.0.1
anyio==3.6.2
appdirs==1.4.4
asgiref==3.7.2
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.2
asyncio==3.4.3
attrs==23.1.0
azure-core==1.27.0
azure-cosmos==4.3.1
azureml==0.2.7
backcall==0.2.0
beautifulsoup4==4.12.2
bentoml==1.0.23
bitsandbytes==0.39.0
blinker==1.6.2
Brotli==1.0.9
bs4==0.0.1
build==0.10.0
cachetools==5.3.1
cattrs==23.1.2
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
circus==0.18.0
click==8.1.3
click-option-group==0.5.6
cloudpickle==2.2.1
colorama==0.4.6
coloredlogs==15.0.1
comm==0.1.2
contextlib2==21.6.0
contourpy==1.0.7
cryptography==41.0.1
cycler==0.11.0
dataclasses-json==0.5.8
datasets==2.13.0
debugpy==1.6.6
decorator==5.1.1
deepmerge==1.1.0
Deprecated==1.2.14
dill==0.3.6
diskcache==5.6.1
duckduckgo-search==3.8.3
einops==0.6.1
executing==1.2.0
faiss-cpu==1.7.4
fastapi==0.95.1
ffmpy==0.3.0
filelock==3.12.2
filetype==1.2.0
Flask==2.3.2
Flask-SQLAlchemy==3.0.5
flatbuffers==23.5.26
flexgen==0.1.7
fonttools==4.39.4
frozenlist==1.3.3
fs==2.4.16
fsspec==2023.6.0
gast==0.4.0
google-auth==2.19.1
google-auth-oauthlib==1.0.0
google-pasta==0.2.0
gptcache==0.1.32
gradio==3.33.1
gradio_client==0.2.5
greenlet==2.0.2
grpcio==1.54.2
grpcio-health-checking==1.48.2
guidance==0.0.63
h11==0.14.0
h2==4.1.0
h5py==3.8.0
hpack==4.0.0
httpcore==0.17.2
httpx==0.24.1
huggingface-hub==0.15.1
humanfriendly==10.0
hyperframe==6.0.1
idna==3.4
importlib-metadata==6.0.1
inflection==0.5.1
ipykernel==6.21.3
ipython==8.11.0
itsdangerous==2.1.2
jaconv==0.3.4
jamo==0.4.1
jax==0.4.12
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
jsonschema==4.17.3
jupyter_client==8.0.3
jupyter_core==5.2.0
keras==2.12.0
kiwisolver==1.4.4
langchain==0.0.196
langchainplus-sdk==0.0.11
libclang==16.0.0
linkify-it-py==2.0.2
llama-cpp-python==0.1.62
lxml==4.9.2
Markdown==3.4.3
markdown-it-py==2.2.0
MarkupSafe==2.1.3
marshmallow==3.19.0
marshmallow-enum==1.5.1
matplotlib==3.7.1
matplotlib-inline==0.1.6
mdit-py-plugins==0.3.3
mdurl==0.1.2
ml-dtypes==0.2.0
mpmath==1.3.0
msal==1.22.0
msgpack==1.0.5
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==1.0.0
nest-asyncio==1.5.6
networkx==3.0
numexpr==2.8.4
numpy==1.23.5
oauthlib==3.2.2
openai==0.27.8
openapi-schema-pydantic==1.2.4
openllm==0.1.17
opentelemetry-api==1.17.0
opentelemetry-instrumentation==0.38b0
opentelemetry-instrumentation-aiohttp-client==0.38b0
opentelemetry-instrumentation-asgi==0.38b0
opentelemetry-instrumentation-grpc==0.38b0
opentelemetry-sdk==1.17.0
opentelemetry-semantic-conventions==0.38b0
opentelemetry-util-http==0.38b0
opt-einsum==3.3.0
optimum==1.9.0
orjson==3.9.1
packaging==23.0
pandas==2.0.1
parso==0.8.3
pathspec==0.11.1
peft @ git+https://github.com/huggingface/peft@03eb378eb914fbee709ff7c86ba5b1d033b89524
pesq==0.0.4
pickleshare==0.7.5
pika==1.3.2
Pillow==9.5.0
pip-requirements-parser==32.0.1
pip-tools==6.13.0
platformdirs==3.1.1
playwright==1.35.0
prometheus-client==0.17.0
prompt-toolkit==3.0.38
protobuf==3.20.3
psutil==5.9.4
PuLP==2.7.0
pure-eval==0.2.2
pyarrow==12.0.1
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==1.10.7
pydocumentdb==2.3.5
pydub==0.25.1
pyee==9.0.4
Pygments==2.14.0
pygtrie==2.5.0
PyJWT==2.7.0
PyMySQL==1.1.0
pynvml==11.5.0
pyparsing==3.0.9
pyproject_hooks==1.0.0
pyre-extensions==0.0.29
pyreadline3==3.4.1
pyrsistent==0.19.3
python-dateutil==2.8.2
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.3
pywin32==305
PyYAML==6.0
pyzmq==25.0.0
regex==2023.6.3
requests==2.29.0
requests-oauthlib==1.3.1
rich==13.4.2
rsa==4.9
safetensors==0.3.1
schema==0.7.5
scikit-learn==1.2.2
scipy==1.10.1
semantic-version==2.10.0
sentencepiece==0.1.99
simple-di==0.1.5
six==1.16.0
sniffio==1.3.0
socksio==1.0.0
soupsieve==2.4.1
SQLAlchemy==2.0.16
stack-data==0.6.2
starlette==0.26.1
sympy==1.12
tabulate==0.9.0
tenacity==8.2.2
tensorboard==2.12.3
tensorboard-data-server==0.7.0
tensorflow==2.12.0
tensorflow-estimator==2.12.0
tensorflow-intel==2.12.0
tensorflow-io-gcs-filesystem==0.31.0
termcolor==2.3.0
threadpoolctl==3.1.0
tiktoken==0.4.0
tokenizers==0.13.3
toolz==0.12.0
torch==2.0.1+cu117
torchaudio==2.0.2+cu117
torchvision==0.15.2+cu117
tornado==6.2
tqdm==4.65.0
traitlets==5.9.0
transformers==4.30.2
typing-inspect==0.9.0
typing_extensions==4.5.0
tzdata==2023.3
uc-micro-py==1.0.2
urllib3==1.26.15
uvicorn==0.22.0
watchfiles==0.19.0
wcwidth==0.2.6
websockets==11.0.3
Werkzeug==2.3.6
wrapt==1.14.1
xformers==0.0.20
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

System information (Optional)

No response

run a transformers model

1.save model
pipe = pipeline("text-generation", model="/data/Data/LLM/starchat/", torch_dtype=torch.bfloat16, device_map="auto")
bentoml.transformers.save_model(name="starsvc", pipeline=pipe)
2.make service
import bentoml
from bentoml.io import Text, JSON
runner = bentoml.transformers.get("starsvc").to_runner()
svc = bentoml.Service("starchat-service", runners=[runner])
@svc.api(input=Text(), output=JSON())
async def generate(input_series: str) -> list:
return await runner.async_run(input_series)

3.run
bentoml serve starsvc.py:svc

last ,I got such a error,why？

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions

bug: circus.exc.ConflictError, Cross Device error, and Internal Server Error

Describe the bug

After updating OpenLLM from 0.1.20 to 0.2.0, I tried to load Baichuan-13B-Chat model as following:

openllm start baichuan --model-id /home/user/.cache/modelscope/hub/baichuan-inc/Baichuan-13B-Chat/ --device 0

Then several problems occurred:

circus.exc.ConflictError:

openllm start baichuan --model-id /home/user/.cache/modelscope/hub/baichuan-inc/Baichuan-13B-Chat/ --device 0
Make sure to have the following dependencies available: ['cpm-kernels']
Converting '/home/user/.cache/modelscope/hub/baichuan-inc/Baichuan-13B-Chat/' to lowercase: '/home/user/.cache/modelscope/hub/baichuan-inc/baichuan-13b-chat/'.
Converting '/home/user/.cache/modelscope/hub/baichuan-inc/Baichuan-13B-Chat/' to lowercase: '/home/user/.cache/modelscope/hub/baichuan-inc/baichuan-13b-chat/'.
Converting 'pt-Baichuan-13B-Chat' to lowercase: 'pt-baichuan-13b-chat'.
Converting 'pt-Baichuan-13B-Chat' to lowercase: 'pt-baichuan-13b-chat'.
__tag__:pt-baichuan-13b-chat:10e955477599362428d4e089e8ad6138256c784f
2023-07-20T14:19:28+0800 [ERROR] [cli] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x7f9de09c8e80>>
Traceback (most recent call last):
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
val = self.callback()
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_start_watchers command

Detected input and model not in the same device:

2023-07-20T14:24:42+0800 [ERROR] [runner:llm-baichuan-runner:1] Exception in ASGI application
Traceback (most recent call last):
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app(  # type: ignore[func-returns-value]
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
await self.app(scope, receive, wrapped_send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/starlette/routing.py", line 69, in app
response = await func(request)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/server/runner_app.py", line 273, in _request_handler
payload = await infer(params)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/marshal/dispatcher.py", line 182, in _func
raise r
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/marshal/dispatcher.py", line 377, in outbound_call
outputs = await self.callback(
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/server/runner_app.py", line 253, in infer_single
ret = await runner_method.async_run(*params.args, **params.kwargs)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 59, in async_run_method
return await anyio.to_thread.run_sync(
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/runner/runnable.py", line 140, in method
return self.func(obj, *args, **kwargs)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/openllm/_llm.py", line 1429, in generate
return self.generate(prompt, **attrs)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/openllm/models/baichuan/modeling_baichuan.py", line 82, in generate
outputs = self.model.generate(
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.cache/huggingface/modules/transformers_modules/10e955477599362428d4e089e8ad6138256c784f/modeling_baichuan.py", line 400, in forward
outputs = self.model(
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.cache/huggingface/modules/transformers_modules/10e955477599362428d4e089e8ad6138256c784f/modeling_baichuan.py", line 284, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

Internal Server Error:

2023-07-20T14:24:42+0800 [ERROR] [api_server:llm-baichuan-service:77] Exception on /v1/generate [POST] (trace=bfb570799cdf3681882093b96cd13353,span=fd186b183a773696,sampled=1,service.name=llm-baichuan-service)
Traceback (most recent call last):
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
output = await api.func(*args)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/openllm/_service.py", line 88, in generate_v1
responses = await runner.generate.async_run(qa_inputs.prompt, **config)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
File "/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 242, in async_run_method
raise RemoteException(
bentoml.exceptions.RemoteException: An unexpected exception occurred in remote runner llm-baichuan-runner: [500] Internal Server Error

Run nvidia-smi, it seems that model has not been loaded on the target device yet:

Thu Jul 20 14:41:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:18:00.0 Off |                    0 |
| N/A   24C    P0    32W / 250W |    848MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-PCIE-40GB      Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   24C    P0    30W / 250W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-PCIE-40GB      Off  | 00000000:86:00.0 Off |                    0 |
| N/A   25C    P0    33W / 250W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-PCIE-40GB      Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   25C    P0    31W / 250W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1003374      C   ...s/openllm_test/bin/python      845MiB |
+-----------------------------------------------------------------------------+

Here is my environment (CUDA VERSION 11.7):

bentoml env
#### Environment variable
BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

#### System information
`bentoml`: 1.0.24
`python`: 3.9.17
`platform`: Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.28
`uid_gid`: 1001:1001
`/home/user/Downloads/enter/bin/conda`: 4.5.11
`in_conda_env`: True
<details><summary><code>conda_packages</code></summary>

Request:

curl -X 'POST'   'http://localhost:3000/v1/generate'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "prompt": "hello",
  "llm_config": {
    "max_new_tokens": 2048,
    "min_length": 0,
    "early_stopping": false,
    "num_beams": 1,
    "num_beam_groups": 1,
    "use_cache": true,
    "temperature": 0.95,
    "top_k": 50,
    "top_p": 0.7,
    "typical_p": 1,
    "epsilon_cutoff": 0,
    "eta_cutoff": 0,
    "diversity_penalty": 0,
    "repetition_penalty": 1,
    "encoder_repetition_penalty": 1,
    "length_penalty": 1,
    "no_repeat_ngram_size": 0,
    "renormalize_logits": false,
    "remove_invalid_values": false,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "encoder_no_repeat_ngram_size": 0,
    "n": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "use_beam_search": false,
    "ignore_eos": false
  }
}'

To reproduce

Run
openllm start baichuan --model-id /path/to/baichuan-inc/Baichuan-13B-Chat/ --device 0
Run

curl -X 'POST'   'http://localhost:3000/v1/generate'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "prompt": "hello",
  "llm_config": {
    "max_new_tokens": 2048,
    "min_length": 0,
    "early_stopping": false,
    "num_beams": 1,
    "num_beam_groups": 1,
    "use_cache": true,
    "temperature": 0.95,
    "top_k": 50,
    "top_p": 0.7,
    "typical_p": 1,
    "epsilon_cutoff": 0,
    "eta_cutoff": 0,
    "diversity_penalty": 0,
    "repetition_penalty": 1,
    "encoder_repetition_penalty": 1,
    "length_penalty": 1,
    "no_repeat_ngram_size": 0,
    "renormalize_logits": false,
    "remove_invalid_values": false,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "encoder_no_repeat_ngram_size": 0,
    "n": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "use_beam_search": false,
    "ignore_eos": false
  }
}'

Logs

No response

Environment

transformers-cli env

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/user/Downloads/enter/envs/openllm_test did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/Downloads/enter/envs/openllm_test/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.31.0
- Platform: Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.17
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
        - dynamo_config: {'dynamo_backend': 'INDUCTOR', 'dynamo_mode': 'default', 'dynamo_use_dynamic': True, 'dynamo_use_fullgraph': False}
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

bentoml env

Environment variable

bash
BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.24
python: 3.9.17
platform: Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.28
uid_gid: 1001:1001
/home/user/Downloads/enter/bin/conda: 4.5.11
in_conda_env: True

conda_packages

name: openllm_test
channels:
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - libgcc-ng=12.2.0=h65d4601_19
  - libstdcxx-ng=12.2.0=h46fd767_19
  - ca-certificates=2023.05.30=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_0
  - llvm-openmp=14.0.6=h9e868ea_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.9=h7f8727e_0
  - pip=23.1.2=py39h06a4308_0
  - python=3.9.17=h955ad1f_0
  - readline=8.2=h5eee18b_0
  - setuptools=67.8.0=py39h06a4308_0
  - sqlite=3.41.2=h5eee18b_0
  - tk=8.6.12=h1ccaba5_0
  - tzdata=2023c=h04d1e81_0
  - wheel=0.38.4=py39h06a4308_0
  - xz=5.4.2=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
    - accelerate==0.21.0
    - aiohttp==3.8.5
    - aiosignal==1.3.1
    - anyio==3.7.1
    - appdirs==1.4.4
    - asgiref==3.7.2
    - async-timeout==4.0.2
    - attrs==23.1.0
    - bentoml==1.0.24
    - bitsandbytes==0.39.1
    - build==0.10.0
    - cattrs==23.1.2
    - certifi==2023.5.7
    - charset-normalizer==3.2.0
    - circus==0.18.0
    - click==8.1.6
    - click-option-group==0.5.6
    - cloudpickle==2.2.1
    - cmake==3.27.0
    - coloredlogs==15.0.1
    - contextlib2==21.6.0
    - cpm-kernels==1.0.11
    - cuda-python==12.2.0
    - cython==3.0.0
    - datasets==2.13.1
    - deepmerge==1.1.0
    - deprecated==1.2.14
    - dill==0.3.6
    - exceptiongroup==1.1.2
    - filelock==3.12.2
    - filetype==1.2.0
    - frozenlist==1.4.0
    - fs==2.4.16
    - fsspec==2023.6.0
    - grpcio==1.56.2
    - grpcio-health-checking==1.56.2
    - h11==0.14.0
    - httpcore==0.17.3
    - httpx==0.24.1
    - huggingface-hub==0.16.4
    - humanfriendly==10.0
    - idna==3.4
    - importlib-metadata==6.0.1
    - inflection==0.5.1
    - jinja2==3.1.2
    - lit==16.0.6
    - markdown-it-py==3.0.0
    - markupsafe==2.1.3
    - mdurl==0.1.2
    - mpmath==1.3.0
    - multidict==6.0.4
    - multiprocess==0.70.14
    - networkx==3.1
    - numpy==1.25.1
    - nvidia-cublas-cu11==11.10.3.66
    - nvidia-cuda-cupti-cu11==11.7.101
    - nvidia-cuda-nvrtc-cu11==11.7.99
    - nvidia-cuda-runtime-cu11==11.7.99
    - nvidia-cudnn-cu11==8.5.0.96
    - nvidia-cufft-cu11==10.9.0.58
    - nvidia-curand-cu11==10.2.10.91
    - nvidia-cusolver-cu11==11.4.0.1
    - nvidia-cusparse-cu11==11.7.4.91
    - nvidia-nccl-cu11==2.14.3
    - nvidia-nvtx-cu11==11.7.91
    - openllm==0.2.0
    - opentelemetry-api==1.18.0
    - opentelemetry-instrumentation==0.39b0
    - opentelemetry-instrumentation-aiohttp-client==0.39b0
    - opentelemetry-instrumentation-asgi==0.39b0
    - opentelemetry-instrumentation-grpc==0.39b0
    - opentelemetry-sdk==1.18.0
    - opentelemetry-semantic-conventions==0.39b0
    - opentelemetry-util-http==0.39b0
    - optimum==1.9.1
    - orjson==3.9.2
    - packaging==23.1
    - pandas==2.0.3
    - pathspec==0.11.1
    - pillow==10.0.0
    - pip-requirements-parser==32.0.1
    - pip-tools==7.1.0
    - prometheus-client==0.17.1
    - protobuf==4.23.4
    - psutil==5.9.5
    - pyarrow==12.0.1
    - pydantic==1.10.11
    - pygments==2.15.1
    - pynvml==11.5.0
    - pyparsing==3.1.0
    - pyproject_hooks==1.0.0
    - python-dateutil==2.8.2
    - python-json-logger==2.0.7
    - python-multipart==0.0.6
    - pytz==2023.3
    - pyyaml==6.0.1
    - pyzmq==25.1.0
    - regex==2023.6.3
    - requests==2.31.0
    - rich==13.4.2
    - safetensors==0.3.1
    - schema==0.7.5
    - scipy==1.11.1
    - sentencepiece==0.1.99
    - simple-di==0.1.5
    - six==1.16.0
    - sniffio==1.3.0
    - starlette==0.28.0
    - sympy==1.12
    - tabulate==0.9.0
    - tokenizers==0.13.3
    - tomli==2.0.1
    - torch==2.0.1
    - tornado==6.3.2
    - tqdm==4.65.0
    - transformers==4.31.0
    - transformers-stream-generator==0.0.4
    - triton==2.0.0
    - typing_extensions==4.7.1
    - urllib3==2.0.4
    - uvicorn==0.23.1
    - watchfiles==0.19.0
    - wcwidth==0.2.6
    - wrapt==1.15.0
    - xxhash==3.2.0
    - yarl==1.9.2
    - zipp==3.16.2
prefix: /home/user/Downloads/enter/envs/openllm_test


</details>

<details><summary><code>pip_packages</code></summary>

<br>

accelerate==0.21.0
aiohttp==3.8.5
aiosignal==1.3.1
anyio==3.7.1
appdirs==1.4.4
asgiref==3.7.2
async-timeout==4.0.2
attrs==23.1.0
bentoml==1.0.24
bitsandbytes==0.39.1
build==0.10.0
cattrs==23.1.2
certifi==2023.5.7
charset-normalizer==3.2.0
circus==0.18.0
click==8.1.6
click-option-group==0.5.6
cloudpickle==2.2.1
cmake==3.27.0
coloredlogs==15.0.1
contextlib2==21.6.0
cpm-kernels==1.0.11
cuda-python==12.2.0
Cython==3.0.0
datasets==2.13.1
deepmerge==1.1.0
Deprecated==1.2.14
dill==0.3.6
exceptiongroup==1.1.2
filelock==3.12.2
filetype==1.2.0
frozenlist==1.4.0
fs==2.4.16
fsspec==2023.6.0
grpcio==1.56.2
grpcio-health-checking==1.56.2
h11==0.14.0
httpcore==0.17.3
httpx==0.24.1
huggingface-hub==0.16.4
humanfriendly==10.0
idna==3.4
importlib-metadata==6.0.1
inflection==0.5.1
Jinja2==3.1.2
lit==16.0.6
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.1
numpy==1.25.1
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openllm==0.2.0
opentelemetry-api==1.18.0
opentelemetry-instrumentation==0.39b0
opentelemetry-instrumentation-aiohttp-client==0.39b0
opentelemetry-instrumentation-asgi==0.39b0
opentelemetry-instrumentation-grpc==0.39b0
opentelemetry-sdk==1.18.0
opentelemetry-semantic-conventions==0.39b0
opentelemetry-util-http==0.39b0
optimum==1.9.1
orjson==3.9.2
packaging==23.1
pandas==2.0.3
pathspec==0.11.1
Pillow==10.0.0
pip-requirements-parser==32.0.1
pip-tools==7.1.0
prometheus-client==0.17.1
protobuf==4.23.4
psutil==5.9.5
pyarrow==12.0.1
pydantic==1.10.11
Pygments==2.15.1
pynvml==11.5.0
pyparsing==3.1.0
pyproject_hooks==1.0.0
python-dateutil==2.8.2
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.3
PyYAML==6.0.1
pyzmq==25.1.0
regex==2023.6.3
requests==2.31.0
rich==13.4.2
safetensors==0.3.1
schema==0.7.5
scipy==1.11.1
sentencepiece==0.1.99
simple-di==0.1.5
six==1.16.0
sniffio==1.3.0
starlette==0.28.0
sympy==1.12
tabulate==0.9.0
tokenizers==0.13.3
tomli==2.0.1
torch==2.0.1
tornado==6.3.2
tqdm==4.65.0
transformers==4.31.0
transformers-stream-generator==0.0.4
triton==2.0.0
typing_extensions==4.7.1
tzdata==2023.3
urllib3==2.0.4
uvicorn==0.23.1
watchfiles==0.19.0
wcwidth==0.2.6
wrapt==1.15.0
xxhash==3.2.0
yarl==1.9.2
zipp==3.16.2

</details>

System information (Optional)

No response

bug: OpenLLM not working in langchain

Describe the bug

OpenLLM not working in langchain in google colab doc

To reproduce

https://colab.research.google.com/drive/11awO0MyCeh0Yi88EoPY_4LBs8IfiN0Iu?usp=sharing

Logs

No response

Environment

google colab

System information (Optional)

No response

bug: environment not as string

Describe the bug

what happened?
Error caught while starting LLM Server:
environment can only contain strings

To reproduce

No response

Logs

Traceback (most recent call last):
  File "D:\conda\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\conda\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\conda\Scripts\openllm.exe\__main__.py", line 7, in <module>
    sys.exit(cli())
  File "D:\conda\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
    rv = self.invoke(ctx)
  File "D:\conda\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "D:\conda\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "D:\conda\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\conda\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "D:\conda\lib\site-packages\openllm\cli.py", line 381, in wrapper
    return func(*args, **attrs)
  File "D:\conda\lib\site-packages\openllm\cli.py", line 354, in wrapper
    return_value = func(*args, **attrs)
  File "D:\conda\lib\site-packages\openllm\cli.py", line 329, in wrapper
    return f(*args, **attrs)
  File "D:\conda\lib\site-packages\click\decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "D:\conda\lib\site-packages\openllm\cli.py", line 837, in model_start
    server.start(env=start_env, text=True, blocking=True)
  File "D:\conda\lib\site-packages\bentoml\server.py", line 190, in start
    return _Manager()
  File "D:\conda\lib\site-packages\bentoml\server.py", line 163, in __init__
    self.process = subprocess.Popen(
  File "D:\conda\lib\subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "D:\conda\lib\subprocess.py", line 1440, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
TypeError: environment can only contain strings

Environment

System information (Optional)

GPU 4080

ExLlama supported?

The exllama transformer is a blazing fast implementation for 4-bit GPTQ weighted llama models achieving 40+ tokens/second on high-end consumer hardware - could it be supported as a backend in OpenLLM?

Support ChatGLM2

Feature request

The chatGLM2 is just released, please consider supporting it: https://huggingface.co/THUDM/chatglm2-6b

Thanks!

Motivation

No response

Other

No response

bug: load a model from local

Describe the bug

when I load my local model

openllm start chatglm --model-id /chatglm-6b

I get a error

openllm.exceptions.OpenLLMException: Model type <class 'transformers_modules.chatglm-6b.configuration_chatglm.ChatGLMConfig'> is not supported yet.

How can I do?

To reproduce

No response

Logs

No response

Environment

cli

System information (Optional)

No response

bug: `openllm.client` doesn't respect per request configuration

Describe the bug

when I do

client.query("What is 3+1?", return_full_text=True)

for a running dolly server, it doesn't process return_full_text correctly

To reproduce

No response

Logs

No response

Environment

No response

bug(starcoder): OOM

Describe the bug

Currently, starcoder runner will OOM when serving with openllm start.

This has to do with the runners get registered for one GPU, when there are more GPU available.

Current limitation with BentoML that only each GPU are assigned with each worker instance resource.

To reproduce

No response

Logs

No response

Environment

No response

bug: Unclear setup

Describe the bug

Running project via README results in issues finding the necessary model. The missing step is something like openllm download dolly-v2 before openllm start dolly-v2 can be run.

To reproduce

Install per instructions
Run per instructions

Logs

`NotFound(bentoml.exceptions.NotFound: Model 'pt-databricks-dollyv2-3b:877db3ed12a3086500d144b9ef74e469b107a041' is not found in BentoML store`

Environment

Ubuntu 22.04

bug: "Timeout doesn't fit into C timeval" in async context

Describe the bug

Since commit 528f76e there is an if statement if(in_async_context()) in openllm_client/runtimes/base.py
If this is true, and the httpx post is executed I get "OverflowError: timeout doesn't fit into C timeval"

The timeout value I can see in my debugger is 36000000. When I set the timeout value (in the debugger) to 1000, the exception is not thrown and I get a correct result.

To reproduce

Start LLMChain::run in an async context

import asyncio
from langchain import PromptTemplate, LLMChain
from langchain.llms import OpenLLM

async def foo():
    server_url = "http://hostname:port"
    temp = 0.2
    llm = OpenLLM(server_url=server_url, temperature=temp)

    print("llm type: " + str(type(llm)))

    template = "What is a good name for a company that makes {product} at this place: {place}?"
    prompt = PromptTemplate(template=template, input_variables=["product", "place"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)

    generated = llm_chain.run(product="mechanical keyboard", place="Detroit")
    print(generated)


loop = asyncio.get_event_loop()
loop.run_until_complete(foo())

Logs

Traceback (most recent call last):
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\gzip.py", line 24, in __call__
    await responder(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\gzip.py", line 44, in __call__
    await self.app(scope, receive, self.send_with_gzip)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\exceptions.py", line 79, in __call__
    raise exc
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\fastapi\middleware\asyncexitstack.py", line 20, in __call__
    raise e
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\fastapi\middleware\asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\fastapi\routing.py", line 241, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\fastapi\routing.py", line 167, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\nicegui\page.py", line 87, in decorated
    result = func(*dec_args, **dec_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\src\router_test.py", line 22, in do_stuff
    generated = stuff()
                ^^^^^^^
  File "C:\svn\ocean\AI_Assistant\src\router_test.py", line 15, in stuff
    generated = llm_chain.run(product="mechanical keyboard", place="Detroit")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\base.py", line 445, in run
    return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\base.py", line 243, in __call__
    raise e
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\base.py", line 237, in __call__
    self._call(inputs, run_manager=run_manager)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\llm.py", line 92, in _call
    response = self.generate([inputs], run_manager=run_manager)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\llm.py", line 102, in generate
    return self.llm.generate_prompt(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 186, in generate_prompt
    return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 279, in generate
    output = self._generate_helper(
             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 223, in _generate_helper
    raise e
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 210, in _generate_helper
    self._generate(
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 602, in _generate
    self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\openllm.py", line 270, in _call
    return self._client.query(prompt, **config.model_dump(flatten=True))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\openllm_client\runtimes\base.py", line 184, in query
    result = httpx.post(
             ^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_api.py", line 304, in post
    return request(
           ^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_api.py", line 100, in request
    return client.request(
           ^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 814, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 901, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 929, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 966, in _send_handling_redirects
    response = self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 1002, in _send_single_request
    response = transport.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_transports\default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection_pool.py", line 261, in handle_request
    raise exc
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection_pool.py", line 245, in handle_request
    response = connection.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection.py", line 92, in handle_request
    raise exc
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection.py", line 69, in handle_request
    stream = self._connect(request)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection.py", line 117, in _connect
    stream = self._network_backend.connect_tcp(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\backends\sync.py", line 100, in connect_tcp
    sock = socket.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\socket.py", line 833, in create_connection
    sock.settimeout(timeout)
OverflowError: timeout doesn't fit into C timeval
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\uvicorn\protocols\http\httptools_impl.py", line 419, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\uvicorn\middleware\proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\fastapi\applications.py", line 290, in __call__
    await super().__call__(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\errors.py", line 184, in __call__
    raise exc
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\gzip.py", line 24, in __call__
    await responder(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\gzip.py", line 44, in __call__
    await self.app(scope, receive, self.send_with_gzip)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\exceptions.py", line 79, in __call__
    raise exc
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\middleware\exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\fastapi\middleware\asyncexitstack.py", line 20, in __call__
    raise e
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\fastapi\middleware\asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\starlette\routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\fastapi\routing.py", line 241, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\fastapi\routing.py", line 167, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\nicegui\page.py", line 87, in decorated
    result = func(*dec_args, **dec_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\src\router_test.py", line 22, in do_stuff
    generated = stuff()
                ^^^^^^^
  File "C:\svn\ocean\AI_Assistant\src\router_test.py", line 15, in stuff
    generated = llm_chain.run(product="mechanical keyboard", place="Detroit")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\base.py", line 445, in run
    return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\base.py", line 243, in __call__
    raise e
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\base.py", line 237, in __call__
    self._call(inputs, run_manager=run_manager)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\llm.py", line 92, in _call
    response = self.generate([inputs], run_manager=run_manager)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\chains\llm.py", line 102, in generate
    return self.llm.generate_prompt(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 186, in generate_prompt
    return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 279, in generate
    output = self._generate_helper(
             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 223, in _generate_helper
    raise e
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 210, in _generate_helper
    self._generate(
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\base.py", line 602, in _generate
    self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\langchain\llms\openllm.py", line 270, in _call
    return self._client.query(prompt, **config.model_dump(flatten=True))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\openllm_client\runtimes\base.py", line 184, in query
    result = httpx.post(
             ^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_api.py", line 304, in post
    return request(
           ^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_api.py", line 100, in request
    return client.request(
           ^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 814, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 901, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 929, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 966, in _send_handling_redirects
    response = self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_client.py", line 1002, in _send_single_request
    response = transport.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpx\_transports\default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection_pool.py", line 261, in handle_request
    raise exc
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection_pool.py", line 245, in handle_request
    response = connection.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection.py", line 92, in handle_request
    raise exc
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection.py", line 69, in handle_request
    stream = self._connect(request)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\_sync\connection.py", line 117, in _connect
    stream = self._network_backend.connect_tcp(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\svn\ocean\AI_Assistant\env\Lib\site-packages\httpcore\backends\sync.py", line 100, in connect_tcp
    sock = socket.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\socket.py", line 833, in create_connection
    sock.settimeout(timeout)
OverflowError: timeout doesn't fit into C timeval

Environment

(env) PS C:\svn\ocean\AI_Assistant> bentoml env

Environment variable

BENTOML_DEBUG=''             
BENTOML_QUIET=''             
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''      
BENTOML_CONFIG=''            
BENTOML_CONFIG_OPTIONS=''    
BENTOML_PORT=''              
BENTOML_HOST=''              
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.23
python: 3.11.3
platform: Windows-10-10.0.19045-SP0
is_window_admin: True

pip_packages

accelerate==0.20.3
aiofiles==23.1.0
aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.7.0
appdirs==1.4.4
argilla==0.0.1
asgiref==3.7.2
async-timeout==4.0.2
attrs==23.1.0
beautifulsoup4==4.12.2
bentoml==1.0.23
bidict==0.22.1
blinker==1.6.2
bottle==0.12.25
bs4==0.0.1
build==0.10.0
cattrs==23.1.2
certifi==2023.5.7
cffi==1.15.1
chardet==5.1.0
charset-normalizer==3.1.0
circus==0.18.0
click==8.1.3
click-option-group==0.5.6
cloudpickle==2.2.1
clr-loader==0.2.5
colorama==0.4.6
colorclass==2.2.2
coloredlogs==15.0.1
compressed-rtf==1.0.6
contextlib2==21.6.0
contourpy==1.1.0
cryptography==41.0.1
cycler==0.11.0
dataclasses-json==0.5.9
datasets==2.13.1
deepmerge==1.1.0
Deprecated==1.2.14
dill==0.3.6
diskcache==5.6.1
easygui==0.98.3
ebcdic==1.1.1
et-xmlfile==1.1.0
extract-msg==0.41.5
fastapi==0.99.1
fastapi-socketio==0.0.10
filelock==3.12.2
filetype==1.2.0
Flask==2.3.2
fonttools==4.40.0
frozenlist==1.3.3
fs==2.4.16
fsspec==2023.6.0
gpt4all==1.0.1
greenlet==2.0.2
grpcio==1.56.0
grpcio-health-checking==1.48.2
grpcio-tools==1.56.0
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==0.17.2
httptools==0.5.0
httpx==0.24.1
huggingface-hub==0.15.1
humanfriendly==10.0
hyperframe==6.0.1
idna==3.4
IMAPClient==2.3.1
importlib-metadata==6.0.1
inflection==0.5.1
itsdangerous==2.1.2
Jinja2==3.1.2
joblib==1.3.1
jsonify==0.5
kiwisolver==1.4.4
langchain==0.0.229
langchainplus-sdk==0.0.20
lark-parser==0.12.0
llama-cpp-python==0.1.67
lxml==4.9.2
Markdown==3.4.3
markdown-it-py==3.0.0
markdown2==2.4.9
MarkupSafe==2.1.3
marshmallow==3.19.0
marshmallow-enum==1.5.1
matplotlib==3.7.1
mdurl==0.1.2
mpmath==1.3.0
msg-parser==1.2.0
msoffcrypto-tool==5.0.1
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==1.0.0
networkx==3.1
nicegui==1.2.24
nltk==3.8.1
numexpr==2.8.4
numpy==1.25.0
olefile==0.46
oletools==0.60.1
openapi-schema-pydantic==1.2.4
openllm==0.1.19
openpyxl==3.1.2
opentelemetry-api==1.17.0
opentelemetry-instrumentation==0.38b0
opentelemetry-instrumentation-aiohttp-client==0.38b0
opentelemetry-instrumentation-asgi==0.38b0
opentelemetry-instrumentation-grpc==0.38b0
opentelemetry-sdk==1.17.0
opentelemetry-semantic-conventions==0.38b0
opentelemetry-util-http==0.38b0
optimum==1.1.1
orjson==3.9.1
packaging==23.1
pandas==2.0.3
pathspec==0.11.1
pcodedmp==1.2.6
pdf2image==1.16.3
pdfminer.six==20221105
Pillow==10.0.0
pip-requirements-parser==32.0.1
pip-tools==6.14.0
plotly==5.15.0
portalocker==2.7.0
prometheus-client==0.17.0
prompt-toolkit==3.0.38
protobuf==4.23.3
proxy-tools==0.1.0
pscript==0.7.7
psutil==5.9.5
pyarrow==12.0.1
pycparser==2.21
pydantic==1.10.10
Pygments==2.15.1
pynvml==11.5.0
pypandoc==1.11
pyparsing==2.4.7
pyproject_hooks==1.0.0
pyreadline3==3.4.1
python-dateutil==2.8.2
python-docx==0.8.11
python-dotenv==1.0.0
python-engineio==4.4.1
python-json-logger==2.0.7
python-magic==0.4.27
python-multipart==0.0.6
python-pptx==0.6.21
python-socketio==5.8.0
pythonnet==3.0.1
pytz==2023.3
pywebview==4.2.2
pywin32==306
PyYAML==6.0
pyzmq==25.1.0
qdrant-client==1.3.1
ReallySimpleDB==1.2
red-black-tree-mod==1.20
regex==2023.6.3
requests==2.31.0
rich==13.4.2
RTFDE==0.0.2
safetensors==0.3.1
schema==0.7.5
scikit-learn==1.3.0
scipy==1.11.1
sentence-transformers==2.2.2
sentencepiece==0.1.99
simple-di==0.1.5
six==1.16.0
sniffio==1.3.0
soupsieve==2.4.1
SQLAlchemy==2.0.17
starlette==0.27.0
sympy==1.12
tabulate==0.9.0
tenacity==8.2.2
threadpoolctl==3.1.0
tiktoken==0.4.0
tokenizers==0.13.3
torch==2.0.1
torchvision==0.15.2
tornado==6.3.2
tqdm==4.65.0
transformers==4.30.2
typing-inspect==0.9.0
typing_extensions==4.5.0
tzdata==2023.3
tzlocal==5.0.1
unstructured==0.7.12
urllib3==1.26.16
uvicorn==0.20.0
vbuild==0.8.1
wcwidth==0.2.6
websockets==11.0.3
Werkzeug==2.3.6
win-unicode-console==0.5
wrapt==1.15.0
xlrd==2.0.1
XlsxWriter==3.1.2
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

(env) PS C:\svn\ocean\AI_Assistant> transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.30.2
Platform: Windows-10-10.0.19045-SP0
Python version: 3.11.3
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

System information (Optional)

No response

No such option: --model-id error

Describe the bug

when running cmd to load chatglm or chatglm2, cmd as below:
openllm start chatglm --model-id thudm/chatglm2-6b
openllm start chatglm --model-id thudm/chatglm-6b

Error: No such option: --model-id
and run openllm start chatglm -h to check
it shows
`Usage: openllm start chatglm [OPTIONS]

chatglm is currently not available to run on your local machine because it requires GPU for inference.

Options:
Miscellaneous options:
-q, --quiet Suppress all output.
--debug, --verbose Print out debug logs.
--do-not-track Do not send usage info
-h, --help Show this message and exit.
`

To reproduce

No response

Logs

No response

Environment

m1max
openllm 0.1.20.dev
python 3.9.7

System information (Optional)

No response

load a model from local path

when I load my local mode with model-id ,It fails,How can I do ?"Invalid value for '--model-id': '/disk/model/glm6b2' is not one of 'thudm/chatglm-6b', 'thudm/chatglm-6b-int8', 'thudm/chatglm-6b-int4'."

feat(cli): parsing Union Type

Describe the bug

This is rather a feature, where parsing configuration should support union type.

TODO:

convert union to correct click type. This probably will involve writing custom Click type called UnionType.

To reproduce

No response

Logs

No response

Environment

No response

ci: PDM

migrate to PDM

bug: 503 for /readyz with model-id facebook/opt-125m

Describe the bug

First at all: Thank you very much, openllm looks awesome so far 💯

This issue is regarding to #47. We tried to start an openllm server with the command:

openllm start opt --model-id facebook/opt-125m

The server started successfully and the webinterface is reachable. But we cannot generate anything, the on the webinterface given examples do not work. openllm query did not work either.

To reproduce

openllm start opt --model-id facebook/opt-125m
try the examples from the webinterface:
Or: openllm query "Tell me the truth!"

Logs

`openllm query "Tell me the truth!"`   

Timed out while connecting to localhost:3000:
Timed out waiting 30 seconds for server at 'localhost:3000' to be ready.
Traceback (most recent call last):
  File "/home/openllm/openllm_environment/bin/openllm", line 8, in <module>
    sys.exit(cli())
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/openllm/cli.py", line 380, in wrapper
    return func(*args, **attrs)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/openllm/cli.py", line 353, in wrapper
    return_value = func(*args, **attrs)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/openllm/cli.py", line 328, in wrapper
    return f(*args, **attrs)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/openllm/cli.py", line 1345, in query
    openllm[client.framework],  # type: ignore (internal API)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/openllm_client/runtimes/http.py", line 52, in framework
    return self._metadata["framework"]
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/openllm_client/runtimes/base.py", line 102, in _metadata
    return self.call("metadata")
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/openllm_client/runtimes/base.py", line 143, in call
    return self._cached.call(f"{name}_{self._api_version}", *args, **attrs)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/openllm_client/runtimes/base.py", line 151, in _cached
    self._client_class.wait_until_server_ready(self._host, int(self._port), timeout=self._timeout)
  File "/home/openllm/openllm_environment/lib64/python3.9/site-packages/bentoml/_internal/client/http.py", line 67, in wait_until_server_ready
    raise TimeoutError(
TimeoutError: Timed out waiting 30 seconds for server at 'localhost:3000' to be ready.

  
Examples/readyz from the Webgui:  
```python
2023-06-27T08:14:33+0200 [WARNING] [cli] No known supported resource available for <class 'types.OptRunnable'>, falling back to using CPU.
2023-06-27T08:14:34+0200 [INFO] [cli] Environ for worker 0: set CPU thread count to 16
2023-06-27T08:14:34+0200 [WARNING] [cli] No known supported resource available for <class 'types.OptRunnable'>, falling back to using CPU.
2023-06-27T08:14:34+0200 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "_service.py:svc" can be accessed at http://localhost:3000/metrics.
2023-06-27T08:14:34+0200 [INFO] [cli] Starting production HTTP BentoServer from "_service.py:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
2023-06-27T08:14:43+0200 [INFO] [api_server:llm-opt-service:16] 123.123.123.123:60081 (scheme=http,method=GET,path=/,type=,length=) (status=200,type=text/html; charset=utf-8,length=2859) 0.470ms (trace=9a0ba0a3ad009c89aebddd40fa94efa0,span=f5801bc407cca64c,sampled=1,service.name=llm-opt-service)
2023-06-27T08:14:44+0200 [INFO] [api_server:llm-opt-service:16] 123.123.123.123:60081 (scheme=http,method=GET,path=/static_content/swagger-ui.css,type=,length=) (status=200,type=text/css; charset=utf-8,length=143980) 6.529ms (trace=a044f5c38e8f8b15777f10f6a61e5e5b,span=d7759d1a32791968,sampled=1,service.name=llm-opt-service)
2023-06-27T08:14:44+0200 [INFO] [api_server:llm-opt-service:16] 123.123.123.123:60081 (scheme=http,method=GET,path=/static_content/index.css,type=,length=) (status=200,type=text/css; charset=utf-8,length=1125) 1.547ms (trace=13289048526d6668c8c59767576296b6,span=7480ec06dd86230a,sampled=1,service.name=llm-opt-service)
2023-06-27T08:14:44+0200 [INFO] [api_server:llm-opt-service:16] 123.123.123.123:60083 (scheme=http,method=GET,path=/static_content/swagger-initializer.js,type=,length=) (status=200,type=application/javascript,length=383) 1.395ms (trace=0f6357f294a61675d8b97a2d388116a6,span=e07593e79162735e,sampled=1,service.name=llm-opt-service)
2023-06-27T08:14:44+0200 [INFO] [api_server:llm-opt-service:16] 123.123.123.123:60081 (scheme=http,method=GET,path=/static_content/swagger-ui-bundle.js,type=,length=) (status=304,type=,length=) 0.877ms (trace=d076ac9766e7be7e4b142a7305daa54e,span=133caf1d609a19aa,sampled=1,service.name=llm-opt-service)
2023-06-27T08:14:44+0200 [INFO] [api_server:llm-opt-service:16] 123.123.123.123:60081 (scheme=http,method=GET,path=/static_content/swagger-ui-standalone-preset.js,type=,length=) (status=304,type=,length=) 0.881ms (trace=1d5078f1aee1173ee967829168d0ac01,span=0b4cd579c15c02f6,sampled=1,service.name=llm-opt-service)
2023-06-27T08:14:44+0200 [INFO] [api_server:llm-opt-service:16] 123.123.123.123:60081 (scheme=http,method=GET,path=/static_content/favicon-96x96.png,type=,length=) (status=200,type=image/png,length=5128) 3.940ms (trace=f141d17fbcbba97061c02f688df3659b,span=95f42383d374830e,sampled=1,service.name=llm-opt-service)
2023-06-27T08:14:44+0200 [INFO] [api_server:llm-opt-service:16] 123.123.123.123:60083 (scheme=http,method=GET,path=/static_content/favicon-32x32.png,type=,length=) (status=200,type=image/png,length=1912) 4.628ms (trace=766eb0b7b95b86767e67bafdc784f69f,span=8cd6fe4667b2d25d,sampled=1,service.name=llm-opt-service)
2023-06-27T08:14:44+0200 [INFO] [api_server:llm-opt-service:15] 123.123.123.123:60084 (scheme=http,method=GET,path=/docs.json,type=,length=) (status=200,type=application/json,length=8166) 16.775ms (trace=36ee0fd098a50c360588732a6eff754e,span=92bfc9de114573d5,sampled=1,service.name=llm-opt-service)
2023-06-27T08:14:57+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 1.229ms (trace=825cdd8b43c097cf0fe9e7a89664d8c7,span=bf06eb4b65e76b31,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:14:57+0200 [INFO] [api_server:llm-opt-service:16] 123.123.123.123:60109 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 77.489ms (trace=825cdd8b43c097cf0fe9e7a89664d8c7,span=023a43fb11c48835,sampled=1,service.name=llm-opt-service)
2023-06-27T08:26:51+0200 [INFO] [api_server:llm-opt-service:15] 123.123.123.123:60622 (scheme=http,method=POST,path=/v1/metadata,type=text/plain,length=4) (status=200,type=application/json,length=731) 5.742ms (trace=7214af978f9398a2d7b8a9feaebc215e,span=b9bf604323a0daa2,sampled=1,service.name=llm-opt-service)
2023-06-27T08:27:13+0200 [INFO] [api_server:llm-opt-service:15] 123.123.123.123:60651 (scheme=http,method=POST,path=/v1/metadata,type=text/plain,length=20) (status=200,type=application/json,length=731) 4.030ms (trace=e7360b8ce2bbd984fbe1a93dc0bc3b18,span=ac391d391a530d3b,sampled=1,service.name=llm-opt-service)
2023-06-27T08:27:59+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.848ms (trace=3d67a35c7cae0ea522a4752ee78ad8c9,span=37de79502ec15faf,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:27:59+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58740 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 4.934ms (trace=3d67a35c7cae0ea522a4752ee78ad8c9,span=f05679dc029016fd,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:00+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.706ms (trace=6feded7ab0a1ebb5850e8e07205137ce,span=545366f670aa6670,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:00+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58750 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 3.470ms (trace=6feded7ab0a1ebb5850e8e07205137ce,span=bfcf97403b38166f,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:01+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.727ms (trace=56d62f117e9c36e4601564f3b4546e8b,span=80f35138b90a4d11,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:01+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58764 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 3.528ms (trace=56d62f117e9c36e4601564f3b4546e8b,span=2b0e2872be7a99f0,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:02+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.692ms (trace=29a97e0ce492e2e54d0e47792360f091,span=2a83ddb3c9fec1f0,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:02+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58770 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 3.434ms (trace=29a97e0ce492e2e54d0e47792360f091,span=6b1596a336c1d33f,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:03+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.698ms (trace=c63dee9e650759297b537c42d531bf9f,span=148454642c31920d,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:03+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58784 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 3.481ms (trace=c63dee9e650759297b537c42d531bf9f,span=3685ee5030f131ba,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:04+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.703ms (trace=3d4f393d8e9d87c9489b80a2602c3d72,span=76da5185f188a13f,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:04+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58786 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 3.489ms (trace=3d4f393d8e9d87c9489b80a2602c3d72,span=1f55db959f3f0e8c,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:05+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.688ms (trace=a79e3934f5765460d9453d5d22c70244,span=484c8de3837265e3,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:05+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58790 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 3.438ms (trace=a79e3934f5765460d9453d5d22c70244,span=6708a4fbbd796615,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:06+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.693ms (trace=6ab794b317bb28b325ce65004f5adf95,span=bf26e25355405222,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:06+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58792 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 3.643ms (trace=6ab794b317bb28b325ce65004f5adf95,span=76c919c82c4d3ded,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:07+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.728ms (trace=1f0affee83d5baa143853575680021b8,span=403ddab11fe64cdc,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:07+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58808 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 4.121ms (trace=1f0affee83d5baa143853575680021b8,span=b4981d0ffb1e79d5,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:08+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.684ms (trace=79bee6a0c93b14dc8a0d52298af710a8,span=a2a6853a038532b1,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:08+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:58820 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 3.425ms (trace=79bee6a0c93b14dc8a0d52298af710a8,span=63935df1dd7a4b6f,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:09+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.713ms (trace=3a17ccfc409db03a76f4c1e99d2f7abd,span=bf36173e51d2114f,sampled=1,service.name=llm-opt-runner)
2023-06-27T08:28:09+0200 [INFO] [api_server:llm-opt-service:16] 127.0.0.1:42736 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 3.536ms (trace=3a17ccfc409db03a76f4c1e99d2f7abd,span=54f8a583b1e7f02f,sampled=1,service.name=llm-opt-service)
2023-06-27T08:28:10+0200 [INFO] [runner:llm-opt-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.711ms (trace=08a437207ba29e28532062be8544627c,span=f66c358bbd8aa2eb,sampled=1,service.name=llm-opt-runner)



### Environment

#### Environment variable

```bash
BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.22
python: 3.9.2
platform: Linux-4.18.0-305.76.1.el8_4.x86_64-x86_64-with-glibc2.28
uid_gid: 8007:8008

pip_packages

accelerate==0.20.3
aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.7.0
appdirs==1.4.4
asgiref==3.7.2
async-timeout==4.0.2
attrs==23.1.0
bentoml==1.0.22
build==0.10.0
cattrs==23.1.2
certifi==2023.5.7
charset-normalizer==3.1.0
circus==0.18.0
click==8.1.3
click-option-group==0.5.6
cloudpickle==2.2.1
cmake==3.26.4
coloredlogs==15.0.1
contextlib2==21.6.0
datasets==2.13.1
deepmerge==1.1.0
Deprecated==1.2.14
dill==0.3.6
exceptiongroup==1.1.1
filelock==3.12.2
filetype==1.2.0
frozenlist==1.3.3
fs==2.4.16
fsspec==2023.6.0
grpcio==1.56.0
grpcio-health-checking==1.48.2
h11==0.14.0
httpcore==0.17.2
httpx==0.24.1
huggingface-hub==0.15.1
humanfriendly==10.0
idna==3.4
importlib-metadata==6.0.1
inflection==0.5.1
Jinja2==3.1.2
lit==16.0.6
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.1
numpy==1.25.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openllm==0.1.14
opentelemetry-api==1.17.0
opentelemetry-instrumentation==0.38b0
opentelemetry-instrumentation-aiohttp-client==0.38b0
opentelemetry-instrumentation-asgi==0.38b0
opentelemetry-instrumentation-grpc==0.38b0
opentelemetry-sdk==1.17.0
opentelemetry-semantic-conventions==0.38b0
opentelemetry-util-http==0.38b0
optimum==1.8.8
orjson==3.9.1
packaging==23.1
pandas==2.0.2
pathspec==0.11.1
Pillow==9.5.0
pip-requirements-parser==32.0.1
pip-tools==6.13.0
prometheus-client==0.17.0
protobuf==3.20.3
psutil==5.9.5
pyarrow==12.0.1
pydantic==1.10.9
Pygments==2.15.1
pynvml==11.5.0
pyparsing==3.1.0
pyproject_hooks==1.0.0
python-dateutil==2.8.2
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.3
PyYAML==6.0
pyzmq==25.1.0
regex==2023.6.3
requests==2.31.0
rich==13.4.2
safetensors==0.3.1
schema==0.7.5
sentencepiece==0.1.99
simple-di==0.1.5
six==1.16.0
sniffio==1.3.0
starlette==0.28.0
sympy==1.12
tabulate==0.9.0
tokenizers==0.13.3
tomli==2.0.1
torch==2.0.1
torchvision==0.15.2
tornado==6.3.2
tqdm==4.65.0
transformers==4.30.2
triton==2.0.0
typing_extensions==4.6.3
tzdata==2023.3
urllib3==2.0.3
uvicorn==0.22.0
watchfiles==0.19.0
wcwidth==0.2.6
wrapt==1.15.0
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

transformers version: 4.30.2
Platform: Linux-4.18.0-305.76.1.el8_4.x86_64-x86_64-with-glibc2.28
Python version: 3.9.2
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu117 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

System information (Optional)

memory: 240 GB
CPU: 16 vCPU
Platform: VMWare ESXi 7.0 U 3
OS: RHEL 8.4

bug: `[ERROR] [runner:llm-falcon-runner:2] Exception in ASGI application`

Describe the bug

[ERROR] [runner:llm-falcon-runner:2] Exception occurs when trying to perform inference on server.

To reproduce

openllm start falcon --model-id tiiuae/falcon-7b-instruct
export OPENLLM_ENDPOINT=http://localhost:3000 ; openllm query 'Explain to me the difference between "further" and "farther"'

Logs

> openllm start falcon --model-id tiiuae/falcon-7b-instruct

2023-07-07T16:43:35+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "_service.py:svc" can be accessed at http://localhost:3000/metrics.
2023-07-07T16:43:36+0000 [ERROR] [cli] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x7fb07859ba60>>
Traceback (most recent call last):
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_start_watchers command
2023-07-07T16:43:37+0000 [INFO] [cli] Starting production HTTP BentoServer from "_service.py:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.51s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.35s/it]

2023-07-07T16:44:05+0000 [INFO] [runner:llm-falcon-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.580ms (trace=2605c8e30c35548bc4d90c3c957ac0d3,span=654aaabd34034c93,sampled=1,service.name=llm-falcon-runner)
2023-07-07T16:44:05+0000 [INFO] [api_server:llm-falcon-service:60] 127.0.0.1:42138 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 74.788ms (trace=2605c8e30c35548bc4d90c3c957ac0d3,span=4cfae1e62f99c4d2,sampled=1,service.name=llm-falcon-service)
2023-07-07T16:44:05+0000 [INFO] [runner:llm-falcon-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.383ms (trace=c65bad1b745313e5efd6d7d466f0be9b,span=74f1ac14fed1f1c0,sampled=1,service.name=llm-falcon-runner)
2023-07-07T16:44:05+0000 [INFO] [api_server:llm-falcon-service:59] 127.0.0.1:42152 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 55.466ms (trace=c65bad1b745313e5efd6d7d466f0be9b,span=f8836bb604802243,sampled=1,service.name=llm-falcon-service)
2023-07-07T16:44:05+0000 [INFO] [api_server:llm-falcon-service:60] 127.0.0.1:42156 (scheme=http,method=GET,path=/docs.json,type=,length=) (status=200,type=application/json,length=8153) 9.468ms (trace=a4907d6361e4099a90d2d9b999854e9e,span=a6c9fb5e65baae5a,sampled=1,service.name=llm-falcon-service)
2023-07-07T16:44:05+0000 [INFO] [api_server:llm-falcon-service:60] 127.0.0.1:42168 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=714) 6.397ms (trace=ec1e23767fccad390d11ecfa97db8544,span=3dd320cbcef0156f,sampled=1,service.name=llm-falcon-service)
2023-07-07T16:44:05+0000 [INFO] [api_server:llm-falcon-service:62] 127.0.0.1:42170 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=714) 6.538ms (trace=b65bb86c166d27a93c6e9af46b991b44,span=f37146eaf7178e69,sampled=1,service.name=llm-falcon-service)
2023-07-07T16:44:07+0000 [ERROR] [runner:llm-falcon-runner:2] Exception in ASGI application
Traceback (most recent call last):
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/uvicorn/middleware/message_logger.py", line 86, in __call__
    raise exc from None
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/uvicorn/middleware/message_logger.py", line 82, in __call__
    await self.app(scope, inner_receive, inner_send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 579, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    response = await func(request)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/runner_app.py", line 272, in _request_handler
    payload = await infer(params)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/marshal/dispatcher.py", line 182, in _func
    raise r
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/marshal/dispatcher.py", line 383, in outbound_call
    outputs = await self.callback(
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/runner_app.py", line 251, in infer_single
    ret = await runner_method.async_run(*params.args, **params.kwargs)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 59, in async_run_method
    return await anyio.to_thread.run_sync(
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/runner/runnable.py", line 140, in method
    return self.func(obj, *args, **kwargs)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/_llm.py", line 1207, in generate
    return self.generate(prompt, **attrs)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/models/falcon/modeling_falcon.py", line 102, in generate
    eos_token_id = attrs.pop("eos_token_id", self.tokenizer.eos_token_id)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/_llm.py", line 941, in tokenizer
    self.__llm_tokenizer__ = t.cast(_T, openllm.serialisation.load_tokenizer(self))
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/__init__.py", line 94, in load_tokenizer
    return openllm.transformers.load_tokenizer(llm)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/transformers.py", line 239, in load_tokenizer
    tokenizer = llm.load_tokenizer(llm.tag, **tokenizer_attrs)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/models/falcon/modeling_falcon.py", line 56, in load_tokenizer
    return transformers.AutoTokenizer.from_pretrained(
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 719, in from_pretrained
    raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.tiiuae.falcon-7b-instruct.c7f670a03d987254220f343c6b026ea0c5147185.configuration_RW.RWConfig'> to build an AutoTokenizer.
Model type should be one of AlbertConfig, AlignConfig, BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BlipConfig, Blip2Config, BloomConfig, BridgeTowerConfig, CamembertConfig, CanineConfig, ChineseCLIPConfig, ClapConfig, CLIPConfig, CLIPSegConfig, CodeGenConfig, ConvBertConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DebertaConfig, DebertaV2Config, DistilBertConfig, DPRConfig, ElectraConfig, ErnieConfig, ErnieMConfig, EsmConfig, FlaubertConfig, FNetConfig, FSMTConfig, FunnelConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GPTSanJapaneseConfig, GroupViTConfig, HubertConfig, IBertConfig, JukeboxConfig, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LiltConfig, LlamaConfig, LongformerConfig, LongT5Config, LukeConfig, LxmertConfig, M2M100Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MgpstrConfig, MobileBertConfig, MPNetConfig, MT5Config, MvpConfig, NezhaConfig, NllbMoeConfig, NystromformerConfig, OneFormerConfig, OpenAIGPTConfig, OPTConfig, OwlViTConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, Pix2StructConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, RagConfig, RealmConfig, ReformerConfig, RemBertConfig, RetriBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2TextConfig, Speech2Text2Config, SpeechT5Config, SplinterConfig, SqueezeBertConfig, SwitchTransformersConfig, T5Config, TapasConfig, TransfoXLConfig, ViltConfig, VisualBertConfig, Wav2Vec2Config, Wav2Vec2ConformerConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YosoConfig.
2023-07-07T16:44:07+0000 [ERROR] [api_server:llm-falcon-service:63] Exception on /v1/generate [POST] (trace=328421a74eec6856b8575ff33d6ded1f,span=f878adce871577e0,sampled=1,service.name=llm-falcon-service)
Traceback (most recent call last):
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
    output = await api.func(*args)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/_service.py", line 90, in generate_v1
    responses = await runner.generate.async_run(qa_inputs.prompt, **config)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 242, in async_run_method
    raise RemoteException(
bentoml.exceptions.RemoteException: An unexpected exception occurred in remote runner llm-falcon-runner: [500] Internal Server Error
2023-07-07T16:44:07+0000 [INFO] [api_server:llm-falcon-service:63] 127.0.0.1:42186 (scheme=http,method=POST,path=/v1/generate,type=application/json,length=7113) (status=500,type=application/json,length=2) 181.435ms (trace=328421a74eec6856b8575ff33d6ded1f,span=f878adce871577e0,sampled=1,service.name=llm-falcon-service)

Environment

bentoml env

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.23
python: 3.10.12
platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.35
uid_gid: 1004:1004
conda: 23.3.1
in_conda_env: True

conda_packages

name: openllm
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2023.05.30=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_0
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.9=h7f8727e_0
  - pip=23.1.2=py310h06a4308_0
  - python=3.10.12=h955ad1f_0
  - readline=8.2=h5eee18b_0
  - setuptools=67.8.0=py310h06a4308_0
  - sqlite=3.41.2=h5eee18b_0
  - tk=8.6.12=h1ccaba5_0
  - wheel=0.38.4=py310h06a4308_0
  - xz=5.4.2=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
      - accelerate==0.20.3
      - aiohttp==3.8.4
      - aiosignal==1.3.1
      - anyio==3.7.1
      - appdirs==1.4.4
      - asgiref==3.7.2
      - async-timeout==4.0.2
      - attrs==23.1.0
      - bentoml==1.0.23
      - build==0.10.0
      - cattrs==23.1.2
      - certifi==2023.5.7
      - charset-normalizer==3.1.0
      - circus==0.18.0
      - click==8.1.4
      - click-option-group==0.5.6
      - cloudpickle==2.2.1
      - cmake==3.26.4
      - coloredlogs==15.0.1
      - contextlib2==21.6.0
      - datasets==2.13.1
      - deepmerge==1.1.0
      - deprecated==1.2.14
      - dill==0.3.6
      - einops==0.6.1
      - exceptiongroup==1.1.2
      - filelock==3.12.2
      - filetype==1.2.0
      - frozenlist==1.3.3
      - fs==2.4.16
      - fsspec==2023.6.0
      - grpcio==1.56.0
      - grpcio-health-checking==1.48.2
      - h11==0.14.0
      - httpcore==0.17.3
      - httpx==0.24.1
      - huggingface-hub==0.16.4
      - humanfriendly==10.0
      - idna==3.4
      - importlib-metadata==6.0.1
      - inflection==0.5.1
      - jinja2==3.1.2
      - lit==16.0.6
      - markdown-it-py==3.0.0
      - markupsafe==2.1.3
      - mdurl==0.1.2
      - mpmath==1.3.0
      - multidict==6.0.4
      - multiprocess==0.70.14
      - mypy-extensions==1.0.0
      - networkx==3.1
      - numpy==1.25.0
      - nvidia-cublas-cu11==11.10.3.66
      - nvidia-cuda-cupti-cu11==11.7.101
      - nvidia-cuda-nvrtc-cu11==11.7.99
      - nvidia-cuda-runtime-cu11==11.7.99
      - nvidia-cudnn-cu11==8.5.0.96
      - nvidia-cufft-cu11==10.9.0.58
      - nvidia-curand-cu11==10.2.10.91
      - nvidia-cusolver-cu11==11.4.0.1
      - nvidia-cusparse-cu11==11.7.4.91
      - nvidia-nccl-cu11==2.14.3
      - nvidia-nvtx-cu11==11.7.91
      - openllm==0.1.20
      - opentelemetry-api==1.17.0
      - opentelemetry-instrumentation==0.38b0
      - opentelemetry-instrumentation-aiohttp-client==0.38b0
      - opentelemetry-instrumentation-asgi==0.38b0
      - opentelemetry-instrumentation-grpc==0.38b0
      - opentelemetry-sdk==1.17.0
      - opentelemetry-semantic-conventions==0.38b0
      - opentelemetry-util-http==0.38b0
      - optimum==1.9.0
      - orjson==3.9.2
      - packaging==23.1
      - pandas==2.0.3
      - pathspec==0.11.1
      - pillow==10.0.0
      - pip-requirements-parser==32.0.1
      - pip-tools==6.14.0
      - prometheus-client==0.17.0
      - protobuf==3.20.3
      - psutil==5.9.5
      - pyarrow==12.0.1
      - pydantic==1.10.11
      - pygments==2.15.1
      - pynvml==11.5.0
      - pyparsing==3.1.0
      - pyproject-hooks==1.0.0
      - pyre-extensions==0.0.29
      - python-json-logger==2.0.7
      - python-multipart==0.0.6
      - pytz==2023.3
      - pyyaml==6.0
      - pyzmq==25.1.0
      - regex==2023.6.3
      - requests==2.31.0
      - rich==13.4.2
      - safetensors==0.3.1
      - schema==0.7.5
      - sentencepiece==0.1.99
      - simple-di==0.1.5
      - six==1.16.0
      - sniffio==1.3.0
      - starlette==0.28.0
      - sympy==1.12
      - tabulate==0.9.0
      - tokenizers==0.13.3
      - tomli==2.0.1
      - torch==2.0.1
      - tornado==6.3.2
      - tqdm==4.65.0
      - transformers==4.30.2
      - triton==2.0.0
      - typing-extensions==4.7.1
      - typing-inspect==0.9.0
      - tzdata==2023.3
      - urllib3==2.0.3
      - uvicorn==0.22.0
      - watchfiles==0.19.0
      - wcwidth==0.2.6
      - wrapt==1.15.0
      - xformers==0.0.20
      - xxhash==3.2.0
      - yarl==1.9.2
      - zipp==3.15.0
prefix: /miniconda3/envs/openllm

pip_packages

accelerate==0.20.3
aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.7.1
appdirs==1.4.4
asgiref==3.7.2
async-timeout==4.0.2
attrs==23.1.0
bentoml==1.0.23
build==0.10.0
cattrs==23.1.2
certifi==2023.5.7
charset-normalizer==3.1.0
circus==0.18.0
click==8.1.4
click-option-group==0.5.6
cloudpickle==2.2.1
cmake==3.26.4
coloredlogs==15.0.1
contextlib2==21.6.0
datasets==2.13.1
deepmerge==1.1.0
Deprecated==1.2.14
dill==0.3.6
docutils==0.16
einops==0.6.1
exceptiongroup==1.1.2
filelock==3.12.2
filetype==1.2.0
frozenlist==1.3.3
fs==2.4.16
fsspec==2023.6.0
grpcio==1.56.0
grpcio-health-checking==1.48.2
h11==0.14.0
httpcore==0.17.3
httpx==0.24.1
huggingface-hub==0.16.4
humanfriendly==10.0
idna==3.4
importlib-metadata==6.0.1
inflection==0.5.1
Jinja2==3.1.2
lit==16.0.6
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==1.0.0
networkx==3.1
numpy==1.25.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openllm==0.1.20
opentelemetry-api==1.17.0
opentelemetry-instrumentation==0.38b0
opentelemetry-instrumentation-aiohttp-client==0.38b0
opentelemetry-instrumentation-asgi==0.38b0
opentelemetry-instrumentation-grpc==0.38b0
opentelemetry-sdk==1.17.0
opentelemetry-semantic-conventions==0.38b0
opentelemetry-util-http==0.38b0
optimum==1.9.0
orjson==3.9.2
packaging==23.1
pandas==2.0.3
pathspec==0.11.1
Pillow==10.0.0
pip-requirements-parser==32.0.1
pip-tools==6.14.0
prometheus-client==0.17.0
protobuf==3.20.3
psutil==5.9.5
pyarrow==12.0.1
pydantic==1.10.11
Pygments==2.15.1
pynvml==11.5.0
pyparsing==3.1.0
pyproject_hooks==1.0.0
pyre-extensions==0.0.29
python-dateutil==2.8.2
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.3
PyYAML==6.0
pyzmq==25.1.0
regex==2023.6.3
requests==2.31.0
rich==13.4.2
rsa==4.7.2
safetensors==0.3.1
schema==0.7.5
sentencepiece==0.1.99
simple-di==0.1.5
six==1.16.0
sniffio==1.3.0
starlette==0.28.0
sympy==1.12
tabulate==0.9.0
tokenizers==0.13.3
tomli==2.0.1
torch==2.0.1
tornado==6.3.2
tqdm==4.65.0
transformers==4.30.2
triton==2.0.0
typing-inspect==0.9.0
typing_extensions==4.7.1
tzdata==2023.3
urllib3==2.0.3
uvicorn==0.22.0
watchfiles==0.19.0
wcwidth==0.2.6
wrapt==1.15.0
xformers==0.0.20
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

System information (Optional)

ubuntu 22
2x nvidia a100 (Driver Version: 535.54.03 CUDA Version: 12.2)

bug: running openllm start opt for the first time fails

Describe the bug

When running openllm start opt for the first time, the process fails after downloading ..._config.json. I'm on a Macbook Pro M2. Here's the output:

(openllm) ➜  OpenLLM git:(main) openllm start opt
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/homebrew/lib/python3.11/site-packages/openllm/cli.py:1248 in download_models                │
│                                                                                                  │
│   1245 │   ).for_model(model_name, model_id=model_id, llm_config=config)                         │
│   1246 │                                                                                         │
│   1247 │   try:                                                                                  │
│ ❱ 1248 │   │   _ref = bentoml.transformers.get(model.tag)                                        │
│   1249 │   │   if output == "pretty":                                                            │
│   1250 │   │   │   _echo(f"{model_name} is already setup for framework '{envvar}': {str(_ref.ta  │
│   1251 │   │   elif output == "json":                                                            │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/bentoml/_internal/frameworks/transformers.py:292 in   │
│ get                                                                                              │
│                                                                                                  │
│   289 │      # target model must be from the BentoML model store                                 │
│   290 │      model = bentoml.transformers.get("my_pipeline:latest")                              │
│   291 │   """                                                                                    │
│ ❱ 292 │   model = bentoml.models.get(tag_like)                                                   │
│   293 │   if model.info.module not in (MODULE_NAME, __name__):                                   │
│   294 │   │   raise NotFound(                                                                    │
│   295 │   │   │   f"Model {model.tag} was saved with module {model.info.module}, not loading w   │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/simple_di/__init__.py:139 in _                        │
│                                                                                                  │
│   136 │   │   bind = sig.bind_partial(*filtered_args, **filtered_kwargs)                         │
│   137 │   │   bind.apply_defaults()                                                              │
│   138 │   │                                                                                      │
│ ❱ 139 │   │   return func(*_inject_args(bind.args), **_inject_kwargs(bind.kwargs))               │
│   140 │                                                                                          │
│   141 │   setattr(_, "_is_injected", True)                                                       │
│   142 │   return cast(WrappedCallable, _)                                                        │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/bentoml/models.py:42 in get                           │
│                                                                                                  │
│    39 │   *,                                                                                     │
│    40 │   _model_store: "ModelStore" = Provide[BentoMLContainer.model_store],                    │
│    41 ) -> "Model":                                                                              │
│ ❱  42 │   return _model_store.get(tag)                                                           │
│    43                                                                                            │
│    44                                                                                            │
│    45 @inject                                                                                    │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/bentoml/_internal/store.py:146 in get                 │
│                                                                                                  │
│   143 │   │   matches = self._fs.glob(f"{path}*/")                                               │
│   144 │   │   counts = matches.count().directories                                               │
│   145 │   │   if counts == 0:                                                                    │
│ ❱ 146 │   │   │   raise NotFound(                                                                │
│   147 │   │   │   │   f"{self._item_type.get_typename()} '{tag}' is not found in BentoML store   │
│   148 │   │   │   )                                                                              │
│   149 │   │   elif counts == 1:                                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NotFound: Model 'pt-facebook-opt-1-3b:8c7b10754972749675d22364c25c428b29face51' is not found in BentoML store <osfs
'/Users/matthewberman/bentoml/models'>

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py:1172 in            │
│ _get_module                                                                                      │
│                                                                                                  │
│   1169 │                                                                                         │
│   1170 │   def _get_module(self, module_name: str):                                              │
│   1171 │   │   try:                                                                              │
│ ❱ 1172 │   │   │   return importlib.import_module("." + module_name, self.__name__)              │
│   1173 │   │   except Exception as e:                                                            │
│   1174 │   │   │   raise RuntimeError(                                                           │
│   1175 │   │   │   │   f"Failed to import {self.__name__}.{module_name} because of the followin  │
│                                                                                                  │
│ /opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11 │
│ /importlib/__init__.py:126 in import_module                                                      │
│                                                                                                  │
│   123 │   │   │   if character != '.':                                                           │
│   124 │   │   │   │   break                                                                      │
│   125 │   │   │   level += 1                                                                     │
│ ❱ 126 │   return _bootstrap._gcd_import(name[level:], package, level)                            │
│   127                                                                                            │
│   128                                                                                            │
│   129 _RELOADING = {}                                                                            │
│ in _gcd_import:1206                                                                              │
│ in _find_and_load:1178                                                                           │
│ in _find_and_load_unlocked:1149                                                                  │
│ in _load_unlocked:690                                                                            │
│ in exec_module:940                                                                               │
│ in _call_with_frames_removed:241                                                                 │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/transformers/modeling_tf_utils.py:70 in <module>      │
│                                                                                                  │
│     67                                                                                           │
│     68 if parse(tf.__version__) >= parse("2.11.0"):                                              │
│     69 │   from keras import backend as K                                                        │
│ ❱   70 │   from keras.engine import data_adapter                                                 │
│     71 │   from keras.engine.keras_tensor import KerasTensor                                     │
│     72 │   from keras.saving.legacy import hdf5_format                                           │
│     73 else:                                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'keras.engine'

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in _run_module_as_main:198                                                                       │
│ in _run_code:88                                                                                  │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/openllm/__main__.py:26 in <module>                    │
│                                                                                                  │
│   23 if __name__ == "__main__":                                                                  │
│   24 │   from openllm.cli import cli                                                             │
│   25 │                                                                                           │
│ ❱ 26 │   cli()                                                                                   │
│   27                                                                                             │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/click/core.py:1130 in __call__                        │
│                                                                                                  │
│   1127 │                                                                                         │
│   1128 │   def __call__(self, *args: t.Any, **kwargs: t.Any) -> t.Any:                           │
│   1129 │   │   """Alias for :meth:`main`."""                                                     │
│ ❱ 1130 │   │   return self.main(*args, **kwargs)                                                 │
│   1131                                                                                           │
│   1132                                                                                           │
│   1133 class Command(BaseCommand):                                                               │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/click/core.py:1055 in main                            │
│                                                                                                  │
│   1052 │   │   try:                                                                              │
│   1053 │   │   │   try:                                                                          │
│   1054 │   │   │   │   with self.make_context(prog_name, args, **extra) as ctx:                  │
│ ❱ 1055 │   │   │   │   │   rv = self.invoke(ctx)                                                 │
│   1056 │   │   │   │   │   if not standalone_mode:                                               │
│   1057 │   │   │   │   │   │   return rv                                                         │
│   1058 │   │   │   │   │   # it's not safe to `ctx.exit(rv)` here!                               │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/click/core.py:1657 in invoke                          │
│                                                                                                  │
│   1654 │   │   │   │   super().invoke(ctx)                                                       │
│   1655 │   │   │   │   sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)                    │
│   1656 │   │   │   │   with sub_ctx:                                                             │
│ ❱ 1657 │   │   │   │   │   return _process_result(sub_ctx.command.invoke(sub_ctx))               │
│   1658 │   │                                                                                     │
│   1659 │   │   # In chain mode we create the contexts step by step, but after the                │
│   1660 │   │   # base command has been invoked.  Because at that point we do not                 │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/click/core.py:1404 in invoke                          │
│                                                                                                  │
│   1401 │   │   │   echo(style(message, fg="red"), err=True)                                      │
│   1402 │   │                                                                                     │
│   1403 │   │   if self.callback is not None:                                                     │
│ ❱ 1404 │   │   │   return ctx.invoke(self.callback, **ctx.params)                                │
│   1405 │                                                                                         │
│   1406 │   def shell_complete(self, ctx: Context, incomplete: str) -> t.List["CompletionItem"]:  │
│   1407 │   │   """Return a list of completions for the incomplete value. Looks                   │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/click/core.py:760 in invoke                           │
│                                                                                                  │
│    757 │   │                                                                                     │
│    758 │   │   with augment_usage_errors(__self):                                                │
│    759 │   │   │   with ctx:                                                                     │
│ ❱  760 │   │   │   │   return __callback(*args, **kwargs)                                        │
│    761 │                                                                                         │
│    762 │   def forward(                                                                          │
│    763 │   │   __self, __cmd: "Command", *args: t.Any, **kwargs: t.Any  # noqa: B902             │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/openllm/cli.py:342 in wrapper                         │
│                                                                                                  │
│    339 │   │   @functools.wraps(func)                                                            │
│    340 │   │   def wrapper(*args: P.args, **attrs: P.kwargs) -> t.Any:                           │
│    341 │   │   │   try:                                                                          │
│ ❱  342 │   │   │   │   return func(*args, **attrs)                                               │
│    343 │   │   │   except OpenLLMException as err:                                               │
│    344 │   │   │   │   raise click.ClickException(                                               │
│    345 │   │   │   │   │   click.style(f"[{group.name}] '{command_name}' failed: " + err.messag  │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/openllm/cli.py:315 in wrapper                         │
│                                                                                                  │
│    312 │   │   │   │   assert group.name is not None, "group.name should not be None"            │
│    313 │   │   │   │   event = analytics.OpenllmCliEvent(cmd_group=group.name, cmd_name=command  │
│    314 │   │   │   │   try:                                                                      │
│ ❱  315 │   │   │   │   │   return_value = func(*args, **attrs)                                   │
│    316 │   │   │   │   │   duration_in_ms = (time.time_ns() - start_time) / 1e6                  │
│    317 │   │   │   │   │   event.duration_in_ms = duration_in_ms                                 │
│    318 │   │   │   │   │   analytics.track(event)                                                │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/openllm/cli.py:290 in wrapper                         │
│                                                                                                  │
│    287 │   │   │                                                                                 │
│    288 │   │   │   configure_logging()                                                           │
│    289 │   │   │                                                                                 │
│ ❱  290 │   │   │   return f(*args, **attrs)                                                      │
│    291 │   │                                                                                     │
│    292 │   │   return t.cast("ClickFunctionWrapper[..., t.Any]", wrapper)                        │
│    293                                                                                           │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/openllm/cli.py:1276 in download_models                │
│                                                                                                  │
│   1273 │   │   │   )                                                                             │
│   1274 │   │                                                                                     │
│   1275 │   │   (model_args, model_attrs), tokenizer_attrs = model.llm_parameters                 │
│ ❱ 1276 │   │   _ref = model.import_model(                                                        │
│   1277 │   │   │   model.model_id,                                                               │
│   1278 │   │   │   model.tag,                                                                    │
│   1279 │   │   │   *model_args,                                                                  │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/openllm/models/opt/modeling_opt.py:74 in import_model │
│                                                                                                  │
│    71 │   │   model: transformers.OPTForCausalLM = transformers.AutoModelForCausalLM.from_pret   │
│    72 │   │   │   model_id, torch_dtype=torch_dtype, trust_remote_code=trust_remote_code, **at   │
│    73 │   │   )                                                                                  │
│ ❱  74 │   │   return bentoml.transformers.save_model(tag, model, custom_objects={"tokenizer":    │
│    75 │                                                                                          │
│    76 │   def load_model(self, tag: bentoml.Tag, *args: t.Any, **attrs: t.Any) -> transformers   │
│    77 │   │   torch_dtype = attrs.pop("torch_dtype", self.dtype)                                 │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/bentoml/_internal/frameworks/transformers.py:829 in   │
│ save_model                                                                                       │
│                                                                                                  │
│   826 │   │   │   pretrained,                                                                    │
│   827 │   │   │   (                                                                              │
│   828 │   │   │   │   transformers.PreTrainedModel,                                              │
│ ❱ 829 │   │   │   │   transformers.TFPreTrainedModel,                                            │
│   830 │   │   │   │   transformers.FlaxPreTrainedModel,                                          │
│   831 │   │   │   ),                                                                             │
│   832 │   │   ):                                                                                 │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py:1162 in            │
│ __getattr__                                                                                      │
│                                                                                                  │
│   1159 │   │   if name in self._modules:                                                         │
│   1160 │   │   │   value = self._get_module(name)                                                │
│   1161 │   │   elif name in self._class_to_module.keys():                                        │
│ ❱ 1162 │   │   │   module = self._get_module(self._class_to_module[name])                        │
│   1163 │   │   │   value = getattr(module, name)                                                 │
│   1164 │   │   else:                                                                             │
│   1165 │   │   │   raise AttributeError(f"module {self.__name__} has no attribute {name}")       │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py:1174 in            │
│ _get_module                                                                                      │
│                                                                                                  │
│   1171 │   │   try:                                                                              │
│   1172 │   │   │   return importlib.import_module("." + module_name, self.__name__)              │
│   1173 │   │   except Exception as e:                                                            │
│ ❱ 1174 │   │   │   raise RuntimeError(                                                           │
│   1175 │   │   │   │   f"Failed to import {self.__name__}.{module_name} because of the followin  │
│   1176 │   │   │   │   f" traceback):\n{e}"                                                      │
│   1177 │   │   │   ) from e                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to import transformers.modeling_tf_utils because of the following error (look up to see its
traceback):
No module named 'keras.engine'
Traceback (most recent call last):
  File "/opt/homebrew/bin/openllm", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/openllm/cli.py", line 342, in wrapper
    return func(*args, **attrs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/openllm/cli.py", line 315, in wrapper
    return_value = func(*args, **attrs)
                   ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/openllm/cli.py", line 290, in wrapper
    return f(*args, **attrs)
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/openllm/cli.py", line 701, in model_start
    llm = t.cast(
          ^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/openllm/models/auto/factory.py", line 127, in for_model
    llm.ensure_model_id_exists()
  File "/opt/homebrew/lib/python3.11/site-packages/openllm/_llm.py", line 688, in ensure_model_id_exists
    output = subprocess.check_output(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/homebrew/opt/[email protected]/bin/python3.11', '-m', 'openllm', 'download', 'opt', '--model-id', 'facebook/opt-1.3b', '--output', 'porcelain']' returned non-zero exit status 1.

To reproduce

Clone repo
Start conda
Install openllm (pip install openllm)
run openllm start opt

Logs

No response

Environment

Macbook Pro M2
Conda
Python 3.11.3

feat: Apple M1/M2 support through MPS

Feature request

I want to use OpenLLM with available models to run on Apple M1/M2 processors (GPU support) through MPS.

Today:

openllm start falcon
No GPU available, therefore this command is disabled

Motivation

No response

Other

No response

Only Tensors of floating point and complex dtype can require gradients

openllm start chatglm

......
File "/opt/conda/lib/python3.10/site-packages/accelerate/big_modeling.py", line 108, in register_empty_parameter
module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/parameter.py", line 36, in new
return torch.Tensor._make_subclass(cls, data, requires_grad)
RuntimeError: Only Tensors of floating point and complex dtype can require gradients

2023-06-25T07:11:30+0000 [ERROR] [runner:llm-chatglm-runner:1] Application startup failed. Exiting.

tracking: Fine-tuning support

Feature request

This ticket keeps track of implementation support for each of the adapter type for models under OpenLLM:

OPT: Lora
Dolly-v2: wip
Flan-T5: wip
Falcon: Lora, Qlora
Starcoder: wip
ChatGLM: wip
stablelm: wip

This is on progress with #52

feat(config): `dict(config)`

Feature request

openllm.LLMConfig should be serializable as a dict

config = openllm.AutoConfig.for_model('dolly-v2')

dict(config) # should be the same as config.model_dump(flatten=True)

This means within the LLMConfig class generation, it should generated the target class to be a slotted class

THis requires some work, but not very high priority atm. Feel free to pick it up

Motivation

No response

Other

No response

bug: openllm start opt or openllm start dolly-v2 faild

Describe the bug

openllm start opt and openllm start dolly-v2 shows OK.
when i made the query below came out.

2023-06-21T16:45:40+0800 [INFO] [runner:llm-dolly-v2-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.574ms (trace=100fa96d33433a772259d444a0006ca9,span=4caf6df55eb0c67e,sampled=1,service.name=llm-dolly-v2-runner)
2023-06-21T16:45:40+0800 [INFO] [api_server:llm-dolly-v2-service:9] 127.0.0.1:63613 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 5.220ms (trace=100fa96d33433a772259d444a0006ca9,span=7ec016176efc036d,sampled=1,service.name=llm-dolly-v2-service)

To reproduce

No response

Logs

2023-06-21T16:45:40+0800 [INFO] [runner:llm-dolly-v2-runner:1] _ (scheme=http,method=GET,path=http://127.0.0.1:8000/readyz,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.574ms (trace=100fa96d33433a772259d444a0006ca9,span=4caf6df55eb0c67e,sampled=1,service.name=llm-dolly-v2-runner)
2023-06-21T16:45:40+0800 [INFO] [api_server:llm-dolly-v2-service:9] 127.0.0.1:63613 (scheme=http,method=GET,path=/readyz,type=,length=) (status=503,type=text/plain; charset=utf-8,length=22) 5.220ms (trace=100fa96d33433a772259d444a0006ca9,span=7ec016176efc036d,sampled=1,service.name=llm-dolly-v2-service)

Environment

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.22
python: 3.8.16
platform: macOS-13.4-arm64-arm-64bit
uid_gid: 501:20
conda: 23.3.1
in_conda_env: True

conda_packages

name: openllm
channels:
  - defaults
dependencies:
  - ca-certificates=2023.05.30=hca03da5_0
  - libcxx=14.0.6=h848a8c0_0
  - libffi=3.4.4=hca03da5_0
  - ncurses=6.4=h313beb8_0
  - openssl=3.0.8=h1a28f6b_0
  - pip=23.1.2=py38hca03da5_0
  - python=3.8.16=hb885b13_4
  - readline=8.2=h1a28f6b_0
  - setuptools=67.8.0=py38hca03da5_0
  - sqlite=3.41.2=h80987f9_0
  - tk=8.6.12=hb8d0fd4_0
  - wheel=0.38.4=py38hca03da5_0
  - xz=5.4.2=h80987f9_0
  - zlib=1.2.13=h5a0b063_0
  - pip:
      - accelerate==0.20.3
      - aiohttp==3.8.4
      - aiosignal==1.3.1
      - anyio==3.7.0
      - appdirs==1.4.4
      - asgiref==3.7.2
      - async-timeout==4.0.2
      - attrs==23.1.0
      - bentoml==1.0.22
      - build==0.10.0
      - cattrs==23.1.2
      - certifi==2023.5.7
      - charset-normalizer==3.1.0
      - circus==0.18.0
      - click==8.1.3
      - click-option-group==0.5.6
      - cloudpickle==2.2.1
      - coloredlogs==15.0.1
      - contextlib2==21.6.0
      - cpm-kernels==1.0.11
      - datasets==2.13.0
      - deepmerge==1.1.0
      - deprecated==1.2.14
      - dill==0.3.6
      - exceptiongroup==1.1.1
      - filelock==3.12.2
      - filetype==1.2.0
      - frozenlist==1.3.3
      - fs==2.4.16
      - fsspec==2023.6.0
      - grpcio==1.54.2
      - grpcio-health-checking==1.48.2
      - h11==0.14.0
      - httpcore==0.17.2
      - httpx==0.24.1
      - huggingface-hub==0.15.1
      - humanfriendly==10.0
      - idna==3.4
      - importlib-metadata==6.0.1
      - inflection==0.5.1
      - jinja2==3.1.2
      - markdown-it-py==3.0.0
      - markupsafe==2.1.3
      - mdurl==0.1.2
      - mpmath==1.3.0
      - multidict==6.0.4
      - multiprocess==0.70.14
      - networkx==3.1
      - numpy==1.24.3
      - openllm==0.1.8
      - opentelemetry-api==1.17.0
      - opentelemetry-instrumentation==0.38b0
      - opentelemetry-instrumentation-aiohttp-client==0.38b0
      - opentelemetry-instrumentation-asgi==0.38b0
      - opentelemetry-instrumentation-grpc==0.38b0
      - opentelemetry-sdk==1.17.0
      - opentelemetry-semantic-conventions==0.38b0
      - opentelemetry-util-http==0.38b0
      - optimum==1.8.8
      - orjson==3.9.1
      - packaging==23.1
      - pandas==2.0.2
      - pathspec==0.11.1
      - pillow==9.5.0
      - pip-requirements-parser==32.0.1
      - pip-tools==6.13.0
      - prometheus-client==0.17.0
      - protobuf==3.20.3
      - psutil==5.9.5
      - pyarrow==12.0.1
      - pydantic==1.10.9
      - pygments==2.15.1
      - pynvml==11.5.0
      - pyparsing==3.1.0
      - pyproject-hooks==1.0.0
      - python-dateutil==2.8.2
      - python-json-logger==2.0.7
      - python-multipart==0.0.6
      - pytz==2023.3
      - pyyaml==6.0
      - pyzmq==25.1.0
      - regex==2023.6.3
      - requests==2.31.0
      - rich==13.4.2
      - safetensors==0.3.1
      - schema==0.7.5
      - sentencepiece==0.1.99
      - simple-di==0.1.5
      - six==1.16.0
      - sniffio==1.3.0
      - starlette==0.28.0
      - sympy==1.12
      - tabulate==0.9.0
      - tokenizers==0.13.3
      - tomli==2.0.1
      - torch==2.0.1
      - torchvision==0.15.2
      - tornado==6.3.2
      - tqdm==4.65.0
      - transformers==4.30.2
      - typing-extensions==4.6.3
      - tzdata==2023.3
      - urllib3==2.0.3
      - uvicorn==0.22.0
      - watchfiles==0.19.0
      - wcwidth==0.2.6
      - wrapt==1.15.0
      - xxhash==3.2.0
      - yarl==1.9.2
      - zipp==3.15.0
prefix: /Users/tim/anaconda3/envs/openllm

pip_packages

accelerate==0.20.3
aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.7.0
appdirs==1.4.4
asgiref==3.7.2
async-timeout==4.0.2
attrs==23.1.0
bentoml==1.0.22
build==0.10.0
cattrs==23.1.2
certifi==2023.5.7
charset-normalizer==3.1.0
circus==0.18.0
click==8.1.3
click-option-group==0.5.6
cloudpickle==2.2.1
coloredlogs==15.0.1
contextlib2==21.6.0
cpm-kernels==1.0.11
datasets==2.13.0
deepmerge==1.1.0
Deprecated==1.2.14
dill==0.3.6
exceptiongroup==1.1.1
filelock==3.12.2
filetype==1.2.0
frozenlist==1.3.3
fs==2.4.16
fsspec==2023.6.0
grpcio==1.54.2
grpcio-health-checking==1.48.2
h11==0.14.0
httpcore==0.17.2
httpx==0.24.1
huggingface-hub==0.15.1
humanfriendly==10.0
idna==3.4
importlib-metadata==6.0.1
inflection==0.5.1
Jinja2==3.1.2
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.1
numpy==1.24.3
openllm==0.1.8
opentelemetry-api==1.17.0
opentelemetry-instrumentation==0.38b0
opentelemetry-instrumentation-aiohttp-client==0.38b0
opentelemetry-instrumentation-asgi==0.38b0
opentelemetry-instrumentation-grpc==0.38b0
opentelemetry-sdk==1.17.0
opentelemetry-semantic-conventions==0.38b0
opentelemetry-util-http==0.38b0
optimum==1.8.8
orjson==3.9.1
packaging==23.1
pandas==2.0.2
pathspec==0.11.1
Pillow==9.5.0
pip-requirements-parser==32.0.1
pip-tools==6.13.0
prometheus-client==0.17.0
protobuf==3.20.3
psutil==5.9.5
pyarrow==12.0.1
pydantic==1.10.9
Pygments==2.15.1
pynvml==11.5.0
pyparsing==3.1.0
pyproject_hooks==1.0.0
python-dateutil==2.8.2
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.3
PyYAML==6.0
pyzmq==25.1.0
regex==2023.6.3
requests==2.31.0
rich==13.4.2
safetensors==0.3.1
schema==0.7.5
sentencepiece==0.1.99
simple-di==0.1.5
six==1.16.0
sniffio==1.3.0
starlette==0.28.0
sympy==1.12
tabulate==0.9.0
tokenizers==0.13.3
tomli==2.0.1
torch==2.0.1
torchvision==0.15.2
tornado==6.3.2
tqdm==4.65.0
transformers==4.30.2
typing_extensions==4.6.3
tzdata==2023.3
urllib3==2.0.3
uvicorn==0.22.0
watchfiles==0.19.0
wcwidth==0.2.6
wrapt==1.15.0
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

transformers version: 4.30.2
Platform: macOS-13.4-arm64-arm-64bit
Python version: 3.8.16
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?:

System information (Optional)

apple m1 max

Error installing and running Falcon Models

Dear community,

When trying to install Falcon model and running I´m getting the following error:

┌───────────────────── Traceback (most recent call last) ─────────────────────┐
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\cli.py:1395 │
│ in download_models │
│ │
│ 1392 │ ).for_model(model_name, model_id=model_id, llm_config=config) │
│ 1393 │ │
│ 1394 │ try: │
│ > 1395 │ │ ref = bentoml.transformers.get(model.tag) │
│ 1396 │ │ if machine: │
│ 1397 │ │ │ # NOTE: When debug is enabled, │
│ 1398 │ │ │ # We will prefix the tag with tag and we can use reg │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\bentoml_internal\f │
│ rameworks\transformers.py:292 in get │
│ │
│ 289 │ # target model must be from the BentoML model store │
│ 290 │ model = bentoml.transformers.get("my_pipeline:latest") │
│ 291 │ """ │
│ > 292 │ model = bentoml.models.get(tag_like) │
│ 293 │ if model.info.module not in (MODULE_NAME, name): │
│ 294 │ │ raise NotFound( │
│ 295 │ │ │ f"Model {model.tag} was saved with module {model.info.mod │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\simple_di_init. │
│ py:139 in _ │
│ │
│ 136 │ │ bind = sig.bind_partial(filtered_args, **filtered_kwargs) │
│ 137 │ │ bind.apply_defaults() │
│ 138 │ │ │
│ > 139 │ │ return func(_inject_args(bind.args), **inject_kwargs(bind.k │
│ 140 │ │
│ 141 │ setattr(, "_is_injected", True) │
│ 142 │ return cast(WrappedCallable, _) │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\bentoml\models.py:4 │
│ 2 in get │
│ │
│ 39 │ , │
│ 40 │ _model_store: "ModelStore" = Provide[BentoMLContainer.model_store │
│ 41 ) -> "Model": │
│ > 42 │ return _model_store.get(tag) │
│ 43 │
│ 44 │
│ 45 @Inject │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\bentoml_internal\s │
│ tore.py:146 in get │
│ │
│ 143 │ │ matches = self._fs.glob(f"{path}/") │
│ 144 │ │ counts = matches.count().directories │
│ 145 │ │ if counts == 0: │
│ > 146 │ │ │ raise NotFound( │
│ 147 │ │ │ │ f"{self._item_type.get_typename()} '{tag}' is not fou │
│ 148 │ │ │ ) │
│ 149 │ │ elif counts == 1: │
└─────────────────────────────────────────────────────────────────────────────┘
NotFound: Model 'pt-tiiuae-falcon-7b:2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5'
is not found in BentoML store <osfs 'C:\Users\pedro\bentoml\models'>

During handling of the above exception, another exception occurred:

┌───────────────────── Traceback (most recent call last) ─────────────────────┐
│ in _run_module_as_main:198 │
│ in run_code:88 │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm_main.py │
│ :26 in │
│ │
│ 23 if name == "main": │
│ 24 │ from openllm.cli import cli │
│ 25 │ │
│ > 26 │ cli() │
│ 27 │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py:1130 │
│ in call │
│ │
│ 1127 │ │
│ 1128 │ def call(self, *args: t.Any, **kwargs: t.Any) -> t.Any: │
│ 1129 │ │ """Alias for :meth:main.""" │
│ > 1130 │ │ return self.main(*args, **kwargs) │
│ 1131 │
│ 1132 │
│ 1133 class Command(BaseCommand): │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py:1055 │
│ in main │
│ │
│ 1052 │ │ try: │
│ 1053 │ │ │ try: │
│ 1054 │ │ │ │ with self.make_context(prog_name, args, **extra) as │
│ > 1055 │ │ │ │ │ rv = self.invoke(ctx) │
│ 1056 │ │ │ │ │ if not standalone_mode: │
│ 1057 │ │ │ │ │ │ return rv │
│ 1058 │ │ │ │ │ # it's not safe to ctx.exit(rv) here! │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py:1657 │
│ in invoke │
│ │
│ 1654 │ │ │ │ super().invoke(ctx) │
│ 1655 │ │ │ │ sub_ctx = cmd.make_context(cmd_name, args, parent=ct │
│ 1656 │ │ │ │ with sub_ctx: │
│ > 1657 │ │ │ │ │ return _process_result(sub_ctx.command.invoke(su │
│ 1658 │ │ │
│ 1659 │ │ # In chain mode we create the contexts step by step, but aft │
│ 1660 │ │ # base command has been invoked. Because at that point we d │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py:1404 │
│ in invoke │
│ │
│ 1401 │ │ │ echo(style(message, fg="red"), err=True) │
│ 1402 │ │ │
│ 1403 │ │ if self.callback is not None: │
│ > 1404 │ │ │ return ctx.invoke(self.callback, **ctx.params) │
│ 1405 │ │
│ 1406 │ def shell_complete(self, ctx: Context, incomplete: str) -> t.Lis │
│ 1407 │ │ """Return a list of completions for the incomplete value. Lo │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py:760 │
│ in invoke │
│ │
│ 757 │ │ │
│ 758 │ │ with augment_usage_errors(__self): │
│ 759 │ │ │ with ctx: │
│ > 760 │ │ │ │ return __callback(*args, **kwargs) │
│ 761 │ │
│ 762 │ def forward( │
│ 763 │ │ __self, __cmd: "Command", *args: t.Any, **kwargs: t.Any # n │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\cli.py:380 │
│ in wrapper │
│ │
│ 377 │ │ @functools.wraps(func) │
│ 378 │ │ def wrapper(*args: P.args, **attrs: P.kwargs) -> t.Any: │
│ 379 │ │ │ try: │
│ > 380 │ │ │ │ return func(*args, **attrs) │
│ 381 │ │ │ except OpenLLMException as err: │
│ 382 │ │ │ │ raise click.ClickException( │
│ 383 │ │ │ │ │ click.style(f"[{group.name}] '{command_name}' fa │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\cli.py:353 │
│ in wrapper │
│ │
│ 350 │ │ │ │ assert group.name is not None, "group.name should no │
│ 351 │ │ │ │ event = analytics.OpenllmCliEvent(cmd_group=group.na │
│ 352 │ │ │ │ try: │
│ > 353 │ │ │ │ │ return_value = func(*args, **attrs) │
│ 354 │ │ │ │ │ duration_in_ms = (time.time_ns() - start_time) / │
│ 355 │ │ │ │ │ event.duration_in_ms = duration_in_ms │
│ 356 │ │ │ │ │ analytics.track(event) │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\cli.py:328 │
│ in wrapper │
│ │
│ 325 │ │ │ │
│ 326 │ │ │ configure_logging() │
│ 327 │ │ │ │
│ > 328 │ │ │ return f(*args, **attrs) │
│ 329 │ │ │
│ 330 │ │ return t.cast("ClickFunctionWrapper[..., t.Any]", wrapper) │
│ 331 │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\cli.py:1422 │
│ in download_models │
│ │
│ 1419 │ │ │ ) │
│ 1420 │ │ │
│ 1421 │ │ (model_args, model_attrs), tokenizer_attrs = model.llm_param │
│ > 1422 │ │ ref = model.import_model( │
│ 1423 │ │ │ model.model_id, │
│ 1424 │ │ │ model.tag, │
│ 1425 │ │ │ *model_args, │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\models\falc │
│ on\modeling_falcon.py:56 in import_model │
│ │
│ 53 │ │ device_map = attrs.pop("device_map", "auto") │
│ 54 │ │ │
│ 55 │ │ tokenizer = transformers.AutoTokenizer.from_pretrained(model │
│ > 56 │ │ model = transformers.AutoModelForCausalLM.from_pretrained( │
│ 57 │ │ │ model_id, │
│ 58 │ │ │ trust_remote_code=trust_remote_code, │
│ 59 │ │ │ torch_dtype=torch_dtype, │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\transformers\models │
│ \auto\auto_factory.py:479 in from_pretrained │
│ │
│ 476 │ │ │ │ class_ref, pretrained_model_name_or_path, **hub_kwarg │
│ 477 │ │ │ ) │
│ 478 │ │ │ _ = hub_kwargs.pop("code_revision", None) │
│ > 479 │ │ │ return model_class.from_pretrained( │
│ 480 │ │ │ │ pretrained_model_name_or_path, *model_args, config=co │
│ 481 │ │ │ ) │
│ 482 │ │ elif type(config) in cls._model_mapping.keys(): │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\transformers\modeli │
│ ng_utils.py:2881 in from_pretrained │
│ │
│ 2878 │ │ │ │ mismatched_keys, │
│ 2879 │ │ │ │ offload_index, │
│ 2880 │ │ │ │ error_msgs, │
│ > 2881 │ │ │ ) = cls._load_pretrained_model( │
│ 2882 │ │ │ │ model, │
│ 2883 │ │ │ │ state_dict, │
│ 2884 │ │ │ │ loaded_state_dict_keys, # XXX: rename? │
│ │
│ C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\transformers\modeli │
│ ng_utils.py:2980 in _load_pretrained_model │
│ │
│ 2977 │ │ │ ) │
│ 2978 │ │ │ is_safetensors = archive_file.endswith(".safetensors") │
│ 2979 │ │ │ if offload_folder is None and not is_safetensors: │
│ > 2980 │ │ │ │ raise ValueError( │
│ 2981 │ │ │ │ │ "The current device_map had weights offloaded │
│ 2982 │ │ │ │ │ " for them. Alternatively, make sure you have s │ │ 2983 │ │ │ │ │ " offers the weights in this format." │ └─────────────────────────────────────────────────────────────────────────────┘ ValueError: The current device_maphad weights offloaded to the disk. Please provide anoffload_folderfor them. Alternatively, make sure you havesafetensors` installed if the model you are using offers the weights in this
format.
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in run_code
File "C:\Users\pedro\anaconda3\envs\powerai\Scripts\openllm.exe_main.py", line 7, in
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py", line 1130, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\cli.py", line 380, in wrapper
return func(*args, **attrs)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\cli.py", line 353, in wrapper
return_value = func(*args, **attrs)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\cli.py", line 328, in wrapper
return f(*args, **attrs)
^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\click\decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\cli.py", line 797, in model_start
llm = t.cast(
^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm\models\auto\factory.py", line 135, in for_model
llm.ensure_model_id_exists()
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\site-packages\openllm_llm.py", line 900, in ensure_model_id_exists
output = subprocess.check_output(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pedro\anaconda3\envs\powerai\Lib\subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\Users\pedro\anaconda3\envs\powerai\python.exe', '-m', 'openllm', 'download', 'falcon', '--model-id', 'tiiuae/falcon-7b', '--machine', '--implementation', 'pt']' returned non-zero exit status 1.

Can someone help me with the topic ?

Thank you.

bug: OpenLLM not loading the model

Describe the bug

Starting from a clean setup (Python 3.10), trying to start a LLaMa 13B results in a ModuleNotFoundError which, when corrected (by installing SciPy), results in nothing much happening after the weights are loaded.

To reproduce

conda create -n py10 python=3.10 -y
conda activate py10
pip install "openllm[llama, fine-tune, vllm]"
pip install scipy
openllm start llama --model-id huggyllama/llama-13b

Logs

Make sure to have the following dependencies available: ['openllm[vllm]']
bin /opt/conda/envs/py10/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
[2023-07-20 13:18:51,311] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Downloading (…)fetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33.4k/33.4k [00:00<00:00, 98.5MB/s]
Downloading (…)of-00003.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.95G/9.95G [00:24<00:00, 408MB/s]
Downloading (…)of-00003.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.90G/9.90G [00:24<00:00, 401MB/s]
Downloading (…)of-00003.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.18G/6.18G [00:15<00:00, 402MB/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:04<00:00, 21.61s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.25s/it]
Downloading (…)neration_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 621kB/s]
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 700/700 [00:00<00:00, 2.61MB/s]
Downloading tokenizer.model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 375MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 4.55MB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 1.95MB/s]

Also, nvidia-smi reveals that nothing is loaded on the GPU (after 20+ minutes):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   33C    P0    58W / 400W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Environment

Debian 10
Python 3.10
OpenLLM 0.2.0

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.24
python: 3.10.12
platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
uid_gid: 1004:1005
conda: 22.9.0
in_conda_env: True

conda_packages

name: py10
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - bzip2=1.0.8=h7f98852_4
  - ca-certificates=2023.5.7=hbcca054_0
  - ld_impl_linux-64=2.40=h41732ed_0
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=13.1.0=he5830b7_0
  - libgomp=13.1.0=he5830b7_0
  - libnsl=2.0.0=h7f98852_0
  - libsqlite=3.42.0=h2797004_0
  - libuuid=2.38.1=h0b41bf4_0
  - libzlib=1.2.13=hd590300_5
  - ncurses=6.4=hcb278e6_0
  - openssl=3.1.1=hd590300_1
  - pip=23.2=pyhd8ed1ab_0
  - python=3.10.12=hd12c33a_0_cpython
  - readline=8.2=h8228510_1
  - setuptools=68.0.0=pyhd8ed1ab_0
  - tk=8.6.12=h27826a3_0
  - wheel=0.40.0=pyhd8ed1ab_1
  - xz=5.2.6=h166bdaf_0
  - pip:
    - accelerate==0.21.0
    - aiofiles==23.1.0
    - aiohttp==3.8.5
    - aiosignal==1.3.1
    - altair==5.0.1
    - anyio==3.7.1
    - appdirs==1.4.4
    - asgiref==3.7.2
    - async-timeout==4.0.2
    - attrs==23.1.0
    - bentoml==1.0.24
    - bitsandbytes==0.39.1
    - build==0.10.0
    - cattrs==23.1.2
    - certifi==2023.5.7
    - charset-normalizer==3.2.0
    - circus==0.18.0
    - click==8.1.6
    - click-option-group==0.5.6
    - cloudpickle==2.2.1
    - cmake==3.27.0
    - coloredlogs==15.0.1
    - contextlib2==21.6.0
    - contourpy==1.1.0
    - cuda-python==12.2.0
    - cycler==0.11.0
    - cython==3.0.0
    - datasets==2.13.1
    - deepmerge==1.1.0
    - deepspeed==0.10.0
    - deprecated==1.2.14
    - dill==0.3.6
    - docker-pycreds==0.4.0
    - exceptiongroup==1.1.2
    - fairscale==0.4.13
    - fastapi==0.100.0
    - ffmpy==0.3.1
    - filelock==3.12.2
    - filetype==1.2.0
    - fonttools==4.41.0
    - frozenlist==1.4.0
    - fs==2.4.16
    - fschat==0.2.3
    - fsspec==2023.6.0
    - gitdb==4.0.10
    - gitpython==3.1.32
    - gradio==3.23.0
    - grpcio==1.51.3
    - grpcio-health-checking==1.51.3
    - h11==0.14.0
    - hjson==3.1.0
    - httpcore==0.17.3
    - httpx==0.24.1
    - huggingface-hub==0.16.4
    - humanfriendly==10.0
    - idna==3.4
    - importlib-metadata==6.0.1
    - inflection==0.5.1
    - jinja2==3.1.2
    - jsonschema==4.18.4
    - jsonschema-specifications==2023.7.1
    - kiwisolver==1.4.4
    - linkify-it-py==2.0.2
    - lit==16.0.6
    - markdown-it-py==2.2.0
    - markdown2==2.4.9
    - markupsafe==2.1.3
    - matplotlib==3.7.2
    - mdit-py-plugins==0.3.3
    - mdurl==0.1.2
    - mpmath==1.3.0
    - msgpack==1.0.5
    - multidict==6.0.4
    - multiprocess==0.70.14
    - mypy-extensions==1.0.0
    - networkx==3.1
    - ninja==1.11.1
    - numpy==1.25.1
    - nvidia-cublas-cu11==11.10.3.66
    - nvidia-cuda-cupti-cu11==11.7.101
    - nvidia-cuda-nvrtc-cu11==11.7.99
    - nvidia-cuda-runtime-cu11==11.7.99
    - nvidia-cudnn-cu11==8.5.0.96
    - nvidia-cufft-cu11==10.9.0.58
    - nvidia-curand-cu11==10.2.10.91
    - nvidia-cusolver-cu11==11.4.0.1
    - nvidia-cusparse-cu11==11.7.4.91
    - nvidia-nccl-cu11==2.14.3
    - nvidia-nvtx-cu11==11.7.91
    - openllm==0.2.0
    - opentelemetry-api==1.18.0
    - opentelemetry-instrumentation==0.39b0
    - opentelemetry-instrumentation-aiohttp-client==0.39b0
    - opentelemetry-instrumentation-asgi==0.39b0
    - opentelemetry-instrumentation-grpc==0.39b0
    - opentelemetry-sdk==1.18.0
    - opentelemetry-semantic-conventions==0.39b0
    - opentelemetry-util-http==0.39b0
    - optimum==1.9.1
    - orjson==3.9.2
    - packaging==23.1
    - pandas==2.0.3
    - pathspec==0.11.1
    - pathtools==0.1.2
    - peft==0.4.0
    - pillow==10.0.0
    - pip-requirements-parser==32.0.1
    - pip-tools==7.1.0
    - prometheus-client==0.17.1
    - prompt-toolkit==3.0.39
    - protobuf==4.23.4
    - psutil==5.9.5
    - py-cpuinfo==9.0.0
    - pyarrow==12.0.1
    - pydantic==1.10.11
    - pydub==0.25.1
    - pygments==2.15.1
    - pynvml==11.5.0
    - pyparsing==3.0.9
    - pyproject-hooks==1.0.0
    - pyre-extensions==0.0.29
    - python-dateutil==2.8.2
    - python-json-logger==2.0.7
    - python-multipart==0.0.6
    - pytz==2023.3
    - pyyaml==6.0.1
    - pyzmq==25.1.0
    - ray==2.5.1
    - referencing==0.30.0
    - regex==2023.6.3
    - requests==2.31.0
    - rich==13.4.2
    - rpds-py==0.9.2
    - safetensors==0.3.1
    - schema==0.7.5
    - scipy==1.11.1
    - semantic-version==2.10.0
    - sentencepiece==0.1.99
    - sentry-sdk==1.28.1
    - setproctitle==1.3.2
    - shortuuid==1.0.11
    - simple-di==0.1.5
    - six==1.16.0
    - smmap==5.0.0
    - sniffio==1.3.0
    - starlette==0.27.0
    - svgwrite==1.4.3
    - sympy==1.12
    - tabulate==0.9.0
    - tokenizers==0.13.3
    - tomli==2.0.1
    - toolz==0.12.0
    - torch==2.0.1
    - tornado==6.3.2
    - tqdm==4.65.0
    - transformers==4.31.0
    - triton==2.0.0
    - trl==0.4.7
    - typing-extensions==4.7.1
    - typing-inspect==0.9.0
    - tzdata==2023.3
    - uc-micro-py==1.0.2
    - urllib3==2.0.4
    - uvicorn==0.23.1
    - vllm==0.1.2
    - wandb==0.15.5
    - watchfiles==0.19.0
    - wavedrom==2.0.3.post3
    - wcwidth==0.2.6
    - websockets==11.0.3
    - wrapt==1.15.0
    - xformers==0.0.20
    - xxhash==3.2.0
    - yarl==1.9.2
    - zipp==3.16.2
prefix: /opt/conda/envs/py10

pip_packages

accelerate==0.21.0
aiofiles==23.1.0
aiohttp==3.8.5
aiosignal==1.3.1
altair==5.0.1
anyio==3.7.1
appdirs==1.4.4
asgiref==3.7.2
async-timeout==4.0.2
attrs==23.1.0
bentoml==1.0.24
bitsandbytes==0.39.1
build==0.10.0
cattrs==23.1.2
certifi==2023.5.7
charset-normalizer==3.2.0
circus==0.18.0
click==8.1.6
click-option-group==0.5.6
cloudpickle==2.2.1
cmake==3.27.0
coloredlogs==15.0.1
contextlib2==21.6.0
contourpy==1.1.0
cuda-python==12.2.0
cycler==0.11.0
Cython==3.0.0
datasets==2.13.1
deepmerge==1.1.0
deepspeed==0.10.0
Deprecated==1.2.14
dill==0.3.6
docker-pycreds==0.4.0
exceptiongroup==1.1.2
fairscale==0.4.13
fastapi==0.100.0
ffmpy==0.3.1
filelock==3.12.2
filetype==1.2.0
fonttools==4.41.0
frozenlist==1.4.0
fs==2.4.16
fschat==0.2.3
fsspec==2023.6.0
gitdb==4.0.10
GitPython==3.1.32
gradio==3.23.0
grpcio==1.51.3
grpcio-health-checking==1.51.3
h11==0.14.0
hjson==3.1.0
httpcore==0.17.3
httpx==0.24.1
huggingface-hub==0.16.4
humanfriendly==10.0
idna==3.4
importlib-metadata==6.0.1
inflection==0.5.1
Jinja2==3.1.2
jsonschema==4.18.4
jsonschema-specifications==2023.7.1
kiwisolver==1.4.4
linkify-it-py==2.0.2
lit==16.0.6
markdown-it-py==2.2.0
markdown2==2.4.9
MarkupSafe==2.1.3
matplotlib==3.7.2
mdit-py-plugins==0.3.3
mdurl==0.1.2
mpmath==1.3.0
msgpack==1.0.5
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==1.0.0
networkx==3.1
ninja==1.11.1
numpy==1.25.1
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openllm==0.2.0
opentelemetry-api==1.18.0
opentelemetry-instrumentation==0.39b0
opentelemetry-instrumentation-aiohttp-client==0.39b0
opentelemetry-instrumentation-asgi==0.39b0
opentelemetry-instrumentation-grpc==0.39b0
opentelemetry-sdk==1.18.0
opentelemetry-semantic-conventions==0.39b0
opentelemetry-util-http==0.39b0
optimum==1.9.1
orjson==3.9.2
packaging==23.1
pandas==2.0.3
pathspec==0.11.1
pathtools==0.1.2
peft==0.4.0
Pillow==10.0.0
pip-requirements-parser==32.0.1
pip-tools==7.1.0
prometheus-client==0.17.1
prompt-toolkit==3.0.39
protobuf==4.23.4
psutil==5.9.5
py-cpuinfo==9.0.0
pyarrow==12.0.1
pydantic==1.10.11
pydub==0.25.1
Pygments==2.15.1
pynvml==11.5.0
pyparsing==3.0.9
pyproject_hooks==1.0.0
pyre-extensions==0.0.29
python-dateutil==2.8.2
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.3
PyYAML==6.0.1
pyzmq==25.1.0
ray==2.5.1
referencing==0.30.0
regex==2023.6.3
requests==2.31.0
rich==13.4.2
rpds-py==0.9.2
safetensors==0.3.1
schema==0.7.5
scipy==1.11.1
semantic-version==2.10.0
sentencepiece==0.1.99
sentry-sdk==1.28.1
setproctitle==1.3.2
shortuuid==1.0.11
simple-di==0.1.5
six==1.16.0
smmap==5.0.0
sniffio==1.3.0
starlette==0.27.0
svgwrite==1.4.3
sympy==1.12
tabulate==0.9.0
tokenizers==0.13.3
tomli==2.0.1
toolz==0.12.0
torch==2.0.1
tornado==6.3.2
tqdm==4.65.0
transformers==4.31.0
triton==2.0.0
trl==0.4.7
typing-inspect==0.9.0
typing_extensions==4.7.1
tzdata==2023.3
uc-micro-py==1.0.2
urllib3==2.0.4
uvicorn==0.23.1
vllm==0.1.2
wandb==0.15.5
watchfiles==0.19.0
wavedrom==2.0.3.post3
wcwidth==0.2.6
websockets==11.0.3
wrapt==1.15.0
xformers==0.0.20
xxhash==3.2.0
yarl==1.9.2
zipp==3.16.2

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /opt/conda/envs/py10/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/opt/conda/envs/py10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda/envs/py10 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /opt/conda/envs/py10/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
[2023-07-20 13:31:57,679] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.31.0
Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
Python version: 3.10.12
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

System information (Optional)

a2-highgpu-1g GCP instance (1xA100 80GB)

bug: Getting 500 with running dolly-v2

Describe the bug

Thank you for creating this great repo! I am running simple "openllm start dolly-v2" and getting a 500 internal server error related to "model_kwargs". Not sure how to proceed. See reproduce and errors below.

To reproduce

Created docker container from this image and ran on a T4 instance:
FROM nvidia/cuda:11.0.3-runtime-ubuntu20.04

ENV BENTOML_HOME="/model_store/"
ENV CUDA_VISIBLE_DEVICES=0

Update apt-get and install pip

RUN apt-get update && apt-get install -y python3-pip
RUN pip3 install openllm
EXPOSE 3000

ENTRYPOINT [ "openllm", "start" ]
CMD [ "dolly-v2", "--model-id", "databricks/dolly-v2-3b", "--device", "0", "-p", "3000", "--verbose"]

When I go to localhost:3000 and post to the inference endpoint I get a 500 internal error.

Logs

2023-06-27T04:29:13+0000 [ERROR] [runner:llm-dolly-v2-runner:1] Exception on runner 'llm-dolly-v2-runner' method 'generate' (trace=666892bd204b5873d08d4ffe96808e94,span=4d1d367c3ded45cb,sampled=1,service.name=llm-dolly-v2-runner)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/runner_app.py", line 352, in _run
    ret = await runner_method.async_run(*params.args, **params.kwargs)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 59, in async_run_method
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runnable.py", line 140, in method
    return self.func(obj, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/_llm.py", line 1241, in generate
    return self.generate(prompt, **attrs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/models/dolly_v2/modeling_dolly_v2.py", line 273, in generate
    return self.model(
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1120, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1127, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1026, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/usr/local/lib/python3.8/dist-packages/openllm/models/dolly_v2/modeling_dolly_v2.py", line 114, in _forward
    generated_sequence = self.model.generate(
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1271, in generate
    self._validate_model_kwargs(model_kwargs.copy())
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1144, in _validate_model_kwargs
    raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['accelerator'] (note: typos in the generate arguments will also show up in this list)
2023-06-27T04:29:13+0000 [INFO] [runner:llm-dolly-v2-runner:1]  - "POST /generate HTTP/1.1" 500 (trace=666892bd204b5873d08d4ffe96808e94,span=13c7890ba80845b0,sampled=1,service.name=llm-dolly-v2-runner)
2023-06-27T04:29:13+0000 [INFO] [runner:llm-dolly-v2-runner:1] _ (scheme=http,method=POST,path=/generate,type=application/octet-stream,length=888) (status=500,type=text/plain,length=0) 4.769ms (trace=666892bd204b5873d08d4ffe96808e94,span=4d1d367c3ded45cb,sampled=1,service.name=llm-dolly-v2-runner)
2023-06-27T04:29:13+0000 [ERROR] [api_server:4] Exception on /v1/generate [POST] (trace=666892bd204b5873d08d4ffe96808e94,span=2aa0dd91ca45f3c4,sampled=1,service.name=llm-dolly-v2-service)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
    output = await api.func(*args)
  File "/usr/local/lib/python3.8/dist-packages/openllm/_service.py", line 86, in generate_v1
    responses = await runner.generate.async_run(qa_inputs.prompt, **config)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 246, in async_run_method
    raise RemoteException(
bentoml.exceptions.RemoteException: An unexpected exception occurred in remote runner llm-dolly-v2-runner: [500]

Environment

openllm, version 0.1.14

System information (Optional)

Dockerfile (executing on a T4 with 4 GPUs):
FROM nvidia/cuda:11.0.3-runtime-ubuntu20.04

ENV BENTOML_HOME="/model_store/"
ENV CUDA_VISIBLE_DEVICES=0

Update apt-get and install pip

RUN apt-get update && apt-get install -y python3-pip
RUN pip3 install openllm
EXPOSE 3000

ENTRYPOINT [ "openllm", "start" ]
CMD [ "dolly-v2", "--model-id", "databricks/dolly-v2-3b", "--device", "0", "-p", "3000", "--verbose"]

chore: publish to test PyPI on every main commit

publish nightly release

bug: Loading adapters from abspath during build

Describe the bug

Currently, this is using in conjunction with build_ctx

To reproduce

Pass in any given adapters as custom path for build. start works.

Logs

No response

Environment

ec2, on main for vllm, transformers, and bentoml

System information (Optional)

No response

bug: `openllm download` dies with Signal.SIGKILL: 9

Describe the bug

I'm running through the most basic install. I have creates an empty virtualenv with python 3.11. I've run pip install openllm, and I get a crash when I run openllm start dolly-v2.

The error I get is:
subprocess.CalledProcessError: Command '['/home/emilstenstrom/.pyenv/versions/3.11.3/envs/openllm/bin/python3.11', '-m', 'openllm', 'download', 'dolly-v2', '--model-id', 'databricks/dolly-v2-3b', '--output', 'porcelain']' died with <Signals.SIGKILL: 9>.

To reproduce

Just run the full install from scratch.

Logs

Here's a full stacktrace of the run:


(openllm) 2023-06-19 15:59:29 ~/Projects/openllm $ openllm start dolly-v2
Traceback (most recent call last):
  File "/home/username/.pyenv/versions/openllm/bin/openllm", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/openllm/cli.py", line 324, in wrapper
    return func(*args, **attrs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/openllm/cli.py", line 297, in wrapper
    return_value = func(*args, **attrs)
                   ^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/openllm/cli.py", line 272, in wrapper
    return f(*args, **attrs)
           ^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/openllm/cli.py", line 671, in model_start
    llm = t.cast(
          ^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/openllm/models/auto/factory.py", line 120, in for_model
    llm.ensure_model_id_exists()
  File "/home/username/.pyenv/versions/3.11.3/envs/openllm/lib/python3.11/site-packages/openllm/_llm.py", line 666, in ensure_model_id_exists
    output = subprocess.check_output(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.pyenv/versions/3.11.3/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/home/username/.pyenv/versions/3.11.3/envs/openllm/bin/python3.11', '-m', 'openllm', 'download', 'dolly-v2', '--model-id', 'databricks/dolly-v2-3b', '--output', 'porcelain']' died with <Signals.SIGKILL: 9>.



### Environment

python: #### Environment variable

```bash
BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.22
python: 3.11.3
platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
uid_gid: 1000:1000

pip_packages

accelerate==0.20.3
aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.7.0
appdirs==1.4.4
asgiref==3.7.2
async-timeout==4.0.2
attrs==23.1.0
bentoml==1.0.22
build==0.10.0
cattrs==23.1.2
certifi==2023.5.7
charset-normalizer==3.1.0
circus==0.18.0
click==8.1.3
click-option-group==0.5.6
cloudpickle==2.2.1
cmake==3.26.4
coloredlogs==15.0.1
contextlib2==21.6.0
datasets==2.13.0
deepmerge==1.1.0
Deprecated==1.2.14
dill==0.3.6
filelock==3.12.2
filetype==1.2.0
frozenlist==1.3.3
fs==2.4.16
fsspec==2023.6.0
grpcio==1.54.2
grpcio-health-checking==1.48.2
h11==0.14.0
httpcore==0.17.2
httpx==0.24.1
huggingface-hub==0.15.1
humanfriendly==10.0
idna==3.4
importlib-metadata==6.0.1
inflection==0.5.1
Jinja2==3.1.2
lit==16.0.6
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.1
numpy==1.25.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openllm==0.1.6
opentelemetry-api==1.17.0
opentelemetry-instrumentation==0.38b0
opentelemetry-instrumentation-aiohttp-client==0.38b0
opentelemetry-instrumentation-asgi==0.38b0
opentelemetry-instrumentation-grpc==0.38b0
opentelemetry-sdk==1.17.0
opentelemetry-semantic-conventions==0.38b0
opentelemetry-util-http==0.38b0
optimum==1.8.8
orjson==3.9.1
packaging==23.1
pandas==2.0.2
pathspec==0.11.1
Pillow==9.5.0
pip-requirements-parser==32.0.1
pip-tools==6.13.0
prometheus-client==0.17.0
protobuf==3.20.3
psutil==5.9.5
pyarrow==12.0.1
pydantic==1.10.9
Pygments==2.15.1
pynvml==11.5.0
pyparsing==3.1.0
pyproject_hooks==1.0.0
python-dateutil==2.8.2
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.3
PyYAML==6.0
pyzmq==25.1.0
regex==2023.6.3
requests==2.31.0
rich==13.4.2
safetensors==0.3.1
schema==0.7.5
sentencepiece==0.1.99
simple-di==0.1.5
six==1.16.0
sniffio==1.3.0
starlette==0.28.0
sympy==1.12
tabulate==0.9.0
tokenizers==0.13.3
torch==2.0.1
torchvision==0.15.2
tornado==6.3.2
tqdm==4.65.0
transformers==4.30.2
triton==2.0.0
typing_extensions==4.6.3
tzdata==2023.3
urllib3==2.0.3
uvicorn==0.22.0
watchfiles==0.19.0
wcwidth==0.2.6
wrapt==1.15.0
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

transformers version: 4.30.2
Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python version: 3.11.3
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: <not relevant, code crashes after download?>
Using distributed or parallel set-up in script?: <not relevant, code crashes after download?>