opencsgs / llm-inference Goto Github PK

llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.

License: Apache License 2.0

Shell 0.31% Python 97.33% Dockerfile 0.19% JavaScript 0.59% Jupyter Notebook 1.58%

deepspeed llama-cpp llm-inference ray transformer vllm

llm-inference's People

Contributors

Stargazers

Watchers

Forkers

jasonhe258 pulltheflower depenglee1707 zhenrong-wang seanhh86 iqiuyu-0821 wanggxa

llm-inference's Issues

Inference throw timeout sometime

throw error of async timeout when inference process for model Qwen with batch_wait_timeout_s: 0 in yaml

The usage introduction of `llm-serve` is not correct in quick_start.md

Wrong:

# llm-serve --help

 Usage: llm-serve [OPTIONS] COMMAND [ARGS]...

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                                        │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ evaluate     Evaluate and summarize the results of a multi_query run with a strong 'evaluator' LLM like GPT-4.                                                     │
│ list         List available model(s) and deployed serving etc.                                                                                                     │
│ predict      Predict one or several models with one or multiple prompts, optionally read from file, and save the results to a file.                                │
│ start        Start application(s) for LLM serving, API server, experimention, fine-tuning and comparation.                                                         │
│ stop         Stop application(s) for LLM serving and API server.                                                                                                   │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The sub command evaluate has already deprecated and removed

Better to implement streaming output feature

Better to implement streaming output feature, so user can read the LLM output word by word.

vllm, gguf, llamacpp, these integration cannot address local path of model

Support load Qwen1.5-72B-Chat-GPTQ-Int4 by auto_gptq

Run Qwen1.5-72B-Chat-GPTQ-Int4 is much slower than Qwen1.5-72B-Chat by transformer package.
Quantited model need load by auto_gptq.

https://github.com/QwenLM/Qwen/blob/main/README_CN.md#%E6%8E%A8%E7%90%86%E6%80%A7%E8%83%BD

Add inference SDK for invoke

Add python sdk for inference api

inference gradio web reponse random words for deepseek instrcuct model

while using rest api, everything seems to be OK:

curl -H "Content-Type: application/json" -X POST -d '{"prompt": "写一个快排吧"}' "http://127.0.0.1:8000/api/v1/default/opencsg--opencsg-deepseek-coder-1.3b-v0.1/run/predict" {"generated_text":"}\n\n\n# 快排\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n\n# 测试\nprint(quicksort(arr))\n\n# 输出: [1, 2, 3, 4, 5, 6, 7, 8, 9]\n```\n\n这个程序使用了快速排序算法，它是一种高效的排序算法，基于分治法的原理。它选择一个元素作为枢轴，并根据它们与枢轴的大小将其他元素分成两个子数组，然后递归地对子数组进行排序。\n\n快速排序的平均时间复杂度为O(n log n)，最坏情况下的时间复杂度为O(n^2)，但这种情况很少发生。","num_input_tokens":16,"num_input_tokens_batch":16,"num_generated_tokens":267,"num_generated_tokens_batch":267,"preprocessing_time":0.008793507993686944,"generation_time":2.4766286090016365,"postprocessing_time":0.0009328589949291199,"generation_time_per_token":0.008751337840995181,"generation_time_per_token_batch":0.008751337840995181,"num_total_tokens":283,"num_total_tokens_batch":283,"total_time":2.4863549759902526,"total_time_per_token":0.008785706628940822,"total_time_per_token_batch":0.008785706628940822}(.llm-inference) root@opencsg-gpu1-4090:~/pl/workspace/depenglee/llm-inference#

avoid to ping huggingface when start serving to speed up the deployement

Model inference cross multi-nodes

Model inference cross mulit-nodes

Expose model generate parameters by API server

generate_kwargs:
  do_sample: true
  max_new_tokens: 128
  min_new_tokens: 16
  temperature: 0.7
  repetition_penalty: 1.1
  top_p: 0.8
  top_k: 50

Requested tokens (817) exceed context window of 512

(PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF pid=42050) [INFO 2024-04-16 09:34:13,880] llamacpp_pipeline.py: 212 generate_kwargs: {'max_tokens': 1024, 'echo': False, 'stop': ['<|im_end|>'], 'logits_processor': [], 'stopping_criteria': []}
(ServeController pid=41618) ERROR 2024-04-16 09:34:14,246 controller 41618 deployment_state.py:658 - Exception in replica 'default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG', the replica will be stopped.
(ServeController pid=41618) Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=41618) _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=41618) return fn(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=41618) return func(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=41618) raise value.as_instanceof_cause()
(ServeController pid=41618) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=41823, ip=172.17.0.2, actor_id=6aff10f7a7934a83f523892907000000, repr=<ray.serve._private.replica.ServeReplica:default:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7f24110af4c0>)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=41618) return self.__get_result()
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=41618) raise self._exception
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=41618) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=41618) RuntimeError: Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata
(ServeController pid=41618) await self.replica.update_user_config(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config
(ServeController pid=41618) await reconfigure_method(user_config)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/server/app.py", line 154, in reconfigure
(ServeController pid=41618) await self.rollover(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover
(ServeController pid=41618) self.new_worker_group = await self._create_worker_group(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 159, in _create_worker_group
(ServeController pid=41618) engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 367, in launch_engine
(ServeController pid=41618) await asyncio.gather(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
(ServeController pid=41618) return (yield from awaitable.await())
(ServeController pid=41618) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=42050, ip=172.17.0.2, actor_id=b7ddc7c61575fad3b581750d07000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 236, in init_model
(ServeController pid=41618) self.generator = init_model(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model
(ServeController pid=41618) resp_batch = generate(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate
(ServeController pid=41618) outputs = pipeline(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 102, in call
(ServeController pid=41618) for batch_response in self.stream(inputs, **kwargs):
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 214, in stream
(ServeController pid=41618) for token in output:
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 970, in _create_completion
(ServeController pid=41618) raise ValueError(
(ServeController pid=41618) ValueError: Requested tokens (817) exceed context window of 512
(ServeController pid=41618) INFO 2024-04-16 09:34:16,388 controller 41618 deployment_state.py:2185 - Replica default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG is stopped.

Model streaming API enhancement

Stream request load-balance for multi-workers for predictor.
More parameters for model.generate api
Stream support for class DefaultTransformersPipeline
RouterDeployment api support format "/{model}/run/predict"
Model id mapping for api server, like mapping facebook/opt-125m to facebook--opt-125m

Enhance inference API to support OpenAI style

vllm implements cannot support download model from repo besides hg

  initialization:
    runtime_env:
      env_vars:
        HF_ENDPOINT: https://hub.opencsg.com/hf
    initializer:
      type: Vllm
      from_pretrained_kwargs:
        trust_remote_code: true
    pipeline: vllm

this cannot work as expected

enable reset generate config on fly

for now the generation params is addressed in yaml files,
add the ability reset these params on fly is useful:

    generate_kwargs:
      do_sample: false
      max_new_tokens: 512
      min_new_tokens: 16
      temperature: 0.7
      repetition_penalty: 1.1
      top_p: 0.8
      top_k: 50
      pad_token: "<|extra_0|>"
      eos_token: "<|endoftext|>"

速度和sglang相比哪个快？

Failed to load qwen1_5-72b-chat-q5_k_m.gguf

(ServeController pid=9277) Traceback (most recent call last):
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=9277)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=9277)     return fn(*args, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=9277)     return func(*args, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=9277)     raise value.as_instanceof_cause()
(ServeController pid=9277) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=9483, ip=172.17.0.3, actor_id=b5fcde3ad8e5c6c8e719d32404000000, repr=<ray.serve._private.replica.ServeReplica:Qwen--Qwen1.5-72B-Chat-GGUF:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7fa4048274c0>)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9277)     return self.__get_result()
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9277)     raise self._exception
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=9277)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=9277) RuntimeError: Traceback (most recent call last):
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata
(ServeController pid=9277)     await self.replica.update_user_config(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config
(ServeController pid=9277)     await reconfigure_method(user_config)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/server/app.py", line 151, in reconfigure
(ServeController pid=9277)     await self.rollover(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover
(ServeController pid=9277)     self.new_worker_group = await self._create_worker_group(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 154, in _create_worker_group
(ServeController pid=9277)     engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 333, in launch_engine
(ServeController pid=9277)     await asyncio.gather(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
(ServeController pid=9277)     return (yield from awaitable.__await__())
(ServeController pid=9277) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=9703, ip=172.17.0.3, actor_id=5691b4ad8e1d62a67ddc668004000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9277)     return self.__get_result()
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9277)     raise self._exception
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 217, in init_model
(ServeController pid=9277)     self.generator = init_model(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/utils.py", line 159, in inner
(ServeController pid=9277)     ret = func(*args, **kwargs)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model
(ServeController pid=9277)     resp_batch = generate(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/utils.py", line 159, in inner
(ServeController pid=9277)     ret = func(*args, **kwargs)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate
(ServeController pid=9277)     outputs = pipeline(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 141, in __call__
(ServeController pid=9277)     output = self.model(input, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 1547, in __call__
(ServeController pid=9277)     return self.create_completion(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 1480, in create_completion
(ServeController pid=9277)     completion: Completion = next(completion_or_chunks)  # type: ignore
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 959, in _create_completion
(ServeController pid=9277)     raise ValueError(
(ServeController pid=9277) ValueError: Requested tokens (818) exceed context window of 512
(ServeController pid=9277) INFO 2024-04-05 11:16:41,444 controller 9277 deployment_state.py:2185 - Replica Qwen--Qwen1.5-72B-Chat-GGUF#Qwen--Qwen1.5-72B-Chat-GGUF#ZgAOMG is stopped.
(ServeController pid=9277) INFO 2024-04-05 11:16:41,445 controller 9277 deployment_state.py:1831 - Adding 1 replica to deployment Qwen--Qwen1.5-72B-Chat-GGUF in application 'Qwen--Qwen1.5-72B-Chat-GGUF'.

API server startup slow

Too much yaml file need to be load during api server startup

[BUG] Get error when try "translation" downstream model

Run command:
llm-serve start experimental --model ./models/translation--t5-small.yaml

get error:

(ServeController pid=26978)   File "/Users/lipeng/workspaces/github.com/depenglee1707/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 125, in __call__
(ServeController pid=26978)     output = self.format_output(data[0], inputs, preprocess_time, generation_time)
(ServeController pid=26978)   File "/Users/lipeng/workspaces/github.com/depenglee1707/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 183, in format_output
(ServeController pid=26978)     num_generated_tokens = len(self.tokenizer(output["generated_text"]).input_ids)
(ServeController pid=26978) TypeError: string indices must be integers

@jasonhe258 please take a look

Wrong model id when there are -- in model id for do predict

Convert / to -- in model ID for url, then there is error when convert -- back to / for predict action.
for example, there is model with id opencsg/code--llama-v1.0, throw a error when do predict action.

Generate incorrect text format when use pipeline defaulttransformers

Set pipeline: defaulttransformers and prompt_format: "'role': 'user', 'content': {instruction}" in yaml, and seems there is text format issue in generated_text as following.

[{"generated_text":"'role': 'user', 'content': hello nihao\n{'role': 'user', 'content': '你好'}","num_input_tokens":2,"num_input_tokens_batch":2,"num_generated_tokens":26,"num_generated_tokens_batch":26,"preprocessing_time":0.007688470010180026,"generation_time":7.110702240024693,"postprocessing_time":0.0007505400571972132,"generation_time_per_token":0.2539536514294533,"generation_time_per_token_batch":0.2539536514294533,"num_total_tokens":28,"num_total_tokens_batch":28,"total_time":7.1191412500920705,"total_time_per_token":0.2542550446461454,"total_time_per_token_batch":0.2542550446461454}]

  initialization:
    runtime_env:
      pip: ["transformers_stream_generator", "tiktoken"]

but when start up, still get the exception:

ImportError: This modeling file requires the following packages that were not found in your environment: tiktoken. Run `pip install tiktoken`

Install dependency llama-cpp-python failed

Using cached exceptiongroup-1.2.0-py3-none-any.whl (16 kB)
Building wheels for collected packages: deepspeed, llama-cpp-python, llm-serve, ffmpy
  Building wheel for deepspeed (setup.py) ... done
  Created wheel for deepspeed: filename=deepspeed-0.14.0-py3-none-any.whl size=1400347 sha256=db3cabb92e930a4d76b2adf48e2bae802dc28c333d54d790ab2b4256efe03fe0
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/23/96/24/bab20c3b4e2af15e195b339afaec373eca7072cf90620432e5
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [66 lines of output]
      *** scikit-build-core 0.8.2 using CMake 3.29.0 (wheel)
      *** Configuring CMake...
      2024-03-31 14:09:18,364 - scikit_build_core - WARNING - libdir/ldlibrary: /Users/hhwang/anaconda3/envs/abc/lib/libpython3.10.a is not a real file!
      2024-03-31 14:09:18,364 - scikit_build_core - WARNING - Can't find a Python library, got libdir=/Users/hhwang/anaconda3/envs/abc/lib, ldlibrary=libpython3.10.a, multiarch=darwin, masd=None
      loading initial cache file /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/CMakeInit.txt
      -- The C compiler identification is AppleClang 15.0.0.15000309
      -- The CXX compiler identification is AppleClang 15.0.0.15000309
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.39.3 (Apple Git-146)")
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Accelerate framework found
      -- Metal framework found
      -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
      -- CMAKE_SYSTEM_PROCESSOR: arm64
      -- ARM detected
      -- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E
      -- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
      CMake Warning (dev) at vendor/llama.cpp/CMakeLists.txt:1218 (install):
        Target llama has RESOURCE files but no RESOURCE DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      CMake Warning (dev) at CMakeLists.txt:21 (install):
        Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      CMake Warning (dev) at CMakeLists.txt:30 (install):
        Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      -- Configuring done (0.5s)
      -- Generating done (0.0s)
      -- Build files have been written to: /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build
      *** Building project with Ninja...
      Change Dir: '/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build'

      Run Build Command(s): /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-build-env-h3q63wii/normal/lib/python3.10/site-packages/ninja/data/bin/ninja -v
      [1/25] cd /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/vendor/llama.cpp && xcrun -sdk macosx metal -O3 -c /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && xcrun -sdk macosx metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-common.h && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal
      FAILED: bin/default.metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib
      cd /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/vendor/llama.cpp && xcrun -sdk macosx metal -O3 -c /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && xcrun -sdk macosx metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-common.h && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal
      xcrun: error: unable to find utility "metal", not a developer tool or in PATH
      [2/25] cd /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp && /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-build-env-h3q63wii/normal/lib/python3.10/site-packages/cmake/data/bin/cmake -DMSVC= -DCMAKE_C_COMPILER_VERSION=15.0.0.15000309 -DCMAKE_C_COMPILER_ID=AppleClang -DCMAKE_VS_PLATFORM_NAME= -DCMAKE_C_COMPILER=/Library/Developer/CommandLineTools/usr/bin/cc -P /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/common/../scripts/gen-build-info-cpp.cmake
      -- Found Git: /usr/bin/git (found version "2.39.3 (Apple Git-146)")
      [3/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-alloc.c
      [4/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-backend.c
      [5/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../.. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../../common -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wno-cast-qual -MD -MT vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o -MF vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o.d -o vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/llava.cpp
      [6/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-metal.m
      [7/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-quants.c
      [8/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -Dllama_EXPORTS -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o -MF vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o.d -o vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/unicode.cpp
      [9/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../.. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../../common -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wno-cast-qual -MD -MT vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o -MF vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o.d -o vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/clip.cpp
      [10/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml.c
      [11/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -Dllama_EXPORTS -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o -MF vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o.d -o vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/llama.cpp
      ninja: build stopped: subcommand failed.


      *** CMake build failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
  Building wheel for llm-serve (pyproject.toml) ... done
  Created wheel for llm-serve: filename=llm_serve-0.0.1-py3-none-any.whl size=100808 sha256=5896e4e7b35cf15f8977a5847a9ff40f78ed2ae42e95adc28def70cefc2b426c
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/cb/6e/71/619b3e1f616ba182cb9bfc8e0e239a9e8402f4305bc75d27d7
  Building wheel for ffmpy (setup.py) ... done
  Created wheel for ffmpy: filename=ffmpy-0.3.2-py3-none-any.whl size=5582 sha256=f2f3304e01d27a1e9f63c8c504d5d56cf0a5c40ec98c2e805c1a5d8c41ea17be
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/bd/65/9a/671fc6dcde07d4418df0c592f8df512b26d7a0029c2a23dd81
Successfully built deepspeed llm-serve ffmpy
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects