Code Monkey home page Code Monkey logo

opencsgs / llm-inference Goto Github PK

View Code? Open in Web Editor NEW
41.0 4.0 7.0 561 KB

llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.

License: Apache License 2.0

Shell 0.31% Python 97.33% Dockerfile 0.19% JavaScript 0.59% Jupyter Notebook 1.58%
deepspeed llama-cpp llm-inference ray transformer vllm

llm-inference's People

Contributors

depenglee1707 avatar jasonhe258 avatar pulltheflower avatar seanhh86 avatar wanggxa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

llm-inference's Issues

The usage introduction of `llm-serve` is not correct in quick_start.md

Wrong:

# llm-serve --help

 Usage: llm-serve [OPTIONS] COMMAND [ARGS]...

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                                        │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ evaluate     Evaluate and summarize the results of a multi_query run with a strong 'evaluator' LLM like GPT-4.                                                     │
│ list         List available model(s) and deployed serving etc.                                                                                                     │
│ predict      Predict one or several models with one or multiple prompts, optionally read from file, and save the results to a file.                                │
│ start        Start application(s) for LLM serving, API server, experimention, fine-tuning and comparation.                                                         │
│ stop         Stop application(s) for LLM serving and API server.                                                                                                   │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The sub command evaluate has already deprecated and removed

inference gradio web reponse random words for deepseek instrcuct model

inference gradio web reponse random words for deepseek instrcuct model
image

while using rest api, everything seems to be OK:

curl -H "Content-Type: application/json" -X POST -d '{"prompt": "写一个快排吧"}' "http://127.0.0.1:8000/api/v1/default/opencsg--opencsg-deepseek-coder-1.3b-v0.1/run/predict" {"generated_text":"}\n\n\n# 快排\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n\n# 测试\nprint(quicksort(arr))\n\n# 输出: [1, 2, 3, 4, 5, 6, 7, 8, 9]\n```\n\n这个程序使用了快速排序算法,它是一种高效的排序算法,基于分治法的原理。它选择一个元素作为枢轴,并根据它们与枢轴的大小将其他元素分成两个子数组,然后递归地对子数组进行排序。\n\n快速排序的平均时间复杂度为O(n log n),最坏情况下的时间复杂度为O(n^2),但这种情况很少发生。","num_input_tokens":16,"num_input_tokens_batch":16,"num_generated_tokens":267,"num_generated_tokens_batch":267,"preprocessing_time":0.008793507993686944,"generation_time":2.4766286090016365,"postprocessing_time":0.0009328589949291199,"generation_time_per_token":0.008751337840995181,"generation_time_per_token_batch":0.008751337840995181,"num_total_tokens":283,"num_total_tokens_batch":283,"total_time":2.4863549759902526,"total_time_per_token":0.008785706628940822,"total_time_per_token_batch":0.008785706628940822}(.llm-inference) root@opencsg-gpu1-4090:~/pl/workspace/depenglee/llm-inference#

Requested tokens (817) exceed context window of 512

(PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF pid=42050) [INFO 2024-04-16 09:34:13,880] llamacpp_pipeline.py: 212 generate_kwargs: {'max_tokens': 1024, 'echo': False, 'stop': ['<|im_end|>'], 'logits_processor': [], 'stopping_criteria': []}
(ServeController pid=41618) ERROR 2024-04-16 09:34:14,246 controller 41618 deployment_state.py:658 - Exception in replica 'default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG', the replica will be stopped.
(ServeController pid=41618) Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=41618) _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=41618) return fn(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=41618) return func(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=41618) raise value.as_instanceof_cause()
(ServeController pid=41618) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=41823, ip=172.17.0.2, actor_id=6aff10f7a7934a83f523892907000000, repr=<ray.serve._private.replica.ServeReplica:default:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7f24110af4c0>)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=41618) return self.__get_result()
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=41618) raise self._exception
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=41618) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=41618) RuntimeError: Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata
(ServeController pid=41618) await self.replica.update_user_config(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config
(ServeController pid=41618) await reconfigure_method(user_config)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/server/app.py", line 154, in reconfigure
(ServeController pid=41618) await self.rollover(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover
(ServeController pid=41618) self.new_worker_group = await self._create_worker_group(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 159, in _create_worker_group
(ServeController pid=41618) engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 367, in launch_engine
(ServeController pid=41618) await asyncio.gather(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
(ServeController pid=41618) return (yield from awaitable.await())
(ServeController pid=41618) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=42050, ip=172.17.0.2, actor_id=b7ddc7c61575fad3b581750d07000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 236, in init_model
(ServeController pid=41618) self.generator = init_model(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model
(ServeController pid=41618) resp_batch = generate(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate
(ServeController pid=41618) outputs = pipeline(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 102, in call
(ServeController pid=41618) for batch_response in self.stream(inputs, **kwargs):
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 214, in stream
(ServeController pid=41618) for token in output:
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 970, in _create_completion
(ServeController pid=41618) raise ValueError(
(ServeController pid=41618) ValueError: Requested tokens (817) exceed context window of 512
(ServeController pid=41618) INFO 2024-04-16 09:34:16,388 controller 41618 deployment_state.py:2185 - Replica default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG is stopped.

Model streaming API enhancement

  1. Stream request load-balance for multi-workers for predictor.
  2. More parameters for model.generate api
  3. Stream support for class DefaultTransformersPipeline
  4. RouterDeployment api support format "/{model}/run/predict"
  5. Model id mapping for api server, like mapping facebook/opt-125m to facebook--opt-125m

enable reset generate config on fly

for now the generation params is addressed in yaml files,
add the ability reset these params on fly is useful:

    generate_kwargs:
      do_sample: false
      max_new_tokens: 512
      min_new_tokens: 16
      temperature: 0.7
      repetition_penalty: 1.1
      top_p: 0.8
      top_k: 50
      pad_token: "<|extra_0|>"
      eos_token: "<|endoftext|>"

Failed to load qwen1_5-72b-chat-q5_k_m.gguf

(ServeController pid=9277) Traceback (most recent call last):
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=9277)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=9277)     return fn(*args, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=9277)     return func(*args, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=9277)     raise value.as_instanceof_cause()
(ServeController pid=9277) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=9483, ip=172.17.0.3, actor_id=b5fcde3ad8e5c6c8e719d32404000000, repr=<ray.serve._private.replica.ServeReplica:Qwen--Qwen1.5-72B-Chat-GGUF:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7fa4048274c0>)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9277)     return self.__get_result()
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9277)     raise self._exception
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=9277)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=9277) RuntimeError: Traceback (most recent call last):
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata
(ServeController pid=9277)     await self.replica.update_user_config(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config
(ServeController pid=9277)     await reconfigure_method(user_config)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/server/app.py", line 151, in reconfigure
(ServeController pid=9277)     await self.rollover(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover
(ServeController pid=9277)     self.new_worker_group = await self._create_worker_group(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 154, in _create_worker_group
(ServeController pid=9277)     engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 333, in launch_engine
(ServeController pid=9277)     await asyncio.gather(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
(ServeController pid=9277)     return (yield from awaitable.__await__())
(ServeController pid=9277) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=9703, ip=172.17.0.3, actor_id=5691b4ad8e1d62a67ddc668004000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9277)     return self.__get_result()
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9277)     raise self._exception
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 217, in init_model
(ServeController pid=9277)     self.generator = init_model(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/utils.py", line 159, in inner
(ServeController pid=9277)     ret = func(*args, **kwargs)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model
(ServeController pid=9277)     resp_batch = generate(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/utils.py", line 159, in inner
(ServeController pid=9277)     ret = func(*args, **kwargs)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate
(ServeController pid=9277)     outputs = pipeline(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 141, in __call__
(ServeController pid=9277)     output = self.model(input, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 1547, in __call__
(ServeController pid=9277)     return self.create_completion(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 1480, in create_completion
(ServeController pid=9277)     completion: Completion = next(completion_or_chunks)  # type: ignore
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 959, in _create_completion
(ServeController pid=9277)     raise ValueError(
(ServeController pid=9277) ValueError: Requested tokens (818) exceed context window of 512
(ServeController pid=9277) INFO 2024-04-05 11:16:41,444 controller 9277 deployment_state.py:2185 - Replica Qwen--Qwen1.5-72B-Chat-GGUF#Qwen--Qwen1.5-72B-Chat-GGUF#ZgAOMG is stopped.
(ServeController pid=9277) INFO 2024-04-05 11:16:41,445 controller 9277 deployment_state.py:1831 - Adding 1 replica to deployment Qwen--Qwen1.5-72B-Chat-GGUF in application 'Qwen--Qwen1.5-72B-Chat-GGUF'.

[BUG] Get error when try "translation" downstream model

Run command:
llm-serve start experimental --model ./models/translation--t5-small.yaml

get error:

(ServeController pid=26978)   File "/Users/lipeng/workspaces/github.com/depenglee1707/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 125, in __call__
(ServeController pid=26978)     output = self.format_output(data[0], inputs, preprocess_time, generation_time)
(ServeController pid=26978)   File "/Users/lipeng/workspaces/github.com/depenglee1707/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 183, in format_output
(ServeController pid=26978)     num_generated_tokens = len(self.tokenizer(output["generated_text"]).input_ids)
(ServeController pid=26978) TypeError: string indices must be integers

@jasonhe258 please take a look

Generate incorrect text format when use pipeline defaulttransformers

Set pipeline: defaulttransformers and prompt_format: "'role': 'user', 'content': {instruction}" in yaml, and seems there is text format issue in generated_text as following.

[{"generated_text":"'role': 'user', 'content': hello nihao\n{'role': 'user', 'content': '你好'}","num_input_tokens":2,"num_input_tokens_batch":2,"num_generated_tokens":26,"num_generated_tokens_batch":26,"preprocessing_time":0.007688470010180026,"generation_time":7.110702240024693,"postprocessing_time":0.0007505400571972132,"generation_time_per_token":0.2539536514294533,"generation_time_per_token_batch":0.2539536514294533,"num_total_tokens":28,"num_total_tokens_batch":28,"total_time":7.1191412500920705,"total_time_per_token":0.2542550446461454,"total_time_per_token_batch":0.2542550446461454}]

vllm cannot address "runtime_env"

for Qwen/Qwen-7B, we set runtime_env like this:

  initialization:
    runtime_env:
      pip: ["transformers_stream_generator", "tiktoken"]

but when start up, still get the exception:

ImportError: This modeling file requires the following packages that were not found in your environment: tiktoken. Run `pip install tiktoken`

Install dependency llama-cpp-python failed

Using cached exceptiongroup-1.2.0-py3-none-any.whl (16 kB)
Building wheels for collected packages: deepspeed, llama-cpp-python, llm-serve, ffmpy
  Building wheel for deepspeed (setup.py) ... done
  Created wheel for deepspeed: filename=deepspeed-0.14.0-py3-none-any.whl size=1400347 sha256=db3cabb92e930a4d76b2adf48e2bae802dc28c333d54d790ab2b4256efe03fe0
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/23/96/24/bab20c3b4e2af15e195b339afaec373eca7072cf90620432e5
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [66 lines of output]
      *** scikit-build-core 0.8.2 using CMake 3.29.0 (wheel)
      *** Configuring CMake...
      2024-03-31 14:09:18,364 - scikit_build_core - WARNING - libdir/ldlibrary: /Users/hhwang/anaconda3/envs/abc/lib/libpython3.10.a is not a real file!
      2024-03-31 14:09:18,364 - scikit_build_core - WARNING - Can't find a Python library, got libdir=/Users/hhwang/anaconda3/envs/abc/lib, ldlibrary=libpython3.10.a, multiarch=darwin, masd=None
      loading initial cache file /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/CMakeInit.txt
      -- The C compiler identification is AppleClang 15.0.0.15000309
      -- The CXX compiler identification is AppleClang 15.0.0.15000309
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.39.3 (Apple Git-146)")
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Accelerate framework found
      -- Metal framework found
      -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
      -- CMAKE_SYSTEM_PROCESSOR: arm64
      -- ARM detected
      -- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E
      -- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
      CMake Warning (dev) at vendor/llama.cpp/CMakeLists.txt:1218 (install):
        Target llama has RESOURCE files but no RESOURCE DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      CMake Warning (dev) at CMakeLists.txt:21 (install):
        Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      CMake Warning (dev) at CMakeLists.txt:30 (install):
        Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      -- Configuring done (0.5s)
      -- Generating done (0.0s)
      -- Build files have been written to: /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build
      *** Building project with Ninja...
      Change Dir: '/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build'

      Run Build Command(s): /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-build-env-h3q63wii/normal/lib/python3.10/site-packages/ninja/data/bin/ninja -v
      [1/25] cd /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/vendor/llama.cpp && xcrun -sdk macosx metal -O3 -c /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && xcrun -sdk macosx metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-common.h && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal
      FAILED: bin/default.metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib
      cd /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/vendor/llama.cpp && xcrun -sdk macosx metal -O3 -c /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && xcrun -sdk macosx metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-common.h && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal
      xcrun: error: unable to find utility "metal", not a developer tool or in PATH
      [2/25] cd /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp && /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-build-env-h3q63wii/normal/lib/python3.10/site-packages/cmake/data/bin/cmake -DMSVC= -DCMAKE_C_COMPILER_VERSION=15.0.0.15000309 -DCMAKE_C_COMPILER_ID=AppleClang -DCMAKE_VS_PLATFORM_NAME= -DCMAKE_C_COMPILER=/Library/Developer/CommandLineTools/usr/bin/cc -P /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/common/../scripts/gen-build-info-cpp.cmake
      -- Found Git: /usr/bin/git (found version "2.39.3 (Apple Git-146)")
      [3/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-alloc.c
      [4/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-backend.c
      [5/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../.. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../../common -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wno-cast-qual -MD -MT vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o -MF vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o.d -o vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/llava.cpp
      [6/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-metal.m
      [7/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-quants.c
      [8/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -Dllama_EXPORTS -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o -MF vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o.d -o vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/unicode.cpp
      [9/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../.. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../../common -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wno-cast-qual -MD -MT vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o -MF vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o.d -o vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/clip.cpp
      [10/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml.c
      [11/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -Dllama_EXPORTS -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o -MF vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o.d -o vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/llama.cpp
      ninja: build stopped: subcommand failed.


      *** CMake build failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
  Building wheel for llm-serve (pyproject.toml) ... done
  Created wheel for llm-serve: filename=llm_serve-0.0.1-py3-none-any.whl size=100808 sha256=5896e4e7b35cf15f8977a5847a9ff40f78ed2ae42e95adc28def70cefc2b426c
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/cb/6e/71/619b3e1f616ba182cb9bfc8e0e239a9e8402f4305bc75d27d7
  Building wheel for ffmpy (setup.py) ... done
  Created wheel for ffmpy: filename=ffmpy-0.3.2-py3-none-any.whl size=5582 sha256=f2f3304e01d27a1e9f63c8c504d5d56cf0a5c40ec98c2e805c1a5d8c41ea17be
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/bd/65/9a/671fc6dcde07d4418df0c592f8df512b26d7a0029c2a23dd81
Successfully built deepspeed llm-serve ffmpy
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.