Code Monkey home page Code Monkey logo

opencsgs / llm-inference Goto Github PK

View Code? Open in Web Editor NEW
44.0 44.0 8.0 616 KB

llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.

License: Apache License 2.0

Shell 0.27% Python 97.51% Dockerfile 0.17% JavaScript 0.56% Jupyter Notebook 1.49%
deepspeed llama-cpp llm-inference ray transformer vllm

llm-inference's Issues

Requested tokens (817) exceed context window of 512

(PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF pid=42050) [INFO 2024-04-16 09:34:13,880] llamacpp_pipeline.py: 212 generate_kwargs: {'max_tokens': 1024, 'echo': False, 'stop': ['<|im_end|>'], 'logits_processor': [], 'stopping_criteria': []}
(ServeController pid=41618) ERROR 2024-04-16 09:34:14,246 controller 41618 deployment_state.py:658 - Exception in replica 'default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG', the replica will be stopped.
(ServeController pid=41618) Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=41618) _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=41618) return fn(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=41618) return func(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=41618) raise value.as_instanceof_cause()
(ServeController pid=41618) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=41823, ip=172.17.0.2, actor_id=6aff10f7a7934a83f523892907000000, repr=<ray.serve._private.replica.ServeReplica:default:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7f24110af4c0>)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=41618) return self.__get_result()
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=41618) raise self._exception
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=41618) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=41618) RuntimeError: Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata
(ServeController pid=41618) await self.replica.update_user_config(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config
(ServeController pid=41618) await reconfigure_method(user_config)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/server/app.py", line 154, in reconfigure
(ServeController pid=41618) await self.rollover(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover
(ServeController pid=41618) self.new_worker_group = await self._create_worker_group(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 159, in _create_worker_group
(ServeController pid=41618) engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 367, in launch_engine
(ServeController pid=41618) await asyncio.gather(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
(ServeController pid=41618) return (yield from awaitable.await())
(ServeController pid=41618) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=42050, ip=172.17.0.2, actor_id=b7ddc7c61575fad3b581750d07000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 236, in init_model
(ServeController pid=41618) self.generator = init_model(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model
(ServeController pid=41618) resp_batch = generate(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate
(ServeController pid=41618) outputs = pipeline(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 102, in call
(ServeController pid=41618) for batch_response in self.stream(inputs, **kwargs):
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 214, in stream
(ServeController pid=41618) for token in output:
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 970, in _create_completion
(ServeController pid=41618) raise ValueError(
(ServeController pid=41618) ValueError: Requested tokens (817) exceed context window of 512
(ServeController pid=41618) INFO 2024-04-16 09:34:16,388 controller 41618 deployment_state.py:2185 - Replica default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG is stopped.

Generate incorrect text format when use pipeline defaulttransformers

Set pipeline: defaulttransformers and prompt_format: "'role': 'user', 'content': {instruction}" in yaml, and seems there is text format issue in generated_text as following.

[{"generated_text":"'role': 'user', 'content': hello nihao\n{'role': 'user', 'content': '你好'}","num_input_tokens":2,"num_input_tokens_batch":2,"num_generated_tokens":26,"num_generated_tokens_batch":26,"preprocessing_time":0.007688470010180026,"generation_time":7.110702240024693,"postprocessing_time":0.0007505400571972132,"generation_time_per_token":0.2539536514294533,"generation_time_per_token_batch":0.2539536514294533,"num_total_tokens":28,"num_total_tokens_batch":28,"total_time":7.1191412500920705,"total_time_per_token":0.2542550446461454,"total_time_per_token_batch":0.2542550446461454}]

inference gradio web reponse random words for deepseek instrcuct model

inference gradio web reponse random words for deepseek instrcuct model
image

while using rest api, everything seems to be OK:

curl -H "Content-Type: application/json" -X POST -d '{"prompt": "写一个快排吧"}' "http://127.0.0.1:8000/api/v1/default/opencsg--opencsg-deepseek-coder-1.3b-v0.1/run/predict" {"generated_text":"}\n\n\n# 快排\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n\n# 测试\nprint(quicksort(arr))\n\n# 输出: [1, 2, 3, 4, 5, 6, 7, 8, 9]\n```\n\n这个程序使用了快速排序算法,它是一种高效的排序算法,基于分治法的原理。它选择一个元素作为枢轴,并根据它们与枢轴的大小将其他元素分成两个子数组,然后递归地对子数组进行排序。\n\n快速排序的平均时间复杂度为O(n log n),最坏情况下的时间复杂度为O(n^2),但这种情况很少发生。","num_input_tokens":16,"num_input_tokens_batch":16,"num_generated_tokens":267,"num_generated_tokens_batch":267,"preprocessing_time":0.008793507993686944,"generation_time":2.4766286090016365,"postprocessing_time":0.0009328589949291199,"generation_time_per_token":0.008751337840995181,"generation_time_per_token_batch":0.008751337840995181,"num_total_tokens":283,"num_total_tokens_batch":283,"total_time":2.4863549759902526,"total_time_per_token":0.008785706628940822,"total_time_per_token_batch":0.008785706628940822}(.llm-inference) root@opencsg-gpu1-4090:~/pl/workspace/depenglee/llm-inference#

Failed to load qwen1_5-72b-chat-q5_k_m.gguf

(ServeController pid=9277) Traceback (most recent call last):
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=9277)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=9277)     return fn(*args, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=9277)     return func(*args, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=9277)     raise value.as_instanceof_cause()
(ServeController pid=9277) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=9483, ip=172.17.0.3, actor_id=b5fcde3ad8e5c6c8e719d32404000000, repr=<ray.serve._private.replica.ServeReplica:Qwen--Qwen1.5-72B-Chat-GGUF:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7fa4048274c0>)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9277)     return self.__get_result()
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9277)     raise self._exception
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=9277)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=9277) RuntimeError: Traceback (most recent call last):
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata
(ServeController pid=9277)     await self.replica.update_user_config(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config
(ServeController pid=9277)     await reconfigure_method(user_config)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/server/app.py", line 151, in reconfigure
(ServeController pid=9277)     await self.rollover(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover
(ServeController pid=9277)     self.new_worker_group = await self._create_worker_group(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 154, in _create_worker_group
(ServeController pid=9277)     engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 333, in launch_engine
(ServeController pid=9277)     await asyncio.gather(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
(ServeController pid=9277)     return (yield from awaitable.__await__())
(ServeController pid=9277) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=9703, ip=172.17.0.3, actor_id=5691b4ad8e1d62a67ddc668004000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9277)     return self.__get_result()
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9277)     raise self._exception
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 217, in init_model
(ServeController pid=9277)     self.generator = init_model(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/utils.py", line 159, in inner
(ServeController pid=9277)     ret = func(*args, **kwargs)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model
(ServeController pid=9277)     resp_batch = generate(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/utils.py", line 159, in inner
(ServeController pid=9277)     ret = func(*args, **kwargs)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate
(ServeController pid=9277)     outputs = pipeline(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 141, in __call__
(ServeController pid=9277)     output = self.model(input, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 1547, in __call__
(ServeController pid=9277)     return self.create_completion(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 1480, in create_completion
(ServeController pid=9277)     completion: Completion = next(completion_or_chunks)  # type: ignore
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 959, in _create_completion
(ServeController pid=9277)     raise ValueError(
(ServeController pid=9277) ValueError: Requested tokens (818) exceed context window of 512
(ServeController pid=9277) INFO 2024-04-05 11:16:41,444 controller 9277 deployment_state.py:2185 - Replica Qwen--Qwen1.5-72B-Chat-GGUF#Qwen--Qwen1.5-72B-Chat-GGUF#ZgAOMG is stopped.
(ServeController pid=9277) INFO 2024-04-05 11:16:41,445 controller 9277 deployment_state.py:1831 - Adding 1 replica to deployment Qwen--Qwen1.5-72B-Chat-GGUF in application 'Qwen--Qwen1.5-72B-Chat-GGUF'.

The usage introduction of `llm-serve` is not correct in quick_start.md

Wrong:

# llm-serve --help

 Usage: llm-serve [OPTIONS] COMMAND [ARGS]...

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                                        │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ evaluate     Evaluate and summarize the results of a multi_query run with a strong 'evaluator' LLM like GPT-4.                                                     │
│ list         List available model(s) and deployed serving etc.                                                                                                     │
│ predict      Predict one or several models with one or multiple prompts, optionally read from file, and save the results to a file.                                │
│ start        Start application(s) for LLM serving, API server, experimention, fine-tuning and comparation.                                                         │
│ stop         Stop application(s) for LLM serving and API server.                                                                                                   │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The sub command evaluate has already deprecated and removed

Error happen when do inference for wukong dtype=bfloat16 of use default transformer pipeline load model

1:job_id:04000000
:actor_name:ServeReplica:default:opencsg--csg-wukong-1B
[INFO 2024-04-30 03:46:04,636] __init__.py: 14  Import vllm related stuff failed, please make sure 'vllm' is installed.
INFO 2024-04-30 03:46:04,723 default_opencsg--csg-wukong-1B IULCpr app.py:95 - LLM Deployment initialize
[INFO 2024-04-30 03:46:04,723] predictor.py: 27  LLM Predictor Initialize
INFO 2024-04-30 03:46:04,724 default_opencsg--csg-wukong-1B IULCpr app.py:145 - LLM Deployment Reconfiguring...
INFO 2024-04-30 03:46:04,724 default_opencsg--csg-wukong-1B IULCpr app.py:103 - LLM Deployment _should_reinit_worker_group
[INFO 2024-04-30 03:46:04,724] predictor.py: 48  Initializing new worker group ScalingConfig(trainer_resources={'CPU': 0}, num_workers=1, use_gpu=True, resources_per_worker={'CPU': 1.0, 'GPU': 1.0})
[INFO 2024-04-30 03:46:04,724] predictor.py: 59  Engine name is generic
[INFO 2024-04-30 03:46:04,724] predictor.py: 83  LLM Predictor creating a new worker group
[INFO 2024-04-30 03:46:04,818] predictor.py: 100  Build Prediction Worker with runtime_env:
[INFO 2024-04-30 03:46:04,819] predictor.py: 101  None
[INFO 2024-04-30 03:46:04,819] predictor.py: 109  Waiting for placement group to be ready...
[INFO 2024-04-30 03:46:04,887] predictor.py: 113  Starting initialize_node tasks...
[INFO 2024-04-30 03:46:06,970] predictor.py: 124  get version: [None]
[INFO 2024-04-30 03:46:06,970] generic.py: 351  Creating prediction workers...
[INFO 2024-04-30 03:46:06,975] generic.py: 358  Initializing torch_dist process group on workers...
[INFO 2024-04-30 03:46:09,210] generic.py: 368  Initializing model on workers with local_ranks: [0]
[INFO 2024-04-30 03:46:10,294] predictor.py: 68  Rolling over to new worker group [Actor(PredictionWorker, efd48e82c51a27d83f8078f604000000)]
INFO 2024-04-30 03:46:10,377 default_opencsg--csg-wukong-1B IULCpr app.py:236 - new_max_batch_size is 1
INFO 2024-04-30 03:46:10,377 default_opencsg--csg-wukong-1B IULCpr app.py:237 - new_batch_wait_timeout_s is 0
INFO 2024-04-30 03:46:10,377 default_opencsg--csg-wukong-1B IULCpr app.py:162 - LLM Deployment Reconfigured.
/home/yons/llm-inference/llmserve/backend/llm/predictor.py:212: RuntimeWarning: coroutine 'GenericEngine.check_health' was never awaited
  self.engine.check_health()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
INFO 2024-04-30 03:47:11,008 default_opencsg--csg-wukong-1B IULCpr 0dd6808d-68b3-42fe-ac35-8f4ce6fb6d21 /api/v1/default/opencsg--csg-wukong-1B/run/predict app.py:210 - batch_generate_text prompts: [Prompt(prompt='What can I do', use_prompt_format=False)] 
INFO 2024-04-30 03:47:11,008 default_opencsg--csg-wukong-1B IULCpr 0dd6808d-68b3-42fe-ac35-8f4ce6fb6d21 /api/v1/default/opencsg--csg-wukong-1B/run/predict app.py:273 - Received 1 prompts [Prompt(prompt='What can I do', use_prompt_format=False)]. start_timestamp None timeout_s 100
[INFO 2024-04-30 03:47:11,008] generic.py: 416  LLM GenericEngine do async predict
ERROR 2024-04-30 03:47:11,135 default_opencsg--csg-wukong-1B IULCpr 0dd6808d-68b3-42fe-ac35-8f4ce6fb6d21 /api/v1/default/opencsg--csg-wukong-1B/run/predict replica.py:756 - Request failed due to RayTaskError:
Traceback (most recent call last):
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 753, in wrap_user_method_call
    yield
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 914, in call_user_method
    raise e from None
ray.exceptions.RayTaskError: �[36mray::ServeReplica:default:opencsg--csg-wukong-1B.handle_request()�[39m (pid=1492889, ip=192.168.80.2)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/_private/utils.py", line 165, in wrap_to_ray_error
    raise exception
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 895, in call_user_method
    result = await method_to_call(*request_args, **request_kwargs)
  File "/home/yons/llm-inference/llmserve/backend/server/app.py", line 217, in batch_generate_text
    texts = await asyncio.gather(
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/batching.py", line 498, in batch_wrapper
    return await enqueue_request(args, kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/batching.py", line 228, in _process_batches
    results = await func_future
  File "/home/yons/llm-inference/llmserve/backend/server/app.py", line 285, in generate_text_batch
    prediction = await self._predict_async(
  File "/home/yons/llm-inference/llmserve/backend/llm/predictor.py", line 183, in _predict_async
    prediction = await self.engine.predict(prompts, generate, timeout_s=timeout_s, start_timestamp=start_timestamp, lock=self._base_worker_group_lock)
  File "/home/yons/llm-inference/llmserve/backend/llm/engines/generic.py", line 443, in predict
    await asyncio.gather(
  File "/home/yons/.conda/envs/abc/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(RuntimeError): �[36mray::PredictionWorker.generate()�[39m (pid=1493087, ip=192.168.80.2, actor_id=efd48e82c51a27d83f8078f604000000, repr=PredictionWorker:opencsg/csg-wukong-1B)
  File "/home/yons/llm-inference/llmserve/backend/llm/engines/generic.py", line 268, in generate
    return generate(
  File "/home/yons/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
    ret = func(*args, **kwargs)
  File "/home/yons/llm-inference/llmserve/backend/llm/engines/generic.py", line 169, in generate
    outputs = pipeline(
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/yons/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 77, in __call__
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/yons/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 208, in forward
    generated_sequence = self.pipeline(**prompt_text, **generate_kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 240, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1187, in __call__
    outputs = list(final_iterator)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1112, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 327, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/generation/utils.py", line 1575, in generate
    result = self._sample(
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/generation/utils.py", line 2735, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
INFO 2024-04-30 03:47:11,135 default_opencsg--csg-wukong-1B IULCpr 0dd6808d-68b3-42fe-ac35-8f4ce6fb6d21 /api/v1/default/opencsg--csg-wukong-1B/run/predict replica.py:772 - BATCH_GENERATE_TEXT ERROR 127.1ms

[BUG] Get error when try "translation" downstream model

Run command:
llm-serve start experimental --model ./models/translation--t5-small.yaml

get error:

(ServeController pid=26978)   File "/Users/lipeng/workspaces/github.com/depenglee1707/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 125, in __call__
(ServeController pid=26978)     output = self.format_output(data[0], inputs, preprocess_time, generation_time)
(ServeController pid=26978)   File "/Users/lipeng/workspaces/github.com/depenglee1707/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 183, in format_output
(ServeController pid=26978)     num_generated_tokens = len(self.tokenizer(output["generated_text"]).input_ids)
(ServeController pid=26978) TypeError: string indices must be integers

@jasonhe258 please take a look

Model streaming API enhancement

  1. Stream request load-balance for multi-workers for predictor.
  2. More parameters for model.generate api
  3. Stream support for class DefaultTransformersPipeline
  4. RouterDeployment api support format "/{model}/run/predict"
  5. Model id mapping for api server, like mapping facebook/opt-125m to facebook--opt-125m

Api server was blocked when LLM deployment scaling config beyond the cluster resouces

for example, ray cluster just has 12 cpus.

curl -H "Content-Type: application/json" -H "user-name: default"  -d '[{"model_id": "facebook/opt-125m", "model_task": "text-generation", "model_revision": "main", "is_oob": true, "scaling_config": {"num_workers": 1, "num_gpus_per_worker": 1,"num_cpus_per_worker": 20}}]' -X POST "http://127.0.0.1:8000/api/start_serving"

enable reset generate config on fly

for now the generation params is addressed in yaml files,
add the ability reset these params on fly is useful:

    generate_kwargs:
      do_sample: false
      max_new_tokens: 512
      min_new_tokens: 16
      temperature: 0.7
      repetition_penalty: 1.1
      top_p: 0.8
      top_k: 50
      pad_token: "<|extra_0|>"
      eos_token: "<|endoftext|>"

Install dependency llama-cpp-python failed

Using cached exceptiongroup-1.2.0-py3-none-any.whl (16 kB)
Building wheels for collected packages: deepspeed, llama-cpp-python, llm-serve, ffmpy
  Building wheel for deepspeed (setup.py) ... done
  Created wheel for deepspeed: filename=deepspeed-0.14.0-py3-none-any.whl size=1400347 sha256=db3cabb92e930a4d76b2adf48e2bae802dc28c333d54d790ab2b4256efe03fe0
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/23/96/24/bab20c3b4e2af15e195b339afaec373eca7072cf90620432e5
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [66 lines of output]
      *** scikit-build-core 0.8.2 using CMake 3.29.0 (wheel)
      *** Configuring CMake...
      2024-03-31 14:09:18,364 - scikit_build_core - WARNING - libdir/ldlibrary: /Users/hhwang/anaconda3/envs/abc/lib/libpython3.10.a is not a real file!
      2024-03-31 14:09:18,364 - scikit_build_core - WARNING - Can't find a Python library, got libdir=/Users/hhwang/anaconda3/envs/abc/lib, ldlibrary=libpython3.10.a, multiarch=darwin, masd=None
      loading initial cache file /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/CMakeInit.txt
      -- The C compiler identification is AppleClang 15.0.0.15000309
      -- The CXX compiler identification is AppleClang 15.0.0.15000309
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.39.3 (Apple Git-146)")
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Accelerate framework found
      -- Metal framework found
      -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
      -- CMAKE_SYSTEM_PROCESSOR: arm64
      -- ARM detected
      -- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E
      -- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
      CMake Warning (dev) at vendor/llama.cpp/CMakeLists.txt:1218 (install):
        Target llama has RESOURCE files but no RESOURCE DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      CMake Warning (dev) at CMakeLists.txt:21 (install):
        Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      CMake Warning (dev) at CMakeLists.txt:30 (install):
        Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      -- Configuring done (0.5s)
      -- Generating done (0.0s)
      -- Build files have been written to: /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build
      *** Building project with Ninja...
      Change Dir: '/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build'

      Run Build Command(s): /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-build-env-h3q63wii/normal/lib/python3.10/site-packages/ninja/data/bin/ninja -v
      [1/25] cd /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/vendor/llama.cpp && xcrun -sdk macosx metal -O3 -c /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && xcrun -sdk macosx metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-common.h && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal
      FAILED: bin/default.metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib
      cd /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/vendor/llama.cpp && xcrun -sdk macosx metal -O3 -c /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && xcrun -sdk macosx metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-common.h && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal
      xcrun: error: unable to find utility "metal", not a developer tool or in PATH
      [2/25] cd /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp && /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-build-env-h3q63wii/normal/lib/python3.10/site-packages/cmake/data/bin/cmake -DMSVC= -DCMAKE_C_COMPILER_VERSION=15.0.0.15000309 -DCMAKE_C_COMPILER_ID=AppleClang -DCMAKE_VS_PLATFORM_NAME= -DCMAKE_C_COMPILER=/Library/Developer/CommandLineTools/usr/bin/cc -P /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/common/../scripts/gen-build-info-cpp.cmake
      -- Found Git: /usr/bin/git (found version "2.39.3 (Apple Git-146)")
      [3/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-alloc.c
      [4/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-backend.c
      [5/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../.. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../../common -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wno-cast-qual -MD -MT vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o -MF vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o.d -o vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/llava.cpp
      [6/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-metal.m
      [7/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-quants.c
      [8/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -Dllama_EXPORTS -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o -MF vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o.d -o vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/unicode.cpp
      [9/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../.. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../../common -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wno-cast-qual -MD -MT vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o -MF vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o.d -o vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/clip.cpp
      [10/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml.c
      [11/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -Dllama_EXPORTS -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o -MF vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o.d -o vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/llama.cpp
      ninja: build stopped: subcommand failed.


      *** CMake build failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
  Building wheel for llm-serve (pyproject.toml) ... done
  Created wheel for llm-serve: filename=llm_serve-0.0.1-py3-none-any.whl size=100808 sha256=5896e4e7b35cf15f8977a5847a9ff40f78ed2ae42e95adc28def70cefc2b426c
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/cb/6e/71/619b3e1f616ba182cb9bfc8e0e239a9e8402f4305bc75d27d7
  Building wheel for ffmpy (setup.py) ... done
  Created wheel for ffmpy: filename=ffmpy-0.3.2-py3-none-any.whl size=5582 sha256=f2f3304e01d27a1e9f63c8c504d5d56cf0a5c40ec98c2e805c1a5d8c41ea17be
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/bd/65/9a/671fc6dcde07d4418df0c592f8df512b26d7a0029c2a23dd81
Successfully built deepspeed llm-serve ffmpy
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

vllm cannot address "runtime_env"

for Qwen/Qwen-7B, we set runtime_env like this:

  initialization:
    runtime_env:
      pip: ["transformers_stream_generator", "tiktoken"]

but when start up, still get the exception:

ImportError: This modeling file requires the following packages that were not found in your environment: tiktoken. Run `pip install tiktoken`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.