llmserve / distserve Goto Github PK

Disaggregated serving system for Large Language Models (LLMs).

License: Apache License 2.0

Python 35.60% Jupyter Notebook 63.59% Shell 0.81%

distserve's Introduction

DistServe

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. In DistServe, you can simply set the parallelism configs and scheduling strategies for the two phases and it will work just like a single instance which handles the KV-Cache communication and memory management automatically.

It utilizes a high-performance C++ Transformer inference library SwiftTransformer as the execution backend, which supports many features like model/pipeline parallelism, FlashAttention, Continuous Batching, and PagedAttention.

It supports:

GPT-2 (gpt2, gpt2-xl, ...)
OPT (facebook/opt-1.3b, facebook/opt-6.7b, ...)
LLaMA2 (meta-llama/Llama-2-7b, meta-llama/Llama-2-13b, ...)

Build && Install

# clone the project
git clone https://github.com/LLMServe/DistServe.git && cd DistServe

# setup the distserve conda environment
conda env create -f environment.yml && conda activate distserve

# clone and build the SwiftTransformer library  
git clone https://github.com/LLMServe/SwiftTransformer.git && cd SwiftTransformer && git submodule update --init --recursive
cmake -B build && cmake --build build -j$(nproc)
cd ..

# install distserve
pip install -e .

Launching

Launch Ray Cluster

DistServe relies on Ray to implement distributed workers. If you do not launch a Ray runtime in advance, it will automatically initiate a cluster consisting of all the gpus on the current node. You may need to start the Ray runtime manually in advance if you want to use multiple nodes for inference.

Run offline example

DistServe requires at least two GPUs to play with. We provide an offline inference example in examples/offline.py.

Run online example

To run online inference, you need to launch the DistServe API server, see the comments in distserve/api_server/distserve_api_server.py.

Then launch the client example in examples/online.py.

Evaluation

To reproduce all the experiments in our paper, please follow the guidance.

Citation

If you use DistServe for your research, please cite our paper:

@misc{zhong2024distserve,
      title={DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving}, 
      author={Yinmin Zhong and Shengyu Liu and Junda Chen and Jianbo Hu and Yibo Zhu and Xuanzhe Liu and Xin Jin and Hao Zhang},
      year={2024},
      eprint={2401.09670},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}

distserve's People

Contributors

Stargazers

Watchers

distserve's Issues

load weight error but no other information when llama-2-7b-hf

python ./examples/offline.py --model /llama-2-7b-hf/

Task exception was never retrieved
future: <Task finished name='Task-6' coro=<_wrap_awaitable() done, defined at /root/miniconda3/envs/distserve/lib/python3.10/asyncio/tasks.py:643> exception=RayTaskError(RuntimeError)(RuntimeError(''))>
Traceback (most recent call last):
File "/root/miniconda3/envs/distserve/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
return (yield from awaitable.await())
ray.exceptions.RayTaskError(RuntimeError): ray::ParaWorker.init_model() (pid=187115, ip=11.163.86.171, actor_id=ebc0337a25939ba0cb7a85ed10000000, repr=<distserve.worker.ParaWorker object at 0x7f2b3581a800>)
File "/DistServe/distserve/worker.py", line 97, in init_model
self.model.load_weight(path)
RuntimeError
INFO 08:29:21 Initializing DECODING kvcaches
INFO 08:29:21 Profiling available blocks
INFO 08:29:21 Profiling result: num_gpu_blocks: 957, num_cpu_blocks: 128
INFO 08:29:21 Allocating kv cache
INFO 08:29:21 Initializing CONTEXT models
(ParaWorker pid=187115) Gpt::load() - /llama-2-7b-hf/decoder.embed_tokens.weight.pt not found

底层跨group的kv cache传输用的是什么库呢？

这一块可以简单介绍一下嘛

Cmake build fail

Hello guys.

I'm struggling with installing swifttransformer in my local env.
I'm using NVIDIA GeForce RTX 3090 with nvcc version 12.1.

When I tried to cmake -B build the SwiftTransformer,
It fails with following error message.
Anyone knows how to solve this problem?

Thanks in advance for your great effort

(distserve)
hmchoi:~/DistServe/SwiftTransformer$ cmake -B build
CMake Error at /home/hmchoi/.local/lib/python3.8/site-packages/cmake/data/share/cmake-3.29/Modules/CMakeDetermine

CompilerId.cmake:814 (message):
Compiling the CUDA compiler identification source file
"CMakeCUDACompilerId.cu" failed.

Compiler: /usr/lib/nvidia-cuda-toolkit/bin/nvcc

Build flags:

Id flags: --keep;--keep-dir;tmp -v

The output was:

255

#$ SPACE=

#$ CUDART=cudart

#$ HERE=/usr/lib/nvidia-cuda-toolkit/bin

#$ THERE=/usr/lib/nvidia-cuda-toolkit/bin

#$ TARGET_SIZE=

#$ TARGET_DIR=

#$ TARGET_SIZE=64

#$ NVVMIR_LIBRARY_DIR=/usr/lib/nvidia-cuda-toolkit/libdevice

#$
PATH=/usr/lib/nvidia-cuda-toolkit/bin:/home/hmchoi/.vscode-server/cli/servers/Stable-f1e16e1e6214d7c44d078b1f0607b2388f29d729/server/bin/remote-cli:/home/hmchoi/.local/bin:/home/hmchoi/miniconda3/envs/distserve/bin:/home/hmchoi/miniconda3/condabin:/home/hmchoi/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

#$ LIBRARIES= -L/usr/lib/x86_64-linux-gnu/stubs

#$ rm tmp/a_dlink.reg.c

#$ gcc -std=c++14 -D__CUDA_ARCH__=300 -E -x c++
-DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__
-D"CUDACC_VER_BUILD=85" -D"CUDACC_VER_MINOR=1"
-D"CUDACC_VER_MAJOR=9" -include "cuda_runtime.h" -m64
"CMakeCUDACompilerId.cu" > "tmp/CMakeCUDACompilerId.cpp1.ii"

#$ cicc --c++14 --gnu_version=70500 --allow_managed -arch compute_30 -m64
-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name
"CMakeCUDACompilerId.fatbin.c" -tused -nvvmir-library
"/usr/lib/nvidia-cuda-toolkit/libdevice/libdevice.10.bc"
--gen_module_id_file --module_id_file_name
"tmp/CMakeCUDACompilerId.module_id" --orig_src_file_name
"CMakeCUDACompilerId.cu" --gen_c_file_name
"tmp/CMakeCUDACompilerId.cudafe1.c" --stub_file_name
"tmp/CMakeCUDACompilerId.cudafe1.stub.c" --gen_device_file_name
"tmp/CMakeCUDACompilerId.cudafe1.gpu" "tmp/CMakeCUDACompilerId.cpp1.ii" -o
"tmp/CMakeCUDACompilerId.ptx"

#$ ptxas -arch=sm_30 -m64 "tmp/CMakeCUDACompilerId.ptx" -o
"tmp/CMakeCUDACompilerId.sm_30.cubin"

ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'

--error 0xff --

offline.py need simulator_config

how to get simulator_config，it seems a json file. But it seems that the output of simulator.py is not a json file.

fail to run examples/offline.py, load model weight error.

below image shows the error detail.

I predownload and cache model weights on disk. then load offline. how to solve this problem?

Error: Peer-to-peer access is unsupported on this platform

Hi, the problem is:

(ParaWorker pid=1955026) Error: Peer-to-peer access is unsupported on this platform.
(ParaWorker pid=1955026) In the current version of distserve, it is necessary to use a platform that supports GPU P2P access.
(ParaWorker pid=1955026) Exiting...

I face a problem like this, but I actually checked the P2P connection between the two GPUs, and I tried the following codes for testing the P2P connection between GPUs:

tensor_a = torch.randn(10, device="cuda:0")
try:
    # Attempt to directly copy tensor_a from GPU 0 to GPU 1
    tensor_b = tensor_a.to("cuda:1")
    print("Successfully copied tensor from GPU 0 to GPU 1 using P2P.")
except RuntimeError as e:
    print("Failed to copy tensor from GPU 0 to GPU 1 using P2P. Error:", e)

and the output is:

Successfully copied tensor from GPU 0 to GPU 1 using P2P.

and the GPU topo is:

can you provide any suggestions?

thank you!

[Profile] How to use nsight to profile CUDA execution information

I use nsys cli to profile offline and serving, and found some problem

The cuda HW info cannot be traced when I use example/offline.py
Prefill instance cudaHW info cannot be trace when use nsys to start a serving, only decode instance
Use nsys in ray actor to profile single Process, still cannot trace cudaHW.

Can I ask for how to profile correctly?

Why appear 0 unaccepted, 0 waiting, 0 processing？

I use offline example.
When I use tensor_parallel_size=1, pipeline_parallel_size=1, the result is correct. But when tensor_parallel_size=2, pipeline_parallel_size=2， it will loop infinitely. As show below.

INFO 03:52:48 (decoding) 0 unaccepted, 0 waiting, 0 processing
INFO 03:52:49 (context) 0 waiting, 0 finished but unaccepted, 6 blocks occupied by on-the-fly requests
INFO 03:52:49 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 03:52:49 (decoding) GPU blocks: 0 / 2916 (0.00%) used, (0 swapping out)
INFO 03:52:49 (decoding) 0 unaccepted, 0 waiting, 0 processing
INFO 03:52:50 (context) 0 waiting, 0 finished but unaccepted, 6 blocks occupied by on-the-fly requests
INFO 03:52:50 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 03:52:50 (decoding) GPU blocks: 0 / 2916 (0.00%) used, (0 swapping out)
INFO 03:52:50 (decoding) 0 unaccepted, 0 waiting, 0 processing
INFO 03:52:51 (context) 0 waiting, 0 finished but unaccepted, 6 blocks occupied by on-the-fly requests
INFO 03:52:51 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 03:52:51 (decoding) GPU blocks: 0 / 2916 (0.00%) used, (0 swapping out)
INFO 03:52:51 (decoding) 0 unaccepted, 0 waiting, 0 processing

Here is my code. Thank you.

import argparse
import os
from distserve import OfflineLLM, SamplingParams
from distserve.config import (
    ModelConfig,
    DisaggParallelConfig,
    ParallelConfig,
    CacheConfig,
    ContextStageSchedConfig,
    DecodingStageSchedConfig
)
os.environ['CUDA_VISIBLE_DEVICES']='2,3,4,5,6,7'
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, help='The model to use', default='/home/xiaoxu/test/Llama-2-7b-hf')
args = parser.parse_args()

prompts = [
    "Life blooms like a flower. Far away or by the road. Waiting",
    "A quick brown fox",
    "Artificial intelligence is",
    "To be or not to be,",
    "one two three four"
]
sampling_params = SamplingParams(
    temperature=0.8, top_p=0.95, max_tokens=64, stop=["\n"]
)
llm = OfflineLLM(
    model_config=ModelConfig(
        model=args.model,
        tokenizer=None
    ),
    disagg_parallel_config=DisaggParallelConfig(
        context=ParallelConfig(
            tensor_parallel_size=2,
            pipeline_parallel_size=2
        ),
        decoding=ParallelConfig(
            tensor_parallel_size=1,
            pipeline_parallel_size=1
        )
    ),
    cache_config=CacheConfig(
        block_size=16,
        max_num_blocks_per_req=1024,
        gpu_memory_utilization=0.9,
        cpu_swap_space=1.0
    ),
    context_sched_config=ContextStageSchedConfig(
        policy="fcfs",
        max_batch_size=4,
        max_tokens_per_batch=16384
    ),
    decoding_sched_config=DecodingStageSchedConfig(
        policy="fcfs",
        max_batch_size=4,
        max_tokens_per_batch=16384
    )
)
outputs = llm.generate(prompts=prompts, sampling_params=sampling_params)
for prompt, step_outputs in zip(prompts, outputs):
    # new_token_ids = [step_output.new_token_id for step_output in step_outputs]
    # output_text = llm.tokenizer.decode(new_token_ids)
    print(
        f"Prompt: {prompt!r}, Generated text: {' '.join([step_output.new_token for step_output in step_outputs])} ({len(step_outputs)} tokens generated)."
    )

not distserve.simulator.utils

i try to find this module,but it not foud in main branch（evaluation/2-benchmark-serving/1-analyse-dataset.py）

An Trick of Enhancing LLM Serving Throughput with KV Cache Pipeline Strategy

Hi, your DistServe paper is pretty good and insightful, thanks a lot for the ideas and implementations! These days, I got an idea exploring further into the domain of prefill-decode disaggregation, but not sure if it can work. If appropriate, I want to share it here. I would deeply appreciate your perspective on this concept, given your expertise :)

My observations are:

LLM serving throughput is mainly bottlenecked at GPU memory capacity (KV cache).
In prefill-decode disaggregation setting, prefill machine memory is oftern underutilized as KV cache is only stored at decode machine.
KV cache occupies memory until the decoding process ends and the request results are returned. However, the KV caches across different layers are accessed sequentially, with data in a layer remaining idle until accessed during the next iteration (token).

Based on these points, I propose a KV cache pipeline strategy between prefill machines and decode machines, i.e. KV caches are partitioned along layer axis, after computing attention for a given layer, its KV cache is immediately swapped out to the local prefill machine, and pre swap in next iteration.

Given the available 900 GBps or 1.8 TBps NVLink bandwidth, it looks work. How do you think? If you also find this approach potentially feasible, perhaps I could contribute some code to the DistServe repository during my spare time to see its effect in practice.

How to profile

I want to get the profile data in my own experimental environment. How can I do it?
It seems to be able to do this with this command: "python distserve/example/profile.py --model facebook/opt-1.3b --tokenizer facebook/opt-1.3b --beam_width 1 --file_path <file_path>"
But I didn't find profile.py or similar code.

What does pp_cross mean in the simulator output?

When I was using this simulation to test, I found that there was a pp_cross parameter in the output best_config. I would like to ask what the specific meaning of this parameter is?

Best per GPU rate: 1.56
Best config: pp_cross=1, tp_prefill=2, pp_prefill=1, tp_decode=1, pp_decode=1

分离部署多个prefill实例与多个decode实例支持问题

请问当前distserve的实现，支持多个prefill实例与多个decode实例一起进行分离部署推理吗？

求助，运行一直异常退出，llama-7b-hf模型，A100, worker 一直出现异常退出

我用了2张卡，A100的机器, 看着也没用啥显存，pp和tp就是1，但是一直异常退出错误信息如下：

INFO 12:20:10 (context) 0 waiting, 0 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 12:20:10 (decoding) CPU blocks: 0 / 2048 (0.00%) used, (0 swapping in)
INFO 12:20:10 (decoding) GPU blocks: 70 / 7467 (0.94%) used, (0 swapping out)
INFO 12:20:10 (decoding) 0 unaccepted, 0 waiting, 1 processing
INFO 12:20:10 (context) Forwarding with lengths [1139]
(context) Warning: Cannot decode token with id 137438954496. Error: out of range integral type conversion attempted
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff4643ca20eddd0f767dfb6f8f07000000 Worker ID: 6536068ce8f7629c0e7caa88c83529a6f763a124689fcb5e43d25636 Node ID: a1ac48022b082e341024d15358be110126114be06dc797078ce8c65c Worker IP address: 33.137.92.88 Worker port: 10151 Worker PID: 18503 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
File "/DistServe/distserve/api_server/distserve_api_server.py", line 159, in start_event_loop_wrapper
await task
File "/DistServe/distserve/llm.py", line 167, in start_event_loop
await self.engine.start_all_event_loops()
File "/DistServe/distserve/engine.py", line 251, in start_all_event_loops
await asyncio.gather(
File "/DistServe/distserve/single_stage_engine.py", line 663, in start_event_loop
await asyncio.gather(event_loop1(), event_loop2(), event_loop3())
File "/DistServe/distserve/single_stage_engine.py", line 654, in event_loop2
await self._step()
File "/DistServe/distserve/single_stage_engine.py", line 600, in _step
generated_tokens_ids = await self.batches_ret_futures[0]
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: ParaWorker
actor_id: 4643ca20eddd0f767dfb6f8f07000000
pid: 18503
namespace: 336d29b2-5654-4240-bec9-7def73115ad1
ip: 33.137.92.88
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

配置信息如下：

sampling_params = SamplingParams(
n=1,
use_beam_search=0,
temperature=1,
top_p=1,
max_tokens=512,
stop=["\n"]
)

模型推理结果混乱，怎么解决。

我使用的模型是llama-2-7b-chat。分别试了OfflineLLM和onlineLLM。生成的结果存在混乱的情况。

例如
Prompt: 'Life blooms like a flower. Far away or by the road. Waiting',
Generated text: for the right time to blo om . Ћ
The sun is sh ining on the earth .
The sun is sh ining on the earth .
The moon is sh ining on the sea .
The sun is sh ining on the sea .
The moon is sh ining on the sea .
The moon is sh (64 tokens generated).

Prompt: 'I have a cold and a headache. What should I do? ',
Generated text: 1 .
I have a cold and a head ache . I ' m not feeling well .
I ' m not feeling well .
I ' m not feeling well .
I ' m not feeling well .
I ' m not feeling well .
I ' m feeling sick .
I ' m feeling sick (64 tokens generated).

而使用HF原生的代码的结果是：
Prompt：
I have a cold and a headache. What should I do?
Generated text:
You should drink plenty of fluids and take paracetamol. If the headache is severe, you should consult your doctor.

请问下是哪里可能出问题了。

Support autoscaled prefill/decode servers

When prefill servers are more loaded than decode, we need to adjust the ratio between decode and prefill servers. Is it possible to use distServe to achieve this mechanism?

Decode Wrong Token

model: Llama-2-7b-hf
step:
1、python3 converter.py --input "Llama-2-7b-hf/*.bin"--output /datasets/distserve/llama-7b --dtype float16 --model llama
2、python3 api_server/distserve_api_server.py --port 6902 --model /datasets/distserve/llama-7b --context-tensor-parallel-size 1 --decoding-tensor-parallel-size 1
3、python3 evaluation/2-benchmark-serving/0-prepare-dataset.py --dataset-path Sharegpt
4、python3 evaluation/2-benchmark-serving/2-benchmark-serving.py --port 6902

the error message:

SwiftTransformer/src/csrc/model/gpt/gpt.cc:278 'cudaMemcpy(ith_context_req_req_index.ptr, ith_context_req_req_index_cpu, sizeof(int64_t) * batch_size, cudaMemcpyHostToDevice)': (700) an illegal memory access was encountered

Question on original DistServe paper - communication overhead

Hi, first of all thanks for the great work.

I have been deep diving your paper and generated following 2 questions:

I wonder how this 90Gbps was calculated? it's generated by real poc test or projections?

Are we overlapping kv cache transmission with prefill calculation to hide the latency?

In this plot, it seems that we're not hiding the overhead.
Plus, do we transmit kv cache layer by layer? (or transmit after entire prefill calculation is done.)

Looking forward to your reply:)
Thank you in advance.

Swift transformers cmak build 一直循序

-- Found Torch: /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch.so
-- USE_CXX11_ABI=False
-- The C compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found Python: /opt/conda/bin/python3.10 (found version "3.10.14") found components: Interpreter
CMake Warning (dev) at /opt/conda/lib/python3.10/site-packages/cmake/data/share/cmake-3.29/Modules/FetchContent.cmake:1352 (message):
The DOWNLOAD_EXTRACT_TIMESTAMP option was not given and policy CMP0135 is
not set. The policy's OLD behavior will be used. When using a URL
download, the timestamps of extracted files should preferably be that of
the time of extraction, otherwise code that depends on the extracted
contents might not be rebuilt if the URL changes. The OLD behavior
preserves the timestamps from the archive instead, but this is usually not
what you want. Update your project to the NEW behavior or specify the
DOWNLOAD_EXTRACT_TIMESTAMP option with a value of true to avoid this
robustness issue.
Call Stack (most recent call first):
CMakeLists.txt:139 (FetchContent_Declare)
This warning is for project developers. Use -Wno-dev to suppress it.

CMake Deprecation Warning at build/_deps/json-src/CMakeLists.txt:1 (cmake_minimum_required):
Compatibility with CMake < 3.5 will be removed from a future version of
CMake.

Update the VERSION argument value or use a ... suffix to tell
CMake that the project does not need compatibility with older versions.

-- Using the multi-header code from /data/lc/DistServe/SwiftTransformer/build/_deps/json-src/include/
-- Configuring done (9.9s)
You have changed variables that require your cache to be deleted.
Configure will be re-run and you may have to reset some variables.
The following variables have changed:
CMAKE_CUDA_COMPILER= /etc/alternatives/cuda/bin/nvcc

-- The CXX compiler identification is GNU 11.4.0
-- The CUDA compiler identification is NVIDIA 12.1.105
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /etc/alternatives/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /etc/alternatives/cuda/targets/x86_64-linux/include (found suitable version "12.1.105", minimum required is "11.4")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
No build type selected, defaulting to RELEASE mode
Use -DBUILD_MODE=DEBUG or -DBUILD_MODE=RELEASE to specify build type
Building in release mode
Building with MPI and NCCL
-- Found NCCL: /usr/include
-- Determining NCCL version from /usr/include/nccl.h...
-- Looking for NCCL_VERSION_CODE
-- Looking for NCCL_VERSION_CODE - not found
-- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl.so)
-- Found MPI_CXX: /user/local/openmpi/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Found CUDA: /usr/local/cuda (found version "12.1")
-- Found CUDAToolkit: /etc/alternatives/cuda/include (found version "12.1.105")
-- Caffe2: CUDA detected: 12.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 12.1
-- /usr/local/cuda-12.1/targets/x86_64-linux/lib/libnvrtc.so shorthash is b51b459d
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
-- Autodetected CUDA architecture(s): 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
CMake Warning at /opt/conda/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
/opt/conda/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
CMakeLists.txt:100 (find_package)

CMake Deprecation Warning at build/_deps/json-src/CMakeLists.txt:1 (cmake_minimum_required):
Compatibility with CMake < 3.5 will be removed from a future version of
CMake.

Update the VERSION argument value or use a ... suffix to tell
CMake that the project does not need compatibility with older versions.

-- Using the multi-header code from /data/lc/DistServe/SwiftTransformer/build/_deps/json-src/include/
-- Configuring done (10.0s)
You have changed variables that require your cache to be deleted.
Configure will be re-run and you may have to reset some variables.
The following variables have changed:
CMAKE_CUDA_COMPILER= /etc/alternatives/cuda/bin/nvcc

DistServe是否支持异构推理？

我的机器有两张卡，分别是A40和A30，A40拥有48G显存，A30拥有24G显存，使用的模型是Llama-2-7b-chat-hf，精度为fp16。按计算来说两卡应该都能放下，但作为decode机器的A30会报OOM错误。

Great work!

Congratulations, great work!
I'm wondering if you guys will continue to develop the framework to reach vllm scale.
If so, please share some docs/roadmaps about system architecture design to the community, so everyone can help to contribute

Model not loaded error

I'm getting this error stacktrace when I run DistServe with tensor parallelism > 1.

KeyboardInterrupt: 
Task exception was never retrieved
future: <Task finished name='Task-38' coro=<LLMEngine.start_all_event_loops() done, defined at /code/dev/DistServe/distserve/engine.py:244> exception=RayTaskError(RuntimeError)(RuntimeError('Please load the weight before inference.'))>
Traceback (most recent call last):
  File "/code/dev/DistServe/distserve/engine.py", line 251, in start_all_event_loops
    await asyncio.gather(
  File "/code/dev/DistServe/distserve/single_stage_engine.py", line 423, in start_event_loop
    await asyncio.gather(event_loop1(), event_loop2())
  File "/code/dev/DistServe/distserve/single_stage_engine.py", line 415, in event_loop1
    await self._step()
  File "/code/dev/DistServe/distserve/single_stage_engine.py", line 355, in _step
    generated_tokens_ids = await self.batches_ret_futures[0]
ray.exceptions.RayTaskError(RuntimeError): ray::ParaWorker.step() (pid=339867, ip=10.42.4.161, actor_id=0f69a4080eac4a5cc75b4f1601000000, repr=<distserve.worker.ParaWorker object at 0x7f69e4db69b0>)
  File "/code/dev/DistServe/distserve/worker.py", line 217, in step
    generated_tokens_ids = self.model.forward(
RuntimeError: Please load the weight before inference.

hotcrp

In ReadMe, you mentioned that you provided credentials to log in to Runpod in hotcrp. where is it?

fail to run examples/offline.py , unable to download the model to reproduce

Hi ,
I am trying to reproduce the result, but it's unable to download the llama2-7b-hf model as below logs printed,

`root@d7b9ced7ced8:/workspace/DistServe# python3 examples/offline.py
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json

The above exception was the direct cause of the following exception:
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json.
Your request to access model meta-llama/Llama-2-7b-hf has been rejected by the repo's authors.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/DistServe/examples/offline.py", line 32, in
model_config=ModelConfig(
File "/workspace/DistServe/distserve/config.py", line 177, in init
self.hf_config = self._get_hf_config()
File "/workspace/DistServe/distserve/config.py", line 192, in _get_hf_config
raise ValueError(
ValueError: Failed to load the model config, please check the model name or path: meta-llama/Llama-2-7b-hf`

, although I login into successfully huggingface-cli login, is there alternative way to acquire the model? thanks

Token has not been saved to git credential helper. Your token has been saved to /root/.cache/huggingface/token Login successful

SwitfTransformer compilation fails with ambiguous conversion error at PyTorch 24.05 container.

SwitfTransformer compilation fails with ambiguous conversion error function from "const half" to a built-in type applies in count_nan.cu:

csrc/kernel/count_nan.cu(14): error: more than one conversion function from "const half" to a built-in type applies

Reproducing the error

Start docker container with PyTorch 24.05

docker run -ti nvcr.io/nvidia/pytorch:24.05-py3 bash

Install miniconda3:

mkdir -p ~/miniconda3 && \
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh && \
    bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3

Activate conda environment:

source ~/miniconda3/bin/activate

Clone, install and activate DistServe:

git clone https://github.com/LLMServe/DistServe.git && \
    cd DistServe && \
    conda env create -f environment.yml && conda activate distserve

Clone SwiftTransformer and compile:

git clone https://github.com/LLMServe/SwiftTransformer.git && \
    cd SwiftTransformer && \
    git submodule update --init --recursive && \
    cmake -B build && \
    cmake --build build -j$(nproc)

Compilation log

-- The CXX compiler identification is GNU 11.4.0
-- The CUDA compiler identification is NVIDIA 12.1.66
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /root/miniconda3/envs/distserve/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /root/miniconda3/envs/distserve/include (found suitable version "12.1.66", minimum required is "11.4")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
No build type selected, defaulting to RELEASE mode
Use -DBUILD_MODE=DEBUG or -DBUILD_MODE=RELEASE to specify build type
Building in release mode
Building with MPI and NCCL
-- Found NCCL: /usr/include
-- Determining NCCL version from /usr/include/nccl.h...
-- Looking for NCCL_VERSION_CODE
-- Looking for NCCL_VERSION_CODE - not found
-- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl.so.2.21.5)
-- Found MPI_CXX: /opt/hpcx/ompi/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Found CUDA: /root/miniconda3/envs/distserve (found version "12.1") 
-- Found CUDAToolkit: /root/miniconda3/envs/distserve/include (found version "12.1.66")
-- Caffe2: CUDA detected: 12.1
-- Caffe2: CUDA nvcc is: /root/miniconda3/envs/distserve/bin/nvcc
-- Caffe2: CUDA toolkit directory: /root/miniconda3/envs/distserve
-- Caffe2: Header version is: 12.1
-- /root/miniconda3/envs/distserve/lib/stubs/libnvrtc.so shorthash is 0ced1d3e
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
CMake Warning at /root/miniconda3/envs/distserve/lib/python3.10/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:385 (message):
  In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
  to cmake instead of implicitly setting it as an env variable.  This will
  become a FATAL_ERROR in future version of pytorch.
Call Stack (most recent call first):
  /root/miniconda3/envs/distserve/lib/python3.10/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:342 (torch_cuda_get_nvcc_gencode_flag)
  /root/miniconda3/envs/distserve/lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:87 (include)
  /root/miniconda3/envs/distserve/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:100 (find_package)


-- Added CUDA NVCC flags for: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_87,code=sm_87;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
CMake Warning at /root/miniconda3/envs/distserve/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  /root/miniconda3/envs/distserve/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
  CMakeLists.txt:100 (find_package)


-- Found Torch: /root/miniconda3/envs/distserve/lib/python3.10/site-packages/torch/lib/libtorch.so
-- USE_CXX11_ABI=False
-- The C compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found Python: /root/miniconda3/envs/distserve/bin/python3.10 (found version "3.10.14") found components: Interpreter
CMake Warning (dev) at /usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.29/Modules/FetchContent.cmake:1352 (message):
  The DOWNLOAD_EXTRACT_TIMESTAMP option was not given and policy CMP0135 is
  not set.  The policy's OLD behavior will be used.  When using a URL
  download, the timestamps of extracted files should preferably be that of
  the time of extraction, otherwise code that depends on the extracted
  contents might not be rebuilt if the URL changes.  The OLD behavior
  preserves the timestamps from the archive instead, but this is usually not
  what you want.  Update your project to the NEW behavior or specify the
  DOWNLOAD_EXTRACT_TIMESTAMP option with a value of true to avoid this
  robustness issue.
Call Stack (most recent call first):
  CMakeLists.txt:139 (FetchContent_Declare)
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Deprecation Warning at build/_deps/json-src/CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


-- Using the multi-header code from /workspace/DistServe/DistServe/SwiftTransformer/build/_deps/json-src/include/
-- Configuring done (10.4s)
-- Generating done (0.1s)
-- Build files have been written to: /workspace/DistServe/DistServe/SwiftTransformer/build
[  1%] Building CXX object _deps/googletest-build/googletest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
[  2%] Building CXX object src/csrc/util/CMakeFiles/util.dir/cublas_wrapper.cc.o
[  2%] Building CXX object src/csrc/util/CMakeFiles/nccl_utils.dir/nccl_utils.cc.o
[  3%] Building CXX object src/csrc/util/CMakeFiles/py_nccl_utils.dir/py_nccl.cc.o
[  4%] Building CXX object src/examples/CMakeFiles/st_args.dir/lib/st_args.cc.o
[  5%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_bf16_aligned_k128.cu.o
[  6%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_bf16_aligned_k128_dropout.cu.o
[  6%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_bf16_aligned_k32_dropout.cu.o
[  7%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_bf16_aligned_k32.cu.o
[  8%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_aligned_k128.cu.o
[  9%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_aligned_k128_dropout.cu.o
[  9%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_aligned_k32.cu.o
[ 10%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_aligned_k64.cu.o
[ 11%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_bf16_aligned_k65536_dropout.cu.o
[ 11%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_bf16_aligned_k64.cu.o
[ 12%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_bf16_aligned_k96.cu.o
[ 13%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_bf16_aligned_k65536.cu.o
[ 15%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_notaligned_k128_dropout.cu.o
[ 15%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_aligned_k32_dropout.cu.o
[ 16%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_bf16_aligned_k64_dropout.cu.o
[ 17%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_notaligned_k128.cu.o
[ 18%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_notaligned_k32_dropout.cu.o
[ 18%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_notaligned_k32.cu.o
[ 19%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_notaligned_k65536_dropout.cu.o
[ 21%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_aligned_k96.cu.o
[ 21%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_aligned_k64_dropout.cu.o
[ 21%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_aligned_k65536_dropout.cu.o
[ 22%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_aligned_k64_dropout.cu.o
[ 23%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_notaligned_k64_dropout.cu.o
[ 24%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_notaligned_k64.cu.o
[ 26%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_aligned_k128_dropout.cu.o
[ 26%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_aligned_k128.cu.o
[ 27%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_aligned_k32_dropout.cu.o
[ 28%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_aligned_k65536.cu.o
[ 28%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f16_notaligned_k65536.cu.o
[ 29%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_notaligned_k128_dropout.cu.o
[ 29%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_aligned_k65536_dropout.cu.o
[ 30%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_notaligned_k32.cu.o
[ 31%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_notaligned_k128.cu.o
[ 32%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_aligned_k32.cu.o
[ 32%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_aligned_k65536.cu.o
[ 34%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_notaligned_k64.cu.o
[ 34%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_aligned_k64.cu.o
[ 35%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_notaligned_k65536.cu.o
[ 36%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_notaligned_k65536_dropout.cu.o
[ 36%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassF_bf16_aligned.cu.o
[ 38%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassF_f32_notaligned.cu.o
[ 38%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_notaligned_k64_dropout.cu.o
[ 38%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassB_f32_notaligned_k32_dropout.cu.o
[ 39%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassF_f32_aligned.cu.o
[ 39%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassF_f16_notaligned.cu.o
[ 40%] Building CUDA object src/csrc/kernel/CMakeFiles/xformers_autogen_impl.dir/xformers/xformers/csrc/attention/cuda/fmha/autogen/impl/cutlassF_f16_aligned.cu.o
[ 41%] Linking CUDA device code CMakeFiles/nccl_utils.dir/cmake_device_link.o
[ 41%] Linking CXX static library libutil.a
[ 41%] Built target util
[ 42%] Linking CXX static library libst_args.a
[ 43%] Building CXX object src/csrc/util/CMakeFiles/py_block_migration.dir/py_block_migration.cc.o
[ 43%] Building CXX object src/csrc/util/CMakeFiles/py_swapping.dir/py_swapping.cc.o
[ 43%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/count_nan.cu.o
[ 44%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/fused_decoding_stage_attention.cu.o
[ 45%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/addbias.cu.o
[ 45%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/fused_addbias_activ.cu.o
[ 45%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/softmax.cu.o
[ 46%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/embedding.cu.o
[ 47%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/gather_last_tokens.cu.o
[ 48%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/layernorm.cu.o
[ 49%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/fused_decoding_stage_attention_mha.cu.o
[ 50%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/rotary_posi_embedding.cu.o
[ 50%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/kvcache_mgmt.cu.o
[ 52%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/fused_activ_multiply.cu.o
[ 52%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/unfused_attention.cu.o
[ 54%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/findmax.cu.o
[ 54%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/fused_context_stage_attention.cu.o
[ 55%] Building CUDA object src/csrc/kernel/CMakeFiles/kernel.dir/rmsnorm.cu.o
[ 56%] Linking CXX static library libnccl_utils.a
[ 56%] Built target st_args
[ 56%] Built target nccl_utils
/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/count_nan.cu(14): error: more than one conversion function from "const half" to a built-in type applies:
            function "__half::operator float() const" (declared at line 217 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator short() const" (declared at line 235 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned short() const" (declared at line 238 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator int() const" (declared at line 241 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned int() const" (declared at line 244 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator long long() const" (declared at line 247 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned long long() const" (declared at line 250 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator __nv_bool() const" (declared at line 254 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
    if (arr[i] != arr[i]) {
        ^
          detected during:
            instantiation of "void st::kernel::countNanKernel(int *, const T *, int) [with T=half]" at line 31
            instantiation of "int st::kernel::countNan(const T *, int) [with T=half]" at line 43

/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/count_nan.cu(14): error: more than one conversion function from "const half" to a built-in type applies:
            function "__half::operator float() const" (declared at line 217 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator short() const" (declared at line 235 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned short() const" (declared at line 238 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator int() const" (declared at line 241 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned int() const" (declared at line 244 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator long long() const" (declared at line 247 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned long long() const" (declared at line 250 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator __nv_bool() const" (declared at line 254 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
    if (arr[i] != arr[i]) {
                  ^
          detected during:
            instantiation of "void st::kernel::countNanKernel(int *, const T *, int) [with T=half]" at line 31
            instantiation of "int st::kernel::countNan(const T *, int) [with T=half]" at line 43

2 errors detected in the compilation of "/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/count_nan.cu".
gmake[2]: *** [src/csrc/kernel/CMakeFiles/kernel.dir/build.make:92: src/csrc/kernel/CMakeFiles/kernel.dir/count_nan.cu.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_decoding_stage_attention.cu(80): warning #177-D: variable "my_q_head_end" was declared but never referenced
   const int64_t my_q_head_end = (blockIdx.x+1)*Q_HEADS_PER_THREAD_BLOCK;
                 ^
          detected during instantiation of "void st::kernel::fusedDecodingStageAttention(T *, const T *, T *, T *, float, const int64_t *, const int64_t *, int64_t, const int64_t *, const int64_t *, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t) [with T=float]" at line 347

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_addbias_activ.cu(40): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const" (declared at line 217 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator short() const" (declared at line 235 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned short() const" (declared at line 238 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator int() const" (declared at line 241 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned int() const" (declared at line 244 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator long long() const" (declared at line 247 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned long long() const" (declared at line 250 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator __nv_bool() const" (declared at line 254 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
     applyActivation<T, ACTIVATION_TYPE>(input_elem.x + bias_elem.x),
                                         ^
          detected during:
            instantiation of "void st::kernel::fusedAddbiasBatchedActivationKernel<T,ACTIVATION_TYPE>(T *, const T *, const T *, int64_t, int64_t) [with T=half, ACTIVATION_TYPE=st::kernel::ActivationType::RELU]" at line 62
            instantiation of "void st::kernel::fusedAddbiasBatchedActivation(T *, const T *, const T *, int64_t, int64_t, st::kernel::ActivationType) [with T=half]" at line 75

/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/embedding.cu(31): error: no operator "+=" matches these operands
            operand types are: half += const half
     cur_result += embed_positions_weight[my_position_id * hidden_size + hidden_size_index];
                ^
          detected during:
            instantiation of "void st::kernel::embedAndPosiEncodeBatchedKernel<T,DO_POSI_ENCODING>(T *, const int64_t *, const int64_t *, const T *, const T *, int64_t) [with T=half, DO_POSI_ENCODING=true]" at line 56
            instantiation of "void st::kernel::embedAndPosiEncodeBatched(T *, const int64_t *, const int64_t *, const T *, const T *, int64_t, int64_t) [with T=half]" at line 80

/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_addbias_activ.cu(40): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const" (declared at line 217 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator short() const" (declared at line 235 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned short() const" (declared at line 238 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator int() const" (declared at line 241 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned int() const" (declared at line 244 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator long long() const" (declared at line 247 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned long long() const" (declared at line 250 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator __nv_bool() const" (declared at line 254 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
     applyActivation<T, ACTIVATION_TYPE>(input_elem.x + bias_elem.x),
                                                        ^
          detected during:
            instantiation of "void st::kernel::fusedAddbiasBatchedActivationKernel<T,ACTIVATION_TYPE>(T *, const T *, const T *, int64_t, int64_t) [with T=half, ACTIVATION_TYPE=st::kernel::ActivationType::RELU]" at line 62
            instantiation of "void st::kernel::fusedAddbiasBatchedActivation(T *, const T *, const T *, int64_t, int64_t, st::kernel::ActivationType) [with T=half]" at line 75

/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/activations.cuh(13): error: more than one conversion function from "const half" to a built-in type applies:
            function "__half::operator float() const" (declared at line 217 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator short() const" (declared at line 235 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned short() const" (declared at line 238 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator int() const" (declared at line 241 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned int() const" (declared at line 244 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator long long() const" (declared at line 247 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned long long() const" (declared at line 250 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator __nv_bool() const" (declared at line 254 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
     return x > (T)0 ? x : (T)0;
            ^
          detected during:
            instantiation of "T st::kernel::applyActivation<T,activation_type>(const T &) [with T=half, activation_type=st::kernel::ActivationType::RELU]" at line 40 of /workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_addbias_activ.cu
            instantiation of "void st::kernel::fusedAddbiasBatchedActivationKernel<T,ACTIVATION_TYPE>(T *, const T *, const T *, int64_t, int64_t) [with T=half, ACTIVATION_TYPE=st::kernel::ActivationType::RELU]" at line 62 of /workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_addbias_activ.cu
            instantiation of "void st::kernel::fusedAddbiasBatchedActivation(T *, const T *, const T *, int64_t, int64_t, st::kernel::ActivationType) [with T=half]" at line 75 of /workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_addbias_activ.cu

/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/activations.cuh(13): error: more than one conversion function from "half" to a built-in type applies:
            function "__half::operator float() const" (declared at line 217 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator short() const" (declared at line 235 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned short() const" (declared at line 238 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator int() const" (declared at line 241 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned int() const" (declared at line 244 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator long long() const" (declared at line 247 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned long long() const" (declared at line 250 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator __nv_bool() const" (declared at line 254 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
     return x > (T)0 ? x : (T)0;
                ^
          detected during:
            instantiation of "T st::kernel::applyActivation<T,activation_type>(const T &) [with T=half, activation_type=st::kernel::ActivationType::RELU]" at line 40 of /workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_addbias_activ.cu
            instantiation of "void st::kernel::fusedAddbiasBatchedActivationKernel<T,ACTIVATION_TYPE>(T *, const T *, const T *, int64_t, int64_t) [with T=half, ACTIVATION_TYPE=st::kernel::ActivationType::RELU]" at line 62 of /workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_addbias_activ.cu
            instantiation of "void st::kernel::fusedAddbiasBatchedActivation(T *, const T *, const T *, int64_t, int64_t, st::kernel::ActivationType) [with T=half]" at line 75 of /workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_addbias_activ.cu

/workspace/DistServe/DistServe/SwiftTransformer/src/csrc/kernel/fused_addbias_activ.cu(41): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const" (declared at line 217 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator short() const" (declared at line 235 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned short() const" (declared at line 238 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator int() const" (declared at line 241 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned int() const" (declared at line 244 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator long long() const" (declared at line 247 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator unsigned long long() const" (declared at line 250 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
            function "__half::operator __nv_bool() const" (declared at line 254 of /root/miniconda3/envs/distserve/include/cuda_fp16.hpp)
     applyActivation<T, ACTIVATION_TYPE>(input_elem.y + bias_elem.y)
                                         ^
          detected during:
            instantiation of "void st::kernel::fusedAddbiasBatchedActivationKernel<T,ACTIVATION_TYPE>(T *, const T *, const T *, int64_t, int64_t) [with T=half, ACTIVATION_TYPE=st::kernel::ActivationType::RELU]" at line 62
            instantiation of "void st::kernel::fusedAddbiasBatchedActivation(T *, const T *, const T *, int64_t, int64_t, st::kernel::ActivationType) [with T=half]" at line 75

cmake error when running "cmake -B build" in swiftTransformer

-- Using the multi-header code from /home/fx/cql/DistServe/SwiftTransformer/build/_deps/json-src/include/
-- Configuring done (30.8s)
CMake Error at src/csrc/kernel/CMakeLists.txt:23 (add_library):
No SOURCES given to target: xformers_autogen_impl

Generating max_num_tokens.csv for Different Hardware Environments

In the part of the experimental model for finding the optimal placement strategy, there are three files in the simdistserve/estimators/profile_data directory, one of which is named max_num_tokens.csv. I am very curious about how this file is generated in different environments, because the other two JSON files can be generated based on the code in evaluation/0-test-single-forward-performance. However, I have not found a method for generating this CSV file. Based on my understanding, this file should be associated with different hardware or GPU environments, and the profiles in different environments should be different. I am eagerly awaiting your response. Thank you very much!

安装环境为什么要conda和pip混着用呢不能全用pip吗

编译SwiftTransformer失败

执行命令：
git clone https://github.com/LLMServe/SwiftTransformer.git cd SwiftTransformergit；submodule update --init --recursive；cmake -B build；cmake --build build -j$(nproc)

报错原因：
(截取部分具有代表性的错误)
/workspace/DistServe/SwiftTransformer/src/unittest/util/../unittest_utils.h:93:45: error: call of overloaded ‘fabs(__half)’ is ambiguous
93 | fabs(answer[i]-reference[i]), fabs(answer[i]-reference[i])/fabs(reference[i]));
| ~~~~^~~~~~~~~~~~~~~~~~~~~~~~
/workspace/DistServe/SwiftTransformer/src/unittest/util/../unittest_utils.h:93:75: error: call of overloaded ‘fabs(__half)’ is ambiguous
93 | fabs(answer[i]-reference[i]), fabs(answer[i]-reference[i])/fabs(reference[i]));
| ~~~~^~~~~~~~~~~~~~~~~~~~~~~~
/workspace/DistServe/SwiftTransformer/src/csrc/kernel/fused_context_stage_attention.cu(145): error: name followed by "::" must be a class or namespace name
wmma::fragment<wmma::matrix_a, 16ul, 16ul, 16ul, __half, wmma::row_major> a_frag;
^
/workspace/DistServe/SwiftTransformer/src/csrc/kernel/fused_context_stage_attention.cu(146): error: type name is not allowed
wmma::fragment<wmma::matrix_b, 16ul, 16ul, 16ul, __half, wmma::col_major> b_frag;
^
/workspace/DistServe/SwiftTransformer/src/csrc/kernel/fused_context_stage_attention.cu(146): error: identifier "b_frag" is undefined
wmma::fragment<wmma::matrix_b, 16ul, 16ul, 16ul, __half, wmma::col_major> b_frag;
^
编译环境
nvcr.io/nvidia/pytorch/23.10-py3镜像
CXX compiler: GNU 11.4.0
CUDA: NVIDIA 12.2.140
CUDAToolkit: 12.2.140
NCCL: libnccl.so.2.19.3
MPI: 3.1

How difficult will adding Llama 3 support be?

Hey, love the work you guys have done on DistServe and SwiftTransformer. As far as I can tell it supports Llama-2. How hard will adding Llama-3 models be? I specifically want support for https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.

Any guidance will be really helpful. Thanks!

missing SimulatorConfig.profiler_data_path when runing offline.py

decoder.embed_tokens.weight.pt not found

Hi, when I using distserve_api_server.py to start a server, it always raise an error during lauching it:

ray.exceptions.RayTaskError(RuntimeError): ray::ParaWorker.init_model() (pid=129376, ip=172.17.0.7, actor_id=27f5ee672314d5dfcee7384701000000, repr=<distserve.worker.ParaWorker object at 0x7efea016a110>)
  File "/share/share/DistServe/distserve/worker.py", line 100, in init_model
    self.model.load_weight(path)
RuntimeError

When I shut it down, it will output the following error:

(ParaWorker pid=129375) Gpt<T>::load() - /llama-2-13b-hf/decoder.embed_tokens.weight.pt not found [repeated 5x across cluster]

The model is llama-2-13b-hf, I can run it in vLLM.

codellama34b ttft延迟问题

你好，在最近的测试中，我在A100上测试Llama-13b、7b等模型，对比vllm和distserve, 在满足slo的情况下， distserve性能要优于vllm，但是在测试codellama-34b过程中，当我的输入长度为8192，发现TTFT要高出vllm约3倍左右，请问这个情况是正常的吗？vllm使用tp2, distserve使用prefill tp2, decode tp2。

一些关于TTFT的问题

请问，你们在做prefill的时候是否会只做prefill，还是会出现prefill和decode混合执行的情况

我对您的vllm的理解是这样的：优先TTFT，也就是说有了prefill的request会优先执行prefill，prefill和decode是不会混合执行的，每次生成一个token的时候进行一次调度，我执行的是您在readme中写的distserve-baseline-vllm分支

基于这样的理解，我没有想清楚，为什么TTFT会加快，对于distserve-prefill-66B而言，distserve-prefill采用tp4，vllm也采用tp4，vllm优先执行TTFT，那为什么distserve会提升TTFT呢

感谢您的回答

Offline.py LLMEngine.init() missing 1 required positional argument: 'simulator_config'

I completed the installation of DistServe. When I tried to run the offline.py using my downloaded llama2 model, I encountered the following problem.

Traceback (most recent call last):
File "/home/wangzhusheng/DistServe/./distserve/examples/offline.py", line 31, in
llm = OfflineLLM(
File "/home/wangzhusheng/DistServe/distserve/llm.py", line 42, in init
self.engine = LLMEngine(
TypeError: LLMEngine.init() missing 1 required positional argument: 'simulator_config'

So, I read the source code and find that there are 5 parameters in OfflineLLM class but 6 parameters in LLMEngine class, simulator_config is missing now. Could you please fix this issue in the provided examples?