sjtu-ipads / powerinfer Goto Github PK

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

License: MIT License

Dockerfile 0.12% Shell 1.33% CMake 1.28% Swift 0.06% Zig 0.19% C++ 40.28% C 33.06% Python 6.69% Nix 0.17% Cuda 10.94% Objective-C 2.70% Metal 3.18%

falcon large-language-models llama llm llm-inference local-inference bamboo-7b

powerinfer's Introduction

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

TL;DR

PowerInfer is a CPU/GPU LLM inference engine leveraging activation locality for your device.

Project Kanban

Latest News 🔥

[2024/6/11] We are thrilled to introduce PowerInfer-2, our highly optimized inference framework designed specifically for smartphones. With TurboSparse-Mixtral-47B, it achieves an impressive speed of 11.68 tokens per second, which is up to 22 times faster than other state-of-the-art frameworks.
[2024/6/11] We are thrilled to present Turbo Sparse, our TurboSparse models for fast inference. With just $0.1M, we sparsified the original Mistral and Mixtral model to nearly 90% sparsity while maintaining superior performance! For a Mixtral-level model, our TurboSparse-Mixtral activates only 4B parameters!
[2024/5/20] Competition Recruitment: CCF-TCArch Customized Computing Challenge 2024. The CCF TCARCH CCC is a national competition organized by the Technical Committee on Computer Architecture (TCARCH) of the China Computer Federation (CCF). This year's competition aims to optimize the PowerInfer inference engine using the open-source ROCm/HIP. More information about the competition can be found here.
[2024/5/17] We now provide support for AMD devices with ROCm.
[2024/3/28] We are trilled to present Bamboo LLM that achieves both top-level performance and unparalleled speed with PowerInfer! Experience it with Bamboo-7B Base / DPO.
[2024/3/14] We supported ProSparse Llama 2 (7B/13B), ReLU models with ~90% sparsity, matching original Llama 2's performance (Thanks THUNLP & ModelBest)!
[2024/1/11] We supported Windows with GPU inference!
[2023/12/24] We released an online gradio demo for Falcon(ReLU)-40B-FP16!
[2023/12/19] We officially released PowerInfer!

Demo 🔥

powerinfer-live-demo.mp4

PowerInfer v.s. llama.cpp on a single RTX 4090(24G) running Falcon(ReLU)-40B-FP16 with a 11x speedup!

_{Both PowerInfer and llama.cpp were running on the same hardware and fully utilized VRAM on RTX 4090.}

Note

Live Demo Online⚡️

Try out our Gradio server hosting Falcon(ReLU)-40B-FP16 on a RTX 4090!

_{Experimental and without warranties 🚧}

Abstract

We introduce PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation.

This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity.

Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.

Features

PowerInfer is a high-speed and easy-to-use inference engine for deploying LLMs locally.

PowerInfer is fast with:

Locality-centric design: Utilizes sparse activation and 'hot'/'cold' neuron concept for efficient LLM inference, ensuring high speed with lower resource demands.
Hybrid CPU/GPU Utilization: Seamlessly integrates memory/computation capabilities of CPU and GPU for a balanced workload and faster processing.

PowerInfer is flexible and easy to use with:

Easy Integration: Compatible with popular ReLU-sparse models.
Local Deployment Ease: Designed and deeply optimized for local deployment on consumer-grade hardware, enabling low-latency LLM inference and serving on a single GPU.
Backward Compatibility: While distinct from llama.cpp, you can make use of most of examples/ the same way as llama.cpp such as server and batched generation. PowerInfer also supports inference with llama.cpp's model weights for compatibility purposes, but there will be no performance gain.

You can use these models with PowerInfer today:

Falcon-40B
Llama2 family
ProSparse Llama2 family
Bamboo-7B

We have tested PowerInfer on the following platforms:

x86-64 CPUs with AVX2 instructions, with or without NVIDIA GPUs, under Linux.
x86-64 CPUs with AVX2 instructions, with or without NVIDIA GPUs, under Windows.
Apple M Chips (CPU only) on macOS. (As we do not optimize for Mac, the performance improvement is not significant now.)

And new features coming soon:

Metal backend for sparse inference on macOS

Please kindly refer to our Project Kanban for our current focus of development.

Setup and Installation

Pre-requisites

PowerInfer requires the following dependencies:

CMake (3.17+)
Python (3.8+) and pip (19.3+), for converting model weights and automatic FFN offloading

Get the Code

git clone https://github.com/SJTU-IPADS/PowerInfer
cd PowerInfer
pip install -r requirements.txt # install Python helpers' dependencies

Build

In order to build PowerInfer you have two different options. These commands are supposed to be run from the root directory of the project.

Using CMake(3.17+):

If you have an NVIDIA GPU:

cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release

If you have an AMD GPU:

# Replace '1100' to your card architecture name, you can get it by rocminfo
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100
cmake --build build --config Release

If you have just CPU:

cmake -S . -B build
cmake --build build --config Release

Model Weights

PowerInfer models are stored in a special format called PowerInfer GGUF based on GGUF format, consisting of both LLM weights and predictor weights.

Download PowerInfer GGUF via Hugging Face

You can obtain PowerInfer GGUF weights at *.powerinfer.gguf as well as profiled model activation statistics for 'hot'-neuron offloading from each Hugging Face repo below.

Base Model	PowerInfer GGUF
LLaMA(ReLU)-2-7B	PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF
LLaMA(ReLU)-2-13B	PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF
Falcon(ReLU)-40B	PowerInfer/ReluFalcon-40B-PowerInfer-GGUF
LLaMA(ReLU)-2-70B	PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF
ProSparse-LLaMA-2-7B	PowerInfer/ProSparse-LLaMA-2-7B-GGUF
ProSparse-LLaMA-2-13B	PowerInfer/ProSparse-LLaMA-2-13B-GGUF
Bamboo-base-7B 🌟	PowerInfer/Bamboo-base-v0.1-gguf
Bamboo-DPO-7B 🌟	PowerInfer/Bamboo-DPO-v0.1-gguf

We recommend using huggingface-cli to download the whole model repo. For example, the following command will download PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF into the ./ReluLLaMA-7B directory.

huggingface-cli download --resume-download --local-dir ReluLLaMA-7B --local-dir-use-symlinks False PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF

As such, PowerInfer can automatically make use of the following directory structure for feature-complete model offloading:

.
├── *.powerinfer.gguf (Unquantized PowerInfer model)
├── *.q4.powerinfer.gguf (INT4 quantized PowerInfer model, if available)
├── activation (Profiled activation statistics for fine-grained FFN offloading)
│   ├── activation_x.pt (Profiled activation statistics for layer x)
│   └── ...
├── *.[q4].powerinfer.gguf.generated.gpuidx (Generated GPU index at runtime for corresponding model)

Convert from Original Model Weights + Predictor Weights

Hugging Face limits single model weight to 50GiB. For unquantized models >= 40B, you can convert PowerInfer GGUF from the original model weights and predictor weights obtained from Hugging Face.

Base Model	Original Model	Predictor
LLaMA(ReLU)-2-7B	SparseLLM/ReluLLaMA-7B	PowerInfer/ReluLLaMA-7B-Predictor
LLaMA(ReLU)-2-13B	SparseLLM/ReluLLaMA-13B	PowerInfer/ReluLLaMA-13B-Predictor
Falcon(ReLU)-40B	SparseLLM/ReluFalcon-40B	PowerInfer/ReluFalcon-40B-Predictor
LLaMA(ReLU)-2-70B	SparseLLM/ReluLLaMA-70B	PowerInfer/ReluLLaMA-70B-Predictor
ProSparse-LLaMA-2-7B	SparseLLM/ProSparse-LLaMA-2-7B	PowerInfer/ProSparse-LLaMA-2-7B-Predictor
ProSparse-LLaMA-2-13B	SparseLLM/ProSparse-LLaMA-2-13B	PowerInfer/ProSparse-LLaMA-2-13B-Predictor
Bamboo-base-7B 🌟	PowerInfer/Bamboo-base-v0.1	PowerInfer/Bamboo-base-v0.1-predictor
Bamboo-DPO-7B 🌟	PowerInfer/Bamboo-DPO-v0.1	PowerInfer/Bamboo-DPO-v0.1-predictor

You can use the following command to convert the original model weights and predictor weights to PowerInfer GGUF:

# make sure that you have done `pip install -r requirements.txt`
python convert.py --outfile /PATH/TO/POWERINFER/GGUF/REPO/MODELNAME.powerinfer.gguf /PATH/TO/ORIGINAL/MODEL /PATH/TO/PREDICTOR
# python convert.py --outfile ./ReluLLaMA-70B-PowerInfer-GGUF/llama-70b-relu.powerinfer.gguf ./SparseLLM/ReluLLaMA-70B ./PowerInfer/ReluLLaMA-70B-Predictor

For the same reason, we suggest keeping the same directory structure as PowerInfer GGUF repos after conversion.

Convert Original models into dense GGUF models(compatible with llama.cpp)

python convert-dense.py --outfile /PATH/TO/DENSE/GGUF/REPO/MODELNAME.gguf /PATH/TO/ORIGINAL/MODEL
# python convert-dense.py --outfile ./Bamboo-DPO-v0.1-gguf/bamboo-7b-dpo-v0.1.gguf --outtype f16 ./Bamboo-DPO-v0.1

Please note that the generated dense GGUF models might not work properly with llama.cpp, as we have altered activation functions (for ReluLLaMA and Prosparse models), or the model architecture (for Bamboo models). The dense GGUF models generated by convert-dense.py can be used for PowerInfer in dense inference mode, but might not work properly with llama.cpp.

Inference

For CPU-only and CPU-GPU hybrid inference with all available VRAM, you can use the following instructions to run PowerInfer:

./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
# e.g.: ./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
# For Windows: .\build\bin\Release\main.exe -m .\ReluFalcon-40B-PowerInfer-GGUF\falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"

If you want to limit the VRAM usage of GPU:

./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gb
# e.g.: ./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 8
# For Windows: .\build\bin\Release\main.exe -m .\ReluLLaMA-7B-PowerInfer-GGUF\llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 8

Under CPU-GPU hybrid inference, PowerInfer will automatically offload all dense activation blocks to GPU, then split FFN and offload to GPU if possible.

Dense inference mode (limited support)

If you want to run PowerInfer to infer with the dense variants of the PowerInfer model family, you can use similarly as llama.cpp does:

./build/bin/main -m /PATH/TO/DENSE/MODEL -n $output_token_count -t $thread_num -p $prompt -ngl $num_gpu_layers
# e.g.: ./build/bin/main -m ./Bamboo-base-v0.1-gguf/bamboo-7b-v0.1.gguf -n 128 -t 8 -p "Once upon a time" -ngl 12

So is the case for other examples/ like server and batched_generation. Please note that the dense inference mode is not a "compatible mode" for all models. We have altered activation functions (for ReluLLaMA and Prosparse models) in this mode to match with our model family.

Serving, Perplexity Evaluation, and more applications

PowerInfer supports serving and batched generation with the same instructions as llama.cpp. Generally, you can use the same command as llama.cpp, except for -ngl argument which has been replaced by --vram-budget for PowerInfer. Please refer to the detailed instructions in each examples/ directory. For example:

Quantization

PowerInfer has optimized quantization support for INT4(Q4_0) models. You can use the following instructions to quantize PowerInfer GGUF model:

./build/bin/quantize /PATH/TO/MODEL /PATH/TO/OUTPUT/QUANTIZED/MODEL Q4_0
# e.g.: ./build/bin/quantize ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.powerinfer.gguf ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf Q4_0
# For Windows: .\build\bin\Release\quantize.exe .\ReluFalcon-40B-PowerInfer-GGUF\falcon-40b-relu.powerinfer.gguf .\ReluFalcon-40B-PowerInfer-GGUF\falcon-40b-relu.q4.powerinfer.gguf Q4_0

Then you can use the quantized model for inference with PowerInfer with the same instructions as above.

Evaluation

We evaluated PowerInfer vs. llama.cpp on a single RTX 4090(24G) with a series of FP16 ReLU models under inputs of length 64, and the results are shown below. PowerInfer achieves up to 11x speedup on Falcon 40B and up to 3x speedup on Llama 2 70B.

_{The X axis indicates the output length, and the Y axis represents the speedup compared with llama.cpp. The number above each bar indicates the end-to-end generation speed (total prompting + generation time / total tokens generated, in tokens/s).}

We also evaluated PowerInfer on a single RTX 2080Ti(11G) with INT4 ReLU models under inputs of length 8, and the results are illustrated in the same way as above. PowerInfer achieves up to 8x speedup on Falcon 40B and up to 3x speedup on Llama 2 70B.

Please refer to our paper for more evaluation details.

FAQs

What if I encountered CUDA_ERROR_OUT_OF_MEMORY?
- You can try to run with --reset-gpu-index argument to rebuild the GPU index for this model to avoid any stale cache.
- Due to our current implementation, model offloading might not be as accurate as expected. You can try with --vram-budget with a slightly lower value or --disable-gpu-index to disable FFN offloading.
Does PowerInfer support mistral, original llama, Qwen, ...?
- Now we only support models with ReLU/ReGLU/Squared ReLU activation function. So we do not support these models now. It's worth mentioning that a paper has demonstrated that using the ReLU/ReGLU activation function has a negligible impact on convergence and performance.
Why is there a noticeable downgrade in the performance metrics of our current ReLU model, particularly the 70B model?
- In contrast to the typical requirement of around 2T tokens for LLM training, our model's fine-tuning was conducted with only 5B tokens. This insufficient retraining has resulted in the model's inability to regain its original performance. We are actively working on updating to a more capable model, so please stay tuned.
What if...
- Issues are welcomed! Please feel free to open an issue and attach your running environment and running parameters. We will try our best to help you.

TODOs

We will release the code and data in the following order, please stay tuned!

Paper and Citation

More technical details can be found in our paper.

If you find PowerInfer useful or relevant to your project and research, please kindly cite our paper:

@misc{song2023powerinfer,
      title={PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU},
      author={Yixin Song and Zeyu Mi and Haotong Xie and Haibo Chen},
      year={2023},
      eprint={2312.12456},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgement

We are thankful for the easily modifiable operator library ggml and execution runtime provided by llama.cpp. We also extend our gratitude to THUNLP for their support of ReLU-based sparse models. We also appreciate the research of Deja Vu, which inspires PowerInfer.

powerinfer's People

Contributors

Stargazers

Watchers

Forkers

jjhw codeaudit richardkelley havietisov evelynmitchell mrcodechef chadbrewbaker bearnardd callandgus mryvae repos-ai-local kaynewest superoldman96 pitchwq cnkjco standardgalactic userbox020 tony163163 happydpc hbcbh1999 zfbok kiminh shabbirhasan1 chunhualiu ishine allthingsllm kitnet draudnaut aboutsome ericxsun chsasank goswamig jerinphilip apollohuang1 goldluo126 frankf-cgn zeroxclem wangwenjie123 holycloud parag0506 gebegb3j shenyang70s wangwendong1024 baozhi888 eltociear sorokinvld minedec isaka peterwrighten imrohankataria ahmedsaoudi kustomzone jameshennessytempus machinelearningsystem javiervicho petermathews 3x0dv5 ai-mou jmwoloso muharremokutan cygwynd jamethcook yuzhuchao xin-zhou-smu keyman9848 wikipedia2008 gamertttt lihuibng crazy-jack compass-star chaosen315 misby jithinraj hokma1943 leichangqing polya20 creative-v yodamaster wangzy anyone0034 chiaki-chan johngeng-xj fcarsten hahahacccc freedomclannad ffos axl-zhang bearx andy12039 sandeepbeniwal markusbkk amir2pl mrsnobody84 mehdi4crypto ali5ac fredatgithub babaja12 manijeh-a shootmir farhadfa22

powerinfer's Issues

vram-budget doesn't work well.

According to feedback, there is a discrepancy between the actual GPU usage and the budget.

How to integrate with LangChain?

Does this framework provide any interoperability with existing ecosystem like LangChain?

llama.cpp:3107: vram_allocated_bytes < vram_capacity

environment: wsl inside Windows.

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           23919         580       22559           3         780       23015

$ nvidia-smi
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:01:00.0  On |                  N/A |
|  0%   33C    P8              25W / 420W |   1762MiB / 24576MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

cmd:

./build/bin/main -m /mnt/{blur}/OneDrive/Models/oobabooga_windows/text-generation-webui/models/test/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"

result:

llama_model_loader: - tensor  881:                blk.79.fc1.weight q4_0     [  8192,  3072,     1,     1 ]
llama_model_loader: - tensor  882:                blk.79.fc2.weight q4_0     [  3072, 28672,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool
llama_model_loader: - kv  22:               general.quantization_version u32
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  722 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 74.98 B
llm_load_print_meta: model size       = 39.28 GiB (4.50 BPW)
llm_load_print_meta: general.name   = nvme
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llm_load_sparse_model_tensors: ggml ctx size =    0.32 MB
GGML_ASSERT: /mnt/{blur}/Projects/PowerInfer/llama.cpp:3107: vram_allocated_bytes < vram_capacity

Experimented with vram-budget, reset or disable gpu index. No luck.

pip install -r requirements 提示 ./gguf-py not installable

您好，请问在执行pip intall -r requirements.txt时，出现以下错误，请问如何解决？

what is the recommended wy to run with this python code?

Should we keep compiled binary running in mmory and have python call it or can python just make arbitrary calls to it?

请问下针对消费级卡的服务器的适配。

背景：这边搭载了一台消费级卡（8张 NVIDIA GF RTX4090）的服务器，希望能够接入 PowerInfer
问题：想请问是否如何接入PowerInfer，以及看是否适配。

[HELP WANTED] 支持 InternLM 吗？

希望支持 InternLM-20B 和 InternLM-7B

https://github.com/internLM/internLM/

http://huggingface.co/internlm

nvcc fails due to illegal options

When I try to build this I get the following error:
nvcc fatal : Unknown option 'Wmissing-declarations'

The cmake files seem to be set up to pass gcc options to nvcc which AFAIK are not supported.

再结合上MLX岂不是可以在Mac平台起飞了！

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

在Mac M系列平台上，是否可以进一步结合上https://github.com/ml-explore/mlx 从而达到更夸张的提速效果？

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

请问你们是否有兴趣支持deepseek？

Deepseek-llm和Deepseek-coder效果也是很好的模型，而且是llama结构https://github.com/deepseek-ai/deepseek-coder/

no CUDA-capable device is detected

Tried to run inference on wsl
./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
and got
no CUDA-capable device is detected
current device: 231936

suggestions on how to fix this?

I have 2 NVIDA cards in the computer
a GeForce RTX2070
and
Tesla M40 24GB

Evaluation

What the speedup means here? speedup compared to what?

[HELP WANTED] 支持qwen吗？

Bitcoin

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do.

Current Behavior

Please provide a detailed written description of what llama.cpp did, instead.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Operating System, e.g. for Linux:

$ uname -a

SDK version, e.g. for Linux:

$ python3 --version
$ make --version
$ g++ --version

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

step 1
step 2
step 3
etc.

Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.

Example environment info:

llama.cpp$ git log | head -1
commit 2af23d30434a677c6416812eea52ccc0af65119c

llama.cpp$ lscpu | egrep "AMD|Flags"
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen Threadripper 1950X 16-Core Processor
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev
Virtualization:                  AMD-V

llama.cpp$ python3 --version
Python 3.10.9

llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
numpy                         1.24.2
numpydoc                      1.5.0
sentencepiece                 0.1.97
torch                         1.13.1
torchvision                   0.14.1

llama.cpp$ make --version | head -1
GNU Make 4.3

$ md5sum ./models/65B/ggml-model-q4_0.bin
dbdd682cce80e2d6e93cefc7449df487  ./models/65B/ggml-model-q4_0.bin

Example run with the Linux command perf

llama.cpp$ perf stat ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p "Please close your issue when it has been answered."
main: seed = 1679149377
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

main: prompt: 'Please close your issue when it has been answered.'
main: number of tokens in prompt = 11
     1 -> ''
 12148 -> 'Please'
  3802 -> ' close'
   596 -> ' your'
  2228 -> ' issue'
   746 -> ' when'
   372 -> ' it'
   756 -> ' has'
  1063 -> ' been'
  7699 -> ' answered'
 29889 -> '.'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Please close your issue when it has been answered.
@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]


main: mem per token = 71159620 bytes
main:     load time = 19309.95 ms
main:   sample time =   168.62 ms
main:  predict time = 223895.61 ms / 888.47 ms per token
main:    total time = 246406.42 ms

 Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':

        3636882.89 msec task-clock                #   14.677 CPUs utilized
             13509      context-switches          #    3.714 /sec
              2436      cpu-migrations            #    0.670 /sec
          10476679      page-faults               #    2.881 K/sec
    13133115082869      cycles                    #    3.611 GHz                      (16.77%)
       29314462753      stalled-cycles-frontend   #    0.22% frontend cycles idle     (16.76%)
    10294402631459      stalled-cycles-backend    #   78.39% backend cycles idle      (16.74%)
    23479217109614      instructions              #    1.79  insn per cycle
                                                  #    0.44  stalled cycles per insn  (16.76%)
     2353072268027      branches                  #  647.002 M/sec                    (16.77%)
        1998682780      branch-misses             #    0.08% of all branches          (16.76%)

     247.802177522 seconds time elapsed

    3618.573072000 seconds user
      18.491698000 seconds sys

Mixtral MoE support?

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Support for Mixtral MoE

请问和llama.cpp 相比有什么优化的地方吗？因为我看大部分代码都是和他重合的

虽然有点冒犯，但是如题

想请问一下有没有在A100上运行PowerInfer的效果情况

请问有没有关于在A100上运行PowerInfer的benchmark？比如说，同样在A100上运行的PowerInfer和llama.cpp相比是否有速度提升，或者能否显著的节省显存的使用呢？

In-depth Analysis of Memory Management for Enhanced Performance on Consumer-grade GPUs

Dear PowerInfer Contributors,

I hope this message finds you well. I am reaching out to discuss a potential enhancement to the PowerInfer inference engine, specifically regarding the memory management strategies employed during LLM inference on consumer-grade GPUs.

Upon a thorough examination of the current implementation, I have observed that while the engine adeptly handles the distribution of workload between the CPU and GPU, there may be room for optimisation in the way memory is allocated and managed, particularly during peak usage scenarios.

The crux of the matter lies in the dynamic allocation of memory for 'hot' and 'cold' neurons. While the preloading of 'hot' neurons onto the GPU is commendable for its efficiency, the allocation of memory for 'cold' neurons during runtime could potentially be streamlined. This is especially pertinent when considering the limited VRAM available on consumer-grade GPUs compared to their server-grade counterparts.

I propose a more granular control over memory allocation, which could include:

Implementing a more sophisticated memory pooling mechanism to reduce fragmentation and improve allocation speed.
Exploring the use of memory compression techniques to increase the effective capacity of VRAM.
Introducing a dynamic memory re-allocation system that can adapt to the changing patterns of 'hot' and 'cold' neuron activations based on real-time usage.

I believe that by addressing these aspects, PowerInfer could achieve even greater performance gains and efficiency, making it more accessible and practical for a wider range of users.

I would be most interested in hearing your thoughts on this matter and am keen to contribute to the development of such enhancements.

Thank you for your time and consideration.

Best regards,
yihong1120

Eliminate Compiling warnings

Our initial implementation of PowerInfer introduced a bunch of compiling warnings, mostly unused vars, and made CI complain about it. A quick fix should solve >95% of them.

is it possible in future run mixtal8x7b

is it possible in future run mixtal8x7b ?

testing vs ollama mistral gives same speed results on llama2 7b

this is on linux with a 4090 comparing .
Running ollama run mistral "why is the sky blue?" vs the same prompt gives the same speed.
on llama7b should we expect faster results?

Combined with LLM in a flash

https://arxiv.org/abs/2312.11514
Recently, LLM in a Flash was proposed, a method to use Flash memory to run models that exceed DRAM.
If I'm right, I think we can apply these technologies simultaneously.
If that were possible, I think it would make running very large models easier.

Any plan on supporting mistral based models?

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

server cannot run

通义千问大模型什么时候能支持呢？我们在用72B、14B的，迫切希望能支持加速推理。

Length

我想问一下，是否有在Length 更长的情况下，比较性能，如果在A100上使用PowerInfer以及length=2048或者更高的情况下，是否能够起到类似文中的效果？
另外还想问一下，是怎样一个想法致使你们会使用ReLU激活函数而不是遵从原来的SwiGLU？

Why performance dropped a lot?

Is it worth?

Chat model

is this project support the chat model of llama?

请问PowerInfer团队有计划支持Bo Peng团队开发的RWKV-LM吗？

RWKV架构一直采用的都是ReLU作为激活函数，是国内唯一的中文RNN结构语言模型，具有很高的开创性价值。请问PowerInfer团队有支持的想法吗？下面是RWKV-LM的项目链接：

https://github.com/BlinkDL/RWKV-LM

我代表RWKV社区的爱好者强烈希望您的团队可以考虑一下！非常感谢！

Seems not support long prompt well.

We noticed that the paper mentioned limited performance improvement for relatively long prompt situations, but our situation is that, in the case of very long prompts, it seems PowerInfer ceases to work, generating no output. This situation occurred with llama-7B, llama-13B, and llama-70B-q4. We speculate that this might be because when the prompt is very long, there are many neurons that need to work simultaneously, and the performance optimization done using the locality principle no longer applies, causing PowerInfer's working mechanism to be ineffective. We would like to hear your perspective on this matter.

精度的对比

你好，有使用 powerinfer 部署后，模型精度损失的对比吗？

Quantized INT4/Q4 model?

How are the gguf weights quantized to INT4? is there a script similar to llama.cpp to convert to fp16 weigths to q4_0?
Please share more details about INT4 model.

从meta-llama/Llama-2-13b-hf到SparseLLM/ReluLLaMA-13B

Good job!
从meta-llama/Llama-2-13b-hf到SparseLLM/ReluLLaMA-13B应该要经过一个finetune，看readme里面没有写，这个具体finetune训练会开放吗

Which A100 are you guys using?

Thanks for the great work!
Just curious, in "Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy."
Are you guys talking about the 40GB version of A100 or the 80GB version?

CUDA error 1 in ggml-cuda.cu:8332: invalid argument, and then segmentation fault

Running in WSL, all deps satisified, most recent code pull, on a RTX 3090.

Command line:
./build/bin/main -m models/7B/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 12

Log output:

Log start
main: build = 1549 (9d72668)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1703277999
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 18 key-value pairs and 355 tensors from models/7B/llama-7b-relu.powerinfer.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  4096,  4096,     1,     1 ]
----snip----
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  290 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 7.57 B
llm_load_print_meta: model size       = 14.11 GiB (16.00 BPW)
llm_load_print_meta: general.name   = syx
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llama_model_load: sparse inference - vram budget = 12.00 GB
llm_load_sparse_model_tensors: ggml ctx size =    0.13 MB
llm_load_sparse_model_tensors: using CUDA for GPU acceleration
llm_load_sparse_model_tensors: mem required  = 8506.63 MB
llm_load_sparse_model_tensors: VRAM used: 5939.52 MB
....................................................................................................
llama_model_loader: loaded meta data with 3 key-value pairs and 64 tensors from models/7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                    blk.0.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    1:                 blk.0.gpu_bucket i32      [  5376,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    blk.1.gpu_idx i32      [ 11008,     1,     1,     1 ]
----snip----
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [  4608,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 6119997440
apply_tensors_to_base_model: applying gpu_idx adapter from 'models/7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (9.84 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (11859.33 ms)
llm_load_gpu_split: offloaded 5790.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 548/1028
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 186.57 MB
llama_new_context_with_model: VRAM scratch buffer: 185.00 MB
llama_new_context_with_model: total VRAM used: 6124.52 MB (model: 5939.52 MB, context: 185.00 MB)

**CUDA error 1 at /home/user/Envs/PowerInfer/ggml-cuda.cu:8332: invalid argument**
current device: 0

CUDA error 4 at /home/user/Envs/PowerInfer/ggml-cuda.cu:485: driver shutting down
current device: 8192
**Segmentation fault**

请问下针对消费级卡的服务器的适配。

背景：这边搭载了一台消费级卡（8张 NVIDIA GF RTX4090）的服务器，希望能够接入 PowerInfer
问题：想请问是否如何接入PowerInfer，以及看是否适配。

fatal error C1189: #error: <stdatomic.h> is not yet supported when compiling as C

运行cmake --build build --config Release，提升以下错误

E:\Langchain-Chatchat\PowerInfer>cmake --build build --config Release
MSBuild version 17.3.1+2badb37d1 for .NET Framework
  build_info.vcxproj -> E:\Langchain-Chatchat\PowerInfer\build\common\build_info.dir\Release\bui
  ld_info.lib
  ggml.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  ggml-alloc.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  ggml-backend.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  ggml-quants.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  正在生成代码...

windows visual studio编译失败

使用CMake构建vs 工程，编译的时候，会报下面的错误：
fatal error C1083: 无法打开包括文件: “stdatomic.h”: No such file or directory

会提供Docker镜像吗

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

No module named powerinfer, can ot split gpu

python3： No module named powerinfer
failed to generate gpu split..............

Jetson Orin+ RTXA6000

我的硬件平台是Jetson Orin+ RTXA6000，看起来显卡的资源没有完全利用起来，现存和GPU占用率都不高，怎么调整能把硬件资源都利用起来呢？

How to get a relu-activated llama2 model with finetune? any supposed finetune scripts?

thx

Convert HF models with sparse threshold specified

Feature Description

Per-model sparse threshold has been supported on LLaMA-family models, yet not supported for other model architectures.

Possible Implementation

Following the way to add KV param in convert.py, adding similar logics in convert-hf-to-powerinfer-gguf.py. If necessary, update MLP model repos to add a config.json.

sjtu-ipads / powerinfer Goto Github PK

powerinfer's Introduction

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

TL;DR

Latest News 🔥

Demo 🔥

Abstract

Features

Getting Started

Setup and Installation

Pre-requisites

Get the Code

Build

Model Weights

Download PowerInfer GGUF via Hugging Face

Convert from Original Model Weights + Predictor Weights

Inference

Serving, Perplexity Evaluation, and more applications

Quantization

More Documentation

Evaluation

FAQs

TODOs

Paper and Citation

Acknowledgement

powerinfer's People

Contributors

Stargazers

Watchers

Forkers

powerinfer's Issues

Prerequisites

Feature Description

Motivation

Possible Implementation

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Prerequisites

Feature Description

Prerequisites

Feature Description

Motivation

Possible Implementation

RWKV架构一直采用的都是ReLU作为激活函数，是国内唯一的中文RNN结构语言模型，具有很高的开创性价值。请问PowerInfer团队有支持的想法吗？下面是RWKV-LM的项目链接：

Prerequisites

Feature Description

Motivation

Possible Implementation

Feature Description

Possible Implementation

Recommend Projects

Recommend Topics

Recommend Org