harrisonvanderbyl / rwkv-cpp-accelerated Goto Github PK

A torchless, c++ rwkv implementation using 8bit quantization, written in cuda/hip/vulkan for maximum compatibility and minimum dependencies

License: MIT License

C++ 99.10% Python 0.30% Cuda 0.59% C 0.01% Shell 0.01%

rwkv-cpp-accelerated's Introduction

rwkv-cpp-accelerated's People

Contributors

Stargazers

Watchers

Forkers

sororfortuna howard0su ssghost murugurugan nenkoru mannykayy resloved arensc pojdd rfsfreitas rejoicesyc mexicanamerican laomaotf jipok chymian

rwkv-cpp-accelerated's Issues

converter failure

% python converter/convert_model.py models/RWKV-5-World-0.1B-v1-20230803-ctx4096.pth
Quantizing att.key.weight: 100%|█████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 58.41it/s]
stacking weightsey.weight: 42%|████████████████████████▏ | 5/12 [00:00<00:00, 45.79it/s]
Cleaning att.key.weight: 100%|████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 60640.54it/s]
Quantizing att.value.weight: 100%|███████████████████████████████████████████████████████| 12/12 [00:00<00:00, 66.89it/s]
stacking weightsalue.weight: 58%|████████████████████████████████▋ | 7/12 [00:00<00:00, 65.76it/s]
Cleaning att.value.weight: 100%|██████████████████████████████████████████████████████| 12/12 [00:00<00:00, 75459.74it/s]
Quantizing att.receptance.weight: 100%|██████████████████████████████████████████████████| 12/12 [00:00<00:00, 55.76it/s]
stacking weightseceptance.weight: 50%|█████████████████████████▌ | 6/12 [00:00<00:00, 50.13it/s]
Cleaning att.receptance.weight: 100%|█████████████████████████████████████████████████| 12/12 [00:00<00:00, 83055.52it/s]
Quantizing ffn.key.weight: 100%|█████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 14.30it/s]
stacking weightsey.weight: 100%|█████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 14.85it/s]
Cleaning ffn.key.weight: 100%|████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 85019.68it/s]
Quantizing ffn.value.weight: 100%|███████████████████████████████████████████████████████| 12/12 [00:00<00:00, 14.70it/s]
stacking weightsalue.weight: 100%|███████████████████████████████████████████████████████| 12/12 [00:00<00:00, 14.81it/s]
Cleaning ffn.value.weight: 100%|██████████████████████████████████████████████████████| 12/12 [00:00<00:00, 92182.51it/s]
Quantizing ffn.receptance.weight: 100%|██████████████████████████████████████████████████| 12/12 [00:00<00:00, 77.98it/s]
stacking weightseceptance.weight: 75%|██████████████████████████████████████▎ | 9/12 [00:00<00:00, 83.23it/s]
Cleaning ffn.receptance.weight: 100%|█████████████████████████████████████████████████| 12/12 [00:00<00:00, 81180.08it/s]
Quantizing att.output.weight: 100%|██████████████████████████████████████████████████████| 12/12 [00:00<00:00, 77.60it/s]
stacking weightsutput.weight: 75%|█████████████████████████████████████████▎ | 9/12 [00:00<00:00, 82.97it/s]
Cleaning att.output.weight: 100%|████████████████████████████████████████████████████| 12/12 [00:00<00:00, 107316.95it/s]
Quantizing: 100%|██████████████████████████████████████████████████████████████████████████| 7/7 [00:03<00:00, 2.00it/s]
saving: xbuf
saving: embed
python: /M/rwkv-cpp-accelerated/converter/cpp_save_tensor.cpp:81: void save(std::string, int64_t, int64_t, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&): Assertion `getSize(i,n_layers,n_emb) == tensors[i]->numel()' failed.
Abandon
(lcpu) /M/rwkv-cpp-accelerated

Error converting to bin

on Archlinux, up to date.

error.log

Radeon Open Compute support

Hipify the source mechanically
Test the result
Tune it further for optimal performance

fail to load on windows

Try to load model, but failed. Below is the output.

D:\AI\ChatRWKV\rwkv-cpp-cuda> chat.exe
D:\AI\ChatRWKV\rwkv-cpp-cuda/model.bin
n_layers: 431231561062       
n_embed: 17592186094693      
loading: xbuf

Windows use "" as delimiter, but the rwkv.cpp use "/" as delimiter.
And the model path being D:\AI\ChatRWKV\rwkv-cpp-cuda/model.bin is not correct.
Want help, Thanks.

Add basic example of a chat using Python

Currently there is only an example of using this project written in C++. However there are C-bindings for Python, but no example of the same chat application using Python.
This would bring a nice capability for further development around this project with integrating some Python web-servers like FastAPI and so on.

Faster model loading

Currently, 7b and 14b models take 10s and 15s respectively to load. Pretty much the same as a vanilla rwkv does.
It would a great thing to make those models to load as fast as possible which could lead to great inference capabilities.

I guess the best milestone to begin with could be a half of those. So 5s and ~7s respectively.

compiling for windows

hi, something is going wrong while compiling for windows. Can I get prebuilt .exe for windows for cuda? Use any version of cuda you want, I will install if I need to, just tell me what version. Thank you.

Endless <|endoftext|> bug

I am not sure if it's a problem with tokenizer or what, but after model loads, it just spams <|endoftext|> endlessly. That's with 4090 and RWKV v10 7B running on Ubuntu. Doesn't matter what I put in the prompt.

Failed to run `examples/storygen/amd.sh` on ROCm 5.6

Hi, I'm trying to run this repo on my AMD card. I have HIP and ROCm running. But fail to run the example provided in README. Am I doing anything wrong?

❯ ./amd.sh
In file included from ./storygen.cpp:1:
In file included from ../../include/rwkv.h:1:
../../include/rwkv/rwkv/rwkv.h:249:21: error: reference to __device__ function 'operator delete[]' in __host__ function
    int **tensors = new int *[46];
                    ^
/opt/rocm/llvm/lib/clang/16.0.0/include/cuda_wrappers/new:73:24: note: 'operator delete[]' declared here
__device__ inline void operator delete[](void* ptr) CUDA_NOEXCEPT {
                       ^
In file included from ./storygen.cpp:1:
In file included from ../../include/rwkv.h:1:
../../include/rwkv/rwkv/rwkv.h:249:21: error: reference to __device__ function 'operator delete[]' in __host__ function
    int **tensors = new int *[46];
                    ^
./storygen.cpp:11:17: note: called by 'main'
    RWKV Rwkv = RWKV();
                ^
/opt/rocm/llvm/lib/clang/16.0.0/include/cuda_wrappers/new:73:24: note: 'operator delete[]' declared here
__device__ inline void operator delete[](void* ptr) CUDA_NOEXCEPT {
                       ^
2 errors generated when compiling for host.

System info

OS: Arch Linux x64
HIPCC: 5.6.31061
C++ compiler: GCC 13

❯ hipcc --version
HIP version: 5.6.31061-
clang version 16.0.0
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin

Thanks

Memory leak in rwkv.h

RWKV::loadFile allocates the state arrays with new but never deallocates them with delete. Unless I'm missing something, you should delete them later to prevent leaking memory. Same goes for the tensors and out variables. They're allocated but never deallocated, so their memory will never be cleaned up until the whole program exits, even if the RWKV object itself is destroyed.

Or better yet just use a std::vector<double> instead of double* to store the state tensors and other vars and use vector.getData() to get a raw pointer when you need it.

Questions about int8 quantification.

Dear authors,
I am a beginner to the project. And I check the code in "include/rwkv/cuda/rwkv.cu". If my understanding is correct, only the computation inside functions cudac_mm8_one and cuda_mm8_threec are related to int8 quantification and the results are float point numbers. But the calculation in sigmoid and kernel_wkvc_forward are done in float point numbers.

My question is why are these parts not quantified? I have heard of some methods which can quantify the non-linear function with a look-up table. Considering the low speed of exp() function. Is there any methods to replace them with fast substitution?

Best,
zzczzc20

输出都是attout浮点数是什么情况？

[root@c8e3006d5ed0 ~/rwkv-cpp-cuda/examples/terminalchat/release]$ ./chat
/root/work/rwkv-cpp-cuda/examples/terminalchat/release/model.bin
n_layers: 32
n_embed: 4096
loading: xbuf
loading: embed
loading: layernorms
loading: state_xy
loading: state_aa
loading: state_bb
loading: state_pp
loading: state_dd
loading: buffer1
loading: buffer2
loading: buffer3
loading: buffer4
loading: mix_k
loading: mix_v
loading: mix_r
loading: km
loading: vm
loading: rm
loading: kr
loading: vr
loading: rr
loading: o1
loading: o2
loading: o3
loading: att_out
loading: att_out_r
loading: att_out_o
loading: ffn_mix_k
loading: ffn_mix_v
loading: ffn_k
loading: ffn_v
loading: ffn_r
loading: ffn_kr
loading: ffn_vr
loading: ffn_rr
loading: ffn_ko
loading: ffn_vo
loading: ffn_ro
loading: ffn_k_buffer
loading: ffn_v_buffer
loading: ffn_r_buffer
loading: decay
loading: bonus
loading: head
loading: head_r
loading: head_o
Loaded model
loading context

attout
-0.0620827
-0.658374

attout
0.066268
-1.02534

attout
0.0661107
-0.864594

attout
0.584045
-0.852568

attout
0.818303
-1.11242

attout
0.858096
-0.945886

attout
0.616216
-0.832412

attout
0.894563
-0.913151

attout
1.11017
-1.40658

attout
0.576667
-1.19348

attout
0.776536
-0.428046

attout
0.632295
0.881825

attout

Multi-GPU support

Current implementation, as I see, doesn't have an ability to share load(VRAM) between GPUs as BlinkDL's ChatRWKV does.
It would be great for running on a two or more consumer grade GPUs without opting for enterprise ones.

Not working on 7B & 14B models | Torch Binding

Testing torch bindings and code doesn't work on large models.

Models are converted using a converter version against current master.
The issue is not occuring while using 1b5 and 3b models.

The self.output after running interop.forward method stays the same(nAn).
On the other hand, state is being changed. So there is some problem with output setting with CPP code.

Attached a jupyter notebook to reproduce, but with .md extension. So make sure to rename it back to .ipynb.
(GH doesn't allow uploading ipynb for some reason)
untitled_1.md

Training, in -cpp-cuda, on one machine?

This project seriously rocks. Thank you very much.
I am not understanding the training mathematics for RWKV. And I want to run training, from scratch and update, off of a legacy C++ system. How easy would it be to slap together a baby RWKV training demo, even using char tokens or words from tiny-shakespeare similar to nanogpt, written in the same technology that you're using now for inference? Even headless would be fine.
I believe this would be a big help to many people.

something went wrong while convert model

hoping for a reply,thx

Add Dockerfile

I think it would be great to create a Dockerfile for deployment purposes.
This way a model could be downloaded within an image and distributed easily in prod environment.

Cannot build

C:\Users\micro\Downloads\rwkv-cpp-cuda\build>cmake ..
CMake Warning:
  Ignoring extra path from command line:

   ".."


CMake Error: The source directory "C:/Users/micro/Downloads/rwkv-cpp-cuda" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.

ACO ERROR: Unsupported opcode

Full log

~/.../storygen/realese [main]  $ ./storygen-vulkan
/home/user/dev/RWKV/rwkv-cpp-accelerated/examples/storygen/realese/model.bin
n_layers: 32
n_embed: 4096
loading: xbuf
isBuffer: 1
cuda_mem: 0x7f2777c2c000
loading: embed
loading: layernorms
isBuffer: 0
cuda_mem: 0x7f276de9e000
loading: state_xy
isBuffer: 1
cuda_mem: 0x7f276dd9e000
loading: state_aa
isBuffer: 1
cuda_mem: 0x7f276dc9e000
loading: state_bb
isBuffer: 1
cuda_mem: 0x7f276db9e000
loading: state_pp
isBuffer: 1
cuda_mem: 0x7f276da9e000
loading: state_dd
isBuffer: 1
cuda_mem: 0x7f276d99e000
loading: buffer1
isBuffer: 1
cuda_mem: 0x7f276ee12000
loading: buffer2
isBuffer: 1
cuda_mem: 0x7f276d96c000
loading: buffer3
isBuffer: 1
cuda_mem: 0x7f2777c24000
loading: buffer4
isBuffer: 1
cuda_mem: 0x7f2777568000
loading: mix_k
isBuffer: 0
cuda_mem: 0x7f276d86c000
loading: mix_v
isBuffer: 0
cuda_mem: 0x7f276d76c000
loading: mix_r
isBuffer: 0
cuda_mem: 0x7f276d66c000
loading: km
isBuffer: 0
cuda_mem: 0x7f26d6e66000
loading: vm
isBuffer: 0
cuda_mem: 0x7f2676e66000
loading: rm
isBuffer: 0
cuda_mem: 0x7f2656e66000
loading: kr
isBuffer: 0
cuda_mem: 0x7f276d5ec000
loading: vr
isBuffer: 0
cuda_mem: 0x7f276d56c000
loading: rr
isBuffer: 0
cuda_mem: 0x7f276d4ec000
loading: o1
isBuffer: 0
cuda_mem: 0x7f276d46c000
loading: o2
isBuffer: 0
cuda_mem: 0x7f276d3ec000
loading: o3
isBuffer: 0
cuda_mem: 0x7f276d36c000
loading: att_out
isBuffer: 0
cuda_mem: 0x7f2636e66000
loading: att_out_r
isBuffer: 0
cuda_mem: 0x7f276d2ec000
loading: att_out_o
isBuffer: 0
cuda_mem: 0x7f276d26c000
loading: ffn_mix_k
isBuffer: 0
cuda_mem: 0x7f276d16c000
loading: ffn_mix_v
isBuffer: 0
cuda_mem: 0x7f276d06c000
loading: ffn_k
isBuffer: 0
cuda_mem: 0x7f2536e65000
loading: ffn_v
isBuffer: 0
cuda_mem: 0x7f23b6e65000
loading: ffn_r
isBuffer: 0
cuda_mem: 0x7f2616e66000
loading: ffn_kr
isBuffer: 0
cuda_mem: 0x7f276cfec000
loading: ffn_vr
isBuffer: 0
cuda_mem: 0x7f276cdec000
loading: ffn_rr
isBuffer: 0
cuda_mem: 0x7f276cd6c000
loading: ffn_ko
isBuffer: 0
cuda_mem: 0x7f276ccec000
loading: ffn_vo
isBuffer: 0
cuda_mem: 0x7f276caec000
loading: ffn_ro
isBuffer: 0
cuda_mem: 0x7f276ca6c000
loading: ffn_k_buffer
isBuffer: 1
cuda_mem: 0x7f276ee0a000
loading: ffn_v_buffer
isBuffer: 1
cuda_mem: 0x7f276ca64000
loading: ffn_r_buffer
isBuffer: 1
cuda_mem: 0x7f276ca54000
loading: decay
isBuffer: 0
cuda_mem: 0x7f276c954000
loading: bonus
isBuffer: 0
cuda_mem: 0x7f276c854000
loading: head
isBuffer: 0
cuda_mem: 0x7f26fe59c000
loading: head_r
isBuffer: 0
cuda_mem: 0x7f276f203000
loading: head_o
isBuffer: 0
cuda_mem: 0x7f276ee06000
Loaded model
loading context

ACO ERROR:
    In file ../src/amd/compiler/aco_assembler.cpp:168
    Unsupported opcode: buffer_atomic_add_f32 %18:s[8-11],  v1: undef, 0, %54:v[0] disable_wqm storage:buffer semantics:volatile,atomic,rmw
Aborted

Model: 11x
Variant: Vulkan
OS: Void linux
Mesa: 23.1.3_1
CPU: AMD Ryzen 6800H
iGPU: Radeon 680M

NumCpp NdArray not initialized

The model seems to successfully load, but after the first user input it displays the following with an error related to NdArray:
file: NdArrayIterators.hpp
function: NdArrayConstIterator
Line: 70
Error: NdArray has not been initialized.

This is with the v11x model using CUDA 12 on Windows. It works when using the latest storygen.exe artifact but not the one generated under /examples/storygen/release after running /build.bat. I might just be doing something silly / missing a dependency.