jl-er / rwkv-peft Goto Github PK

Cuda 2.59% C++ 0.52% Python 96.90%

rwkv-peft's Introduction

🦚 RWKV-PEFT

Release

infctx
fla --fla
State tuning
Quant(QPissa,QLora) --quant int8/nf4
Pissa
Lisa
Lora
dataload(get、pad、only)

High performance on consumer hardware

Consider the memory requirements for training the following models with an 4090 24GB GPU with 64GB of CPU RAM.(--strategy deepspeed_stage_1 --ctx_len 1024 --micro_bsz 1 --lora_r 64)

Model	Full Finetuning	lora/pissa	Qlora/Qpissa	State tuning
RWKV6-1.6B	OOM GPU	7.4GB GPU	5.6GB GPU	6.4GB GPU
RWKV6-3B	OOM GPU	12.1GB GPU	8.2GB GPU	9.4GB GPU
RWKV6-7B	OOM GPU	23.7GB GPU(bsz 8 OOM)	14.9GB GPU(bsz 8 need 19.5GB)	18.1GB GPU

Quant State Tuning

strategy deepspeed_stage_1
ctx_len 1024
micro_bsz 1
4090 24G

Model	bf16	int8	nf4/fp4/4bit
RWKV6-1.6B	6.1GB GPU	4.7GB GPU	4.1GB GPU
RWKV6-3B	9.1GB GPU	6.5GB GPU	5.2GB GPU
RWKV6-7B	17.8GB GPU	11.9GB GPU	8.5GB GPU
RWKV6-14B	xxGB GPU	xxGB GPU	xxGB GPU

Usage

sh demo/demo-xxxx.sh

--train_type

"--quant (infctx state)"

infctx train

"--train_type infctx --chunk_ctx 512" "chunk_ctx" represents the chunk length, while "ctx_len" stands for the total length of the data. Due to the lack of gradients in the wkv6state operator, I now recommend using fla instead.

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model/state --data_file /home/rwkv/JL/data/roleplay \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 1000 --epoch_count 100 --epoch_begin 0 --epoch_save 1 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1 --lr_final 1e-1 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--train_type infctx --chunk_ctx 512 --fla

fla

pip install triton==2.2.0 add "--fla" to utilize."FLA" doesn't need to be compiled, make sure Triton is installed before using it. https://github.com/sustcsonglin/flash-linear-attention.git

State Tuning

add "--train_type state " to utilize quantization State Tuning.
This project's state tuning currently only supports training the state. You can refer to the state tuning in the demo for configuration. When saving weights, only the state is retained, so you need to use the state merge from the demo for merging. The advantage is that the saved weight files are very small. Any user who uses the same base model as you trained can merge and experience the same training results.

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model/state --data_file /home/rwkv/JL/data/roleplay \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 1000 --epoch_count 100 --epoch_begin 0 --epoch_save 1 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1 --lr_final 1e-1 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--train_type state

Quant Train

You just need to add "--quant (int8 4bit nf4 fp4)" to utilize quantization fine-tuning. You can also use "sh demo-pissa.sh" for a quick start.Then use "sh demo-pissa-merge.sh" for merging.

PISSA

PISSA is better than LISA
--lora_alpha 128 --lora_dropout 0.01 (These two parameters do not work.)

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model/lisa-l2 --data_file /home/rwkv/JL/data/roleplay \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 1000 --epoch_count 100 --epoch_begin 0 --epoch_save 1 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1e-4 --lr_final 1e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln \
--PISSA --svd_niter 4

PISSA merge (you need merge init_lora and rwkv-0)

python merge_pissa.py --use-gpu /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth /home/rwkv/JL/out_model/lora-1e-4/init_lora.pth /home/rwkv/JL/out_model/lora-1e-4/rwkv-0.pth  /home/rwkv/JL/model/pissa.pth

LISA

LISA is faster and more memory-efficient than LoRA.
In the context of the LISA algorithm, lisa_r determines how many layers are updated simultaneously, while lisa_k determines how often the algorithm re-selects layers for updating.

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model/lisa-l2 --data_file /home/rwkv/JL/data/roleplay \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 1000 --epoch_count 100 --epoch_begin 0 --epoch_save 1 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1e-4 --lr_final 1e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--LISA --lisa_r 2 --lisa_k 100

RWKV-v6-lora

只需要再v5指令基础上增加 --my_testing "x060"

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model --data_file /home/rwkv/JL/data/minipile \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 8000 --epoch_count 100 --epoch_begin 0 --epoch_save 5 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 3e-4 --lr_final 3e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--wandb rwkv \
--lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln

Merge lora

python merge_lora.py --use-gpu 128 /home/asd/model/RWKV-5-World-1B5-v2-20231025-ctx4096.pth img595k/rwkv-0.pth /home/asd/model/RWKV-5-World-1.5B--lora.pth

RWKV-v5-lora

训练技巧：标准的全量微调方法：将数据复制多遍（如果你想炼多个epoch），注意，其中的条目，必须每次用不同的随机排列！然后用我这里的 --my_pile_stage 3 --my_exit_tokens xxx --magic_prime xxx 技术，这样采样才是完美无重复的。学习速率建议 --lr_init 1e-5 --lr_final 1e-5
my_exit_tokens = datalen，数据的精确 token 数，在载入数据时会显示 # magic_prime = the largest 3n+2 prime smaller than datalen/ctxlen-1 (= 1498226207/512-1 = 2926222.06 in this case) # use
lora： --lora_r 64 --lora_alpha 128 r和a 同时增大，越大效果越好但训练速度也会变慢，目前较好参数为64/128

python train.py --load_model /home/asd/model/RWKV-5-World-1B5-v2-20231025-ctx4096.pth \
--proj_dir /home/asd/model --data_file ttt_text_document \
--data_type binidx --vocab_size 65536 \
--ctx_len 10 --epoch_steps 10 --epoch_count 100 --epoch_begin 0 --epoch_save 5 --micro_bsz 1 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1e-5 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp 1 \
--lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln

RWKV-V4-lora

源代码地址：https://github.com/Blealtan/RWKV-LM-LoRA

python3 train.py \
  --load_model <pretrained base model> \
  --proj_dir <place to save checkpoints> \
  --data_file <data for finetune> \
  --data_type <data type for finetune, recommend binidx> \
  --vocab_size 50277 --ctx_len 1024 --epoch_steps 1000 --epoch_count 1000 --epoch_begin 0 --epoch_save 5 --micro_bsz 2 --accumulate_grad_batches 4 \
  --n_layer 24 --n_embd 1024 --pre_ffn 0 --head_qk 0 --lr_init 1e-4 --lr_final 1e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.999 --adam_eps 1e-8 --accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp 0 \ # all your familiar options
  --lora --lora_r 8 --lora_alpha 16 --lora_dropout 0.01 \
  --lora_load <lora checkpoint to continue training> \ # optional
  --lora_parts=att,ffn,time,ln # configure which parts to finetune

rwkv-peft's People

Contributors

Stargazers

Watchers

Forkers

zhaodice zhangxinxin234 cryscan habibzadeh lihuibng szxysdt techthiyanes seikaijyu qiusuo-ai kill136 agendd

rwkv-peft's Issues

大佬，能不能加个requirements.txt啊

麻烦能加个requirements.txt吧，这样比较方便安装环境。

RuntimeError: Error(s) in loading state_dict for RWKV:

我使用 minipile 测试时：
python train.py --load_model /data/models/rwkv/RWKV-5-World-7B-v2-20240128-ctx4096.pth
--proj_dir /data/models/rwkv/lora --data_file /data/datasets/BlinkDL/minipile-tokenized/rwkv_vocab_v20230424/minipile
--data_type binidx --vocab_size 65536
--ctx_len 10 --epoch_steps 10 --epoch_count 100 --epoch_begin 0 --epoch_save 5 --micro_bsz 1
--n_layer 24 --n_embd 2048
--pre_ffn 0 --head_qk 0 --lr_init 1e-5 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp 1
--lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln

报错如下：
INFO:pytorch_lightning.utilities.rank_zero:########## Loading /data/models/rwkv/RWKV-5-World-7B-v2-20240128-ctx4096.pth... ##########
Traceback (most recent call last):

RuntimeError: Error(s) in loading state_dict for RWKV:
size mismatch for emb.weight: copying a param with shape torch.Size([65536, 4096]) from checkpoint, the shape in current model is torch.Size([65536, 2048]).
size mismatch for blocks.0.ln1.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln2.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln0.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.att.time_mix_k: copying a param with shape torch.Size([1, 1, 4096]) from checkpoint, the shape in current model is torch.Size([1, 1, 2048]).
size mismatch for blocks.0.att.time_mix_v: copying a param with shape torch.Size([1, 1, 4096]) from checkpoint, the shape in current model is torch.Size([1, 1, 2048]).
size mismatch for blocks.0.att.time_mix_r: copying a param with shape torch.Size([1, 1, 4096]) from checkpoint, the shape in current model is torch.Size([1, 1, 2048]).
size mismatch for blocks.0.att.time_mix_g: copying a param with shape torch.Size([1, 1, 4096]) from checkpoint, the shape in current model is torch.Size([1, 1, 2048]).
size mismatch for blocks.0.att.time_decay: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 64]).
size mismatch for blocks.0.att.time_faaaa: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 64]).
size mismatch for blocks.0.att.receptance.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for blocks.0.att.key.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for blocks.0.att.value.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
…………

root@DESKTOP-TTBPHVB:~/rwkv/RWKV-v5-lora# python3 train.py --load_model RWKV-5-World-0.4B-v2-20231113-ctx4096.pth --proj_dir . --data_file output --data_type binidx --vocab_size 65536 --ctx_len 1024 --epoch_steps 500 --epoch_count 3 --epoch_begin 0 --epoch_save 1 --micro_bsz 1 --n_layer 24 --n_embd 2048 --pre_ffn 0 --head_qk 0 --lr_init 1e-3 --lr_final 1e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 --accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp 1 --lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln
INFO:pytorch_lightning.utilities.rank_zero:########## work in progress ##########
[2023-12-22 17:37:52,817] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO:pytorch_lightning.utilities.rank_zero:
############################################################################
#
# RWKV-5 BF16 on 1x1 GPU, bsz 1x1x1=1, deepspeed_stage_2 with grad_cp
#
# Data = output (binidx), ProjDir = /root/autodl-tmp
#
# Epoch = 0 to 2 (will continue afterwards), save every 1 epoch
#
# Each "epoch" = 500 steps, 500 samples, 512000 tokens
#
# Model = 24 n_layer, 2048 n_embd, 1024 ctx_len
#
# Adam = lr 0.001 to 0.0001, warmup 0 steps, beta (0.9, 0.99), eps 1e-08
#
# Found torch 2.1.2+cu121, recommend 1.13.1+cu117 or newer
# Found deepspeed 0.12.6, recommend 0.7.0 (faster than newer versions)
# Found pytorch_lightning 2.1.3, recommend 1.9.5
#
############################################################################

INFO:pytorch_lightning.utilities.rank_zero:{'load_model': '../RWKV-5-World-0.4B-v2-20231113-ctx4096.pth', 'wandb': '', 'proj_dir': '/root/autodl-tmp', 'random_seed': -1, 'data_file': 'output', 'data_type': 'binidx', 'vocab_size': 65536, 'ctx_len': 1024, 'epoch_steps': 500, 'epoch_count': 3, 'epoch_begin': 0, 'epoch_save': 1, 'micro_bsz': 1, 'n_layer': 24, 'n_embd': 2048, 'dim_att': 2048, 'dim_ffn': 7168, 'pre_ffn': 0, 'head_qk': 0, 'tiny_att_dim': 0, 'tiny_att_layer': -999, 'lr_init': 0.001, 'lr_final': 0.0001, 'warmup_steps': 0, 'beta1': 0.9, 'beta2': 0.99, 'adam_eps': 1e-08, 'grad_cp': 1, 'dropout': 0, 'weight_decay': 0, 'weight_decay_final': -1, 'my_pile_version': 1, 'my_pile_stage': 0, 'my_pile_shift': -1, 'my_pile_edecay': 0, 'layerwise_lr': 1, 'ds_bucket_mb': 200, 'my_sample_len': 0, 'my_ffn_shift': 1, 'my_att_shift': 1, 'head_size_a': 64, 'head_size_divisor': 8, 'my_pos_emb': 0, 'load_partial': 0, 'magic_prime': 0, 'my_qa_mask': 0, 'my_random_steps': 0, 'my_testing': '', 'my_exit': 99999999, 'my_exit_tokens': 0, 'emb': False, 'lora': True, 'lora_load': 'rwkv-0', 'lora_r': 64, 'lora_alpha': 128.0, 'lora_dropout': 0.01, 'lora_parts': 'att,ffn,time,ln', 'accelerator': 'gpu', 'strategy': 'deepspeed_stage_2', 'devices': 1, 'num_nodes': 1, 'precision': 'bf16', 'accumulate_grad_batches': 1, 'my_timestamp': '2023-12-22-17-37-53', 'enable_checkpointing': False, 'replace_sampler_ddp': False, 'logger': False, 'gradient_clip_val': 1.0, 'num_sanity_val_steps': 0, 'check_val_every_n_epoch': 100000000000000000000, 'log_every_n_steps': 100000000000000000000, 'max_epochs': -1, 'betas': (0.9, 0.99), 'real_bsz': 1, 'run_name': '65536 ctx1024 L24 D2048'}

RWKV_MY_TESTING 
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/wkv5/build.ninja...
Building extension module wkv5...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=wkv5 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -res-usage --use_fast_math -O3 -Xptxas -O3 --extra-device-vectorization -D_N_=64 -std=c++17 -c /root/rwkv/RWKV-v5-lora/cuda/wkv5_cuda.cu -o wkv5_cuda.cuda.o 
FAILED: wkv5_cuda.cuda.o 
/usr/bin/nvcc  -DTORCH_EXTENSION_NAME=wkv5 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -res-usage --use_fast_math -O3 -Xptxas -O3 --extra-device-vectorization -D_N_=64 -std=c++17 -c /root/rwkv/RWKV-v5-lora/cuda/wkv5_cuda.cu -o wkv5_cuda.cuda.o 
ptxas info    : 1 bytes gmem
ptxas info    : Compiling entry function '_Z15kernel_backwardIN3c108BFloat16EEviiiiPKT_S4_S4_PKfS6_S4_S4_PS2_S7_S7_S7_S7_' for 'sm_86'
ptxas info    : Function properties for _Z15kernel_backwardIN3c108BFloat16EEviiiiPKT_S4_S4_PKfS6_S4_S4_PS2_S7_S7_S7_S7_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 168 registers, 1536 bytes smem, 464 bytes cmem[0]
ptxas info    : Compiling entry function '_Z14kernel_forwardIN3c108BFloat16EEviiiiPKT_S4_S4_PKfS4_PS2_' for 'sm_86'
ptxas info    : Function properties for _Z14kernel_forwardIN3c108BFloat16EEviiiiPKT_S4_S4_PKfS4_PS2_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 100 registers, 1024 bytes smem, 416 bytes cmem[0]
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/rwkv/RWKV-v5-lora/train.py", line 251, in <module>
    from src.trainer import train_callback, generate_init_weight
  File "/root/rwkv/RWKV-v5-lora/src/trainer.py", line 6, in <module>
    from .model import LORA_CONFIG
  File "/root/rwkv/RWKV-v5-lora/src/model.py", line 53, in <module>
    wkv5_cuda = load(name="wkv5", sources=["cuda/wkv5_op.cpp", f"cuda/wkv5_cuda.cu"],
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'wkv5'

微调内存占用

您好，请问微调v5 7b模型需要多少内存？我微调1.5b用了40g 128g内存连3b都无法进行微调

Hi,when use --train_type state,this repo will save whole parameters,maybe you can fix it.

Iikes the repo https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/trainer.py
we can fix our repo https://github.com/JL-er/RWKV-PEFT/blob/main/src/trainer.py
at follow:

import os, math, time, datetime, subprocess
import torch
from torch.utils.data import DataLoader
import pytorch_lightning as pl
from pytorch_lightning.utilities import rank_zero_info, rank_zero_only
from .model import LORA_CONFIG
import re
import numpy as np

def my_save(args, trainer, dd, ff):
    if '14b-run1' in ff:
        fn = ff.split('/')[-1]
        fff = '/dev/shm/' + fn
        torch.save(dd, fff)
        subprocess.Popen(f" aws s3 mv {fff} s3://rwkv-14b-4k/{fn} --quiet", shell=True)
    elif ('world/14b' in ff) or ('world/7b' in ff):
        aa = ff.split('/')[1]
        fn = ff.split('/')[-1]
        fff = f'/dev/shm/{aa}-{fn}'
        torch.save(dd, fff)
        subprocess.Popen(f" aws s3 mv {fff} s3://rwkv-world/{aa}-{fn} --quiet", shell=True)
    else:
        if 'deepspeed_stage_3' in args.strategy:
            trainer.save_checkpoint(ff, weights_only=True)
        else:
            if args.train_type == 'state':
                ddd = {}
                for k, v in dd.items():
                    if 'time_sta' in k:
                        ddd[k] = v.clone()
                torch.save(ddd, ff)
            else:
                torch.save(dd, ff)

使用pissa微调的模型无法正常回复

如图所示参数和效果