Code Monkey home page Code Monkey logo

rwkv-peft's Introduction

🦚 RWKV-PEFT

Release

  • infctx
  • fla --fla
  • State tuning
  • Quant(QPissa,QLora) --quant int8/nf4
  • Pissa
  • Lisa
  • Lora
  • dataload(get、pad、only)

High performance on consumer hardware

Consider the memory requirements for training the following models with an 4090 24GB GPU with 64GB of CPU RAM.(--strategy deepspeed_stage_1 --ctx_len 1024 --micro_bsz 1 --lora_r 64)

Model Full Finetuning lora/pissa Qlora/Qpissa State tuning
RWKV6-1.6B OOM GPU 7.4GB GPU 5.6GB GPU 6.4GB GPU
RWKV6-3B OOM GPU 12.1GB GPU 8.2GB GPU 9.4GB GPU
RWKV6-7B OOM GPU 23.7GB GPU(bsz 8 OOM) 14.9GB GPU(bsz 8 need 19.5GB) 18.1GB GPU

Quant State Tuning

  • strategy deepspeed_stage_1
  • ctx_len 1024
  • micro_bsz 1
  • 4090 24G
Model bf16 int8 nf4/fp4/4bit
RWKV6-1.6B 6.1GB GPU 4.7GB GPU 4.1GB GPU
RWKV6-3B 9.1GB GPU 6.5GB GPU 5.2GB GPU
RWKV6-7B 17.8GB GPU 11.9GB GPU 8.5GB GPU
RWKV6-14B xxGB GPU xxGB GPU xxGB GPU

Usage

sh demo/demo-xxxx.sh

--train_type

"--quant (infctx state)"

infctx train

"--train_type infctx --chunk_ctx 512" "chunk_ctx" represents the chunk length, while "ctx_len" stands for the total length of the data. Due to the lack of gradients in the wkv6state operator, I now recommend using fla instead.

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model/state --data_file /home/rwkv/JL/data/roleplay \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 1000 --epoch_count 100 --epoch_begin 0 --epoch_save 1 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1 --lr_final 1e-1 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--train_type infctx --chunk_ctx 512 --fla

fla

pip install triton==2.2.0 add "--fla" to utilize."FLA" doesn't need to be compiled, make sure Triton is installed before using it. https://github.com/sustcsonglin/flash-linear-attention.git

State Tuning

add "--train_type state " to utilize quantization State Tuning.
This project's state tuning currently only supports training the state. You can refer to the state tuning in the demo for configuration. When saving weights, only the state is retained, so you need to use the state merge from the demo for merging. The advantage is that the saved weight files are very small. Any user who uses the same base model as you trained can merge and experience the same training results.

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model/state --data_file /home/rwkv/JL/data/roleplay \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 1000 --epoch_count 100 --epoch_begin 0 --epoch_save 1 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1 --lr_final 1e-1 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--train_type state

Quant Train

You just need to add "--quant (int8 4bit nf4 fp4)" to utilize quantization fine-tuning. You can also use "sh demo-pissa.sh" for a quick start.Then use "sh demo-pissa-merge.sh" for merging.

PISSA

PISSA is better than LISA
--lora_alpha 128 --lora_dropout 0.01 (These two parameters do not work.)

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model/lisa-l2 --data_file /home/rwkv/JL/data/roleplay \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 1000 --epoch_count 100 --epoch_begin 0 --epoch_save 1 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1e-4 --lr_final 1e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln \
--PISSA --svd_niter 4

PISSA merge (you need merge init_lora and rwkv-0)

python merge_pissa.py --use-gpu /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth /home/rwkv/JL/out_model/lora-1e-4/init_lora.pth /home/rwkv/JL/out_model/lora-1e-4/rwkv-0.pth  /home/rwkv/JL/model/pissa.pth

LISA

LISA is faster and more memory-efficient than LoRA.
In the context of the LISA algorithm, lisa_r determines how many layers are updated simultaneously, while lisa_k determines how often the algorithm re-selects layers for updating.

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model/lisa-l2 --data_file /home/rwkv/JL/data/roleplay \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 1000 --epoch_count 100 --epoch_begin 0 --epoch_save 1 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1e-4 --lr_final 1e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--LISA --lisa_r 2 --lisa_k 100

RWKV-v6-lora

只需要再v5指令基础上增加 --my_testing "x060"

python train.py --load_model /home/rwkv/JL/model/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth \
--proj_dir /home/rwkv/JL/out_model --data_file /home/rwkv/JL/data/minipile \
--data_type binidx --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 8000 --epoch_count 100 --epoch_begin 0 --epoch_save 5 --micro_bsz 4 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 3e-4 --lr_final 3e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 1 \
--my_testing "x060" \
--wandb rwkv \
--lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln

Merge lora

python merge_lora.py --use-gpu 128 /home/asd/model/RWKV-5-World-1B5-v2-20231025-ctx4096.pth img595k/rwkv-0.pth /home/asd/model/RWKV-5-World-1.5B--lora.pth

RWKV-v5-lora

训练技巧: 标准的全量微调方法: 将数据复制多遍(如果你想炼多个epoch),注意,其中的条目,必须每次用不同的随机排列! 然后用我这里的 --my_pile_stage 3 --my_exit_tokens xxx --magic_prime xxx 技术,这样采样才是完美无重复的。 学习速率建议 --lr_init 1e-5 --lr_final 1e-5
my_exit_tokens = datalen,数据的精确 token 数,在载入数据时会显示 # magic_prime = the largest 3n+2 prime smaller than datalen/ctxlen-1 (= 1498226207/512-1 = 2926222.06 in this case) # use
lora: --lora_r 64 --lora_alpha 128 r和a 同时增大,越大效果越好但训练速度也会变慢,目前较好参数为64/128

python train.py --load_model /home/asd/model/RWKV-5-World-1B5-v2-20231025-ctx4096.pth \
--proj_dir /home/asd/model --data_file ttt_text_document \
--data_type binidx --vocab_size 65536 \
--ctx_len 10 --epoch_steps 10 --epoch_count 100 --epoch_begin 0 --epoch_save 5 --micro_bsz 1 \
--n_layer 24 --n_embd 2048 \
--pre_ffn 0 --head_qk 0 --lr_init 1e-5 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp 1 \
--lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln

RWKV-V4-lora

源代码地址:https://github.com/Blealtan/RWKV-LM-LoRA

python3 train.py \
  --load_model <pretrained base model> \
  --proj_dir <place to save checkpoints> \
  --data_file <data for finetune> \
  --data_type <data type for finetune, recommend binidx> \
  --vocab_size 50277 --ctx_len 1024 --epoch_steps 1000 --epoch_count 1000 --epoch_begin 0 --epoch_save 5 --micro_bsz 2 --accumulate_grad_batches 4 \
  --n_layer 24 --n_embd 1024 --pre_ffn 0 --head_qk 0 --lr_init 1e-4 --lr_final 1e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.999 --adam_eps 1e-8 --accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp 0 \ # all your familiar options
  --lora --lora_r 8 --lora_alpha 16 --lora_dropout 0.01 \
  --lora_load <lora checkpoint to continue training> \ # optional
  --lora_parts=att,ffn,time,ln # configure which parts to finetune

rwkv-peft's People

Contributors

jl-er avatar

Stargazers

 avatar  avatar  avatar 爱可可-爱生活 avatar Haowen Hou avatar  avatar YMJ avatar  avatar 王冰洁 avatar theluyuan avatar Armstrong avatar Yu Zhang avatar YuChuXi avatar  avatar  avatar Cahya Wirawan avatar weidong avatar code0o0 avatar Bing Han avatar 关注桃几OvO喵 avatar Rolex avatar lier avatar  avatar Kevin Wang avatar neromous avatar Vague Nebula avatar OpenMOSE avatar Rohan Paul avatar Anton Solbjørg avatar  avatar  avatar Boa avatar WuTianyi avatar Seikaiju avatar hazukiaoi avatar  avatar Iron-Bound avatar Bin avatar Bryan avatar Haoqin Tu avatar  avatar Songlin Yang avatar PJ568 avatar Alex Wu avatar  avatar  avatar Ben Friesen avatar Ulan Sametov avatar LauTrueyes avatar Huanghe avatar  avatar 研究社交 avatar Dogvane Huang avatar yakami avatar  avatar Evi1ran avatar shiro avatar Amirreza salimi avatar  avatar 赵怡然 avatar averyyan2010 avatar 顾真牛 avatar  avatar

Watchers

YY Lin avatar  avatar

rwkv-peft's Issues

RuntimeError: Error(s) in loading state_dict for RWKV:

我使用 minipile 测试时:
python train.py --load_model /data/models/rwkv/RWKV-5-World-7B-v2-20240128-ctx4096.pth
--proj_dir /data/models/rwkv/lora --data_file /data/datasets/BlinkDL/minipile-tokenized/rwkv_vocab_v20230424/minipile
--data_type binidx --vocab_size 65536
--ctx_len 10 --epoch_steps 10 --epoch_count 100 --epoch_begin 0 --epoch_save 5 --micro_bsz 1
--n_layer 24 --n_embd 2048
--pre_ffn 0 --head_qk 0 --lr_init 1e-5 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp 1
--lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln

报错如下:
INFO:pytorch_lightning.utilities.rank_zero:########## Loading /data/models/rwkv/RWKV-5-World-7B-v2-20240128-ctx4096.pth... ##########
Traceback (most recent call last):

RuntimeError: Error(s) in loading state_dict for RWKV:
size mismatch for emb.weight: copying a param with shape torch.Size([65536, 4096]) from checkpoint, the shape in current model is torch.Size([65536, 2048]).
size mismatch for blocks.0.ln1.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln2.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln0.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.ln0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for blocks.0.att.time_mix_k: copying a param with shape torch.Size([1, 1, 4096]) from checkpoint, the shape in current model is torch.Size([1, 1, 2048]).
size mismatch for blocks.0.att.time_mix_v: copying a param with shape torch.Size([1, 1, 4096]) from checkpoint, the shape in current model is torch.Size([1, 1, 2048]).
size mismatch for blocks.0.att.time_mix_r: copying a param with shape torch.Size([1, 1, 4096]) from checkpoint, the shape in current model is torch.Size([1, 1, 2048]).
size mismatch for blocks.0.att.time_mix_g: copying a param with shape torch.Size([1, 1, 4096]) from checkpoint, the shape in current model is torch.Size([1, 1, 2048]).
size mismatch for blocks.0.att.time_decay: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 64]).
size mismatch for blocks.0.att.time_faaaa: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 64]).
size mismatch for blocks.0.att.receptance.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for blocks.0.att.key.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for blocks.0.att.value.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
…………

RuntimeError: Error building extension 'wkv5'

hey
i am getting this error :

root@DESKTOP-TTBPHVB:~/rwkv/RWKV-v5-lora# python3 train.py --load_model RWKV-5-World-0.4B-v2-20231113-ctx4096.pth --proj_dir . --data_file output --data_type binidx --vocab_size 65536 --ctx_len 1024 --epoch_steps 500 --epoch_count 3 --epoch_begin 0 --epoch_save 1 --micro_bsz 1 --n_layer 24 --n_embd 2048 --pre_ffn 0 --head_qk 0 --lr_init 1e-3 --lr_final 1e-4 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 --accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp 1 --lora_load rwkv-0 --lora --lora_r 64 --lora_alpha 128 --lora_dropout 0.01 --lora_parts=att,ffn,time,ln
INFO:pytorch_lightning.utilities.rank_zero:########## work in progress ##########
[2023-12-22 17:37:52,817] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO:pytorch_lightning.utilities.rank_zero:
############################################################################
#
# RWKV-5 BF16 on 1x1 GPU, bsz 1x1x1=1, deepspeed_stage_2 with grad_cp
#
# Data = output (binidx), ProjDir = /root/autodl-tmp
#
# Epoch = 0 to 2 (will continue afterwards), save every 1 epoch
#
# Each "epoch" = 500 steps, 500 samples, 512000 tokens
#
# Model = 24 n_layer, 2048 n_embd, 1024 ctx_len
#
# Adam = lr 0.001 to 0.0001, warmup 0 steps, beta (0.9, 0.99), eps 1e-08
#
# Found torch 2.1.2+cu121, recommend 1.13.1+cu117 or newer
# Found deepspeed 0.12.6, recommend 0.7.0 (faster than newer versions)
# Found pytorch_lightning 2.1.3, recommend 1.9.5
#
############################################################################

INFO:pytorch_lightning.utilities.rank_zero:{'load_model': '../RWKV-5-World-0.4B-v2-20231113-ctx4096.pth', 'wandb': '', 'proj_dir': '/root/autodl-tmp', 'random_seed': -1, 'data_file': 'output', 'data_type': 'binidx', 'vocab_size': 65536, 'ctx_len': 1024, 'epoch_steps': 500, 'epoch_count': 3, 'epoch_begin': 0, 'epoch_save': 1, 'micro_bsz': 1, 'n_layer': 24, 'n_embd': 2048, 'dim_att': 2048, 'dim_ffn': 7168, 'pre_ffn': 0, 'head_qk': 0, 'tiny_att_dim': 0, 'tiny_att_layer': -999, 'lr_init': 0.001, 'lr_final': 0.0001, 'warmup_steps': 0, 'beta1': 0.9, 'beta2': 0.99, 'adam_eps': 1e-08, 'grad_cp': 1, 'dropout': 0, 'weight_decay': 0, 'weight_decay_final': -1, 'my_pile_version': 1, 'my_pile_stage': 0, 'my_pile_shift': -1, 'my_pile_edecay': 0, 'layerwise_lr': 1, 'ds_bucket_mb': 200, 'my_sample_len': 0, 'my_ffn_shift': 1, 'my_att_shift': 1, 'head_size_a': 64, 'head_size_divisor': 8, 'my_pos_emb': 0, 'load_partial': 0, 'magic_prime': 0, 'my_qa_mask': 0, 'my_random_steps': 0, 'my_testing': '', 'my_exit': 99999999, 'my_exit_tokens': 0, 'emb': False, 'lora': True, 'lora_load': 'rwkv-0', 'lora_r': 64, 'lora_alpha': 128.0, 'lora_dropout': 0.01, 'lora_parts': 'att,ffn,time,ln', 'accelerator': 'gpu', 'strategy': 'deepspeed_stage_2', 'devices': 1, 'num_nodes': 1, 'precision': 'bf16', 'accumulate_grad_batches': 1, 'my_timestamp': '2023-12-22-17-37-53', 'enable_checkpointing': False, 'replace_sampler_ddp': False, 'logger': False, 'gradient_clip_val': 1.0, 'num_sanity_val_steps': 0, 'check_val_every_n_epoch': 100000000000000000000, 'log_every_n_steps': 100000000000000000000, 'max_epochs': -1, 'betas': (0.9, 0.99), 'real_bsz': 1, 'run_name': '65536 ctx1024 L24 D2048'}

RWKV_MY_TESTING 
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/wkv5/build.ninja...
Building extension module wkv5...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=wkv5 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -res-usage --use_fast_math -O3 -Xptxas -O3 --extra-device-vectorization -D_N_=64 -std=c++17 -c /root/rwkv/RWKV-v5-lora/cuda/wkv5_cuda.cu -o wkv5_cuda.cuda.o 
FAILED: wkv5_cuda.cuda.o 
/usr/bin/nvcc  -DTORCH_EXTENSION_NAME=wkv5 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -res-usage --use_fast_math -O3 -Xptxas -O3 --extra-device-vectorization -D_N_=64 -std=c++17 -c /root/rwkv/RWKV-v5-lora/cuda/wkv5_cuda.cu -o wkv5_cuda.cuda.o 
ptxas info    : 1 bytes gmem
ptxas info    : Compiling entry function '_Z15kernel_backwardIN3c108BFloat16EEviiiiPKT_S4_S4_PKfS6_S4_S4_PS2_S7_S7_S7_S7_' for 'sm_86'
ptxas info    : Function properties for _Z15kernel_backwardIN3c108BFloat16EEviiiiPKT_S4_S4_PKfS6_S4_S4_PS2_S7_S7_S7_S7_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 168 registers, 1536 bytes smem, 464 bytes cmem[0]
ptxas info    : Compiling entry function '_Z14kernel_forwardIN3c108BFloat16EEviiiiPKT_S4_S4_PKfS4_PS2_' for 'sm_86'
ptxas info    : Function properties for _Z14kernel_forwardIN3c108BFloat16EEviiiiPKT_S4_S4_PKfS4_PS2_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 100 registers, 1024 bytes smem, 416 bytes cmem[0]
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/rwkv/RWKV-v5-lora/train.py", line 251, in <module>
    from src.trainer import train_callback, generate_init_weight
  File "/root/rwkv/RWKV-v5-lora/src/trainer.py", line 6, in <module>
    from .model import LORA_CONFIG
  File "/root/rwkv/RWKV-v5-lora/src/model.py", line 53, in <module>
    wkv5_cuda = load(name="wkv5", sources=["cuda/wkv5_op.cpp", f"cuda/wkv5_cuda.cu"],
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'wkv5'

微调内存占用

您好,请问微调v5 7b模型需要多少内存?我微调1.5b用了40g 128g内存连3b都无法进行微调

Hi,when use --train_type state,this repo will save whole parameters,maybe you can fix it.

Iikes the repo https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/trainer.py
we can fix our repo https://github.com/JL-er/RWKV-PEFT/blob/main/src/trainer.py
at follow:

import os, math, time, datetime, subprocess
import torch
from torch.utils.data import DataLoader
import pytorch_lightning as pl
from pytorch_lightning.utilities import rank_zero_info, rank_zero_only
from .model import LORA_CONFIG
import re
import numpy as np

def my_save(args, trainer, dd, ff):
    if '14b-run1' in ff:
        fn = ff.split('/')[-1]
        fff = '/dev/shm/' + fn
        torch.save(dd, fff)
        subprocess.Popen(f" aws s3 mv {fff} s3://rwkv-14b-4k/{fn} --quiet", shell=True)
    elif ('world/14b' in ff) or ('world/7b' in ff):
        aa = ff.split('/')[1]
        fn = ff.split('/')[-1]
        fff = f'/dev/shm/{aa}-{fn}'
        torch.save(dd, fff)
        subprocess.Popen(f" aws s3 mv {fff} s3://rwkv-world/{aa}-{fn} --quiet", shell=True)
    else:
        if 'deepspeed_stage_3' in args.strategy:
            trainer.save_checkpoint(ff, weights_only=True)
        else:
            if args.train_type == 'state':
                ddd = {}
                for k, v in dd.items():
                    if 'time_sta' in k:
                        ddd[k] = v.clone()
                torch.save(ddd, ff)
            else:
                torch.save(dd, ff)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.