hiyouga / llama-factory Goto Github PK

View Code? Open in Web Editor NEW

20.8K 147.0 2.5K 214.15 MB

Unify Efficient Fine-Tuning of 100+ LLMs

License: Apache License 2.0

Python 99.91% Makefile 0.04% Dockerfile 0.06%

fine-tuning language-model llama llm peft transformers rlhf qlora quantization chatglm

llama-factory's Introduction

👋 Join our WeChat.

[ English | 中文 ]

Fine-tuning a large language model can be easy as...

tutorial_en.mp4

Choose your path:

Colab: https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing
Local machine: Please refer to usage

Features
Benchmark
Changelog
Supported Models
Supported Training Approaches
Provided Datasets
Requirement
Getting Started
Projects using LLaMA Factory
License
Citation
Acknowledgement

Features

Various models: LLaMA, LLaVA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
Integrated methods: (Continuous) pre-training, (multimodal) supervised fine-tuning, reward modeling, PPO, DPO and ORPO.
Scalable resources: 32-bit full-tuning, 16-bit freeze-tuning, 16-bit LoRA and 2/4/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8.
Advanced algorithms: GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ and Agent tuning.
Practical tricks: FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA.
Experiment monitors: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
Faster inference: OpenAI-style API, Gradio UI and CLI with vLLM worker.

Benchmark

Compared to ChatGLM's P-Tuning, LLaMA Factory's LoRA tuning offers up to 3.7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory.

Definitions

Training Speed: the number of training samples processed per second during the training. (bs=4, cutoff_len=1024)
Rouge Score: Rouge-2 score on the development set of the advertising text generation task. (bs=4, cutoff_len=1024)
GPU Memory: Peak GPU memory usage in 4-bit quantized training. (bs=1, cutoff_len=1024)
We adopt pre_seq_len=128 for ChatGLM's P-Tuning and lora_rank=32 for LLaMA Factory's LoRA tuning.

Changelog

[24/04/26] We supported fine-tuning the LLaVA-1.5 multimodal LLMs. See examples/lora_single_gpu/sft_mllm.sh for usage.

[24/04/22] We provided a Colab notebook for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check Llama3-8B-Chinese-Chat and Llama3-Chinese for details.

[24/04/21] We supported Mixture-of-Depths according to AstraMindAI's implementation. See examples/extras/mod for usage.

[24/04/16] We supported BAdam. See examples/extras/badam for usage.

[24/04/16] We supported unsloth's long-sequence training (Llama-2-7B-56k within 24GB). It achieves 117% speed and 50% memory compared with FlashAttention-2, more benchmarks can be found in this page.

Full Changelog

[24/03/31] We supported ORPO. See examples/lora_single_gpu for usage.

[24/03/21] Our paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" is available at arXiv!

[24/03/20] We supported FSDP+QLoRA that fine-tunes a 70B model on 2x24GB GPUs. See examples/extras/fsdp_qlora for usage.

[24/03/13] We supported LoRA+. See examples/extras/loraplus for usage.

[24/03/07] We supported gradient low-rank projection (GaLore) algorithm. See examples/extras/galore for usage.

[24/03/07] We integrated vLLM for faster and concurrent inference. Try --infer_backend vllm to enjoy 270% inference speed. (LoRA is not yet supported, merge it first.)

[24/02/28] We supported weight-decomposed LoRA (DoRA). Try --use_dora to activate DoRA training.

[24/02/15] We supported block expansion proposed by LLaMA Pro. See examples/extras/llama_pro for usage.

[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.

[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with --dataset glaive_toolcall.

[23/12/23] We supported unsloth's implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try --use_unsloth argument to activate unsloth patch. It achieves 170% speed in our benchmark, check this page for details.

[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.

[23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub for Chinese mainland users. See this tutorial for usage.

[23/10/21] We supported NEFTune trick for fine-tuning. Try --neftune_noise_alpha argument to activate NEFTune, e.g., --neftune_noise_alpha 5.

[23/09/27] We supported $S^2$-Attn proposed by LongLoRA for the LLaMA models. Try --shift_attn argument to enable shift short attention.

[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See this example to evaluate your models.

[23/09/10] We supported FlashAttention-2. Try --flash_attn fa2 argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.

[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try --rope_scaling linear argument in training and --rope_scaling dynamic argument at inference to extrapolate the position embeddings.

[23/08/11] We supported DPO training for instruction-tuned models. See this example to train your models.

[23/07/31] We supported dataset streaming. Try --streaming and --max_steps 10000 arguments to load your dataset in streaming mode.

[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.

[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try train_web.py to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.

[23/07/09] We released FastEdit ⚡🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.

[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.

[23/06/22] We aligned the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.

[23/06/03] We supported quantized training and inference (aka QLoRA). Try --quantization_bit 4/8 argument to work with quantized models.

Supported Models

Model	Model size	Default module	Template
Baichuan2	7B/13B	W_pack	baichuan2
BLOOM	560M/1.1B/1.7B/3B/7.1B/176B	query_key_value	-
BLOOMZ	560M/1.1B/1.7B/3B/7.1B/176B	query_key_value	-
ChatGLM3	6B	query_key_value	chatglm3
Command-R	35B/104B	q_proj,v_proj	cohere
DeepSeek (MoE)	7B/16B/67B	q_proj,v_proj	deepseek
Falcon	7B/40B/180B	query_key_value	falcon
Gemma/CodeGemma	2B/7B	q_proj,v_proj	gemma
InternLM2	7B/20B	wqkv	intern2
LLaMA	7B/13B/33B/65B	q_proj,v_proj	-
LLaMA-2	7B/13B/70B	q_proj,v_proj	llama2
LLaMA-3	8B/70B	q_proj,v_proj	llama3
LLaVA-1.5	7B/13B	q_proj,v_proj	vicuna
Mistral/Mixtral	7B/8x7B/8x22B	q_proj,v_proj	mistral
OLMo	1B/7B	q_proj,v_proj	-
Phi-1.5/2	1.3B/2.7B	q_proj,v_proj	-
Phi-3	3.8B	qkv_proj	phi
Qwen	1.8B/7B/14B/72B	c_attn	qwen
Qwen1.5 (Code/MoE)	0.5B/1.8B/4B/7B/14B/32B/72B/110B	q_proj,v_proj	qwen
StarCoder2	3B/7B/15B	q_proj,v_proj	-
XVERSE	7B/13B/65B	q_proj,v_proj	xverse
Yi	6B/9B/34B	q_proj,v_proj	yi
Yuan	2B/51B/102B	q_proj,v_proj	yuan

Note

Default module is used for the --lora_target argument, you can use --lora_target all to specify all the available modules for better convergence.

For the "base" models, the --template argument can be chosen from default, alpaca, vicuna etc. But make sure to use the corresponding template for the "instruct/chat" models.

Remember to use the SAME template in training and inference.

Please refer to constants.py for a full list of models we supported.

You also can add a custom chat template to template.py.

Supported Training Approaches

Approach	Full-tuning	Freeze-tuning	LoRA	QLoRA
Pre-Training	✅	✅	✅	✅
Supervised Fine-Tuning	✅	✅	✅	✅
Reward Modeling	✅	✅	✅	✅
PPO Training	✅	✅	✅	✅
DPO Training	✅	✅	✅	✅
ORPO Training	✅	✅	✅	✅

Provided Datasets

Pre-training datasets

Supervised fine-tuning datasets

Preference datasets

Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

pip install --upgrade huggingface_hub
huggingface-cli login

Requirement

Mandatory	Minimum	Recommend
python	3.8	3.10
torch	1.13.1	2.2.0
transformers	4.37.2	4.39.3
datasets	2.14.3	2.18.0
accelerate	0.27.2	0.28.0
peft	0.9.0	0.10.0
trl	0.8.1	0.8.1

Optional	Minimum	Recommend
CUDA	11.6	12.2
deepspeed	0.10.0	0.14.0
bitsandbytes	0.39.0	0.43.0
flash-attn	2.3.0	2.5.6

Hardware Requirement

* estimated

Method	Bits	7B	13B	30B	70B	110B	8x7B	8x22B
Full	AMP	120GB	240GB	600GB	1200GB	2000GB	900GB	2400GB
Full	16	60GB	120GB	300GB	600GB	900GB	400GB	1200GB
Freeze	16	20GB	40GB	80GB	200GB	360GB	160GB	400GB
LoRA/GaLore/BAdam	16	16GB	32GB	64GB	160GB	240GB	120GB	320GB
QLoRA	8	10GB	20GB	40GB	80GB	140GB	60GB	160GB
QLoRA	4	6GB	12GB	24GB	48GB	72GB	30GB	96GB
QLoRA	2	4GB	8GB	16GB	24GB	48GB	18GB	48GB

Getting Started

Data Preparation

Please refer to data/README.md for checking the details about the format of dataset files. You can either use datasets on HuggingFace / ModelScope hub or load the dataset in local disk.

Note

Please update data/dataset_info.json to use your custom dataset.

Dependence Installation

git clone https://github.com/hiyouga/LLaMA-Factory.git
conda create -n llama_factory python=3.10
conda activate llama_factory
cd LLaMA-Factory
pip install -e .[metrics]

Extra dependencies available: deepspeed, metrics, galore, badam, vllm, bitsandbytes, gptq, awq, aqlm, qwen, modelscope, quality

For Windows users

If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you will be required to install a pre-built version of bitsandbytes library, which supports CUDA 11.1 to 12.2, please select the appropriate release version based on your CUDA version.

pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl

To enable FlashAttention-2 on the Windows platform, you need to install the precompiled flash-attn library, which supports CUDA 12.1 to 12.2. Please download the corresponding version from flash-attention based on your requirements.

Train with LLaMA Board GUI (powered by Gradio)

Important

LLaMA Board GUI only supports training on a single GPU, please use CLI for distributed training.

Use local environment

export CUDA_VISIBLE_DEVICES=0 # `set CUDA_VISIBLE_DEVICES=0` for Windows
export GRADIO_SERVER_PORT=7860 # `set GRADIO_SERVER_PORT=7860` for Windows
python src/train_web.py # or python -m llmtuner.webui.interface

For Alibaba Cloud users

If you encountered display problems in LLaMA Board on Alibaba Cloud, try using the following command to set environment variables before starting LLaMA Board:

export GRADIO_ROOT_PATH=/${JUPYTER_NAME}/proxy/7860/

Use Docker

docker build -f ./Dockerfile -t llama-factory:latest .
docker run --gpus=all \
    -v ./hf_cache:/root/.cache/huggingface/ \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -e CUDA_VISIBLE_DEVICES=0 \
    -p 7860:7860 \
    --shm-size 16G \
    --name llama_factory \
    -d llama-factory:latest

Use Docker Compose

docker compose -f ./docker-compose.yml up -d

Details about volume

hf_cache: Utilize Hugging Face cache on the host machine. Reassignable if a cache already exists in a different directory.
data: Place datasets on this dir of the host machine so that they can be selected on LLaMA Board GUI.
output: Set export dir to this location so that the merged result can be accessed directly on the host machine.

Train with Command Line Interface

See examples/README.md for usage.

Use python src/train_bash.py -h to display arguments description.

Deploy with OpenAI-style API and vLLM

CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
    --template llama3 \
    --infer_backend vllm \
    --vllm_enforce_eager

Download from ModelScope Hub

If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope.

export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows

Train the model by specifying a model ID of the ModelScope Hub as the --model_name_or_path. You can find a full list of model IDs at ModelScope Hub, e.g., LLM-Research/Meta-Llama-3-8B-Instruct.

Projects using LLaMA Factory

If you have a project that should be incorporated, please contact via email or create a pull request.

Click to show

Wang et al. ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation. 2023. [arxiv]
Yu et al. Open, Closed, or Small Language Models for Text Classification? 2023. [arxiv]
Wang et al. UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language. 2023. [arxiv]
Luceri et al. Leveraging Large Language Models to Detect Influence Campaigns in Social Media. 2023. [arxiv]
Zhang et al. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. 2023. [arxiv]
Wang et al. Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs. 2024. [arxiv]
Wang et al. CANDLE: Iterative Conceptualization and Instantiation Distillation from Large Language Models for Commonsense Reasoning. 2024. [arxiv]
Choi et al. FACT-GPT: Fact-Checking Augmentation via Claim Matching with LLMs. 2024. [arxiv]
Zhang et al. AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts. 2024. [arxiv]
Lyu et al. KnowTuning: Knowledge-aware Fine-tuning for Large Language Models. 2024. [arxiv]
Yang et al. LaCo: Large Language Model Pruning via Layer Collaps. 2024. [arxiv]
Bhardwaj et al. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. 2024. [arxiv]
Yang et al. Enhancing Empathetic Response Generation by Augmenting LLMs with Small-scale Empathetic Models. 2024. [arxiv]
Yi et al. Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding. 2024. [arxiv]
Cao et al. Head-wise Shareable Attention for Large Language Models. 2024. [arxiv]
Zhang et al. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. 2024. [arxiv]
Kim et al. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. 2024. [arxiv]
Yu et al. KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models. 2024. [arxiv]
Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [arxiv]
Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [arxiv]
Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [arxiv]
Zhang et al. EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling. 2024. [arxiv]
Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [arxiv]
Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. 2024. [arxiv]
Zan et al. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. 2024. [arxiv]
Liu et al. Extensive Self-Contrast Enables Feedback-Free Language Model Alignment. 2024. [arxiv]
Luo et al. BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models. 2024. [arxiv]
Du et al. Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model. 2024. [arxiv]
Liu et al. Dynamic Generation of Personalities with Large Language Models. 2024. [arxiv]
StarWhisper: A large language model for Astronomy, based on ChatGLM2-6B and Qwen-14B.
DISC-LawLLM: A large language model specialized in Chinese legal domain, based on Baichuan-13B, is capable of retrieving and reasoning on legal knowledge.
Sunsimiao: A large language model specialized in Chinese medical domain, based on Baichuan-7B and ChatGLM-6B.
CareGPT: A series of large language models for Chinese medical domain, based on LLaMA2-7B and Baichuan-13B.
MachineMindset: A series of MBTI Personality large language models, capable of giving any LLM 16 different personality types based on different datasets and training methods.

License

This repository is licensed under the Apache-2.0 License.

Please follow the model licenses to use the corresponding model weights: Baichuan2 / BLOOM / ChatGLM3 / Command-R / DeepSeek / Falcon / Gemma / InternLM2 / LLaMA / LLaMA-2/LLaVA-1.5 / LLaMA-3 / Mistral / OLMo / Phi-1.5/2 / Phi-3 / Qwen / StarCoder2 / XVERSE / Yi / Yuan

Citation

If this work is helpful, please kindly cite as:

@article{zheng2024llamafactory,
  title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models},
  author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Yongqiang Ma},
  journal={arXiv preprint arXiv:2403.13372},
  year={2024},
  url={http://arxiv.org/abs/2403.13372}
}

Acknowledgement

This repo benefits from PEFT, TRL, QLoRA and FastChat. Thanks for their wonderful works.

Star History

llama-factory's People

Contributors

Stargazers

Watchers

Forkers

rogerus mmrbun ticoag zhanglv0209 sciengu tamanna18 chenzongchao ppnorain kenyony zhangnn520 vjimrunning tim-taoxq tianbuwei yuanhuachao shaonianyr fartypants techthiyanes c00renut neumyor neverstoplearn yuanmengwei goggeryang lee9871 jadeluo pfxjacky smj0 jarvisrie tjannati630 ah74ba antonioabeltyled525 danxing0947 lyfysuriq dst1213 ikem264 leisangcs reyhyman516 pecolazegup daddyyyyyyy eltociear chenyicyuan huipen41 lucasjahlbach kngujola davidzhuwei lisadavidsoc hinennei charmingggg hirokmartin054 bhuytgvcftyvcfhcdf redlegenddev pterameta goswamig maocaixia nighteen1999 zhenkaivip jianantian paixai kingmeng-stack shamy1997 liaoyinan 464657226z mr-nineteen 464657226x soon14 qingkongzhiqian weinyn kendallsteele18 hhy5277 merasaamerasaa kharunshofura shenyuanchiyua1 wenmin-wu rayjue reynaly86494160 zhaobinnf ernestpalyc jolz76 stevendab fengyunzaidushi haleighsloan beboyossipp draformhidis prenkabazabz avzalpakalm ronaldamin7 lujitutuahp yalisgarnevo vankirkcrq80 shenyong123 9cat libeineu nanqiai livingthings jingsong-yan onehumanaha play2boy wu-yy apollohuang1 glinxi chfenglv

llama-factory's Issues

训练之后存在灾难性遗忘问题

大佬好，模型微调之后，之前的功能全部丢失了，只会做微调训练数据中的任务，这个怎么解决。采用的是默认配置，模型是ziya-llama, 1W条数据，3 epoch

关于数据集

想加载自己的数据集，是需要写在dataset_info.json吗

微调百川模型出错：Target modules ['q_proj', 'v_proj'] not found in the base model

是不是还不支持baichuan-7B基座模型，还是说我PEFT的版本有问题？

Traceback (most recent call last):
File "/home/jerome/github/LLaMA-Efficient-Tuning/src/train_sft.py", line 98, in
main()
File "/home/jerome/github/LLaMA-Efficient-Tuning/src/train_sft.py", line 26, in main
model, tokenizer = load_pretrained(model_args, finetuning_args, training_args.do_train, stage="sft")
File "/home/jerome/github/LLaMA-Efficient-Tuning/src/utils/common.py", line 216, in load_pretrained
model = _init_adapter(model, model_args, finetuning_args, is_trainable, is_mergeable)
File "/home/jerome/github/LLaMA-Efficient-Tuning/src/utils/common.py", line 133, in _init_adapter
model = get_peft_model(model, lora_config)
File "/home/jerome/anaconda3/envs/left/lib/python3.10/site-packages/peft/mapping.py", line 120, in get_peft_model
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config)
File "/home/jerome/anaconda3/envs/left/lib/python3.10/site-packages/peft/peft_model.py", line 662, in init
super().init(model, peft_config, adapter_name)
File "/home/jerome/anaconda3/envs/left/lib/python3.10/site-packages/peft/peft_model.py", line 99, in init
self.base_model = PEFT_TYPE_TO_MODEL_MAPPING[peft_config.peft_type](
File "/home/jerome/anaconda3/envs/left/lib/python3.10/site-packages/peft/tuners/lora.py", line 154, in init
self.add_adapter(adapter_name, self.peft_config[adapter_name])
File "/home/jerome/anaconda3/envs/left/lib/python3.10/site-packages/peft/tuners/lora.py", line 161, in add_adapter
self._find_and_replace(adapter_name)
File "/home/jerome/anaconda3/envs/left/lib/python3.10/site-packages/peft/tuners/lora.py", line 254, in _find_and_replace
raise ValueError(
ValueError: Target modules ['q_proj', 'v_proj'] not found in the base model. Please check the target modules and try again.

可以提供一个可以参考的的accelerate config_file么...accelerate一直启动不起来

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

src/train_sft.py
--model_name_or_path /models/Ziya-LLaMA-13B-Pretrain-v1/
--do_train
--dataset alpaca_gpt4_zh
--finetuning_type lora
--output_dir sft_save_model_checkpoint_V2
--overwrite_cache
--per_device_train_batch_size 2
--gradient_accumulation_steps 16
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 1.0
--resume_lora_training False
--plot_loss
--max_source_length 1200
--max_target_length 768
--fp16

ValueError: Target module BloomMLP is not supported. Currently, only `torch.nn.Linear` and `Conv1D` are supported.

执行的命令：

#!/bin/bash

CUDA_VISIBLE_DEVICES=0 python ../src/train_rm.py \
    --model_name_or_path golaxy/gogpt-560m \
    --do_train \
    --dataset_dir ../data \
    --dataset comparison_gpt4_en,comparison_gpt4_zh,hh_rlhf_en \
    --finetuning_type lora \
    --lora_target query_key_value,dense,mlp \
    --output_dir ./results/rm \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --seed 123456

报错：

╭───────────────────────────── Traceback (most recent call last) ──────────────────────────────╮
│ /home/user/Desktop/pythonProject/LLaMA-Efficient-Tuning/examples/../src/train_rm.py:74 in    │
│ <module>                                                                                     │
│                                                                                              │
│   71                                                                                         │
│   72                                                                                         │
│   73 if __name__ == "__main__":                                                              │
│ ❱ 74 │   main()                                                                              │
│   75                                                                                         │
│                                                                                              │
│ /home/user/Desktop/pythonProject/LLaMA-Efficient-Tuning/examples/../src/train_rm.py:24 in    │
│ main                                                                                         │
│                                                                                              │
│   21 │   # Prepare pretrained model and dataset                                              │
│   22 │   model_args, data_args, training_args, finetuning_args = prepare_args(stage="rm")    │
│   23 │   dataset = prepare_data(model_args, data_args)                                       │
│ ❱ 24 │   model, tokenizer = load_pretrained(model_args, finetuning_args, training_args.do_tr │
│   25 │   dataset = preprocess_data(dataset, tokenizer, data_args, training_args, stage="rm") │
│   26 │   data_collator = PairwiseDataCollatorWithPadding(tokenizer, model.pretrained_model)  │
│   27                                                                                         │
│                                                                                              │
│ /home/user/Desktop/pythonProject/LLaMA-Efficient-Tuning/src/utils/common.py:186 in           │
│ load_pretrained                                                                              │
│                                                                                              │
│   183 │   │   **config_kwargs                                                                │
│   184 │   )                                                                                  │
│   185 │   model = prepare_model_for_training(model) if is_trainable else model               │
│ ❱ 186 │   model = init_adapter(model, model_args, finetuning_args, is_trainable)             │
│   187 │                                                                                      │
│   188 │   if not is_trainable:                                                               │
│   189 │   │   model.requires_grad_(False) # fix all model params                             │
│                                                                                              │
│ /home/user/Desktop/pythonProject/LLaMA-Efficient-Tuning/src/utils/common.py:121 in           │
│ init_adapter                                                                                 │
│                                                                                              │
│   118 │   │   │   │   lora_dropout=finetuning_args.lora_dropout,                             │
│   119 │   │   │   │   target_modules=finetuning_args.lora_target                             │
│   120 │   │   │   )                                                                          │
│ ❱ 121 │   │   │   model = get_peft_model(model, lora_config)                                 │
│   122 │                                                                                      │
│   123 │   if model_args.checkpoint_dir is not None:                                          │
│   124 │   │   logger.info("Loaded fine-tuned model from checkpoint(s): {}".format(",".join(m │
│                                                                                              │
│ /home/user/anaconda3/lib/python3.9/site-packages/peft/mapping.py:120 in get_peft_model       │
│                                                                                              │
│   117 │   │   return PeftModel(model, peft_config)                                           │
│   118 │   if isinstance(peft_config, PromptLearningConfig):                                  │
│   119 │   │   peft_config = _prepare_prompt_learning_config(peft_config, model_config)       │
│ ❱ 120 │   return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config) │
│   121                                                                                        │
│                                                                                              │
│ /home/user/anaconda3/lib/python3.9/site-packages/peft/peft_model.py:662 in __init__          │
│                                                                                              │
│    659 │   """                                                                               │
│    660 │                                                                                     │
│    661 │   def __init__(self, model, peft_config: PeftConfig, adapter_name="default"):       │
│ ❱  662 │   │   super().__init__(model, peft_config, adapter_name)                            │
│    663 │   │   self.base_model_prepare_inputs_for_generation = self.base_model.prepare_input │
│    664 │                                                                                     │
│    665 │   def forward(                                                                      │
│                                                                                              │
│ /home/user/anaconda3/lib/python3.9/site-packages/peft/peft_model.py:99 in __init__           │
│                                                                                              │
│     96 │   │   self.base_model_torch_dtype = getattr(model, "dtype", None)                   │
│     97 │   │   if not isinstance(peft_config, PromptLearningConfig):                         │
│     98 │   │   │   self.peft_config[adapter_name] = peft_config                              │
│ ❱   99 │   │   │   self.base_model = PEFT_TYPE_TO_MODEL_MAPPING[peft_config.peft_type](      │
│    100 │   │   │   │   self.base_model, self.peft_config, adapter_name                       │
│    101 │   │   │   )                                                                         │
│    102 │   │   │   self.set_additional_trainable_modules(peft_config, adapter_name)          │
│                                                                                              │
│ /home/user/anaconda3/lib/python3.9/site-packages/peft/tuners/lora.py:154 in __init__         │
│                                                                                              │
│   151 │   │   self.model = model                                                             │
│   152 │   │   self.forward = self.model.forward                                              │
│   153 │   │   self.peft_config = config                                                      │
│ ❱ 154 │   │   self.add_adapter(adapter_name, self.peft_config[adapter_name])                 │
│   155 │                                                                                      │
│   156 │   def add_adapter(self, adapter_name, config=None):                                  │
│   157 │   │   if config is not None:                                                         │
│                                                                                              │
│ /home/user/anaconda3/lib/python3.9/site-packages/peft/tuners/lora.py:161 in add_adapter      │
│                                                                                              │
│   158 │   │   │   model_config = self.model.config.to_dict() if hasattr(self.model.config, " │
│   159 │   │   │   config = self._prepare_lora_config(config, model_config)                   │
│   160 │   │   │   self.peft_config[adapter_name] = config                                    │
│ ❱ 161 │   │   self._find_and_replace(adapter_name)                                           │
│   162 │   │   if len(self.peft_config) > 1 and self.peft_config[adapter_name].bias != "none" │
│   163 │   │   │   raise ValueError(                                                          │
│   164 │   │   │   │   "LoraModel supports only 1 adapter with bias. When using multiple adap │
│                                                                                              │
│ /home/user/anaconda3/lib/python3.9/site-packages/peft/tuners/lora.py:246 in                  │
│ _find_and_replace                                                                            │
│                                                                                              │
│   243 │   │   │   │   │   │   │   │   )                                                      │
│   244 │   │   │   │   │   │   │   │   kwargs["fan_in_fan_out"] = lora_config.fan_in_fan_out  │
│   245 │   │   │   │   │   │   else:                                                          │
│ ❱ 246 │   │   │   │   │   │   │   raise ValueError(                                          │
│   247 │   │   │   │   │   │   │   │   f"Target module {target} is not supported. "           │
│   248 │   │   │   │   │   │   │   │   f"Currently, only `torch.nn.Linear` and `Conv1D` are s │
│   249 │   │   │   │   │   │   │   )                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Target module BloomMLP(
  (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
  (gelu_impl): BloomGelu()
  (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)

使用的模型是在bigscience/bloomz-560m上微调后的，--lora_taget参考的config.py填的，但我看Bloom的模型结构：

(23): BloomBlock(
  (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  (self_attention): BloomAttention(
    (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
    (dense): Linear(in_features=1024, out_features=1024, bias=True)
    (attention_dropout): Dropout(p=0.0, inplace=False)
  )
  (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  (mlp): BloomMLP(
    (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
    (gelu_impl): BloomGelu()
    (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
  )

似乎不需要mlp参数，应该是dense_h_to_4h和dense_4h_to_h？

torch.distributed.elastic.multiprocessing.errors.ChildFailedError

训练指令：

accelerate launch src/train_sft.py \
    --model_name_or_path llama-hf/llama-13b-hf \
    --do_train \
    --dataset ChangChunTeng \
    --finetuning_type lora \
    --output_dir CCT/sft \
    --overwrite_cache \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --resume_lora_training False \
    --plot_loss \
    --fp16

train_sft.py中加载tokenizer耗时太长，请问是正常的吗？

06/02/2023 06:16:54 - INFO - utils.common - Loaded tokenizer in 569.9713776111603 seconds. 接近10分钟了。

Dependency version in requirements mismatch

protobuf should <=3.20.1
(or produce TypeError: Descriptors cannot not be created directly.)
dataset should >=2.12.0
( or produce ImportError: datasets>=2.12.0 is required for a normal functioning of this module, but found datasets==2.11.0.)

if qlora:

accelerate>=0.20.0.dev0 : pip install git+https://github.com/huggingface/accelerate.git
bitsandbytes>=0.39.0
transformers should >=4.30.0.dev0 : pip install git+https://github.com/huggingface/transformers.git
(or produce ImportError: transformers>=4.30.0.dev0 is required for a normal functioning of this module, but found transformers==4.29.2.)

AssertionError: The given checkpoint is not a LoRA checkpoint, please specify `--finetuning_type full/freeze` instead.

训练参数：
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py --model_name_or_path ./Bloom/ --do_train --dataset alpaca_gpt4_en --finetuning_type lora --checkpoint_dir path_to_pt_checkpoint --output_dir path_to_sft_checkpoint --overwrite_cache --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 3.0 --resume_lora_training False --lora_target query_key_value --plot_loss --fp16

Bloom不支持lora吗？谢谢。

训练报错

ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on.

PPO阶段与RM阶段使用accelerate训练产生同样错误

以下是PPO阶段的错误log，RM产生这个错误可以通过不使用accelerate多卡训练解决：

"transformers_version": "4.29.2"
训练报错：

[INFO|modeling_utils.py:2513] 2023-06-08 10:26:34,951 >> loading weights file llama-hf/33b-hf/llama-33b-hf/pytorch_model.bin.index.json
[INFO|modeling_utils.py:1154] 2023-06-08 10:26:34,952 >> Instantiating LlamaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:577] 2023-06-08 10:26:34,953 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 0,
  "transformers_version": "4.29.2"
}

Loading checkpoint shards:  71%|█████████████████████████████████████████▍                | 5/7 [01:46<00:43, 21.78s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 517201 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 517202 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 517204 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 517203) of binary: /root/miniconda3/envs/xray/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/xray/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/accelerate/commands/launch.py", line 909, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train_ppo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-08_10:28:37
  host      : mpudgx202302-DGX-Station-A100-920-23487-2531-000
  rank      : 2 (local_rank: 2)
  exitcode  : -9 (pid: 517203)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 517203
============================================================

我的内存

              total        used        free      shared  buff/cache   available
Mem:          503Gi       301Gi       198Gi        32Mi       3.2Gi       199Gi
Swap:            0B          0B          0B

训练命令：

accelerate launch src/train_ppo.py \
    --model_name_or_path llama-hf/33b-hf/llama-33b-hf \
    --do_train \
    --dataset CCT \
    --finetuning_type lora \
    --checkpoint_dir sft/ \
    --reward_model rm/ \
    --output_dir ppo \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 2.0 \
    --resume_lora_training False \
    --plot_loss

how to fine-tune bloom-3b model?

train.sh

CUDA_VISIBLE_DEVICES=0 python src/train_pt.py --model_name_or_path bloom-3b/ --do_train --dataset wiki_demo --finetuning_type lora --output_dir weights/ --overwrite_cache --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --save_steps 100 --learning_rate 5e-5 --num_train_epochs 3.0 --plot_loss

error:
[INFO|modeling_utils.py:3303] 2023-06-13 08:51:34,327 >> All the weights of BloomForCausalLM were initialized from the model checkpoint at bloom-3b/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BloomForCausalLM for predictions without further training.
[INFO|modeling_utils.py:2927] 2023-06-13 08:51:34,328 >> Generation config file not found, using a generation config created from the model config.
06/13/2023 08:51:34 - INFO - utils.common - Fine-tuning method: LoRA
Traceback (most recent call last):
File "/home/server/Tutorial/LLaMA-Efficient-Tuning-main/src/train_pt.py", line 81, in
main()
File "/home/server/Tutorial/LLaMA-Efficient-Tuning-main/src/train_pt.py", line 26, in main
model, tokenizer = load_pretrained(model_args, finetuning_args, training_args.do_train, stage="pt")
File "/home/server/Tutorial/LLaMA-Efficient-Tuning-main/src/utils/common.py", line 214, in load_pretrained
model = _init_adapter(model, model_args, finetuning_args, is_trainable, is_mergeable)
File "/home/server/Tutorial/LLaMA-Efficient-Tuning-main/src/utils/common.py", line 133, in _init_adapter
model = get_peft_model(model, lora_config)
File "/home/server/anaconda3/envs/pytorch/lib/python3.10/site-packages/peft/mapping.py", line 120, in get_peft_model
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config)
File "/home/server/anaconda3/envs/pytorch/lib/python3.10/site-packages/peft/peft_model.py", line 662, in init
super().init(model, peft_config, adapter_name)
File "/home/server/anaconda3/envs/pytorch/lib/python3.10/site-packages/peft/peft_model.py", line 99, in init
self.base_model = PEFT_TYPE_TO_MODEL_MAPPING[peft_config.peft_type](
File "/home/server/anaconda3/envs/pytorch/lib/python3.10/site-packages/peft/tuners/lora.py", line 154, in init
self.add_adapter(adapter_name, self.peft_config[adapter_name])
File "/home/server/anaconda3/envs/pytorch/lib/python3.10/site-packages/peft/tuners/lora.py", line 161, in add_adapter
self._find_and_replace(adapter_name)
File "/home/server/anaconda3/envs/pytorch/lib/python3.10/site-packages/peft/tuners/lora.py", line 254, in _find_and_replace
raise ValueError(
ValueError: Target modules ['q_proj', 'v_proj'] not found in the base model. Please check the target modules and try again.

你好，请问怎么样使用deepspeed多卡训练呀，加了deepspeed config后跑不起来

│ 771 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │
│ ❱ 772 │ │ │ config = self.deepspeed_plugin.deepspeed_config │
│ 773 │ │ │ if config.get("fp16", {}).get("enabled", False): │
│ 774 │ │ │ │ mixed_precision = "fp16" │
│ 775 │ │ │ elif config.get("bf16", {}).get("enabled", False): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'deepspeed_config'

Train using qlora exist with error

train script as follow :

CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \
    --model_name_or_path /xx/model/model_weights/Ziya-LLaMA-13B \
    --do_train \
    --dataset xx \
    --finetuning_type lora \
    --output_dir /xx/output \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-3 \
    --num_train_epochs 10.0 \
    --resume_lora_training False \
    --plot_loss \
    --fp16 \
    --quantization_bit 4

error message as follow :

Traceback (most recent call last):
  File "/xxx/src/train_sft.py", line 97, in <module>
    main()
  File "/xxx/src/train_sft.py", line 69, in main
    train_result = trainer.train()
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/trainer.py", line 1638, in train
    return inner_training_loop(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/trainer.py", line 1923, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/trainer.py", line 2733, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in compute_loss
    outputs = model(**inputs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 570, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 566, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 194, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (724x5120 and 1x13107200)
  0%|                                                                                                                | 0/30 [00:00<?, ?it/s]

reward模型训练loss为0

可能是什么问题？

webui 只加载Ziya 13B，推理的时候报 RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

错误：

Traceback (most recent call last):
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/gradio/blocks.py", line 1302, in process_api
    result = await self.call_function(
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/gradio/blocks.py", line 1039, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/gradio/utils.py", line 491, in async_iteration
    return next(iterator)
  File "/home/hysz/AI/LLaMA-Efficient-Tuning/src/web_demo.py", line 99, in predict
    generation_output = model.generate(input_ids=input_ids, **gen_kwargs)
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/generation/utils.py", line 1568, in generate
    return self.sample(
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/generation/utils.py", line 2651, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
06/03/2023 13:48:06 - INFO - httpx - HTTP Request: POST http://localhost:7860/api/predict "HTTP/1.1 500 Internal Server Error"
06/03/2023 13:48:06 - INFO - httpx - HTTP Request: POST http://localhost:7860/reset "HTTP/1.1 200 OK"
06/03/2023 13:48:07 - INFO - httpx - HTTP Request: POST http://localhost:7860/api/predict "HTTP/1.1 200 OK"
06/03/2023 13:48:07 - INFO - httpx - HTTP Request: POST http://localhost:7860/reset "HTTP/1.1 200 OK"

how to run it with bloom 560M

About the tokenizer pad side

Why it is pad left rather than right for llama model?

baichuan-7b-sft 使用的什么对话数据呢？

rt, 使用了下昨天开源的baichuan-7b-sft 模型，感觉挺不错的，想请问下训练过程使用了什么对话数据吗？方便公开吗~感谢！！

SFT full parameter finetuning - Unable to load the model

I have finetuned LLaMa 7B with full parameters using the following command

deepspeed src/train_sft.py --model_name_or_path huggyllama/llama-7b --do_train --dataset dummy_identity --finetuning_type full --output_dir output/sft-dummy-v1 --overwrite_cache --per_device_train_batch_size 4 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 3.0 --plot_loss --fp16 --deepspeed /root/bud-conv/finetune/configs/ds_config.json

How do I run this in cli? When I try the command

python src/cli_demo.py --model_name_or_path huggyllama/llama-7b --checkpoint_dir output/sft-dummy-v1/

I'm getting this

ValueError: The given checkpoint may be not a LoRA checkpoint, please specify --finetuning_type full/freeze instead.

When I specify the finetunetype

python src/cli_demo.py --model_name_or_path huggyllama/llama-7b --checkpoint_dir output/sft-dummy-v1/ --finetuning_type full

I'm getting a shape error as below

RW训练报错

Loading checkpoint shards:  71%|█████████████████████████████████████████▍                | 5/7 [01:51<00:45, 22.58s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215299 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215300 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215301 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 215298) of binary: /root/miniconda3/envs/xray/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/xray/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/accelerate/commands/launch.py", line 909, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train_rm.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-07_14:46:15
  host      : mpudgx202302-DGX-Station-A100-920-23487-2531-000
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 215298)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 215298
============================================================

训练命令为：

accelerate launch src/train_rm.py \
    --model_name_or_path llama-hf/33b-hf/llama-33b-hf \
    --do_train \
    --dataset comparison_gpt4_zh \
    --finetuning_type lora \
    --output_dir rm \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --fp16

这是目前看到最全的大模型训练代码

这套代码包含了预训练、rlhf流程，还有lora、qlora技术。真的是很全面了。
但如果可以实现多轮对话构建，比如[q1，a1，q2，a2，q3，a3]，构建成训练样本为：prompt：q1*[IGNORE_INDEX]+a1++q2*[IGNORE_INDEX]+a2++q3*[IGNORE_INDEX]，response: a3
就更好了哈哈

启动cli或者web_demo时如何加载reward和rlhf的checkpoint?

用两张4090微调13b的belle出现oom，单卡则不会

我单卡微调没有出现这个情况，多卡出现了，但是我有一张卡已经被占用了15G显存，还剩8g左右，相当于我是8+24g进行多卡微调，这样微调会确实会出现问题？还是我没配置好的问题？

reward训练异常

Traceback (most recent call last):
File "src/train_rm.py", line 77, in
main()
File "src/train_rm.py", line 25, in main
model, tokenizer = load_pretrained(model_args, finetuning_args, training_args.do_train, stage="rm")
File "/baichuan-7B/train/LLaMA-Efficient-Tuning/src/utils/common.py", line 217, in load_pretrained
model = _init_adapter(model, model_args, finetuning_args, is_trainable, is_mergeable)
File "/baichuan-7B/train/LLaMA-Efficient-Tuning/src/utils/common.py", line 135, in _init_adapter
model = get_peft_model(model, lora_config)
File "/root/miniconda3/envs/baichuan/lib/python3.8/site-packages/peft/mapping.py", line 120, in get_peft_model
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config)
File "/root/miniconda3/envs/baichuan/lib/python3.8/site-packages/peft/peft_model.py", line 662, in init
super().init(model, peft_config, adapter_name)
File "/root/miniconda3/envs/baichuan/lib/python3.8/site-packages/peft/peft_model.py", line 99, in init
self.base_model = PEFT_TYPE_TO_MODEL_MAPPING[peft_config.peft_type](
File "/root/miniconda3/envs/baichuan/lib/python3.8/site-packages/peft/tuners/lora.py", line 154, in init
self.add_adapter(adapter_name, self.peft_config[adapter_name])
File "/root/miniconda3/envs/baichuan/lib/python3.8/site-packages/peft/tuners/lora.py", line 161, in add_adapter
self._find_and_replace(adapter_name)
File "/root/miniconda3/envs/baichuan/lib/python3.8/site-packages/peft/tuners/lora.py", line 254, in _find_and_replace
raise ValueError(
ValueError: Target modules ['q_proj', 'v_proj'] not found in the base model. Please check the target modules and try again.

running on v100

Is it possible to do fine-tuning via 4 v100s? Thanks!

可以用这个库做chatglm的全量微调吗，需要改代码里面那些部分内容

column names don't match, An error occurred while generating the dataset

May I have some hint about how to solve this question pls：

The detail：I want to use the dataset format like this in json file：

Then I just add the dataset info in the dataset_info.json like this：

My file are set like this：
-baichuan
--baichuan-7B
---baichuan-7B
--LLaMA-Efficient-Tuning
---data
----alpaca4zh.json

The training command：
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py
--model_name_or_path /root/baichuan/baichuan-7B/baichuan-7B
--do_train
--dataset alpaca4zh
--finetuning_type lora
--lora_rank 8
--lora_target W_pack
--output_dir alpaca_baichuan
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 100
--learning_rate 5e-5
--max_grad_norm 0.5
--num_train_epochs 3.0
--dev_ratio 0.01
--evaluation_strategy steps
--load_best_model_at_end
--plot_loss
--fp16
The bug：

PPO训练报错Tensors must be CUDA and denseTensors must be CUDA and dense

报错：

Assistant:<s>
Traceback (most recent call last):
Traceback (most recent call last):
  File "/tmp/cct/src/train_ppo.py", line 82, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "/tmp/cct/src/train_ppo.py", line 82, in <module>
  File "/tmp/cct/src/train_ppo.py", line 82, in <module>
  File "/tmp/cct/src/train_ppo.py", line 82, in <module>
    main()
    main()
    main()
    main()
  File "/tmp/cct/src/train_ppo.py", line 55, in main
  File "/tmp/cct/src/train_ppo.py", line 55, in main
  File "/tmp/cct/src/train_ppo.py", line 55, in main
  File "/tmp/cct/src/train_ppo.py", line 55, in main
    ppo_trainer = PPOPeftTrainer(
     ppo_trainer = PPOPeftTrainer(
            ppo_trainer = PPOPeftTrainer(
 ppo_trainer = PPOPeftTrainer(
                                ^   ^   ^   ^    ^     ^ ^ ^ ^^^^ ^^^^^^ ^^^ ^^^ ^^^^^^^^^^^
^^^^^^  File "/tmp/cct/src/utils/ppo.py", line 72, in __init__
^^^^^^^^
^^^^  File "/tmp/cct/src/utils/ppo.py", line 72, in __init__
^
^^  File "/tmp/cct/src/utils/ppo.py", line 72, in __init__
^^^^
  File "/tmp/cct/src/utils/ppo.py", line 72, in __init__
    PPOTrainer.__init__(self, **kwargs)    PPOTrainer.__init__(self, **kwargs)

        PPOTrainer.__init__(self, **kwargs)PPOTrainer.__init__(self, **kwargs)

  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 290, in __init__
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 290, in __init__
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 290, in __init__
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 290, in __init__
    ) = self.accelerator.prepare(
) = self.accelerator.prepare() = self.accelerator.prepare(

     ) = self.accelerator.prepare(
             ^   ^   ^ ^    ^  ^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1182, in prepare
^^^^^^^^^^^^^^^
^
^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1182, in prepare
^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1182, in prepare
^
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
             ^^^^^^
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    result = tuple(
result = tuple(
            result = tuple(
                      ^ ^ ^ ^ ^ ^^ ^ ^^^^^^

^^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
^^
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)    ^self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
^
^^^^    ^self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)

  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    ^^^^^^ ^ ^ ^  ^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^return self.prepare_model(obj, device_placement=device_placement)^^^
^^^^^^^^^^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
^^^^^^^^^^^^^^
^^^^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
^^^^^^
^^^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1275, in prepare_model
    return self.prepare_model(obj, device_placement=device_placement)
           ^^^^^^    ^return self.prepare_model(obj, device_placement=device_placement)^
^^    ^return self.prepare_model(obj, device_placement=device_placement)^
^ ^ ^ ^ ^ ^ ^  ^  ^      ^  model = torch.nn.parallel.DistributedDataParallel(^
^^ ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^ ^^^ ^ ^^^ ^^^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1275, in prepare_model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__
^^^^^^^    ^^model = torch.nn.parallel.DistributedDataParallel(
^
^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1275, in prepare_model
^^ ^ ^ ^
     File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/accelerator.py", line 1275, in prepare_model
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
            ^^    ^model = torch.nn.parallel.DistributedDataParallel(^
^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__
^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__
                _sync_module_states(_sync_module_states(
_sync_module_states(_sync_module_states(


  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
    _sync_params_and_buffers(
        _sync_params_and_buffers(_sync_params_and_buffers(_sync_params_and_buffers(  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers



  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers
    dist._broadcast_coalesced(
RuntimeError    : dist._broadcast_coalesced(Tensors must be CUDA and dense

dist._broadcast_coalesced(dist._broadcast_coalesced(RuntimeError

: Tensors must be CUDA and dense
RuntimeErrorRuntimeError: : Tensors must be CUDA and denseTensors must be CUDA and dense

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2344665 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2344666) of binary: /root/miniconda3/envs/ppo/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/ppo/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/commands/launch.py", line 932, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/ppo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train_ppo.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-13_10:49:40
  host      : mpudgx202302-DGX-Station-A100-920-23487-2531-000
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2344667)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-06-13_10:49:40
  host      : mpudgx202302-DGX-Station-A100-920-23487-2531-000
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2344668)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-13_10:49:40
  host      : mpudgx202302-DGX-Station-A100-920-23487-2531-000
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2344666)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

命令：

accelerate launch src/train_ppo.py \
    --model_name_or_path llama-hf/ \
    --do_train \
    --dataset CCT \
    --quantization_bit 4 \
    --checkpoint_dir sft/checkpoint-3000 \
    --reward_model rm \
    --output_dir ppo \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 2.0 \
    --resume_lora_training False \
    --plot_loss

为什么用py脚本do_predict和web_demo中回答的结果不一样

模型是baichuan-7B
同样的问题，py脚本中do_predict的回复质量明显高于web_demo

通过webui.py 导入13B原模型，用8bit方式会报错

通过webui.py 导入13B原模型，用8bit方式会报错，执行代码如下：
python src/web_demo.py --model_name_or_path ../models/Ziya-LLaMA-13B --quantization_bit 8
出错信息如下：

Traceback (most recent call last):
  File "/home/hysz/AI/LLaMA-Efficient-Tuning/src/web_demo.py", line 18, in <module>
    model, tokenizer = load_pretrained(model_args, finetuning_args)
  File "/home/hysz/AI/LLaMA-Efficient-Tuning/src/utils/common.py", line 182, in load_pretrained
    model = model.half() # cast all params to float16 for inference
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in half
    raise ValueError(
ValueError: `.half()` is not supported for `4-bit` or `8-bit` models. Please use the model as it is, since the model has already been casted to the correct `dtype`.

加载chinese-alpaca-plus-13b模型推理异常

python src/cli_demo.py --model_name_or_path xxx --prompt_template alpaca
xx是合并后的chinese-alpaca-plus-13b目录。模型确认没有问题，用chinese-alpaca官方的cpp文件推理没有问题。

使用四卡A100和Qlora-4进行PPO训练报错

Assistant:<unk>
Traceback (most recent call last):
Traceback (most recent call last):
  File "/tmp/CCT/src/train_ppo.py", line 82, in <module>
  File "/tmp/CCT/src/train_ppo.py", line 82, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "/tmp/CCT/src/train_ppo.py", line 82, in <module>
  File "/tmp/CCT/src/train_ppo.py", line 82, in <module>
    main()
main()
  File "/tmp/CCT/src/train_ppo.py", line 55, in main
  File "/tmp/CCT/src/train_ppo.py", line 55, in main
    main()
  File "/tmp/CCT/src/train_ppo.py", line 55, in main
    main()
  File "/tmp/CCT/src/train_ppo.py", line 55, in main
        ppo_trainer = PPOPeftTrainer(ppo_trainer = PPOPeftTrainer(

  File "/tmp/CCT/src/utils/ppo.py", line 72, in __init__
ppo_trainer = PPOPeftTrainer(  File "/tmp/CCT/src/utils/ppo.py", line 72, in __init__

  File "/tmp/CCT/src/utils/ppo.py", line 72, in __init__
    ppo_trainer = PPOPeftTrainer(
  File "/tmp/CCT/src/utils/ppo.py", line 72, in __init__
    PPOTrainer.__init__(self, **kwargs)
          File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/trl/trainer/ppo_trainer.py", line 290, in __init__
PPOTrainer.__init__(self, **kwargs)    PPOTrainer.__init__(self, **kwargs)
PPOTrainer.__init__(self, **kwargs)
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/trl/trainer/ppo_trainer.py", line 290, in __init__

  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/trl/trainer/ppo_trainer.py", line 290, in __init__
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/trl/trainer/ppo_trainer.py", line 290, in __init__
    ) = self.accelerator.prepare(
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1182, in prepare
            ) = self.accelerator.prepare() = self.accelerator.prepare() = self.accelerator.prepare(


  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1182, in prepare
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1182, in prepare
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
            result = tuple(result = tuple(result = tuple(


  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
            self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one

  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1275, in prepare_model
    return self.prepare_model(obj, device_placement=device_placement)
      File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1275, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
return self.prepare_model(obj, device_placement=device_placement)  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1275, in prepare_model

  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1275, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
    model = torch.nn.parallel.DistributedDataParallel(  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__

  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__
        _sync_module_states(_sync_module_states(

_sync_module_states(_sync_module_states(  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states


  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
        _sync_params_and_buffers(
    _sync_params_and_buffers(_sync_params_and_buffers(
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers
_sync_params_and_buffers(
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers

  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers
    dist._broadcast_coalesced(
    dist._broadcast_coalesced(
        dist._broadcast_coalesced(dist._broadcast_coalesced(

RuntimeError: Tensors must be CUDA and dense
RuntimeErrorRuntimeError: : RuntimeErrorTensors must be CUDA and denseTensors must be CUDA and dense
:
Tensors must be CUDA and dense
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 897529) of binary: /root/miniconda3/envs/llama_etuning/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/llama_etuning/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/commands/launch.py", line 932, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train_ppo.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-09_10:00:51
  host      : mpudgx202302-DGX-Station-A100-920-23487-2531-000
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 897530)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-06-09_10:00:51
  host      : mpudgx202302-DGX-Station-A100-920-23487-2531-000
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 897531)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-06-09_10:00:51
  host      : mpudgx202302-DGX-Station-A100-920-23487-2531-000
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 897532)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-09_10:00:51
  host      : mpudgx202302-DGX-Station-A100-920-23487-2531-000
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 897529)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

accelerate launch src/train_ppo.py \
    --model_name_or_path llama-hf/33b-hf/llama-33b-hf \
    --do_train \
    --dataset ChangChunTeng \
    --finetuning_type lora \
    --checkpoint_dir sft/checkpoint-9000 \
    --reward_model rm/checkpoint-4000 \
    --output_dir ppo \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 2.0 \
    --resume_lora_training False \
    --plot_loss \
    --quantization_bit 4

预训练需要什么配置哇大佬

我怕我这两张4090带不动

OOM issue for SFT full parameter training

I'm trying to do SFT with full parameter training on LLaMa 7B model. I have used the same command from the readme for train_sft.py. When I use finetuning_type='lora', the training is starting as expected. But when I use finetuning_type='full', it's leading to OOM.

I'm using A100 80GB.

CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \ --model_name_or_path huggyllama/llama-7b \ --do_train \ --dataset alpaca_gpt4_en \ --finetuning_type full \ --output_dir path_to_sft_checkpoint \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16

Any thought on what might be the issue here?

ValueError: Target modules ['q_proj', 'v_proj'] not found in the base model. Please check the target modules and try again.

train_sft.py训练指令：
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py
--model_name_or_path /data1/projects/baichuan-7B/
--do_train
--dataset alpaca_gpt4_zh
--finetuning_type lora
--output_dir output
--overwrite_cache
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--fp16

训练报错ValueError: Target modules ['q_proj', 'v_proj'] not found in the base model. Please check the target modules and try again.

有没有大佬知道怎么解决，谢谢！

openaiapi compatible api_demo support

可以增加完全兼容openai api的api demo吗？这样的话，我们就可以使用大部分的前端，例如chatbotui，chatgpt-next 等。

[Question] 关于增量预训练的几个问题

作者您好，关于使用baichuan-7B做增量预训练有几个问题：

CUDA_VISIBLE_DEVICES=0 python src/train_pt.py \
    --model_name_or_path path_to_your_model \
    --do_train \
    --dataset wiki_demo \
    --finetuning_type lora \
    --output_dir path_to_pt_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16

其中，

finetuning_type lora意思是只用LoRA权重做预训练吗？
硬件资源的要求是什么，8张V100或者8张3090够用吗？
预训练的任务是什么呢？
自定义数据集的格式参考wiki_demo吗？
其他参数需不需要改呢？我现在有大概4-5万的数据可以做自监督训练，epochs设置多少合适呢？

微信群已经失效了，可以在分享下吗

怎么finetune多轮对话？

希望提供多轮对话finetune的Demo或文档

where's the implemention code of QLoRA?

微信群满了，麻烦新开一个？

微信群无法加入

微信群超过200人，只能通过邀请加入，不能扫码加入了

咨询关于微调的报错

报错如下
06/19/2023 13:52:00 - INFO - utils.common - Fine-tuning method: LoRA
ValueError: Target modules ['q_proj', 'v_proj'] not found in the base model. Please check the target modules and try again.

运行代码参数 CUDA_VISIBLE_DEVICES=0 python src/train_sft.py --model_name_or_path ../models --do_train --dataset alpaca_gpt4_zh --finetuning_type lora --output_dir path_to_sft_checkpoint --overwrite_cache --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 3.0 --plot_loss --fp16

base model是baichuan

关于单机多卡训练问题

您好，请问如何实现将大模型的参数划分到多张卡上训练，而不是在每张卡上都加载整个模型参数。

load_valuehead_params 没有那个文件value_head.bin

请问大拿，如下操作报错，有遇到这种情况的吗
LLaMA train
（持续）预培训
CUDA_VISIBLE_DEVICES=0 python src/train_pt.py
--model_name_or_path path_to_llama_model
--do_train
--dataset wiki_demo
--finetuning_type lora
--output_dir path_to_pt_checkpoint
--overwrite_cache
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--fp16

目录 path_to_pt_checkpoint 没有文件value_head.bin
当训练rw模型时
CUDA_VISIBLE_DEVICES=0 python src/train_rm.py
--model_name_or_path path_to_llama_model
--do_train
--dataset comparison_gpt4_en
--finetuning_type lora
--checkpoint_dir path_to_pt_checkpoint
--output_dir path_to_rm_checkpoint
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 1e-5
--num_train_epochs 1.0
--plot_loss
--fp16
报错这个目录path_to_pt_checkpoint下面没有这个文件value_head.bin
从来没有见过这个文件啊

请问baichuan-7B进行PT+SFT+RLHF的全流程微调的话，需要多少显存呢

有关reward模型的问题

感谢你们的工作，这个真的很好用，但是我有一个问题，打个比方，我对LLama进行RLHF时，是否能够使用BLOOM作为reward模型，如果可以的话，需要在哪里进行改动呢

checkpoint_dir 参数的作用是？

请问首次做sft训练，不需要checkpoint_dir 参数吗？

QLoRA 训练报错

int4 报错：RuntimeError: self and mat2 must have the same dtype
训练参数：
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py
--model_name_or_path /models/bloomz-7b1-mt
--do_train
--dataset alpaca_gpt4_zh
--finetuning_type lora
--quantization_bit 4
--output_dir bloomz_lora
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--resume_lora_training False
--plot_loss
--fp16

上述参数的 --quantization_bit 如果设置为 8  可正常训练
设备：RTX3080