zhengxiangshi / dept Goto Github PK

[ICLR 2024] This is the repository for the paper titled "DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning"

Home Page: http://arxiv.org/abs/2309.05173

License: MIT License

Python 100.00%

fine-tuning language-model large-language-models natural-language-processing parameter-efficient-tuning peft prompt-tuning parameter-efficient-fine-tuning nlp nlp-machine-learning

dept's Introduction

DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

This repository provides the code for the paper titled DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning, making the integration of our code contributions into other projects more accessible.

[News - 26 Sep 2023] We now support Llama-2 Models. Please set MODEL to your local Llama-2 path to run the experiments.
[News - 21 Sep 2023] Check out our work at NeurIPS 2023 titled Don’t Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner and the code at PowerfulPromptFT.

Quick Links

DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

Overview

You can reproduce the experiments of our paper DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning.

Abstract Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving substantial memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline, in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes.

1. Requirements and Installation

To run the prompt-based or cls-based fine-tuning, you need to install the following packages.

Transformers
Pytorch

2. Prepare the datasets

We use the following NLP datasets in our experiments: GLUE, SuperGLUE, MRQA 2019 Shared Task, WinoGrande, Yelp-2, SciTail and PAWS-Wiki. All these datasets are available in the Huggingface Datasets and can be downloaded automatically. Please refer to the file src/tasks.py for the details of the datasets.

3. Run Experiments

We provide the scripts to reproduce the main experiments in our paper. For example, you can run the following script to reproduce the results of DePT on the GLUE dataset. The PREFIX_LENGTH represents the length of the soft prompt m in the paper. The R represents the rank of low-rank matrices r in the paper. lr is the learning rate for the soft prompt, and LORA_LR is the learning rate for the the pair of the low-rank matrices that will be added to the frozen word embeddings.

MODEL=t5-base
MAX_LENGTH=256
MAX_STEPS=40000
PREFIX_LENGTH=40 
R=45
for TASK_NAME in cola mrpc mnli qnli qqp rte sst2 stsb; do
  for LORA_LR in 5e-3 3e-1 5e-4; do
      for lr in 3e-1 4e-1; do
            CUDA_VISIBLE_DEVICES=0 python train.py \
                --peft_type PROMPT_TUNING_LORA \
                --lora_embedding_lr ${LORA_LR} \
                --learning_rate ${lr} \
                --prefix_length ${PREFIX_LENGTH} \
                --r ${R} \
                --task_name ${TASK_NAME} \
                --dataset_config_name en \
                --model_name_or_path ${MODEL} \
                --do_train \
                --do_eval \
                --do_predict \
                --per_device_train_batch_size 32 \
                --per_device_eval_batch_size 32 \
                --max_seq_length ${MAX_LENGTH} \
                --save_strategy steps \
                --evaluation_strategy steps \
                --max_steps ${MAX_STEPS} \
                --eval_steps 1000 \
                --save_steps 1000 \
                --warmup_steps 500 \
                --weight_decay 1e-5 \
                --load_best_model_at_end \
                --save_total_limit 1 \
                --output_dir saved_${MODEL}/${TASK_NAME}_lr${lr}_loralr${LORA_LR}_pl${PREFIX_LENGTH}_r${R}_st${MAX_STEPS};
        done;
    done;
done

You can replace the TASK_NAME with superglue-multirc superglue-wic superglue-wsc.fixed superglue-cb superglue-boolq for the SuperGLUE benchmark, newsqa searchqa hotpotqa nq for the MRQA 2019 Shared Task, winogrande for the WinoGrande dataset, yelp_polarity for the Yelp-2 dataset, scitail for the SciTail dataset, and paws for the PAWS-Wiki dataset.

Additionally, you can add the argument --peft_model_id to initialize the soft prompt and the pair of low-rank matrices with the pretrained prompt vectors. You can add the argument --k_shot_examples to specify the number of examples used for the few-shot learning.

Limitations

As we dicussed in the paper, one of the potential limitations of this work is the introduction of extra hyperparameters for tuning, e.g., the learning rate of the low-rank matrices and training steps. This might introduce some additional computational overhead during the hyperparameter optimization phase of model training. It is important to search over all these hyperparameters to get the optimal performance. For the large dataset with more than 100,000 training example, we follow the prior work (Vu et al., 2022) to train our proposed method DePT with up to 300,000 steps. Training more steps is helpful for improving the performance on the large datasets. Additonally, as the length of soft prompts decreases, it may take more efforts to perform the hyperparameter search. Typically, using the soft prompt with the length PREFIX_LENGTH as 40 or 60 should be fine. Using parameter-efficient transfer learning (PETL) can be helpful to reduce the efforts for the hyperparameter search. However, it is important to note that the model training process is a one-time event, while the model inference is not. In this context, the efficiency benefits of DePT become especially valuable during the inference.

Bugs or questions?

If you have any questions regarding the code or the paper, please feel free to reach out to Zhengxiang at [email protected]. If you experience any difficulties while using the code or need to report a bug, feel free to open an issue. We kindly ask that you provide detailed information about the problem to help us provide effective support.

Citation

@inproceedings{shi2024dept,
title={DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning},
author={Zhengxiang Shi and Aldo Lipani},
booktitle={International Conference on Learning Representations},
year={2024},
url={http://arxiv.org/abs/2309.05173}
}

Acknowledgement

This repository is built upon the following repositories:

dept's People

Contributors

Stargazers

Watchers

Forkers

eltociear nileafrica pixelscribe111 bigmichaels stjordanis entelecheia-archives devjediknight weijiejack xoctopusz appealxxxx nguyenvuthientrang techthiyanes vanqui-tu

dept's Issues

LLaMA 2 finetuning

Thanks for the good work and clear/neat implementation. I have some questions regarding the implementation of LLaMA 2

Q.1 I can see in train.py file that you support LLaMA 2 model, however, the trainer in line 639 in train.py generates an object from the PEFTSeq2SeqTrainer() class. Later in line 659, you call the method.train. When I checked the implementation of the class PEFTSeq2SeqTrainer() it seems it does not have the train method. does this mean for LLaMA (deocder based model) you do not support finetuning of LLaMA model, or am missing something ?.
Q.2 Also in the same class in line 75, generation_inputs = inputs[self.model.main_input_name] is called. So my questions is does LLaMA2 has main_input_name?
Q.3 is the code runnable for LLaM2 on the all datasets provided, as I can see in the paper you have the results only for sst2 dataset?

Thanks and looking for your response.

Some tensors share memory, this will lead to duplicate memory

This is the part of my log file where the error happens. I guess this error happened because of an incompatible library. Could you give me information about the version of transformers you use?. It would be helpful if you release the requirements.txt file

2%|▎ | 1000/40000 [15:49<10:18:48, 1.05it/s]{'loss': 0.5651, 'learning_rate': 0.0005, 'epoch': 4.35}
{'loss': 0.2853, 'learning_rate': 0.0004936708860759494, 'epoch': 8.7}

0%| | 0/7 [00:00<?, ?it/s]�[A

29%|██▊ | 2/7 [00:00<00:02, 2.49it/s]�[A

43%|████▎ | 3/7 [00:01<00:02, 1.75it/s]�[A

57%|█████▋ | 4/7 [00:02<00:01, 1.51it/s]�[A

71%|███████▏ | 5/7 [00:03<00:01, 1.40it/s]�[A

86%|████████▌ | 6/7 [00:04<00:00, 1.35it/s]�[A

100%|██████████| 7/7 [00:04<00:00, 1.64it/s]�[A

�[A
2%|▎ | 1000/40000 [15:54<10:18:48, 1.05it/s]

100%|██████████| 7/7 [00:04<00:00, 1.64it/s]�[A

                                         �[A{'eval_loss': 0.27034154534339905, 'eval_f1': 81.3953488372093, 'eval_accuracy': 68.62745098039215, 'eval_runtime': 5.1909, 'eval_samples_per_second': 39.299, 'eval_steps_per_second': 1.349, 'epoch': 8.7}

Traceback (most recent call last):
File "/home/trunghn/DecomposedPromptTuning/train.py", line 704, in
main()
File "/home/trunghn/DecomposedPromptTuning/train.py", line 659, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/trunghn/miniconda3/envs/ie/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/trunghn/miniconda3/envs/ie/lib/python3.10/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/trunghn/miniconda3/envs/ie/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/trunghn/miniconda3/envs/ie/lib/python3.10/site-packages/transformers/trainer.py", line 2355, in _save_checkpoint
self.save_model(staging_output_dir, _internal_call=True)
File "/home/trunghn/miniconda3/envs/ie/lib/python3.10/site-packages/transformers/trainer.py", line 2849, in save_model
self._save(output_dir)
File "/home/trunghn/miniconda3/envs/ie/lib/python3.10/site-packages/transformers/trainer.py", line 2905, in _save
safetensors.torch.save_file(state_dict, os.path.join(output_dir, SAFE_WEIGHTS_NAME))
File "/home/trunghn/miniconda3/envs/ie/lib/python3.10/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/trunghn/miniconda3/envs/ie/lib/python3.10/site-packages/safetensors/torch.py", line 467, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.decoder.embed_tokens.weight', 'base_model.lm_head.weight', 'base_model.encoder.embed_tokens.weight', 'base_model.shared.weight', 'word_embeddings.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

Runing Time

Hello! I am a newbie in NLP. Thank you very much for the inspiration from your outstanding work! I got a warning that the lower version might lead to slower running speed. I would like to know how long it takes to run a dataset, so that I can better choose and install the versions of software and hardware. Looking forward to your response!

About the hyperparameters

Could you please tell me which one is the correct hyperparameter, those in the paper or those provided in the readme, such as "weight decay"、“batch size" and "warm up steps"?

Could you please provide the checkpoints of MRQA datasets?

About the max_length of SuperGLUE-MultiRC dataset

Hello, thank you for the nice work! I would like to know, why the max length of the SuperGLUE-MultiRC dataset is 348 rather than 256? Is there any findings behind this setting? Besides, will the number of trainable parameters increase using max length=384 while the number of appended prompts remains 60 and the number of r for lora remains 30? Thank you very much!

General question about padding in the setting of soft-prompt tuning

Hello Authors,

I have a general question about padding in soft-prompt tuning setting.

In a batch when sequences are different length, typically we left-pad the smaller sequences like

# first example is left padded, batch size is set to two.
input_tokens = [ 
[<pad>, <pad>,  "my" "name"], 
["where", "are", "you", "from"]
]

And then, do you add soft-prompt embedding to the embeddings of the above padded tokens like below?

# se1, se2 are soft-prompt embeddings that needs to be tuned
input_embedding = [ 
[ se1, se2, Embedding(<pad>), Embedding(<pad>), Embedding("my"), Embedding("name"), ],
[ se1, se2, Embedding("where"), Embedding("are"), Embedding("you"), Embedding("from"), ],
]

Is my understanding right? Section 2.1 in the paper mentions the X to be padded using max sequence length.

Also other concern I have is, shouldnt we prepend the soft-prompt embeddings to the unpaded user embeddings and then pad the smaller sequences? Like,

# please note the change in the first example
input_embedding = [ 
[ Embedding(<pad>), Embedding(<pad>), se1, se2, , Embedding("my"), Embedding("name"), ],
[ se1, se2, Embedding("where"), Embedding("are"), Embedding("you"), Embedding("from"), ],
]

Can you please share your insights on how padding is done while doing soft-prompt tuning in general? Thank you for clarification.

About the hyperparameters of large models

Hi, I would like to know the experiment details about training large models like T5-3B, llama-2-7B/13B. Will the learning rate and batch size be the same as training T5-base? Also, what will the design of prompt and Lora be (the number of PREFIX_LENGTH, R)? Is it convenient for you to provide the training shell script files for reference? Thank you very much.