Code Monkey home page Code Monkey logo

longlora's Introduction

Stanford-Alpaca

LongLoRA and LongAlpaca for Long-context LLMs

Huggingface Models Data Paper

Code License Data License Weight License

TABLE OF CONTENTS

  1. News
  2. Highlights
  3. How to contribute
  4. Requirements
  5. Installation and quick guide
  6. LongAlpaca Data
  7. Models
  8. Training
  9. Evaluation
  10. Demo
  11. Streaming Inference
  12. Data Generation via Pdf2Text
  13. Examples
  14. Citation
  15. Acknowledgement
  16. License

News

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models [Paper]
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia

Highlights

  1. In LongLoRA approach, The proposed shifted short attention is easy to implement, compatible with Flash-Attention, and is not required during inference.
  2. We released all our models, including models from 7B to 70B, context length from 8k to 100k, including LLaMA2-LongLoRA-7B-100k, LLaMA2-LongLoRA-13B-64k, and LLaMA2-LongLoRA-70B-32k.
  3. We built up a long-context instruction-following dataset, LongAlpaca-12k. We released the corresponding LongAlpaca-7B, LongAlpaca-13B and LongAlpaca-70B models. To our best knowledge, this is the first open-sourced long-context 70B model.

How to Contribute

  • Make sure to have git installed.
  • Create your own fork of the project.
  • Clone the repository on your local machine, using git clone and pasting the url of this project.
  • Read both the Requirements and Installation and Quick Guide sections below.
  • Commit and push your changes.
  • Make a pull request when finished modifying the project.

Usage Requirements

To download and use the pre-trained weights you will need:

  1. Hugging Face (HF) account with valid email. Note, the email used for HF must alse be used for the license agreement.
  2. Accept the Meta license and acceptable use policy

Installation and Quick Guide

To install and run the application:

  1. Fork this repo on github
  2. Clone the repository on your local machine, using git clone and pasting the url of this project.
  3. Run the following code:
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
  1. Use either a Released model or Fine tune a model to fit your preferences.
  2. Test your model by chat.
  3. Deploy your own demo.

LongAlpaca Data

LongAlpaca-12k contains 9k long QA data that we collected and 3k short QA sampled from the original Alpaca data. This is to avoid the case that the model might degrade at short instruction following. The data we collect contains various types and amounts as the following figure.

Stanford-Alpaca

Data Short QA Long QA Total Download
LongAlpaca-12k 3k 9k 12k Link

Following the original Alpaca format, our Long QA data uses the following prompts for fine-tuning:

  • instruction: str, describes the task the model should perform. For example, to answer a question after reading a book section or paper. We vary the contents and questions to make instructions diverse.
  • output: str, the answer to the instruction.

We did not use the input format in the Alpaca format for simplicity.

Models

Models with supervised fine-tuning

Model Size Context Train Link
LongAlpaca-7B 7B 32768 Full FT Model
LongAlpaca-13B 13B 32768 Full FT Model
LongAlpaca-70B 70B 32768 LoRA+ Model (LoRA-weight)

Models with context extension via fully fine-tuning

Model Size Context Train Link
Llama-2-7b-longlora-8k-ft 7B 8192 Full FT Model
Llama-2-7b-longlora-16k-ft 7B 16384 Full FT Model
Llama-2-7b-longlora-32k-ft 7B 32768 Full FT Model
Llama-2-7b-longlora-100k-ft 7B 100000 Full FT Model
Llama-2-13b-longlora-8k-ft 13B 8192 Full FT Model
Llama-2-13b-longlora-16k-ft 13B 16384 Full FT Model
Llama-2-13b-longlora-32k-ft 13B 32768 Full FT Model

Models with context extension via improved LoRA fine-tuning

Model Size Context Train Link
Llama-2-7b-longlora-8k 7B 8192 LoRA+ LoRA-weight
Llama-2-7b-longlora-16k 7B 16384 LoRA+ LoRA-weight
Llama-2-7b-longlora-32k 7B 32768 LoRA+ LoRA-weight
Llama-2-13b-longlora-8k 13B 8192 LoRA+ LoRA-weight
Llama-2-13b-longlora-16k 13B 16384 LoRA+ LoRA-weight
Llama-2-13b-longlora-32k 13B 32768 LoRA+ LoRA-weight
Llama-2-13b-longlora-64k 13B 65536 LoRA+ LoRA-weight
Llama-2-70b-longlora-32k 70B 32768 LoRA+ LoRA-weight
Llama-2-70b-chat-longlora-32k 70B 32768 LoRA+ LoRA-weight

Training

Pre-trained weights

We use LLaMA2 models as the pre-trained weights and fine-tune them to long context window sizes. Download based on your choices.

Pre-trained weights
Llama-2-7b-hf
Llama-2-13b-hf
Llama-2-70b-hf
Llama-2-7b-chat-hf
Llama-2-13b-chat-hf
Llama-2-70b-chat-hf

This project also supports GPTNeoX models as the base model architecture. Some candidate pre-trained weights may include GPT-NeoX-20B, Polyglot-ko-12.8B and other variants.

Fine-tuning

torchrun --nproc_per_node=8 fine-tune.py  \
        --model_name_or_path path_to/Llama-2-7b-hf \
        --bf16 True \
        --output_dir path_to_saving_checkpoints       \
        --cache_dir path_to_cache \
        --model_max_length 8192 \
        --use_flash_attn True \
        --low_rank_training False \
        --num_train_epochs 1  \
        --per_device_train_batch_size 1     \
        --per_device_eval_batch_size 2     \
        --gradient_accumulation_steps 8     \
        --evaluation_strategy "no"     \
        --save_strategy "steps"     \
        --save_steps 1000     \
        --save_total_limit 2     \
        --learning_rate 2e-5     \
        --weight_decay 0.0     \
        --warmup_steps 20     \
        --lr_scheduler_type "constant_with_warmup"     \
        --logging_steps 1     \
        --deepspeed "ds_configs/stage2.json" \
        --tf32 True \
        --max_steps 1000
  • Please remember to change path_to/Llama-2-7b-hf, path_to_saving_checkpoints, path_to_cache to your own directory.
  • Note that you can change model_max_length to other values.
  • You could change ds_configs/stage2.json to ds_configs/stage3.json if you want.
  • Please set use_flash_attn as False if you use V100 machines or do not install flash attention.
  • You can set low_rank_training as False if you want to use fully fine-tuning. It will cost more GPU memory and slower, but the performance will be a bit better.
  • When training is finished, to get the full model weight:
cd path_to_saving_checkpoints && python zero_to_fp32.py . pytorch_model.bin

Note that the path_to_saving_checkpoints might be the global_step directory, which depends on the deepspeed versions.

Supervised Fine-tuning

torchrun --nproc_per_node=8 supervised-fine-tune.py  \
        --model_name_or_path path_to_Llama2_chat_models \
        --bf16 True \
        --output_dir path_to_saving_checkpoints       \
        --model_max_length 16384 \
        --use_flash_attn True \
        --data_path LongAlpaca-16k-length.json \
        --low_rank_training True \
        --num_train_epochs 5  \
        --per_device_train_batch_size 1     \
        --per_device_eval_batch_size 2     \
        --gradient_accumulation_steps 8     \
        --evaluation_strategy "no"     \
        --save_strategy "steps"     \
        --save_steps 98     \
        --save_total_limit 2     \
        --learning_rate 2e-5     \
        --weight_decay 0.0     \
        --warmup_steps 20     \
        --lr_scheduler_type "constant_with_warmup"     \
        --logging_steps 1     \
        --deepspeed "ds_configs/stage2.json" \
        --tf32 True
  • There is no need to make supervised fine-tuning upon the fine-tuned context extended models. It is all right to directly use base model as Llama2-chat models, as the amount of long instruction following data is enough for SFT.
  • Our long instruction following data can be found in LongAlpaca-12k.json.
  • Note that supervised-fine-tune.py can be replaced by supervised-fine-tune-qlora.py if you want to try 4-bit quantized fine-tuning for further GPU memory reduction. This follows QLoRA.
  • If you meet issue for saving pytorch_model.bin after the qlora sft, please refer to this issue.

Get trainable weights in low-rank training

In low-rank training, we set embedding and normalization layers as trainable. Please use the following line to extract the trainable weights trainable_params.bin from pytorch_model.bin

python3 get_trainable_weights.py --checkpoint_path path_to_saving_checkpoints --trainable_params "embed,norm"

Merge LoRA Weight

Merge the LoRA weights of pytorch_model.bin and trainable parameters trainable_params.bin, save the resulting model into your desired path in the Hugging Face format:

python3 merge_lora_weights_and_save_hf_model.py \
        --base_model path_to/Llama-2-7b-hf \
        --peft_model path_to_saving_checkpoints \
        --context_size 8192 \
        --save_path path_to_saving_merged_model

For example,

python3 merge_lora_weights_and_save_hf_model.py \
        --base_model /dataset/pretrained-models/Llama-2-7b-hf \
        --peft_model /dataset/yukangchen/hf_models/lora-models/Llama-2-7b-longlora-8k \
        --context_size 8192 \
        --save_path /dataset/yukangchen/models/Llama-2-7b-longlora-8k-merged

Evaluation

Perplexity Validation

To evaluate a model that is trained in the low-rank setting, please set both base_model and peft_model. base_model is the pre-trained weight. peft_model is the path to the saved checkpoint, which should contain trainable_params.bin, adapter_model.bin and adapter_config.json. For example,

python3 eval.py --seq_len 8192 --context_size 8192 --batch_size 1 --base_model path_to/Llama-2-7b-hf --peft_model path_to_saving_checkpoints --data_path pg19/test.bin

Or evaluate with multiple GPUs as follows.

torchrun --nproc_per_node=auto eval_distributed.py --seq_len 8192 --context_size 8192 --batch_size 1 --base_model path_to/Llama-2-7b-hf --peft_model path_to_saving_checkpoints --data_path pg19/test.bin

To evaluate a model that is fully fine-tuned, you only need to set base_model as the path to the saved checkpoint, which should contain pytorch_model.bin and config.json. peft_model should be ignored.

python3 eval.py --seq_len 8192 --context_size 8192 --batch_size 1 --base_model path_to_saving_checkpoints --data_path pg19/test.bin

Or evaluate with multiple GPUs as follows.

torchrun --nproc_per_node=auto eval_distributed.py --seq_len 8192 --context_size 8192 --batch_size 1 --base_model path_to_saving_checkpoints --data_path pg19/test.bin
  • Note that --seq_len is to set the sequence length for evaluation. --context_size is to set the context length of the model during fine-tuning. --seq_len should not be larger than --context_size.

  • We have already tokenized the validation and test splits of PG19 and proof-pile dataset into pg19/validation.bin, pg19/test.bin, and proof-pile/test_sampled_data.bin, with the tokenizer of LLaMA. proof-pile/test_sampled_data.bin contains 128 documents that are randomly sampled from the total proof-pile test split. For each document, it has at least 32768 tokens. We also release the sampled ids in proof-pile/test_sampled_ids.bin. You can download them from the links below.

Dataset Split Link
PG19 validation pg19/validation.bin
PG19 test pg19/test.bin
Proof-pile test proof-pile/test_sampled_data.bin

Passkey Retrieval

We provide a manner to test the passkey retrieval accuracy. For example,

python3 passkey_retrivial.py \
        --context_size 32768 \
        --base_model path_to/Llama-2-7b-longlora-32k \
        --max_tokens 32768 \
        --interval 1000
  • Note that the context_size is the context length during fine-tuning.
  • max_tokens is maximum length for the document in passkey retrieval evaluation.
  • interval is the interval during the document length increasing. It is a rough number because the document increases by sentences.

Demo

Local Inference

To chat with LongAlpaca models,

python3 inference.py  \
        --base_model path_to_model \
        --question $question \
        --context_size $context_length \
        --max_gen_len $max_gen_len \
        --flash_attn True \
        --material $material_content

To ask a question related to a book:

python3 inference.py  \
        --base_model /data/models/LongAlpaca-13B \
        --question "Why doesn't Professor Snape seem to like Harry?" \
        --context_size 32768 \
        --max_gen_len 512 \
        --flash_attn True \
        --material "materials/Harry Potter and the Philosophers Stone_section2.txt"

To ask a question related to a paper:

python3 inference.py  \
        --base_model /data/models/LongAlpaca-13B \
        --question "What are the main contributions and novelties of this work?" \
        --context_size 32768 \
        --max_gen_len 512 \
        --flash_attn True \
        --material "materials/paper1.txt"
  • Note that inference.py can be replaced by inference-qlora.py if you want to try 4-bit quantized fine-tuning for further GPU memory reduction. This follows QLoRA.

Online Demo

To deploy your own demo run

python3 demo.py  \
	--base_model path_to_model \
	--context_size $context_size \
	--max_gen_len $max_gen_len \
	--flash_attn True

Example

python3 demo.py  \
	--base_model /data/models/LongAlpaca-13B \
	--context_size 32768 \
	--max_gen_len 512 \
	--flash_attn True
  • Note that flash_attn=True will make the generation slow but save much GPU memory.

Streaming Inference

We support the inference of LongAlpaca models with StreamingLLM. This increases the context-length of the multi-round dialogue in StreamingLLM. Here is an example,

python run_streaming_llama_longalpaca.py \
	----enable_streaming \
	--test_filepath outputs_stream.json \
	--use_flash_attn True \
	--recent_size 32768
  • Note that please use a smaller recent_size if you meet OOM issues, for example 8192.
  • test_filepath is the json file that contains prompts for inference. We provide an example file outputs_stream.json, which is a subset of LongAlpaca-12k. You can replace it to your own questions.

Data Generation via Pdf2text

During our dataset collection, we convert paper and books from pdf to text. The conversion quality has a large influence on the final model quality. We think that this step is non-trivial. We release the tool for the pdf2txt conversion, in the folder pdf2txt. It is built upon pdf2image, easyocr, ditod and detectron2. Please refer to the README.md in pdf2txt for more details.

Examples

Citation

If you find this project useful in your research, please consider citing:

@article{longlora,
  title={LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models},
  author={Yukang Chen and Shengju Qian and Haotian Tang and Xin Lai and Zhijian Liu and Song Han and Jiaya Jia},
  journal={arXiv:2309.12307},
  year={2023}
}
@misc{long-alpaca,
  author = {Yukang Chen and Shaozuo Yu and Shengju Qian and Haotian Tang and Xin Lai and Zhijian Liu and Song Han and Jiaya Jia},
  title = {Long Alpaca: Long-context Instruction-following models},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/dvlab-research/LongLoRA}},
}

Acknowledgement

License

  • LongLoRA is licensed under the Apache License 2.0. This means that it requires the preservation of copyright and license notices.
  • Data and weights are under CC-BY-NC 4.0 License. They are licensed for research use only, and allowed only non-commercial. Models trained using the dataset should not be used outside of research purposes.

longlora's People

Contributors

aoyuqc avatar es94129 avatar gianlucamacri avatar j-frei avatar jayxio avatar lugzsport0 avatar naubull2 avatar netrunner-a avatar ruanwz avatar thesouthfrog avatar weicheng113 avatar weicheng113113 avatar x-lai avatar xinyuuzhou avatar xiuyu-li avatar yukang2017 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

longlora's Issues

is it langchan compatible?

i tried using the model on hugging face 110 # We have a serialization from tokenizers which let us directly build the backend
--> 111 fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
112 elif slow_tokenizer is not None:
113 # We need to convert a slow tokenizer to build the backend

Exception: No such file or directory (os error 2)

Are 32k+ datasets required to SFT for 32k+ lengths

I note the following remark in the README:

During our dataset collection, it is hard for us to collect many high-quality QA that are larger than 32768. Thus, if you use our LongQA.json, please also set model_max_length as 32768.

Are you saying that SFT for 32k+ (say 100k) context requires a meaningful amount of samples that are 32k-100k in length?

i.e. it is insufficient to take a model fine tuned for 100k context, and then SFT on lengths of only 32k.

Appreciate your clarification and reasoning around this.

about the data type

thanks for your brilliant idea and opensource!

A question about the data type:
I notice that the bf16 and tf16 are both set True. I'm confused what will happen. Can they use together?

the leakage problem of casual mask after shifting?

After shifting, the first half-group tokens and the last half-group tokens will be in the same group, do I understand your method correctly?

If so, the first half-group tokens will attend to the last half-group tokens, which violates the objective of language modeling.
Do you put this problem into consideration?

Tokenizer padding & vocab size

Hi Yukang,

Was there a reason you added a new pad token and expanded the vocab instead of using EOS or BOS token for pad, or setting pad_token_id = 0, like some the other training scripts/repos do?

Also, was there a reason for setting padding_side = "right" (instead of "left" which seems to be more common)?

Finally, did you intend to use model.resize_token_embeddings(len(tokenizer),pad_to_multiple_of=64) instead of model.resize_token_embeddings(len(tokenizer)) in smart_tokenizer_and_embedding_resize(), or you had a reason to NOT be divisible by 64?

Now that you have done this for this model, I think everyone else deriving from this model with other LORAs should also follow this convention and appropriately modify their tokenizers, though it needs to be properly documented because it is different than other LORAs, and will effect anyone doing batch inference later.

Thanks!

EDIT: From huggingface/transformers#25022 original Llama also used padding_side = "right", like this repo, which seems opposite to all the popular training and inference repos out there!

在RedPajama上的微调是必要的吗?

您是否做过消融实验,证明在RedPajama上的微调是必要的?
我想知道,如果不使用RedPajama数据集,而是PI或YaRN后,直接使用LongQA数据集进行指令微调,模型能否获得长文本对话能力?

Can't load 'Llama-2-13b-longlora-8k'

When I try to use the supervised finetuning script and change the model path to the lora version: Llama-2-13b-longlora-8k, the model can not be correctly loaded and an error is raised:
image

Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2785, in from_pretrained
Traceback (most recent call last):
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2785, in from_pretrained
resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/utils/hub.py", line 429, in cached_file
Traceback (most recent call last):
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2785, in from_pretrained
resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/utils/hub.py", line 429, in cached_file
resolved_file = hf_hub_download(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/utils/hub.py", line 429, in cached_file
resolved_file = hf_hub_download(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
validate_repo_id(arg_value)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
Traceback (most recent call last):
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2785, in from_pretrained
resolved_file = hf_hub_download(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
validate_repo_id(arg_value)raise HFValidationError(

File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/pretrained-models/Llama-2-13b-hf'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/users/PAA0201/shubaobao/LongLoRA/supervised-fine-tune.py", line 291, in
resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/utils/hub.py", line 429, in cached_file
validate_repo_id(arg_value)raise HFValidationError(

File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/pretrained-models/Llama-2-13b-hf'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/users/PAA0201/shubaobao/LongLoRA/supervised-fine-tune.py", line 291, in
train()
File "/users/PAA0201/shubaobao/LongLoRA/supervised-fine-tune.py", line 236, in train
resolved_file = hf_hub_download(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/pretrained-models/Llama-2-13b-hf'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/users/PAA0201/shubaobao/LongLoRA/supervised-fine-tune.py", line 291, in
train()model = transformers.AutoModelForCausalLM.from_pretrained(

File "/users/PAA0201/shubaobao/LongLoRA/supervised-fine-tune.py", line 236, in train
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
validate_repo_id(arg_value)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
model = transformers.AutoModelForCausalLM.from_pretrained(train()

File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
File "/users/PAA0201/shubaobao/LongLoRA/supervised-fine-tune.py", line 236, in train
return model_class.from_pretrained(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2857, in from_pretrained
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/pretrained-models/Llama-2-13b-hf'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/users/PAA0201/shubaobao/LongLoRA/supervised-fine-tune.py", line 291, in
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2857, in from_pretrained
raise EnvironmentError(
OSError: Can't load the model for '/dataset/pretrained-models/Llama-2-13b-hf'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/dataset/pretrained-models/Llama-2-13b-hf' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
train()
File "/users/PAA0201/shubaobao/LongLoRA/supervised-fine-tune.py", line 236, in train
return model_class.from_pretrained(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2857, in from_pretrained
raise EnvironmentError(
OSError: Can't load the model for '/dataset/pretrained-models/Llama-2-13b-hf'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/dataset/pretrained-models/Llama-2-13b-hf' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
raise EnvironmentError(
OSError: Can't load the model for '/dataset/pretrained-models/Llama-2-13b-hf'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/dataset/pretrained-models/Llama-2-13b-hf' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
return model_class.from_pretrained(
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2857, in from_pretrained
raise EnvironmentError(
OSError: Can't load the model for '/dataset/pretrained-models/Llama-2-13b-hf'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/dataset/pretrained-models/Llama-2-13b-hf' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1357168) of binary: /users/PAA0201/shubaobao/anaconda3/envs/longlora_env/bin/python

关于训练和解码的问题。

请问我的理解是对的吗?在解码的时候并不需要S^2 attention,只要正常做full attention。训练的时候需要S^2 attention因为训练用full attention会爆显存。那我很疑惑,解码可以用S^2 attention吗?这样训练和解码就匹配了,效果应该最好。但是如果解码用S^2 attention是不是很多东西就要重新计算,因为shift会导致group内发生改变。

模型加载报错

huggingface下载模型llama-2-13b-chat-longlora-32k-sft. 在进行模型推理时候,报错:
image
请问什么原因
而且readme写了
To chat with Llama-2-13b-chat-longlora-32k-sft or Llama-2-70b-chat-longlora-32k-sft, you need to run merge_lora_weights_and_save_hf_model.py first

这个llama-2-13b-chat-longlora32k-sft模型本身就是已经合并过,不是lora权重,你执行这个代码有啥作用?

Failed to convert the model into GGML format

I tried to convert Llama-2-13b-chat-longlora-32k-sft into GGML format with the ggml convert.py. The command line is

python convert.py ./Llama-2-13b-chat-longlora-32k-sft --outfile ./Llama-2-13b-chat-longlora-32k-sft/Llama-2-13b-chat-longlora-32k-sft-ggml.bin

The error thrown in the converting process is

Exception: Vocab size mismatch (model has 32001, but Llama-2-13b-chat-longlora-32k-sft/tokenizer.model has 32000).  Most likely you are missing added_tokens.json (should be in Llama-2-13b-chat-longlora-32k-sft).

The discussion of a similar issue shows that the vocab_size = 32001 cannot be divided by 256. llama.cpp runs the check.

I tried to update vocab_size in config.json to 32000. The conversion and quantization processes can work without any error reported. However, it failed when I run a test with the quantized model: ./main -m Llama-2-13b-chat-longlora-32k-sft-ggml-q4_0.bin -n 128. The error is

error loading model: create_tensor: tensor 'token_embd.weight' has wrong shape; expected  5120, 32000, got  5120, 32001,     1,     1

The error means that updating the vocal_size cannot solve the issue. Could you please help with it? Thanks a lot!

merging LoRa Weight llama2_70B_chat_hf with Adapter weights

Thank you for your great work. I attempted to download the weights and integrate them with the original LLama2_70b_chat_hf 4k context length model, following the instructions provided in the 'merge_lora_weights_and_save_hf_model.py' script. Unfortunately, I encountered a size mismatch issue related to 'base_model.model.model.layer,' which is explained in more detail in the attached image.

If you have any suggestions or advice on how to resolve this issue, I would greatly appreciate it. Thank you for your assistance.
Screenshot 2023-09-25 at 2 23 56 PM

Maximize the Impact of Your Model with Gradio demo on Hugging Face Hub

Hi,

Very cool work! It would be nice to create a research demo using Gradio on the Hugging Face Hub !
Some of the benefits of sharing your models through the Hub would be:

Wider reach of your work to the ecosystem
Seamless integration with popular libraries and frameworks, enhancing usability
Real-time feedback and collaboration opportunities with a global community of researchers and developers
This is a step-by-step guide explaining the process in case you're interested. 😊 This is the docs on Community GPU Grants.

Building wheel for flash-attn (setup.py) ... error

An error occurs when I run pip install flash-attn --no-build-isolation with Python 3.9.18:

Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [9 lines of output]
      fatal: Not a git repository (or any of the parent directories): .git
      
      
      torch.__version__  = 2.0.1+cu117
      
      
      running bdist_wheel
      Guessing wheel URL:  https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.1.post1/flash_attn-2.3.1.post1+cu117torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
      error: <urlopen error [Errno -2] Name or service not known>
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for flash-attn
  Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects

thanks for help!

Shift left or right?

qkv[:, :, :, self.num_heads//2:] = qkv[:, :, :, self.num_heads//2:].roll(-group_size//2, dims=1)

Why shifting left instead of shifting right?

Impact on minimal number of tokens? I.e., shifting left would only impact the first several tokens and shifting right would impact way much more tokens?

中文效果如何

有在中文基座模型下做过实验对比吗? 对中文支持怎么样

Error about finetuning lora

Thanks for the brilliant work!

It raised an error when I tried to finetune llama2 13b 8k lora weight. Could you let me know how to solve this?

File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 848, in forward
shift_logits = shift_logits.view(-1, self.config.vocab_size)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward
return self.model.forward(*args, **kwargs)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 848, in forward
shift_logits = shift_logits.view(-1, self.config.vocab_size)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward
return self.base_model(
RuntimeError: shape '[-1, 0]' is invalid for input of size 85058658
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 848, in forward
shift_logits = shift_logits.view(-1, self.config.vocab_size)
RuntimeError: shape '[-1, 0]' is invalid for input of size 101923185
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
RuntimeError: shape '[-1, 0]' is invalid for input of size 130148067
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward
return self.model.forward(*args, **kwargs)
File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 848, in forward
shift_logits = shift_logits.view(-1, self.config.vocab_size)
RuntimeError: shape '[-1, 0]' is invalid for input of size 85154661

flash attention forward vs forward full?

I don't quite understand the difference between forward_flashattn and forward_flashattn_full in your script. It seems that the full version did not do any shift attention, am I right? Then, why you did not use forward_flashattn for supervised fine tuning or in other word, did not do any shift attention for supervised fine tuning?

LongLoRA的实现代码问题

论文中提到”Particularly, it can be implemented with only two lines of code in training, while being optional in inference“;
请问,具体的实现方式是那两行代码呢?想移植到其他代码里,应该复制哪些代码?

Tensor Size Mismatched Error During Inference

Hi, thanks for this great project. While running inference.py with Llama-2-13b-longlora-32k model, I am getting Tensor size mismatched ValueError (setting [32000, 5120] in weights that has size of [32001, 5120]). Please see details at the end of this message. Can you please let me know if I am doing something wrong or if there is way to resolve it?

/home/code/dev/LongLora/env/bin/python /home/code/dev/LongLora/LongLoRA/inference.py
Downloading (…)fetensors.index.json: 
Traceback (most recent call last):
  File "/home/code/dev/LongLora/LongLoRA/inference.py", line 138, in <module>
    main(args)
  File "/home/code/dev/LongLora/LongLoRA/inference.py", line 100, in main
    model = transformers.AutoModelForCausalLM.from_pretrained(
  File "/home/code/dev/LongLora/env/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/code/dev/LongLora/env/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3187, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/code/dev/LongLora/env/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3575, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/code/dev/LongLora/env/lib/python3.9/site-packages/transformers/modeling_utils.py", line 745, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/code/dev/LongLora/env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 5120]) in "weight" (which has shape torch.Size([32001, 5120])), this look incorrect.

Supervised Fine-tuning notes

Last night I SFT (with LoRA) trained the Llama 7B chat model for 3 epochs on 1,000 rows containing instruction-output pairs varying in length up to 32,000 tokens dataset take from here.

I had to adjust the code in a number of ways see here

  • set the model type to llama in the sft script
  • uncomment the prompt format to allow for instruction - output pairs
  • add lines to the SFT script to save the embed and norm layers as trainable_params.bin at the end of the training
  • Note: the trainer doesn't save checkpoints properly when using LoRA because it expects a full model.

I'm not entirely sure if the way I'm saving trainable params is correct:

        trainable_params = {n: p for n, p in model.named_parameters() if any([k in n for k in training_args.trainable_params.split(",")])}

        # Convert trainable_params to state_dict format
        trainable_params_state_dict = {n: p.data for n, p in trainable_params.items()}
~~~
    # Save only from master node to avoid duplicate saves
    if torch.distributed.get_rank() == 0:
        torch.save(trainable_params_state_dict, f"{training_args.output_dir}/trainable_params.bin")

See the diff for better clarity.

wandb

The run is here.

Loss went from about 6 to about 2 after 6 hours and 3 epochs of training on 2 x A6000s.

The finished merged model is (here to try](https://huggingface.co/Trelis/Llama-2-7b-chat-hf-32k).

The model outputs garbage for any length above 4k context (the original Llama length). Lots of blank lines mixed with characters.

Thoughts on the training:

  • Am I facing these issues because of starting with a chat base model? I don't think so, but it could be.
  • I set the rope scaling to 8, and linear. But I didn't change rope theta from the base model. Could this cause issues?
  • Is there a mistake in how I am saving the embed and norm trainable weights?

I greatly appreciate any remarks for improvement. Thanks.

Passkey Retrieval

Do you have any results on passkey retrieval for 16k and 32k? I see topic retrieval in the paper, but I'm unclear whether/how that is different, and there is only data to 13k. Thanks

Hardware Requirements for Finetuning and Inference?

Hello,

Great work on the paper and the models. Truly groundbreaking, I am not sure why your project is not being talked about more.

Could you please provide the GPU RAM required to:

  1. Fine tune your models
  2. Run Inference on them.

I would highly appreciate it if you can provide the information for all of your variants, especially the Llama2 70B 32k and the Llama2 13B 32k.

Error in Finetuning

Hi All
Thank you for this great project. While trying out the fine-tune.py script, I got the error below despite installing all the required modules, including the flash-attn --no-build-isolation. How could this be rectified? Thanks.

ModuleNotFoundError Traceback (most recent call last)
in <cell line: 11>()
9 from torch.utils.data import Dataset
10 from transformers import Trainer, DataCollatorForLanguageModeling
---> 11 from llama_attn_replace import replace_llama_attn
12 from peft import LoraConfig, get_peft_model
13 from torch.distributed import barrier

ModuleNotFoundError: No module named 'llama_attn_replace'


NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

I also got the error below while running the Supervised Fine-tuning script. Error:
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


2023-09-29 02:23:07.921163: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-09-29 02:23:07.941552: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-09-29 02:23:08.336867: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-09-29 02:23:08.784676: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-09-29 02:23:09.024703: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-09-29 02:23:09.052531: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-09-29 02:23:09.285467: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-09-29 02:23:09.403561: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
File "/content/supervised-fine-tune.py", line 29, in
from llama_attn_replace import replace_llama_attn
ModuleNotFoundError: No module named 'llama_attn_replace'
Traceback (most recent call last):
File "/content/supervised-fine-tune.py", line 29, in
from llama_attn_replace import replace_llama_attn
ModuleNotFoundError: No module named 'llama_attn_replace'
Traceback (most recent call last):
File "/content/supervised-fine-tune.py", line 29, in
from llama_attn_replace import replace_llama_attn
ModuleNotFoundError: No module named 'llama_attn_replace'
Traceback (most recent call last):
File "/content/supervised-fine-tune.py", line 29, in
from llama_attn_replace import replace_llama_attn
ModuleNotFoundError: No module named 'llama_attn_replace'
Traceback (most recent call last):
File "/content/supervised-fine-tune.py", line 29, in
from llama_attn_replace import replace_llama_attn
ModuleNotFoundError: No module named 'llama_attn_replace'
Traceback (most recent call last):
File "/content/supervised-fine-tune.py", line 29, in
from llama_attn_replace import replace_llama_attn
ModuleNotFoundError: No module named 'llama_attn_replace'
Traceback (most recent call last):
File "/content/supervised-fine-tune.py", line 29, in
from llama_attn_replace import replace_llama_attn
ModuleNotFoundError: No module named 'llama_attn_replace'
Traceback (most recent call last):
File "/content/supervised-fine-tune.py", line 29, in
from llama_attn_replace import replace_llama_attn
ModuleNotFoundError: No module named 'llama_attn_replace'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7451) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

supervised-fine-tune.py FAILED

Failures:
[1]:
time : 2023-09-29_02:23:33
host : ebc31241e54d
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 7452)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-09-29_02:23:33
host : ebc31241e54d
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 7453)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-09-29_02:23:33
host : ebc31241e54d
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 7454)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-09-29_02:23:33
host : ebc31241e54d
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 7455)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2023-09-29_02:23:33
host : ebc31241e54d
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 7456)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2023-09-29_02:23:33
host : ebc31241e54d
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 7457)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2023-09-29_02:23:33
host : ebc31241e54d
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 7458)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-09-29_02:23:33
host : ebc31241e54d
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7451)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

`cuda:0` in merge_lora_weights_and_save_hf_model

Hello,

I am currently trying to run inference and that are my steps:

  1. Set-up env
  2. Run:
python3 merge_lora_weights_and_save_hf_model.py \
        --base_model meta-llama/Llama-2-70b-chat-hf \
        --peft_model Yukang/Llama-2-70b-chat-longlora-32k \
        --context_size 32768 \
        --save_path ./models/Llama-2-70b-longlora-32k

In this case models are downloaded from the Huggingface. I notices that in merge_lora_weights_and_save_hf_model.py you set up device=cuda:0. Because of that, loading model shards are quite slow. Is this required for this script? Can I use multy-gpu? I am currently using 8xA10G with 23GB memory each.

Thank you in advance!

Inference.py taking too long

I am trying firstly to run inference over a longlora model on a as-is basis, ie. with no previous fine tuning or instruction fine tunning. I am trying this on a colab notebook with a A100 Gpu. But it is taking too long. Realy, realy, too long. Can I do this or must I do fine tunning previously? I does not make sense to me since already arefine tunned models. But it seems odd that the context window even for models with 32k context is showing 4096.

Error in FineTuning

I am facing an error while I am finetuning:

    raise ValueError("q_len %d should be divisible by group size %d."%(q_len, group_size))
ValueError: q_len 583 should be divisible by group size 145.

Text Generation after Supervised Fine Tuning of the model

Hi,
I use the Yukang/Llama-2-13b-longlora-16k-ft model as the base model:

  1. Supervised fine-tuning using my own dataset.
  2. Running python zero_to_fp32.py . pytorch_model.bin -> which creates the model file in the same directory.
  3. The saved folder misses a config.json so I copy the base models config.json
  4. Load the model in the huggingface text_generation pipeline.

import math 

# replace_llama_attn()
config = transformers.AutoConfig.from_pretrained(model_fl1)

context_size = 16328
orig_ctx_len = getattr(config, "max_position_embeddings", None) # this value should be 4096 for LLaMA2 models
if orig_ctx_len and context_size > orig_ctx_len:
    scaling_factor = float(math.ceil(context_size / orig_ctx_len))
    config.rope_scaling = {"type": "linear", "factor": scaling_factor}

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_fl1,
    config=config,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
)
model.resize_token_embeddings(32001)

tokenizerfl1 = AutoTokenizer.from_pretrained(model_fl1)
pipelinefl1 = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizerfl1,
    torch_dtype=torch.bfloat16,
    max_new_tokens=1000,
    return_full_text=False,
    eos_token_id=2
)

  1. Run it on multiple prompts.

Output: All the prompts just output empty as the generated text. Something is clearly wrong.
[{'generated_text': ''}]

Tried a model with very little SFT and still the same issue - seems like a problem with saving the model somehow. Or maybe a format issue. Please help debug

Stack RuntimeError

Hi, I have merged the 70b sft chat model with the original 70b-chat-hf model, when i try to do the inference, i got:

File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/app/src/llm_workflow/steps/inference/src/llama_attn_replace.py", line 77, in forward_flashattn
qkv = torch.stack(
RuntimeError: stack expects each tensor to be equal size, but got [1, 64, 1, 128] at entry 0 and [1, 64, 1284, 128] at entry 1

I'm using 4xa100(80G) to do the infernece.

LICENSE

Thanks for this awesome project! Can you please add a LICENSE file to your repository so that others can build upon your work? MIT-0 and Apache-2 are both great choices, as are options at https://creativecommons.org/choose/

TypeError: forward_noflashattn() got an unexpected keyword argument 'padding_mask'

torchrun --nproc_per_node=4 /home/v100/LongLoRA/fine-tune.py  \
        --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
        --bf16 False \
        --output_dir /home/v100/LongLoRA/output       \
        --cache_dir path_to_cache \
        --model_max_length 2048 \
        --use_flash_attn False \
        --low_rank_training True \
        --num_train_epochs 1  \
        --per_device_train_batch_size 1     \
        --per_device_eval_batch_size 1     \
        --gradient_accumulation_steps 8     \
        --evaluation_strategy "no"     \
        --save_strategy "steps"     \
        --save_steps 1000     \
        --save_total_limit 2     \
        --learning_rate 2e-5     \
        --weight_decay 0.0     \
        --warmup_steps 20     \
        --lr_scheduler_type "constant_with_warmup"     \
        --logging_steps 1     \
        --deepspeed "/home/v100/LongLoRA/ds_configs/stage2.json" \
        --tf32 False \
        --max_steps 1000\
        --model_type llama```

I am using a v100 16x4 GPU 

Full error trace:
```python
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-10-04 18:14:38,648] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-04 18:14:38,648] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-04 18:14:38,687] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-04 18:14:38,688] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-04 18:14:41,319] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-04 18:14:41,319] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-10-04 18:14:41,343] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-04 18:14:41,350] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-04 18:14:41,352] [INFO] [comm.py:637:init_distributed] cdb=None
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.97s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.98s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.03s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.70s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Map (num_proc=12):   0%|                                                                                               | 0/676 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):   8%|███████▏                                                                             | 57/676 [00:00<00:04, 136.86 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  25%|█████████████████████▏                                                              | 171/676 [00:00<00:01, 380.21 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  42%|███████████████████████████████████▎                                                | 284/676 [00:00<00:00, 554.22 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  59%|█████████████████████████████████████████████████▏                                  | 396/676 [00:00<00:00, 689.61 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  75%|███████████████████████████████████████████████████████████████                     | 508/676 [00:00<00:00, 784.11 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  92%|█████████████████████████████████████████████████████████████████████████████       | 620/676 [00:00<00:00, 853.13 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12): 100%|████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:01<00:00, 613.20 examples/s]
Dataset({
    features: ['input_ids'],
    num_rows: 108
})
Map (num_proc=12):   0%|                                                                                               | 0/676 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):   8%|███████▏                                                                             | 57/676 [00:00<00:04, 133.28 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):   8%|███████▏                                                                             | 57/676 [00:00<00:04, 125.82 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):   8%|███████▏                                                                             | 57/676 [00:00<00:05, 123.57 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  17%|██████████████▏                                                                     | 114/676 [00:00<00:02, 215.70 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  34%|████████████████████████████▎                                                       | 228/676 [00:00<00:01, 420.50 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  34%|████████████████████████████▏                                                       | 227/676 [00:00<00:01, 381.37 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  34%|████████████████████████████▎                                                       | 228/676 [00:00<00:01, 337.02 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  67%|████████████████████████████████████████████████████████▏                           | 452/676 [00:00<00:00, 763.51 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  75%|███████████████████████████████████████████████████████████████                     | 508/676 [00:01<00:00, 645.31 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  75%|███████████████████████████████████████████████████████████████                     | 508/676 [00:01<00:00, 670.60 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12):  92%|█████████████████████████████████████████████████████████████████████████████       | 620/676 [00:01<00:00, 702.35 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (73728 > 8192). Running this sequence through the model will result in indexing errors
Map (num_proc=12): 100%|████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:01<00:00, 541.41 examples/s]
Map (num_proc=12): 100%|████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:01<00:00, 533.94 examples/s]
Map (num_proc=12): 100%|████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:01<00:00, 525.23 examples/s]
Dataset({
    features: ['input_ids'],
    num_rows: 108
})
Dataset({
    features: ['input_ids'],
    num_rows: 108
})
Dataset({
    features: ['input_ids'],
    num_rows: 108
})
Rank: 0 partition count [4, 4] and sizes[(34866176, False), (66560, False)] 
Rank: 3 partition count [4, 4] and sizes[(34866176, False), (66560, False)] 
Rank: 1 partition count [4, 4] and sizes[(34866176, False), (66560, False)] 
Rank: 2 partition count [4, 4] and sizes[(34866176, False), (66560, False)] 
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.12
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
  0%|                                                                                                                        | 0/1000 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/v100/LongLoRA/fine-tune.py", line 207, in <module>
  File "/home/v100/LongLoRA/fine-tune.py", line 207, in <module>
Traceback (most recent call last):
  File "/home/v100/LongLoRA/fine-tune.py", line 207, in <module>
            train()train()train()


  File "/home/v100/LongLoRA/fine-tune.py", line 201, in train
  File "/home/v100/LongLoRA/fine-tune.py", line 201, in train
  File "/home/v100/LongLoRA/fine-tune.py", line 201, in train
        trainer.train()trainer.train()

  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    trainer.train()
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
        return inner_training_loop(return inner_training_loop(

      File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop
return inner_training_loop(  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop

  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
      File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in training_step
    tr_loss_step = self.training_step(model, inputs)tr_loss_step = self.training_step(model, inputs)

  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in training_step
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss
        loss = self.compute_loss(model, inputs)loss = self.compute_loss(model, inputs)

  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss
    outputs = model(**inputs)
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    outputs = model(**inputs)
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    outputs = model(**inputs)
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/v100/lla2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    return forward_call(*args, **kwargs)
    return forward_call(*args, **kwargs)      File "/home/v100/lla2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

ret_val = func(*args, **kwargs)
  File "/home/v100/lla2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
  File "/home/v100/lla2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
        ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)

  File "/home/v100/lla2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
        loss = self.module(*inputs, **kwargs)loss = self.module(*inputs, **kwargs)

  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/v100/lla2/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward
        return forward_call(*args, **kwargs)return forward_call(*args, **kwargs)

  File "/home/v100/lla2/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward
Traceback (most recent call last):
    return self.base_model(
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
        return self.base_model(return self.base_model(

  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  File "/home/v100/LongLoRA/fine-tune.py", line 207, in <module>
    train()
  File "/home/v100/LongLoRA/fine-tune.py", line 201, in train
    trainer.train()
      File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
return forward_call(*args, **kwargs)
  File "/home/v100/lla2/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward
          File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
return forward_call(*args, **kwargs)return forward_call(*args, **kwargs)
    
return self.model.forward(*args, **kwargs)  File "/home/v100/lla2/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward

  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1038, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in training_step
    loss = self.compute_loss(model, inputs)
        return self.model.forward(*args, **kwargs)return self.model.forward(*args, **kwargs)

  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1038, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1038, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss
    outputs = model(**inputs)
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    outputs = self.model(
  File "/home/v100/lla2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
        outputs = self.model(  File "/home/v100/lla2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
    loss = self.module(*inputs, **kwargs)
outputs = self.model(

  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/v100/lla2/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward
    return self.base_model(
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    return forward_call(*args, **kwargs)
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 921, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward
    return self.model.forward(*args, **kwargs)
        return forward_call(*args, **kwargs)return forward_call(*args, **kwargs)

  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 921, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 921, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1038, in forward
    outputs = self.model(
      File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 921, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
        layer_outputs = torch.utils.checkpoint.checkpoint(layer_outputs = torch.utils.checkpoint.checkpoint(    

return CheckpointFunction.apply(function, preserve, *args)  File "/home/v100/lla2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint

  File "/home/v100/lla2/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
          File "/home/v100/lla2/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
return CheckpointFunction.apply(function, preserve, *args)return CheckpointFunction.apply(function, preserve, *args)

  File "/home/v100/lla2/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
      File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 917, in custom_forward
    return module(*inputs, past_key_value, output_attentions, padding_mask=padding_mask)
outputs = run_function(*args)        return super().apply(*args, **kwargs)  # type: ignore[misc]
return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 917, in custom_forward

  File "/home/v100/lla2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
        outputs = run_function(*args)outputs = run_function(*args)

  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 917, in custom_forward
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 917, in custom_forward
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 635, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    return module(*inputs, past_key_value, output_attentions, padding_mask=padding_mask)
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
TypeError: forward_noflashattn() got an unexpected keyword argument 'padding_mask'
        return module(*inputs, past_key_value, output_attentions, padding_mask=padding_mask)return module(*inputs, past_key_value, output_attentions, padding_mask=padding_mask)

  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 635, in forward
        return forward_call(*args, **kwargs)return forward_call(*args, **kwargs)

  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 635, in forward
  File "/home/v100/lla2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 635, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
        hidden_states, self_attn_weights, present_key_value = self.self_attn(hidden_states, self_attn_weights, present_key_value = self.self_attn(

  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  File "/home/v100/lla2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: forward_noflashattn() got an unexpected keyword argument 'padding_mask'
    return forward_call(*args, **kwargs)    
return forward_call(*args, **kwargs)
TypeErrorTypeError: : forward_noflashattn() got an unexpected keyword argument 'padding_mask'forward_noflashattn() got an unexpected keyword argument 'padding_mask'

Hi! error and confusion among Yukang/Llama-2-13b-chat-longlora-32k-sft

Hi!
Awesome work!

However, there are some confusion in the readme... where you mentioned that the 'Yukang/Llama-2-13b-chat-longlora-32k-sft' need to be merged first. However it seemed to be a merged model already.

Therefore I tried to inference like this:

from transformers import AutoTokenizer
import transformers
import torch
from transformers import pipeline


model = 'Yukang/Llama-2-13b-chat-longlora-32k-sft'

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    tokenizer=tokenizer,
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

system_prompt = '''<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>'''

def response(artifact:str, question:str):
	
    prompt = 'Please consider the following input text and answer the subsequent question, only answer the question please: <text> ' + artifact + '</text> [INST] ' + question + ' [/INST]'
    prompt = prompt
    print(prompt)
    sequences = pipeline(
            prompt,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
            max_new_tokens = 512
            # max_length=1000,
        )
    output = sequences[0]['generated_text']
    output_ind = output.index('[/INST]')
    return output[output_ind+7:]
    
response('The Mona Lisa is a 16th century oil painting created by Leonardo. It\'s held at the Louvre in Paris.', 'Where is the Mona Lisa held?')

And the output is
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe the the the the the the the the the thethe the the the the the the the the the the the the the the the the the the the:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n the the the the the the the the the the the the the the the the the the the the the thethe the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the theththththththth\n\n thethe theththththththethththe theththethththththethth theththththe thethththththeththththeththeththththtthtththe thththththe ththththe theththththe theth ththththththththe thethththththththth\nthethththe theththththeththe th thth\nthththththeththththththeththththththeththethththththththt\nththtthththtttht\nththe ththoththththththththe ththththththe theththththtth'

Can you help by chance? Thanks!

Timeout after 30 minutes problem

While rank 0 process is working on prepping the dataset, and it takes longer than 30 minutes, an exception is thrown.

Your number of cores is 128, and it may be that your hardware can finish the dataset prep before the timeout kicks in.

I have 16 cores, 32 threats, and the optimal number is 20. So, 30 minues is not enough.

I have scoured the web, I cannot find a solution about how to modify the timeout setting. None of the things, I have seen, work.

Dataset prep should be taken out of the fine-tune.py script, and made into a separate script to create a directly ingestible dataset.

rank = int(os.environ.get('RANK', -1))

if rank > 0:
barrier()

dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=training_args.cache_dir)
dataset = dataset.map(partial(tokenize_fn,tokenizer),batched=True, num_proc=128, remove_columns=["text", "meta"])

if rank == 0:
barrier()

How to run pdf2txt?

There is an error, can you tell me if there is other configuration need to be done?

KeyError: "No object named 'build_vit_fpn_backbone' found in 'BACKBONE' registry!"

python pdf2txt.py --pdf_path /content/pdf/ --outputs_dir /content/output/

Traceback (most recent call last):
  File "/content/LongLoRA/pdf2txt/pdf2txt.py", line 183, in <module>
    process_pdf(args.pdf_path, args.outputs_dir, args.config_file)
  File "/content/LongLoRA/pdf2txt/pdf2txt.py", line 97, in process_pdf
    predictor = DefaultPredictor(cfg)
  File "/usr/local/lib/python3.10/dist-packages/detectron2/engine/defaults.py", line 282, in __init__
    self.model = build_model(self.cfg)
  File "/usr/local/lib/python3.10/dist-packages/detectron2/modeling/meta_arch/build.py", line 22, in build_model
    model = META_ARCH_REGISTRY.get(meta_arch)(cfg)
  File "/usr/local/lib/python3.10/dist-packages/detectron2/config/config.py", line 189, in wrapped
    explicit_args = _get_args_from_config(from_config_func, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/detectron2/config/config.py", line 245, in _get_args_from_config
    ret = from_config_func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/detectron2/modeling/meta_arch/rcnn.py", line 73, in from_config
    backbone = build_backbone(cfg)
  File "/usr/local/lib/python3.10/dist-packages/detectron2/modeling/backbone/build.py", line 31, in build_backbone
    backbone = BACKBONE_REGISTRY.get(backbone_name)(cfg, input_shape)
  File "/usr/local/lib/python3.10/dist-packages/fvcore/common/registry.py", line 71, in get
    raise KeyError(
KeyError: "No object named 'build_vit_fpn_backbone' found in 'BACKBONE' registry!"

Dataset?

Dear authors,
Thanks so much for your nice contribution and the great paper (very nice to read).
Will the dataset be released soon?
Thanks a lot,
Pierre

Missing `num_group` and wrong shape manage for `forward_flashattn`

In the function forward_flashattn of llama_attn_replace.py

  • The num_group is not defined. It is easy to patch by adding num_group = q_len // group_size at line 101
  • Line 118 - 124
output = rearrange(
        pad_input(
            rearrange(output_unpad, "nnz h d -> nnz (h d)"), indices, bsz * num_group if self.training else bsz, group_size if self.training else q_len
        ),
        "b s (h d) -> b s h d",
        h=nheads,
    )

The size is not aligned. The given output_unpad is (2 x B x G x N, H, D//2), follow the definition above. Thus the bsz * num_group should be bsz * num_group * 2. (I suppose)

I don't dig into much about the axis switch trick in the coding. Thus, I am not confident for my correction. It's better that you can double check.

By the way, one question is why we reshape the qkv to (bsz * 2, q_len, 3, self.num_heads // 2, self.head_dim) but then rearrange it to(bsz * 2, q_len, 3, self.num_heads , self.head_dim// 2) after

x_unpad = rearrange(
        x_unpad, "nnz (three h d) -> nnz three h d", three=3, h=nheads
    )

Is it compatible with QLoRA?

Looking at this paper, it was impressive to see that we are increasing the length of context while maintaining GPU usage similar to LoRA. Is it compatible with QLoRA?

how many models was supported?

i wana know this method support how many models ? and there were some docs to lora different model, like chatglm? baichuan? or other except the serias of llma? thanks a lot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.