Code Monkey home page Code Monkey logo

multi-lora-fine-tune's Introduction

m-LoRA: Efficient LLM Model Fine-Tune via Multi-LoRA Optimization

m-LoRA (a.k.a Multi-Lora Fine-Tune) is an open-source framework for fine-tuning Large Language Models (LLMs) using the efficient multiple LoRA/QLoRA methods. Key features of m-LoRA include:

  • Efficient LoRA/QLoRA: Optimizes the fine-tuning process, significantly reducing GPU memory usage by leveraging a shared frozen-based model.

  • Multiple LoRA Adapters: Support for concurrent fine-tuning of multiple LoRA/QLoRA adapters.

Contents

Updates

  • Support multiple Qwen2 fine-tuning
  • Support multiple Mistral fine-tuning
  • Support multiple LLaMA2 fine-tuning
  • Support multiple ChatGLM fine-tuning
  • Support multiple LLaMA fine-tuning
  • On the way, Baichuan

Models

Model # Parameters
LLaMA 7B/13B/33B/65B
LLaMA-2 7B/13B/70B
Qwen-2 4B/7B/14B/72B
Mistral 7B
ChatGLM 6B
ChatGLM2 6B/12B
ChatGLM3 6B
Baichuan 7B/13B
Baichuan2 7B/13B

Example: Use our system to improve the LLaMa-2 fine-tuning with less resources https://www.kaggle.com/code/rraydata/multi-lora-example/notebook

Overview

m-LoRA is a high-throughput LLM fine-tuning framework based on LoRA and QLoRA, compatible with HuggingFace-Transformers LLaMA Models and ChatGLM Models.

This picture shows the basic principle of LoRA and Multi-LoRA.

The system overview of m-LoRA is as follows.

m-LoRA requires PyTorch and NVIDIA CUDA compatible GPUs.

Main Contribution

  • Introduces the Multi-LoRA method, capable of enabling the sharing of pre-trained model weights during the fine-tuning process of large language models;
  • Proposes a task scheduling algorithm to enhance the overall throughput of the task training process and reduce total training latency;
  • Builds upon the above by implementing m-LoRA, a high-throughput large language model fine-tuning framework based on LoRA and QLoRA;
  • Evaluates m-LoRA in experiments against existing systems, confirming that m-LoRA effectively utilizes system computing resources, thereby improving training throughput and reducing training latency compared to current systems.

Experiment Results

Environment: NVIDIA RTX A6000 with Intel Xeon Silver 4314 on Ubuntu 22.04.3

Baseline: We utilized the widely adopted Alpaca-LoRA as a foundation. On a single GPU, we independently ran multiple Alpaca-LoRA processes in parallel (marked as Baseline@Alpaca-Parallel) and sequentially (marked as Baseline@Alpaca-Seq), forming two baseline methods for the experiments. We test this on A100, and rest of results are based on the same GPU configure.

Training Latency and Throughput

Method Latency Throughput
Baseline@Alpaca-Seq 10.51h 608.41 token/s
Baseline@Alpaca-Parallel 9.85h 649.30 token/s
m-LoRA 9.46h 674.58 token/s

We conducted four identical fine-tuning jobs with same dataset and same hyper-parameters, incorporating two baselines and m-LoRA. During the experimental process, we collected the completion times for each task in the baseline methods and calculated the time taken by the slowest task as the Training Latency. As shown in Table, m-LoRA exhibits lower Training Latency compared to both baseline methods. Specifically, m-LoRA is 9.99% faster than Baseline@Alpaca-Seq and 3.92% faster than Baseline@Alpaca-Parallel.

Video Memory Usage

We conducted several fine-tuning jobs with same dataset and batch_size = {2,4, 6, 8}, incorporating Baseline@Alpaca-Parallel and m-LoRA.

Baseline@Alpaca-Parallel triggered OOM error after 3 parallel tasks when batch size = 8, while m-LoRA can handle twice that amount.

Batching Strategies

Method Training Latency Peak Memory Usage Average GPU Utilization Training Throughput
Baseline@Alpaca-Seq 27.73h 10.68GB 79.39% 653.35 token/s
m-LoRA@M1 36.82h 23.82GB 96.52% 672.54 token/s
m-LoRA@M2 39.14h 23.86GB 96.41% 671.28 token/s
m-LoRA@M3 22.97h 23.85GB 95.22% 674.41 token/s

We conducted four fine-tuning jobs with different dataset but same hyper-parameters, incorporating Baseline@Alpaca-Seq and m-LoRA.

During the experimental process, we collected following metrics:

  • Training Latency = Job completion time
  • Throughput = The number of passed tokens in model forward process / training latency
  • Memory Usage = Peak video memory usage
  • GPU Utilization = Average GPU utilization

All metrics are computed for each job. M1, M2, M3 represent three batch strategies of m-LoRA: Optimal-Fit, Trivial, and Fast-Fit. BASELINE denotes Baseline@Alpaca-Seq.

The Optimal-Fit strategy performs the best across all four metrics, while the other two strategies also outperform the baseline method other than training latency.

Use Cases:

  • Domain-Specific Fine-Tuning: This involves adapting a single model with various parameters particularly for one domain.
  • Cross-Domain Fine-Tuning: This method leverages the base model to fine-tune multiple models, each intended for a different domain.

Quickstart

Firstly, you should clone this repository and install dependencies:

# Clone Repository
git clone https://github.com/TUDB-Labs/multi-lora-fine-tune
cd multi-lora-fine-tune
# Install requirements
pip install -r requirements.txt

The mlora.py code is a starting point for finetuning on various datasets. Basic command for finetuning a baseline model on the Alpaca Cleaned dataset:

python mlora.py \
  --base_model yahma/llama-7b-hf \
  --config ./config/alpaca.json \
  --load_8bit

You can check the template finetune configuration in template folder.

For further detailed usage information, please use --help option:

python mlora.py --help

Running m-LoRA in Multi-GPUs(Experimental Feature)

m-LoRA employs a distinctive approach to pipeline parallelism for executing parallel tasks across multiple GPUs. Below are parameters designed to facilitate the use of Multi-GPUs.

  • --pipeline: enables support for multi-GPU setups.
  • --rank: specifies the worker's index in the pipeline (ranging from 0 to the number of GPUs minus 1).
  • --device: specifies the device for loading weights.
  • --balance: defines a sequence indicating the number of layers that should be loaded by the worker at a specific index.

Currently, only weights in the safetensors format are supported. If you have weights in the Hugging Face PyTorch format, you can use the following code to convert them:

python trans_to_safetensors.py --model_path /home/local_model

Suppose the model has 35 layers(32 transformer layers and 3 other layers). Here are basic commands for finetuning this model on 4 GPUs platform:

python mlora.py \
  --base_model /home/local_model \
  --config ./config/alpaca.json \
  --pipeline \
  --rank 0 \
  --device cuda:0 \
  --balance 9 9 9 8 &

python mlora.py \
  --base_model /home/local_model \
  --config ./config/alpaca.json \
  --pipeline \
  --rank 1 \
  --device cuda:1 \
  --balance 9 9 9 8 &

python mlora.py \
  --base_model /home/local_model \
  --config ./config/alpaca.json \
  --pipeline \
  --rank 2 \
  --device cuda:2 \
  --balance 9 9 9 8 &

python mlora.py \
  --base_model /home/local_model \
  --config ./config/alpaca.json \
  --pipeline \
  --rank 3 \
  --device cuda:3 \
  --balance 9 9 9 8 &

Demo on Colab

You can run finetune on Colab by following this example: Google Colab Example. Make sure to switch the runtime environment to GPU before running it.

Webui for m-LoRA

You can run finetune through webui by following the instructions in the ‘webui/Instruction.md’.Make sure to switch the runtime environment to GPU before running it.

Installation

You can also install m-LoRA into your environment:

# Optional but recommended
conda create -n mlora_env python=3.8
conda activate mlora_env
# Install requirements
pip install mlora

After installation, you can use m-LoRA directly in your code:

import mlora

Contributing

We welcome contributions to improve this repository! Please review the contribution guidelines before submitting pull requests or issues.

Fork the repository. Create a new branch for your feature or fix. Submit a pull request with a detailed explanation of your changes.

You can use the pre-commit to check your code.

ln -s ../../.github/workflows/pre-commit .git/hooks/pre-commit

Citation

Please cite the repo if you use the code in this repo.

@misc{m-LoRA,
  author = {Zhengmao, Ye\textsuperscript{*} and Dengchun, Li\textsuperscript{*} and Jingqi, Tian and Tingfeng, Lan and Yanbo, Liang and Yexi, Jiang and Jie, Zuo and Hui, Lu and Lei, Duan and Mingjie, Tang},
  title = {m-LoRA: Efficient LLM Model Fine-tune and Inference via Multi-Lora Optimization},
  year = {2023},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/TUDB-Labs/multi-lora-fine-tune}},
  note={\textsuperscript{*}: these authors contributed equally to this work.}
}

Copyright

Copyright © 2023 All Rights Reserved.

This project is licensed under the Apache 2.0 License.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

multi-lora-fine-tune's People

Contributors

antlera avatar lianxingao avatar merlintang avatar mikecovlee avatar qsimu avatar trilarflagz avatar vinkle-hzt avatar yezhem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

multi-lora-fine-tune's Issues

[WIP] Aspen test report

We randomly generated 4 datasets, 1/2 train data set randomly chosen from alpaca-lora, and 3 and 4 from spider. below are the datasets' token lens distribution and total size.
test gpu: a6000
data_set_1: 34000
data_set_2: 17000
data_set_3: 5556
data_set_4: 2700
We will train 8 lora model:
data_set_1 with lr = 3e-4 and lr = 1e-4
data_set_2 with lr = 3e-4 and lr = 1e-4
data_set_3 with lr = 3e-4 and lr = 1e-4
data_set_4 with lr = 3e-4 and lr = 1e-4

train 2 lora model parallel in 2 gpu
= one lora in gpu0, and another lora in gpu1
= train 2 lora model serial in one gpu, and ignore the model load time.
双卡并行时延VS单卡ASPEN并行时延

train 2 lora model parallel in 1 gpu
单卡并行时延VS单卡ASPEN并行时延

compare the peak memory
ALPACA-LORAVS单卡ASPEN并行显存
compare a6000 and 4090
4090VSA6000

Log style not consistent

Consider replace current log mechanism with a unified log framework.

[2023-12-11 21:36:41] m-LoRA: NVIDIA CUDA initialized successfully.
[2023-12-11 21:36:41] m-LoRA: Total 1 GPU(s) detected.
[2023-12-11 21:36:41] m-LoRA: Loading model with quantization, bits = 8

About Mix LoRA

I think this is a great job

Is there any problem with the implementation of the method? Why is the code no longer in the warehouse?

Offload Performance Test Result

Test data: batch-size = 4, seqlen = 1552, use vicuna-7B model in one GPU to test.
The vicuna-7B has 32 transformer layers, use checkpoint in each layer.
case1: 31-layer use the recompute checkpoint, 1-layer use the offload checkpoint, time cost: 12.8416s
case2: 32-layer all use the recompute checkpoint, time cost: 11.3163s
It seems if we use offload in one card, it will be 1.135 times slower than recomputing.

Question about Training

Dear Authors,

Thanks for this great project.
I got a question about training,
I can see this part only produces one work for inference, why are we not using auto-regressive here?
Also, I wondered how we should test the throughput like tokens per second.

Best
Chao
Screenshot 2023-11-01 at 2 01 50 PM

How to test the performance.

Use patch files below to get baseline performance(alpaca-Lora):
transformers/trainer.py

148a149,155
> import logging as flog
> import os
> import time
> flog.basicConfig(filename="logs.log",
>                  filemode='a',
>                  format='%(message)s',
>                  level=flog.DEBUG)
1871a1879,1880
>                     torch.cuda.reset_peak_memory_stats()
>                     opti_start_time = time.time()
1894a1904,1909
>                     opti_end_time = time.time()
>                     device_str = inputs["input_ids"].device
>                     alloc_mem = torch.cuda.max_memory_allocated(device_str)
>                     gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>                     flog.info(f"optim: {(opti_end_time - opti_start_time):.10f} {alloc_mem} {gpu_utilization}")
>                     flog.info(f"train: {tr_loss_step}")
2658a2674,2675
>         torch.cuda.reset_peak_memory_stats()
>         back_start_time = time.time()
2665a2683,2687
>         back_end_time = time.time()
>         device_str = inputs["labels"].device
>         alloc_mem = torch.cuda.max_memory_allocated(device_str)
>         gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>         flog.info(f"backward: {(back_end_time - back_start_time):.10f} {alloc_mem} {gpu_utilization}")

transformers/models/llama/modeling_llama.py

38a39,46
> import logging as flog
> import os
> import time
> flog.basicConfig(filename="logs.log",
>                  filemode='a',
>                  format='%(message)s',
>                  level=flog.DEBUG)
>
805a814,816
>         flog.info(f"data size: {input_ids.shape[0]} {input_ids.shape[1]}")
>         torch.cuda.reset_peak_memory_stats()
>         forward_start_time = time.time()
816a828,832
>         forward_end_time = time.time()
>         device_str = input_ids.device
>         alloc_mem = torch.cuda.max_memory_allocated(device_str)
>         gpu_uilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>         flog.info(f"forward: {(forward_end_time - forward_start_time):.10f} {alloc_mem} {gpu_uilization}")
837a854,856
>
>             torch.cuda.reset_peak_memory_stats()
>             loss_start_time = time.time()
838a858,862
>             loss_end_time = time.time()
>             device_str = input_ids.device
>             alloc_mem = torch.cuda.max_memory_allocated(device_str)
>             gpu_uilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>             flog.info(f"loss: {(loss_end_time - loss_start_time):.10f} {alloc_mem} {gpu_uilization}")

peft/tuners/lora.py

46a47,53
> import logging as flog
> import os
> import time
> flog.basicConfig(filename="logs.log",
>                  filemode='a',
>                  format='%(message)s',
>                  level=flog.DEBUG)
1148a1156,1157
>             torch.cuda.reset_peak_memory_stats()
>             base_start_time = time.time()
1149a1159,1163
>             base_end_time = time.time()
>             device_str = x.device
>             alloc_mem = torch.cuda.max_memory_allocated(device_str)
>             gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>             flog.info(f"base: {(base_end_time-base_start_time):.10f} {alloc_mem} {gpu_utilization}")
1153a1168,1169
>                 torch.cuda.reset_peak_memory_stats()
>                 lora_start_time = time.time()
1172a1189,1193
>                 lora_end_time = time.time()
>                 device_str = x.device
>                 alloc_mem = torch.cuda.max_memory_allocated(device_str)
>                 gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>                 flog.info(f"lora: {(lora_end_time-lora_start_time):.10f} {alloc_mem} {gpu_utilization}")

example goes out of memory

Dear, Author,

Thanks for this great project,
I hit a problem when I tried to run the example code mlora.py with float16
I use A100 with 40GB memory but it still goes out of memory.

Do you have any clue about this error?

Thanks!

checkpoint offload policy

Now we use the checkpoint to save GPU memory, The checkpoint will cache each transformer layer's input, and forward without grad produce, and then backward will use the input data cached to recompute and produce each transformer layer's grad.

but I think if the tensor size is big(train multi Lora model has a big total batch size, so the tensor is enough big.) the time taken by recompute will be less than transfer time(GPU -> CPU and CPU -> GPU). Maybe this method will increase latency but will increase throughput.

I have found some APIs to implement it:

  • save_on_cpu will save the checkpoint's input to the CPU, and when needed backward, it will load the input data from CPU to GPU. This API just saves the checkpoint's input memory.
  • saved_tensors_hooks Use this context-manager can define how intermediary results of an operation should be packed before saving, and unpacked on retrieval. So it can offload tensor when forward and backward. The torch==2.0.1 uses this API to implement checkpoint, but does not support the user passing the argument to change it, if we want to implement our policy, can hack this function.
  • pytorch-v2.1.0-rc2 in the nightly version, I found the checkpoint API allow the user to create new context manager, we can implement our context manager to implement the offload policy.

The test report compare to alpaca-lora.

I tested three different datasets with different amounts in alpaca-lora and multi-lora-fine-tune.
Each dataset(the input data's sequence and size are also the same) trains two different lora models with two different optimizers, each optimizer has the same training hyperparams.
So the alpaca-lora needs to be trained twice to produce two different lora model, but multi-lora-fine-tune just need once to produce one lora model.
The experimental statistics on end-to-end train latency (without model and dataset load and save latency).

  • dataset1 use batchsize 7, 457 data from alpaca-lora, and max seq len is 1304
  • dataset2 use batchsize 16, 452 data from alpaca-lora, and max seq len is 512
  • dataset3 use batchsize 16, 5000 data from sql-create-context, and max seq len is 256
    The experimental results are as follows:
  1. Train two different lora total cost time(hours)
    result (手机)
  2. Train two different lora thuoughput(tokens/second)
    result_1 (手机)

Will Embeding change?

I have been studying LoRA recently and I noticed that during pre-training, the word vectors change as the training progresses. However, what about when using LoRA for fine-tuning? Do the word vectors still change, or is it only the attention weights.

Known Issues

  • ⚠️ Only last layer of adapter will be updated when training
  • Slow build-in inference

Cannot load vicuna-7b-delta-v0

  1. Use aspen.load_llama_tf_weight to load vicuna-7b-delta-v0 model use more than 30GB memory then caused OOM.

  2. Use utils.convert_hf_to_pth to transfer vicuna-7b-delta-v0 to .pth model, then use aspen.load_llama_7b_weight to load .pth model, an error is reported:

Not use layer model.embed_tokens.weight.
Traceback (most recent call last):
  File "/data/glx/code/multi_lora/legacy.py", line 43, in <module>
    aspen.load_llama_7b_weight(llama_model, config["base_model"], config["device"])
  File "/data/glx/code/multi_lora/aspen/modelloader.py", line 21, in load_llama_7b_weight
    layer_id = int(layer_name[:layer_name.find(".")])
ValueError: invalid literal for int() with base 10: 'ayers'

Issues about integrated inference

Traceback

Traceback (most recent call last):
  File "/home/mikecovlee/work/multi-lora-fine-tune/mlora.py", line 175, in <module>
    inference(config, model, tokenizer)
  File "/home/mikecovlee/work/multi-lora-fine-tune/mlora.py", line 106, in inference
    input_data = mlora.MultiLoraBatchData(
TypeError: MultiLoraBatchData.__init__() got an unexpected keyword argument 'prompts_'

TODO

Improve inference functions. @mikecovlee

Performance Test

Self comparison test about alpaca_data_en_52k dataset on vicuna-7b-v1.1 (GPU: A100), group_by_length and no checkpoint.

Method1: Using the same configuration file and data, fine-tune two datasets simultaneously.
Method2: Using the same configuration file and data, fine-tune only one dataset.

method 1 config file:

{
    "cutoff_len": 256,
    "group_by_length": true,
    "expand_right": true,
    "pad_token_id": -1,
    "save_step": 20000,
    "lora": [
        {
            "name": "lora_0",
            "output": "lora_0",
            "optim": "adamw",
            "lr": 1e-4,
            "batch_size": 16,
            "num_epochs": 1,
            "r": 8,
            "alpha": 16,
            "dropout": 0.05,
            "target_modules": {
                "q_proj": true,
                "k_proj": true,
                "v_proj": true,
                "o_proj": true,
                "w1_proj": false,
                "w2_proj": false,
                "w3_proj": false
            },
            "data": "/data/glx/LLaMA-Efficient-Tuning/data/alpaca_data_en_52k.json",
            "prompt": "template/template_demo.json"
        },
        {
            "name": "lora_1",
            "output": "lora_1",
            "optim": "adamw",
            "lr": 1e-4,
            "batch_size": 16,
            "num_epochs": 1,
            "r": 8,
            "alpha": 16,
            "dropout": 0.05,
            "target_modules": {
                "q_proj": true,
                "k_proj": true,
                "v_proj": true,
                "o_proj": true,
                "w1_proj": false,
                "w2_proj": false,
                "w3_proj": false
            },
            "data": "/data/glx/LLaMA-Efficient-Tuning/data/alpaca_data_en_52k.json",
            "prompt": "template/template_demo.json"
        }
    ]
}

method2 config file:

{
    "cutoff_len": 256,
    "group_by_length": true,
    "expand_right": true,
    "pad_token_id": -1,
    "save_step": 20000,
    "lora": [
        {
            "name": "lora_only1",
            "output": "lora_only1",
            "optim": "adamw",
            "lr": 1e-4,
            "batch_size": 16,
            "num_epochs": 1,
            "r": 8,
            "alpha": 16,
            "dropout": 0.05,
            "target_modules": {
                "q_proj": true,
                "k_proj": true,
                "v_proj": true,
                "o_proj": true,
                "w1_proj": false,
                "w2_proj": false,
                "w3_proj": false
            },
            "data": "/data/glx/LLaMA-Efficient-Tuning/data/alpaca_data_en_52k.json",
            "prompt": "template/template_demo.json"
        }
    ]
}

Method1: time cost: 7h55min, gpu memory cost: 21.74GB
Method2: time cost: 4h17min, gpu memory cost: 15.86GB

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.