LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [Paper]

Changelog

20231215: Uploaded artifacts.

Artifacts

Model checkpoint (and training logs) for LLaMA-2 7B with LQ-LoRA (2.75-bits, 64-rank, Fisher) [link]
Model checkpoint (and training logs) for LLaMA-2 70B with LQ-LoRA (2.75-bits, 64-rank, Fisher) [link]
Pre-computed ILP data for LLaMA-2 7B [link]
Pre-computed ILP data for LLaMA-2 70B [link]
Fisher Information for LLaMA-2 7B [link]
Fisher Information for LLaMA-2 70B -> file over the size limit, please contact us!

Installation

Clone the repo

git clone https://github.com/HanGuo97/lq-lora.git
cd lq-lora

Create Docker image (optional)

# Using BuiltKit
DOCKER_BUILDKIT=1 docker build \
    -t lqlora \
    -f Dockerfile \
    .

docker run -ti --rm \
    --gpus all \
    -p 28888:8888 \
    --shm-size=2g \
    lqlora \
    bash -c "cd main/ && jupyter-lab --ip=0.0.0.0 --allow-root"

Install dependencies

bash scripts/setup.sh

Note: Some of the codebase relies on PyTorch>=2.1.

Usages

Downloading Data for Quantization

After downloading the files, please update FILE_NAMES_DICT in models/allocation_utils accordingly.

Applying Quantization

from transformers import AutoTokenizer, AutoModelForCausalLM
from models import lora_utils

data = "c4"         # applying data-aware quantization
budget = "2.75"     # target bits
model_size = "70b"  # 7b or 70b

# Loads the base model (to CPU)
model = AutoModelForCausalLM.from_pretrained(
    f"meta-llama/Llama-2-{model_size}-hf")

# Adds LoRA components, etc
model = lora_utils.prepare_model_for_lora(
    model=model,
    num_ranks=64,
    lora_alpha=16,
    lora_dropout=0.0,
    use_gradient_checkpointing=True)

# Applies LQ-LoRA to the model.
lora_utils.transform_lora_layers(
    lpq=True,
    model=model,
    model_name=f"llama-2-{model_size}/lpq-64/{data},budget={budget}",
    device="cuda")

Saving Quantized Models

Note that HuggingFace's PEFT library only saves the adapter parameters. Since LQ-LoRA additionally changes the base model parameters, we need to save the entire weights of the model.

state_dict = model.state_dict()
file_name = os.path.join(
    output_dir,
    "full_model.pth")
torch.save(state_dict, file_name)

Loading Quantized Models

# No need to apply `transform_lora_layers` because
# these will be loaded from the checkpoint.
model = lora_utils.prepare_model_for_lora(
    model=model,
    num_ranks=64,
    lora_alpha=16,
    lora_dropout=0.0,
    use_gradient_checkpointing=True,
    checkpoint_dir=checkpoint_dir)  # -> enter the path to the checkpoint directory

Todos

Upload the artifacts
We use a legacy version of the (de)quantizaton implementation. We will update the code to use the latest version of the (de)quantization implementation.

Acknowledgement

This code reuses components from several libraries including QLoRA and OmniQuant.

baohaoliao / lq_lora_v0 Goto Github PK

lq_lora_v0's Introduction

LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [Paper]

Changelog

Artifacts

Installation

Usages

Downloading Data for Quantization

Applying Quantization

Saving Quantized Models

Loading Quantized Models

Todos

Acknowledgement

lq_lora_v0's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent