Code Monkey home page Code Monkey logo

unsloth's Introduction

unsloth logo

Finetune Llama 3, Mistral, Phi-3 & Gemma 2-5x faster with 80% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face.

Unsloth supports Free Notebooks Performance Memory use
Llama 3 (8B) ▶️ Start for free 2x faster 60% less
Mistral v0.3 (7B) ▶️ Start for free 2.2x faster 73% less
Gemma 2 (9B) ▶️ Start for free 2x faster 63% less
Phi-3 (mini) ▶️ Start for free 2x faster 50% less
Phi-3 (medium) ▶️ Start for free 2x faster 50% less
Ollama ▶️ Start for free 1.9x faster 43% less
ORPO ▶️ Start for free 1.9x faster 43% less
DPO Zephyr ▶️ Start for free 1.9x faster 43% less
TinyLlama ▶️ Start for free 3.9x faster 74% less

🦥 Unsloth.ai News

🔗 Links and Resources

Type Links
📚 Documentation & Wiki Read Our Wiki
  Twitter (aka X) Follow us on X
💾 Installation unsloth/README.md
🥇 Benchmarking Performance Tables
🌐 Released Models Unsloth Releases
✍️ Blog Read our Blogs

⭐ Key Features

  • All kernels written in OpenAI's Triton language. Manual backprop engine.
  • 0% loss in accuracy - no approximation methods - all exact.
  • No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
  • Works on Linux and Windows via WSL.
  • Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
  • Open source trains 5x faster - see Unsloth Pro for up to 30x faster training!
  • If you trained a model with 🦥Unsloth, you can use this cool sticker!  

🥇 Performance Benchmarking

1 A100 40GB 🤗Hugging Face Flash Attention 🦥Unsloth Open Source 🦥Unsloth Pro
Alpaca 1x 1.04x 1.98x 15.64x
LAION Chip2 1x 0.92x 1.61x 20.73x
OASST 1x 1.19x 2.17x 14.83x
Slim Orca 1x 1.18x 2.22x 14.82x
Free Colab T4 Dataset 🤗Hugging Face Pytorch 2.1.1 🦥Unsloth 🦥 VRAM reduction
Llama-2 7b OASST 1x 1.19x 1.95x -43.3%
Mistral 7b Alpaca 1x 1.07x 1.56x -13.7%
Tiny Llama 1.1b Alpaca 1x 2.06x 3.87x -73.8%
DPO with Zephyr Ultra Chat 1x 1.09x 1.55x -18.6%

💾 Installation Instructions

Conda Installation

Select either pytorch-cuda=11.8 for CUDA 11.8 or pytorch-cuda=12.1 for CUDA 12.1. If you have mamba, use mamba instead of conda for faster solving. See this Github issue for help on debugging Conda installs.

conda create --name unsloth_env \
    python=3.10 \
    pytorch-cuda=<11.8/12.1> \
    pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
    -y
conda activate unsloth_env

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes

Pip Installation

Do NOT use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.

  1. Find your CUDA version via
import torch; torch.version.cuda
  1. For Pytorch 2.1.0: You can update Pytorch via Pip (interchange cu121 / cu118). Go to https://pytorch.org/ to learn more. Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path. For Pytorch 2.1.1: go to step 3. For Pytorch 2.2.0: go to step 4.
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
  --index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"
  1. For Pytorch 2.1.1: Use the "ampere" path for newer RTX 30xx GPUs or higher.
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
  --index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
  1. For Pytorch 2.2.0: Use the "ampere" path for newer RTX 30xx GPUs or higher.
pip install --upgrade --force-reinstall --no-cache-dir torch==2.2.0 triton \
  --index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
  1. If you get errors, try the below first, then go back to step 1:
pip install --upgrade pip
  1. For Pytorch 2.2.1:
# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

# Pre Ampere RTX 2080, T4, GTX 1080 GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
  1. For Pytorch 2.3.0: Use the "ampere" path for newer RTX 30xx GPUs or higher.
pip install "unsloth[cu118-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
  1. To troubleshoot installs try the below (all must succeed). Xformers should mostly all be available.
nvcc
python -m xformers.info
python -m bitsandbytes

📜 Documentation

  • Go to our Wiki page for saving to GGUF, checkpointing, evaluation and more!
  • We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
  • We're in 🤗Hugging Face's official docs! Check out the SFT docs and DPO docs!
from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates

DPO Support

DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.

We're in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs!

from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
)

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = YOUR_DATASET_HERE,
    # eval_dataset = YOUR_DATASET_HERE,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)
dpo_trainer.train()

🥇 Detailed Benchmarking Tables

  • Click "Code" for fully reproducible examples
  • "Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.
  • For the full list of benchmarking tables, go to our website
1 A100 40GB 🤗Hugging Face Flash Attention 2 🦥Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Alpaca 1x 1.04x 1.98x 2.48x 5.32x 15.64x
code Code Code Code Code
seconds 1040 1001 525 419 196 67
memory MB 18235 15365 9631 8525
% saved 15.74 47.18 53.25

Llama-Factory 3rd party benchmarking

  • Link to performance table. TGS: tokens per GPU per second. Model: LLaMA2-7B. GPU: NVIDIA A100 * 1. Batch size: 4. Gradient accumulation: 2. LoRA rank: 8. Max length: 1024.
Method Bits TGS GRAM Speed
HF 16 2392 18GB 100%
HF+FA2 16 2954 17GB 123%
Unsloth+FA2 16 4007 16GB 168%
HF 4 2415 9GB 101%
Unsloth+FA2 4 3726 7GB 160%

Performance comparisons between popular models

Click for specific model benchmarking tables (Mistral 7b, CodeLlama 34b etc.)

Mistral 7b

1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Mistral 7B Slim Orca 1x 1.15x 2.15x 2.53x 4.61x 13.69x
code Code Code Code Code
seconds 1813 1571 842 718 393 132
memory MB 32853 19385 12465 10271
% saved 40.99 62.06 68.74

CodeLlama 34b

1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Code Llama 34B OOM ❌ 0.99x 1.87x 2.61x 4.27x 12.82x
code ▶️ Code Code Code Code
seconds 1953 1982 1043 748 458 152
memory MB 40000 33217 27413 22161
% saved 16.96 31.47 44.60

1 Tesla T4

1 T4 16GB Hugging Face Flash Attention Unsloth Open Unsloth Pro Equal Unsloth Pro Unsloth Max
Alpaca 1x 1.09x 1.69x 1.79x 2.93x 8.3x
code ▶️ Code Code Code Code
seconds 1599 1468 942 894 545 193
memory MB 7199 7059 6459 5443
% saved 1.94 10.28 24.39

2 Tesla T4s via DDP

2 T4 DDP Hugging Face Flash Attention Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Alpaca 1x 0.99x 4.95x 4.44x 7.28x 20.61x
code ▶️ Code Code Code
seconds 9882 9946 1996 2227 1357 480
memory MB 9176 9128 6904 6782
% saved 0.52 24.76 26.09

Performance comparisons on 1 Tesla T4 GPU:

Click for Time taken for 1 epoch

One Tesla T4 on Google Colab bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System GPU Alpaca (52K) LAION OIG (210K) Open Assistant (10K) SlimOrca (518K)
Huggingface 1 T4 23h 15m 56h 28m 8h 38m 391h 41m
Unsloth Open 1 T4 13h 7m (1.8x) 31h 47m (1.8x) 4h 27m (1.9x) 240h 4m (1.6x)
Unsloth Pro 1 T4 3h 6m (7.5x) 5h 17m (10.7x) 1h 7m (7.7x) 59h 53m (6.5x)
Unsloth Max 1 T4 2h 39m (8.8x) 4h 31m (12.5x) 0h 58m (8.9x) 51h 30m (7.6x)

Peak Memory Usage

System GPU Alpaca (52K) LAION OIG (210K) Open Assistant (10K) SlimOrca (518K)
Huggingface 1 T4 7.3GB 5.9GB 14.0GB 13.3GB
Unsloth Open 1 T4 6.8GB 5.7GB 7.8GB 7.7GB
Unsloth Pro 1 T4 6.4GB 6.4GB 6.4GB 6.4GB
Unsloth Max 1 T4 11.4GB 12.4GB 11.9GB 14.4GB
Click for Performance Comparisons on 2 Tesla T4 GPUs via DDP: **Time taken for 1 epoch**

Two Tesla T4s on Kaggle bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System GPU Alpaca (52K) LAION OIG (210K) Open Assistant (10K) SlimOrca (518K) *
Huggingface 2 T4 84h 47m 163h 48m 30h 51m 1301h 24m *
Unsloth Pro 2 T4 3h 20m (25.4x) 5h 43m (28.7x) 1h 12m (25.7x) 71h 40m (18.1x) *
Unsloth Max 2 T4 3h 4m (27.6x) 5h 14m (31.3x) 1h 6m (28.1x) 54h 20m (23.9x) *

Peak Memory Usage on a Multi GPU System (2 GPUs)

System GPU Alpaca (52K) LAION OIG (210K) Open Assistant (10K) SlimOrca (518K) *
Huggingface 2 T4 8.4GB | 6GB 7.2GB | 5.3GB 14.3GB | 6.6GB 10.9GB | 5.9GB *
Unsloth Pro 2 T4 7.7GB | 4.9GB 7.5GB | 4.9GB 8.5GB | 4.9GB 6.2GB | 4.7GB *
Unsloth Max 2 T4 10.5GB | 5GB 10.6GB | 5GB 10.6GB | 5GB 10.5GB | 5GB *
  • Slim Orca bsz=1 for all benchmarks since bsz=2 OOMs. We can handle bsz=2, but we benchmark it with bsz=1 for consistency.


Thank You to

unsloth's People

Contributors

coffeevampir3 avatar danielhanchen avatar huynguyen-hust avatar qubitium avatar shimmyshimmer avatar younesbelkada avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unsloth's Issues

Breakdown of speedups

Hey there,

Thanks for releasing this!

Going through the list of kernels:

  1. CrossEntropyLoss
  2. RMS NORM
  3. RopeEmbedding
  4. Swiglu
  5. FastLoRA

I'm trying to understand how the various optimizations correlate to performance improvements, is there a chart that shows the gains from #5 alone?

Secondly, could you please explain what's being done/included in the both the PRO/MAX tiers. The wording from the blog post is very imprecise.

Thanks!

Manual Modifications to Extend LoraConfig Support

Thanks for the great work on this project!

I was wondering if it's possible to expand the trainable modules in LoraConfig by adding to the modules_to_save parameter. Specifically, can we include more modules like lm_head for training? Here's an example of what I'm thinking:

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=r,
    lora_alpha=lora_alpha,
    lora_dropout=finetuning_args.lora_dropout,
    target_modules=target_modules,
    modules_to_save="lm_head"
)

Would love to hear your thoughts on this or if there's already a way to do it.

Thanks!

CUDA_VISIBLE_DEVICES

Hi,I got the following warning when using a single machine for multi-gpu training
/usr/local/lib/python3.10/dist-packages/unsloth/__init__.py:23: UserWarning: Unsloth: 'CUDA_VISIBLE_DEVICES' is currently 0,1,2,3 but we require 'CUDA_VISIBLE_DEVICES=0' We shall set it ourselves.
How does this affect my training?,The training framework uses:LLaMA-Factory

RuntimeError: Cannot launch Triton kernel since n = 46336 exceeds the maximum CUDA blocksize = 65535.

RuntimeError: Cannot launch Triton kernel since n = 46336 exceeds the maximum CUDA blocksize = 65535.
ile /usr/local/lib/python3.10/dist-packages/unsloth/kernels/utils.py:23, in calculate_settings(n)
     21 # CUDA only supports 65535 - 2^16-1 threads per block
     22 if BLOCK_SIZE > MAX_FUSED_SIZE:
---> 23     raise RuntimeError(f"Cannot launch Triton kernel since n = {n} exceeds "\
     24                        f"the maximum CUDA blocksize = {MAX_FUSED_SIZE}.")
     25 num_warps = 4
     26 if   BLOCK_SIZE >= 32768: num_warps = 32
H100 80GB

RoPE Scaling issues

Hey there,

Worked on this quite a lot yesterday with the help of GPT-4, but I'm having some issues with errors similar to the following:

Traceback (most recent call last):
  File "/mnt/c/Users/magda/OneDrive/Desktop/python_work/unsloth_finetuner.py", line 67, in <module>
    trainer.train()
  File "/home/harry/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/harry/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/harry/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2744, in training_step
    self.accelerator.backward(loss)
  File "/home/harry/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/harry/.local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/harry/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/harry/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/harry/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/harry/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/harry/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/harry/.local/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 140, in decorate_bwd
    return bwd(*args, **kwargs)
  File "/home/harry/.local/lib/python3.10/site-packages/unsloth/kernels/fast_lora.py", line 291, in backward
    dX = torch.matmul(dQ, QW.t(), out = X)
RuntimeError: Expected out tensor to have dtype c10::Half, but got float instead

I'm running an RTX 3090 locally, for the record--I want to say CUDA 118. Downgraded Pytorch to 2.1.0 as well. Here's what the script is looking like right now. This is my first time trying to locally finetune, so I'm not sure if the args are wrong, or what, but any advice would be very much appreciated.

from unsloth import FastLlamaModel, FastMistralModel
import torch
max_seq_length = 4096 # Can change to any number <= 4096
HAS_BFLOAT16 = torch.cuda.is_bf16_supported()
dtype = torch.bfloat16 if HAS_BFLOAT16 else torch.float16
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
    model_name = "unsloth/llama-2-7b", # Supports any llama model eg meta-llama/Llama-2-7b-hf
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = "hf...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# Do model patching and add fast LoRA weights
model = FastLlamaModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

from transformers import Trainer, TrainingArguments
from datasets import load_dataset

dataset = load_dataset('json', data_files='processed_data.json', split='train')
# Tokenize the dataset
def tokenize_and_prepare_labels(examples):
    tokenized_inputs = tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_seq_length)
    # Shift the tokenized inputs to the right for the labels
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_prepare_labels, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir='outputs',          
    num_train_epochs=3,            
    per_device_train_batch_size=8, 
    warmup_steps=500,              
    weight_decay=0.01,             
    logging_dir='./logs',          
    logging_steps=10,
    fp16=True,  # Enable mixed precision
)


# Initialize Trainer with the entire tokenized dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=None  # Set this to your evaluation dataset if you have one
)


print(f"Max sequence length: {max_seq_length}")
trainer.train()
print(f"Training done! Check outputs")

For the record, when I remove fp16 = True, it errors out a lot faster and gives a slight variation of the error:

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float

Unsloth breaks the inference ?!

Hello, thanks for your contribution it is really promising but for some reason it breaks the generation and inference
Here is an example:

from unsloth import FastLlamaModel
import torch
max_seq_length = 1024 # Can change to any number <= 4096
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
    model_name = "TheBloke/Llama-2-7B-fp16", # Supports any llama model
    max_seq_length = max_seq_length,
    dtype=dtype,
    load_in_4bit = False
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
inputs = tokenizer.encode("the concept of ", return_tensors="pt", add_special_tokens = True).to(model.device)
answer = model.generate(inputs, max_new_tokens = 20)
tokenizer.batch_decode(answer, skip_special_tokens = False)

The output:

==((====))==  Unsloth: Fast Llama patching release 23.11
   \\   /|    GPU: A100-SXM4-40GB. Max memory: 39.587 GB
O^O/ \_/ \    CUDA compute capability = 8.0
\        /    Pytorch version: 2.1.0+cu118. CUDA Toolkit = 11.8
 "-____-"     bfloat16 support = TRUE

Loading checkpoint shards: 100%
2/2 [00:14<00:00, 6.61s/it]
['<s> the concept of 1<s> Tags\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n']

I have tried more than 4 different Llamas including yours and the same issue.

Question about benchmark

Hi,i'm highly interested in this position.
I am curious to know if the Hugging Face baseline incorporates Flash Attention2. Could you please offer a reproducible Hugging Face baseline example?

Does unsloth support loading existing lora weights when finetuning?

This is really an amazing project. I really speed up my model finetuning with 2x faster.

I'm wondering whether can I add an existing lora adapter to the model, so as to continue training the lora on another dataset.
currently, it seems that I have to merge my base model and lora weights in advance if using unsloth.
Therefore, I have to update and save my newly merged model each time before further training.

so, will this project support loading existing lora weights in the near futher?

Thank you!

Question about readme benchmarks

Huggingface 1 T4 23h 15m 56h 28m 8h 38m 391h 41m
Huggingface 2 T4 84h 47m 163h 48m 30h 51m 1301h 24m

Sorry, are these different models? why is 1T4 faster?

`ERROR`: Package 'unsloth-2024.1' requires a different Python: 3.8.10 not in '>=3.9'

 > [21/23] RUN pip install "unsloth[cu121_torch211] @ git+https://github.com/unslothai/unsloth.git":                                                                                         
1.136 Collecting unsloth[cu121_torch211]@ git+https://github.com/unslothai/unsloth.git                                                                                                       
1.136   Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-ptgzpiwm/unsloth                                                                                                
1.136   Running command git clone -q https://github.com/unslothai/unsloth.git /tmp/pip-install-ptgzpiwm/unsloth                                                                              
2.604   Installing build dependencies: started                                                                                                                                               
6.648   Installing build dependencies: finished with status 'done'
6.649   Getting requirements to build wheel: started
6.932   Getting requirements to build wheel: finished with status 'done'
6.936   Installing backend dependencies: started
9.363   Installing backend dependencies: finished with status 'done'
9.365     Preparing wheel metadata: started
9.693     Preparing wheel metadata: finished with status 'done'
9.773 ERROR: Package 'unsloth-2024.1' requires a different Python: 3.8.10 not in '>=3.9'

Python 3.8 has reach EOL quite some time ago... please update it or at least don't restrict currently supported versions.

Ampere-specific pip install line gives error message (I have a 4090 RTX)

Hello, I tried to follow the pip installation instructions for unslothai. I have a 4090 RTX, so I used the ampere command:

pip install "unsloth[cu118_ampere] @ git+https://github.com/unslothai/unsloth.git"

This produced the error messages shown below. I also found that if I used the non-ampere pip installation command, that performed a successful installation.

Collecting unsloth[cu118_ampere]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-airv06iv/unsloth_a06586211bd7445ba155bf97e1192880
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-airv06iv/unsloth_a06586211bd7445ba155bf97e1192880
  Resolved https://github.com/unslothai/unsloth.git to commit 27bbd6b2cd927b7cb0866a03dec41efb04470501
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
WARNING: unsloth 2023.12 does not provide the extra 'cu118-ampere'
Collecting bitsandbytes (from unsloth[cu118_ampere]@ git+https://github.com/unslothai/unsloth.git)
  Using cached bitsandbytes-0.41.3.post2-py3-none-any.whl.metadata (9.8 kB)
Collecting flash-attn (from unsloth[cu118_ampere]@ git+https://github.com/unslothai/unsloth.git)
  Using cached flash_attn-2.3.6.tar.gz (2.3 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [17 lines of output]
      Traceback (most recent call last):
        File "/home/zmx/m.2/Dev/llm/unsloth/venv/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/home/zmx/m.2/Dev/llm/unsloth/venv/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/home/zmx/m.2/Dev/llm/unsloth/venv/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-1h3yu7xe/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "/tmp/pip-build-env-1h3yu7xe/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-1h3yu7xe/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 480, in run_setup
          super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
        File "/tmp/pip-build-env-1h3yu7xe/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 9, in <module>
      ModuleNotFoundError: No module named 'packaging'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Any idea what the cause might be?

Full finetune (not QLoRA) - optimizations are not turned on

We are benchmarking training of llama2-13B-hf + hf + trl + FlashAttn 2 vs unsloth (which has FlashAttn 2 enabled) on a single A100 80GB.

With the latest changes:

  1. We see at most 10% improvement in training time.
  2. Peak vram is actually slightly elevated vs hf. ~1GB more on a A100.

Both using native bf16 for finetuning.

Overall, we cannot reproduce anywhere close to the numbers as presented by unsloth.

Can anyone else share their results?

Edit: Perhaps the drastic improvements is only visible on T4 gpu? Curious to see what the real-world benchmarks results are like on the more modern gpu i.e. 3090, A100, 4090.

GGUF llama.cpp direct conversion - BETA testers wanted!

Previously when one trains with either HF's TRL library or Unsloth, converting the model to GGUF / llama.cpp was a nightmare.

Either the following strategies were used:

  1. Save PEFT adapters and upload them to HF. Use llama.cpp to combine a GGUF version of Mistral / Llama and combine LoRA via convert-lora-to-ggml.py then quantize. Accuracy is medium, since bitsandbytes's quantization method is different to GGUF. VRAM usage is great.
  2. Train with load_in_4bit = False to use 16bit, then use model.save_pretrained and using llama.cpp then quantize. Accuracy is superb, since 16bit retains full accuracy, but VRAM usage for llama-7b goes to 14.5GB, and Mistral OOMs.
  3. Merge PEFT adapters back using model.merge_and_unload() then converting each layer to 16bit then step 2. Accuracy is low, since conversion to 4bit then upcasting to 16bit loses lots of bits.
  4. [NEW] Unsloth: Combines best of both worlds - train with QLoRA (low memory), then we upcast to 16bit, but on the fly, using mmap. Accuracy is high as well.

A table comparing all strategies:

Strategy VRAM Usage Disk Usage Accuracy
1. PEFT Low Low Medium
2. 16bit Horrible High High
3. Merge Horrible High Low
4. Unsloth Low High High

To use Unsloth:

from unsloth import unsloth_save_model
# unsloth_save_model has the same args as model.save_pretrained
unsloth_save_model(model, tokenizer, "output_model", push_to_hub = False, token = None)
colab_quantize_to_gguf("output_model", quantization_method = "q4_k_m")

I'm looking for beta testers to test out this Colab notebook! It finetunes Mistral and saves to GGUF: https://colab.research.google.com/drive/14DW0VwuqL2O3tqGlX7aUF6TOBA8S59M4?usp=sharing

It would be great it anyone tested it to check the perplexity / whether my conversions are correct! Thanks!

GGUF conversion

Hi, I modified your alpaca mistral 7b colab example https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_#scrollTo=upcOlWe7A1vc

To use other mistral model, I'm confused about how to save the model in fp16 mode ? The reason is then I wanted to convert to gguf, here's the step after training success and inference test is correct :

new_model = "xxx/xxxx"
model = model.merge_and_unload()
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)

The result is very small like only 4.13 gb and in HF tag it said 4 bit precision (model.safetensors)
And llama cpp gguf convert.py refused to convert this

Did I do something wrong ?

Note : BTW unsloth is very cool, very fast training

Xformers and other dependency errors

I am attempting to replicate the Mistral 7b model finetuning locally as outlined below Unsloth Open column in the README.md file. I have successfully downloaded the dataset and the model to my local machine, which is a deviation from the original jupyter notebook that used data directly from the cloud.

model_name = "./Mistral-7B-v0.1"
...
dataset = load_dataset(path = "./SlimOrca", split = "train")

The initial stages of the Jupyter notebook, which include multiple code blocks, were executed without any issues. However, I have run into an error during the execution of the trainer_stats = trainer.train() block. The issue arises with a ValueError stating that the Query/Key/Value must have either BMHK or BMK shape.

ValueError: Query/Key/Value should all have BMHK or BMK shape.
  query.shape: torch.Size([4, 2021, 8, 4, 128])
  key.shape  : torch.Size([4, 2021, 8, 4, 128])
  value.shape: torch.Size([4, 2021, 8, 4, 128])

I have even tried reverting back to using the HuggingFace dataset instead of the local one to see if the issue was related to my dataset location. Unfortunately, I encountered the same error.
Here’s a brief overview of my setup:

==((====))==  Unsloth: Fast Mistral patching release 2023.12
   \\   [/](https://file+.vscode-resource.vscode-cdn.net/)|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.647 GB
O^O/ \_/ \    CUDA compute capability = 8.9
\        [/](https://file+.vscode-resource.vscode-cdn.net/)    Pytorch version: 2.1.0.post300. CUDA Toolkit = 11.8
 "-____-"     bfloat16 support = TRUE

GPU = NVIDIA GeForce RTX 4090. Max memory = 23.647 GB.
4.66 GB of memory reserved.

Additionally, I have addressed an issue with the torch version format. The original Unsloth configuration didn't support version numbers with four segments, so I modified the version parsing by changing from major_torch, minor_torch, _ = torch.version.split(".") to major_torch, minor_torch, _ = torch.__version__.split(".")[0:3] to accommodate the expected format.

Would anyone be able to point me in the right direction, or has anyone experienced a similar issue while working on model finetuning replication? Your help would be greatly appreciated as this error is currently standing in the way of my progress.

Thank you in advance for any assistance!

AMD GPU

Hi,
Does Unsloth support AMD GPUs?
Thank you!

ModuleNotFoundError: No module named 'unsloth'

Hello,

First of all, my gpu is rtx3090. I've installed like this
pip install "unsloth[cu118_ampere] @ git+https://github.com/unslothai/unsloth.git"

But the error
ModuleNotFoundError: No module named 'unsloth'

Can anyone tell me the reason?

ValueError: Token None for key pad_token should be a str or an AddedToken instance

Failed to load deepseek-ai/deepseek-llm-7b-base (which is a model of the llama 2 architecture), is the following code necessary? hf tokenizer should automatically handle this according to tokenizer_config.json?

tokenizer.add_special_tokens({"pad_token" : tokenizer.unk_token});
tokenizer.pad_token = tokenizer.unk_token
config = model.config.update({"pad_token_id" : tokenizer.unk_token_id});

File [/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:599](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:599), in FastLlamaModel.from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, token, device_map)
    [586](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:586) model = AutoModelForCausalLM.from_pretrained(
    [587](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:587)     model_name,
    [588](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:588)     device_map = device_map,
   (...)
    [591](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:591)     token = token,
    [592](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:592) )
    [593](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:593) tokenizer = AutoTokenizer.from_pretrained(
    [594](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:594)     model_name,
    [595](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:595)     model_max_length = max_seq_length,
    [596](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:596)     padding_side = "right",
    [597](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:597)     token = token,
...
    [962](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:962)     if isinstance(value, (str)):
    [963](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:963)         # for legacy purpose we default to stripping. `False` depends on this
    [964](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:964)         value = AddedToken(value, rstrip=False, lstrip=False, normalized=False, special=True)

ValueError: Token None for key pad_token should be a str or an AddedToken instance

Feature request: discussions around new features within HF ecosystem with unsloth

Hi @danielhanchen

Thank you very much for this great project and pushing this forward for the community !

With TRL / PEFT team we've seen that your example scripts heavily rely on PEFT / TRL libraries and we wanted to see if you need any help or have any feature request around HF ecosystem we would be happy to collaborate and see what we can do together

Note also recently SDPA has been integrated into transformers core huggingface/transformers#26572 we were also wondering if you did some comparisons with unsloth against transformers 4.36.0

cc @pacman100 @lvwerra

RuntimeError: Cannot launch Triton kernel since n = 102400 exceeds the maximum CUDA blocksize = 65536.

Env:

  • base model: deepseek-llm-7b-base
  • Code: Unsloth - Alpaca.ipynb
  • GPU = NVIDIA GeForce RTX 3090. Max memory = 23.689 GB.
    5.176 GB of memory reserved.

Traceback:

RuntimeError                              Traceback (most recent call last)
/workspaces/unsloth-train-playground/train.ipynb Cell 6 line 1
----> 1 trainer_stats = trainer.train()

File /workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:280, in SFTTrainer.train(self, *args, **kwargs)
    277 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
    278     self.model = self._trl_activate_neftune(self.model)
--> 280 output = super().train(*args, **kwargs)
    282 # After training we make sure to retrieve back the original forward pass method
    283 # for the embedding layer by removing the forward post hook.
    284 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File /workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/trainer.py:1555, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1553         hf_hub_utils.enable_progress_bars()
   1554 else:
-> 1555     return inner_training_loop(
   1556         args=args,
   1557         resume_from_checkpoint=resume_from_checkpoint,
   1558         trial=trial,
   1559         ignore_keys_for_eval=ignore_keys_for_eval,
   1560     )

File /workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/trainer.py:1860, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1857     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
...
     24                        f"the maximum CUDA blocksize = {MAX_FUSED_SIZE}.")
     25 num_warps = 4
     26 if   BLOCK_SIZE >= 32768: num_warps = 32

RuntimeError: Cannot launch Triton kernel since n = 102400 exceeds the maximum CUDA blocksize = 65536.

ImportError in 'from kernels import *': Undefined Symbol in bitsandbytes/libbitsandbytes_cuda117.so

when executing the line from kernels import *. Below is the error trace for reference:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
...
AttributeError: /root/anaconda3/envs/group_chat/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so: undefined symbol: cdequantize_blockwise_fp16_nf4. Did you mean: 'cdequantize_blockwise_fp32'?

The main issue seems to stem from an undefined symbol cdequantize_blockwise_fp16_nf4 in libbitsandbytes_cuda117.so. The system suggests an alternative cdequantize_blockwise_fp32, but it's unclear if this is a suitable substitute.

AttributeError: 'LlamaForCausalLM' object has no attribute 'extra_ignored_labels'

base model: llama2-13b-hf

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
Traceback (most recent call last):
File "/root/fanfiction-go/python/dictionary/train/sft_trainer.py", line 353, in
train_result = trainer.train()
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 280, in train
output = super().train(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 659, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 647, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/unsloth/models/llama.py", line 475, in LlamaForCausalLM_fast_forward
shift_labels = torch.hstack((labels[..., 1:], self.extra_ignored_labels[:labels.shape[0]]))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'extra_ignored_labels'

[Feature Request] Implement Dropout

Great work on these optimized kernels. I am happily using this on a fine-tune.

I'd request to implement dropout into your kernels as without dropout we get too-tight of a fine-tune.

I imagine you'd need be to add random zeroing to the source matrix in your matmul.

TypeError: FastMistralModel.from_pretrained() got an unexpected keyword argument 'rope_scaling'

12/27/2023 19:13:37 - WARNING - llmtuner.model.patcher - Current model does not support RoPE scaling.
12/27/2023 19:13:37 - INFO - llmtuner.model.patcher - Using FlashAttention-2 for faster training and inference.
12/27/2023 19:13:37 - INFO - llmtuner.model.patcher - Quantizing model to 4 bit.
/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/init.py:31: UserWarning: Unsloth: 'CUDA_DEVICE_ORDER' is not set but we require 'CUDA_DEVICE_ORDER=PCI_BUS_ID'
We shall set it ourselves.
warnings.warn(
Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 32, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 28, in run_dpo
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train)
File "/workspace/LLaMA-Factory/src/llmtuner/model/loader.py", line 77, in load_model_and_tokenizer
model, _ = FastMistralModel.from_pretrained(**unsloth_kwargs)
TypeError: FastMistralModel.from_pretrained() got an unexpected keyword argument 'rope_scaling'

Question about triton kernel for changed input shape

Excellent work for training optimization! I have a question for a long time and can an expert like you help me with it?

test code is as below:

import torch
import triton.ops
import time

dtype = torch.int8
out_dtype = torch.int32

# %%
M, K, N = 1000, 1024, 2048
a = torch.randint(-128, 127, (M, K), dtype=dtype, device='cuda')
b = torch.randint(-128, 127, (N, K), dtype=dtype, device='cuda')
b = b.t()

for i in range(3):
    ts = time.time()
    out = triton.ops.matmul(a, b, out_dtype)
    print(f"test {i}: {time.time() - ts}s")

M = 1024  # changing M, like changing sequence length
print("changing M from 1000 to 1024")
a = torch.randint(-128, 127, (M, K), dtype=dtype, device='cuda')
b = torch.randint(-128, 127, (N, K), dtype=dtype, device='cuda')
b = b.t()
for i in range(3):
    ts = time.time()
    out = triton.ops.matmul(a, b, out_dtype)
    print(f"test {i}: {time.time() - ts}s")

output:

test 0: 6.1418256759643555s
test 1: 0.00017952919006347656s
test 2: 8.273124694824219e-05s
changing M from 1000 to 1024
test 0: 5.853637456893921s
test 1: 0.00012087821960449219s
test 2: 7.677078247070312e-05s

If everytime input shape changed, triton have to "warmup", do we have any solutions to fix or avoid this? Or do I use the matmul in a wrong way ?

[Feature support] Qwen model support

Hi, just found this wonderful project. May I ask how could I make the sufficient changes to support the model architecture Qwen-1.8B/7B/14B? The differences of that to LLAMA is that the Q, K, V projection matrices have biases and these weights are fused.

https://huggingface.co/Qwen/Qwen-1_8B-Chat/blob/1d0f68de57b88cfde81f3c3e537f24464d889081/modeling_qwen.py#L269

def __init__(self):
        self.c_attn = nn.Linear(config.hidden_size, 3 * self.projection_size)

def forward(self):
        mixed_x_layer = self.c_attn(hidden_states)

        query, key, value = mixed_x_layer.split(self.split_size, dim=2)

Is it enough to modify the code here to the following?

out = torch.matmul(X, W, out = out)

torch.addmm(bias, X, W, out = out)

Finetune Yi model: RuntimeError: Unsloth: Extra special token `<|startoftext|>` with id=64000 exceeds the maximum vocabulary size of 64000

I saw unsloth support finetune qlora model of Yi. So I tried using below script to finetune Yi-34B-Chat. Then I encountered blow error.
Could you please tell me how to fix it?

Thank you!

CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 /root/LLaMA-Factory/src/train_bash.py
--model_name_or_path /root/download/01ai/Yi-34B-Chat
--dataset_dir /root/LLaMA-Factory/data
--output_dir /root/output/Yi-34B-Chat
--use_unsloth
--dataset alpaca_gpt4_en
--stage sft
--sft_packing True
--do_train True
--finetuning_type lora
--quantization_bit 4
--template yi
--cutoff_len 4096
--learning_rate 1e-4
--preprocessing_num_workers 8
--num_train_epochs 1.0
--max_samples 1000000
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 1
--save_steps 100
--warmup_steps 0
--neftune_noise_alpha 5
--lora_rank 128
--lora_alpha 256
--lora_dropout 0
--lora_target all
--fp16 True
--plot_loss True
--overwrite_output_dir True
--deepspeed ds_config_zero2.json

[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,955 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,956 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,956 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,956 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,956 >> loading file tokenizer_config.json
Traceback (most recent call last):
File "/root/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/root/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/root/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/root/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 29, in run_sft
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train)
File "/root/LLaMA-Factory/src/llmtuner/model/loader.py", line 75, in load_model_and_tokenizer
model, _ = FastLlamaModel.from_pretrained(**unsloth_kwargs)
File "/root/.local/conda/envs/cpp/lib/python3.10/site-packages/unsloth/models/llama.py", line 717, in from_pretrained
check_tokenizer(model, tokenizer)
File "/root/.local/conda/envs/cpp/lib/python3.10/site-packages/unsloth/models/_utils.py", line 131, in check_tokenizer
raise RuntimeError(
RuntimeError: Unsloth: Extra special token <|startoftext|> with id=64000 exceeds the maximum vocabulary size of 64000. You must fix the tokenizer or else out of bounds memory accesses will occur.
[2024-01-03 12:45:50,260] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 27792

Multigpu

Is there multigpu support ? Don't know how to set up without running a script

importlib.metadata.PackageNotFoundError: No package metadata was found for unsloth

`
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 29, in run_sft
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train)
File "/workspace/LLaMA-Factory/src/llmtuner/model/loader.py", line 63, in load_model_and_tokenizer
require_version("unsloth", "Follow the instructions at: https://github.com/unslothai/unsloth")
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/utils/versions.py", line 104, in require_version
raise importlib.metadata.PackageNotFoundError(
importlib.metadata.PackageNotFoundError: No package metadata was found for The 'unsloth' distribution was not found and is required by this application.
Follow the instructions at: https://github.com/unslothai/unsloth
Converting format of dataset: 97%|█████████▋| 139000/143000 [00:02<00:00, 55130.54 examples/s]12/31/2023 15:47:40 - INFO - llmtuner.model.patcher - Quantizing model to 4 bit.
Traceback (most recent call last):
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/utils/versions.py", line 102, in require_version
got_ver = importlib.metadata.version(pkg)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/importlib/metadata/init.py", line 996, in version
return distribution(distribution_name).version
File "/root/miniconda3/envs/llama_factory/lib/python3.10/importlib/metadata/init.py", line 969, in distribution
return Distribution.from_name(distribution_name)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/importlib/metadata/init.py", line 548, in from_name
raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: No package metadata was found for unsloth

During handling of the above exception, another exception occurred:`

How to use Yi-34b?

Hello,
I am trying to use Yi-34b model using this code

model_name = 'chargoddard/Yi-34B-Llama'
max_seq_length = 2048
learning_rate = 2e-4
weight_decay = 0.01
warmup_steps = 10
batch_size = 1
gradient_accumulation_steps = 16
lr_scheduler_type = "linear"
optimizer = "adamw_8bit"
use_gradient_checkpointing = True
random_state = 3407

from unsloth import FastLlamaModel
import torch
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
HAS_BFLOAT16 = torch.cuda.is_bf16_supported()

model, tokenizer = FastLlamaModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

but I am getting the following error


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[23], line 7
      4 load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
      5 HAS_BFLOAT16 = torch.cuda.is_bf16_supported()
----> 7 model, tokenizer = FastLlamaModel.from_pretrained(
      8     model_name = model_name,
      9     max_seq_length = max_seq_length,
     10     dtype = dtype,
     11     load_in_4bit = load_in_4bit,
     12     # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
     13 )

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/unsloth/lib/python3.11/site-packages/unsloth/models/llama.py:658, in FastLlamaModel.from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, token, device_map, rope_scaling)
    643     bnb_config = BitsAndBytesConfig(
    644         load_in_4bit              = True,
    645         bnb_4bit_use_double_quant = True,
    646         bnb_4bit_quant_type       = "nf4",
    647         bnb_4bit_compute_dtype    = dtype,
    648     )
    650 model = AutoModelForCausalLM.from_pretrained(
    651     model_name,
    652     device_map = device_map,
   (...)
    656     rope_scaling = rope_scaling,
    657 )
--> 658 tokenizer = AutoTokenizer.from_pretrained(
    659     model_name,
    660     model_max_length = max_seq_length,
    661     padding_side = "right",
    662     token = token,
    663 )
    665 model, tokenizer = patch_tokenizer(model, tokenizer)
    666 model = FastLlamaModel.post_patch(model)

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/unsloth/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:784, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    782         tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
    783     if tokenizer_class is None:
--> 784         raise ValueError(
    785             f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
    786         )
    787     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    789 # Otherwise we have to be creative.
    790 # if model is an encoder decoder, the encoder tokenizer class is used by default

ValueError: Tokenizer class YiTokenizer does not exist or is not currently imported.

Could you tell me what I am doing wrong? Thank you!

Paged_adam8_bit has casting issues

An RuntimeError would occur when I use unsloth to qlora-fintune vicuna-7b-v1.5, which refers to the following lines:

dX = torch.matmul(DW_f, upW.t(), out = X)

dX = torch.matmul(dQ, QW.t(), out = X)

And I modified the line 163 from dX = torch.matmul(DW_f, upW.t(), out = X) to dX = torch.matmul(DW_f, upW.t()).to(X.dtype) ; the line 291 from dX = torch.matmul(dQ, QW.t(), out = X) to dX = torch.matmul(dQ, QW.t()).to(X.dtype)

The RuntimeError would be fixed.

[Feature request] Support GPTQ quantization

So I have a GPTQ llama model I downloaded (from TheBloke), and it's already 4 bit quantized. I have to pass in False for the load_in_4bit parameter of:

model, tokenizer = FastLlamaModel.from_pretrained(

because if I don't, I get an error thrown saying:

The model is already quantized with gptq. You can't quantize it again with bitsandbytes

But, if I pass in False for load_in_4bit, this code makes bnb_config be None:

        bnb_config = None
        if load_in_4bit:
            bnb_config = BitsAndBytesConfig(
                load_in_4bit              = True,
                bnb_4bit_use_double_quant = True,
                bnb_4bit_quant_type       = "nf4",
                bnb_4bit_compute_dtype    = dtype,
            )

and that makes quantization_config be None as well:

quantization_config = bnb_config,

and that crashes here:

        if hasattr(self, "quantization_config"):
            output["quantization_config"] = (
                self.quantization_config.to_dict()

with the error message:

'NoneType' object has no attribute 'to_dict'

So I'm not sure how to LoRA train this llama model. Any thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.