Breakdown of speedups

Finetune Llama 3, Mistral, Phi-3 & Gemma 2-5x faster with 80% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face.

Unsloth supports	Free Notebooks	Performance	Memory use
Llama 3 (8B)	▶️ Start for free	2x faster	60% less
Mistral v0.3 (7B)	▶️ Start for free	2.2x faster	73% less
Gemma 2 (9B)	▶️ Start for free	2x faster	63% less
Phi-3 (mini)	▶️ Start for free	2x faster	50% less
Phi-3 (medium)	▶️ Start for free	2x faster	50% less
Ollama	▶️ Start for free	1.9x faster	43% less
ORPO	▶️ Start for free	1.9x faster	43% less
DPO Zephyr	▶️ Start for free	1.9x faster	43% less
TinyLlama	▶️ Start for free	3.9x faster	74% less

Kaggle Notebooks for Llama 3 (8B), Gemma 2 (9B), Mistral (7B)
Run Llama 3 conversational notebook and Mistral v0.3 ChatML
This text completion notebook is for continued pretraining / raw text
This continued pretraining notebook is for learning another language
Click here for detailed documentation for Unsloth.

🦥 Unsloth.ai News

📣 NEW! Gemma-2-9b and Gemma-2-27b now supported
📣 UPDATE! Phi-3 mini model updated
📣 NEW! Continued Pretraining notebook for other languages like Korean!
📣 NEW! Qwen2 now works
📣 Mistral v0.3 Base and [Mistral v0.3 Instruct]
📣 ORPO support is here + 2x faster inference added for all our models
📣 We cut memory usage by a further 30% and now support 4x longer context windows!

🔗 Links and Resources

Type	Links
📚 Documentation & Wiki	Read Our Wiki
Twitter (aka X)	Follow us on X
💾 Installation	unsloth/README.md
🥇 Benchmarking	Performance Tables
🌐 Released Models	Unsloth Releases
✍️ Blog	Read our Blogs

⭐ Key Features

All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for up to 30x faster training!
If you trained a model with 🦥Unsloth, you can use this cool sticker!

🥇 Performance Benchmarking

For the full list of reproducible benchmarking tables, go to our website

1 A100 40GB	🤗Hugging Face	Flash Attention	🦥Unsloth Open Source	🦥Unsloth Pro
Alpaca	1x	1.04x	1.98x	15.64x
LAION Chip2	1x	0.92x	1.61x	20.73x
OASST	1x	1.19x	2.17x	14.83x
Slim Orca	1x	1.18x	2.22x	14.82x

Benchmarking table below was conducted by 🤗Hugging Face.

Free Colab T4	Dataset	🤗Hugging Face	Pytorch 2.1.1	🦥Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

💾 Installation Instructions

Conda Installation

Select either pytorch-cuda=11.8 for CUDA 11.8 or pytorch-cuda=12.1 for CUDA 12.1. If you have mamba, use mamba instead of conda for faster solving. See this Github issue for help on debugging Conda installs.

conda create --name unsloth_env \
    python=3.10 \
    pytorch-cuda=<11.8/12.1> \
    pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
    -y
conda activate unsloth_env

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes

Pip Installation

Do NOT use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.

Find your CUDA version via

import torch; torch.version.cuda

For Pytorch 2.1.0: You can update Pytorch via Pip (interchange cu121 / cu118). Go to https://pytorch.org/ to learn more. Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path. For Pytorch 2.1.1: go to step 3. For Pytorch 2.2.0: go to step 4.

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"

For Pytorch 2.1.1: Use the "ampere" path for newer RTX 30xx GPUs or higher.

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"

For Pytorch 2.2.0: Use the "ampere" path for newer RTX 30xx GPUs or higher.

pip install --upgrade --force-reinstall --no-cache-dir torch==2.2.0 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"

If you get errors, try the below first, then go back to step 1:

pip install --upgrade pip

For Pytorch 2.2.1:

# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

# Pre Ampere RTX 2080, T4, GTX 1080 GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

For Pytorch 2.3.0: Use the "ampere" path for newer RTX 30xx GPUs or higher.

pip install "unsloth[cu118-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"

To troubleshoot installs try the below (all must succeed). Xformers should mostly all be available.

nvcc
python -m xformers.info
python -m bitsandbytes

📜 Documentation

Go to our Wiki page for saving to GGUF, checkpointing, evaluation and more!
We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
We're in 🤗Hugging Face's official docs! Check out the SFT docs and DPO docs!

from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates

DPO Support

DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.

We're in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs!

from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
)

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = YOUR_DATASET_HERE,
    # eval_dataset = YOUR_DATASET_HERE,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)
dpo_trainer.train()

🥇 Detailed Benchmarking Tables

Click "Code" for fully reproducible examples
"Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.
For the full list of benchmarking tables, go to our website

1 A100 40GB	🤗Hugging Face	Flash Attention 2	🦥Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.04x	1.98x	2.48x	5.32x	15.64x
code	Code	Code	Code	Code
seconds	1040	1001	525	419	196	67
memory MB	18235	15365	9631	8525
% saved		15.74	47.18	53.25

Llama-Factory 3rd party benchmarking

Link to performance table. TGS: tokens per GPU per second. Model: LLaMA2-7B. GPU: NVIDIA A100 * 1. Batch size: 4. Gradient accumulation: 2. LoRA rank: 8. Max length: 1024.

Method	Bits	TGS	GRAM	Speed
HF	16	2392	18GB	100%
HF+FA2	16	2954	17GB	123%
Unsloth+FA2	16	4007	16GB	168%
HF	4	2415	9GB	101%
Unsloth+FA2	4	3726	7GB	160%

Performance comparisons between popular models

Click for specific model benchmarking tables (Mistral 7b, CodeLlama 34b etc.)

Mistral 7b

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Mistral 7B Slim Orca	1x	1.15x	2.15x	2.53x	4.61x	13.69x
code	Code	Code	Code	Code
seconds	1813	1571	842	718	393	132
memory MB	32853	19385	12465	10271
% saved		40.99	62.06	68.74

CodeLlama 34b

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Code Llama 34B	OOM ❌	0.99x	1.87x	2.61x	4.27x	12.82x
code	▶️ Code	Code	Code	Code
seconds	1953	1982	1043	748	458	152
memory MB	40000	33217	27413	22161
% saved		16.96	31.47	44.60

1 Tesla T4

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.09x	1.69x	1.79x	2.93x	8.3x
code	▶️ Code	Code	Code	Code
seconds	1599	1468	942	894	545	193
memory MB	7199	7059	6459	5443
% saved		1.94	10.28	24.39

2 Tesla T4s via DDP

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	0.99x	4.95x	4.44x	7.28x	20.61x
code	▶️ Code	Code	Code
seconds	9882	9946	1996	2227	1357	480
memory MB	9176	9128	6904	6782
% saved		0.52	24.76	26.09

Performance comparisons on 1 Tesla T4 GPU:

Click for Time taken for 1 epoch

One Tesla T4 on Google Colab bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	23h 15m	56h 28m	8h 38m	391h 41m
Unsloth Open	1 T4	13h 7m (1.8x)	31h 47m (1.8x)	4h 27m (1.9x)	240h 4m (1.6x)
Unsloth Pro	1 T4	3h 6m (7.5x)	5h 17m (10.7x)	1h 7m (7.7x)	59h 53m (6.5x)
Unsloth Max	1 T4	2h 39m (8.8x)	4h 31m (12.5x)	0h 58m (8.9x)	51h 30m (7.6x)

Peak Memory Usage

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	7.3GB	5.9GB	14.0GB	13.3GB
Unsloth Open	1 T4	6.8GB	5.7GB	7.8GB	7.7GB
Unsloth Pro	1 T4	6.4GB	6.4GB	6.4GB	6.4GB
Unsloth Max	1 T4	11.4GB	12.4GB	11.9GB	14.4GB

Click for Performance Comparisons on 2 Tesla T4 GPUs via DDP:

**Time taken for 1 epoch**

Two Tesla T4s on Kaggle bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	84h 47m	163h 48m	30h 51m	1301h 24m *
Unsloth Pro	2 T4	3h 20m (25.4x)	5h 43m (28.7x)	1h 12m (25.7x)	71h 40m (18.1x) *
Unsloth Max	2 T4	3h 4m (27.6x)	5h 14m (31.3x)	1h 6m (28.1x)	54h 20m (23.9x) *

Peak Memory Usage on a Multi GPU System (2 GPUs)

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	8.4GB \| 6GB	7.2GB \| 5.3GB	14.3GB \| 6.6GB	10.9GB \| 5.9GB *
Unsloth Pro	2 T4	7.7GB \| 4.9GB	7.5GB \| 4.9GB	8.5GB \| 4.9GB	6.2GB \| 4.7GB *
Unsloth Max	2 T4	10.5GB \| 5GB	10.6GB \| 5GB	10.6GB \| 5GB	10.5GB \| 5GB *

Slim Orca bsz=1 for all benchmarks since bsz=2 OOMs. We can handle bsz=2, but we benchmark it with bsz=1 for consistency.

Thank You to

HuyNguyen-hust for making RoPE Embeddings 28% faster
RandomInternetPreson for confirming WSL support
152334H for experimental DPO support
atgctg for syntax highlighting

Strategy	VRAM Usage	Disk Usage	Accuracy
1. PEFT	Low	Low	Medium
2. 16bit	Horrible	High	High
3. Merge	Horrible	High	Low
4. Unsloth	Low	High	High

	tokenizer.add_special_tokens({"pad_token" : tokenizer.unk_token});
	tokenizer.pad_token = tokenizer.unk_token
	config = model.config.update({"pad_token_id" : tokenizer.unk_token_id});

unslothai / unsloth Goto Github PK

unsloth's Introduction

Finetune Llama 3, Mistral, Phi-3 & Gemma 2-5x faster with 80% less memory!

✨ Finetune for Free

🦥 Unsloth.ai News

🔗 Links and Resources

⭐ Key Features

🥇 Performance Benchmarking

💾 Installation Instructions

Conda Installation

Pip Installation

📜 Documentation

DPO Support

🥇 Detailed Benchmarking Tables

Llama-Factory 3rd party benchmarking

Performance comparisons between popular models

Mistral 7b

CodeLlama 34b

1 Tesla T4

2 Tesla T4s via DDP

Performance comparisons on 1 Tesla T4 GPU:

Thank You to

unsloth's People

Contributors

Stargazers

Watchers

Forkers

unsloth's Issues

Recommend Projects

Recommend Topics

Recommend Org