Finetune Llama 3, Mistral, Phi-3 & Gemma 2-5x faster with 80% less memory!
✨ Finetune for Free
All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face.
All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for up to 30x faster training!
If you trained a model with 🦥Unsloth, you can use this cool sticker!
🥇 Performance Benchmarking
For the full list of reproducible benchmarking tables, go to our website
Benchmarking table below was conducted by 🤗Hugging Face.
Free Colab T4
Dataset
🤗Hugging Face
Pytorch 2.1.1
🦥Unsloth
🦥 VRAM reduction
Llama-2 7b
OASST
1x
1.19x
1.95x
-43.3%
Mistral 7b
Alpaca
1x
1.07x
1.56x
-13.7%
Tiny Llama 1.1b
Alpaca
1x
2.06x
3.87x
-73.8%
DPO with Zephyr
Ultra Chat
1x
1.09x
1.55x
-18.6%
💾 Installation Instructions
Conda Installation
Select either pytorch-cuda=11.8 for CUDA 11.8 or pytorch-cuda=12.1 for CUDA 12.1. If you have mamba, use mamba instead of conda for faster solving. See this Github issue for help on debugging Conda installs.
Do NOT use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.
Find your CUDA version via
importtorch; torch.version.cuda
For Pytorch 2.1.0: You can update Pytorch via Pip (interchange cu121 / cu118). Go to https://pytorch.org/ to learn more. Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path. For Pytorch 2.1.1: go to step 3. For Pytorch 2.2.0: go to step 4.
Go to our Wiki page for saving to GGUF, checkpointing, evaluation and more!
We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
We're in 🤗Hugging Face's official docs! Check out the SFT docs and DPO docs!
fromunslothimportFastLanguageModelfromunslothimportis_bfloat16_supportedimporttorchfromtrlimportSFTTrainerfromtransformersimportTrainingArgumentsfromdatasetsimportload_datasetmax_seq_length=2048# Supports RoPE Scaling interally, so choose any!# Get LAION dataseturl="https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"dataset=load_dataset("json", data_files= {"train" : url}, split="train")
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.fourbit_models= [
"unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!"unsloth/llama-3-8b-Instruct-bnb-4bit",
"unsloth/llama-3-70b-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!"unsloth/Phi-3-medium-4k-instruct",
"unsloth/mistral-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!
] # More models at https://huggingface.co/unslothmodel, tokenizer=FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)
# Do model patching and add fast LoRA weightsmodel=FastLanguageModel.get_peft_model(
model,
r=16,
target_modules= ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha=16,
lora_dropout=0, # Supports any, but = 0 is optimizedbias="none", # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing="unsloth", # True or "unsloth" for very long contextrandom_state=3407,
max_seq_length=max_seq_length,
use_rslora=False, # We support rank stabilized LoRAloftq_config=None, # And LoftQ
)
trainer=SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=60,
fp16=notis_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
output_dir="outputs",
optim="adamw_8bit",
seed=3407,
),
)
trainer.train()
# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like# (1) Saving to GGUF / merging to 16bit for vLLM# (2) Continued training from a saved LoRA adapter# (3) Adding an evaluation loop / OOMs# (4) Customized chat templates
DPO Support
DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.
We're in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs!
fromunslothimportFastLanguageModel, PatchDPOTrainerfromunslothimportis_bfloat16_supportedPatchDPOTrainer()
importtorchfromtransformersimportTrainingArgumentsfromtrlimportDPOTrainermodel, tokenizer=FastLanguageModel.from_pretrained(
model_name="unsloth/zephyr-sft-bnb-4bit",
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)
# Do model patching and add fast LoRA weightsmodel=FastLanguageModel.get_peft_model(
model,
r=64,
target_modules= ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha=64,
lora_dropout=0, # Supports any, but = 0 is optimizedbias="none", # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing="unsloth", # True or "unsloth" for very long contextrandom_state=3407,
max_seq_length=max_seq_length,
)
dpo_trainer=DPOTrainer(
model=model,
ref_model=None,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
warmup_ratio=0.1,
num_train_epochs=3,
fp16=notis_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
seed=42,
output_dir="outputs",
),
beta=0.1,
train_dataset=YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,tokenizer=tokenizer,
max_length=1024,
max_prompt_length=512,
)
dpo_trainer.train()
🥇 Detailed Benchmarking Tables
Click "Code" for fully reproducible examples
"Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.
I was wondering if it's possible to expand the trainable modules in LoraConfig by adding to the modules_to_save parameter. Specifically, can we include more modules like lm_head for training? Here's an example of what I'm thinking:
Hi,I got the following warning when using a single machine for multi-gpu training /usr/local/lib/python3.10/dist-packages/unsloth/__init__.py:23: UserWarning: Unsloth: 'CUDA_VISIBLE_DEVICES' is currently 0,1,2,3 but we require 'CUDA_VISIBLE_DEVICES=0' We shall set it ourselves.
How does this affect my training?,The training framework uses:LLaMA-Factory
RuntimeError: Cannot launch Triton kernel since n = 46336 exceeds the maximum CUDA blocksize = 65535.
ile /usr/local/lib/python3.10/dist-packages/unsloth/kernels/utils.py:23, in calculate_settings(n)
21 # CUDA only supports 65535 - 2^16-1 threads per block
22 if BLOCK_SIZE > MAX_FUSED_SIZE:
---> 23 raise RuntimeError(f"Cannot launch Triton kernel since n = {n} exceeds "\
24 f"the maximum CUDA blocksize = {MAX_FUSED_SIZE}.")
25 num_warps = 4
26 if BLOCK_SIZE >= 32768: num_warps = 32
H100 80GB
Worked on this quite a lot yesterday with the help of GPT-4, but I'm having some issues with errors similar to the following:
Traceback (most recent call last):
File "/mnt/c/Users/magda/OneDrive/Desktop/python_work/unsloth_finetuner.py", line 67, in <module>
trainer.train()
File "/home/harry/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/harry/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/harry/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2744, in training_step
self.accelerator.backward(loss)
File "/home/harry/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/harry/.local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/harry/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/harry/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/home/harry/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/harry/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/harry/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/home/harry/.local/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 140, in decorate_bwd
return bwd(*args, **kwargs)
File "/home/harry/.local/lib/python3.10/site-packages/unsloth/kernels/fast_lora.py", line 291, in backward
dX = torch.matmul(dQ, QW.t(), out = X)
RuntimeError: Expected out tensor to have dtype c10::Half, but got float instead
I'm running an RTX 3090 locally, for the record--I want to say CUDA 118. Downgraded Pytorch to 2.1.0 as well. Here's what the script is looking like right now. This is my first time trying to locally finetune, so I'm not sure if the args are wrong, or what, but any advice would be very much appreciated.
from unsloth import FastLlamaModel, FastMistralModel
import torch
max_seq_length = 4096 # Can change to any number <= 4096
HAS_BFLOAT16 = torch.cuda.is_bf16_supported()
dtype = torch.bfloat16 if HAS_BFLOAT16 else torch.float16
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
model_name = "unsloth/llama-2-7b", # Supports any llama model eg meta-llama/Llama-2-7b-hf
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
token = "hf...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
# Do model patching and add fast LoRA weights
model = FastLlamaModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
dataset = load_dataset('json', data_files='processed_data.json', split='train')
# Tokenize the dataset
def tokenize_and_prepare_labels(examples):
tokenized_inputs = tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_seq_length)
# Shift the tokenized inputs to the right for the labels
tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
return tokenized_inputs
tokenized_dataset = dataset.map(tokenize_and_prepare_labels, batched=True)
# Define training arguments
training_args = TrainingArguments(
output_dir='outputs',
num_train_epochs=3,
per_device_train_batch_size=8,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
fp16=True, # Enable mixed precision
)
# Initialize Trainer with the entire tokenized dataset
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
eval_dataset=None # Set this to your evaluation dataset if you have one
)
print(f"Max sequence length: {max_seq_length}")
trainer.train()
print(f"Training done! Check outputs")
For the record, when I remove fp16 = True, it errors out a lot faster and gives a slight variation of the error:
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float
Hello, thanks for your contribution it is really promising but for some reason it breaks the generation and inference Here is an example:
from unsloth import FastLlamaModel
import torch
max_seq_length = 1024 # Can change to any number <= 4096
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
model_name = "TheBloke/Llama-2-7B-fp16", # Supports any llama model
max_seq_length = max_seq_length,
dtype=dtype,
load_in_4bit = False
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
inputs = tokenizer.encode("the concept of ", return_tensors="pt", add_special_tokens = True).to(model.device)
answer = model.generate(inputs, max_new_tokens = 20)
tokenizer.batch_decode(answer, skip_special_tokens = False)
The output:
==((====))== Unsloth: Fast Llama patching release 23.11
\\ /| GPU: A100-SXM4-40GB. Max memory: 39.587 GB
O^O/ \_/ \ CUDA compute capability = 8.0
\ / Pytorch version: 2.1.0+cu118. CUDA Toolkit = 11.8
"-____-" bfloat16 support = TRUE
Loading checkpoint shards: 100%
2/2 [00:14<00:00, 6.61s/it]
['<s> the concept of 1<s> Tags\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n']
I have tried more than 4 different Llamas including yours and the same issue.
Hi,i'm highly interested in this position.
I am curious to know if the Hugging Face baseline incorporates Flash Attention2. Could you please offer a reproducible Hugging Face baseline example?
This is really an amazing project. I really speed up my model finetuning with 2x faster.
I'm wondering whether can I add an existing lora adapter to the model, so as to continue training the lora on another dataset.
currently, it seems that I have to merge my base model and lora weights in advance if using unsloth.
Therefore, I have to update and save my newly merged model each time before further training.
so, will this project support loading existing lora weights in the near futher?
> [21/23] RUN pip install "unsloth[cu121_torch211] @ git+https://github.com/unslothai/unsloth.git":
1.136 Collecting unsloth[cu121_torch211]@ git+https://github.com/unslothai/unsloth.git
1.136 Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-ptgzpiwm/unsloth
1.136 Running command git clone -q https://github.com/unslothai/unsloth.git /tmp/pip-install-ptgzpiwm/unsloth
2.604 Installing build dependencies: started
6.648 Installing build dependencies: finished with status 'done'
6.649 Getting requirements to build wheel: started
6.932 Getting requirements to build wheel: finished with status 'done'
6.936 Installing backend dependencies: started
9.363 Installing backend dependencies: finished with status 'done'
9.365 Preparing wheel metadata: started
9.693 Preparing wheel metadata: finished with status 'done'
9.773 ERROR: Package 'unsloth-2024.1' requires a different Python: 3.8.10 not in '>=3.9'
Python 3.8 has reach EOL quite some time ago... please update it or at least don't restrict currently supported versions.
This produced the error messages shown below. I also found that if I used the non-ampere pip installation command, that performed a successful installation.
Collecting unsloth[cu118_ampere]@ git+https://github.com/unslothai/unsloth.git
Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-airv06iv/unsloth_a06586211bd7445ba155bf97e1192880
Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-airv06iv/unsloth_a06586211bd7445ba155bf97e1192880
Resolved https://github.com/unslothai/unsloth.git to commit 27bbd6b2cd927b7cb0866a03dec41efb04470501
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... done
WARNING: unsloth 2023.12 does not provide the extra 'cu118-ampere'
Collecting bitsandbytes (from unsloth[cu118_ampere]@ git+https://github.com/unslothai/unsloth.git)
Using cached bitsandbytes-0.41.3.post2-py3-none-any.whl.metadata (9.8 kB)
Collecting flash-attn (from unsloth[cu118_ampere]@ git+https://github.com/unslothai/unsloth.git)
Using cached flash_attn-2.3.6.tar.gz (2.3 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [17 lines of output]
Traceback (most recent call last):
File "/home/zmx/m.2/Dev/llm/unsloth/venv/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/home/zmx/m.2/Dev/llm/unsloth/venv/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/zmx/m.2/Dev/llm/unsloth/venv/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-1h3yu7xe/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-1h3yu7xe/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-1h3yu7xe/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 480, in run_setup
super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
File "/tmp/pip-build-env-1h3yu7xe/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
exec(code, locals())
File "<string>", line 9, in <module>
ModuleNotFoundError: No module named 'packaging'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
We are benchmarking training of llama2-13B-hf + hf + trl + FlashAttn 2 vs unsloth (which has FlashAttn 2 enabled) on a single A100 80GB.
With the latest changes:
We see at most 10% improvement in training time.
Peak vram is actually slightly elevated vs hf. ~1GB more on a A100.
Both using native bf16 for finetuning.
Overall, we cannot reproduce anywhere close to the numbers as presented by unsloth.
Can anyone else share their results?
Edit: Perhaps the drastic improvements is only visible on T4 gpu? Curious to see what the real-world benchmarks results are like on the more modern gpu i.e. 3090, A100, 4090.
Previously when one trains with either HF's TRL library or Unsloth, converting the model to GGUF / llama.cpp was a nightmare.
Either the following strategies were used:
Save PEFT adapters and upload them to HF. Use llama.cpp to combine a GGUF version of Mistral / Llama and combine LoRA via convert-lora-to-ggml.py then quantize. Accuracy is medium, since bitsandbytes's quantization method is different to GGUF. VRAM usage is great.
Train with load_in_4bit = False to use 16bit, then use model.save_pretrained and using llama.cpp then quantize. Accuracy is superb, since 16bit retains full accuracy, but VRAM usage for llama-7b goes to 14.5GB, and Mistral OOMs.
Merge PEFT adapters back using model.merge_and_unload() then converting each layer to 16bit then step 2. Accuracy is low, since conversion to 4bit then upcasting to 16bit loses lots of bits.
[NEW] Unsloth: Combines best of both worlds - train with QLoRA (low memory), then we upcast to 16bit, but on the fly, using mmap. Accuracy is high as well.
A table comparing all strategies:
Strategy
VRAM Usage
Disk Usage
Accuracy
1. PEFT
Low
Low
Medium
2. 16bit
Horrible
High
High
3. Merge
Horrible
High
Low
4. Unsloth
Low
High
High
To use Unsloth:
from unsloth import unsloth_save_model
# unsloth_save_model has the same args as model.save_pretrained
unsloth_save_model(model, tokenizer, "output_model", push_to_hub = False, token = None)
colab_quantize_to_gguf("output_model", quantization_method = "q4_k_m")
To use other mistral model, I'm confused about how to save the model in fp16 mode ? The reason is then I wanted to convert to gguf, here's the step after training success and inference test is correct :
new_model = "xxx/xxxx"
model = model.merge_and_unload()
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)
The result is very small like only 4.13 gb and in HF tag it said 4 bit precision (model.safetensors)
And llama cpp gguf convert.py refused to convert this
Did I do something wrong ?
Note : BTW unsloth is very cool, very fast training
I am attempting to replicate the Mistral 7b model finetuning locally as outlined below Unsloth Open column in the README.md file. I have successfully downloaded the dataset and the model to my local machine, which is a deviation from the original jupyter notebook that used data directly from the cloud.
The initial stages of the Jupyter notebook, which include multiple code blocks, were executed without any issues. However, I have run into an error during the execution of the trainer_stats = trainer.train() block. The issue arises with a ValueError stating that the Query/Key/Value must have either BMHK or BMK shape.
ValueError: Query/Key/Value should all have BMHK or BMK shape.
query.shape: torch.Size([4, 2021, 8, 4, 128])
key.shape : torch.Size([4, 2021, 8, 4, 128])
value.shape: torch.Size([4, 2021, 8, 4, 128])
I have even tried reverting back to using the HuggingFace dataset instead of the local one to see if the issue was related to my dataset location. Unfortunately, I encountered the same error.
Here’s a brief overview of my setup:
==((====))== Unsloth: Fast Mistral patching release 2023.12
\\ [/](https://file+.vscode-resource.vscode-cdn.net/)| GPU: NVIDIA GeForce RTX 4090. Max memory: 23.647 GB
O^O/ \_/ \ CUDA compute capability = 8.9
\ [/](https://file+.vscode-resource.vscode-cdn.net/) Pytorch version: 2.1.0.post300. CUDA Toolkit = 11.8
"-____-" bfloat16 support = TRUE
GPU = NVIDIA GeForce RTX 4090. Max memory = 23.647 GB.
4.66 GB of memory reserved.
Additionally, I have addressed an issue with the torch version format. The original Unsloth configuration didn't support version numbers with four segments, so I modified the version parsing by changing from major_torch, minor_torch, _ = torch.version.split(".") to major_torch, minor_torch, _ = torch.__version__.split(".")[0:3] to accommodate the expected format.
Would anyone be able to point me in the right direction, or has anyone experienced a similar issue while working on model finetuning replication? Your help would be greatly appreciated as this error is currently standing in the way of my progress.
Failed to load deepseek-ai/deepseek-llm-7b-base (which is a model of the llama 2 architecture), is the following code necessary? hf tokenizer should automatically handle this according to tokenizer_config.json?
Thank you very much for this great project and pushing this forward for the community !
With TRL / PEFT team we've seen that your example scripts heavily rely on PEFT / TRL libraries and we wanted to see if you need any help or have any feature request around HF ecosystem we would be happy to collaborate and see what we can do together
Note also recently SDPA has been integrated into transformers core huggingface/transformers#26572 we were also wondering if you did some comparisons with unsloth against transformers 4.36.0
GPU = NVIDIA GeForce RTX 3090. Max memory = 23.689 GB.
5.176 GB of memory reserved.
Traceback:
RuntimeError Traceback (most recent call last)
/workspaces/unsloth-train-playground/train.ipynb Cell 6 line 1
----> 1 trainer_stats = trainer.train()
File /workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:280, in SFTTrainer.train(self, *args, **kwargs)
277 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
278 self.model = self._trl_activate_neftune(self.model)
--> 280 output = super().train(*args, **kwargs)
282 # After training we make sure to retrieve back the original forward pass method
283 # for the embedding layer by removing the forward post hook.
284 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
File /workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/trainer.py:1555, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1553 hf_hub_utils.enable_progress_bars()
1554 else:
-> 1555 return inner_training_loop(
1556 args=args,
1557 resume_from_checkpoint=resume_from_checkpoint,
1558 trial=trial,
1559 ignore_keys_for_eval=ignore_keys_for_eval,
1560 )
File /workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/trainer.py:1860, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1857 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
...
24 f"the maximum CUDA blocksize = {MAX_FUSED_SIZE}.")
25 num_warps = 4
26 if BLOCK_SIZE >= 32768: num_warps = 32
RuntimeError: Cannot launch Triton kernel since n = 102400 exceeds the maximum CUDA blocksize = 65536.
when executing the line from kernels import *. Below is the error trace for reference:
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
...
AttributeError: /root/anaconda3/envs/group_chat/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so: undefined symbol: cdequantize_blockwise_fp16_nf4. Did you mean: 'cdequantize_blockwise_fp32'?
The main issue seems to stem from an undefined symbol cdequantize_blockwise_fp16_nf4 in libbitsandbytes_cuda117.so. The system suggests an alternative cdequantize_blockwise_fp32, but it's unclear if this is a suitable substitute.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
Traceback (most recent call last):
File "/root/fanfiction-go/python/dictionary/train/sft_trainer.py", line 353, in
train_result = trainer.train()
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 280, in train
output = super().train(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 659, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 647, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/unsloth/models/llama.py", line 475, in LlamaForCausalLM_fast_forward
shift_labels = torch.hstack((labels[..., 1:], self.extra_ignored_labels[:labels.shape[0]]))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'extra_ignored_labels'
12/27/2023 19:13:37 - WARNING - llmtuner.model.patcher - Current model does not support RoPE scaling.
12/27/2023 19:13:37 - INFO - llmtuner.model.patcher - Using FlashAttention-2 for faster training and inference.
12/27/2023 19:13:37 - INFO - llmtuner.model.patcher - Quantizing model to 4 bit.
/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/init.py:31: UserWarning: Unsloth: 'CUDA_DEVICE_ORDER' is not set but we require 'CUDA_DEVICE_ORDER=PCI_BUS_ID'
We shall set it ourselves.
warnings.warn(
Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 32, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 28, in run_dpo
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train)
File "/workspace/LLaMA-Factory/src/llmtuner/model/loader.py", line 77, in load_model_and_tokenizer
model, _ = FastMistralModel.from_pretrained(**unsloth_kwargs)
TypeError: FastMistralModel.from_pretrained() got an unexpected keyword argument 'rope_scaling'
Excellent work for training optimization! I have a question for a long time and can an expert like you help me with it?
test code is as below:
import torch
import triton.ops
import time
dtype = torch.int8
out_dtype = torch.int32
# %%
M, K, N = 1000, 1024, 2048
a = torch.randint(-128, 127, (M, K), dtype=dtype, device='cuda')
b = torch.randint(-128, 127, (N, K), dtype=dtype, device='cuda')
b = b.t()
for i in range(3):
ts = time.time()
out = triton.ops.matmul(a, b, out_dtype)
print(f"test {i}: {time.time() - ts}s")
M = 1024 # changing M, like changing sequence length
print("changing M from 1000 to 1024")
a = torch.randint(-128, 127, (M, K), dtype=dtype, device='cuda')
b = torch.randint(-128, 127, (N, K), dtype=dtype, device='cuda')
b = b.t()
for i in range(3):
ts = time.time()
out = triton.ops.matmul(a, b, out_dtype)
print(f"test {i}: {time.time() - ts}s")
output:
test 0: 6.1418256759643555s
test 1: 0.00017952919006347656s
test 2: 8.273124694824219e-05s
changing M from 1000 to 1024
test 0: 5.853637456893921s
test 1: 0.00012087821960449219s
test 2: 7.677078247070312e-05s
If everytime input shape changed, triton have to "warmup", do we have any solutions to fix or avoid this? Or do I use the matmul in a wrong way ?
Hi, just found this wonderful project. May I ask how could I make the sufficient changes to support the model architecture Qwen-1.8B/7B/14B? The differences of that to LLAMA is that the Q, K, V projection matrices have biases and these weights are fused.
I saw unsloth support finetune qlora model of Yi. So I tried using below script to finetune Yi-34B-Chat. Then I encountered blow error.
Could you please tell me how to fix it?
[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,955 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,956 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,956 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,956 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2024] 2024-01-03 12:45:42,956 >> loading file tokenizer_config.json
Traceback (most recent call last):
File "/root/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/root/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/root/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/root/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 29, in run_sft
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train)
File "/root/LLaMA-Factory/src/llmtuner/model/loader.py", line 75, in load_model_and_tokenizer
model, _ = FastLlamaModel.from_pretrained(**unsloth_kwargs)
File "/root/.local/conda/envs/cpp/lib/python3.10/site-packages/unsloth/models/llama.py", line 717, in from_pretrained
check_tokenizer(model, tokenizer)
File "/root/.local/conda/envs/cpp/lib/python3.10/site-packages/unsloth/models/_utils.py", line 131, in check_tokenizer
raise RuntimeError(
RuntimeError: Unsloth: Extra special token <|startoftext|> with id=64000 exceeds the maximum vocabulary size of 64000. You must fix the tokenizer or else out of bounds memory accesses will occur.
[2024-01-03 12:45:50,260] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 27792
`
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 29, in run_sft
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train)
File "/workspace/LLaMA-Factory/src/llmtuner/model/loader.py", line 63, in load_model_and_tokenizer
require_version("unsloth", "Follow the instructions at: https://github.com/unslothai/unsloth")
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/utils/versions.py", line 104, in require_version
raise importlib.metadata.PackageNotFoundError(
importlib.metadata.PackageNotFoundError: No package metadata was found for The 'unsloth' distribution was not found and is required by this application.
Follow the instructions at: https://github.com/unslothai/unsloth
Converting format of dataset: 97%|█████████▋| 139000/143000 [00:02<00:00, 55130.54 examples/s]12/31/2023 15:47:40 - INFO - llmtuner.model.patcher - Quantizing model to 4 bit.
Traceback (most recent call last):
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/utils/versions.py", line 102, in require_version
got_ver = importlib.metadata.version(pkg)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/importlib/metadata/init.py", line 996, in version
return distribution(distribution_name).version
File "/root/miniconda3/envs/llama_factory/lib/python3.10/importlib/metadata/init.py", line 969, in distribution
return Distribution.from_name(distribution_name)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/importlib/metadata/init.py", line 548, in from_name
raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: No package metadata was found for unsloth
During handling of the above exception, another exception occurred:`
And I modified the line 163 from dX = torch.matmul(DW_f, upW.t(), out = X) to dX = torch.matmul(DW_f, upW.t()).to(X.dtype) ; the line 291 from dX = torch.matmul(dQ, QW.t(), out = X) to dX = torch.matmul(dQ, QW.t()).to(X.dtype)
So I have a GPTQ llama model I downloaded (from TheBloke), and it's already 4 bit quantized. I have to pass in False for the load_in_4bit parameter of: