Comments (10)
@dvmazur @lavawolfiee Can you please kindly address this question? I'd be happy to do this myself if it's not already possible, which I don't think it is, if you could point me to where I'd need to make changes.
from mixtral-offloading.
Hi!
Sorry for the long reply.
Running the model on multi-GPU is not currently supported. Currently, all active experts are sent to cuda:0. You can send an expert to a different GPU by simply specifying a different device while initializing MixtralExpertWrapper
.
Keep I'm mind that you would need to ballance the number of active experts between your GPUs. This logic could be added to the ExpertCache
class.
from mixtral-offloading.
By the way, one of our quantization setups compressed the model to 17Gb. This would fit into the VRAM of two T4 GPUs, which you can get for free on Kaggle.
Have you looked into running a quantized version (possibly ours) of the model using tensor_parallel?
from mixtral-offloading.
Hi @dvmazur!
Thank you for your reply.
Unfortunately (or fortunately) I have 8 1080ti GPUs on my machine, which individually cannot seem to handle the model even with quantization and when offload_per_layer = 5
or offload_per_layer = 6
. What I am ultimately trying to achieve is to run a single model on 2x 1080ti GPUs (total VRAM ~22.5Gb), so I can run 4 separate instances of the model across my GPUs simultaneously.
Thank you for you suggestions, I'll have a look at the MixtralExpertWrapper
, but the tensor_parallel option with your specific quantization setup seems like a great workaround to try first. May I ask which quantization setup allowed compression down to 17Gb, or if you could point me to a file that contains that setup please? Currently when I set offload_per_layer = 5
the model seems to only occupy ~11Gb on a single GPU without an OOM error, but then at inference there's no utilization of the GPU cores throughout (though the VRAM is occupied) until the kernel crashes. Here's the code:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
import sys
sys.path.append("mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging
from src.build_model import OffloadConfig, QuantConfig, build_model
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"
config = AutoConfig.from_pretrained(quantized_model_name)
device = torch.device("cuda")
##### Change this to 5 if you have only 12 GB of GPU VRAM #####
# offload_per_layer = 4
offload_per_layer = 5
###############################################################
num_experts = config.num_local_experts
offload_config = OffloadConfig(
main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
offload_size=config.num_hidden_layers * offload_per_layer,
buffer_size=4,
offload_per_layer=offload_per_layer,
)
attn_config = BaseQuantizeConfig(
nbits=4,
group_size=64,
quant_zero=True,
quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256
ffn_config = BaseQuantizeConfig(
nbits=2,
group_size=16,
quant_zero=True,
quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)
model = build_model(
device=device,
quant_config=quant_config,
offload_config=offload_config,
state_path=state_path,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
conversations_texts = ["can you summarise the book Love in the Time of Cholera in 500 words?",
"can you summarise the book The Picture of Dorian Gray in 500 words?"]
batched_prompts = [f"User: {text} Assistant:" for text in conversations_texts] # Prepare prompts
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token # to avoid an error
# Tokenize all prompts as a batch
batch_inputs = tokenizer(batched_prompts, padding=True, return_tensors="pt", add_special_tokens=True).to("cuda")
# Generate responses for each prompt in the batch
outputs = model.generate(**batch_inputs, max_new_tokens=1000) #kernel dies!
The following pic shows the GPU utilization right before the kernel dies.
from mixtral-offloading.
May I ask which quantization setup allowed compression down to 17Gb, or if you could point me to a file that contains that setup please?
It's the 4-bit attention and 2-bit expert setup from our tech-report. I suppose the weights can be found here. Let's summon @lavawolfiee just in case I'm mistaken.
from mixtral-offloading.
the model seems to only occupy ~11Gb on a single GPU without an OOM error, but then at inference there's no utilization of the GPU cores throughout (though the VRAM is occupied) until the kernel crashes
Could you provide a bit more detail? I'll look into it as soon as I have the time to.
from mixtral-offloading.
It's the 4-bit attention and 2-bit expert setup from our tech-report. I suppose the weights can be found here.
Yes, you're right
from mixtral-offloading.
May I ask which quantization setup allowed compression down to 17Gb, or if you could point me to a file that contains that setup please?
It's the 4-bit attention and 2-bit expert setup from our tech-report. I suppose the weights can be found here. Let's summon @lavawolfiee just in case I'm mistaken.
This seems to be the same setup I have used in the code I provided, which occupies ~11Gb VRAM and ~23Gb of CPU RAM and then crashes the kernel at inference.
from mixtral-offloading.
the model seems to only occupy ~11Gb on a single GPU without an OOM error, but then at inference there's no utilization of the GPU cores throughout (though the VRAM is occupied) until the kernel crashes
Could you provide a bit more detail? I'll look into it as soon as I have the time to.
Absolutely, what information are you looking for?
from mixtral-offloading.
Absolutely, what information are you looking for?
A stacktrace would be helpful.
from mixtral-offloading.
Related Issues (19)
- Enhancing the Efficacy of MoE Offloading with Speculative Prefetching Strategies
- Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? HOT 1
- How to use the offloading in my MoE model? HOT 4
- Doesn't work HOT 10
- Can it run with LlamaIndex?
- Is it possible to finetune this on a custom dataset? HOT 7
- CUDA OOM errors in wsl2
- need mixtral offload for NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO
- hqq_aten package not installed. HOT 1
- Run without quantization HOT 9
- 4bit-3bit model produces gibberish when plugged into demo
- Run on second GPU (torch.device("cuda:1")) HOT 1
- Update Requirements.txt
- a strange issue with default parameters " RuntimeError about memory"
- (Colab) Clear GPU RAM usage after running the generation code without restarting instance HOT 1
- Can this be used for Jambo inference HOT 1
- exl2 HOT 2
- Session crashed on colab HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mixtral-offloading.