Comments (10)
Try to use huggingface-cli
for downloading the model first, something like
!huggingface-cli download --resume-download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo
clear_output()
! date
# then load the modle from local-dir
# config = AutoConfig.from_pretrained(quantized_model_name)
# state_path = snapshot_download(quantized_model_name)
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"
config = AutoConfig.from_pretrained(state_path)
Maybe snapshot_download can't handle so many files. huggingface-cli download
is quite fast: 17G in a 2-3 minutes. Note config = AutoConfig.from_pretrained(quantized_model_name)
also seems to hang.
from mixtral-offloading.
Hey, @SanskarX10, could you provide more info?
Just tried running the notebook myself, It appears to be stuck downloading the model snapshot from the model hub. Could be an issue on HF's side.
from mixtral-offloading.
Despite the generation speed being slows, it works like a charm.
So, I wanted to thank you, @dvmazur, for such an amazing job you have done, really!!!!
That said, I would have some questions:
- What does it take to run the some notebook locally on one's pc? What GPU is necessary? Regarding RAM it seem to be at least 16 GB, right?
- Is there a way to download the Mixtral model embeddings?
==> Ok. I gave a look at the Paper, which seems to clarify my questions.
Again, sincere congratulations to the all the contributors !!!!!
from mixtral-offloading.
Hey, @oltipreka, thanks for the kind words. This was a collaborative effort, so please shout out @lavawolfiee for making it happen.
As for the generation speed, we are still working on making it faster, but we've slowed down a bit due to the holidays :)
Regarding your questions,
- You'll need about 27Gb of combined GPU and CPU memory. The proportion of GPU to CPU memory affects generation speed, as lower GPU memory might require offloading more experts. You can view some setups in our tech-report
- You can download the original embedding layer weights from Mixtral's repo on HF Hub.
I'm closing this issue due to it being resolved.
from mixtral-offloading.
Hey, @oltipreka, thanks for the kind words. This was a collaborative effort, so please shout out @lavawolfiee for making it happen.
As for the generation speed, we are still working on making it faster, but we've slowed down a bit due to the holidays :)
Regarding your questions,
- You'll need about 27Gb of combined GPU and CPU memory. The proportion of GPU to CPU memory affects generation speed, as lower GPU memory might require offloading more experts. You can view some setups in our tech-report
- You can download the original embedding layer weights from Mixtral's repo on HF Hub.
I'm closing this issue due to it being resolved.
Thanks for the clarifications, extremely useful.
Yeah, you are absolutely right, the entire team deserves credit for this, including @lavawolfiee.
Thank you folks and keep going !!!
from mixtral-offloading.
Hi everyone. I have the same problem. A curiosity is that with Colab, if I don't have the Hugging Face token, the code doesn't run on line 5.
But when I introduce the token in Colab secrets, it doesn't run on line 4.
Maybe, there is an error of compatibility between Colab and Hugging Face, or issues related to connection
from mixtral-offloading.
Same issue , the execution goes on forever and nothing gets downloaded
from mixtral-offloading.
Hey, @SanskarX10, could you provide more info?
Just tried running the notebook myself, It appears to be stuck downloading the model snapshot from the model hub. Could be an issue on HF's side.
Oh ! , that can be the case. Thanks for quick reply.
from mixtral-offloading.
@ffreemt, thanks for the tip! I just published the new notebook.
We'll implement a more permanent solution on the weekend.
from mixtral-offloading.
The recent notebook works. However, the speed of generation is slow. To answer the query "write a poem about python" it took 4 minutes.
from mixtral-offloading.
Related Issues (16)
- Enhancing the Efficacy of MoE Offloading with Speculative Prefetching Strategies
- Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? HOT 1
- How to use the offloading in my MoE model? HOT 4
- Can it run on multi-GPU? HOT 10
- Can it run with LlamaIndex?
- Is it possible to finetune this on a custom dataset? HOT 7
- CUDA OOM errors in wsl2
- need mixtral offload for NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO
- hqq_aten package not installed. HOT 1
- Run without quantization HOT 9
- 4bit-3bit model produces gibberish when plugged into demo
- Run on second GPU (torch.device("cuda:1")) HOT 1
- Update Requirements.txt
- exl2 HOT 2
- Session crashed on colab HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mixtral-offloading.