Comments (16)
7B in float16 with be 14GB and if quantized to uint8 could be as low as 7GB. But on the graphics cards, from what I've tried with other models it can take 2x the VRAM.
My guess is that 32GB would be the minimum but some clever person may be able to run it with 16GB VRAM.
But the question is, how fast would it be? If it is one character per second then it would not be that useful!
The 7B model generates quickly on a 3090ti (~30 seconds for ~500 tokens, ~17 tokens/s), much faster than the ChatGPT interface. It is using ~14GB VRAM during generation. This is also with batch_size=1
, meaning theoretical throughput is higher than this.
Recording.2023-03-02.225512.mp4
See my fork for the code for rolling generation and the Gradio interface.
from llama.
I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:
https://github.com/modular-ml/wrapyfi-examples_llama
and have a readme with the instructions on how to do it:
LLaMA with Wrapyfi
Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM
currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!
This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon!
UPDATE: Tested on Two 3080 Tis as well!!!
How to?
-
Replace all instances of <YOUR_IP> and before running the scripts
-
Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:
git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .
- Install Wrapyfi with the same environment:
git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]
- Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:
cd wrapyfi/standalone
python zeromq_proxy_broker.py --comm_type pubsubpoll
- Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
- Now start the second instance (within this repo and env) :
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
-
You will now see the output on both terminals
-
EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,
### (replace 10.0.0.101 with <YOUR_IP> ###
# step 4 modification
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll
# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
from llama.
Trying to run the 7B model in Colab with 15GB GPU is failing. Is there a way to configure this to be using fp16 or thats already baked into the existing model.
*update: Using batch_size=2
seems to make it work in Colab+ with GPU
from llama.
Can I use the model on Intel iRIS Xe graphics card?
I appreciate, if possible to as well recommend which libraries to use.
from llama.
Flexgen only supports opt models
from llama.
With KoboldAi I was able to run GPT J 6b on my 8gb 3070 ti by offloading the model to my ram
from llama.
See my fork for the code for rolling generation and the Gradio interface.
@bjoernpl Works great, thanks!
Have you tried changing the gradio interface to use the gradio chatbot component?
from llama.
Have you tried changing the gradio interface to use the gradio chatbot component?
I think this doesn't quite fit, since LLama is not fine-tuned for chatbot-like capabilities. I think it would definitely be possible (even if it probably doesn't work too well) to use it as a chatbot with some clever prompting. Might be worth a try, thanks for the idea and the feedback.
from llama.
According to my napkin math, even the smallest model with 7B parameters will probably take close to 30GB of space. 8GB is unlikely to suffice. But I have no access to the weights yet, it's just my rough guess.
from llama.
Could be possible with https://github.com/FMInference/FlexGen
from llama.
Could be possible with https://github.com/FMInference/FlexGen
This project looks amazing 🤩. However, in its example, it seems like a 6.7B OPT model would still need at least 15GB of GPU memory. So, the chances are mere 🥲. I would so wanna run it on my 3080 10GB.
from llama.
7B in float16 with be 14GB and if quantized to uint8 could be as low as 7GB. But on the graphics cards, from what I've tried with other models it can take 2x the VRAM.
My guess is that 32GB would be the minimum but some clever person may be able to run it with 16GB VRAM.
But the question is, how fast would it be? If it is one character per second then it would not be that useful!
from llama.
With KoboldAi I was able to run GPT J 6b on my 8gb 3070 ti by offloading the model to my ram
How fast was it?
from llama.
@fabawi Good work. 👍
from llama.
Thank you! Works great.
from llama.
Closing this issue - great work @fabawi !!
from llama.
Related Issues (20)
- ### System Info HOT 1
- Architecture
- Agnostic Atheist AI not Normal HOT 14
- Discussing a potential bias in Llama2-Chat that can lead to content safety issues
- download.sh didn't work well HOT 3
- parameter count of Llama2-70B and Llama2-13B
- Change the name of openai to closeai and change the project name to openai.
- Error: llama runner process no longer running: 3221225785
- [Generation, Question] Why does the `seed` have to be the same in different processors (`Llama.build`)?
- how can i evaluate mathematic datasets like GSM8K?
- Test Tokenizer gives Incorrect padding error
- No response from request to access models
- how to download this model HOT 1
- Providing SHA-256 hashes
- This PR will implement code for reproducing results in the following paper:
- Unable to access the Hugging Face Llama-3 model repo
- [Parallel MD5] Accelerating `download.sh`
- LLaMA3 supports an 8K token context length. When continuously pretraining with proprietary data, the majority of the text data is significantly shorter than 8K tokens, resulting in a substantial amount of padding. To enhance training efficiency and effectiveness, it is necessary to merge multiple short texts into a longer text, with the length remaining below 8K tokens. However, the question arises: how should these short texts be combined into a single training sequence? Should they be separated by delimiters, or should an approach involving masking be used during the pretraining process?
- Oddities downloading the 8b-instruct model
- How to infer answer using llama2-7b-hf?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama.