Comments (12)
I try to set max_batch_size
as 1 and success to run inference on 24GB GPU (without any other modifications).
Don't forget to give only one prompt (in the example, there are two prompts in the list) if you modify the max_batch_size
.
from llama.
I try to set
max_batch_size
as 1 and success to run inference on 24GB GPU (without any other modifications).Don't forget to give only one prompt (in the example, there are two prompts in the list) if you modify the
max_batch_size
.
Thanks! With this tweak I'm able to run a 7B model on Colab Pro, with a 16GB T4 GPU.
from llama.
Change the line
model = Transformer(model_args)
to
model = Transformer(model_args).cuda().half()
to use FP16.
from llama.
Checkpoints are indeed fp16, no conversion is needed.
Memory usage is large due to cache being pre-allocated for max_batch_size = 32 // max_seq_len = 1024 as noted by @Nardien
Feel free to change all of these for your use case :)
from llama.
Perhaps you can modify this script. Only problem is onnx files generally support up to 2GB without external weights. And Onnxruntime has a memory leak for external weights.
@pauldog Curious to know, is there an issue I can reference to for the issue you mention in ORT?
Yes, it has been fixed now though:
from llama.
@pauldog The 65B model is 122GB and all models are 220GB in total. Weights are in .pth
format.
from llama.
Same here.
I've tested it, and it has a peak GPU memory usage of about 30GiB for the smallest 7B model.
from llama.
Hello, I don't have access to the model yet. But here I did convert the Stable Diffusion to float16 onnx file.
Here is the script I used changing float32 to float16.
Perhaps you can modify this script. Only problem is onnx files generally support up to 2GB without external weights. And Onnxruntime has a memory leak for external weights.
Curious is the model weights in torch format? (*.bin) ? What is the size of the weights file?
from llama.
@pauldog The 65B model is 122GB and all models are 220GB in total. Weights are in
.pth
format.
Thanks. If the 65B is only 122GB sounds like it already is in float16 format.
7B should be 14GB but sometimes these models take 2x the VRAM if this so wouldn't be too surprised if it didn't work on 24GB GPU. (Although some people got it working so IDK)
What I might do is try to run it on a Shadow PC in the cloud which has a 32GB GPU for the top level.
from llama.
Has anyone tried it with DirectML instead of CUDA. I prefer. There is a DirectML plugin for torch.
from llama.
Change the line
model = Transformer(model_args)
tomodel = Transformer(model_args).cuda().half()
to use FP16.
Seems not work.
7B model -> 13GB storage, seems it is already float16.
My test on 24GB 3090 consume 14+GB for one prompt.
from llama.
Perhaps you can modify this script. Only problem is onnx files generally support up to 2GB without external weights. And Onnxruntime has a memory leak for external weights.
@pauldog Curious to know, is there an issue I can reference to for the issue you mention in ORT?
from llama.
Related Issues (20)
- ValidationError: Input validation error: `inputs` must have less than 4096 tokens. Given: 4545
- Too long for pending a review for huggingface model
- ### System Info HOT 1
- Architecture
- Agnostic Atheist AI not Normal HOT 14
- Discussing a potential bias in Llama2-Chat that can lead to content safety issues
- download.sh didn't work well HOT 3
- parameter count of Llama2-70B and Llama2-13B
- Change the name of openai to closeai and change the project name to openai.
- Error: llama runner process no longer running: 3221225785
- [Generation, Question] Why does the `seed` have to be the same in different processors (`Llama.build`)?
- how can i evaluate mathematic datasets like GSM8K?
- Test Tokenizer gives Incorrect padding error
- No response from request to access models
- how to download this model HOT 1
- Providing SHA-256 hashes
- This PR will implement code for reproducing results in the following paper:
- Unable to access the Hugging Face Llama-3 model repo
- [Parallel MD5] Accelerating `download.sh`
- LLaMA3 supports an 8K token context length. When continuously pretraining with proprietary data, the majority of the text data is significantly shorter than 8K tokens, resulting in a substantial amount of padding. To enhance training efficiency and effectiveness, it is necessary to merge multiple short texts into a longer text, with the length remaining below 8K tokens. However, the question arises: how should these short texts be combined into a single training sequence? Should they be separated by delimiters, or should an approach involving masking be used during the pretraining process?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama.