Comments (6)
Well, that answers all my doubts. Thanks a lot, @RahulSChand. I learned some new stuff here and it seems like I need to revisit some of those nuances. But thanks again.
from gpu_poor.
@Anindyadeep Thanks for letting me know. I checked this issue & ran QLoRA training with 1000 context length on my 4090 24GB GPU. Below is the memory screenshot (it takes ~23 GB & the website also gives you same values.
![image](https://private-user-images.githubusercontent.com/16897807/271616063-0ab9ebf8-f568-4706-9426-9afa32bb4388.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQ3MTA0MjcsIm5iZiI6MTcwNDcxMDEyNywicGF0aCI6Ii8xNjg5NzgwNy8yNzE2MTYwNjMtMGFiOWViZjgtZjU2OC00NzA2LTk0MjYtOWFmYTMyYmI0Mzg4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTA4VDEwMzUyN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY1NDc1YTg2NGFlMDFjOWE1NTc0YjM2MjViMTQ3OWVjNzYxNWE3N2UzOWY0YmJjYzg5OGY2N2MwZDY5ZTg4YzQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.UZHktjMCzZxtsFWBjEH4mZPwkUDqioJaDlwRQUAEyf0)
As far as the link that you provided is concerned where only 16GB of memory is used. This is because the finetuning is done on alpaca dataset (which has context length of around 700) Link: https://github.com/gururise/AlpacaDataCleaned#finetune-considerations. For length=700, website gives you 15GB memory requirement (same as the image you posted).
The memory requirements depend on the context length (activation memory) since there are many (length, dim)
& (dim, dim, head)
intermediate states generated in forward pass (which are also needed for backward pass). These vectors are not updated (they don't have grad) but they are needed to compute grad of LoRA params. So your memory requirement can increase a lot with context length.
from gpu_poor.
Ahh, that makes quite a sense. Actually, I was pretty blown up by watching the GPU numbers, but the same blog post showed that it takes 240 GB (6 x 40 GB) for full fine-tuning on the same dataset. So it now kinda makes sense that it will take more with 4096 context length (~ quadratic increase).
However, can you please clear one more doubt, like right now, why the memory requirement is > memory requirement in LoRA?
from gpu_poor.
@Anindyadeep sorry I didn't get your last question. What do you mean by "memory requirement is > memory requirement in LoRA"? Do you mean that the website is giving memory requirement for QLoRA as being larger than LoRA? I checked & this doesn't seem to be the case for your configuration (codellama-7b, 2048 context length).
Let me know if I am misunderstanding your question.
from gpu_poor.
Here:
For Full-finetuning
For LoRA
For QLoRA
from gpu_poor.
@Anindyadeep oh okay got it. This is because for QLoRA & any other bitsandbytes quantization (https://github.com/TimDettmers/bitsandbytes) method there is an overhead during forward pass (this overhead is usually small when context length is small). This is also present if you use bitsandbytes llm.int8
quantization
So even though QLoRA is smaller than LoRA (theoretically), the quantization overhead introduced by bitsandbytes can offset this when context length is large.
Below is an approximate way to calculate this overhead (this is an empirical way that I figured after lots of trial & error with 3b/7b/13b models & bitsandbytes QLoRA runs)
QLoRA overhead = (15*hidden_dim + 6*intermediate_dim) x (numLayers) x contextLen x 0.75
bytes
I am also not sure what happens at high context length regime (maybe for large context lengths like >2048 this approximation is very wrong and overhead doesn't grow linearly with contextLen). This is something I need to check
from gpu_poor.
Related Issues (9)
- What is [Prompt len] and [Tokens to Generate]? HOT 3
- Test results are different HOT 6
- API to use this repo HOT 1
- The memory usage in LoRA finetuning HOT 1
- DeepSpeed support HOT 1
- Name and size from same model can cause different result HOT 2
- compute in gpu_configs.json meaning HOT 1
- why batch size does not effect to memory usage in inference mode
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpu_poor.