Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Well, that answers all my doubts. Thanks a lot, <a class="user-mention notranslate" da

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Here: For Full-finetuning <a ta

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Results are inconsistent and is not reliable enough about gpu_poor HOT 6 CLOSED

rahulschand commented on July 28, 2024

Results are inconsistent and is not reliable enough

from gpu_poor.

Comments (6)

Anindyadeep commented on July 28, 2024 1

Well, that answers all my doubts. Thanks a lot, @RahulSChand. I learned some new stuff here and it seems like I need to revisit some of those nuances. But thanks again.

from gpu_poor.

RahulSChand commented on July 28, 2024

@Anindyadeep Thanks for letting me know. I checked this issue & ran QLoRA training with 1000 context length on my 4090 24GB GPU. Below is the memory screenshot (it takes ~23 GB & the website also gives you same values.

As far as the link that you provided is concerned where only 16GB of memory is used. This is because the finetuning is done on alpaca dataset (which has context length of around 700) Link: https://github.com/gururise/AlpacaDataCleaned#finetune-considerations. For length=700, website gives you 15GB memory requirement (same as the image you posted).

The memory requirements depend on the context length (activation memory) since there are many (length, dim) & (dim, dim, head) intermediate states generated in forward pass (which are also needed for backward pass). These vectors are not updated (they don't have grad) but they are needed to compute grad of LoRA params. So your memory requirement can increase a lot with context length.

from gpu_poor.

Anindyadeep commented on July 28, 2024

Ahh, that makes quite a sense. Actually, I was pretty blown up by watching the GPU numbers, but the same blog post showed that it takes 240 GB (6 x 40 GB) for full fine-tuning on the same dataset. So it now kinda makes sense that it will take more with 4096 context length (~ quadratic increase).

However, can you please clear one more doubt, like right now, why the memory requirement is > memory requirement in LoRA?

from gpu_poor.

RahulSChand commented on July 28, 2024

@Anindyadeep sorry I didn't get your last question. What do you mean by "memory requirement is > memory requirement in LoRA"? Do you mean that the website is giving memory requirement for QLoRA as being larger than LoRA? I checked & this doesn't seem to be the case for your configuration (codellama-7b, 2048 context length).

Let me know if I am misunderstanding your question.

from gpu_poor.

Anindyadeep commented on July 28, 2024

Here:

For Full-finetuning

For LoRA

For QLoRA

from gpu_poor.

RahulSChand commented on July 28, 2024

@Anindyadeep oh okay got it. This is because for QLoRA & any other bitsandbytes quantization (https://github.com/TimDettmers/bitsandbytes) method there is an overhead during forward pass (this overhead is usually small when context length is small). This is also present if you use bitsandbytes llm.int8 quantization

So even though QLoRA is smaller than LoRA (theoretically), the quantization overhead introduced by bitsandbytes can offset this when context length is large.

Below is an approximate way to calculate this overhead (this is an empirical way that I figured after lots of trial & error with 3b/7b/13b models & bitsandbytes QLoRA runs)

QLoRA overhead = (15*hidden_dim + 6*intermediate_dim) x (numLayers) x contextLen x 0.75 bytes

I am also not sure what happens at high context length regime (maybe for large context lengths like >2048 this approximation is very wrong and overhead doesn't grow linearly with contextLen). This is something I need to check

from gpu_poor.

Results are inconsistent and is not reliable enough about gpu_poor HOT 6 CLOSED

Comments (6)

For Full-finetuning

For LoRA

For QLoRA

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent