Code Monkey home page Code Monkey logo

gpu_poor's Introduction

Can my GPU run this LLM? & at what token/s?

Made with

Calculates how much GPU memory you need and how much token/s you can get for any LLM & GPU/CPU.

Also breakdown of where it goes for training/inference with quantization (GGML/bitsandbytes/QLoRA) & inference frameworks (vLLM/llama.cpp/HF) supported

Link: https://rahulschand.github.io/gpu_poor/

Demo

new_upload


Use cases/Features

1. Calculate vRAM memory requirement 💾

image

2. Calculate ~token/s you can get ⏱️

image

3. Approximate time for finetuning (ms per iteration) ⌛️

image

For memory, output is total vRAM & its breakdown. It looks like below

{
  "Total": 4000,
  "KV Cache": 1000,
  "Model Size": 2000,
  "Activation Memory": 500,
  "Grad & Optimizer memory": 0,
  "cuda + other overhead":  500
}

For token/s, additional info looks like below

{
  "Token per second": 50,
  "ms per token": 20,
  "Prompt process time (s)": 5 s,
  "memory or compute bound?": Memory,
}

For training, output is time for each forward pass (in ms)

{
  "ms per iteration (forward + backward)": 100,
  "memory or compute bound?": Memory,
}

Purpose

made this to check if you can run a particular LLM on your GPU. Useful to figure out the following

  1. How much token/s can I get?
  2. How much total time to finetune?
  3. What quantization will fit on my GPU?
  4. Max context length & batch-size my GPU can handle?
  5. Which finetuning? Full? LoRA? QLoRA?
  6. What is consuming my GPU memory? What to change to fit the LLM on GPU?

Additional info + FAQ

Can't we just look at the model size & figure this out?

Finding which LLMs your GPU can handle isn't as easy as looking at the model size because during inference (KV cache) takes susbtantial amount of memory. For example, with sequence length 1000 on llama-2-7b it takes 1GB of extra memory (using hugginface LlamaForCausalLM, with exLlama & vLLM this is 500MB). And during training both KV cache & activations & quantization overhead take a lot of memory. For example, llama-7b with bnb int8 quant is of size ~7.5GB but it isn't possible to finetune it using LoRA on data with 1000 context length even with RTX 4090 24 GB. Which means an additional 16GB memory goes into quant overheads, activations & grad memory.

How reliable are the numbers?

The results can vary depending on your model, input data, cuda version & what quant you are using & it is impossible to predict exact values. I have tried to take these into account & make sure the results are within 500MB. Below table I cross-check 3b,7b & 13b model memories given by the website vs. what what I get on my RTX 4090 & 2060 GPUs. All values are within 500MB.

image

How are the values calculated?

Total memory = model size + kv-cache + activation memory + optimizer/grad memory + cuda etc. overhead

  1. Model size = this is your .bin file size (divide it by 2 if Q8 quant & by 4 if Q4 quant).
  2. KV-Cache = Memory taken by KV (key-value) vectors. Size = (2 x sequence length x hidden size) per layer. For huggingface this (2 x 2 x sequence length x hidden size) per layer. In training the whole sequence is processed at once (therefore KV cache memory = 0)
  3. Activation Memory = In forward pass every operation's output has to be stored for doing .backward(). For example if you do output = Q * input where Q = (dim, dim) and input = (batch, seq, dim) then output of shape (batch, seq, dim) will need to be stored (in fp16). This consumes the most memory in LoRA/QLoRA. In LLMs there are many such intermediate steps (after Q,K,V and after attention, after norm, after FFN1, FFN2, FFN3, after skip layer ....) Around 15 intermediate representations are saved per layer.
  4. Optimizer/Grad memory = Memory taken by .grad tensors & tensors associated with the optimizer (running avg etc.)
  5. Cuda etc. overhead = Around 500-1GB memory is taken by CUDA whenever cuda is loaded. Also there are additional overheads when you use any quantization (like bitsandbytes). There is not straightforward formula here (I assume 650 MB overhead in my calculations for cuda overhead)

Why are the results wrong?

Sometimes the answers might be very wrong in which case please open an issue here & I will try to fix it.


TODO

  1. Add support for vLLM for token/s
  2. Add QLora
  3. Add way to measure approximste tokens/s you can get for a particular GPU
  4. Improve logic to get hyper-params from size (since hidden layer/intermediate size/number of layers can vary for a particular size) ✅
  5. Add AWQ

gpu_poor's People

Contributors

rahulschand avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gpu_poor's Issues

Test results are different

Thk for your contribution
When I use this project for fine tuning:https://github.com/hiyouga/LLaMA-Factory
I used the Baichuan-13B model for sft,max_token=800,the actual memory size I use is: 28G(A40)
BUT Use your project test as:44241MB(43G)
What is the problem that causes such a difference?
Looking forward to your reply

MY TRAIN SCRIPT:
CUDA_VISIBLE_DEVICES=0 python ../src/train_bash.py
--stage sft
--template baichuan
--model_name_or_path /container/LLM/Baichuan-13B-Chat
--do_train
--dataset test_data
--dataset_dir ../data
--val_size 0.1
--finetuning_type lora
--lora_target W_pack
--output_dir output_Bc
--overwrite_cache
--preprocessing_num_workers 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--cutoff_len 800
--max_new_tokens 1400
--logging_steps 10
--save_steps 6
--eval_steps 6
--max_grad_norm 0.5
--learning_rate 5e-5
--num_train_epochs 3.0
--evaluation_strategy steps
--load_best_model_at_end
--plot_loss
--fp16
--overwrite_output_dir
--seed 3407

image

The memory usage in LoRA finetuning

As far as i know, we can set lora rank and target module to change the number of trainable parameter, which, I think, can cause different memory usage. But i didn't fine any relevant setting in your project. How do you estimate memory usage without those information?

DeepSpeed support

if i'm using deepspeed+huggingface (including ZeRO-1, 2, 3). Is there any difference on memory usage compare with just using 🤗? If there is a difference, is it gonna be supported?

API to use this repo

Hi, great work! I would like to use it in a terminal environment so I am wondering if you can release the API or add a terminal interaction function. Thanks!

What's the meaning of magic numbers?

Hi, thanks for your great work to calculate Tokens/s. I read your code of App.js and found some magic numbers. Can you please add comments for them? Just list out some numbers as below. It is awesome you would add comments for all magic numbers.

  1.     let finalPromptTime =
         theoryTimePrompt_in_ms * getFloatRatio_F16(quantType) * 1.8 +  // What's the meaning of ”1.8“?
         convertByteToMB(2 * memoryTransfer) * (0.008 / 100);           // What's the meaning of "0.008 / 100" ?
    
  2. What's the meaning of extraFactor 2.0,1.5,1.0 ...?

Results are inconsistent and is not reliable enough

Hey @RahulSChand, Awesome work on creating this calculator. But there are some problems I am facing and getting unreliable results. Here are some of the issues I am facing:

The configurations I will be using are as follows:

Model: CodeLlama 
Param size: 7B
batch size: 1
context length: 2048
  1. QLoRA's GPU memory is showing more than LoRA

In LoRA it is showing: 177 GB and for QLoRA it is showing: 180 GB and full fine-tuning it is showing: 216 GB

  1. When I upload the config.json file vs. just the parameter number, it shows inconsistent results.

  2. The memory requirement number should not be this much. For example, I am using just 1 as batch size and 2048 context length size it is showing triple digits for LoRA and QLoRA, and now consider this graph. Reference

image

According to this graph, the memory requirement for LoRA is 16GB but in the calculation, it is showing 177 GB.

So, can you please address this doubts and if there is any way to fix this, it would be awesome.

Missing License

I like this, great work.
I saw on your page that you mention the code is open source, but I could not find a license (such as MIT or BSD3, etc.), would it be ok if you add a license file so the terms are clear?

What is [Prompt len] and [Tokens to Generate]?

Sorry I am not quite familiar with inference: in fine-tune/training, I simply use the concept of max_seq_length. Are [Prompt len] and [Tokens to Generate] the same as max_seq_length? How could they be different?

Activation Memory

If I'm just using it for inference, do I not need to save the intermediate activation value, for example in vllm

How do I understand this activation value?

compute in gpu_configs.json meaning

Hi! I want to add some GPU specs to gpu_configs.json. What is the meaning of compute in that file? Is it the TFLOPS under certain precision?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.