Trying to load 7B but got a memory error for a 24GB GPU. What would

Change the line model = Transformer(model_args) <b

Perhaps you can modify this . Only problem is onnx fil

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Load in fp16? about llama HOT 12 CLOSED

facebookresearch commented on June 19, 2024 1

Load in fp16?

from llama.

Comments (12)

Nardien commented on June 19, 2024 11

I try to set max_batch_size as 1 and success to run inference on 24GB GPU (without any other modifications).

https://github.com/facebookresearch/llama/blob/1076b9c51c77ad06e9d7ba8a4c6df775741732bd/example.py#L44

Don't forget to give only one prompt (in the example, there are two prompts in the list) if you modify the max_batch_size.

from llama.

garyfanhku commented on June 19, 2024 10

I try to set max_batch_size as 1 and success to run inference on 24GB GPU (without any other modifications).

https://github.com/facebookresearch/llama/blob/1076b9c51c77ad06e9d7ba8a4c6df775741732bd/example.py#L44

Don't forget to give only one prompt (in the example, there are two prompts in the list) if you modify the max_batch_size.

Thanks! With this tweak I'm able to run a 7B model on Colab Pro, with a 16GB T4 GPU.

from llama.

JOHW85 commented on June 19, 2024 9

Change the line
model = Transformer(model_args)
to
model = Transformer(model_args).cuda().half()
to use FP16.

from llama.

timlacroix commented on June 19, 2024 6

Checkpoints are indeed fp16, no conversion is needed.
Memory usage is large due to cache being pre-allocated for max_batch_size = 32 // max_seq_len = 1024 as noted by @Nardien

Feel free to change all of these for your use case :)

from llama.

pauldog commented on June 19, 2024 2

Perhaps you can modify this script. Only problem is onnx files generally support up to 2GB without external weights. And Onnxruntime has a memory leak for external weights.

@pauldog Curious to know, is there an issue I can reference to for the issue you mention in ORT?

Yes, it has been fixed now though:

microsoft/onnxruntime#14466

from llama.

jacklxc commented on June 19, 2024 1

@pauldog The 65B model is 122GB and all models are 220GB in total. Weights are in .pth format.

from llama.

w32zhong commented on June 19, 2024

Same here.

I've tested it, and it has a peak GPU memory usage of about 30GiB for the smallest 7B model.

from llama.

pauldog commented on June 19, 2024

Hello, I don't have access to the model yet. But here I did convert the Stable Diffusion to float16 onnx file.

Here is the script I used changing float32 to float16.

Perhaps you can modify this script. Only problem is onnx files generally support up to 2GB without external weights. And Onnxruntime has a memory leak for external weights.

Curious is the model weights in torch format? (*.bin) ? What is the size of the weights file?

from llama.

pauldog commented on June 19, 2024

@pauldog The 65B model is 122GB and all models are 220GB in total. Weights are in .pth format.

Thanks. If the 65B is only 122GB sounds like it already is in float16 format.

7B should be 14GB but sometimes these models take 2x the VRAM if this so wouldn't be too surprised if it didn't work on 24GB GPU. (Although some people got it working so IDK)

What I might do is try to run it on a Shadow PC in the cloud which has a 32GB GPU for the top level.

from llama.

pauldog commented on June 19, 2024

Has anyone tried it with DirectML instead of CUDA. I prefer. There is a DirectML plugin for torch.

from llama.

lixinghe1999 commented on June 19, 2024

Change the line model = Transformer(model_args) to model = Transformer(model_args).cuda().half() to use FP16.

Seems not work.
7B model -> 13GB storage, seems it is already float16.
My test on 24GB 3090 consume 14+GB for one prompt.

from llama.

fxmarty commented on June 19, 2024

Perhaps you can modify this script. Only problem is onnx files generally support up to 2GB without external weights. And Onnxruntime has a memory leak for external weights.

@pauldog Curious to know, is there an issue I can reference to for the issue you mention in ORT?

from llama.

Load in fp16? about llama HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent