Code Monkey home page Code Monkey logo

Comments (12)

Nardien avatar Nardien commented on June 19, 2024 11

I try to set max_batch_size as 1 and success to run inference on 24GB GPU (without any other modifications).

https://github.com/facebookresearch/llama/blob/1076b9c51c77ad06e9d7ba8a4c6df775741732bd/example.py#L44

Don't forget to give only one prompt (in the example, there are two prompts in the list) if you modify the max_batch_size.

from llama.

garyfanhku avatar garyfanhku commented on June 19, 2024 10

I try to set max_batch_size as 1 and success to run inference on 24GB GPU (without any other modifications).

https://github.com/facebookresearch/llama/blob/1076b9c51c77ad06e9d7ba8a4c6df775741732bd/example.py#L44

Don't forget to give only one prompt (in the example, there are two prompts in the list) if you modify the max_batch_size.

Thanks! With this tweak I'm able to run a 7B model on Colab Pro, with a 16GB T4 GPU.

from llama.

JOHW85 avatar JOHW85 commented on June 19, 2024 9

Change the line
model = Transformer(model_args)
to
model = Transformer(model_args).cuda().half()
to use FP16.

from llama.

timlacroix avatar timlacroix commented on June 19, 2024 6

Checkpoints are indeed fp16, no conversion is needed.
Memory usage is large due to cache being pre-allocated for max_batch_size = 32 // max_seq_len = 1024 as noted by @Nardien

Feel free to change all of these for your use case :)

from llama.

pauldog avatar pauldog commented on June 19, 2024 2

Perhaps you can modify this script. Only problem is onnx files generally support up to 2GB without external weights. And Onnxruntime has a memory leak for external weights.

@pauldog Curious to know, is there an issue I can reference to for the issue you mention in ORT?

Yes, it has been fixed now though:

microsoft/onnxruntime#14466

from llama.

jacklxc avatar jacklxc commented on June 19, 2024 1

@pauldog The 65B model is 122GB and all models are 220GB in total. Weights are in .pth format.

from llama.

w32zhong avatar w32zhong commented on June 19, 2024

Same here.

I've tested it, and it has a peak GPU memory usage of about 30GiB for the smallest 7B model.

from llama.

pauldog avatar pauldog commented on June 19, 2024

Hello, I don't have access to the model yet. But here I did convert the Stable Diffusion to float16 onnx file.

Here is the script I used changing float32 to float16.

Perhaps you can modify this script. Only problem is onnx files generally support up to 2GB without external weights. And Onnxruntime has a memory leak for external weights.

Curious is the model weights in torch format? (*.bin) ? What is the size of the weights file?

from llama.

pauldog avatar pauldog commented on June 19, 2024

@pauldog The 65B model is 122GB and all models are 220GB in total. Weights are in .pth format.

Thanks. If the 65B is only 122GB sounds like it already is in float16 format.

7B should be 14GB but sometimes these models take 2x the VRAM if this so wouldn't be too surprised if it didn't work on 24GB GPU. (Although some people got it working so IDK)

What I might do is try to run it on a Shadow PC in the cloud which has a 32GB GPU for the top level.

from llama.

pauldog avatar pauldog commented on June 19, 2024

Has anyone tried it with DirectML instead of CUDA. I prefer. There is a DirectML plugin for torch.

from llama.

lixinghe1999 avatar lixinghe1999 commented on June 19, 2024

Change the line model = Transformer(model_args) to model = Transformer(model_args).cuda().half() to use FP16.

Seems not work.
7B model -> 13GB storage, seems it is already float16.
My test on 24GB 3090 consume 14+GB for one prompt.

from llama.

fxmarty avatar fxmarty commented on June 19, 2024

Perhaps you can modify this script. Only problem is onnx files generally support up to 2GB without external weights. And Onnxruntime has a memory leak for external weights.

@pauldog Curious to know, is there an issue I can reference to for the issue you mention in ORT?

from llama.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.