Code Monkey home page Code Monkey logo

Comments (24)

differentprogramming avatar differentprogramming commented on September 18, 2024 1

I deleted a reply because I hadn't noticed that dspasyuk had mentioned lots of gpus other than his 1060.

from llama.cpp.

shibe2 avatar shibe2 commented on September 18, 2024 1

I'm not familiar with that code, but I think that this is relevant:

// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1) {
for (int b = 0; b < src_backend_id; b++) {
if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
SET_CAUSE(tensor, "1.off");
return b;
}
}
}

Operations in non-offloaded layers are normally performed by CPU back-end. But if ggml_backend_offload_op of the other back-end returns true, it will be performed by that back-end instead. GPU back-ends check batch dimension of result tensor against some threshold (32), and then return true for most operations. For example, this is relevant code in Vulkan back-end:
const int min_batch_size = 32;
return (op->ne[1] >= min_batch_size && op->op != GGML_OP_GET_ROWS) ||
(op->ne[2] >= min_batch_size && op->op == GGML_OP_MUL_MAT_ID);

from llama.cpp.

SteelPh0enix avatar SteelPh0enix commented on September 18, 2024

isn't this just intended behavior? --gpu-layers is just 0 by default.

from llama.cpp.

shibe2 avatar shibe2 commented on September 18, 2024

Not all the work is done on CPU. If you give it long prompt, it should use GPU.

from llama.cpp.

differentprogramming avatar differentprogramming commented on September 18, 2024

None of the work is being done by the gpu, it runs exactly as fast as CPU build and gpu use stays at 0%.

The instructions should warn you that you have to tell it to put layers into the gpu to use the gpu.

from llama.cpp.

differentprogramming avatar differentprogramming commented on September 18, 2024

Or just maybe, if you compile it to use a GPU the default should be to use it.

from llama.cpp.

differentprogramming avatar differentprogramming commented on September 18, 2024

Note that there are other issues where people complained about these speeds and never figured out why.
I think they eventually got closed because people stopped talking.

from llama.cpp.

SteelPh0enix avatar SteelPh0enix commented on September 18, 2024

I don't think enabling GPU offload by default is a good idea - partial offload can hurt performance too (we should not assume that the user will be able to automatically offload any model to GPU in full), and it can lead to more unexpected/unstable behaviors, leaving the CPU inference as default is a reasonable choice.

One could say that discovering the GPU offload is "off" by default is a "rite of passage" for a new llama.cpp user. While i agree that there should maybe be a clear information that by default GPU offloading is off (i haven't seen one yet, i'm talking about something like a h1 banner in readme), i think that reading the docs is probably a better solution than turning this on by default

from llama.cpp.

shibe2 avatar shibe2 commented on September 18, 2024

None of the work is being done by the gpu, it runs exactly as fast as CPU build and gpu use stays at 0%.

How long is your prompt?

from llama.cpp.

differentprogramming avatar differentprogramming commented on September 18, 2024

Another question, isn't 100% gpu usage for a 5 times uplift vs. a 6 core skylake cpu a bit low?
I can't really know because there's no matrix of uplift per common GPUs, different ways of using them (BLAS, CUDA, Vulkan etc.) and different models and format.
While I realize that would be a combinatorial mess, there should be an entry with some example tokens per second for an example model, a few variations on it, and common GPUs. And some idea whether CUDA is for instance the fastest and how BLAS or rocBLAS or Vulkan compares.

from llama.cpp.

differentprogramming avatar differentprogramming commented on September 18, 2024

For instance I have no idea if I would get any uplift from spending a few weeks Hipifying the CUDA build!

from llama.cpp.

differentprogramming avatar differentprogramming commented on September 18, 2024

None of the work is being done by the gpu, it runs exactly as fast as CPU build and gpu use stays at 0%.

How long is your prompt?

I wasn't testing with interactive mode, so it was using built in prompts at random.
I notice that changing the length of the result doesn't change the per token speed reported.

from llama.cpp.

dspasyuk avatar dspasyuk commented on September 18, 2024

@differentprogramming It is the intended behavior, you can always default to CPU when everything fails. As for the difference between CUDA and cuBLAS, I do not see much difference on GTX1060/P4/A100/A4500 GPUs ~about 20 tokens per second on GTX1060 using Llama-instruct-Q4 Here is my default setup: https://github.com/dspasyuk/llama.cui

from llama.cpp.

dspasyuk avatar dspasyuk commented on September 18, 2024

@differentprogramming No worries, GTX1060 is about 10 times slower than A4500 (~200 tokens/s) and A100 (PCIE card) is about 180 tokens per/sec using Llama3-instract-q4 and yesterday's version of llama.cpp with llama.cui

from llama.cpp.

shibe2 avatar shibe2 commented on September 18, 2024

So I benchmarked prompt processing with -ngl 0 with 3 back-ends: CPU, Vulkan and hipBLAS.

model size params backend threads test t/s
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp128 33.53 ± 0.12
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp256 32.75 ± 0.07
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp512 31.75 ± 0.02
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp1024 31.06 ± 0.01
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp2048 29.62 ± 0.00
model size params backend ngl test t/s
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp128 113.84 ± 1.78
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp256 200.53 ± 0.82
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp512 273.84 ± 0.76
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp1024 272.64 ± 0.82
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp2048 270.03 ± 0.55
model size params backend ngl test t/s
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp128 308.72 ± 0.05
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp256 436.04 ± 0.05
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp512 539.76 ± 0.16
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp1024 527.57 ± 0.29
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp2048 497.41 ± 0.93

build: a27aa50 (3258)

The last table says "CUDA", but it actually used AMD GPU, and CUDA was not installed on the test system.

As we can see, GPU offloading provides significant speedup in prompt processing. I use some models with -ngl 0, and there, GPU back-ends provide speedup in real use-cases too.

The title of this issue purports that all work is done on CPU by default, which is not the case in my experience and tests.

What is true is that processing of small prompts and generation does not use GPU with -ngl 0. This is intentional because when model parameters are not stored in VRAM, they need to be repeatedly transferred from system RAM for each token or small batch, which takes time and negates the speedup provided by GPU.

The remaining question is whether ngl should be 0 by default. Ideally, the value for ngl could be selected automatically for best performance. Some work in this direction is done in #6502. Unfortunately, this is difficult to figure out in most cases. For example, hipBLAS back-end allocates additional VRAM during inference, and it can run out of it even if parameters and initially allocated buffers seem to fit in VRAM. Also, smaller value for ngl was sometimes faster in my tests.

Not having a good solution, here are some options for default ngl value:

  • 0 – this is what we have currently. You have to find values that work best on your system for models that you use.
  • All layers. We may get get a stream of reports about it crashing when it runs out of VRAM.
  • Some value in between – what it should be?

I'm closing this issue, because everything seems to work as expected. If you want to discuss automatic ngl, there is issue #3719 about that.

from llama.cpp.

differentprogramming avatar differentprogramming commented on September 18, 2024

@shibe2 is there a simple command line to run the same test on my system?

from llama.cpp.

shibe2 avatar shibe2 commented on September 18, 2024

For example:

llama-bench -m models/llama3-8b-instruct/ggml-model-q8_0.gguf -p 128,256,512,1024,2048 -n 0 -ngl 0

For testing different back-ends I used different builds of llama-bench.

from llama.cpp.

differentprogramming avatar differentprogramming commented on September 18, 2024

from llama.cpp.

shibe2 avatar shibe2 commented on September 18, 2024

@differentprogramming I benchmarked specifically processing of long prompts. If you don't give it a prompt of the same length as used in the benchmark, you will get different speeds, of course.

from llama.cpp.

Msyu1020 avatar Msyu1020 commented on September 18, 2024

So I benchmarked prompt processing with -ngl 0 with 3 back-ends: CPU, Vulkan and hipBLAS.

model size params backend threads test t/s
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp128 33.53 ± 0.12
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp256 32.75 ± 0.07
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp512 31.75 ± 0.02
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp1024 31.06 ± 0.01
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp2048 29.62 ± 0.00
model size params backend ngl test t/s
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp128 113.84 ± 1.78
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp256 200.53 ± 0.82
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp512 273.84 ± 0.76
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp1024 272.64 ± 0.82
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp2048 270.03 ± 0.55
model size params backend ngl test t/s
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp128 308.72 ± 0.05
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp256 436.04 ± 0.05
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp512 539.76 ± 0.16
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp1024 527.57 ± 0.29
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp2048 497.41 ± 0.93
build: a27aa50 (3258)

The last table says "CUDA", but it actually used AMD GPU, and CUDA was not installed on the test system.

As we can see, GPU offloading provides significant speedup in prompt processing. I use some models with -ngl 0, and there, GPU back-ends provide speedup in real use-cases too.

The title of this issue purports that all work is done on CPU by default, which is not the case in my experience and tests.

What is true is that processing of small prompts and generation does not use GPU with -ngl 0. This is intentional because when model parameters are not stored in VRAM, they need to be repeatedly transferred from system RAM for each token or small batch, which takes time and negates the speedup provided by GPU.

The remaining question is whether ngl should be 0 by default. Ideally, the value for ngl could be selected automatically for best performance. Some work in this direction is done in #6502. Unfortunately, this is difficult to figure out in most cases. For example, hipBLAS back-end allocates additional VRAM during inference, and it can run out of it even if parameters and initially allocated buffers seem to fit in VRAM. Also, smaller value for ngl was sometimes faster in my tests.

Not having a good solution, here are some options for default ngl value:

  • 0 – this is what we have currently. You have to find values that work best on your system for models that you use.
  • All layers. We may get get a stream of reports about it crashing when it runs out of VRAM.
  • Some value in between – what it should be?

I'm closing this issue, because everything seems to work as expected. If you want to discuss automatic ngl, there is issue #3719 about that.

I'm sorry to bother you. I'm trying to figure out if gpu would be used when ngl is 0. As the benchmark results show, does it means that even if no layers are offloaded on GPU, there are still some layers or ops would be excuted on GPU?

from llama.cpp.

shibe2 avatar shibe2 commented on September 18, 2024

Larger batches are processed on GPU, I think, 32 tokens and up. Model's parameters involved in each operation are transferred to VRAM, then discarded after the operation is completed.

from llama.cpp.

Msyu1020 avatar Msyu1020 commented on September 18, 2024

Larger batches are processed on GPU, I think, 32 tokens and up. Model's parameters involved in each operation are transferred to VRAM, then discarded after the operation is completed.

Thanks a lot. Could you point the related code of llama.cpp that determines whether the batch is large enough to be processed on the GPU?

from llama.cpp.

Msyu1020 avatar Msyu1020 commented on September 18, 2024

I'm not familiar with that code, but I think that this is relevant:

// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1) {
for (int b = 0; b < src_backend_id; b++) {
if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
SET_CAUSE(tensor, "1.off");
return b;
}
}
}

Operations in non-offloaded layers are normally performed by CPU back-end. But if ggml_backend_offload_op of the other back-end returns true, it will be performed by that back-end instead. GPU back-ends check batch dimension of result tensor against some threshold (32), and then return true for most operations. For example, this is relevant code in Vulkan back-end:

const int min_batch_size = 32;
return (op->ne[1] >= min_batch_size && op->op != GGML_OP_GET_ROWS) ||
(op->ne[2] >= min_batch_size && op->op == GGML_OP_MUL_MAT_ID);

This is exactly what I want. Sincerely thank you

from llama.cpp.

shibe2 avatar shibe2 commented on September 18, 2024

@Msyu1020 This is completely off-topic, but it would be nicer if you didn't quote long comments if full.

from llama.cpp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.