What happened? I spent days trying to figure out why it running a

Bug: on AMD gpu, it offloads all the work to the CPU unless you specify --n-gpu-layers on the llama-cli command line,about ggerganov/llama.cpp

differentprogramming commented on September 18, 2024 1

I deleted a reply because I hadn't noticed that dspasyuk had mentioned lots of gpus other than his 1060.

from llama.cpp.

shibe2 commented on September 18, 2024 1

I'm not familiar with that code, but I think that this is relevant:

llama.cpp/ggml/src/ggml-backend.c

Lines 1172 to 1180 in 87e25a1

    
           // check if a backend with higher prio wants to offload the op 
        
           if (src_backend_id == sched->n_backends - 1) { 
        
               for (int b = 0; b < src_backend_id; b++) { 
        
                   if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) { 
        
                       SET_CAUSE(tensor, "1.off"); 
        
                       return b; 
        
                   } 
        
               } 
        
           }

Operations in non-offloaded layers are normally performed by CPU back-end. But if ggml_backend_offload_op of the other back-end returns true, it will be performed by that back-end instead. GPU back-ends check batch dimension of result tensor against some threshold (32), and then return true for most operations. For example, this is relevant code in Vulkan back-end:

llama.cpp/ggml/src/ggml-vulkan.cpp

Lines 6349 to 6352 in 87e25a1

    
           const int min_batch_size = 32; 
        
           return (op->ne[1] >= min_batch_size && op->op != GGML_OP_GET_ROWS) || 
        
                  (op->ne[2] >= min_batch_size && op->op == GGML_OP_MUL_MAT_ID);

from llama.cpp.

SteelPh0enix commented on September 18, 2024

isn't this just intended behavior? --gpu-layers is just 0 by default.

from llama.cpp.

shibe2 commented on September 18, 2024

Not all the work is done on CPU. If you give it long prompt, it should use GPU.

from llama.cpp.

differentprogramming commented on September 18, 2024

None of the work is being done by the gpu, it runs exactly as fast as CPU build and gpu use stays at 0%.

The instructions should warn you that you have to tell it to put layers into the gpu to use the gpu.

from llama.cpp.

differentprogramming commented on September 18, 2024

Or just maybe, if you compile it to use a GPU the default should be to use it.

from llama.cpp.

differentprogramming commented on September 18, 2024

Note that there are other issues where people complained about these speeds and never figured out why.
I think they eventually got closed because people stopped talking.

from llama.cpp.

SteelPh0enix commented on September 18, 2024

I don't think enabling GPU offload by default is a good idea - partial offload can hurt performance too (we should not assume that the user will be able to automatically offload any model to GPU in full), and it can lead to more unexpected/unstable behaviors, leaving the CPU inference as default is a reasonable choice.

One could say that discovering the GPU offload is "off" by default is a "rite of passage" for a new llama.cpp user. While i agree that there should maybe be a clear information that by default GPU offloading is off (i haven't seen one yet, i'm talking about something like a h1 banner in readme), i think that reading the docs is probably a better solution than turning this on by default

from llama.cpp.

shibe2 commented on September 18, 2024

None of the work is being done by the gpu, it runs exactly as fast as CPU build and gpu use stays at 0%.

How long is your prompt?

from llama.cpp.

differentprogramming commented on September 18, 2024

Another question, isn't 100% gpu usage for a 5 times uplift vs. a 6 core skylake cpu a bit low?
I can't really know because there's no matrix of uplift per common GPUs, different ways of using them (BLAS, CUDA, Vulkan etc.) and different models and format.
While I realize that would be a combinatorial mess, there should be an entry with some example tokens per second for an example model, a few variations on it, and common GPUs. And some idea whether CUDA is for instance the fastest and how BLAS or rocBLAS or Vulkan compares.

from llama.cpp.

differentprogramming commented on September 18, 2024

For instance I have no idea if I would get any uplift from spending a few weeks Hipifying the CUDA build!

from llama.cpp.

differentprogramming commented on September 18, 2024

None of the work is being done by the gpu, it runs exactly as fast as CPU build and gpu use stays at 0%.

How long is your prompt?

I wasn't testing with interactive mode, so it was using built in prompts at random.
I notice that changing the length of the result doesn't change the per token speed reported.

from llama.cpp.

dspasyuk commented on September 18, 2024

@differentprogramming It is the intended behavior, you can always default to CPU when everything fails. As for the difference between CUDA and cuBLAS, I do not see much difference on GTX1060/P4/A100/A4500 GPUs ~about 20 tokens per second on GTX1060 using Llama-instruct-Q4 Here is my default setup: https://github.com/dspasyuk/llama.cui

from llama.cpp.

dspasyuk commented on September 18, 2024

@differentprogramming No worries, GTX1060 is about 10 times slower than A4500 (~200 tokens/s) and A100 (PCIE card) is about 180 tokens per/sec using Llama3-instract-q4 and yesterday's version of llama.cpp with llama.cui

from llama.cpp.

shibe2 commented on September 18, 2024

So I benchmarked prompt processing with -ngl 0 with 3 back-ends: CPU, Vulkan and hipBLAS.

model	size	params	backend	threads	test	t/s
llama 8B Q6_K	6.14 GiB	8.03 B	CPU	8	pp128	33.53 ± 0.12
llama 8B Q6_K	6.14 GiB	8.03 B	CPU	8	pp256	32.75 ± 0.07
llama 8B Q6_K	6.14 GiB	8.03 B	CPU	8	pp512	31.75 ± 0.02
llama 8B Q6_K	6.14 GiB	8.03 B	CPU	8	pp1024	31.06 ± 0.01
llama 8B Q6_K	6.14 GiB	8.03 B	CPU	8	pp2048	29.62 ± 0.00

model	size	params	backend	test	t/s
llama 8B Q6_K	6.14 GiB	8.03 B	Vulkan	pp128	113.84 ± 1.78
llama 8B Q6_K	6.14 GiB	8.03 B	Vulkan	pp256	200.53 ± 0.82
llama 8B Q6_K	6.14 GiB	8.03 B	Vulkan	pp512	273.84 ± 0.76
llama 8B Q6_K	6.14 GiB	8.03 B	Vulkan	pp1024	272.64 ± 0.82
llama 8B Q6_K	6.14 GiB	8.03 B	Vulkan	pp2048	270.03 ± 0.55

model	size	params	backend	test	t/s
llama 8B Q6_K	6.14 GiB	8.03 B	CUDA	pp128	308.72 ± 0.05
llama 8B Q6_K	6.14 GiB	8.03 B	CUDA	pp256	436.04 ± 0.05
llama 8B Q6_K	6.14 GiB	8.03 B	CUDA	pp512	539.76 ± 0.16
llama 8B Q6_K	6.14 GiB	8.03 B	CUDA	pp1024	527.57 ± 0.29
llama 8B Q6_K	6.14 GiB	8.03 B	CUDA	pp2048	497.41 ± 0.93

build: a27aa50 (3258)

The last table says "CUDA", but it actually used AMD GPU, and CUDA was not installed on the test system.

As we can see, GPU offloading provides significant speedup in prompt processing. I use some models with -ngl 0, and there, GPU back-ends provide speedup in real use-cases too.

The title of this issue purports that all work is done on CPU by default, which is not the case in my experience and tests.

What is true is that processing of small prompts and generation does not use GPU with -ngl 0. This is intentional because when model parameters are not stored in VRAM, they need to be repeatedly transferred from system RAM for each token or small batch, which takes time and negates the speedup provided by GPU.

The remaining question is whether ngl should be 0 by default. Ideally, the value for ngl could be selected automatically for best performance. Some work in this direction is done in #6502. Unfortunately, this is difficult to figure out in most cases. For example, hipBLAS back-end allocates additional VRAM during inference, and it can run out of it even if parameters and initially allocated buffers seem to fit in VRAM. Also, smaller value for ngl was sometimes faster in my tests.

Not having a good solution, here are some options for default ngl value:

0 – this is what we have currently. You have to find values that work best on your system for models that you use.
All layers. We may get get a stream of reports about it crashing when it runs out of VRAM.
Some value in between – what it should be?

I'm closing this issue, because everything seems to work as expected. If you want to discuss automatic ngl, there is issue #3719 about that.

from llama.cpp.

differentprogramming commented on September 18, 2024

@shibe2 is there a simple command line to run the same test on my system?

from llama.cpp.

shibe2 commented on September 18, 2024

For example:

llama-bench -m models/llama3-8b-instruct/ggml-model-q8_0.gguf -p 128,256,512,1024,2048 -n 0 -ngl 0

For testing different back-ends I used different builds of llama-bench.

from llama.cpp.

differentprogramming commented on September 18, 2024

I hate to keep bothering you, but what does it mean that in llama-cli I get about 50 t/s but that bench test reports 240?

…

On Fri, Jun 28, 2024 at 2:10 AM shibe2 ***@***.***> wrote: For example: llama-bench -m models/llama3-8b-instruct/ggml-model-q8_0.gguf -p 128,256,512,1024,2048 -n 0 -ngl 0 For testing different back-ends I used different builds of llama-bench. — Reply to this email directly, view it on GitHub <#8164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4FICI6MTLWX5T4A62RVBTZJUSBBAVCNFSM6AAAAABJ7Z22QWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJWGQ3DQNZQGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from llama.cpp.

shibe2 commented on September 18, 2024

@differentprogramming I benchmarked specifically processing of long prompts. If you don't give it a prompt of the same length as used in the benchmark, you will get different speeds, of course.

from llama.cpp.

Msyu1020 commented on September 18, 2024

So I benchmarked prompt processing with -ngl 0 with 3 back-ends: CPU, Vulkan and hipBLAS.

model size params backend threads test t/s
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp128 33.53 ± 0.12
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp256 32.75 ± 0.07
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp512 31.75 ± 0.02
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp1024 31.06 ± 0.01
llama 8B Q6_K 6.14 GiB 8.03 B CPU 8 pp2048 29.62 ± 0.00
model size params backend ngl test t/s
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp128 113.84 ± 1.78
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp256 200.53 ± 0.82
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp512 273.84 ± 0.76
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp1024 272.64 ± 0.82
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 0 pp2048 270.03 ± 0.55
model size params backend ngl test t/s
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp128 308.72 ± 0.05
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp256 436.04 ± 0.05
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp512 539.76 ± 0.16
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp1024 527.57 ± 0.29
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 0 pp2048 497.41 ± 0.93
build: a27aa50 (3258)

The last table says "CUDA", but it actually used AMD GPU, and CUDA was not installed on the test system.

As we can see, GPU offloading provides significant speedup in prompt processing. I use some models with -ngl 0, and there, GPU back-ends provide speedup in real use-cases too.

The title of this issue purports that all work is done on CPU by default, which is not the case in my experience and tests.

What is true is that processing of small prompts and generation does not use GPU with -ngl 0. This is intentional because when model parameters are not stored in VRAM, they need to be repeatedly transferred from system RAM for each token or small batch, which takes time and negates the speedup provided by GPU.

The remaining question is whether ngl should be 0 by default. Ideally, the value for ngl could be selected automatically for best performance. Some work in this direction is done in #6502. Unfortunately, this is difficult to figure out in most cases. For example, hipBLAS back-end allocates additional VRAM during inference, and it can run out of it even if parameters and initially allocated buffers seem to fit in VRAM. Also, smaller value for ngl was sometimes faster in my tests.

Not having a good solution, here are some options for default ngl value:

0 – this is what we have currently. You have to find values that work best on your system for models that you use.

All layers. We may get get a stream of reports about it crashing when it runs out of VRAM.

Some value in between – what it should be?

I'm closing this issue, because everything seems to work as expected. If you want to discuss automatic ngl, there is issue #3719 about that.

I'm sorry to bother you. I'm trying to figure out if gpu would be used when ngl is 0. As the benchmark results show, does it means that even if no layers are offloaded on GPU, there are still some layers or ops would be excuted on GPU?

from llama.cpp.

shibe2 commented on September 18, 2024

Larger batches are processed on GPU, I think, 32 tokens and up. Model's parameters involved in each operation are transferred to VRAM, then discarded after the operation is completed.

from llama.cpp.

Msyu1020 commented on September 18, 2024

Larger batches are processed on GPU, I think, 32 tokens and up. Model's parameters involved in each operation are transferred to VRAM, then discarded after the operation is completed.

Thanks a lot. Could you point the related code of llama.cpp that determines whether the batch is large enough to be processed on the GPU?

from llama.cpp.

Msyu1020 commented on September 18, 2024

I'm not familiar with that code, but I think that this is relevant:

llama.cpp/ggml/src/ggml-backend.c

Lines 1172 to 1180 in 87e25a1

// check if a backend with higher prio wants to offload the op

if (src_backend_id == sched->n_backends - 1) {

for (int b = 0; b < src_backend_id; b++) {

if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {

SET_CAUSE(tensor, "1.off");

return b;

}

}

}

Operations in non-offloaded layers are normally performed by CPU back-end. But if ggml_backend_offload_op of the other back-end returns true, it will be performed by that back-end instead. GPU back-ends check batch dimension of result tensor against some threshold (32), and then return true for most operations. For example, this is relevant code in Vulkan back-end:

llama.cpp/ggml/src/ggml-vulkan.cpp

Lines 6349 to 6352 in 87e25a1

const int min_batch_size = 32;

return (op->ne[1] >= min_batch_size && op->op != GGML_OP_GET_ROWS) ||

(op->ne[2] >= min_batch_size && op->op == GGML_OP_MUL_MAT_ID);

This is exactly what I want. Sincerely thank you

from llama.cpp.

shibe2 commented on September 18, 2024

@Msyu1020 This is completely off-topic, but it would be nicer if you didn't quote long comments if full.

from llama.cpp.

Bug: on AMD gpu, it offloads all the work to the CPU unless you specify --n-gpu-layers on the llama-cli command line about llama.cpp HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	// check if a backend with higher prio wants to offload the op
	if (src_backend_id == sched->n_backends - 1) {
	for (int b = 0; b < src_backend_id; b++) {
	if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
	SET_CAUSE(tensor, "1.off");
	return b;
	}
	}
	}

	const int min_batch_size = 32;

	return (op->ne[1] >= min_batch_size && op->op != GGML_OP_GET_ROWS) \|\|
	(op->ne[2] >= min_batch_size && op->op == GGML_OP_MUL_MAT_ID);