I am a little confused with how to set the memory location in the <code class="notrans

Memory location in the prefetch instructions about blis HOT 5 CLOSED

site-g commented on June 29, 2024

Memory location in the prefetch instructions

from blis.

Comments (5)

devinamatthews commented on June 29, 2024

Both prefetch strategies accomplish the same thing. Consider the ideal case where the data pointer is originally aligned to a cache line boundary (typically 64 bytes). Then you can prefect any address in that 64-byte region and that cache line will be loaded. Then you increment by 64 bytes, prefetch again, etc. and all is good. However, if the pointer is NOT aligned, then the first 64-byte region actually spans two cache lines. You now have the choice to prefetch the first or second cache line: prefetching at offset 0 is always the first one and any address within the last element (f32 or f64) is always the second one (note that both strategies above accomplish this). Here's an example:

| 64 bytes | 64 bytes | 64 bytes |
|CL1|    CL2   |    CL3   | CL4  |

If we assume that three 64-byte regions requires three prefetches, then prefetching at offset 0 accomplishes:

| 64 bytes | 64 bytes | 64 bytes |
|XXX|    XXX   |    XXX   | ---  |   XXX = prefetched, --- = not prefetched

Instead, prefetching at any address within the last element accomplishes:

| 64 bytes | 64 bytes | 64 bytes |
|---|    XXX   |    XXX   | XXX  |   XXX = prefetched, --- = not prefetched

So far there is no difference in the average case. However, and in particular when considering the C microtile, a later iteration of the microkernel will access the region just beyond what was prefetched (and loaded/stored) here. The last cache line accessed "spills over" into the next 64-byte region, which is precisely the data that will NOT be prefetched in that later microkernel iteration in approach # 2. So, the prefetching in the later iteration looks like this:

| 64 bytes | 64 bytes | 64 bytes |
|YYY|    XXX   |    XXX   | XXX  |   XXX = prefetched, YYY = not prefetched here, but probably already warm in cache from a previous load.

If instead we prefetch at offset 0 then there is no benefit from previous loads/stores.

from blis.

devinamatthews commented on June 29, 2024

Actually, the second one maybe should be 15*4 also in case the pointer is only 4-byte aligned.

from blis.

site-g commented on June 29, 2024

I see. Prefetch the last element will benefit the next iteration. This design is ingenious.

Then I think the address in the prefetch instructions for the next A micropanels may not be correct.

blis/kernels/haswell/3/sup/bli_gemmsup_rv_haswell_asm_s6x16m.c

Line 390 in 6d0ab74

prefetch(0, mem(rdx, 5*8))

blis/kernels/haswell/3/sup/bli_gemmsup_rv_haswell_asm_s6x16m.c

Line 425 in 6d0ab74

prefetch(0, mem(rdx, r9, 1, 5*8))

blis/kernels/haswell/3/sup/bli_gemmsup_rv_haswell_asm_s6x16m.c

Line 460 in 6d0ab74

prefetch(0, mem(rdx, r9, 2, 5*8))

Here rdx = a + ps_a4 is the adress of A's next micropanel, r9 = cs_a. For packed A, cs_a is 24 byte, the prefetch does not have problem as the prefetches will overlap with each other. However, if my understanding is correct, for small matrix, the micropanels for A will not be packed, then cs_a > 64 is possible. There exists the situation of

| 64 bytes | 64 bytes | 64 bytes |
|---|    XXX   |    ---   |    XXX   |    ---   |    XXX = prefetched, --- = not prefetched
|        cs_a        |        cs_a        |

What we want is the first 6*4 byte in each column of A, which will never be prefetched when cs_a > 64.

Is my understanding correct?

from blis.

devinamatthews commented on June 29, 2024

In the case that cs_A is too large, then you would always need two prefetches to make sure to get all of the next data. I didn't write this particular code but I guess the design decision was that picking an offset of 5*8 gives optimal performance if the data is tightly packed and at least gets ~1/2 of the data prefetched in the average, large stride case. There is a limit on how many L1 prefetches can be in flight at the same time so doing two prefetches per row is probably too many.

from blis.

site-g commented on June 29, 2024

I see. So this need to be judged by profiling. Thank you!

from blis.

Memory location in the prefetch instructions about blis HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent