Code Monkey home page Code Monkey logo

Comments (5)

devinamatthews avatar devinamatthews commented on June 29, 2024

Both prefetch strategies accomplish the same thing. Consider the ideal case where the data pointer is originally aligned to a cache line boundary (typically 64 bytes). Then you can prefect any address in that 64-byte region and that cache line will be loaded. Then you increment by 64 bytes, prefetch again, etc. and all is good. However, if the pointer is NOT aligned, then the first 64-byte region actually spans two cache lines. You now have the choice to prefetch the first or second cache line: prefetching at offset 0 is always the first one and any address within the last element (f32 or f64) is always the second one (note that both strategies above accomplish this). Here's an example:

| 64 bytes | 64 bytes | 64 bytes |
|CL1|    CL2   |    CL3   | CL4  |

If we assume that three 64-byte regions requires three prefetches, then prefetching at offset 0 accomplishes:

| 64 bytes | 64 bytes | 64 bytes |
|XXX|    XXX   |    XXX   | ---  |   XXX = prefetched, --- = not prefetched

Instead, prefetching at any address within the last element accomplishes:

| 64 bytes | 64 bytes | 64 bytes |
|---|    XXX   |    XXX   | XXX  |   XXX = prefetched, --- = not prefetched

So far there is no difference in the average case. However, and in particular when considering the C microtile, a later iteration of the microkernel will access the region just beyond what was prefetched (and loaded/stored) here. The last cache line accessed "spills over" into the next 64-byte region, which is precisely the data that will NOT be prefetched in that later microkernel iteration in approach # 2. So, the prefetching in the later iteration looks like this:

| 64 bytes | 64 bytes | 64 bytes |
|YYY|    XXX   |    XXX   | XXX  |   XXX = prefetched, YYY = not prefetched here, but probably already warm in cache from a previous load.

If instead we prefetch at offset 0 then there is no benefit from previous loads/stores.

from blis.

devinamatthews avatar devinamatthews commented on June 29, 2024

Actually, the second one maybe should be 15*4 also in case the pointer is only 4-byte aligned.

from blis.

site-g avatar site-g commented on June 29, 2024

I see. Prefetch the last element will benefit the next iteration. This design is ingenious.

Then I think the address in the prefetch instructions for the next A micropanels may not be correct.


prefetch(0, mem(rdx, r9, 1, 5*8))

prefetch(0, mem(rdx, r9, 2, 5*8))

Here rdx = a + ps_a4 is the adress of A's next micropanel, r9 = cs_a. For packed A, cs_a is 24 byte, the prefetch does not have problem as the prefetches will overlap with each other. However, if my understanding is correct, for small matrix, the micropanels for A will not be packed, then cs_a > 64 is possible. There exists the situation of

| 64 bytes | 64 bytes | 64 bytes |
|---|    XXX   |    ---   |    XXX   |    ---   |    XXX = prefetched, --- = not prefetched
|        cs_a        |        cs_a        |

What we want is the first 6*4 byte in each column of A, which will never be prefetched when cs_a > 64.

Is my understanding correct?

from blis.

devinamatthews avatar devinamatthews commented on June 29, 2024

In the case that cs_A is too large, then you would always need two prefetches to make sure to get all of the next data. I didn't write this particular code but I guess the design decision was that picking an offset of 5*8 gives optimal performance if the data is tightly packed and at least gets ~1/2 of the data prefetched in the average, large stride case. There is a limit on how many L1 prefetches can be in flight at the same time so doing two prefetches per row is probably too many.

from blis.

site-g avatar site-g commented on June 29, 2024

I see. So this need to be judged by profiling. Thank you!

from blis.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.