Comments (5)
Both prefetch strategies accomplish the same thing. Consider the ideal case where the data pointer is originally aligned to a cache line boundary (typically 64 bytes). Then you can prefect any address in that 64-byte region and that cache line will be loaded. Then you increment by 64 bytes, prefetch again, etc. and all is good. However, if the pointer is NOT aligned, then the first 64-byte region actually spans two cache lines. You now have the choice to prefetch the first or second cache line: prefetching at offset 0 is always the first one and any address within the last element (f32 or f64) is always the second one (note that both strategies above accomplish this). Here's an example:
| 64 bytes | 64 bytes | 64 bytes |
|CL1| CL2 | CL3 | CL4 |
If we assume that three 64-byte regions requires three prefetches, then prefetching at offset 0 accomplishes:
| 64 bytes | 64 bytes | 64 bytes |
|XXX| XXX | XXX | --- | XXX = prefetched, --- = not prefetched
Instead, prefetching at any address within the last element accomplishes:
| 64 bytes | 64 bytes | 64 bytes |
|---| XXX | XXX | XXX | XXX = prefetched, --- = not prefetched
So far there is no difference in the average case. However, and in particular when considering the C microtile, a later iteration of the microkernel will access the region just beyond what was prefetched (and loaded/stored) here. The last cache line accessed "spills over" into the next 64-byte region, which is precisely the data that will NOT be prefetched in that later microkernel iteration in approach # 2. So, the prefetching in the later iteration looks like this:
| 64 bytes | 64 bytes | 64 bytes |
|YYY| XXX | XXX | XXX | XXX = prefetched, YYY = not prefetched here, but probably already warm in cache from a previous load.
If instead we prefetch at offset 0 then there is no benefit from previous loads/stores.
from blis.
Actually, the second one maybe should be 15*4
also in case the pointer is only 4-byte aligned.
from blis.
I see. Prefetch the last element will benefit the next iteration. This design is ingenious.
Then I think the address in the prefetch instructions for the next A micropanels may not be correct.
Here rdx = a + ps_a4
is the adress of A's next micropanel, r9 = cs_a
. For packed A, cs_a
is 24 byte, the prefetch does not have problem as the prefetches will overlap with each other. However, if my understanding is correct, for small matrix, the micropanels for A will not be packed, then cs_a > 64
is possible. There exists the situation of
| 64 bytes | 64 bytes | 64 bytes |
|---| XXX | --- | XXX | --- | XXX = prefetched, --- = not prefetched
| cs_a | cs_a |
What we want is the first 6*4
byte in each column of A, which will never be prefetched when cs_a > 64
.
Is my understanding correct?
from blis.
In the case that cs_A
is too large, then you would always need two prefetches to make sure to get all of the next data. I didn't write this particular code but I guess the design decision was that picking an offset of 5*8
gives optimal performance if the data is tightly packed and at least gets ~1/2 of the data prefetched in the average, large stride case. There is a limit on how many L1 prefetches can be in flight at the same time so doing two prefetches per row is probably too many.
from blis.
I see. So this need to be judged by profiling. Thank you!
from blis.
Related Issues (20)
- Default BLIS_[MNK]T values never actually set HOT 9
- fatal error: malloc.h: No such file or directory HOT 2
- Header path for default source build and Debian should match HOT 6
- bli_gemmsup_rd_haswell_asm_d6x8m.c:1296:1:error:bp cannot be used in ams here HOT 3
- New release? HOT 6
- A more complete list of ARM cpu implementations
- arm64 cpu identification is not portable to BSDs HOT 2
- inconsistence between documentation and code for bli_?trmm3 HOT 5
- What is the best way to debug BLIS? HOT 2
- GPU support and PortBLAS HOT 4
- getting error as illegal instruction HOT 4
- Support compiler names with spaces HOT 1
- Regarding Default Behaviour for CPU Affinity HOT 4
- BF16 on AMD CPU? HOT 4
- Upstream BLIS patches for ARM SVE? HOT 5
- Facing issue when running following command: pip install --upgrade --no-cache-dir thinc HOT 1
- AMD FX(tm)-6300 Six-Core Processor piledriver errors with check HOT 3
- errors with scalapack due to [cz]symv and [cz]syr interfaces HOT 10
- Not possible to link Blis and Lapack statically into the same executable HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blis.