Comments (8)
Hi, thanks for testing clBLAS on ARM GPU. The problem you saw is looking like a issue between clBLAS and ARM OpenCL compiler. I am pretty sure the same code runs fine with AMD, Intel and nVidia's OpenCL compiler.
I think the code generation of sgemmBlock() is done by calling blockGen() in /src/library/blas/gens/gemm.c . I am thinking one way of fixing would be eliminating the use of GPtr. A line looks like "__global float{1|2|4|8|16} = C + (coord.y * ldc + coord.x)/8; " should be sufficient. Of course we need to be careful because C could be float{1|2|4|8|16} in different cases.
from clblas.
Hi - thanks for the hint. I looked through the code and have noticed that amending the blas_kgen.c file gets my code further:
diff --git a/src/library/blas/gens/blas_kgen.c b/src/library/blas/gens/blas_kgen.c
index 595fe10..81a5fec 100644
--- a/src/library/blas/gens/blas_kgen.c
+++ b/src/library/blas/gens/blas_kgen.c
@@ -1173,8 +1173,9 @@ updateResultGen(
if ( vecLen > 1 ) {
sprintf(tmp,
- "uC.%s = %s + (%s * %s + %s)/%d;\n",
+ "uC.%s%dv = %s + (%s * %s + %s)/%d;\n",
vfield,
+ vecLen,
uvars.result,
uvars.startCol,
uvars.ld,
@@ -1197,8 +1198,9 @@ updateResultGen(
if ( vecLen > 1 ) {
sprintf(tmp,
- "uC.%s = %s + (%s * %s + %s)/%d;\n",
+ "uC.%s%dv = %s + (%s * %s + %s)/%d;\n",
vfield,
+ vecLen,
uvars.result,
uvars.startRow,
uvars.ld,
I think the original code might have triggered a compilation error in ARM OpenCL compiler. If you think the fix won't impact other implementations (I don't have AMD GPU unfortunately) then let me know and I will create a pull request.
from clblas.
One more thing. Looks like now the blas code stops on this line:
<source>:92:51: error: Cannot downconvert and widen scalar type 'unsigned long' to vector type 'uint4'
const uint4 bk = ((uint4)(0, 1, 2, 3) + (( get_group_id(0)*0 + k ) >> 1)) % vKB;
~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I would appreciate the hint
AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = Mali-T604
error = -11
memory pattern = Cached global memory based subgroup gemm, computing kernel generator
Subproblem dimensions: dims[0].itemY = 16, dims[0].itemX = 8, dims[0].y = 16, dims[0].x = 8, dims[0].bwidth = 64; ; dims[1].itemY = 4, dims[1].itemX = 4, dims[1].y = 4, dims[1].x = 4, dims[1].bwidth = 8; ;
Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 8, pgran->wgSize[1] = 8, pgran->wfSize = 64
Kernel extra flags: 536887044
Source:
#define A_BLK_H 4
#define B_BLK_H 4
#define SUBG_ITEMS 8
#define SUBG_A 4
#define SUBG_B 2
#define SUBG_B 2
#define K_VLEN_A 2
#define K_VLEN_B 2
typedef union GPtr {
__global float *f;
__global float2 *f2v;
__global float4 *f4v;
__global float8 *f8v;
__global float16 *f16v;
} GPtr;
typedef union LPtr {
__local float *f;
__local float2 *f2v;
__local float4 *f4v;
__local float8 *f8v;
__local float16 *f16v;
} LPtr;
typedef union PPtr {
float *f;
float2 *f2v;
float4 *f4v;
float8 *f8v;
float16 *f16v;
} PPtr;
__attribute__((reqd_work_group_size(8, 8, 1)))
void __kernel
sgemmSubgroup(
uint M,
uint N,
uint K,
const float alpha,
const float beta,
const __global float2 *restrict A,
const __global float2 *restrict B,
__global float4 *C,
uint lda,
uint ldb,
uint ldc)
{
uint Ktail = K % 64;
uint Kbase = K - Ktail;
float8 a0, a1, a2, a3;
float8 b0, b1, b2, b3;
float4 c0, c1, c2, c3;
lda /= K_VLEN_A;
ldb /= K_VLEN_B;
int2 itemId;
itemId.x = get_local_id(0);
itemId.y = get_local_id(1);
int coordY = A_BLK_H*( get_group_id(1)*SUBG_A + get_local_id(1)/SUBG_B );
int coordX = B_BLK_H*( get_group_id(0)*SUBG_B + get_local_id(1)%SUBG_B );
uint skipTileMul = 0;
//M block tail
if( coordY >= M ) {
skipTileMul = 1;
}
//N block tail
if( coordX >= N ) {
skipTileMul = 1;
}
A += lda*coordY;
B += ldb*coordX;
c0 = 0;
c1 = 0;
c2 = 0;
c3 = 0;
const uint vKB = (Kbase >> 1);
if( !skipTileMul ) {
for(int k = 8*get_local_id(0); k < Kbase; k += 8*SUBG_ITEMS) {
/* -- Tiles multiplier -- */
float8 sum;
const uint4 bk = ((uint4)(0, 1, 2, 3) + (( get_group_id(0)*0 + k ) >> 1)) % vKB;
from clblas.
Hi I ran into some test fails in c/zgemm with your code. You can see them by running the google test with command "./test-short --gtest_filter=gemm".
Can you take a look and see if you can fix it?
I will take a look for the bug in sgemmsubgroup later.
from clblas.
I have a very similar issue on Mac OS X 10.10 when trying to perform SGEMM.
I am able to say that the crash only occurs when using the transposition feature of the function.
i.e.
This crashes :
error = clblasSgemm(clblasRowMajor, clblasNoTrans, clblasTrans, X.nb_rows, Y.nb_rows, X.nb_columns,
a, X.gpu_buffer, 0, X.nb_columns,
Y.gpu_buffer, 0, Y.nb_columns,
b, Z.gpu_buffer, 0, Z.nb_columns,
1, &this->command_queue, 0, NULL, &event);
I ran several tests and
- the crash only occurs if you try to transpose a matrix by setting the clblasTrans flag
- the crash only occurs when the transposed matrix is the second parameter (never if it is the first parameter)
- the crash only occurs if the second matrix has more than 8 columns
- the crash also occurs if you try to outsmart the function by setting the clblasColumnMajor mode and clBlasTrans on the first parameter, as below :
error = clblasSgemm(clblasColumnMajor, clblasTrans, clblasNoTrans, X.nb_columns, Y.nb_columns, X.nb_rows,
a, X.gpu_buffer, 0, X.nb_rows,
Y.gpu_buffer, 0, Y.nb_rows,
b, Z.gpu_buffer, 0, Z.nb_rows,
1, &this->command_queue, 0, NULL, &event);
For those running into the same issue as me and looking for a quick fix.
You can get around it by first transposing the matrix, then running SGEMM with clBlasNoTrans, then transposing again the matrix.
Hardly an efficient or satisfactory solution (given that BLAS has no in-place transposition function), but I guess it is good to know that this is not a truly blocking issue.
Cheers
========================================================
AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = HD Graphics 5000
error = -11
memory pattern = Cached global memory based subgroup gemm, computing kernel generator
Subproblem dimensions: dims[0].itemY = 16, dims[0].itemX = 8, dims[0].y = 16, dims[0].x = 8, dims[0].bwidth = 64; ; dims[1].itemY = 4, dims[1].itemX = 4, dims[1].y = 4, dims[1].x = 4, dims[1].bwidth = 8; ;
Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 8, pgran->wgSize[1] = 8, pgran->wfSize = 64
Kernel extra flags: 805338884
Source:
TRUNCATED FOR CLARITY
--------------------------------------------------------
Build log:
<program source>:97:51: error: Cannot downconvert and widen scalar type 'unsigned long' to vector type '__uint4'
const uint4 bk = ((uint4)(0, 1, 2, 3) + (( get_group_id(0)*0 + k ) >> 1)) % vKB;
~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<program source>:270:20: warning: unused variable 'sum'
float8 sum;
^
<program source>:592:29: warning: unused variable 'i'
int i, j;
^
<program source>:592:32: warning: unused variable 'j'
int i, j;
^
<program source>:593:30: warning: unused variable 'res'
PPtr res;
^
<program source>:68:16: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
if( coordY >= M ) {
~~~~~~ ^ ~
<program source>:72:16: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
if( coordX >= N ) {
~~~~~~ ^ ~
<program source>:77:20: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
if (coordX + 4 > N) {
~~~~~~~~~~ ^ ~
<program source>:94:42: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
for(int k = 8*get_local_id(0); k < Kbase; k += 8*SUBG_ITEMS) {
~ ^ ~~~~~
<program source>:511:21: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
if ((coordX + 4 == N) && (N % 4)) {
~~~~~~~~~~ ^ ~
<program source>:527:23: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
if( (itemId.y >= mstep)&&(itemId.y < (mstep+8)) ) {
~~~~~~~~ ^ ~~~~~
<program source>:527:44: warning: comparison of integers of different signs: 'int' and 'unsigned int'
if( (itemId.y >= mstep)&&(itemId.y < (mstep+8)) ) {
~~~~~~~~ ^ ~~~~~~~
<program source>:543:23: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
if( (itemId.y >= mstep)&&(itemId.y < (mstep+8)) ) {
~~~~~~~~ ^ ~~~~~
<program source>:543:44: warning: comparison of integers of different signs: 'int' and 'unsigned int'
if( (itemId.y >= mstep)&&(itemId.y < (mstep+8)) ) {
~~~~~~~~ ^ ~~~~~~~
<program source>:557:29: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
if ((coordY < M) && (coordX < N)) {
~~~~~~ ^ ~
<program source>:557:45: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
if ((coordY < M) && (coordX < N)) {
~~~~~~ ^ ~
========================================================
libc++abi.dylib: terminating with uncaught exception
(lldb)
from clblas.
I believe that this issue is also closed when #62 closed. The two issues appeared to be related on different platforms
from clblas.
Hi, I don't think this is really fixed. I am compiling on ARM and got the same error. I implemented marcino239's fix to src/library/blas/gens/blas_kgen.c (11/2/14 post) and it fixes it. Can that fix be put in the repo?
from clblas.
@galassoj +1 exactly same situation here. I tried with both v2.6 and master.
from clblas.
Related Issues (20)
- test-correctness segfault and "INTERNAL BUILD FAILURE"
- Does this clBLAS support FPGA? HOT 3
- Will it run on OpenCL 1.1 ( EP) on Vivante GC2000 GPU? HOT 3
- Problems building with gtest-1.8.1 HOT 1
- Runtime error with Intel OpenCL 18.1.0.0920
- What is APPML 1.12 and where is it?
- clBLAS test fail with ROCm on Centos 7.6 HOT 1
- clBLAS aborts when backend is OpenCL 1.1
- Outdated documentation? HOT 1
- bug in clblasiCamax???
- Build clBLAS without OpenBLAS? HOT 5
- how about the performance on adreno gpu HOT 1
- test-short failure on gfx1010 (RX5700 XT)
- CMake compilation with clBLAS fails on hard-coded AMDADDPSDK path HOT 1
- Installation procedure went wrong? HOT 1
- add error checker when creating cmd queue in client: especially when OoO queue is not supported on many devices HOT 1
- undefined clblassetup HOT 7
- Test cases that can be displayed in an image interface HOT 1
- Correctness test fails to compile on m2 Mac HOT 1
- Is it a good idea to use GCN cross lane instruction for optimization? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clblas.