Code Monkey home page Code Monkey logo

Comments (8)

TimmyLiu avatar TimmyLiu commented on July 30, 2024

Hi, thanks for testing clBLAS on ARM GPU. The problem you saw is looking like a issue between clBLAS and ARM OpenCL compiler. I am pretty sure the same code runs fine with AMD, Intel and nVidia's OpenCL compiler.
I think the code generation of sgemmBlock() is done by calling blockGen() in /src/library/blas/gens/gemm.c . I am thinking one way of fixing would be eliminating the use of GPtr. A line looks like "__global float{1|2|4|8|16} = C + (coord.y * ldc + coord.x)/8; " should be sufficient. Of course we need to be careful because C could be float{1|2|4|8|16} in different cases.

from clblas.

marcino239 avatar marcino239 commented on July 30, 2024

Hi - thanks for the hint. I looked through the code and have noticed that amending the blas_kgen.c file gets my code further:

diff --git a/src/library/blas/gens/blas_kgen.c b/src/library/blas/gens/blas_kgen.c
index 595fe10..81a5fec 100644
--- a/src/library/blas/gens/blas_kgen.c
+++ b/src/library/blas/gens/blas_kgen.c
@@ -1173,8 +1173,9 @@ updateResultGen(
         if ( vecLen > 1 ) {

             sprintf(tmp,
-                "uC.%s = %s + (%s * %s + %s)/%d;\n",
+                "uC.%s%dv = %s + (%s * %s + %s)/%d;\n",
                 vfield,
+                vecLen,
                 uvars.result,
                 uvars.startCol,
                 uvars.ld,
@@ -1197,8 +1198,9 @@ updateResultGen(
         if ( vecLen > 1 ) {

             sprintf(tmp,
-                "uC.%s = %s + (%s * %s + %s)/%d;\n",
+                "uC.%s%dv = %s + (%s * %s + %s)/%d;\n",
                 vfield,
+                vecLen,
                 uvars.result,
                 uvars.startRow,
                 uvars.ld,

I think the original code might have triggered a compilation error in ARM OpenCL compiler. If you think the fix won't impact other implementations (I don't have AMD GPU unfortunately) then let me know and I will create a pull request.

from clblas.

marcino239 avatar marcino239 commented on July 30, 2024

One more thing. Looks like now the blas code stops on this line:

<source>:92:51: error: Cannot downconvert and widen scalar type 'unsigned long' to vector type 'uint4'
            const uint4 bk = ((uint4)(0, 1, 2, 3) + (( get_group_id(0)*0 + k ) >> 1)) % vKB;
                              ~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I would appreciate the hint

AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = Mali-T604
error = -11
memory pattern = Cached global memory based subgroup gemm, computing kernel generator
Subproblem dimensions: dims[0].itemY = 16, dims[0].itemX = 8, dims[0].y = 16, dims[0].x = 8, dims[0].bwidth = 64; ; dims[1].itemY = 4, dims[1].itemX = 4, dims[1].y = 4, dims[1].x = 4, dims[1].bwidth = 8; ; 
Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 8, pgran->wgSize[1] = 8, pgran->wfSize = 64
Kernel extra flags: 536887044
Source:

#define A_BLK_H 4
#define B_BLK_H 4
#define SUBG_ITEMS 8
#define SUBG_A 4
#define SUBG_B 2

#define SUBG_B 2
#define K_VLEN_A 2
#define K_VLEN_B 2

typedef union GPtr {
    __global float *f;
    __global float2 *f2v;
    __global float4 *f4v;
    __global float8 *f8v;
    __global float16 *f16v;
} GPtr;

typedef union LPtr {
    __local float *f;
    __local float2 *f2v;
    __local float4 *f4v;
    __local float8 *f8v;
    __local float16 *f16v;
} LPtr;

typedef union PPtr {
    float *f;
    float2 *f2v;
    float4 *f4v;
    float8 *f8v;
    float16 *f16v;
} PPtr;


__attribute__((reqd_work_group_size(8, 8, 1)))
void __kernel
sgemmSubgroup(
    uint M,
    uint N,
    uint K,
    const float alpha,
    const float beta,
    const __global float2 *restrict A,
    const __global float2 *restrict B,
    __global float4 *C,
    uint lda,
    uint ldb,
    uint ldc)
{
    uint Ktail = K % 64;
    uint Kbase = K - Ktail;
    float8 a0, a1, a2, a3;
    float8 b0, b1, b2, b3;
    float4 c0, c1, c2, c3;
    lda /= K_VLEN_A;
    ldb /= K_VLEN_B;

    int2 itemId;
    itemId.x = get_local_id(0);
    itemId.y = get_local_id(1);

    int coordY = A_BLK_H*( get_group_id(1)*SUBG_A + get_local_id(1)/SUBG_B );
    int coordX = B_BLK_H*( get_group_id(0)*SUBG_B + get_local_id(1)%SUBG_B );

    uint skipTileMul = 0;
    //M block tail
    if( coordY >= M ) {
        skipTileMul = 1;
    }
    //N block tail
    if( coordX >= N ) {
        skipTileMul = 1;
    }

    A += lda*coordY;
    B += ldb*coordX;

    c0 = 0;
    c1 = 0;
    c2 = 0;
    c3 = 0;



    const uint vKB = (Kbase >> 1);

    if( !skipTileMul ) {
        for(int k = 8*get_local_id(0); k < Kbase; k += 8*SUBG_ITEMS) {
            /* -- Tiles multiplier -- */
            float8 sum;
            const uint4 bk = ((uint4)(0, 1, 2, 3) + (( get_group_id(0)*0 + k ) >> 1)) % vKB;

from clblas.

TimmyLiu avatar TimmyLiu commented on July 30, 2024

Hi I ran into some test fails in c/zgemm with your code. You can see them by running the google test with command "./test-short --gtest_filter=gemm".
Can you take a look and see if you can fix it?
I will take a look for the bug in sgemmsubgroup later.

from clblas.

vejja avatar vejja commented on July 30, 2024

I have a very similar issue on Mac OS X 10.10 when trying to perform SGEMM.
I am able to say that the crash only occurs when using the transposition feature of the function.
i.e.
This crashes :

error = clblasSgemm(clblasRowMajor, clblasNoTrans, clblasTrans, X.nb_rows, Y.nb_rows, X.nb_columns,
                        a, X.gpu_buffer, 0, X.nb_columns,
                        Y.gpu_buffer, 0, Y.nb_columns,
                        b, Z.gpu_buffer, 0, Z.nb_columns,
                        1, &this->command_queue, 0, NULL, &event);

I ran several tests and

  • the crash only occurs if you try to transpose a matrix by setting the clblasTrans flag
  • the crash only occurs when the transposed matrix is the second parameter (never if it is the first parameter)
  • the crash only occurs if the second matrix has more than 8 columns
  • the crash also occurs if you try to outsmart the function by setting the clblasColumnMajor mode and clBlasTrans on the first parameter, as below :
error = clblasSgemm(clblasColumnMajor, clblasTrans, clblasNoTrans, X.nb_columns, Y.nb_columns, X.nb_rows,
                        a, X.gpu_buffer, 0, X.nb_rows,
                        Y.gpu_buffer, 0, Y.nb_rows,
                        b, Z.gpu_buffer, 0, Z.nb_rows,
                        1, &this->command_queue, 0, NULL, &event);

For those running into the same issue as me and looking for a quick fix.
You can get around it by first transposing the matrix, then running SGEMM with clBlasNoTrans, then transposing again the matrix.
Hardly an efficient or satisfactory solution (given that BLAS has no in-place transposition function), but I guess it is good to know that this is not a truly blocking issue.

Cheers

========================================================

AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = HD Graphics 5000
error = -11
memory pattern = Cached global memory based subgroup gemm, computing kernel generator
Subproblem dimensions: dims[0].itemY = 16, dims[0].itemX = 8, dims[0].y = 16, dims[0].x = 8, dims[0].bwidth = 64; ; dims[1].itemY = 4, dims[1].itemX = 4, dims[1].y = 4, dims[1].x = 4, dims[1].bwidth = 8; ; 
Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 8, pgran->wgSize[1] = 8, pgran->wfSize = 64
Kernel extra flags: 805338884
Source:

TRUNCATED FOR CLARITY

--------------------------------------------------------

Build log:

<program source>:97:51: error: Cannot downconvert and widen scalar type 'unsigned long' to vector type '__uint4'
            const uint4 bk = ((uint4)(0, 1, 2, 3) + (( get_group_id(0)*0 + k ) >> 1)) % vKB;
                              ~~~~~~~             ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<program source>:270:20: warning: unused variable 'sum'
            float8 sum;
                   ^
<program source>:592:29: warning: unused variable 'i'
                        int i, j;
                            ^
<program source>:592:32: warning: unused variable 'j'
                        int i, j;
                               ^
<program source>:593:30: warning: unused variable 'res'
                        PPtr res;
                             ^
<program source>:68:16: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
    if( coordY >= M ) {
        ~~~~~~ ^  ~
<program source>:72:16: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
    if( coordX >= N ) {
        ~~~~~~ ^  ~
<program source>:77:20: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
    if (coordX + 4 > N) {
        ~~~~~~~~~~ ^ ~
<program source>:94:42: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
        for(int k = 8*get_local_id(0); k < Kbase; k += 8*SUBG_ITEMS) {
                                       ~ ^ ~~~~~
<program source>:511:21: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
    if ((coordX + 4 == N) && (N % 4)) {
         ~~~~~~~~~~ ^  ~
<program source>:527:23: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
        if( (itemId.y >= mstep)&&(itemId.y < (mstep+8)) ) {
             ~~~~~~~~ ^  ~~~~~
<program source>:527:44: warning: comparison of integers of different signs: 'int' and 'unsigned int'
        if( (itemId.y >= mstep)&&(itemId.y < (mstep+8)) ) {
                                  ~~~~~~~~ ^  ~~~~~~~
<program source>:543:23: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
        if( (itemId.y >= mstep)&&(itemId.y < (mstep+8)) ) {
             ~~~~~~~~ ^  ~~~~~
<program source>:543:44: warning: comparison of integers of different signs: 'int' and 'unsigned int'
        if( (itemId.y >= mstep)&&(itemId.y < (mstep+8)) ) {
                                  ~~~~~~~~ ^  ~~~~~~~
<program source>:557:29: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
                if ((coordY < M) && (coordX < N)) {
                     ~~~~~~ ^ ~
<program source>:557:45: warning: comparison of integers of different signs: 'int' and 'uint' (aka 'unsigned int')
                if ((coordY < M) && (coordX < N)) {
                                     ~~~~~~ ^ ~

========================================================

libc++abi.dylib: terminating with uncaught exception 
(lldb) 

from clblas.

kknox avatar kknox commented on July 30, 2024

I believe that this issue is also closed when #62 closed. The two issues appeared to be related on different platforms

from clblas.

galassoj avatar galassoj commented on July 30, 2024

Hi, I don't think this is really fixed. I am compiling on ARM and got the same error. I implemented marcino239's fix to src/library/blas/gens/blas_kgen.c (11/2/14 post) and it fixes it. Can that fix be put in the repo?

from clblas.

nstiurca avatar nstiurca commented on July 30, 2024

@galassoj +1 exactly same situation here. I tried with both v2.6 and master.

from clblas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.