rocm / hipblas Goto Github PK

View Code? Open in Web Editor NEW

109.0 37.0 76.0 10.73 MB

ROCm BLAS marshalling library

Home Page: https://rocm.docs.amd.com/projects/hipBLAS/en/latest/index.html

License: Other

CMake 0.46% C 0.03% C++ 82.88% Shell 0.21% Groovy 0.12% Fortran 14.46% Python 1.78% Asymptote 0.06%

blas cuda rocm hip

hipblas's Introduction

hipBLAS

hipBLAS is a Basic Linear Algebra Subprograms (BLAS) marshalling library with multiple supported backends. It sits between your application and a 'worker' BLAS library, where it marshals inputs to the backend library and marshals results to your application. hipBLAS exports an interface that doesn't require the client to change, regardless of the chosen backend. Currently, hipBLAS supports rocBLAS and cuBLAS backends.

To use hipBLAS, you must first install rocBLAS, rocSPARSE, and rocSOLVER or cuBLAS.

Documentation

Documentation for hipBLAS is available at https://rocm.docs.amd.com/projects/hipBLAS/en/latest/.

To build our documentation locally, use the following code:

cd docs

pip3 install -r sphinx/requirements.txt

python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Alternatively, build with CMake:

cmake -DBUILD_DOCS=ON ...

Build and install

Download the hipBLAS source code (clone this repository):

    git clone https://github.com/ROCmSoftwarePlatform/hipBLAS.git

    hipBLAS requires specific versions of rocBLAS and rocSOLVER. Refer to
    [CMakeLists.txt](https://github.com/ROCmSoftwarePlatform/hipBLAS/blob/develop/library/CMakeLists.txt)
    for details.

Build hipBLAS and install it into /opt/rocm/hipblas:
```
    cd hipblas
    ./install.sh -i
```

Interface examples

The hipBLAS interface is compatible with rocBLAS and cuBLAS-v2 APIs. Porting a CUDA application that originally calls the cuBLAS API to an application that calls the hipBLAS API is relatively straightforward. For example, the hipBLAS SGEMV interface is:

GEMV API

hipblasStatus_t
hipblasSgemv( hipblasHandle_t handle,
              hipblasOperation_t trans,
              int m, int n, const float *alpha,
              const float *A, int lda,
              const float *x, int incx, const float *beta,
              float *y, int incy );

Batched and strided GEMM API

hipBLAS GEMM can process matrices in batches with regular strides by using the strided-batched version of the API:

hipblasStatus_t
hipblasSgemmStridedBatched( hipblasHandle_t handle,
              hipblasOperation_t transa, hipblasOperation_t transb,
              int m, int n, int k, const float *alpha,
              const float *A, int lda, long long bsa,
              const float *B, int ldb, long long bsb, const float *beta,
              float *C, int ldc, long long bsc,
              int batchCount);

hipBLAS assumes matrix A and vectors x, y are allocated in GPU memory space filled with data. You are responsible for copying data to and from the host and device memory.

hipblas's People

Contributors

Stargazers

Watchers

Forkers

codeaudit amcamd briansp2020 prasandhmcw neelmcw acowley zaliu jorghi12 aaronenyeshi ntrost57 gilbertlee-amd jichangjichang carlushuang vilelasagna amdkila sriharikarnam hephaex alexbrownamd saadrahim saiislam tcojean streamhsa daineamd eidenyoshida hgaspar srinivamd torrezuk feizheng10 tfalders mhbliao pruthvistony luoyin500 nelsonc-amd cgmb open-gpu-compute haampie naveenelumalaiamd lawruble13 yvanmokwinski keerthana4708 rkamd raramakr rocmmathlibrariesbot ckendrick pavahora anonymousywl yenong-amd voldmare arvindcheru sarbojit2019 sergiykostrov samjwu jeffdaily pkamd cklam12345 smalekta trixirt streamhpc evetsso mengzcai siddharth9820 estewart08 abhimeda peterjunpark e-kwsm yiakwy-xpu-ml-framework-team bnemanich krapivn eminsight

hipblas's Issues

Install directory inconsistency

This is not really a bug, but it seems kind of inconsistent. The hipblas .deb package is installed into /usr/include and /usr/lib, whereas most other rocm packages install into /opt/rocm.

CMake always wants to build for CUDA if it is at all present

What is the expected behavior

User should have control of whether they want to build for either platform, since you can't do both at once

What actually happens

You have no way to specify that the build is meant for HCC, if you have CUDA at all installed, for whatever reason, it tries to enable that, and if you were trying to build with HCC, the build fails

How to reproduce

Try to build using HCC, for ROC, but also have CUDA installed on your system

Environment

Observed in Arch Linux whilst converting the scripts from Experimental ROC
I have CUDA installed and bumped into this.

Proposed fix

I'm making a PR with what seems to be the simplest way to do this, I provided an additional TRY_CUDA option. if that is OFF, the find_package call is skipped. By default it is ON, so the legacy behaviour is maintained.
But really I feel like making this choice on the back of a find_package (and a find module that has already been deprecated anyway) is the wrong way to do this. Instead the correct would be a bit of a larger change and would include

set(HIPBLAS_BACKEND "hcc" CACHE STRING "Backend the hipBLAS build should use")
set_property(CACHE HIPBLAS_BACKEND PROPERTY STRINGS "hcc;cuda")

And then derive all the choices in inner branches of the build tree from querying this variable

Results of hipblasHgemm seems incorrect

I tried to run an example that calls the hipblasHgemm function on a gfx908 GPU. The results seemed incorrect when comparing the results of the half- and single-precision versions. Thanks.

https://github.com/zjin-lcf/HeCBench/tree/master/src/mkl-sgemm-hip

HIP version: 6.0.32831-204d35d16
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4)

hipcc  -std=c++17 -Wall -O3 gemm.o -o main -lhipblas
./main 79 91 83 1000
        Running with half precision real data type:
Average GEMM execution time: 542.991028 (us)

                Outputting 2x2 block of A,B,C matrices:

                        A = [ 0, 1, ...
                            [ 1, 1, ...
                            [ ...


                        B = [ 1, 0, ...
                            [ 1, 1, ...
                            [ ...


                        C = [ 0, 0, ...
                            [ 0, 0, ...
                            [ ...

        Running with single precision real data type:
Average GEMM execution time: 288.084442 (us)

                Outputting 2x2 block of A,B,C matrices:

                        A = [ 0, 1, ...
                            [ 1, 1, ...
                            [ ...


                        B = [ 1, 0, ...
                            [ 1, 1, ...
                            [ ...


                        C = [ 92, 96, ...
                            [ 116, 112, ...
                            [ ...

Disrespecting CMake option `BUILD_WITH_SOLVER`

What is the expected behavior

Guarding rocSOLVER specific code paths based on CMake option

What actually happens

Unconditional #include "rocsolver.h" in hipblas.cpp resulting in failing build.

How to reproduce

Default specifying -D BUILD_WITH_SOLVER=OFF

Environment

Hardware	description
GPU	device string
CPU	device string

Software	version
ROCK	v3.3-19
ROCR	v3.3-19
HCC	v3.1.20114-6776c83f-1
Library	master @ 29b3fb0

Building rocBLAS for CUDA backend fails

I'm building using CMake directly[1] for CUDA backend:

# On branch release/rocm-rel-5.5
mkdir build && cd build
cmake -DUSE_CUDA=ON -DHIP_ROOT_DIR=/opt/rocm ..
make

Scanning dependencies of target hipblas_fortran
[ 20%] Building Fortran object library/src/CMakeFiles/hipblas_fortran.dir/hipblas_module.f90.o
[ 40%] Linking Fortran shared library libhipblas_fortran.so
[ 40%] Built target hipblas_fortran
[ 60%] Building CXX object library/src/CMakeFiles/hipblas.dir/nvidia_detail/hipblas.cpp.o
In file included from /home/torrance/hipBLAS/library/src/nvidia_detail/hipblas.cpp:27:
/usr/local/cuda/include/cublas_v2.h:59:2: error: #error "It is an error to include both cublas.h and cublas_v2.h"
   59 | #error "It is an error to include both cublas.h and cublas_v2.h"
      |  ^~~~~

I can compile with the following changes, though I don't know if this is properly functional or not:

diff --git a/library/src/nvidia_detail/hipblas.cpp b/library/src/nvidia_detail/hipblas.cpp
index b60a31d..08a6257 100644
--- a/library/src/nvidia_detail/hipblas.cpp
+++ b/library/src/nvidia_detail/hipblas.cpp
@@ -23,7 +23,7 @@

 #include "hipblas.h"
 #include "exceptions.hpp"
-#include <cublas.h>
+//#include <cublas.h>
 #include <cublas_v2.h>
 #include <cuda_runtime_api.h>
 #include <hip/hip_runtime.h>
@@ -350,7 +350,7 @@ hipblasStatus_t hipblasGetPointerMode(hipblasHandle_t handle, hipblasPointerMode
 try
 {
     cublasPointerMode_t cublasMode;
-    cublasStatus        status = cublasGetPointerMode((cublasHandle_t)handle, &cublasMode);
+    cublasStatus_t        status = cublasGetPointerMode((cublasHandle_t)handle, &cublasMode);
     *mode                      = CudaPointerModeToHIPPointerMode(cublasMode);
     return hipCUBLASStatusToHIPStatus(status);
 }

[1] i.e. not using the install.sh script, though the error is present in both obviously. Honestly though, why this install script, when the rest of the hip ecosystem uses CMake? And why default to making a .deb package?

Environment

Hardware	description
GPU	Tesla T4
CPU	Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz

Software	version
HIP	v5.5
CUDA	12.1

[5.5.X] `hipblasDatatype_t` should be replaced with HIP's `hipDataType`

What is the expected behavior

hipDataType from HIP API should be used instead of hipblasDatatype_t.
hipDataType should be populated with the corresponding hipblasDatatype_t elements, like:
HIPBLAS_R_8I -> HIP_R_8I
hipblasDatatype_t should be deleted from hipBLAS API or might be left as a typedef for hipDataType, like it is done for cublasDataType_t in cuBLAS API:

typedef cudaDataType cublasDataType_t;

[Reasons]

hipblasDatatype_t being actually an analogue to cudaDataType is a type common for all HIP libraries, not only for hipBLAS, and it might be used in any HIP application.
Currently, we can't just hipify cudaDataType to hipblasDatatype_t because it will require adding a corresponding #include to hipblas.h what is definitely incorrect.

What actually happens

ROCm/HIPIFY#383

How to reproduce

ROCm/HIPIFY#383

Software
hipcc --version HIP version: 4.3.0

Make error on nvcc platform

Hi, first I build HIP from source and change the HIP_PATH, then when I cmake and make hipBLAS it shows the following error:
CMake Error: INSTALL(EXPORT) given unknown export "hipblas-targets"

I also try to correct the CMakelist.txt it doesn't work, too.
So is there a simple description about how to build hipBLAS on nvcc platform without install rocm?

[5.3.X] TRMM functions do not have correct correspondence in hipBLAS

hipBLAS TRMM functions hipblasStrmm, hipblasDtrmm, hipblasCtrmm, hipblasZtrmm do not match neither cublas TRMM functions, nor cublas TRMM _v2 functions.

For instance:

cublasStrmm:

void CUBLASWINAPI cublasStrmm(char side,
                              char uplo,
                              char transa,
                              char diag,
                              int m,
                              int n,
                              float alpha,
                              const float* A,
                              int lda,
                              float* B,
                              int ldb);

cublasStrmm_v2:

CUBLASAPI cublasStatus_t CUBLASWINAPI cublasStrmm_v2(cublasHandle_t handle,
                                                     cublasSideMode_t side,
                                                     cublasFillMode_t uplo,
                                                     cublasOperation_t trans,
                                                     cublasDiagType_t diag,
                                                     int m,
                                                     int n,
                                                     const float* alpha, /* host or device pointer */
                                                     const float* A,
                                                     int lda,
                                                     const float* B,
                                                     int ldb,
                                                     float* C,
                                                     int ldc);

hipblasStrmm:

HIPBLAS_EXPORT hipblasStatus_t hipblasStrmm(hipblasHandle_t    handle,
                                            hipblasSideMode_t  side,
                                            hipblasFillMode_t  uplo,
                                            hipblasOperation_t transA,
                                            hipblasDiagType_t  diag,
                                            int                m,
                                            int                n,
                                            const float*       alpha,
                                            const float*       AP,
                                            int                lda,
                                            float*             BP,
                                            int                ldb);

The same goes for rocBLAS analogues rocblas_strmm, rocblas_dtrmm, rocblas_ctrmm, rocblas_ztrmm (ROCm/rocBLAS#1265).

So, the above 4 hipBLAS and 4 rocBLAS functions are marked as HIP UNSUPPORTED.

[Solution]
As far as hipBLAS doesn't support v1 BLAS functions, populate hipblas TRMM functions with two missing arguments: float* C and int ldc and revise functions' logic.

hipBLAS Ubuntu packaging is broken

Since I can't see a better place to list this issue

What is the expected behavior

The hipblas package contains library, headers and CMake config files

example from rocm-4.3.0

# dpkg-query -L hipblas
/opt
/opt/rocm-4.3.0
/opt/rocm-4.3.0/hipblas
/opt/rocm-4.3.0/hipblas/include
/opt/rocm-4.3.0/hipblas/include/exceptions.hpp
/opt/rocm-4.3.0/hipblas/include/hipblas-export.h
/opt/rocm-4.3.0/hipblas/include/hipblas-version.h
/opt/rocm-4.3.0/hipblas/include/hipblas.h
/opt/rocm-4.3.0/hipblas/include/hipblas_module.f90
/opt/rocm-4.3.0/hipblas/lib
/opt/rocm-4.3.0/hipblas/lib/cmake
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas/hipblas-config-version.cmake
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas/hipblas-config.cmake
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas/hipblas-targets-release.cmake
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas/hipblas-targets.cmake
/opt/rocm-4.3.0/hipblas/lib/libhipblas.so.0.1.40300
/opt/rocm-4.3.0/include
/opt/rocm-4.3.0/lib
/opt/rocm-4.3.0/lib/cmake
/opt/rocm-4.3.0/lib/cmake/hipblas
/opt/rocm-4.3.0/hipblas/lib/libhipblas.so
/opt/rocm-4.3.0/hipblas/lib/libhipblas.so.0
/opt/rocm-4.3.0/include/exceptions.hpp
/opt/rocm-4.3.0/include/hipblas-export.h
/opt/rocm-4.3.0/include/hipblas-version.h
/opt/rocm-4.3.0/include/hipblas.h
/opt/rocm-4.3.0/include/hipblas_module.f90
/opt/rocm-4.3.0/lib/cmake/hipblas/hipblas-config-version.cmake
/opt/rocm-4.3.0/lib/cmake/hipblas/hipblas-config.cmake
/opt/rocm-4.3.0/lib/cmake/hipblas/hipblas-targets-release.cmake
/opt/rocm-4.3.0/lib/cmake/hipblas/hipblas-targets.cmake
/opt/rocm-4.3.0/lib/libhipblas.so
/opt/rocm-4.3.0/lib/libhipblas.so.0
/opt/rocm-4.3.0/lib/libhipblas.so.0.1.40300

What actually happens

The hipblas package contains only the library itself without headers and CMake config

# dpkg-query -L hipblas
/opt
/opt/rocm-4.5.0
/opt/rocm-4.5.0/hipblas
/opt/rocm-4.5.0/hipblas/lib
/opt/rocm-4.5.0/hipblas/lib/libhipblas.so.0.1.40500
/opt/rocm-4.5.0/lib
/opt/rocm-4.5.0/hipblas/lib/libhipblas.so.0
/opt/rocm-4.5.0/lib/libhipblas.so.0
/opt/rocm-4.5.0/lib/libhipblas.so.0.1.40500

How to reproduce

Install the latest hipblas package from http://repo.radeon.com/rocm/apt/debian/ ubuntu main

[5.3.X][CUDA >= 11.0] `hipblasGemmEx` doesn't fully match `cublasGemmEx`

The problem is with the penultimate argument hipblasDatatype_t computeType, which doesn't match to cublasComputeType_t computeType. cublasComputeType_t appeared with CUDA 11.0. cublasGemmEx used cudaDataType instead of cublasComputeType_t for its penultimate argument starting with CUDA 8.0 and till CUDA 11.0.

HIPBLAS_EXPORT hipblasStatus_t hipblasGemmEx(hipblasHandle_t    handle,
                                             hipblasOperation_t transA,
                                             hipblasOperation_t transB,
                                             int                m,
                                             int                n,
                                             int                k,
                                             const void*        alpha,
                                             const void*        A,
                                             hipblasDatatype_t  aType,
                                             int                lda,
                                             const void*        B,
                                             hipblasDatatype_t  bType,
                                             int                ldb,
                                             const void*        beta,
                                             void*              C,
                                             hipblasDatatype_t  cType,
                                             int                ldc,
                                             hipblasDatatype_t  computeType,
                                             hipblasGemmAlgo_t  algo);

CUBLASAPI cublasStatus_t CUBLASWINAPI cublasGemmEx(cublasHandle_t handle,
                                                   cublasOperation_t transa,
                                                   cublasOperation_t transb,
                                                   int m,
                                                   int n,
                                                   int k,
                                                   const void* alpha, /* host or device pointer */
                                                   const void* A,
                                                   cudaDataType Atype,
                                                   int lda,
                                                   const void* B,
                                                   cudaDataType Btype,
                                                   int ldb,
                                                   const void* beta, /* host or device pointer */
                                                   void* C,
                                                   cudaDataType Ctype,
                                                   int ldc,
                                                   cublasComputeType_t computeType,
                                                   cublasGemmAlgo_t algo);

typedef enum
{
    HIPBLAS_R_16F = 150, /**< 16 bit floating point, real */
    HIPBLAS_R_32F = 151, /**< 32 bit floating point, real */
    HIPBLAS_R_64F = 152, /**< 64 bit floating point, real */
    HIPBLAS_C_16F = 153, /**< 16 bit floating point, complex */
    HIPBLAS_C_32F = 154, /**< 32 bit floating point, complex */
    HIPBLAS_C_64F = 155, /**< 64 bit floating point, complex */
    HIPBLAS_R_8I  = 160, /**<  8 bit signed integer, real */
    HIPBLAS_R_8U  = 161, /**<  8 bit unsigned integer, real */
    HIPBLAS_R_32I = 162, /**< 32 bit signed integer, real */
    HIPBLAS_R_32U = 163, /**< 32 bit unsigned integer, real */
    HIPBLAS_C_8I  = 164, /**<  8 bit signed integer, complex */
    HIPBLAS_C_8U  = 165, /**<  8 bit unsigned integer, complex */
    HIPBLAS_C_32I = 166, /**< 32 bit signed integer, complex */
    HIPBLAS_C_32U = 167, /**< 32 bit unsigned integer, complex */
    HIPBLAS_R_16B = 168, /**< 16 bit bfloat, real */
    HIPBLAS_C_16B = 169, /**< 16 bit bfloat, complex */
} hipblasDatatype_t;

typedef enum {
  CUBLAS_COMPUTE_16F = 64,           /* half - default */
  CUBLAS_COMPUTE_16F_PEDANTIC = 65,  /* half - pedantic */
  CUBLAS_COMPUTE_32F = 68,           /* float - default */
  CUBLAS_COMPUTE_32F_PEDANTIC = 69,  /* float - pedantic */
  CUBLAS_COMPUTE_32F_FAST_16F = 74,  /* float - fast, allows down-converting inputs to half or TF32 */
  CUBLAS_COMPUTE_32F_FAST_16BF = 75, /* float - fast, allows down-converting inputs to bfloat16 or TF32 */
  CUBLAS_COMPUTE_32F_FAST_TF32 = 77, /* float - fast, allows down-converting inputs to TF32 */
  CUBLAS_COMPUTE_64F = 70,           /* double - default */
  CUBLAS_COMPUTE_64F_PEDANTIC = 71,  /* double - pedantic */
  CUBLAS_COMPUTE_32I = 72,           /* signed 32-bit int - default */
  CUBLAS_COMPUTE_32I_PEDANTIC = 73,  /* signed 32-bit int - pedantic */
} cublasComputeType_t;

Type of AP in hipblasStrsm functions

For example, float *AP may be changed to const float *AP.

HIPBLAS_EXPORT hipblasStatus_t hipblasStrsm(hipblasHandle_t    handle,
                                            hipblasSideMode_t  side,
                                            hipblasFillMode_t  uplo,
                                            hipblasOperation_t transA,
                                            hipblasDiagType_t  diag,
                                            int                m,
                                            int                n,
                                            const float*       alpha,
                                            float*             AP,
                                            int                lda,
                                            float*             BP,
                                            int                ldb);

HIPBLAS_EXPORT hipblasStatus_t hipblasDtrsm(hipblasHandle_t    handle,
                                            hipblasSideMode_t  side,
                                            hipblasFillMode_t  uplo,
                                            hipblasOperation_t transA,
                                            hipblasDiagType_t  diag,
                                            int                m,
                                            int                n,
                                            const double*      alpha,
                                            double*            AP,
                                            int                lda,
                                            double*            BP,
                                            int                ldb);


cublasStatus_t cublasStrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           float           *B, int ldb)
cublasStatus_t cublasDtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const double          *alpha,
                           const double          *A, int lda,
                           double          *B, int ldb)
cublasStatus_t cublasCtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           cuComplex       *B, int ldb)
cublasStatus_t cublasZtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           cuDoubleComplex *B, int ldb)

hipEventCreate

What is the expected behavior

What actually happens

How to reproduce

Environment

Hardware	description
GPU	device string
CPU	device string

Software	version
ROCK	v0.0
ROCR	v0.0
HCC	v0.0
Library	v0.0

compute type of hipblasGemmEx

Running a CUDA program shows that cublasGemmEx supports compute type CUBLAS_COMPUTE_32F_FAST_TF32 and CUBLAS_GEMM_DEFAULT_TENSOR_OP. The type is not available in hipBLAS. Thank you for your discussion.

status = cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, B_cols, A_rows, A_cols, &alpha, gpu_B, CUDA_R_32F, A_rows, gpu_A, CUDA_R_32F,
                      A_cols, &beta, gpu_C, CUDA_R_32F, A_rows, CUBLAS_COMPUTE_32F_FAST_TF32, CUBLAS_GEMM_DEFAULT_TENSOR_OP);

MXNet: Query regarding HIP equivalents of cuBLAS APIs

Issue : While downstreaming mxnet code we have come across certain additional cuBLAS apis. What are the equivalent hip apis for these?

The list of cuBLAS apis :
cublasStrmm
cublasDtrmm
cublasSsyrk
cublasDsyrk
cublasGetMathMode
cublasSetMathMode
cublasSgemmEx
cublasGemmEx

The list of cuBLAS varibales :
cudaDataType:
CUDA_R_32F
CUDA_R_64F
CUDA_R_16F
CUDA_R_8U
CUDA_R_8I
CUDA_R_32I
cublasMath_t:
CUBLAS_DEFAULT_MATH
CUBLAS_TENSOR_OP_MATH
cublasGemmAlgo_t:
CUBLAS_GEMM_DFALT

hipBLAS compiled for CUDA cannot be found in CMake

What is the expected behavior

In CMake, find_package(hipblas) should work.

What actually happens

After updating hip and hipblas to the 4.5 release, hipblas can no longer be found be CMake on an NVIDIA platform:

CMake Error at /opt/cmake-3.21.3-linux-x86_64/share/cmake-3.21/Modules/CMakeFindDependencyMacro.cmake:47 (find_package):
  By not providing "Findhip.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "hip", but
  CMake did not find one.

  Could not find a package configuration file provided by "hip" with any of
  the following names:

    hipConfig.cmake
    hip-config.cmake

  Add the installation prefix of "hip" to CMAKE_PREFIX_PATH or set "hip_DIR"
  to a directory containing one of the above files.  If "hip" provides a
  separate development package or SDK, be sure it has been installed.
Call Stack (most recent call first):
  /opt/rocm/hipblas/lib/cmake/hipblas/hipblas-config.cmake:90 (find_dependency)
  CMakeLists.txt:239 (find_package)

This used to work fine on all previous releases.
A current workaround is to change this line from DEPENDS PACKAGE hip to DEPENDS PACKAGE HIP. Then after recompiling and installing hipBLAS, it works as expected.

How to reproduce

On an NVIDIA machine, install hip 4.5.
Download/clone hipBLAS at the 4.5 release. Compile using ./install.sh -i --cuda
In a CMake project, using the following will output the above error.

find_package(HIP REQUIRED)
find_package(hipblas REQUIRED)

Looking at /opt/rocm/hipblas/lib/cmake/hipblas/hipblas-config.cmake:90 shows find_dependency(hip) whereas find_dependency(HIP) works fine (see workaround above).

Environment

Ubuntu 20.04 with CUDA 11.5, CMake 3.21.3.
HIP 4.5 installed using debian packages. hipBLAS 4.5 compiled with --cuda.

Error : There is no device can be used to do the computation for HIP_PLATFORM NVCC

Background
“Error: There is no device can be used to do the computation”, while executing MXNet tests on HIP/CUDA (NVCC) path after integrating hipBLAS

Issue
Facing runtime issues on integrating hipBLAS to MXNet library . “error: There is no device can be used to do the computation”. (Test cases terminate after throwing this error)
Attached sample application which will reproduce the above mentioned error
hipblas_test.cpp.zip

Steps to reproduce

$git clone --recursive https://github.com/ROCmSoftwarePlatform/mxnet
$cd mxnet
$export HIP_PLATFORM=nvcc
$make -j(nproc)
$unzip hipblas_test.cpp.zip
$cp hipblas_test.cpp mxnet/
$/opt/rocm/bin/hipcc -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 hipblas_test.cpp -I../ -I/usr/local/cuda-8.0/include -I/opt/rocm/hipblas/include -L/opt/rocm/hipblas/lib -lhipblas -lcublas -o hipblas_test.cpp.out
$./hipblas_test.cpp.out

Analysis
Though the error indicates absence of device, the gpu is being detected. The image of nvidia-smi log has been attached.

[Feature]: Optional rocSOLVER at run time

Suggestion Description

There is a BUILD_WITH_SOLVER CMake option, but it would be nice if the rocsolver library could be loaded at run-time with dlopen when BUILD_WITH_SOLVER is OFF. This would allow Debian and Ubuntu systems packages to build with -DBUILD_WITH_SOLVER=OFF and mark rocsolver as a recommended package, rather than a required dependency. The behaviour for BUILD_WITH_SOLVER=ON would be unchanged (and would be needed for static hipBLAS builds).

Users that wish to use hipBLAS for applications such as llama-cpp currently must install hipBLAS, which depends on rocSOLVER, which depends on rocSPARSE. Despite only wanting to run a few GEMMS, they end up installing several gigabytes of other libraries that provide functionality that they do not need for their particular application. The rocSOLVER library has made rocSPARSE an optional dependency for this reason, and I think it would be a good idea for hipBLAS to do the same.

The rocSOLVER library would still be installed with hipblas by default, but users could specifically request --no-install-recommends to avoid it (or could explicitly uninstall rocsolver).

Operating System

Ubuntu

GPU

No response

ROCm Component

hipBLAS

HIPBLAS depends on ROCSOLVER, which isn't an Apt dependency

What is the expected behavior

Link programs that call DGEMM against HIPBLAS without error.
Apt installs all dependencies of HIPBLAS when HIPBLAS is installed.

What actually happens

HIPBLAS depends on the LAPACK-like ROCSOLVER library, which is not installed by Apt.
HIPBLAS program cannot be linked due to missing symbols from ROCSOLVER.

Apt Log

$ sudo apt update && sudo apt install hipblas
Hit:2 http://repo.radeon.com/rocm/apt/debian xenial InRelease                                      
Reading package lists... Done                                
Building dependency tree       
Reading state information... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
hipblas is already the newest version (0.36.0.0-rocm-rel-3.9-17-e4d9e7b).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
$ apt search rocsolver
Sorting... Done
Full Text Search... Done
rocsolver/Ubuntu 16.04 3.9.0.0-c2cd214 amd64
  Radeon Open Compute SOLVER library

rocsolver3.9.0/Ubuntu 16.04 3.9.0.0-c2cd214 amd64
  Radeon Open Compute SOLVER library

Linker errors

/usr/bin/ld: warning: librocsolver.so.0, needed by /opt/rocm/lib/libhipblas.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf_batched'
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)

How to reproduce

Try to link the following program.

Code

#include <hipblas.h>

int main()
{
    hipblasHandle_t h;
    int order{1};
    double alpha{0};
    double A{1}, B{1}, C{1};
    hipblasDgemm(h, HIPBLAS_OP_N, HIPBLAS_OP_N, order, order, order, &alpha, &A, order, &B, order, &alpha, &C, order);
    return 0;
}

Build

$ hipcc bug.cc -L/opt/rocm/lib -lrocblas -L/opt/rocm/lib -lhipblas
/usr/bin/ld: warning: librocsolver.so.0, needed by /opt/rocm/lib/libhipblas.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf_batched'
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)

Environment

Hardware

jrhammon@jrhammon-nuc:~/PRK/Cxx11$ rocminfo 
ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   8300                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            8                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32803976(0x1f48c88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32803976(0x1f48c88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Polaris 22 XT [Radeon RX Vega M GH]
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26956(0x694c)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1190                               
  BDFID:                   256                                
  Internal Node ID:        1                                  
  Compute Unit:            24                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Software

$ hipcc --version
HIP version: 3.9.20412-6d111f85
clang version 12.0.0 (/src/external/llvm-project/clang 60f39e2924d51c1e8606f2135f95e9047fb1da5d)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-3.9.0/llvm/bin

hipblasSgemm extremely slow when compared to cublasSgemm

What is the expected behavior

Would expect the performance of GEMM operations to be comparable to cublas GEMM operations.

What actually happens

The GEMM operations on floats is extremely slow.

How to reproduce

See attached test case. Compile test case with -D CUDA and run on an NVIDIA GPU and recompile with -D HIP and run on an AMD Hipyfied AMD GPU and compare the results.

Environment

| Software | version |
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"

| ROCK | v0.0 |
Runtime Version: 1.1
| ROCR | v0.0 |
| HCC | v0.0 |
HIP version: 5.2.21153-02187ecf
| Library | v0.0 |

I ran it on my two desktops -- one with Radeon W6800 and the other with GeForce RTX 2060. See attached GemmTest.cpp and here are the results:

GeForce RTX 2060:

size 512 average 8.85824e-05 s
size 1024 average 0.000308755 s
size 2048 average 0.00214913 s
size 4096 average 0.0187068 s
size 8192 average 0.153399 s
size 16384 average 1.2847 s

Radeon W6800:

size 512 average 0.262805 s
size 1024 average 0.0126241 s
size 2048 average 0.118601 s
size 4096 average 0.794954 s
size 8192 average 3.916 s
size 16384 average 26.0589 s

#include <unistd.h>
#include <iostream>
#include <stdlib.h>
#include <assert.h>

#if defined(CUDA)
#include <cuda_runtime.h>
#include <cublas_v2.h>

#define GPUBLAS_OP_N CUBLAS_OP_N

typedef cudaError_t gpuError_t;
typedef cudaEvent_t gpuEvent_t;
typedef cudaStream_t gpuStream_t;

typedef cublasHandle_t gpuBlasHandle_t;
typedef cublasStatus_t gpuBlasStatus_t;

const gpuBlasStatus_t gpuBlasSuccess = CUBLAS_STATUS_SUCCESS;

const gpuError_t gpuSuccess = cudaSuccess;

#define gpuMemcpyHostToDevice cudaMemcpyHostToDevice

gpuError_t gpuMemcpy(void* dest, void* src, size_t size, cudaMemcpyKind flags) {
  return cudaMemcpy(dest, src, size, flags);
}

gpuError_t gpuMallocManaged(void** ret, size_t size) {
  return cudaMallocManaged(ret, size);
}

gpuError_t gpuEventCreate(gpuEvent_t* pevent) {
  return cudaEventCreate(pevent);
}

gpuError_t gpuEventRecord(gpuEvent_t event, gpuStream_t stream) {
  return cudaEventRecord(event, stream);
}

gpuError_t gpuEventSynchronize(gpuEvent_t event) {
  return cudaEventSynchronize(event);
}

gpuError_t gpuGetLastError() {
  return cudaGetLastError();
}

gpuError_t gpuEventElapsedTime(float* t, gpuEvent_t start, gpuEvent_t stop) {
  return cudaEventElapsedTime(t, start, stop);
}

gpuError_t gpuFree(void* p) {
  return cudaFree(p);
}

gpuBlasStatus_t gpuBlasCreate(gpuBlasHandle_t* phandle) {
  return cublasCreate(phandle);
}

gpuBlasStatus_t gpuBlasSgemm(gpuBlasHandle_t handle, cublasOperation_t ta, cublasOperation_t tb, int m, int n, int k, const float* alpha, const float* a, int lda, const float* b, int ldb, const float* beta, float* c, int ldc) {
  return cublasSgemm(handle, ta, tb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
}

#elif defined(HIP)
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
#include <hipblas/hipblas.h>
#include <hiprand/hiprand.h>

#define GPUBLAS_OP_N HIPBLAS_OP_N

typedef hipError_t gpuError_t;
typedef hipEvent_t gpuEvent_t;
typedef hipStream_t gpuStream_t;

typedef hipblasHandle_t gpuBlasHandle_t;
typedef hipblasStatus_t gpuBlasStatus_t;

const gpuBlasStatus_t gpuBlasSuccess = HIPBLAS_STATUS_SUCCESS;

const gpuError_t gpuSuccess = hipSuccess;

#define gpuMemcpyHostToDevice hipMemcpyHostToDevice

gpuError_t gpuMemcpy(void* dest, void* src, size_t size, hipMemcpyKind flags) {
  return hipMemcpy(dest, src, size, flags);
}

gpuError_t gpuMallocManaged(void** ret, size_t size) {
  return hipMallocManaged(ret, size);
}

gpuError_t gpuEventCreate(gpuEvent_t* pevent) {
  return hipEventCreate(pevent);
}

gpuError_t gpuEventRecord(gpuEvent_t event, hipStream_t stream) {
  return hipEventRecord(event, stream);
}

gpuError_t gpuEventSynchronize(gpuEvent_t event) {
  return hipEventSynchronize(event);
}

gpuError_t gpuGetLastError() {
  return hipGetLastError();
}

gpuError_t gpuEventElapsedTime(float* t, gpuEvent_t start, gpuEvent_t stop) {
  return hipEventElapsedTime(t, start, stop);
}

gpuError_t gpuFree(void* p) {
  return hipFree(p);
}

gpuBlasStatus_t gpuBlasCreate(gpuBlasHandle_t* phandle) {
  return hipblasCreate(phandle);
}

gpuBlasStatus_t gpuBlasSgemm(gpuBlasHandle_t handle, hipblasOperation_t ta, hipblasOperation_t tb, int m, int n, int k, const float* alpha, const float* a, int lda, const float* b, int ldb, const float* beta, float* c, int ldc) {
  return hipblasSgemm(handle, ta, tb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
}

#else
#error "Specify GPU type"
#endif


void initRand(float* p, int rows, int cols) {
  int a = 1;

  for (int i = 0; i < rows * cols; i++) {
    p[i] = (float) rand() / (float) (RAND_MAX / a);
  }
}

int main(int argc, char* argv[]) {
  int status, lower, upper, num, reps, verbose;

  lower = 256;
  upper = 8192;
  num = 25000;
  reps = 5;
  verbose = 0;

  while ((status = getopt(argc, argv, "l:u:n:r:v")) != -1) {
    switch (status) {
    case 'l':
      lower = strtoul(optarg, 0, 0);
      break;
    case 'u':
      upper = strtoul(optarg, 0, 0);
      break;
    case 'n':
      num = strtoul(optarg, 0, 0);
      break;
    case 'r':
      reps = strtoul(optarg, 0, 0);
      break;
    case 'v':
      verbose = strtoul(optarg, 0, 0);
      break;
    default:
      std::cerr << "invalid argument: " << status << std::endl;
      exit(1);
    }
  }

  if (verbose) {
    std::cout << "Running with" << " lower: " << lower << " upper: " << upper << " num: " << num << " reps: " << reps << std::endl;
  }

  if (verbose) {
    std::cout << "initializing inputs" << std::endl;
  }

  gpuBlasHandle_t handle;
  gpuBlasStatus_t blasStatus;

  blasStatus = gpuBlasCreate(&handle);
  if (blasStatus != gpuBlasSuccess) {
    std::cerr << "Could not create blas handle " << blasStatus << std::endl;
    exit(1);
  }

  float* A = (float*) calloc(1, upper * upper * sizeof(float));
  float* B = (float*) calloc(1, upper * upper * sizeof(float));
  float* C = (float*) calloc(1, upper * upper * sizeof(float));

  initRand(A, upper, upper);
  initRand(B, upper, upper);
  initRand(C, upper, upper);

  float* dA, *dB, *dC;
  gpuError_t err;

  int lda, ldb, ldc, m, n, k;
  float alpha = 1.0f, beta = 0.0f;

  err = gpuMallocManaged((void**) &dA, upper * upper * sizeof(float));
  if (err != gpuSuccess) {
    std::cerr << "Could not allocate GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMallocManaged((void**) &dB, upper * upper * sizeof(float));
  if (err != gpuSuccess) {
    std::cerr << "Could not allocate GPU memory; size: " <<  upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMallocManaged((void**) &dC, upper * upper * sizeof(float));
  if (err != gpuSuccess) {
    std::cerr << "Could not allocate GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMemcpy(dA, A, upper * upper * sizeof(float), gpuMemcpyHostToDevice);
  if (err != gpuSuccess) {
    std::cerr << "Could not copy to GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMemcpy(dB, B, upper * upper * sizeof(float), gpuMemcpyHostToDevice);
  if (err != gpuSuccess) {
    std::cerr << "Could not copy to GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMemcpy(dC, C, upper * upper * sizeof(float), gpuMemcpyHostToDevice);
  if (err != gpuSuccess) {
    std::cerr << "Could not copy to GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }
  gpuEvent_t start, stop;
  gpuEventCreate(&start);
  gpuEventCreate(&stop);

  for (int s = lower; s <= upper; s = s * 2) {
    double sum = 0.0;
    for (int r = 0; r < reps; r++) {
      gpuEventRecord(start, 0);
      m = n = k = s;
      lda = m; ldb = k; ldc = m;

      blasStatus = gpuBlasSgemm(handle, GPUBLAS_OP_N, GPUBLAS_OP_N, m, n, k, &alpha, dA, lda, dB, ldb, &beta, dC, ldc);
      gpuEventRecord(stop, 0);
      gpuEventSynchronize(stop);
      if (blasStatus != gpuBlasSuccess) {
        std::cerr << "gpuBlasSgemm failed: " << blasStatus << std::endl;
        exit(1);
      }
      err = gpuGetLastError();
      if (err != gpuSuccess) {
        std::cerr << "gpu error: " << err << std::endl;
        exit(1);
      }

      float elapsed;
      gpuEventElapsedTime(&elapsed, start, stop);
      elapsed /= 1000.0f;
      sum += elapsed;
    }
    std::cout << "size " << s << " average " << sum / reps << " s " << std::endl;
  }

  gpuFree(dA); gpuFree(dB); gpuFree(dC);
  free(A); free(B); free(C);
}

hipblasIdamin implementation is needed

Analogous to already implemented hipblasIdamax.

question about axpby

Would "axpby" (https://oneapi-src.github.io/oneMKL/domains/blas/axpby.html) be added to hip/roc-BLAS ?

64-bit interface

Does the BLAS (e.g. hipblasIdamax) function support 64-bit interface ? Thanks.

https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2023-0/iamax.html

hipblasXgelsBatched() failed with error

Running the HIP program produces some error. I migrated the program from CUDA to HIP. Thanks.

To reproduce:

go to https://github.com/zjin-lcf/HeCBench/tree/master/src/gels-hip
type 'make'
type './main 1'

ERROR: hipblasXgelsBatched() failed with error HIPBLAS_STATUS_INVALID_VALUE..

hipConfig.cmake issue on NVIDIA

While installing on NVIDIA, hipBLAS searches for hipConfig.cmake but hipConfig.cmake will be generated only on HCC Platform while installing HIP.

GemmEx of float16 runs so fast but with wrong result

What is the expected behavior

The performance(e.g. running time) of GEMM in float32 and float16 should be different in a reasonable way. And the result matrix should be close.

What actually happens

The run time of GemmEx in fp16 or Hgemm is too short, like 0.02 ms.
The result of GemmEx in fp16 or Hgemm is too different( code dong the check is inside the file too)

How to reproduce

Code is below

code.zip

compile command: hipcc fp16gemm.hip -o fp16_gemm -lhipblas -L/opt/rocm-4.3.0/hipblas/lib
execute: ./fp16_gemm
Output is like this:

The error analysis may have some problem(I am still working on it), but the run time of Gemm of fp16 or Hgemm is too few to be real.

Matrix Data I used the gen_data.py to generate by Python3.8

Environment

Hardware	description
GPU	MI100
CPU	AMD EPYC 7302 16-Core Processor

Software	version
ROCm	v4.3.0
HipCC	v4.3
RocBlas	v4.3

Error with `DgetriBatched` function on AMD devices

What is the expected behavior

The function should work without error. The same function call with the same inputs works on an Nvidia system using cuBLAS.

What actually happens

rocblas_status_invalid_pointer error is returned.
This code fails on AMD platforms, but runs fine when using any Nvidia card (the HIP version and original CUDA/cuBLAS versions both work).

How to reproduce

See small example code below.

#include <iostream>
#include <vector>
#include <assert.h>

#include <hip/hip_runtime.h>
#include <hipblas.h>

void getri(hipblasHandle_t *handle, int *n, double *A, int *lda, int *ipiv, double *lwork, int *info);

int main(int argc, char **argv)
{

  std::vector<double> const test{0.767135868133925, -0.641484652834663,
                                 0.641484652834663, 0.767135868133926};
  std::vector<int> ipiv(2);
  std::vector<int> info(10);
  std::vector<double> work(4);
  int n = 2;
  int lda = 4;

  double *test_d, *work_d;
  int *ipiv_d, *info_d;

  // Create device copies of input matrix, ipiv, info, and work buffers
  auto success = hipMalloc((void **)&ipiv_d, ipiv.size() * sizeof(int *));
  assert(success == hipSuccess);
  success = hipMemcpy(ipiv_d, ipiv.data(), ipiv.size() * sizeof(int), hipMemcpyHostToDevice);
  assert(success == 0);

  success = hipMalloc((void **)&info_d, info.size() * sizeof(int *));
  assert(success == hipSuccess);
  success = hipMemcpy(info_d, info.data(), info.size() * sizeof(int), hipMemcpyHostToDevice);
  assert(success == 0);

  success = hipMalloc((void **)&test_d, test.size() * sizeof(double));
  assert(success == hipSuccess);
  success = hipMemcpy(test_d, test.data(), test.size() * sizeof(double), hipMemcpyHostToDevice);
  assert(success == 0);

  success = hipMalloc((void **)&work_d, work.size() * sizeof(double));
  assert(success == hipSuccess);
  success = hipMemcpy(work_d, work.data(), work.size() * sizeof(double), hipMemcpyHostToDevice);
  assert(success == 0);

  // Make a hipblas handle
  hipblasHandle_t handle;
  auto hipblas_success = hipblasCreate(&handle);
  assert(hipblas_success == HIPBLAS_STATUS_SUCCESS);

  // Call hipBlasDgetriBatched
  std::cout << "Running getri..\n";
  getri(&handle, &n, test_d, &lda, ipiv_d, work_d, info_d);

  success = hipDeviceSynchronize();
  assert(success == 0);
  std::cout << " -- after getri\n";

  hipblas_success = hipblasDestroy(handle);
  assert(hipblas_success == HIPBLAS_STATUS_SUCCESS);

  std::cout << "Done\n";

  return 0;
}

void getri(hipblasHandle_t *handle, int *n, double *A, int *lda, int *ipiv, double *lwork, int *info)
{
  double **A_d;
  double **work_d;

  auto stat = hipMalloc((void **)&A_d, sizeof(double *));
  assert(stat == 0);
  stat = hipMemcpy(A_d, &A, sizeof(double *), hipMemcpyHostToDevice);
  assert(stat == 0);

  stat = hipMalloc((void **)&work_d, sizeof(double *));
  assert(stat == 0);
  stat = hipMemcpy(work_d, &lwork, sizeof(double *), hipMemcpyHostToDevice);
  assert(stat == 0);

  auto const success = hipblasDgetriBatched(*handle, *n, A_d, *lda, nullptr, work_d, *n, info, 1);
  assert(success == 0);
}

Environment

Hardware	description
GPU	gfx906 (Vega 20)
CPU	AMD EPYC 7452

System is using ROCM 4.2. See attached rocminfo.txt.
rocm_smi --showdriverversion gives Driver version: 5.9.25

Upgrade hipblas 0.5.2.0 to 0.6.0.8 via deb repository fails.

What is the expected behavior

apt-get install hipblas should work

What actually happens

The following packages will be upgraded:
  hipblas
1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
4 not fully installed or removed.
Need to get 0 B/62.1 kB of archives.
After this operation, 270 kB of additional disk space will be used.
Do you want to continue? [Y/n] 
(Reading database ... 1092339 files and directories currently installed.)
Preparing to unpack .../hipblas_0.6.0.8_amd64.deb ...
Unpacking hipblas (0.6.0.8) over (0.5.2.0) ...
dpkg: error processing archive /var/cache/apt/archives/hipblas_0.6.0.8_amd64.deb (--unpack):
 unable to open '/opt/rocm/hipblas/lib/cmake/hipblas/hipblas-config-version.cmake.dpkg-new': No such file or directory
abort-upgrade
Errors were encountered while processing:
 /var/cache/apt/archives/hipblas_0.6.0.8_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

How to reproduce

Standard upgrade from previous ROCm version via the repository.

Environment

Linux mint 18.2

question about mixed precision dot

Would the mixed precision "dot" be added to the library ?

For mixed precision version (inputs are float while result is double), dot product is computed with double precision.

Reference
https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2023-0/dot.html

hipblasVersion* is not set correctly by CMake

What is the expected behavior

hipblasVersion* (hipblasVersionMajor, hipblasVersionMinor, ..) should be set by CMake.

What actually happens

they are set string @hipblasVersion* not the version number

How to reproduce

can not use hipblasVersion* in #if

There are extra space between the last @ and the string in this hipblas-version.h.in
I think this leads CMake can not replace the version correctly

hipBLAS's complex number definition causes compilation failures

What is the expected behavior

hipBLAS doesn't redefine HIP complex number implementation, and simply uses hip_complex.h to provide hipFloatComplex & hipDoubleComplex

What actually happens

hipBLAS redefines HIP complex number definitions in a contradictory way which causes compilation failures

How to reproduce

Simply include hipblas.h and hip/hip_complex.h in any project:

/// test.cpp
#include <hipblas.h>
#include <hip/hip_complex.h>

hipcc test.cpp errors

Environment

Software	version
HCC	3.0.19493-75ea952e-40756364719e

Why does the hipBLAS library define complex numbers in a way which is incompatible with the rest of HIP?

There are far too many errors for even an example program. The first error is:

/opt/rocm/include/hipblas.h:53:36: error: typedef redefinition with different types ('hip_complex_number<float>' vs 'hipFloatComplex' (aka 'HIP_vector_type<float, 2>'))
typedef hip_complex_number<float>  hipComplex;
                                   ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:269:25: note: previous definition is here
typedef hipFloatComplex hipComplex;
                        ^

It seems the lines are: https://github.com/ROCmSoftwarePlatform/hipBLAS/tree/develop/library/include#L53

Is there a reason why hipBLAS could not use the HIP standard complex number definition? Or is there some workaround to be able to use both at the same time?

hipblasGemmEx does not match the CPU or ROCBlas results for int8 x int8 to int32 matrix multiplication

The minimal testing case has been attached as
igemm_all_in_one.cc.gz.
Can be compiled with g++ -I/usr/include/eigen3 -I/opt/rocm/include igemm_all_in_one.cc -Wl,-rpath,/opt/rocm/lib/ -L/opt/rocm/lib/ -lrocblas -lhipblas -lamdhip64 -o igemm_aiw

What is the expected behavior

"Result from hipblasGemmEx with i8 input" matches the output of "Result from Eigen with i32 data type" and "Result from rocblas_gemm_ex with i8 input"

What actually happens

"Result from hipblasGemmEx with i8 input" shows
- -15 65 110 -65
- -49 -79 -90 117
The reference output "Result from Eigen with i8 data type" and "Result from rocblas_gemm_ex with i32 input" both show
- -55 16 89 -44
- 122 -102 68 -39
Note the reference output matches the unit tests from onnxruntime.

How to reproduce

Run the unit tests

Environment

Hardware	description
GPU	gfx90a
CPU	AMD EPYC 7542 32-Core Processor

Software	version
ROCK	`modinfo amdgpu\|grep version` shows `version: 5.13.20.22.10`
ROCR	v5.1.3
HCC	HIP version: 5.1.20532-f592a741
Library	v5.1.3

Error while build hipBLAS

What is the expected behavior

build successful

What actually happens

running develop
running egg_info
writing hipmat.egg-info/PKG-INFO
writing top-level names to hipmat.egg-info/top_level.txt
writing dependency_links to hipmat.egg-info/dependency_links.txt
reading manifest file 'hipmat.egg-info/SOURCES.txt'
writing manifest file 'hipmat.egg-info/SOURCES.txt'
running build_ext
building 'hipmat.libhipmat' extension
/opt/rocm/hip/bin/hipcc -I/opt/rocm/hipblas/include -I/opt/rocm/hip/include/hip/ -I/usr/include/python2.7 -c hipmat/hipmat.cpp -o build/temp.linux-x86_64-2.7/hipmat/hipmat.o -O -fPIC
In file included from hipmat/hipmat.cpp:5:
In file included from /opt/rocm/hipblas/include/hipblas.h:17:
In file included from /opt/rocm/hip/include/hip/hip_complex.h:29:
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:107:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:108:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:109:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:110:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, double)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:111:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:112:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:113:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned long long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:114:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:106:5: error: C++ requires a type specifier for all declarations
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:114:80: error: expected ';' at end of declaration list
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long long)
                                                                               ^
                                                                               ;
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:103:45: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex() : x(0.0f), y(0.0f) {}
                                            ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:103:54: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex() : x(0.0f), y(0.0f) {}
                                                     ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:104:52: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x) : x(x), y(0.0f) {}
                                                   ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:104:58: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x) : x(x), y(0.0f) {}
                                                         ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:105:61: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x, float y) : x(x), y(y) {}
                                                            ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:105:67: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x, float y) : x(x), y(y) {}
                                                                  ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:126:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, signed short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:127:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, unsigned int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:128:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, signed int)
    ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
In file included from hipmat/hipmat.cpp:5:
In file included from /opt/rocm/hipblas/include/hipblas.h:17:
In file included from /opt/rocm/hip/include/hip/hip_complex.h:29:
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:107:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:108:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:109:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:110:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, double)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:111:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:112:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:113:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned long long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:114:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:106:5: error: C++ requires a type specifier for all declarations
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:114:80: error: expected ';' at end of declaration list
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long long)
                                                                               ^
                                                                               ;
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:103:45: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex() : x(0.0f), y(0.0f) {}
                                            ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:103:54: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex() : x(0.0f), y(0.0f) {}
                                                     ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:104:52: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x) : x(x), y(0.0f) {}
                                                   ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:104:58: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x) : x(x), y(0.0f) {}
                                                         ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:105:61: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x, float y) : x(x), y(y) {}
                                                            ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:105:67: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x, float y) : x(x), y(y) {}
                                                                  ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:126:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, signed short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:127:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, unsigned int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:128:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, signed int)
    ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
Died at /opt/rocm/hip/bin/hipcc line 565.
error: command '/opt/rocm/hip/bin/hipcc' failed with exit status 1

How to reproduce

cd hipBLAS && mkdir build && cd build
ccmake ..
make

Environment

Hardware	description
GPU	Radeon R9 Fury Nano
CPU	Intel i5-4440

Software stack

I am currently using latest version of HIP, hipBLAS

Link broken in the README for `hipBLAS functions`

The very last link of the README is broken.

Unsupported for hipblas equivalent of Integer Datatypes.

The hipblas Datatypes equivalent is not supported for integer datatypes but supported in both rocblas and cublas.
list of rocblas datatypes not supported in hipblas
rocblas_datatype_i8_c
rocblas_datatype_u8_r
rocblas_datatype_u8_c
rocblas_datatype_i32_r
rocblas_datatype_i32_c
rocblas_datatype_u32_r
rocblas_datatype_u32_c

list of cublas Datatypes:

    CUDA_R_8I = 3,   
    CUDA_C_8I = 7,   
    CUDA_R_8U = 8,   
    CUDA_C_8U = 9,   
    CUDA_R_32I= 10,  
    CUDA_C_32I= 11,  
    CUDA_R_32U= 12,  
    CUDA_C_32U= 13

where as hipblas only supports float datatypes as below
enum hipblasDatatype_t
{
HIPBLAS_R_16F = 150,
HIPBLAS_R_32F = 151,
HIPBLAS_R_64F = 152,
HIPBLAS_C_16F = 153,
HIPBLAS_C_32F = 154,
HIPBLAS_C_64F = 155,
};

performance of hipblasHgemm

I measured the performance of hipblasHgemm using _Float16 on MI 250. here are what I got:

N=K=M=8192: MI250: ~38 TFLOPs. Is it right? or I missed anything?

https://www.amd.com/en/products/server-accelerators/instinct-mi250x said that "Peak Half Precision (FP16) Performance is 383 TFLOPs"

add topic tags

I suggest adding topics such as rocm, blas, cuda in the About section at https://github.com/ROCm/hipBLAS