linnanwang / blasx Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 11.0 3.33 MB

a heterogeneous multiGPU level-3 BLAS library

Makefile 0.55% C 85.93% Python 8.50% Perl 0.78% Assembly 0.08% C++ 4.16%

blasx's People

Contributors

Stargazers

Watchers

Forkers

alongwithyou fjramireg codeaudit springer13 ggz91 pseudotensor templeblock renok1 luyuechao jesunsahariar eddy16112 lsl036

blasx's Issues

performance comparison update

The paper on the arXiv is already three years old. Which version of cuBLAS-XT was used to create the performance comparison charts? Is it still the case that BLASX is outperforming cuBLAS-XT? Wouldn't it be very interesting to have updated plots with current versions of cuBLAS-XT?

large scale Dgemm segmentation fault

Hello, there are still errors when applying the library to a large matrix gemm on multiple GPUs. I need to find another library that can replace cublasXt and execute large-scale matrix gemm on multiple GPUs. Therefore, I tested Dgemm on 50000X50000 and 50000X50000. In the end, I got a segmentation error. I tried to figure out what the problem was. But it's hard for me to go deep. I just found that cublasGetMatrixAsync and cublasSetMatrixAsync seem to be wrong in blasx_dgemm.c. Here is the gdb info about this.

If you can help, I really appreciate it.

Do we need Cublas if we use BLASX?

How GEMM is implemented?

I am wondering how the GEMM is implemented, is it like, CPU RAM store all the matrix A and B. Suppose we have 2 GPUs and we send A(i, k) and B(k, j) to GPU0 and we iterate all possible k, and we get a C(i, j) in GPU0. Similarly in GPU1. And we concatenate the result? If more complicated than that, do you have any reference paper?
Thank you!!!

segfault for large matrices in dgemm

I've modified the gemm-example to use dgemm only with matrices of dimension 30000x30000.
Using a server with 4 GTX Titan cards the program produces a segfault. It seems that there
aren't any checks regarding available device-memory.

nvidia-smi:

CPU Level Parallelism

Hello!

This library looks great, but I was wondering if it has CPU multi-threading blas capabilities. Reading through the code for some of the *gemm files, it almost appears to be the case.

I'm trying to perform a benchmark on AWS between g2 and c4 instances. I was hoping to find some way of writing a single code base that will perform the same function on the two different instances.

How GPU task are implemented by CPU thread?

This might be a naive question.... It is mentioned in paper that GPU task can be bonded to a CPU thread...? I am wondering is any references discuss more details about this or what keyword I should use to search on Google...It should be about multi GPU allocation?

Thank you!!!

nvprof profile shows excessive waiting and lack of multi-GPU use

Running the testing/gemm.c with only sgemm (commenting out dgemm code) and larger matrices:

int loop = 0;
for (loop = 1; loop < 2; loop++) {
int M = 10000;
int N = M;
int K = M;
float alpha_f = (float)(((double) rand()/(double)RAND_MAX)*10)+1;
float beta_f = (float)(((double) rand()/(double)RAND_MAX)*10)+1;
float A_f, B_f, C_f;
A_f = (float)malloc(sizeof(float)MK);
B_f = (float)malloc(sizeof(float)KN);
C_f = (float)malloc(sizeof(float)MN);
Fill_Float(A_f,M,K);
Fill_Float(B_f,K,N);
Fill_Float(C_f,M,N);
fprintf(stderr,"START");
cudaProfilerStart();
cblas_sgemm(CblasColMajor,CblasNoTrans,CblasNoTrans,M,N,K,
alpha_f,A_f,M,
B_f,K,
beta_f,C_f,M);
cudaProfilerStop();
fprintf(stderr,"END");
free(A_f);
free(B_f);
free(C_f);
}

shows in nvprof very little multi-GPU use with my 4 Titan-X (Pascal)'s. Also, discounting matrix filling, still lots of wait time before any gemm stuff is done.

I forced type=3 for always blasx use in blas/sgemm.c , so this should be all blasx and no cpu blas.

inline's cause problems for compiling, had to remove for Ubuntu 16.04

See changes in forked version: https://github.com/pseudotensor/BLASX (commit a9b2293).

All inlined functions would be not found during linking otherwise.

linnanwang / blasx Goto Github PK

blasx's People

Contributors

Stargazers

Watchers

Forkers

blasx's Issues

performance comparison update

large scale Dgemm segmentation fault

Do we need Cublas if we use BLASX?

How GEMM is implemented?

segfault for large matrices in dgemm

CPU Level Parallelism

How GPU task are implemented by CPU thread?

nvprof profile shows excessive waiting and lack of multi-GPU use

inline's cause problems for compiling, had to remove for Ubuntu 16.04

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent