linnanwang / blasx Goto Github PK
View Code? Open in Web Editor NEWa heterogeneous multiGPU level-3 BLAS library
a heterogeneous multiGPU level-3 BLAS library
The paper on the arXiv is already three years old. Which version of cuBLAS-XT was used to create the performance comparison charts? Is it still the case that BLASX is outperforming cuBLAS-XT? Wouldn't it be very interesting to have updated plots with current versions of cuBLAS-XT?
Hello, there are still errors when applying the library to a large matrix gemm on multiple GPUs. I need to find another library that can replace cublasXt and execute large-scale matrix gemm on multiple GPUs. Therefore, I tested Dgemm on 50000X50000 and 50000X50000. In the end, I got a segmentation error. I tried to figure out what the problem was. But it's hard for me to go deep. I just found that cublasGetMatrixAsync and cublasSetMatrixAsync seem to be wrong in blasx_dgemm.c. Here is the gdb info about this.
If you can help, I really appreciate it.
I am wondering how the GEMM is implemented, is it like, CPU RAM store all the matrix A and B. Suppose we have 2 GPUs and we send A(i, k) and B(k, j) to GPU0 and we iterate all possible k, and we get a C(i, j) in GPU0. Similarly in GPU1. And we concatenate the result? If more complicated than that, do you have any reference paper?
Thank you!!!
I've modified the gemm-example to use dgemm only with matrices of dimension 30000x30000.
Using a server with 4 GTX Titan cards the program produces a segfault. It seems that there
aren't any checks regarding available device-memory.
nvidia-smi:
| 0 30252 C ./gemm 6067MiB |
| 1 30252 C ./gemm 6067MiB |
| 2 30252 C ./gemm 6067MiB |
| 3 30252 C ./gemm 6067MiB |
Hello!
This library looks great, but I was wondering if it has CPU multi-threading blas capabilities. Reading through the code for some of the *gemm files, it almost appears to be the case.
I'm trying to perform a benchmark on AWS between g2 and c4 instances. I was hoping to find some way of writing a single code base that will perform the same function on the two different instances.
This might be a naive question.... It is mentioned in paper that GPU task can be bonded to a CPU thread...? I am wondering is any references discuss more details about this or what keyword I should use to search on Google...It should be about multi GPU allocation?
Thank you!!!
Running the testing/gemm.c with only sgemm (commenting out dgemm code) and larger matrices:
int loop = 0;
for (loop = 1; loop < 2; loop++) {
int M = 10000;
int N = M;
int K = M;
float alpha_f = (float)(((double) rand()/(double)RAND_MAX)*10)+1;
float beta_f = (float)(((double) rand()/(double)RAND_MAX)*10)+1;
float A_f, B_f, C_f;
A_f = (float)malloc(sizeof(float)MK);
B_f = (float)malloc(sizeof(float)KN);
C_f = (float)malloc(sizeof(float)MN);
Fill_Float(A_f,M,K);
Fill_Float(B_f,K,N);
Fill_Float(C_f,M,N);
fprintf(stderr,"START");
cudaProfilerStart();
cblas_sgemm(CblasColMajor,CblasNoTrans,CblasNoTrans,M,N,K,
alpha_f,A_f,M,
B_f,K,
beta_f,C_f,M);
cudaProfilerStop();
fprintf(stderr,"END");
free(A_f);
free(B_f);
free(C_f);
}
shows in nvprof very little multi-GPU use with my 4 Titan-X (Pascal)'s. Also, discounting matrix filling, still lots of wait time before any gemm stuff is done.
I forced type=3 for always blasx use in blas/sgemm.c , so this should be all blasx and no cpu blas.
See changes in forked version: https://github.com/pseudotensor/BLASX (commit a9b2293).
All inlined functions would be not found during linking otherwise.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.