Code Monkey home page Code Monkey logo

blasx's People

Contributors

eddy16112 avatar linnanwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

blasx's Issues

performance comparison update

The paper on the arXiv is already three years old. Which version of cuBLAS-XT was used to create the performance comparison charts? Is it still the case that BLASX is outperforming cuBLAS-XT? Wouldn't it be very interesting to have updated plots with current versions of cuBLAS-XT?

large scale Dgemm segmentation fault

Hello, there are still errors when applying the library to a large matrix gemm on multiple GPUs. I need to find another library that can replace cublasXt and execute large-scale matrix gemm on multiple GPUs. Therefore, I tested Dgemm on 50000X50000 and 50000X50000. In the end, I got a segmentation error. I tried to figure out what the problem was. But it's hard for me to go deep. I just found that cublasGetMatrixAsync and cublasSetMatrixAsync seem to be wrong in blasx_dgemm.c. Here is the gdb info about this.
1575817709021
If you can help, I really appreciate it.

How GEMM is implemented?

I am wondering how the GEMM is implemented, is it like, CPU RAM store all the matrix A and B. Suppose we have 2 GPUs and we send A(i, k) and B(k, j) to GPU0 and we iterate all possible k, and we get a C(i, j) in GPU0. Similarly in GPU1. And we concatenate the result? If more complicated than that, do you have any reference paper?
Thank you!!!

segfault for large matrices in dgemm

I've modified the gemm-example to use dgemm only with matrices of dimension 30000x30000.
Using a server with 4 GTX Titan cards the program produces a segfault. It seems that there
aren't any checks regarding available device-memory.

nvidia-smi:

| 0 30252 C ./gemm 6067MiB |
| 1 30252 C ./gemm 6067MiB |
| 2 30252 C ./gemm 6067MiB |
| 3 30252 C ./gemm 6067MiB |

CPU Level Parallelism

Hello!

This library looks great, but I was wondering if it has CPU multi-threading blas capabilities. Reading through the code for some of the *gemm files, it almost appears to be the case.

I'm trying to perform a benchmark on AWS between g2 and c4 instances. I was hoping to find some way of writing a single code base that will perform the same function on the two different instances.

How GPU task are implemented by CPU thread?

This might be a naive question.... It is mentioned in paper that GPU task can be bonded to a CPU thread...? I am wondering is any references discuss more details about this or what keyword I should use to search on Google...It should be about multi GPU allocation?

Thank you!!!

nvprof profile shows excessive waiting and lack of multi-GPU use

Running the testing/gemm.c with only sgemm (commenting out dgemm code) and larger matrices:

int loop = 0;
for (loop = 1; loop < 2; loop++) {
int M = 10000;
int N = M;
int K = M;
float alpha_f = (float)(((double) rand()/(double)RAND_MAX)*10)+1;
float beta_f = (float)(((double) rand()/(double)RAND_MAX)*10)+1;
float A_f, B_f, C_f;
A_f = (float
)malloc(sizeof(float)MK);
B_f = (float
)malloc(sizeof(float)KN);
C_f = (float
)malloc(sizeof(float)MN);
Fill_Float(A_f,M,K);
Fill_Float(B_f,K,N);
Fill_Float(C_f,M,N);
fprintf(stderr,"START");
cudaProfilerStart();
cblas_sgemm(CblasColMajor,CblasNoTrans,CblasNoTrans,M,N,K,
alpha_f,A_f,M,
B_f,K,
beta_f,C_f,M);
cudaProfilerStop();
fprintf(stderr,"END");
free(A_f);
free(B_f);
free(C_f);
}

shows in nvprof very little multi-GPU use with my 4 Titan-X (Pascal)'s. Also, discounting matrix filling, still lots of wait time before any gemm stuff is done.

I forced type=3 for always blasx use in blas/sgemm.c , so this should be all blasx and no cpu blas.

gemm2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.