Comments (3)
Hi @fommil
I started answering this thread/issue when I responded to your comments in Issue #15. I wanted to follow up with extra comments here.
Our BLAS (and likewise FFT) API is designed the way the way it is to unlock the complete potential of the heterogeneous platforms that OpenCL is designed for. In order to do that, our library can not make any guesses as to where the clients data resides. As such, the library is designed to allow the OpenCL client to completely manage their own data, meaning that the client controls where data lives and when it should be transferred to the host or the device. Our API's assume nothing, and we are not going to change this flexibility with the clMath API's; it has to be this way for performance.
However, I completely see value in interfaces that are easier to use. Interfaces that allow programmers to prototype functionality quicker, or just enables more developers to have access to the power of heterogeneous computing that is sitting on everybody's desktops and gaming rigs. Products like Accelereyes ArrayFire software, or your personal project that you link above have an important role to flesh out the software ecosystem. I hope to see solutions for everybody in the future, where a developer can understand and properly make tradeoff decisions between ease of use and performance that are right for their own compute needs.
from clblas.
@kknox it's not really just about being "easy to use". This is a compatibility issue: if you create a BLAS implementation, one certainly expects to be able to use the BLAS API to interface with it. What you've created is great, but it's not BLAS. It's BLAS-like.
To take a leaf from the LAPACKE (official C API to LAPACK): they offer version that allow the user to control memory allocation and an easy to use one that simply creates arrays on demand. I believe you should offer something similar: i.e. an API that matches Fortran BLAS exactly plus the one you're currently exposing. Surely there must be a way to setup the GPU device on shared library load, and close it down cleanly on exit (or abnormal exit / segfault!) .. such that the only overhead is the memory transfer. From my experiments with cuBLAS (which I believe is closer to the latter API) the memory transfer starts to be negligible (compared to the computational benefits) for arrays of 100,000 elements and more.
(I am aware and see clearly the performance advantage of leaving the arrays on the GPU memory space. That is clearly another level of optimisation that people may wish to make: but it requires significant source code change)
NOTE: I may well have fallen victim to NULL
operations with the cuBLAS also. When I'm next at my desktop machine, I'm going to run tests at the same time as performance runs, to ensure that the DGEMM is actually taking place.
from clblas.
Closing old clBLAS issues for the new year
This idea of creating an API to match the BLAS API should be handled in a wrapper to the clBLAS library, or implemented in another library altogether. The issue of managing OpenCL state is complicated and was not a design goal of this project.
It is entirely possible to create a project that matches the BLAS API and completely hides the OpenCL details from the end user, and this new wrapper/library could call into clBLAS to implement and manage the OpenCL kernels.
from clblas.
Related Issues (20)
- test-correctness segfault and "INTERNAL BUILD FAILURE"
- Does this clBLAS support FPGA? HOT 3
- Will it run on OpenCL 1.1 ( EP) on Vivante GC2000 GPU? HOT 3
- Problems building with gtest-1.8.1 HOT 1
- Runtime error with Intel OpenCL 18.1.0.0920
- What is APPML 1.12 and where is it?
- clBLAS test fail with ROCm on Centos 7.6 HOT 1
- clBLAS aborts when backend is OpenCL 1.1
- Outdated documentation? HOT 1
- bug in clblasiCamax???
- Build clBLAS without OpenBLAS? HOT 5
- how about the performance on adreno gpu HOT 1
- test-short failure on gfx1010 (RX5700 XT)
- CMake compilation with clBLAS fails on hard-coded AMDADDPSDK path HOT 1
- Installation procedure went wrong? HOT 1
- add error checker when creating cmd queue in client: especially when OoO queue is not supported on many devices HOT 1
- undefined clblassetup HOT 7
- Test cases that can be displayed in an image interface HOT 1
- Correctness test fails to compile on m2 Mac HOT 1
- Is it a good idea to use GCN cross lane instruction for optimization? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clblas.