clmathlibraries / clsparse Goto Github PK

a software library containing Sparse functions written in OpenCL

License: Apache License 2.0

CMake 6.97% C++ 67.02% C 26.02%

clsparse's Introduction

Build Status

Pre-built binaries are available on our releases page

Build branch	master	develop
GCC/Clang x64
Visual Studio x64

clSPARSE

an OpenCL™ library implementing Sparse linear algebra routines. This project is a result of a collaboration between AMD Inc. and Vratis Ltd..

What's new in clSPARSE v0.10.1

bug fix release
- Fixes for travis builds
- Fix to the matrix market reader in the cuSPARSE benchmark to synchronize with the regular MM reader
- Replace cl.hpp with cl2.hpp (thanks to arrayfire)
- Fixes for the Nvidia platform; tested 352.79
  - Fixed buffer overruns in CSR-Adaptive kernels
  - Fix invalid memory access on Nvidia GPUs in CSR-Adaptive SpMV kernel

clSPARSE features

Sparse Matrix - dense Vector multiply (SpM-dV)
Sparse Matrix - dense Matrix multiply (SpM-dM)
Sparse Matrix - Sparse Matrix multiply Sparse Matrix Multiply(SpGEMM) - Single Precision
Iterative conjugate gradient solver (CG)
Iterative biconjugate gradient stabilized solver (BiCGStab)
Dense to CSR conversions (& converse)
COO to CSR conversions (& converse)
Functions to read matrix market files in COO or CSR format

True in spirit with the other clMath libraries, clSPARSE exports a “C” interface to allow projects to build wrappers around clSPARSE in any language they need. A great deal of thought and effort went into designing the API’s to make them less ‘cluttered’ compared to the older clMath libraries. OpenCL state is not explicitly passed through the API, which enables the library to be forward compatible when users are ready to switch from OpenCL 1.2 to OpenCL 2.0 ³

Google Groups

Two mailing lists have been created for the clMath projects:

[email protected] - group whose focus is to answer questions on using the library or reporting issues
[email protected] - group whose focus is for developers interested in contributing to the library code itself

API semantic versioning

Good software is typically the result of iteration and feedback. clSPARSE follows the semantic versioning guidelines, and while the major version number remains '0', the public API should not be considered stable. We release clSPARSE as beta software (0.y.z) early to the community to elicit feedback and comment. This comes with the expectation that with feedback, we may incorporate breaking changes to the API that might require early users to recompile, or rewrite portions of their code as we iterate on the design.

clSPARSE Wiki

The project wiki contains helpful documentation.
A build primer is available, which describes how to use cmake to generate platforms specific build files

Samples

clSPARSE contains a directory of simple OpenCL samples that demonstrate the use of the API in both C and C++. The superbuild script for clSPARSE also builds the samples as an external project, to demonstrate how an application would find and link to clSPARSE with cmake.

clSPARSE library documentation

API documentation is available at http://clmathlibraries.github.io/clSPARSE/. The samples give an excellent starting point to basic library operations.

Contributing code

Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. Code in the /master branch is considered to be stable and new library releases are made when commits are merged into /master. Active development and pull-requests should be made to the develop branch.

License

clSPARSE is licensed under the Apache License, Version 2.0

Compiling for Windows

Windows® 7/8
Visual Studio 2013 and above
CMake 2.8.12 (download from Kitware)
Solution (.sln) or
Nmake makefiles
An OpenCL SDK, such as APP SDK 3.0

Compiling for Linux

GCC 4.8 and above
CMake 2.8.12 (install with distro package manager )
- Unix makefiles or
- KDevelop or
- QT Creator
An OpenCL SDK, such as APP SDK 3.0

Compiling for Mac OSX

CMake 2.8.12 (install via brew)
Unix makefiles or
XCode
An OpenCL SDK (installed via xcode-select --install)

Bench & Test infrastructure dependencies

Googletest v1.7
Boost v1.58

Footnotes

[1]: Changed to reflect CppCoreGuidelines: F.21

[2]: Changed to reflect CppCoreGuidelines: NL.8

[3]: OpenCL 2.0 support is not yet fully implemented; only the interfaces have been designed

clsparse's People

Contributors

Stargazers

Watchers

Forkers

jlgreathouse kvaragan pavanky bhuvanap nagyist matcheydj crycrane timmyliu adambrouwersharries cdeterman leragequit haibo031031 ckehl jpola geospatialdaryl frodohtx jszuppe xfong 10imaging a3213105 whu-zhigao arrayfire hal2001 fjramireg mkre foo123 bkmgit tillalex liuguoyou aaamourao jakirkham shi27feng nicolasderumigny grseb9s luwangg luco2018 yodamaster bdh0404 judofright xuyj2016 rebecca-palmer junaidnaseer alapini wenyuan95 mibrahim aarongreig slelaron liufengyinxue maria-nazarova yaccconstructor johan-kok realasking aerosayan snserhello zhaozhixiang-heu xdcesc leonmethod30 themartianobserver

clsparse's Issues

c-based structures to mirror data on host

Right now, our c-based structures assume that all data contained therein are located on the opencl device. However, there are instances when it is convenient to be able to access the data on host; for instance the alpha and beta scalars to decide if we should run a more optimal kernel for special cases.

If we incorporate a 'shadow' copy of device data, when is it updated? Do we update it or the user? In effect, you get a synchronization problem because you now maintain two copies of the data.

Is it reasonable to maintain 'shadow' values for scalars as an optimization, but not matrix structures?

Would we store the host data as void*'s, to maintain the typeless structs we currently have?

From clsparse<Vector><Scalar>Private to clsparse::vector<T>

Hi,
currently I'm working on clsparse::vector class which might help us to simplify the code and hide differences between cl1.2 and cl2.0 implementations. I would like to know your opinion about the idea of substituting *Private structures with the clsparse::vector class. If we will have clsparse::vector I don't see the need for the *Private classes to exists. If we want to keep both structures, for all algorithm I will have to write proper specialisations which actually will differ in very little details like passing buffer to kernel. I would like to avoid this unnecessary duplication. The clsparse::vector class is very simple now but is enough, in my opinion, to substitute *Private structures.

pros:

unify the code,
unify the way to pass the buffers to kernels,
manages memory for internal buffers.
clsparse::vector == clsparseVectorPrivate
clsparse::vector[1] == clsprseScalarPrivate
easy to use
hides the cl1.2 and cl2.0 differences.
good base line to extend this class for other algorithm requirements (iterators)

cons:

don't know yet if that will interfere with the C interface. I think it will not for the moment.
not quite an issue but a all algorithms need to be refractored.

What is your opinion?

Can we remove clBLAS dependency

Just scanned through the code, and we are pulling in the clBLAS library to call clblasSscal(). I think we recently implemented our own cldenseSscale, so I wonder if our dependence on clBLAS is legacy. Can we cut the cord?

Move constructors and operators for clsparse::array

and then we can eliminate the shallow_copy methods

Create Readme.md file

An introductory readme file that provides a short explanation of the library and its purpose, explains it's dependencies and requirements. It would be nice to provide as short of an example as possible to highlight it's API.

csrspmv vector - alpha & beta parameters

alpha and beta parameters are constant across all threads, so kernels should be able to branch on special values with little to no performance penalty.

Write basic build instructions

Create documentation that goes through the steps of building clSPARSE on multiple platforms

Cloud-based automated builds

When clSPARSE is open to the public, integrate the services of Travis and Appveyor to utilize their services for pull-requests. That should also allow us to integrate build badges on our main readme file for the site.

Update clsparse benchmark app to unify all benchmarks

Currently, for every routine we check into clSPARSE, we have a framework where we need to create a new exe to benchmark the function. Most of the code to measure performance is common to all the benchmarks, and should be highly leveragable. Adding a new function should be as easy as updating a few lines to select a new function and call it. Performance metrics should be customizable between functions.

Unit tests refactoring to use gtest_filters and reduce test executables

Similar to the benchmarking app, I think we are currently developing too many test executables. There is a lot of common framework code that belongs to each executable, and they might as well be the same executable where we add more unit tests with gtest_filters. Re: clFFT unit tests.

It might still make sense to group unit tests into 3 main executables, one to test L1 operations, another for L2 operations and a third for L3 operations.

Optimize csr2dense function

Fuse the clEnqueueFillBuffer() kernel function with the custom nnz fill kernel. Launching 1 kernel to make a single pass over the destination buffer should be faster than 2 separate passes.

validateMemObject

Create a nice function which take Private class and validate the underneath buffer which might be void for CL20, and cl_mem for lower versions

Implement cl2.0 methods for the internal vector/array datatypes

Our memory abstraction has come a long way, from #defines to a simple RAII wrapper, to a full fledged pseudo STL like container. Along the way, we did implement a few methods for cl2.0, but they have been dropped as we iterated on the design.

Finish the implementation for cl2.0 clsparse::vector and related types.

Remove C++-ism's from the clsparse.error.hpp

This header is using std::strings and throwing exceptions. Throwing exceptions is not valid from inside the library, and I think users might like to use the error header file with the handy macros defined. All the strings are const char* anyway, so not really a need to use std::strings.

Rename the header file as clsparse-error.h, and make it usable by clients of clsparse to help them convert our error enums into friendly strings.

Use a well-known good reference library to compare against

Right now, our unit tests are written to compare the opencl device result with a hand-written vanilla C++ solution. This was easy to bootstrap with, but as the functions we implement get more complicated, the reference solutions will get more complicated. Eventually, there may be bugs in either implementation.

We should compare ourselves with a respected reference source, either viennacl or boost ublas. Both libraries implement sparse operations, and it should be trivial to wire either of those in as a reference source. Our tests already have a dependency on boost, so maybe it makes sense to use ublas.

Develop cusparse benchmark app to compare performance

We should be able to compare the performance of an operation in our clspare library against an equivalent implementation. cusparse could serve as the spiritual equivalent implementation. This should provide us a metric to know the performance quality of our work.

Implement clsparse::array<T, SIZE> for scalar variables

Looking at the type of our scalar variables, I think we lose a little bit of context that the type of our Alpha & Beta are only ever meant to be size == 1. What I mean is that a scalar values type like alpha/beta looks the same as a true array. As a potential code enhancement , we should truly implement std::array<> with compile time size restrictions. The type of our scalars would then look like clsparse::array< T, 1 > . We would then rename your clsparse::array to be the more appropriate clsparse::vector.

Exceptions in cl.hpp

Currently, we enable exceptions in the cl.hpp file. For a C based interface, exceptions must not propogate outside of the C interface, because C99 does not define exceptions. Either we need to guarantee that we catch all exceptions before they leave our C-based API, or we disable exceptions in cl.hpp.
Preprocessor symbol: __CL_ENABLE_EXCEPTIONS

Add logic to test and bench frameworks to recurse through sparse directories

The superbuild currently downloads sparse files from the Tim Davis online sparse collection database into a sub-directory on the local hard drive. When the archives are uncompressed, they each expand into their own sub-directory. It would be handy for both the test and bench applications to take a root directory as input, and then the apps re-curse into all sub-directories looking for sparse test data.

Benchmark to add command line options to select the device

Given that a machine may have multiple devices to select from, add command line options to allow the user to specify which platform and which device they wish to instantiate.

Add license information

Add an appropriate license file to the repository, along with credits to AMD & Vratis

Add contributing documentation

Explicitly state how to make contributions to clSPARSE, and under which license.

ClSPARSE Build Issue

I am able to generate SuperBuild.clSPARSE.sln Visual Studio 2013 Solution file
[used the command "cmake.exe -G "Visual Studio 12 2013 Win64"]
I tried to build this workspace, I get errors when boost library is getting built.(Failed to build Boost.Build engine.) From bootstrap.log file, i can see the error message "LIBCMTD.lib(unlink.obj) : fatal error LNK1112: module machine type 'x64' conflicts with target machine type 'X86' ".
How do I rectify this error.
My System:
Windows 7 64 bit OS
Microsoft Visual Studio 2013 Community Edition

Refactor different precision's for clsparseCsrMatrixfromFile()

I believe that the clsparseCsrMatrixfromFile() family of functions should be agnostic to the precision of the data loaded. So we should condense
clsparseSCsrMatrixfromFile()
clsparseDCsrMatrixfromFile()
into a single function that takes an enum to select precision types.

Array library GPU device memory abstraction

The library is providing interfaces for both cl1.2 and cl2.0, and these two interfaces have very different memory buffer management API's. The library has attempted to abstract the differences between these API's with an internal class called clMemRAII, but we can do better. clMemRAII is being upgraded over time with new functionality, but it is not growing very gracefully. It is treating buffer allocation and mapping device memory as 'the same'.

We would like to treat device memory as any old array of data from the hosts perspective, and we can create a clsparse::array class that implements the same interface as std::array. The primary difference would be that clsparse::array would encapsulate device memory, and we could access that memory through the host by using iterators (with all the expected performance degradations but at least the memory is easy to use and manage). With proper use of c++11/14 constructors, move semantics, operator='s and iterators, management of device memory could be as easy as host memory.

64-bit capable containers

Currently, all of our C-based structs store their dimension sizes as cl_int. This limits the sizes of the sparse matrices our library supports to 2-billion (this was chosen without much thought at the time). Should we possibly redefine the structs to include all sizes as cl_ulong, or possible size_t? Size_t may be problematic, in that OpenCL does not allow size_t types as kernel arguments
From spec:
Arguments to __kernel functions in a program cannot be declared with the built-in scalar types bool, half, size_t, ptrdiff_t, intptr_t, and uintptr_t

Benchmark to print out device information

The benchmark apps need to print out information about the device under test. Some machines can have multiple devices installed, and the benchmark does not inform the user which device it picked as the default device.

Related to this, the benchmark needs to provide a facility to print out a list of all detected platforms and devices. This will enable users to be able to select the device they wish as described in #39

clsparseDenseMatrix params

Need to enhance clsparseDenseMatrix with both an enum to describe row/column major layout of matrix data, and a parameter LDA to describe stride between rows/columns

Using Boost in clSPARSE

Currently, we use the boost library in our benchmark and our testing applications. I've seen a few places now in our library where boost would be nice to have: for instance the filesystem library. The filesystem library in the compilers implementation are not standard yet, some use filesystem v2 and others use filesystem v3. Using boost would remove the compiler dependency on the implementation.

How would users feel about requiring boost to build the library proper? I feel like Boost is an 'pseudo' c++ standard that the only negative is that it doesn't ship with the compiler. Cmake removes almost all of the pain of building boost, but I've run into users that just hate external dependencies on principle.

clsparseSCsrMatrixfromFile - reading symmetric matrices

I've just found an issue. It really gave me a headache 😉.
The first part of the bug related to MMReadHeader which is executed while calling clsparseHeaderfromFile

// If symmetric MM stored file, double the reported size
if( mm_is_symmetric( Typecode ) )
    nNZ <<= 1;

I think this is not true. If we take diagonal matrix the nNZ should stay the same.

The second part of the bug is related to MMGenerateCOOFromFile. The function is calculating the actual number of nnz:

if( exp_zeroes == 0 && val == 0 )
    continue;
else
    FillCoordData( Typecode, unsym_coords, unsym_actual_nnz, ir, ic, val );

and at the end assign the new actual number of non zero entries to nNZ

 nNZ = unsym_actual_nnz;

This actually corrupts all the test programs when using matrix which nnz was changed during the reading procedure. The quick fix for the second part is to remove the n_vals, n_rows, n_cols assignment from constructor and assign the proper values later after the matrix is loaded.

I think there are more impacts. I think this is the cause which freezes my computer using adaptive spmv.

Benchmark to add command line option to select different algorithms per function

Some functions in clSPARSE may have multiple algorithms and implementations. The benchmark may wish to compare the performance of one algorithm verses another, so there must be a way for the benchmark to specify the algorithm for the function from the library.

For instance, the csr spmv function currently has support to two distinct algorithms, and the benchmark needs to implement the facilities to choose which one to test.

mm_reader - CL_MAP_WRITE_INVALIDATE_REGION

@kknox
It seems to me that matrix market reader is not working porperly with the inplace writing with CL_MAP_WRITE_INVALIDATE_REGION. I wrote a benchmark for CG solver and I was getting trash resluts (nans, -inf etc.) in tests the CG code was working correctly. The only difference I found was that in tests I'm using my own reading function and in the benchmarks it was your mm_reader.

For small matrix "poisson2D.mtx" [nRow: 367] [nCol: 367] [nNZ: 2417] my reader shows exactly what is in the file:

your function unfortunately returns following output (looks like random data)

-nan
-48.5088
-32.1338
-32.7589
-32.8839
-33.6338
-33.7588
-33.0089
-6.78082e-30
-1.09377
-1.26117e-44
2.16269e+06

When I return to

cl_float* fCsrValues = rCsrValues.clMapMem( CL_TRUE, CL_MAP_WRITE, pCsrMatx->valOffset( ), pCsrMatx->nnz );
cl_int* iCsrColIndices = rCsrColIndices.clMapMem( CL_TRUE, CL_MAP_WRITE, pCsrMatx->colIndOffset( ), pCsrMatx->nnz );
cl_int* iCsrRowOffsets = rCsrRowOffsets.clMapMem( CL_TRUE, CL_MAP_WRITE, pCsrMatx->rowOffOffset( ), pCsrMatx->m + 1 );

Function is working properly.

Could you please confirm?

Tests for OpenCL 2.0

Create set of tests for clsparse algorithms using OpenCL 2.0 interface.

Version numbers

Assign version numbers separately for /master and /develop branches. Use similar version numbering scheme as the other clMathLibraries.

Deciphering Version numbers:

Major => Releases with same major numbers are guaranteed to be compatible; backward compatibility NOT guaranteed. 2.x might not be compatible with 1.x
Minor => Incremented release adding new features, significant performance improvements or significant internal refactoring, which does not affect backwards compatibility; /master == even, /develop == odd
Patch => Bug fix, doc update, minor performance improvement or minor internal refactoring
Tweak => Incrementing build number

Change buffer() to data()

I'm realizing that in c++11 syntax, our buffer() method is in fact the same as std::vector::data(). We should probably follow the same conventions and rename our method.

CPP Interface for Tests

Current test and benchmark implementation is not clean enough. I mixed C and CPP for OpenCL function calls. It will be good idea to unify them to one standard. Maybe tests should be somehow divided to parts using C style and CPP style.

csr-adaptive does not honor offsets of matrix data

The kernel written for csr-adaptive is hard wired to assume the offset for data buffers is 0.

BLAS1 Function interface

Currently the interface of blas1 operations (scale, axpy, axpby) updates the input parameter.

cldenseSscale (clsparseVector* y, const clsparseScalar* alpha, const clsparseControl control)

I think that more flexible interface will be

cldenseSscale (clsparseVector* y, const clsparseScalar* alpha, clsparseVector* x const clsparseControl control)

in this version y = alpha * x if we call it

cldenseSscale (y, alpha, y);

we still maintain the desired functionallity.

This issue arisen in development of BCGStab solver where the result of AXPBY has to be stored in different existing buffer. Currently only the internal axpby template is adopted to fulfil this requirement.

clMemRAII allocate buffer

Hi, there is funny thing I have found during some development. Take look at this constructor of clMemRAII class

clMemRAII( const cl_command_queue cl_queue, const cl_mem cl_buff,
               const size_t cl_size = 0, const cl_mem_flags cl_flags = CL_MEM_READ_WRITE) :
        clMem( nullptr )
    {
        clQueue = cl_queue;
        clBuff = cl_buff;

        ::clRetainCommandQueue( clQueue );
         if(cl_size > 0)
         {
             cl_context ctx = NULL;

             ::clGetCommandQueueInfo(clQueue, CL_QUEUE_CONTEXT, sizeof( cl_context ), &ctx, NULL);
             cl_int status = 0;
    //here clBuff becomes different than cl_buff therefore later use of cl_buff should give undefined behaviour.
             clBuff = ::clCreateBuffer(ctx, cl_flags, cl_size * sizeof(pType), NULL, &status);
         }
         else
         {
            ::clRetainMemObject( clBuff );
         }
    }

This is strange to me but even if the code in dot function makes no sense it still return correct values!

clMemRAII<T> rPartial (control->queue(), partial.values, partial.n);

My suggestion is to add following constructor:

clMemRAII( const cl_command_queue cl_queue, cl_mem* cl_buff,
               const size_t cl_size = 0, const cl_mem_flags cl_flags = CL_MEM_READ_WRITE) :
        clMem( nullptr )
    {
        clQueue = cl_queue;
        clBuff = *cl_buff;

        ::clRetainCommandQueue( clQueue );
         if(cl_size > 0)
         {
             cl_context ctx = NULL;

             ::clGetCommandQueueInfo(clQueue, CL_QUEUE_CONTEXT, sizeof( cl_context ), &ctx, NULL);
             cl_int status = 0;
             clBuff = ::clCreateBuffer(ctx, cl_flags, cl_size * sizeof(pType), NULL, &status);
             *cl_buff = clBuff
         }
         else
         {
            ::clRetainMemObject( clBuff );
         }
    }

Does my observations are correct?

validateMemSize

Create functions for validating *Private classes buffers concerning size of allocated space.

clsparseControl event managemet

//It is just a first issue. I'll start to using it to keep track of my issues.

Following the idea to simplify the clsparse interface I've created the clsparseControl struct. When it comes to event synchronization clsparse have to manage two scenarios at the beginning.

If no event was provided by the user, we have to spawn some temp event for a given kernel and call wait() on it.
If user provides an event it should be registered by particular kernel. The user is responsible to call wait() on his own event.

To work with the events and clsparseControl in the C style manner there are several interface functions

c++ clsparseSetupEvent(clsparseControl control, cl_event *event)
c++ clsparseSetupEventWaitList(clsparseControl control, cl_uint num_events_in_wait_list, cl_event *event_wait_list)
*c++ clsparseSynchronize(clsparseControl control)

First function should take the event and put it to control->event. control->event is of type cl::Event.
Second function should put the event_wait_list into VECTOR_TYPEcl::Event type
Third function should be called to call control->event.wait();

Current implementation of those functions is totally unacceptable and probably very buggy the cl::Event loosing the validity.

atomic_add_float

Hi @kknox
Could you please explain to me how your implementation of atomic_add_float is working? I know the general idea of atomic operations related to synchronization but from my tests I see that when this operation is used for the first time, the initial value is always ignored.

Could you please take look on following kernel:
Here the pX is the vector of size 256, pSum is clsparseScalar buffer.

__attribute__((reqd_work_group_size(WG_SIZE,1,1)))
__kernel
void xxx (const SIZE_TYPE size,
          __global const VALUE_TYPE* pX,
          __global VALUE_TYPE* pSum)
{
    SIZE_TYPE idx = get_global_id(0);
    if (idx >= size) return;
    atomic_add_float(&pSum[0], pX[idx]);
}

My tests are saying that whatever the initial value of pSum was I'm getting always the same result which depends only on pX values. I get the same results when the initial value of pSum was -100 and 0, and any other value.

Thanks in advance!

Elementwise operators for clsparse::array.

Is it possible that we can overload 'operator /' for two device memories, such that it calls a device division kernel? The 'operator /' would not transfer device memory to host, nor transfer host memory to device in a round trip fashion? This also means supporting an 'operator =' that assigns a temporary device memory to device memory.

Solver concept

Hi, @kknox @shuaiche
take a look on my branch

https://github.com/jpola/clSPARSE/tree/solver_concept

In this branch I wanted to present a concept of the solver architecture. We have created quite nice infrastructure to build more advanced algorithms in the
field of sparse linear algebra and since internally we are allowed to use C++ it is much easier than I expected. In this branch I've made a lot of changes
related to the old algorithms, the changes mostly concerns a split of the algorithm to its general header where I store the template of the algorithm and its public cpp implementation related to the clSPARSE.h API. Header files can be later used internally. AXPY algorithms have now
elementwise operator by that I can control scalar parameter. Norms have simple header which opaque the call of reduce function. CSRMV is splitted to header file as well.

The first solver I provided is the Conjugate Gradient algorithm with and without preconditioning. I started with creation of solvers controler (solvers/solver_control.hpp) which have similar idea as clsparseControl object. It is storing the informations and constraints related to the call of the solver. Sparse solvers are iterative algorithms, they provides approximated solution therefore in the solver controler you can find fields like absolute tolerance, relative tolerance etc. The next step was to implement the idea how to manage the preconditioners. For more details please check the solvers/preconditioners/preconditioner.hpp. You will find there an explanation what the preconditioner actually is from math perspective. When it comes to programming perspective every preconditioner have their own handler. Preconditioner handler have to fullfill the general preconditioner handler structure. Then during the call of the solver I can easily pick desired object (see conjugate_gradients.cpp).

The implementation of the CG solver itself can be found in /solvers/conjugate_gradients.hpp. Currently it is ugly as hell but this will define future steps to optimise the code and helps to make the decision related to architecture.

Currently what gives me a headache is buffer management.

In several places I have to check the value of the clsparseScalar object.
In several places I have to divide value of two scalars,
Copy one scalar or vector to another

What I'm currently looking for (and I hope you can help me as well ) is to implement a nice classes which manages the lifetime of the buffer objects, perform a mapping operations, performs a cpy operations. I'm also wondering what is the performance ratio of launching a kernel which divides two numbers compared to operation where I map the values to the host and there update the values. Do you have some ideas?

What do you think about this concept?

Kuba.

clsparseCsrMetaSize

@kknox,
I can't execute succesfully the test_readMMcoo.cpp
I tried following matrices:

consph.mtx,
e40r0100.mtx,
shipsec1.mtx

Non of these matrices gave me correct results, here is some logs:
shipsec1.mtx

[ RUN      ] MM_file.load
Matrix: /media/jpola/Storage/matrices/OpenCL/3/shipsec1.mtx [nRow: 140874] [nCol: 140874] [nNZ: 7954278]
key: [csrmv_adaptive/csrmv_adaptive] -cl-kernel-arg-info -cl-std=CL1.2  -DROWBITS=32 -DWGBITS=24 -DBLOCKSIZE=1024 hash = 3744165166
kernel not found: 3744165166
/home/jpola/Projects/ClMath_ClSparse/clSPARSE_fork/clSPARSE/src/tests/test_readMMcoo.cpp:155: Failure
The difference between yHost[ i ] and deviceResult[ i ] is inf, which exceeds 5e-4, where
yHost[ i ] evaluates to 108360.09375,
deviceResult[ i ] evaluates to inf, and
5e-4 evaluates to 0.00050000000000000001.

consph.mtx

[ RUN      ] MM_file.load
Matrix: /media/jpola/Storage/matrices/OpenCL/3/consph.mtx [nRow: 83334] [nCol: 83334] [nNZ: 6093814]
key: [csrmv_adaptive/csrmv_adaptive] -cl-kernel-arg-info -cl-std=CL1.2  -DROWBITS=32 -DWGBITS=24 -DBLOCKSIZE=1024 hash = 3744165166
kernel not found: 3744165166
/home/jpola/Projects/ClMath_ClSparse/clSPARSE_fork/clSPARSE/src/tests/test_readMMcoo.cpp:155: Failure
The difference between yHost[ i ] and deviceResult[ i ] is 1973.46337890625, which exceeds 5e-4, where
yHost[ i ] evaluates to 1975.46337890625,
deviceResult[ i ] evaluates to 2, and
5e-4 evaluates to 0.00050000000000000001.

For several others I've tested I recieved
C++ exception with description "std::bad_alloc" thrown in the test body.
Which means that 16GB of ram is not enough to compute CsrMetaSize algorithm.

I've executed test_readMMcoo.cpp using small matrix named orani678.mtx
%%MatrixMarket matrix coordinate real general
2529 2529 90158
This part of the code (or any other which provides an input to this) seems not to work properly:

for( int w = 1; w < numWGReq; w++ )
            {
                rowBlocks.push_back( i - 1 << ROW_BITS );
                rowBlocks[ rowBlocks.size( ) - 1 ] |= static_cast< cl_ulong >( w );
            }

Snapshot from gdb says:

numWGReq = 2091873
rowBlocks <1 items> std::vector
[0] 140737342162904 unsigned long
rowDelimiters 963064043 cl_int
sum 2142077658 cl_ulong
w 392732 int

Is the adaptive version working on Windows platform? If yes that means it is related to Linux platform.

Distinguish between dense and sparse vectors

clSPARSE should provide a struct that represents a sparse vector. This is required for sparse level 1 routines.

gpuTimer throws out_of_range exception for CG benchmark

When using gpuTimer with the CG benchmark it is throwing out_of_range exception, probably due to large number of events to collect.

To test the issue enable the gpuTImer in clfunct_xCGM.hpp. Throw is catched at line 239.

Directory layout and file names

I think I'd like to get feedback on the way our files are laid out in clSPARSE:

Directory Questions/Proposals:

I like the concept of the folder called blas1, we might need to distinguish between sparse and dense routines in the future though
we should create blas2 and blas3 folders to hold their respective routines, and merge the spmv folder into it.
Should we create a /kernels sub-directory below each blasX directory? Separate the kernels?
I think transforms should house the conversion routines
solvers should contain the new solver routines
matrix-market might be moved/merged into internal?
internal is sort of generic name, not descriptive; is there a better name or we split the files?

With regard to file names, we now have 4 distinct naming conventions. Most of these variations were created by me 😄

underscores (clsparse_csr2dense.cpp)
hyphens (clSPARSE-private.hpp)
periods (clsparse.error.hpp)
Camel Case (loadDynamicLibrary.hpp)

Do we want to agree on file naming conventions? I certainly have my own opinions I can mention in a follow on comment.

Create examples that use the clSPARSE library

We need a few examples that demonstrate how to write code that uses the clSPARSE API, in addition to how to build and link with it. I'm thinking a good set of examples should be:

How to load a matrix from .mtx file into a COO object, then covert it to CSR
How to compute a basic SpM-dV operation
How to compute a basic CG solve

The examples should be independent of the library, and should be written and built in a way that an end user would be expected to use the library.

Move constructors and operators for references.

It should make sense to implement move operators and constructors for references, and that may eliminate unnecessary map/unmap calls.