libocca / occa Goto Github PK

Portable and vendor neutral framework for parallel programming on heterogeneous platforms.

License: MIT License

C 3.92% C++ 87.43% Python 1.86% Makefile 0.89% Shell 0.72% Objective-C++ 0.80% CMake 1.14% Emacs Lisp 0.11% Fortran 3.14%

hpc gpu jit opencl cuda openmp multithreading hip gpgpu metal

occa's People

Contributors

Stargazers

Watchers

Forkers

jkozdon lcw pdhahn galerkin damian-666 zhejianggaoxiao gitter-badger lferraro ceed nicholasmalaya sapphire-arches eminsight jedbrown anandpratap musamajaved v-dobrev wuziyou199217 myfreebrain hknibbe2 kris-rowe louishp siegloesch rakeshmadivi suniljoshi82 martinfx sfrijters amikstcyr anoopmad jameseggleton awehrfritz noelchalmers gionellef nek5000 tcew luspi mpanoop abhishekmshra dmcdougall aljenu ronrahaman jmoney666 hmrg-grmh aishwaryashastry argonne-lcf malachitimothyphillips deukhyun jaimesouza anapaula94 dreamplayer-zhang ilibx girihemant19 mfkiwl laplacekorea weihuayi maximnl rishit2307 clayne gerhobbelt yaomingamd ooreilly nnnunnn tejax-alaghari barracuda156 deukhyun-cha benwibking nsoborin thilinarmtb topazus aaron-msb cjatin mkbosmans kian-ohara rakesroy bryanchance iuriikobein vyast-softserveinc andrsd rliop913

occa's Issues

where can I find a tutorial for matlab interface?

Loopy removal

Regarding 96d9460:

I'm wondering about the right path forward here. Sure, I can call loopy myself using whatever syscall, and that's essentially equivalent to the code that was removed. But since loopy can take a couple seconds to run, the right thing to do would be to cache its output. I was rather hoping to piggyback off of OCCA's compiler cache to do so, which would point in the direction of tighter integration. So what I'm looking for here is a pronouncement by either @tcew or @dmed256 along one of the following lines:

There will be no loopy-anything in occa. Third-party toolchain components will not be supported as part of OCCA, in particular with respect to cache integration.
There will be no loopy-anything in occa. Third-party toolchain components are a possiblity.
Loopy as part of the build chain is welcome back into occa, as long as conditions XYZ are met.

cc @lcw

Query device memory

As far as I can tell there is no mechanism in occa that allows me to query the main memory for a device.

Two ways that I could see doing this are

an occa function that just returns the global memory size of the device (we currently only really care about this for opencl and cuda)
an occa function that returns the device pointer, and let the user query the device themselves

Would either of these by feasible to implement within occa? We were having problems locking up our boxes when allocating to much memory on the device, so want to put in some checks to make sure this doesn't occur in our codes.

Documentation for 1.0

Update documentation in libocca.org

Fix insertion of Barriers

Properly detect loops that use shared memory, insert barriers before loop starts if needed

problems including a device function

Hi,

I am having problems include a file with a device function. It looks like the device function gets multiply defined. For example I have a file that is included with occaKernelInfoAddInclde that contains

void conn_mapping(const int tree, const dfloat a, const dfloat b,
                         const dfloat c, dfloat *x, dfloat *y, float *z)
{
  *x = a;
  *y = b;
  *z = c;
}

when this file gets included to the kernel info then compiling kernels has an issue like

Compiling [compute_X]
g++ -x c++ -fPIC -shared  /Users/lucas/._occa/kernels/a21338d5e3be2cbf/source.occa -o /Users/lucas/._occa/kernels/a21338d5e3be2cbf/binary -I/Users/lucas/research/code/ape/occa//include -L/Users/lucas/research/code/ape/occa//lib -locca

Compiling [compute_X0]
g++ -x c++ -fPIC -shared  /Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa -o /Users/lucas/._occa/kernels/3854aa958acba9fd/binary -I/Users/lucas/research/code/ape/occa//include -L/Users/lucas/research/code/ape/occa//lib -locca

/Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa: In function 'void conn_mapping(int, float, float, float, float*, float*, float*)':
/Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa:88:19: error: redefinition of 'void conn_mapping(int, float, float, float, float*, float*, float*)'
 occaFunction void conn_mapping(occaConst int tree,
                   ^
/Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa:71:19: note: 'void conn_mapping(int, float, float, float, float*, float*, float*)' previously defined here
 occaFunction void conn_mapping(const int tree, const dfloat a, const dfloat b,
                   ^

Looking at the source it seems that a21338d5e3be2cbf/source.occa is okay and only has one definition of conn_mapping but 3854aa958acba9fd/source.occa is not okay. Here is the relevant section of 3854aa958acba9fd/source.occa

occaFunction void conn_mapping(const int tree, const dfloat a, const dfloat b,
                               const dfloat c, dfloat *occaRestrict x,
                               dfloat *occaRestrict y, float *occaRestrict z)
{
  *x = a;
  *y = b;
  *z = c;
}

occaFunction void conn_mapping(occaConst int tree,
                               occaConst float a,
                               occaConst float b,
                               occaConst float c,
                               float * occaRestrict x,
                               float * occaRestrict y,
                               float * occaRestrict z);

occaFunction void conn_mapping(occaConst int tree,
                               occaConst float a,
                               occaConst float b,
                               occaConst float c,
                               float * occaRestrict x,
                               float * occaRestrict y,
                               float * occaRestrict z) {
  *x = a;
  *y = b;
  *z = c;
}

I looked but cannot figure out why this function is getting defined twice. Any help would be appreciated.

Thanks,
Lucas

occaMax gives min

definition of occaMax at line 209 in occa/defines/OpenCL.h has "<", needs ">".

Refactor parser

Handle most of C99

how to disable CUDA support?

I am only interested in OpenCL support. CUDA support is enabled by default even though it is not ~~present~~ active [1] on my machine. The build succeeds but testing fails.

I cannot find a build system control option to disable CUDA, short of hacking makefiles by hand.

jrhammon@klondike:/tmp$ git clone https://github.com/libocca/occa.git
Cloning into 'occa'...
remote: Counting objects: 18556, done.
remote: Total 18556 (delta 0), reused 0 (delta 0), pack-reused 18556
Receiving objects: 100% (18556/18556), 12.84 MiB | 9.78 MiB/s, done.
Resolving deltas:  99% (14184/14266)   
Resolving deltas: 100% (14266/14266), done.
Checking connectivity... done.
jrhammon@klondike:/tmp$ cd occa
jrhammon@klondike:/tmp/occa$ export OCCA_CXX=g++-7
jrhammon@klondike:/tmp/occa$ export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH
jrhammon@klondike:/tmp/occa$ make -j16 >& make.log
jrhammon@klondike:/tmp/occa$ make test
echo '---[ Testing ]--------------------------'
---[ Testing ]--------------------------
cd /tmp/occa/; \
make -j 4 CXXFLAGS='-g' FCFLAGS='-g'
make[1]: Entering directory '/tmp/occa'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/tmp/occa'
cd /tmp/occa//examples/addVectors/cpp; \
make -j 4 CXXFLAGS='-g' FCFLAGS='-g'; \
./main
make[1]: Entering directory '/tmp/occa/examples/addVectors/cpp'
g++ -g -o /tmp/occa/examples/addVectors/cpp//main  -fopenmp -DOCCA_OPENMP_ENABLED=1  -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1  -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5  -D OCCA_OS=LINUX_OS  /tmp/occa/examples/addVectors/cpp//main.cpp -I/tmp/occa/lib -I/tmp/occa/include -L/tmp/occa/lib   -I/usr/local/cuda-9.0/include -I/usr/local/cuda-9.0/include  -locca -lm -lrt -ldl -L/usr/local/cuda-9.0/lib64 -lOpenCL -L/usr/local/cuda-9.0/lib64/stubs -lcuda
In file included from /tmp/occa/include/occa.hpp:18:0,
                 from /tmp/occa/examples/addVectors/cpp//main.cpp:3:
/tmp/occa/include/occa/defines/vector.hpp:20:32: warning: unknown option after ‘#pragma GCC diagnostic’ kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wint-in-bool-context"
                                ^
make[1]: Leaving directory '/tmp/occa/examples/addVectors/cpp'
./main: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
makefile:116: recipe for target 'test' failed

[1] /usr/local/cuda-9.0 exists but the requisite drivers are not installed and thus CUDA does not work.

Restore CUDA/OpenCL kernel source support

Hi everybody.

Waiting for the release of libocca 1.0, I strongly suggest to add support for CUDA/OpenCL kernel source in CUDA/OpenCL OCCA mode: this would be of great advantage when porting an existent CUDA accelerated program with OCCA.

What I suggest is:

add a NativeCUDA, NativeOpenCL mode for kernel build in order to use already existent CUDA/OpenCL source code kernels (to use only with CUDA and OpenCL OCCA mode of course).
a) please, remove any dummy argument in order to call the CUDA/OpenCL kernel (see original implementation in the "master" branch of libocca)
b) use a unique semantic for formal parameters of the setWorkingDims method of the kernel class, for example: CUDA way (dims, gridOfBlocks, blockDims) or OpenCL way (dims, gridOfElements, blockDims)

Thank you for you great work and support

Luca Ferraro

OpenMP private to private operator overloading

In https://github.com/libocca/occa/blob/master/include/occa/defines/OpenMP.hpp#L441 I think that the operators needs to be overloaded for private to private operations, such as

  friend inline TM operator + (const occaPrivate_t &a, const occaPrivate_t &b)

or something like this (my c++ is a little rusty)

Kernel profiler

Pass a profile: true to kernel properties to aggregate timing information

Draft documentation for 1.0

We're adding documentation in readthedocs: http://occa.readthedocs.io

Once the draft is done, we'll review it one more before 1.0 release

math functions

A number of the math functions are wrapped in occa already, for instance occasqrt.

Would it be possible to wrap the rest of the opencl supported math functions?
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mathFunctions.html

I particularly need hypot, but figure it could be worth just wrapping them all (or at least the opencl/cuda intersection).

ARM Support

Test out Serial and OpenMP modes with ARM hardware to make sure compiler and compiler flags work. Find ARM-specific macros for vectorization options (float2, float4, ...)

[Parser] Constant expression "1e-..." truncated to "1"

Found in laghos/kernels/cpu/quadratureData.okl when (cNorm > 1e-16) and (maxNorm > 1e-16) lines 370 and 377

Add coverage metrics

Add coveralls coverage

[FORTRAN] unnecessary printf statements

Unnecessary printf statement

src/FC.cpp line 10 - printf("type = %p\n", *type);

Also when occa is run with thousands of GPUs, the info message for cached kernel becomes too much, so I suggest an option to turn it on/off.

//std::cout << "Found cached binary of [" << compressFilename(filename) << "] in [" << compressFilename(binaryFile) << "]\n";

regards,
Daniel

Add basic linear algebra routines to occa::array

Add some common vector operations and linear algebra routines to occa::array

+ - * / operations
Norms (L1, L2, Lp, Linf)
max, min
dot

memory::wrapMemory and memory::free is wrong in OpenCL and CUDA

I should be making a new cl_mem when wrapping and copying the value, not keeping the pointer of the cl_mem*

Emulate OKL AST

Since we build the AST from parsed OKL code, have a feature to emulate the AST to search for

Out-of-bound accesses
Race conflicts
Access ordering (heatmap?)
FLOP/Read/Write counts
Potential cache misses

Update C bindings

Make sure the C API is in sync with the C++ API
Clean up occaType to avoid allocations

opencl float atomics

I am in need of atomicAdd for floats in OCCA. Currently it seems only CUDA atomicAdd on floats work. AtomicAdd on doubles doesn't work in CUDA, and OpenCL does not seem to define it on floats/doubles at all. I know CUDA doesn't implement atomicAdd on 64 bit directly but via atomicCAS, but it would be great to have it.

My OpenCL is:

Version: OpenCL 1.2 AMD-APP (1445.5)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_amd_svm cl_khr_gl_event

Version 1.0

We're aiming for a production-grade release with 1.0. A large amount of refactoring is being done for the API and OKL parser.

Issues

#51: Add basic linear algebra routines to occa::array
#49: Add custom attributes
#52: Update C bindings
#46: Create 1.0 documentation
#48: Add coverage metrics

Documentation

libocca.org

We're releasing documentation with

Getting started
OCCA programming model
OKL programming model
API documentation
Extending OCCA

Testing and Coverage

This has been a long and crucial part missing in our release process. We are going to be making testing a required part in our release process. Coverage metrics will also be included.

Extendable API

Developer Tools

Helper methods were added to facilitate collaboration and extending OCCA in other codebases.

Examples include:

System calls
Environment variables
Finding files with/without custom URI prefixes (e.g. occa://occa/foo.okl)
Hashing files and objects
Caching files such as compiled binaries
Locating cached binaries
String manipulation

Properties

We're aiming to unify backends through a general API. However, we would also like to expose various features unique to different backends.

For example:

Memory allocation types
- CUDA and OpenCL have texture memory
- Possible memory locations in future architectures
Device features
- Expose features backend devices require. A list of cores Threads mode is pinning to, deviceID in OpenCL and CUDA, etc
Kernels
- Defines and headers
- Possible extensions to kernel building

We added properties in 1.0 to help create a more flexible API. Properties use the standard JSON representation to maintain a 1-1 mapping from data to string.

The following JS augmentations are also supported:

Object keys do not need to be stringified
Single quotes can also be used for strings
Trailing commas can be used

{ mode: 'Serial', }

While we support loading JSON strings with occa::json, properties are always JSON objects. Hence, we also support ignoring braces when initializing properties from a string.

occa::properties(mode: 'Serial')

Modes

OCCA modes are composed of 3 classes: device, kernel, and memory. Modes can now be added outside of OCCA through registering the modes. The mode constructor will register the classes with its respective mode.

occa::mode<openmp::modeInfo,
           openmp::device,
           openmp::kernel,
           openmp::memory> mode("OpenMP");

OKL Parser

The OKL kernel language is one of the key features in OCCA. Adding JIT compilation of kernels gives us the flexibility of reusing kernels for different OCCA modes. We are refactoring the parser from the bottom up for robustness and flexibility.

Custom Kernel Arguments

Custom C++ classes can now be passed as kernel arguments by adding the occa::kernelArg cast operator

operator occa::kernelArg ();

The class is not restricted by OCCA in the amount of arguments or types it can add to the kernel call. The backend could restrict the number of arguments or types, for example 256 is the maximum arguments in CPU modes.

To see an example of this use, check occa::array.

Note: The OKL kernel has to match the same number arguments and types.

Linear Algebra

We're adding some common vector operations and linear algebra routines to occa::array

+ - * / operations
Norms (L1, L2, Lp, Linf)
max, min
dot

Languages

We will officially support the following frontends in the 1.0 release

Language	Github Repository
C	libocca/occa
C++	libocca/occa
Python	libocca/occa-python

We'll be adding the following in the 1.1 release

Language	Github Repository
Python	libocca/occa-python
Fortran	libocca/occa-fortran

Potential future languages

Language	Github Repository
Java	libocca/occa-java
Julia	libocca/occa-julia

Documentation and testing will be included

Provide costructors for occa::memory objects with offsets

In many situations, we need to pass subset of memory buffers to the kernel, for example when working with many streams (in order to implement a pipe-line to process a big buffer in chunks with the same kernel) you require that the kernel operates on different portions of a memory buffer.

for (int i=0; i<nchunks; ++i) {
  size_t starting_index = i * chunksize;
  // stuff like device.setStream(.......)
  // asynch copy stuff for buffers from base + starting_index of size chunksize
  kernel( occa::memory(bufA, start_index), occa::memory(bufB, start_index) );
  // asynch copy back stuff
}

What I suggest is to provide a simple way to create occa:memory objects from other existing occa::memory, using an offset: this is a "must have" in all situations where you need to process subsection of a occa::memory buffer with a kernel, for example when using multiple streams for
processing a buffer in chunks (see attached example).

You can go through a operator+ or a copy costructor (or both), so to let user write something like that:

// modified source from the master branch of libocca
template<> memory_t<CUDA>::memory_t(const memory_t<CUDA> &m, const
uintptr_t offset) {
  *this =m;
  handle = new CUdeviceptr;
  CUdeviceptr base = (CUdeviceptr)m.handle; (CUdeviceptr*)handle =
base + offset;
}

For OpenCL we can use subbuffers.

We must grant memory resources will not to be destroyed when the newly shifted memory object goes out of scope.

Free allocations made by initializer

These memory leaks create valgrind noise for users that link to libocca, even if they never call into the library.

==14201== HEAP SUMMARY:
==14201==     in use at exit: 232 bytes in 7 blocks
==14201==   total heap usage: 35 allocs, 28 frees, 111,478 bytes allocated
==14201== 
==14201== 8 bytes in 1 blocks are definitely lost in loss record 2 of 7
==14201==    at 0x4C2D52F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14201==    by 0x58B2F2F: occa::env::registerFileOpeners() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x58B5B9C: occa::env::initialize() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x400F519: call_init.part.0 (in /usr/lib/ld-2.26.so)
==14201==    by 0x400F625: _dl_init (in /usr/lib/ld-2.26.so)
==14201==    by 0x4000F69: ??? (in /usr/lib/ld-2.26.so)
==14201== 
==14201== 8 bytes in 1 blocks are definitely lost in loss record 3 of 7
==14201==    at 0x4C2D52F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14201==    by 0x58B2F4C: occa::env::registerFileOpeners() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x58B5B9C: occa::env::initialize() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x400F519: call_init.part.0 (in /usr/lib/ld-2.26.so)
==14201==    by 0x400F625: _dl_init (in /usr/lib/ld-2.26.so)
==14201==    by 0x4000F69: ??? (in /usr/lib/ld-2.26.so)
==14201== 
==14201== 8 bytes in 1 blocks are definitely lost in loss record 4 of 7
==14201==    at 0x4C2D52F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14201==    by 0x58B2F69: occa::env::registerFileOpeners() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x58B5B9C: occa::env::initialize() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x400F519: call_init.part.0 (in /usr/lib/ld-2.26.so)
==14201==    by 0x400F625: _dl_init (in /usr/lib/ld-2.26.so)
==14201==    by 0x4000F69: ??? (in /usr/lib/ld-2.26.so)

Add custom attributes

Formalize a way to modify kernel AST with custom attributes

device::free() failure in CUDA mode

There's a problem with resource deallocation when using OCCA in CUDA native mode.

File     : <stripped>/occa/src/modes/cuda/kernel.cpp
Function : free
Line     : 260
Message  : Kernel (addVectors) : Unloading Module
Error   : CUDA Error [ 709 ]: CUDA_ERROR_CONTEXT_IS_DESTROYED
Stack    :
  6 occa::cuda::error(cudaError_enum, std::string const&, std::string const&, int, std::string const&)
  5 occa::cuda::kernel::free()
  4 occa::kernel::removeKHandleRef()
  3 ./main()
  2 /lib64/libc.so.6(__libc_start_main+0xf5)
  1 ./main()

I attach a simple diff file to apply to examples/nativeAddVectors which should reproduce the problem.

Thank you for your precious work,

nativeAddVectors.diff.zip

occaParallelFor not automatically added in OpenMP mode

Is there a reason why that is ? I added it manually in my code but i was wondering if this is a bug or a feature in OCCA2 ?

Errors in examples midgTest & mandelbulb

Hi, I use branch 1.0. When I test examples I get errors.
In midgTest, it says could not find function [midg] in file midg.okl.
In mandelbulb, it says no setupAide.hpp.
Thank you.

wrapMemory() funtion in CUDA.cpp

The wrapMemory() funtion in CUDA.cpp causes a runtime error. The error occurs when finish() is called, but when I modified the code (see line 1300 in CUDA.cpp), I was able to get the kernel to execute successfully. Below is my suggested patch/replacement for the wrapMemory() function.

template <>
memory_v* device_t::wrapMemory(void *handle_,
const uintptr_t bytes){
memory_v *mem = new memory_t;

mem->dHandle = this;
mem->size    = bytes;
mem->handle  = (CUdeviceptr*) handle_;

mem->memInfo |= memFlag::isAWrapper;

return mem;

}

Private vector variables

I am having troubles using private float4 variables. Here is a small example of a code that does not work

#include <iostream>

#include "occa.hpp"

int main(int argc, char **argv){
  occa::printAvailableDevices();

  int nvec = 4;
  int entries = 2;

  const char *kernelString =
    "occaKernel void addVectors(occaKernelInfoArg,                  \n"
    "                           occaConst int occaVariable entries, \n"
    "                           occaConst occaPointer float4 * a,   \n"
    "                           occaConst occaPointer float4 * b,   \n"
    "                           occaPointer           float4 * ab){ \n"
    "  occaParallelFor                                              \n"
    "  occaOuterFor0{                                               \n"
    "   occaPrivate(float4, c);                                     \n"
    "    occaInnerFor0{                                             \n"
    "      const int N = occaGlobalId0;                             \n"
    "      if(N < entries){                                         \n"
    "        c = b;                                                 \n"
    "        ab[N].x = a[N].x + c[N].x;                             \n"
    "        ab[N].y = a[N].y + c[N].y;                             \n"
    "        ab[N].z = a[N].z + c[N].z;                             \n"
    "        ab[N].w = a[N].w + c[N].w;                             \n"
    "      }                                                        \n"
    "    }                                                          \n"
    "  }                                                            \n"
    "}                                                              \n";


  float *a  = new float[nvec*entries];
  float *b  = new float[nvec*entries];
  float *ab = new float[nvec*entries];

  for(int i = 0; i < nvec*entries; ++i){
    a[i]  = i;
    b[i]  = 1 - i;
    ab[i] = 0;
  }

  occa::device device;
  occa::kernel addVectors;
  occa::kernelInfo kInfo;
  occa::memory o_a, o_b, o_ab;


  //---[ Device setup with string flags ]-------------------
  // device.setup("mode = Serial");
  device.setup("mode = OpenMP  , schedule = compact, chunk = 10");
  // device.setup("mode = OpenCL  , platformID = 0, deviceID = 1");
  // device.setup("mode = CUDA    , deviceID = 0");
  // device.setup("mode = Pthreads, threadCount = 4, schedule = compact, pinnedCores = [0, 0, 1, 1]");
  // device.setup("mode = COI     , deviceID = 0");
  //========================================================

  o_a  = device.malloc(nvec*entries*sizeof(float));
  o_b  = device.malloc(nvec*entries*sizeof(float));
  o_ab = device.malloc(nvec*entries*sizeof(float));

  addVectors = device.buildKernelFromString(kernelString,
                                            "addVectors", kInfo,
                                            occa::usingNative);

  int dims = 1;
  int itemsPerGroup(16);
  int groups((nvec*entries + itemsPerGroup - 1)/itemsPerGroup);

  addVectors.setWorkingDims(dims, itemsPerGroup, groups);

  o_a.copyFrom(a);
  o_b.copyFrom(b);

  addVectors(entries, o_a, o_b, o_ab);

  o_ab.copyTo(ab);

  for(int i = 0; i < nvec*entries; ++i)
    std::cout << i << ": " << ab[i] << '\n';

  for(int i = 0; i < nvec*entries; ++i){
    if(ab[i] != (a[i] + b[i]))
      throw 1;
  }

  delete [] a;
  delete [] b;
  delete [] ab;

  addVectors.free();
  o_a.free();
  o_b.free();
  o_ab.free();
  device.free();

  return 0;
}

and here are the error messages that I get

> ./main
==============o=======================o==========================================
   CPU Info   |  Processor Name       | Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
              |  Cores                | 6
              |  Memory (RAM)         | 31 GB
              |  Clock Frequency      | 1.205 GHz
              |  SIMD Instruction Set | SSE2
              |  SIMD Width           | 128 bits
              |  L1 Cache Size (d)    |    32K
              |  L2 Cache Size        |   256K
              |  L3 Cache Size        | 12288K
==============o=======================o==========================================
    OpenCL    |  Device Name          | Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
              |  Driver Vendor        | Intel
              |  Platform ID          | 0
              |  Device ID            | 0
              |  Memory               | 31 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Tahiti
              |  Driver Vendor        | AMD
              |  Platform ID          | 1
              |  Device ID            | 0
              |  Memory               | 2 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Tahiti
              |  Driver Vendor        | AMD
              |  Platform ID          | 1
              |  Device ID            | 1
              |  Memory               | 2 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Tahiti
              |  Driver Vendor        | AMD
              |  Platform ID          | 1
              |  Device ID            | 2
              |  Memory               | 2 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Tahiti
              |  Driver Vendor        | AMD
              |  Platform ID          | 1
              |  Device ID            | 3
              |  Memory               | 2 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
              |  Driver Vendor        | Intel
              |  Platform ID          | 1
              |  Device ID            | 4
              |  Memory               | 31 GB
==============o=======================o==========================================
Compiling [addVectors]
g++ -x c++ -fPIC -shared  -fopenmp /home/lucas/._occa/kernels/b65af17897881926/source.occa -o /home/lucas/._occa/kernels/b65af17897881926/binary -I/home/lucas/research/code/occa//include -L/home/lucas/research/code/occa//lib -locca

/home/lucas/._occa/kernels/b65af17897881926/source.occa: In function ‘void addVectors(const int*, int, int, int, const int&, const float4*, const float4*, float4*)’:
/home/lucas/._occa/kernels/b65af17897881926/source.occa:14:11: error: no match for ‘operator=’ (operand types are ‘occaPrivate_t<float4, 1>’ and ‘const float4*’)
         c = b;
           ^
/home/lucas/._occa/kernels/b65af17897881926/source.occa:14:11: note: candidates are:
In file included from /home/lucas/._occa/kernels/b65af17897881926/source.occa:1:0:
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:479:14: note: TM& occaPrivate_t<TM, SIZE>::operator=(const occaPrivate_t<TM, SIZE>&) [with TM = float4; int SIZE = 1]
   inline TM& operator = (const occaPrivate_t &r) {
              ^
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:479:14: note:   no known conversion for argument 1 from ‘const float4*’ to ‘const occaPrivate_t<float4, 1>&’
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:484:14: note: TM& occaPrivate_t<TM, SIZE>::operator=(const TM&) [with TM = float4; int SIZE = 1]
   inline TM& operator = (const TM &t){
              ^
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:484:14: note:   no known conversion for argument 1 from ‘const float4*’ to ‘const float4&’

---[ Error ]--------------------------------------------
    File     : /home/lucas/research/code/occa/src/OpenMP.cpp
    Function : buildFromSource
    Line     : 305
    Error    : Compilation error
========================================================

Am I doing something wrong? If not I would like to put in a feature request for vector private variables.

[Kernel Arguments] Array of pointers

Being able to retrieve the addresses on the device side for an array of them would be useful.

unroll pragma bug

To work with intel compilers, I think that https://github.com/libocca/occa/blob/master/include/occa/defines/OpenMP.hpp#L437

#define occaUnroll(N)  occaUnroll2(unroll N)

should be

#define occaUnroll(N)  occaUnroll2(unroll(N))

Using streams with multiple devices

I am unable to get my program to work with streams and multiple devices in mode=CUDA. So, as a test, I modified the addVectors example under examples/usingStreams/ to include and extra device. The error is CUDA_ERROR_INVALID_HANDLE. See copy of modified code below.

#include <iostream>

#include "occa.hpp"

int main(int argc, char **argv){
  int entries = 8;

  float *a  = new float[entries];
  float *b  = new float[entries];
  float *ab = new float[entries];

  for(int i = 0; i < entries; ++i){
    a[i]  = i;
    b[i]  = 1 - i;
    ab[i] = 0;
  }   

  occa::device device;
  occa::device device1; // ADDED
  occa::kernel addVectors;
  occa::memory o_a, o_b, o_ab;

  occa::stream streamA, streamB;

  device.setup("mode = CUDA, deviceID = 0"); // UNCOMMENTED
  device1.setup("mode = CUDA, deviceID = 1"); //ADDED
  //device.setup("mode = OpenCL, platformID = 0, deviceID = 1"); //COMMENTED

  streamA = device.getStream();
  streamB = device.createStream();

  o_a  = device.malloc(entries*sizeof(float));
  o_b  = device.malloc(entries*sizeof(float));
  o_ab = device.malloc(entries*sizeof(float));

  addVectors = device.buildKernelFromSource("addVectors.okl",
                                            "addVectors");

  o_a.copyFrom(a);
  o_b.copyFrom(b);

  device.setStream(streamA);
  addVectors(entries, o_a, o_b, o_ab);

  device.setStream(streamB);
  addVectors(entries, o_a, o_b, o_ab);

  o_ab.copyTo(ab);

  for(int i = 0; i < entries; ++i)
    std::cout << i << ": " << ab[i] << '\n';

  delete [] a;
  delete [] b;
  delete [] ab;

  addVectors.free();
  o_a.free();
  o_b.free();
  o_ab.free();
  device.free();
  device1.free();  //ADDED
}

Aligned malloc in occa ?

Hi David,
I would like to use vector intrinsics when running on the CPU in OpenMP mode. For this purpose, I need to change all occaDeviceMallocs with the corresponding __aligned_malloc, f.i to ensure 32 byte alignement for vdouble4. I could not find an occaDeviceAignedMalloc in OCCA unfortunately.

The vector.hpp file has some vector intrinsics towards the bottom of the file, which is not complete yet, so I wrote a new class which is not working properly right now. I am guessing the reason may be byte alignment issues.

Daniel

Compile error, not compatible with cuda 8.0

Hi, I am new to occa. I work on Ubuntu 16 PC, with cuda 8.0 installed. While I want to compile occa, I get error that the program is not compatible with libcuda.so. Thank you.

building for COI?

Am I doing something wrong?

g++ -O3 -fPIC -o /home/hugo/occa/obj/timer.o  -fopenmp -DOCCA_OPENMP_ENABLED=1  -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1 -DOCCA_HSA_ENABLED=0 -DOCCA_COI_ENABLED=1  -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5  -D OCCA_OS=LINUX_OS -c -I/home/hugo/occa/lib -I/home/hugo/occa/include -I./include -I/usr/local/cuda-7.0/include -I/usr/local/cuda-7.0/include /home/hugo/occa/src/timer.cpp
In file included from /home/hugo/occa/include/occa/timer.hpp:4:0,
from /home/hugo/occa/src/timer.cpp:1:
/home/hugo/occa/include/occa/base.hpp:1144:29: error: 'occa::device occa::coi::wrapDevice' redeclared as different kind of symbol
occa::device wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1141:18: error: previous declaration of 'occa::device occa::coi::wrapDevice(void*)'
occa::device wrapDevice(void *coiDevicePtr);
^
/home/hugo/occa/include/occa/base.hpp:1144:29: error: 'COIENGINE' was not declared in this scope
occa::device wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1245:41: error: 'COIENGINE' has not been declared
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1245:60: error: 'occa::device occa::coi::wrapDevice(int)' should have been declared inside 'occa::coi'
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1361:41: error: 'COIENGINE' has not been declared
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1541:41: error: 'COIENGINE' has not been declared
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
make: *** [/home/hugo/occa/obj/timer.o] Error 1

Add benchmark example

Similar to:

https://github.com/ekondis/mixbench

measure throughput and bandwidth for a GPU

OCCA_CXX is ignored

The README indicates that OCCA_CXX is the right way to control the C++ compiler that is used to build OCCA. This is false in practice, both on Linux and MacOS.

You can see below that I want to use g++-7 but OCCA chooses g++ instead. On MacOS, it defaults to c++ and ignores my attempts to use clang++ and g++-7.

jrhammon@klondike:/tmp$ git clone https://github.com/libocca/occa.git
Cloning into 'occa'...
remote: Counting objects: 18556, done.
remote: Total 18556 (delta 0), reused 0 (delta 0), pack-reused 18556
Receiving objects: 100% (18556/18556), 12.84 MiB | 9.35 MiB/s, done.
Resolving deltas: 100% (14266/14266), done.
Checking connectivity... done.
jrhammon@klondike:/tmp$ cd occa
jrhammon@klondike:/tmp/occa$ export OCCA_CXX=g++-7
jrhammon@klondike:/tmp/occa$ make
mkdir -p /tmp/occa//obj
mkdir -p /tmp/occa//obj/parser
mkdir -p /tmp/occa//obj/python
g++ -O3 -fPIC -DOCCA_COMPILED_DIR="/tmp/occa/" -o /tmp/occa//obj/Serial.o  -fopenmp -DOCCA_OPENMP_ENABLED=1  -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1  -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5  -D OCCA_OS=LINUX_OS -c -I/tmp/occa//lib -I/tmp/occa//include -I/usr/local/cuda-9.0/include -I/usr/local/cuda-9.0/include /tmp/occa//src/Serial.cpp
g++ -O3 -fPIC -DOCCA_COMPILED_DIR="/tmp/occa/" -o /tmp/occa//obj/OpenCL.o  -fopenmp -DOCCA_OPENMP_ENABLED=1  -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1  -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5  -D OCCA_OS=LINUX_OS -c -I/tmp/occa//lib -I/tmp/occa//include -I/usr/local/cuda-9.0/include -I/usr/local/cuda-9.0/include /tmp/occa//src/OpenCL.cpp
^Cmakefile:68: recipe for target '/tmp/occa//obj/OpenCL.o' failed
make: *** [/tmp/occa//obj/OpenCL.o] Interrupt

Add support to use __constant in GPU modes

Try out the following to see if we can add __constant to work on native CUDA kernels.
Manually put the constant variable/array with a name in the kernel file

__constant float myVar;
__constant float myArray[10];

Compile kernel

occa::kernel foo = device.buildKernel('foo.cu', 'foo')

Use an OCCA helper method to initialize the constant value. The 'extra work' part is keeping track of the variable name and size outside of the kernel:

float myVar = 10;
float *myArray = new float[10];
// Initialize myArray

occa::cuda::initConstantMemory(foo, 'myVar'  , &myVar , sizeof(float));
occa::cuda::initConstantMemory(foo, 'myArray', myArray, 10 * sizeof(float));

Use preprocessor output content as hash

When #include-ing files, the kernel hash doesn't properly reflect the content

Add reduction attribute

Reduction is not trivial using OKL, instead we can make it a backend requirement to provide a way to do fast reductions.

@reduction void dot(const int entries,
                    const float *a,
                    const float *b,
                    @reduce float out) {

  for (int i = 0; i < entries; ++i; reduce(32)) {
    out += a[i] * b[i];
  }
}

We can have multiple reductions in one kernel

@reduction void dot_norm(const int entries,
                         const float *a,
                         const float *b,
                         @reduce float dot,
                         @reduce float a_norm2,
                         @reduce float b_norm2) {

  for (int i = 0; i < entries; ++i; reduce(32)) {
    dot += a[i] * b[i];
    a_norm2 = a[i] * a[i];
    b_norm2 = b[i] * b[i];
  }
}

Running on MIC

I am having trouble to run on the Xeon Phi using the offload model. A simple program show below
prints 240 cores, but bin/occainfo does not detect the Xeon Phi (only CPU info is shown). I also tried to compile occa with OCCA_COI_ENABLED=1 but i learned this is deprecated, so I am trying to run in OpenMP mode.

int main(){
     int nprocs;
    #pragma offload target(mic)
     nprocs = omp_get_num_procs();
     printf("Hello %d\n",nprocs);
     return 0;
}

With OpenMP mode I am redefining the outer loop of my kernels to run in parallel and on the MIC with:

 #define myOuterFor OCCA_PRAGMA("offload target(mic)") occaParallelFor occaOuterFor0

I got errors after I added the mic pragma:

 error: pointer variable "occaKernelArgs" in this offload region must be specified in an in/out/inout/nocopy clause

Is it possible to run OCCA in native mode ? I would like to launch more than one process per MIC card and a few threads instead of running threads only. But I guess this would be against the design of OCCA.

Daniel

CUDA multiple defines

When I run example occa/examples/addVectors/c/ in cuda mode I get a bunch of multiple define warnings between primitives.hpp and CUDA.hpp as well as defines.hpp

There is also an error from an unfound file: occa/defines.hpp

[parser] printf("...",u[0][i]) => printf("...",u[0])

The parser is silently ignoring the '[i]' in such a case.

Allow memory allocation inside kernels

FYI I can allocate OCCA device memory inside the occaKernel routine body (outside of actual kernel loops), which allows me to define temporary global device memory buffer(s) there instead of externally which then must be passed in as a parameter. This means I can keep the logic for allocation and management of temporary / work buffers used in kernels localized to the occaKernel routine code where it belongs.

I achieve this by passing in the occaDevice pointer cast as a uint64_t, then calling a special-purpose externally-defined C++ allocation function ("CreateOccaMemory()") inside the routine body to return an occaPointer to a buffer (global device memory). This allocation function simply takes the incoming occaDevice pointer plus a number of elements argument, and calls its OCCA malloc() function. The return type is (for example) occaPointer float*. I also have a corresponding memory free function ("ReleaseOccaMemory()")

Anyway, it would be much better (and helpful for all OCCA programmers) if there were built-in support to do this without even having to explicitly pass in the occaDevice pointer as a parameter from the outside. That pointer could be a hidden parameter generated and passed in by the OCCA build. And in terms of user syntax, the user could simply define global device memory via dynamically-sized arrays (auto-freed on return). Or similar to what I do, you could just have a new built-in OCCA allocator function (e.g., "occaMalloc()") that users call to create the global device memory, which is accessible in subsequent kernel loops. Of course, you would also want to provide an "occaFree()" function.

Allow return statements

Would it be possible to allow a return statement (return 0 on success, otherwise failure)?

math functions for vectors (floatn)

Concerning vector data types.

a) The opencl floatn vectors support math functions such as sin, cos etc, but occa's support in non-opencl mode lacks that.

b) Support for all float-n types such as float3, float6, float9 which can be used to represent vector, symmetric tensor and tensor.

Investigate if a Metal backend is possible

We would need to:

Wrap Metal API with C++
Add the metal mode and backend