Code Monkey home page Code Monkey logo

occa's People

Contributors

aljenu avatar amikstcyr avatar awehrfritz avatar barracuda156 avatar deukhyun-cha avatar dmcdougall avatar dmed256 avatar hknibbe2 avatar jameseggleton avatar jdahm avatar jedbrown avatar jeremylt avatar kian-ohara avatar kris-rowe avatar kvoronin-intel avatar lcw avatar malachitimothyphillips avatar noelchalmers avatar pdhahn avatar rakesroy avatar reidatcheson avatar sapphire-arches avatar sfrijters avatar stgeke avatar suniljoshi82 avatar tcew avatar tejax-alaghari avatar thilinarmtb avatar v-dobrev avatar wjhorne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

occa's Issues

Loopy removal

Regarding 96d9460:

I'm wondering about the right path forward here. Sure, I can call loopy myself using whatever syscall, and that's essentially equivalent to the code that was removed. But since loopy can take a couple seconds to run, the right thing to do would be to cache its output. I was rather hoping to piggyback off of OCCA's compiler cache to do so, which would point in the direction of tighter integration. So what I'm looking for here is a pronouncement by either @tcew or @dmed256 along one of the following lines:

  • There will be no loopy-anything in occa. Third-party toolchain components will not be supported as part of OCCA, in particular with respect to cache integration.
  • There will be no loopy-anything in occa. Third-party toolchain components are a possiblity.
  • Loopy as part of the build chain is welcome back into occa, as long as conditions XYZ are met.

cc @lcw

Query device memory

As far as I can tell there is no mechanism in occa that allows me to query the main memory for a device.

Two ways that I could see doing this are

  1. an occa function that just returns the global memory size of the device (we currently only really care about this for opencl and cuda)
  2. an occa function that returns the device pointer, and let the user query the device themselves

Would either of these by feasible to implement within occa? We were having problems locking up our boxes when allocating to much memory on the device, so want to put in some checks to make sure this doesn't occur in our codes.

problems including a device function

Hi,

I am having problems include a file with a device function. It looks like the device function gets multiply defined. For example I have a file that is included with occaKernelInfoAddInclde that contains

void conn_mapping(const int tree, const dfloat a, const dfloat b,
                         const dfloat c, dfloat *x, dfloat *y, float *z)
{
  *x = a;
  *y = b;
  *z = c;
}

when this file gets included to the kernel info then compiling kernels has an issue like

Compiling [compute_X]
g++ -x c++ -fPIC -shared  /Users/lucas/._occa/kernels/a21338d5e3be2cbf/source.occa -o /Users/lucas/._occa/kernels/a21338d5e3be2cbf/binary -I/Users/lucas/research/code/ape/occa//include -L/Users/lucas/research/code/ape/occa//lib -locca

Compiling [compute_X0]
g++ -x c++ -fPIC -shared  /Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa -o /Users/lucas/._occa/kernels/3854aa958acba9fd/binary -I/Users/lucas/research/code/ape/occa//include -L/Users/lucas/research/code/ape/occa//lib -locca

/Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa: In function 'void conn_mapping(int, float, float, float, float*, float*, float*)':
/Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa:88:19: error: redefinition of 'void conn_mapping(int, float, float, float, float*, float*, float*)'
 occaFunction void conn_mapping(occaConst int tree,
                   ^
/Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa:71:19: note: 'void conn_mapping(int, float, float, float, float*, float*, float*)' previously defined here
 occaFunction void conn_mapping(const int tree, const dfloat a, const dfloat b,
                   ^

Looking at the source it seems that a21338d5e3be2cbf/source.occa is okay and only has one definition of conn_mapping but 3854aa958acba9fd/source.occa is not okay. Here is the relevant section of 3854aa958acba9fd/source.occa

occaFunction void conn_mapping(const int tree, const dfloat a, const dfloat b,
                               const dfloat c, dfloat *occaRestrict x,
                               dfloat *occaRestrict y, float *occaRestrict z)
{
  *x = a;
  *y = b;
  *z = c;
}

occaFunction void conn_mapping(occaConst int tree,
                               occaConst float a,
                               occaConst float b,
                               occaConst float c,
                               float * occaRestrict x,
                               float * occaRestrict y,
                               float * occaRestrict z);

occaFunction void conn_mapping(occaConst int tree,
                               occaConst float a,
                               occaConst float b,
                               occaConst float c,
                               float * occaRestrict x,
                               float * occaRestrict y,
                               float * occaRestrict z) {
  *x = a;
  *y = b;
  *z = c;
}

I looked but cannot figure out why this function is getting defined twice. Any help would be appreciated.

Thanks,
Lucas

occaMax gives min

definition of occaMax at line 209 in occa/defines/OpenCL.h has "<", needs ">".

how to disable CUDA support?

I am only interested in OpenCL support. CUDA support is enabled by default even though it is not present active [1] on my machine. The build succeeds but testing fails.

I cannot find a build system control option to disable CUDA, short of hacking makefiles by hand.

jrhammon@klondike:/tmp$ git clone https://github.com/libocca/occa.git
Cloning into 'occa'...
remote: Counting objects: 18556, done.
remote: Total 18556 (delta 0), reused 0 (delta 0), pack-reused 18556
Receiving objects: 100% (18556/18556), 12.84 MiB | 9.78 MiB/s, done.
Resolving deltas:  99% (14184/14266)   
Resolving deltas: 100% (14266/14266), done.
Checking connectivity... done.
jrhammon@klondike:/tmp$ cd occa
jrhammon@klondike:/tmp/occa$ export OCCA_CXX=g++-7
jrhammon@klondike:/tmp/occa$ export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH
jrhammon@klondike:/tmp/occa$ make -j16 >& make.log
jrhammon@klondike:/tmp/occa$ make test
echo '---[ Testing ]--------------------------'
---[ Testing ]--------------------------
cd /tmp/occa/; \
make -j 4 CXXFLAGS='-g' FCFLAGS='-g'
make[1]: Entering directory '/tmp/occa'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/tmp/occa'
cd /tmp/occa//examples/addVectors/cpp; \
make -j 4 CXXFLAGS='-g' FCFLAGS='-g'; \
./main
make[1]: Entering directory '/tmp/occa/examples/addVectors/cpp'
g++ -g -o /tmp/occa/examples/addVectors/cpp//main  -fopenmp -DOCCA_OPENMP_ENABLED=1  -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1  -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5  -D OCCA_OS=LINUX_OS  /tmp/occa/examples/addVectors/cpp//main.cpp -I/tmp/occa/lib -I/tmp/occa/include -L/tmp/occa/lib   -I/usr/local/cuda-9.0/include -I/usr/local/cuda-9.0/include  -locca -lm -lrt -ldl -L/usr/local/cuda-9.0/lib64 -lOpenCL -L/usr/local/cuda-9.0/lib64/stubs -lcuda
In file included from /tmp/occa/include/occa.hpp:18:0,
                 from /tmp/occa/examples/addVectors/cpp//main.cpp:3:
/tmp/occa/include/occa/defines/vector.hpp:20:32: warning: unknown option after ‘#pragma GCC diagnostic’ kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wint-in-bool-context"
                                ^
make[1]: Leaving directory '/tmp/occa/examples/addVectors/cpp'
./main: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
makefile:116: recipe for target 'test' failed

[1] /usr/local/cuda-9.0 exists but the requisite drivers are not installed and thus CUDA does not work.

Restore CUDA/OpenCL kernel source support

Hi everybody.

Waiting for the release of libocca 1.0, I strongly suggest to add support for CUDA/OpenCL kernel source in CUDA/OpenCL OCCA mode: this would be of great advantage when porting an existent CUDA accelerated program with OCCA.

What I suggest is:

  1. add a NativeCUDA, NativeOpenCL mode for kernel build in order to use already existent CUDA/OpenCL source code kernels (to use only with CUDA and OpenCL OCCA mode of course).
    a) please, remove any dummy argument in order to call the CUDA/OpenCL kernel (see original implementation in the "master" branch of libocca)
    b) use a unique semantic for formal parameters of the setWorkingDims method of the kernel class, for example: CUDA way (dims, gridOfBlocks, blockDims) or OpenCL way (dims, gridOfElements, blockDims)

Thank you for you great work and support

Luca Ferraro

Kernel profiler

Pass a profile: true to kernel properties to aggregate timing information

Draft documentation for 1.0

We're adding documentation in readthedocs: http://occa.readthedocs.io

Once the draft is done, we'll review it one more before 1.0 release

  • USER DOCUMENTATION
    • Getting Started
    • Command-Line Interface
    • Support
  • OCCA DOCUMENTATION
    • Introduction
    • Terminology
    • Connecting to a Device
    • Allocating and Initializing Memory
    • Building and Running Kernels
    • Synchronization
    • Streams
    • Background Devices
    • Unified Virtual Addressing
  • OKL DOCUMENTATION
    • Introduction
    • Loop-based Parallelism
    • Memory Spaces
    • Kernels and Functions
    • Reductions
    • Features and Limitations
  • API DOCUMENTATION
    • Device
    • Memory
    • Kernel
    • Stream
    • Properties
  • EXTENSIONS
    • Packaging Kernels in Libraries
    • Custom Backends
    • Custom Kernel Arguments
    • Custom OKL Attributes

ARM Support

Test out Serial and OpenMP modes with ARM hardware to make sure compiler and compiler flags work. Find ARM-specific macros for vectorization options (float2, float4, ...)

[FORTRAN] unnecessary printf statements

Unnecessary printf statement

src/FC.cpp line 10 - printf("type = %p\n", *type);

Also when occa is run with thousands of GPUs, the info message for cached kernel becomes too much, so I suggest an option to turn it on/off.

//std::cout << "Found cached binary of [" << compressFilename(filename) << "] in [" << compressFilename(binaryFile) << "]\n";

regards,
Daniel

Emulate OKL AST

Since we build the AST from parsed OKL code, have a feature to emulate the AST to search for

  • Out-of-bound accesses
  • Race conflicts
  • Access ordering (heatmap?)
  • FLOP/Read/Write counts
  • Potential cache misses

Update C bindings

  • Make sure the C API is in sync with the C++ API
  • Clean up occaType to avoid allocations

opencl float atomics

I am in need of atomicAdd for floats in OCCA. Currently it seems only CUDA atomicAdd on floats work. AtomicAdd on doubles doesn't work in CUDA, and OpenCL does not seem to define it on floats/doubles at all. I know CUDA doesn't implement atomicAdd on 64 bit directly but via atomicCAS, but it would be great to have it.

My OpenCL is:

Version: OpenCL 1.2 AMD-APP (1445.5)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_amd_svm cl_khr_gl_event

Version 1.0

Version 1.0

We're aiming for a production-grade release with 1.0. A large amount of refactoring is being done for the API and OKL parser.

Issues

  • #51: Add basic linear algebra routines to occa::array
  • #49: Add custom attributes
  • #52: Update C bindings
  • #46: Create 1.0 documentation
  • #48: Add coverage metrics

Documentation

libocca.org

We're releasing documentation with

  • Getting started
  • OCCA programming model
  • OKL programming model
  • API documentation
  • Extending OCCA

Testing and Coverage

This has been a long and crucial part missing in our release process. We are going to be making testing a required part in our release process. Coverage metrics will also be included.

Extendable API

Developer Tools

Helper methods were added to facilitate collaboration and extending OCCA in other codebases.

Examples include:

  • System calls
  • Environment variables
  • Finding files with/without custom URI prefixes (e.g. occa://occa/foo.okl)
  • Hashing files and objects
  • Caching files such as compiled binaries
  • Locating cached binaries
  • String manipulation

Properties

We're aiming to unify backends through a general API. However, we would also like to expose various features unique to different backends.

For example:

  • Memory allocation types
    • CUDA and OpenCL have texture memory
    • Possible memory locations in future architectures
  • Device features
    • Expose features backend devices require. A list of cores Threads mode is pinning to, deviceID in OpenCL and CUDA, etc
  • Kernels
    • Defines and headers
    • Possible extensions to kernel building

We added properties in 1.0 to help create a more flexible API. Properties use the standard JSON representation to maintain a 1-1 mapping from data to string.

The following JS augmentations are also supported:

  • Object keys do not need to be stringified
  • Single quotes can also be used for strings
  • Trailing commas can be used
{ mode: 'Serial', }

While we support loading JSON strings with occa::json, properties are always JSON objects. Hence, we also support ignoring braces when initializing properties from a string.

occa::properties(mode: 'Serial')

Modes

OCCA modes are composed of 3 classes: device, kernel, and memory. Modes can now be added outside of OCCA through registering the modes. The mode constructor will register the classes with its respective mode.

occa::mode<openmp::modeInfo,
           openmp::device,
           openmp::kernel,
           openmp::memory> mode("OpenMP");

OKL Parser

The OKL kernel language is one of the key features in OCCA. Adding JIT compilation of kernels gives us the flexibility of reusing kernels for different OCCA modes. We are refactoring the parser from the bottom up for robustness and flexibility.

Custom Kernel Arguments

Custom C++ classes can now be passed as kernel arguments by adding the occa::kernelArg cast operator

operator occa::kernelArg ();

The class is not restricted by OCCA in the amount of arguments or types it can add to the kernel call. The backend could restrict the number of arguments or types, for example 256 is the maximum arguments in CPU modes.

To see an example of this use, check occa::array.

Note: The OKL kernel has to match the same number arguments and types.

Linear Algebra

We're adding some common vector operations and linear algebra routines to occa::array

  • + - * / operations
  • Norms (L1, L2, Lp, Linf)
  • max, min
  • dot

Languages

We will officially support the following frontends in the 1.0 release

Language Github Repository
C libocca/occa
C++ libocca/occa
Python libocca/occa-python

We'll be adding the following in the 1.1 release

Language Github Repository
Python libocca/occa-python
Fortran libocca/occa-fortran

Potential future languages

Language Github Repository
Java libocca/occa-java
Julia libocca/occa-julia

Documentation and testing will be included

Provide costructors for occa::memory objects with offsets

In many situations, we need to pass subset of memory buffers to the kernel, for example when working with many streams (in order to implement a pipe-line to process a big buffer in chunks with the same kernel) you require that the kernel operates on different portions of a memory buffer.

for (int i=0; i<nchunks; ++i) {
  size_t starting_index = i * chunksize;
  // stuff like device.setStream(.......)
  // asynch copy stuff for buffers from base + starting_index of size chunksize
  kernel( occa::memory(bufA, start_index), occa::memory(bufB, start_index) );
  // asynch copy back stuff
}

What I suggest is to provide a simple way to create occa:memory objects from other existing occa::memory, using an offset: this is a "must have" in all situations where you need to process subsection of a occa::memory buffer with a kernel, for example when using multiple streams for
processing a buffer in chunks (see attached example).

You can go through a operator+ or a copy costructor (or both), so to let user write something like that:

// modified source from the master branch of libocca
template<> memory_t<CUDA>::memory_t(const memory_t<CUDA> &m, const
uintptr_t offset) {
  *this =m;
  handle = new CUdeviceptr;
  CUdeviceptr base = (CUdeviceptr)m.handle; (CUdeviceptr*)handle =
base + offset;
}

For OpenCL we can use subbuffers.

We must grant memory resources will not to be destroyed when the newly shifted memory object goes out of scope.

Free allocations made by initializer

These memory leaks create valgrind noise for users that link to libocca, even if they never call into the library.

==14201== HEAP SUMMARY:
==14201==     in use at exit: 232 bytes in 7 blocks
==14201==   total heap usage: 35 allocs, 28 frees, 111,478 bytes allocated
==14201== 
==14201== 8 bytes in 1 blocks are definitely lost in loss record 2 of 7
==14201==    at 0x4C2D52F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14201==    by 0x58B2F2F: occa::env::registerFileOpeners() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x58B5B9C: occa::env::initialize() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x400F519: call_init.part.0 (in /usr/lib/ld-2.26.so)
==14201==    by 0x400F625: _dl_init (in /usr/lib/ld-2.26.so)
==14201==    by 0x4000F69: ??? (in /usr/lib/ld-2.26.so)
==14201== 
==14201== 8 bytes in 1 blocks are definitely lost in loss record 3 of 7
==14201==    at 0x4C2D52F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14201==    by 0x58B2F4C: occa::env::registerFileOpeners() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x58B5B9C: occa::env::initialize() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x400F519: call_init.part.0 (in /usr/lib/ld-2.26.so)
==14201==    by 0x400F625: _dl_init (in /usr/lib/ld-2.26.so)
==14201==    by 0x4000F69: ??? (in /usr/lib/ld-2.26.so)
==14201== 
==14201== 8 bytes in 1 blocks are definitely lost in loss record 4 of 7
==14201==    at 0x4C2D52F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14201==    by 0x58B2F69: occa::env::registerFileOpeners() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x58B5B9C: occa::env::initialize() (in /home/jed/src/occa/lib/libocca.so)
==14201==    by 0x400F519: call_init.part.0 (in /usr/lib/ld-2.26.so)
==14201==    by 0x400F625: _dl_init (in /usr/lib/ld-2.26.so)
==14201==    by 0x4000F69: ??? (in /usr/lib/ld-2.26.so)

device::free() failure in CUDA mode

There's a problem with resource deallocation when using OCCA in CUDA native mode.

File     : <stripped>/occa/src/modes/cuda/kernel.cpp
Function : free
Line     : 260
Message  : Kernel (addVectors) : Unloading Module
Error   : CUDA Error [ 709 ]: CUDA_ERROR_CONTEXT_IS_DESTROYED
Stack    :
  6 occa::cuda::error(cudaError_enum, std::string const&, std::string const&, int, std::string const&)
  5 occa::cuda::kernel::free()
  4 occa::kernel::removeKHandleRef()
  3 ./main()
  2 /lib64/libc.so.6(__libc_start_main+0xf5)
  1 ./main()

I attach a simple diff file to apply to examples/nativeAddVectors which should reproduce the problem.

Thank you for your precious work,

nativeAddVectors.diff.zip

Errors in examples midgTest & mandelbulb

Hi, I use branch 1.0. When I test examples I get errors.
In midgTest, it says could not find function [midg] in file midg.okl.
In mandelbulb, it says no setupAide.hpp.
Thank you.

wrapMemory() funtion in CUDA.cpp

The wrapMemory() funtion in CUDA.cpp causes a runtime error. The error occurs when finish() is called, but when I modified the code (see line 1300 in CUDA.cpp), I was able to get the kernel to execute successfully. Below is my suggested patch/replacement for the wrapMemory() function.

template <>
memory_v* device_t::wrapMemory(void *handle_,
const uintptr_t bytes){
memory_v *mem = new memory_t;

mem->dHandle = this;
mem->size    = bytes;
mem->handle  = (CUdeviceptr*) handle_;

mem->memInfo |= memFlag::isAWrapper;

return mem; 

}

Private vector variables

I am having troubles using private float4 variables. Here is a small example of a code that does not work

#include <iostream>

#include "occa.hpp"

int main(int argc, char **argv){
  occa::printAvailableDevices();

  int nvec = 4;
  int entries = 2;

  const char *kernelString =
    "occaKernel void addVectors(occaKernelInfoArg,                  \n"
    "                           occaConst int occaVariable entries, \n"
    "                           occaConst occaPointer float4 * a,   \n"
    "                           occaConst occaPointer float4 * b,   \n"
    "                           occaPointer           float4 * ab){ \n"
    "  occaParallelFor                                              \n"
    "  occaOuterFor0{                                               \n"
    "   occaPrivate(float4, c);                                     \n"
    "    occaInnerFor0{                                             \n"
    "      const int N = occaGlobalId0;                             \n"
    "      if(N < entries){                                         \n"
    "        c = b;                                                 \n"
    "        ab[N].x = a[N].x + c[N].x;                             \n"
    "        ab[N].y = a[N].y + c[N].y;                             \n"
    "        ab[N].z = a[N].z + c[N].z;                             \n"
    "        ab[N].w = a[N].w + c[N].w;                             \n"
    "      }                                                        \n"
    "    }                                                          \n"
    "  }                                                            \n"
    "}                                                              \n";


  float *a  = new float[nvec*entries];
  float *b  = new float[nvec*entries];
  float *ab = new float[nvec*entries];

  for(int i = 0; i < nvec*entries; ++i){
    a[i]  = i;
    b[i]  = 1 - i;
    ab[i] = 0;
  }

  occa::device device;
  occa::kernel addVectors;
  occa::kernelInfo kInfo;
  occa::memory o_a, o_b, o_ab;


  //---[ Device setup with string flags ]-------------------
  // device.setup("mode = Serial");
  device.setup("mode = OpenMP  , schedule = compact, chunk = 10");
  // device.setup("mode = OpenCL  , platformID = 0, deviceID = 1");
  // device.setup("mode = CUDA    , deviceID = 0");
  // device.setup("mode = Pthreads, threadCount = 4, schedule = compact, pinnedCores = [0, 0, 1, 1]");
  // device.setup("mode = COI     , deviceID = 0");
  //========================================================

  o_a  = device.malloc(nvec*entries*sizeof(float));
  o_b  = device.malloc(nvec*entries*sizeof(float));
  o_ab = device.malloc(nvec*entries*sizeof(float));

  addVectors = device.buildKernelFromString(kernelString,
                                            "addVectors", kInfo,
                                            occa::usingNative);

  int dims = 1;
  int itemsPerGroup(16);
  int groups((nvec*entries + itemsPerGroup - 1)/itemsPerGroup);

  addVectors.setWorkingDims(dims, itemsPerGroup, groups);

  o_a.copyFrom(a);
  o_b.copyFrom(b);

  addVectors(entries, o_a, o_b, o_ab);

  o_ab.copyTo(ab);

  for(int i = 0; i < nvec*entries; ++i)
    std::cout << i << ": " << ab[i] << '\n';

  for(int i = 0; i < nvec*entries; ++i){
    if(ab[i] != (a[i] + b[i]))
      throw 1;
  }

  delete [] a;
  delete [] b;
  delete [] ab;

  addVectors.free();
  o_a.free();
  o_b.free();
  o_ab.free();
  device.free();

  return 0;
}

and here are the error messages that I get

> ./main
==============o=======================o==========================================
   CPU Info   |  Processor Name       | Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
              |  Cores                | 6
              |  Memory (RAM)         | 31 GB
              |  Clock Frequency      | 1.205 GHz
              |  SIMD Instruction Set | SSE2
              |  SIMD Width           | 128 bits
              |  L1 Cache Size (d)    |    32K
              |  L2 Cache Size        |   256K
              |  L3 Cache Size        | 12288K
==============o=======================o==========================================
    OpenCL    |  Device Name          | Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
              |  Driver Vendor        | Intel
              |  Platform ID          | 0
              |  Device ID            | 0
              |  Memory               | 31 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Tahiti
              |  Driver Vendor        | AMD
              |  Platform ID          | 1
              |  Device ID            | 0
              |  Memory               | 2 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Tahiti
              |  Driver Vendor        | AMD
              |  Platform ID          | 1
              |  Device ID            | 1
              |  Memory               | 2 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Tahiti
              |  Driver Vendor        | AMD
              |  Platform ID          | 1
              |  Device ID            | 2
              |  Memory               | 2 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Tahiti
              |  Driver Vendor        | AMD
              |  Platform ID          | 1
              |  Device ID            | 3
              |  Memory               | 2 GB
              |-----------------------+------------------------------------------
              |  Device Name          | Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
              |  Driver Vendor        | Intel
              |  Platform ID          | 1
              |  Device ID            | 4
              |  Memory               | 31 GB
==============o=======================o==========================================
Compiling [addVectors]
g++ -x c++ -fPIC -shared  -fopenmp /home/lucas/._occa/kernels/b65af17897881926/source.occa -o /home/lucas/._occa/kernels/b65af17897881926/binary -I/home/lucas/research/code/occa//include -L/home/lucas/research/code/occa//lib -locca

/home/lucas/._occa/kernels/b65af17897881926/source.occa: In function ‘void addVectors(const int*, int, int, int, const int&, const float4*, const float4*, float4*)’:
/home/lucas/._occa/kernels/b65af17897881926/source.occa:14:11: error: no match for ‘operator=’ (operand types are ‘occaPrivate_t<float4, 1>’ and ‘const float4*’)
         c = b;
           ^
/home/lucas/._occa/kernels/b65af17897881926/source.occa:14:11: note: candidates are:
In file included from /home/lucas/._occa/kernels/b65af17897881926/source.occa:1:0:
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:479:14: note: TM& occaPrivate_t<TM, SIZE>::operator=(const occaPrivate_t<TM, SIZE>&) [with TM = float4; int SIZE = 1]
   inline TM& operator = (const occaPrivate_t &r) {
              ^
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:479:14: note:   no known conversion for argument 1 from ‘const float4*’ to ‘const occaPrivate_t<float4, 1>&’
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:484:14: note: TM& occaPrivate_t<TM, SIZE>::operator=(const TM&) [with TM = float4; int SIZE = 1]
   inline TM& operator = (const TM &t){
              ^
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:484:14: note:   no known conversion for argument 1 from ‘const float4*’ to ‘const float4&’

---[ Error ]--------------------------------------------
    File     : /home/lucas/research/code/occa/src/OpenMP.cpp
    Function : buildFromSource
    Line     : 305
    Error    : Compilation error
========================================================

Am I doing something wrong? If not I would like to put in a feature request for vector private variables.

Using streams with multiple devices

I am unable to get my program to work with streams and multiple devices in mode=CUDA. So, as a test, I modified the addVectors example under examples/usingStreams/ to include and extra device. The error is CUDA_ERROR_INVALID_HANDLE. See copy of modified code below.

#include <iostream>

#include "occa.hpp"

int main(int argc, char **argv){
  int entries = 8;

  float *a  = new float[entries];
  float *b  = new float[entries];
  float *ab = new float[entries];

  for(int i = 0; i < entries; ++i){
    a[i]  = i;
    b[i]  = 1 - i;
    ab[i] = 0;
  }   

  occa::device device;
  occa::device device1; // ADDED
  occa::kernel addVectors;
  occa::memory o_a, o_b, o_ab;

  occa::stream streamA, streamB;

  device.setup("mode = CUDA, deviceID = 0"); // UNCOMMENTED
  device1.setup("mode = CUDA, deviceID = 1"); //ADDED
  //device.setup("mode = OpenCL, platformID = 0, deviceID = 1"); //COMMENTED

  streamA = device.getStream();
  streamB = device.createStream();

  o_a  = device.malloc(entries*sizeof(float));
  o_b  = device.malloc(entries*sizeof(float));
  o_ab = device.malloc(entries*sizeof(float));

  addVectors = device.buildKernelFromSource("addVectors.okl",
                                            "addVectors");

  o_a.copyFrom(a);
  o_b.copyFrom(b);

  device.setStream(streamA);
  addVectors(entries, o_a, o_b, o_ab);

  device.setStream(streamB);
  addVectors(entries, o_a, o_b, o_ab);

  o_ab.copyTo(ab);

  for(int i = 0; i < entries; ++i)
    std::cout << i << ": " << ab[i] << '\n';

  delete [] a;
  delete [] b;
  delete [] ab;

  addVectors.free();
  o_a.free();
  o_b.free();
  o_ab.free();
  device.free();
  device1.free();  //ADDED
}

Aligned malloc in occa ?

Hi David,
I would like to use vector intrinsics when running on the CPU in OpenMP mode. For this purpose, I need to change all occaDeviceMallocs with the corresponding __aligned_malloc, f.i to ensure 32 byte alignement for vdouble4. I could not find an occaDeviceAignedMalloc in OCCA unfortunately.

The vector.hpp file has some vector intrinsics towards the bottom of the file, which is not complete yet, so I wrote a new class which is not working properly right now. I am guessing the reason may be byte alignment issues.

Daniel

Compile error, not compatible with cuda 8.0

Hi, I am new to occa. I work on Ubuntu 16 PC, with cuda 8.0 installed. While I want to compile occa, I get error that the program is not compatible with libcuda.so. Thank you.

building for COI?

Am I doing something wrong?

g++ -O3 -fPIC -o /home/hugo/occa/obj/timer.o  -fopenmp -DOCCA_OPENMP_ENABLED=1  -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1 -DOCCA_HSA_ENABLED=0 -DOCCA_COI_ENABLED=1  -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5  -D OCCA_OS=LINUX_OS -c -I/home/hugo/occa/lib -I/home/hugo/occa/include -I./include -I/usr/local/cuda-7.0/include -I/usr/local/cuda-7.0/include /home/hugo/occa/src/timer.cpp
In file included from /home/hugo/occa/include/occa/timer.hpp:4:0,
from /home/hugo/occa/src/timer.cpp:1:
/home/hugo/occa/include/occa/base.hpp:1144:29: error: 'occa::device occa::coi::wrapDevice' redeclared as different kind of symbol
occa::device wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1141:18: error: previous declaration of 'occa::device occa::coi::wrapDevice(void*)'
occa::device wrapDevice(void *coiDevicePtr);
^
/home/hugo/occa/include/occa/base.hpp:1144:29: error: 'COIENGINE' was not declared in this scope
occa::device wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1245:41: error: 'COIENGINE' has not been declared
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1245:60: error: 'occa::device occa::coi::wrapDevice(int)' should have been declared inside 'occa::coi'
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1361:41: error: 'COIENGINE' has not been declared
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1541:41: error: 'COIENGINE' has not been declared
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
make: *** [/home/hugo/occa/obj/timer.o] Error 1

Add benchmark example

Similar to:

https://github.com/ekondis/mixbench

measure throughput and bandwidth for a GPU

OCCA_CXX is ignored

The README indicates that OCCA_CXX is the right way to control the C++ compiler that is used to build OCCA. This is false in practice, both on Linux and MacOS.

You can see below that I want to use g++-7 but OCCA chooses g++ instead. On MacOS, it defaults to c++ and ignores my attempts to use clang++ and g++-7.

jrhammon@klondike:/tmp$ git clone https://github.com/libocca/occa.git
Cloning into 'occa'...
remote: Counting objects: 18556, done.
remote: Total 18556 (delta 0), reused 0 (delta 0), pack-reused 18556
Receiving objects: 100% (18556/18556), 12.84 MiB | 9.35 MiB/s, done.
Resolving deltas: 100% (14266/14266), done.
Checking connectivity... done.
jrhammon@klondike:/tmp$ cd occa
jrhammon@klondike:/tmp/occa$ export OCCA_CXX=g++-7
jrhammon@klondike:/tmp/occa$ make
mkdir -p /tmp/occa//obj
mkdir -p /tmp/occa//obj/parser
mkdir -p /tmp/occa//obj/python
g++ -O3 -fPIC -DOCCA_COMPILED_DIR="/tmp/occa/" -o /tmp/occa//obj/Serial.o  -fopenmp -DOCCA_OPENMP_ENABLED=1  -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1  -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5  -D OCCA_OS=LINUX_OS -c -I/tmp/occa//lib -I/tmp/occa//include -I/usr/local/cuda-9.0/include -I/usr/local/cuda-9.0/include /tmp/occa//src/Serial.cpp
g++ -O3 -fPIC -DOCCA_COMPILED_DIR="/tmp/occa/" -o /tmp/occa//obj/OpenCL.o  -fopenmp -DOCCA_OPENMP_ENABLED=1  -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1  -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5  -D OCCA_OS=LINUX_OS -c -I/tmp/occa//lib -I/tmp/occa//include -I/usr/local/cuda-9.0/include -I/usr/local/cuda-9.0/include /tmp/occa//src/OpenCL.cpp
^Cmakefile:68: recipe for target '/tmp/occa//obj/OpenCL.o' failed
make: *** [/tmp/occa//obj/OpenCL.o] Interrupt

Add support to use __constant in GPU modes

Try out the following to see if we can add __constant to work on native CUDA kernels.
Manually put the constant variable/array with a name in the kernel file

__constant float myVar;
__constant float myArray[10];

Compile kernel

occa::kernel foo = device.buildKernel('foo.cu', 'foo')

Use an OCCA helper method to initialize the constant​ value. The 'extra work' part is keeping track of the variable name and size outside of the kernel:

float myVar = 10;
float *myArray = new float[10];
​// Initialize myArray​

occa::cuda::initConstantMemory(foo, 'myVar'  , &myVar , sizeof(float));
occa::cuda::initConstantMemory(foo, 'myArray', myArray, 10 * sizeof(float));

Add reduction attribute

Reduction is not trivial using OKL, instead we can make it a backend requirement to provide a way to do fast reductions.

@reduction void dot(const int entries,
                    const float *a,
                    const float *b,
                    @reduce float out) {

  for (int i = 0; i < entries; ++i; reduce(32)) {
    out += a[i] * b[i];
  }
}

We can have multiple reductions in one kernel

@reduction void dot_norm(const int entries,
                         const float *a,
                         const float *b,
                         @reduce float dot,
                         @reduce float a_norm2,
                         @reduce float b_norm2) {

  for (int i = 0; i < entries; ++i; reduce(32)) {
    dot += a[i] * b[i];
    a_norm2 = a[i] * a[i];
    b_norm2 = b[i] * b[i];
  }
}

Running on MIC

I am having trouble to run on the Xeon Phi using the offload model. A simple program show below
prints 240 cores, but bin/occainfo does not detect the Xeon Phi (only CPU info is shown). I also tried to compile occa with OCCA_COI_ENABLED=1 but i learned this is deprecated, so I am trying to run in OpenMP mode.

int main(){
     int nprocs;
    #pragma offload target(mic)
     nprocs = omp_get_num_procs();
     printf("Hello %d\n",nprocs);
     return 0;
}

With OpenMP mode I am redefining the outer loop of my kernels to run in parallel and on the MIC with:

 #define myOuterFor OCCA_PRAGMA("offload target(mic)") occaParallelFor occaOuterFor0

I got errors after I added the mic pragma:

 error: pointer variable "occaKernelArgs" in this offload region must be specified in an in/out/inout/nocopy clause

Is it possible to run OCCA in native mode ? I would like to launch more than one process per MIC card and a few threads instead of running threads only. But I guess this would be against the design of OCCA.

Daniel

CUDA multiple defines

When I run example occa/examples/addVectors/c/ in cuda mode I get a bunch of multiple define warnings between primitives.hpp and CUDA.hpp as well as defines.hpp

There is also an error from an unfound file: occa/defines.hpp

Allow memory allocation inside kernels

FYI I can allocate OCCA device memory inside the occaKernel routine body (outside of actual kernel loops), which allows me to define temporary global device memory buffer(s) there instead of externally which then must be passed in as a parameter. This means I can keep the logic for allocation and management of temporary / work buffers used in kernels localized to the occaKernel routine code where it belongs.

I achieve this by passing in the occaDevice pointer cast as a uint64_t, then calling a special-purpose externally-defined C++ allocation function ("CreateOccaMemory()") inside the routine body to return an occaPointer to a buffer (global device memory). This allocation function simply takes the incoming occaDevice pointer plus a number of elements argument, and calls its OCCA malloc() function. The return type is (for example) occaPointer float*. I also have a corresponding memory free function ("ReleaseOccaMemory()")

Anyway, it would be much better (and helpful for all OCCA programmers) if there were built-in support to do this without even having to explicitly pass in the occaDevice pointer as a parameter from the outside. That pointer could be a hidden parameter generated and passed in by the OCCA build. And in terms of user syntax, the user could simply define global device memory via dynamically-sized arrays (auto-freed on return). Or similar to what I do, you could just have a new built-in OCCA allocator function (e.g., "occaMalloc()") that users call to create the global device memory, which is accessible in subsequent kernel loops. Of course, you would also want to provide an "occaFree()" function.

Allow return statements

Would it be possible to allow a return statement (return 0 on success, otherwise failure)?

math functions for vectors (floatn)

Concerning vector data types.

a) The opencl floatn vectors support math functions such as sin, cos etc, but occa's support in non-opencl mode lacks that.

b) Support for all float-n types such as float3, float6, float9 which can be used to represent vector, symmetric tensor and tensor.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.