mitsuba-renderer / enoki Goto Github PK

Enoki: structured vectorization and differentiation on modern processor architectures

License: Other

CMake 1.58% C++ 91.01% Shell 0.19% Python 0.90% Cuda 6.05% C 0.27%

enoki's Introduction

Enoki — structured vectorization and differentiation on modern processor architectures

Documentation	Linux	Windows

Introduction

Enoki is a C++ template library that enables automatic transformations of numerical code, for instance to create a "wide" vectorized variant of an algorithm that runs on the CPU or GPU, or to compute gradients via transparent forward/reverse-mode automatic differentation.

The core parts of the library are implemented as a set of header files with no dependencies other than a sufficiently C++17-capable compiler (GCC >= 8.2, Clang >= 7.0, Visual Studio >= 2017). Enoki code reduces to efficient SIMD instructions available on modern CPUs and GPUs—in particular, Enoki supports:

Intel: AVX512, AVX2, AVX, and SSE4.2,
ARM: NEON/VFPV4 on armv7-a, Advanced SIMD on 64-bit armv8-a,
NVIDIA: CUDA via a Parallel Thread Execution (PTX) just-in-time compiler.
Fallback: a scalar fallback mode ensures that programs still run even if none of the above are available.

Deploying a program on top of Enoki usually serves three goals:

Enoki ships with a convenient library of special functions and data structures that facilitate implementation of numerical code (vectors, matrices, complex numbers, quaternions, etc.).
Programs built using these can be instantiated as wide versions that process many arguments at once (either on the CPU or the GPU).

Enoki is also structured in the sense that it handles complex programs with custom data structures, lambda functions, loadable modules, virtual method calls, and many other modern C++ features.
If derivatives are desired (e.g. for stochastic gradient descent), Enoki performs transparent forward or reverse-mode automatic differentiation of the entire program.

Finally, Enoki can do all of the above simultaneously: if desired, it can compile the same source code to multiple different implementations (e.g. scalar, AVX512, and CUDA+autodiff).

Motivation

The development of this library was prompted by the author's frustration with the current vectorization landscape:

Auto-vectorization in state-of-the-art compilers is inherently local. A computation whose call graph spans separate compilation units (e.g. multiple shared libraries) simply can't be vectorized.
Data structures must be converted into a Structure of Arrays (SoA) layout to be eligible for vectorization.

This is analogous to performing a matrix transpose of an application's entire memory layout—an intrusive change that is likely to touch almost every line of code.
Parts of the application likely have to be rewritten using intrinsic instructions, which is going to look something like this:

Intrinsics-heavy code is challenging to read and modify once written, and it is inherently non-portable. CUDA provides a nice language environment for programming GPUs but does nothing to help with the other requirements (vectorization on CPUs, automatic differentiation).
Vectorized transcendental functions (exp, cos, erf, ..) are not widely available. Intel, AMD, and CUDA provide proprietary implementations, but many compilers don't include them by default.
It is desirable to retain both scalar and vector versions of an algorithm, but ensuring their consistency throughout the development cycle becomes a maintenance nightmare.
Domain-specific languages (DSLs) for vectorization such as ISPC address many of the above issues but assume that the main computation underlying an application can be condensed into a compact kernel that is implementable using the limited language subset of the DSL (e.g. plain C in the case of ISPC).

This is not the case for complex applications, where the "kernel" may be spread out over many separate modules involving high-level language features such as functional or object-oriented programming.

What Enoki does differently

Enoki addresses these issues and provides a complete solution for vectorizing and differentiating modern C++ applications with nontrivial control flow and data structures, dynamic memory allocation, virtual method calls, and vector calls across module boundaries. It has the following design goals:

Unobtrusive. Only minor modifications are necessary to convert existing C++ code into its Enoki-vectorized equivalent, which remains readable and maintainable.
No code duplication. It is generally desirable to provide both scalar and vectorized versions of an API, e.g. for debugging, and to preserve compatibility with legacy code. Enoki code extensively relies on class and function templates to achieve this goal without any code duplication—the same code template can be leveraged to create scalar, CPU SIMD, and GPU implementations, and each variant can provide gradients via automatic differentiation if desired.
Custom data structures. Enoki can also vectorize custom data structures. All the hard work (e.g. conversion to SoA format) is handled by the C++17 type system.
Function calls. Vectorized calls to functions in other compilation units (e.g. a dynamically loaded plugin) are possible. Enoki can even vectorize method or virtual method calls (e.g. instance->my_function(arg1, arg2, ...); when instance turns out to be an array containing many different instances).
Mathematical library. Enoki includes an extensive mathematical support library with complex numbers, matrices, quaternions, and related operations (determinants, matrix, inversion, etc.). A set of transcendental and special functions supports real, complex, and quaternion-valued arguments in single and double-precision using polynomial or rational polynomial approximations, generally with an average error of <1/2 ULP on their full domain. These include exponentials, logarithms, and trigonometric and hyperbolic functions, as well as their inverses. Enoki also provides real-valued versions of error function variants, Bessel functions, the Gamma function, and various elliptic integrals.

Importantly, all of this functionality is realized using the abstractions of Enoki, which means that it transparently composes with vectorization, the JIT compiler for generating CUDA kernels, automatic differentiation, etc.
Portability. When creating vectorized CPU code, Enoki supports arbitrary array sizes that don't necessarily match what is supported by the underlying hardware (e.g. 16 x single precision on a machine, whose SSE vector only has hardware support for 4 x single precision operands). The library uses template metaprogramming techniques to efficiently map array expressions onto the available hardware resources. This greatly simplifies development because it's enough to write a single implementation of a numerical algorithm that can then be deployed on any target architecture. There are non-vectorized fallbacks for everything, thus programs will run even on unsupported architectures (albeit without the performance benefits of vectorization).
Modular architecture. Enoki is split into two major components: the front-end provides various high-level array operations, while the back-end provides the basic ingredients that are needed to realize these operations using the SIMD instruction set(s) supported by the target architecture.

The CPU vector back-ends e.g. make heavy use of SIMD intrinsics to ensure that compilers generate efficient machine code. The intrinsics are contained in separate back-end header files (e.g. array_avx.h for AVX intrinsics), which provide rudimentary arithmetic and bit-level operations. Fancier operations (e.g. atan2) use the back-ends as an abstract interface to the hardware, which means that it's simple to support other instruction sets such as a hypothetical future AVX1024 or even an entirely different architecture (e.g. a DSP chip) by just adding a new back-end.
License. Enoki is available under a non-viral open source license (3-clause BSD).

Cloning

Enoki depends on two other repositories (pybind11 and cub) that are required when using certain optional features, specifically differentiable GPU arrays with Python bindings.

To fetch the entire project including these dependencies, clone the project using the --recursive flag as follows:

$ git clone --recursive https://github.com/mitsuba-renderer/enoki

Documentation

An extensive set of tutorials and reference documentation are available at readthedocs.org.

About

This project was created by Wenzel Jakob. It is named after Enokitake, a type of mushroom with many long and parallel stalks reminiscent of data flow in vectorized arithmetic.

Enoki is the numerical foundation of version 2 of the Mitsuba renderer, though it is significantly more general and should be a trusty tool for a variety of simulation and optimization problems.

When using Enoki in academic projects, please cite

@misc{Enoki,
   author = {Wenzel Jakob},
   year = {2019},
   note = {https://github.com/mitsuba-renderer/enoki},
   title = {Enoki: structured vectorization and differentiation on modern processor architectures}
}

enoki's People

Stargazers

Watchers

Forkers

cannerycoders namibj bssrdf msoft1115 xzm2004260 wy2609 kuan-li sprinterzzj shaunstanislauslau hhy5277 trendingtechnology peterzhousz ner0-m eliemichel infancy zesem jmakov rjammala ogsdave speierers swordigo1995 yoann01 feiyunwill sahwar martinpreinfalk linecode d-v-b jammm feixh zivzone xyuan chibaf setoye qaz734913414 shayekhbinislam joonvan pidgeybe soufianekhiat manodeep standy-zz arpit15 bathal1 k-sheridan mbrukman mentice dendisuhubdy shiinamiyuki tomgillooly zeta1999 denji shdwdln gergol ntopology xaldew decrispell xiaoyxue lidonginter knut0815 hannibalhuang yunhsiao mingzailao bruinxiong ovilla vgatherps caryliu1999 grideyes-2010 mlamarre roquo lplatz-gh asdlei99 baldrlector gipeto forkrp microno95 leonardeyer mbks mfkiwl jimmy-inl clayne hyeonjang satyajitghana tbttfox-forks miaoyuanxi jslee02 jlfot ajunlonglive mfischer-ucl nilz3000 mxe191 derekrenderling janm31415 sorokin brugarolas tobischluter

enoki's Issues

Incorrect behaviour for select with scalar inputs

In python, ek.select(True, 0.1, 0.2) outputs [1] of type scalar.Vector1m

Adding the following binding code in src/python/scalar.cpp solves the bug.

m.def("select", [](bool a, Float b, Float c) {
        return enoki::select(a, b, c);
});

Please verify if this is the correct way to add binding to scalar for select function.

error: lvalue required as left operand of assignment

Trying to do this:

FloatD arr = {1.f,3.f,2.f,5.f,6.f,2.f,6.f};
FloatD t = 0.0f;
set_requires_gradient(t);

arr[2] = arr[2] + t
(error: lvalue required as left operand of assignment)

How to pybind FloatC and FloatD

I am rendering a gradient image using enoki CUDA array. Is there any suggestion on how to store the c++ cuda array FloatC and FloatD (or vector) into python so I can call backward in python for optimization? I didn't see there is a binding for that in enoki/python.h

Unexpected `select` result

I came across a potential Mask and / or select bug in the Mitsuba2 codebase. Here is an MVE:

#include <iostream>
#include <enoki/array.h>

using namespace enoki;

namespace {
using Float    = float;
constexpr size_t PacketSize = enoki::max_packet_size / sizeof(Float);

using Point4f = Packet<Float, 4>;
using MyMask = mask_t<Point4f>;

template <typename T>
void print(const T &val) {
    std::cout << val << std::endl;
}
}  // namespace

int main() {
    MyMask m1(true);
    MyMask m2(true | true);

    print("---  These two masks look identical:");
    print(m1);
    print(m2);
    print("---  But maybe they aren't?");
    print(m1 & m2);
    print(all(eq(m1, m2)));
    print("---  Using scalars (for reference):");
    print(select(true, 1.0f, 0.0f));
    print(select(true | true, 1.0f, 0.0f));
    print("---  Now, using packets:");
    print(select(m1, Point4f(1.0f), Point4f(0.0f)));
    print(select(m2, Point4f(1.0f), Point4f(0.0f)));  // Unexpected result here.
}

Running it outputs:

---  These two masks look identical:
[1, 1, 1, 1]
[1, 1, 1, 1]
---  But maybe they aren't?
[1, 1, 1, 1]
0
---  Using scalars (for reference):
1
1
---  Now, using packets:
[1, 1, 1, 1]
[0, 0, 0, 0]

Note how the last line's result is unexpected.

Inspecting the masks in LLDB, they indeed look different:

(lldb) p m1
((anonymous namespace)::MyMask) $0 = {
  enoki::StaticMaskImpl<float, 4, true, enoki::RoundingMode::Default, enoki::PacketMask<float, 4, true, enoki::RoundingMode::Default>, void> = {
    enoki::StaticArrayImpl<float, 4, true, enoki::RoundingMode::Default, enoki::PacketMask<float, 4, true, enoki::RoundingMode::Default> > = {
      m = (NaN, NaN, NaN, NaN)
    }
  }
}
(lldb) p m2
((anonymous namespace)::MyMask) $1 = {
  enoki::StaticMaskImpl<float, 4, true, enoki::RoundingMode::Default, enoki::PacketMask<float, 4, true, enoki::RoundingMode::Default>, void> = {
    enoki::StaticArrayImpl<float, 4, true, enoki::RoundingMode::Default, enoki::PacketMask<float, 4, true, enoki::RoundingMode::Default> > = {
      m = (0.00000000000000000000000000000000000000000000140129846, 0.00000000000000000000000000000000000000000000140129846, 0.00000000000000000000000000000000000000000000140129846, 0.00000000000000000000000000000000000000000000140129846)
    }
  }
}

I think that LLDB's printers don't print the mask entries' correctly anyway, but at least we can confirm that they are different.

memcpy from one blob to another using enoki

Consider this simple code:

void bar(const char* src, int src_size, char* dst, int dst_size) { 
  assert(src_size == dst_size);

  for (int i = 0; i < src_size; ++i) { 
    *dst++ = *src++;
  } 
}

this code generates the following assembly (only the loop part is shown here):

  40d7c8:       c5 fe 6f 04 07          vmovdqu ymm0,YMMWORD PTR [rdi+rax*1]
  40d7cd:       c5 fe 7f 04 02          vmovdqu YMMWORD PTR [rdx+rax*1],ymm0
  40d7d2:       48 83 c0 20             add    rax,0x20
  40d7d6:       48 39 c8                cmp    rax,rcx
  40d7d9:       75 ed                   jne    40d7c8 <bar(char const*, int, char*, int)+0x28>

gcc is smart enough to vectorize this loop and copy chunks of 32 bytes.

Now consider this code written with enoki:

void foo(const char* src, int src_size, char* dst, int dst_size) {
  using Array = enoki::Array<int, 8>;

  auto es = enoki::DynamicArray<Array>::map(src, src_size);
  auto ed = enoki::DynamicArray<Array>::map(dst, dst_size);
  for (int i = 0; i < (int)es.packets(); ++i) {
    const auto& pkt = es.packet(i);
    auto& dst_pkt = ed.packet(i);
    dst_pkt = pkt;
  }
}

This code generated this assembly:

  40d850:       c5 fd 6f 04 07          vmovdqa ymm0,YMMWORD PTR [rdi+rax*1]
  40d855:       c5 fd 7f 04 02          vmovdqa YMMWORD PTR [rdx+rax*1],ymm0
  40d85a:       48 83 c0 20             add    rax,0x20
  40d85e:       48 39 c8                cmp    rax,rcx
  40d861:       75 ed                   jne    40d850 <foo(char const*, int, char*, int)+0x20>

So almost the same code (except for aligned read).

Now I wanted to change the code so use two ymm registers to unroll this loop further. so I changed the Array in above code to

using Array = enoki::Array<int, 16>;

The assembly generated with 16 byte array is this:

  40d850:       c5 f9 6f 04 07          vmovdqa xmm0,XMMWORD PTR [rdi+rax*1]
  40d855:       c5 f8 29 04 02          vmovaps XMMWORD PTR [rdx+rax*1],xmm0
  40d85a:       c5 f9 6f 4c 07 10       vmovdqa xmm1,XMMWORD PTR [rdi+rax*1+0x10]
  40d860:       c5 f8 29 4c 02 10       vmovaps XMMWORD PTR [rdx+rax*1+0x10],xmm1
  40d866:       c5 f9 6f 54 07 20       vmovdqa xmm2,XMMWORD PTR [rdi+rax*1+0x20]
  40d86c:       c5 f8 29 54 02 20       vmovaps XMMWORD PTR [rdx+rax*1+0x20],xmm2
  40d872:       c5 f9 6f 5c 07 30       vmovdqa xmm3,XMMWORD PTR [rdi+rax*1+0x30]
  40d878:       c5 f8 29 5c 02 30       vmovaps XMMWORD PTR [rdx+rax*1+0x30],xmm3
  40d87e:       48 83 c0 40             add    rax,0x40
  40d882:       48 39 c8                cmp    rax,rcx
  40d885:       75 c9                   jne    40d850 <foo(char const*, int, char*, int)+0x20>

So instead of using two ymm registers, it uses 4 xmm registers. I find this quite odd. Do you have any idea why did enoki do that?

Do you have benchmarks available?

I really like your design! Do you have any benchmarks available for example problems like the graphs in Figure 8 of the Stan math paper?

https://arxiv.org/pdf/1509.07164.pdf

Question: comparison with Halide

Could you please briefly explain how enoki compares with Halide?

Missing example for compiling to multiple implementations

if desired, it can compile the same source code to multiple different implementations

The above line from README suggests this but I couldn't find examples/tutorials/other references regarding the process to do so. Would it be possible to add more information regarding this?

I would also like to generate multiple versions (scalar, avx512, avx2, sse, cuda) to benchmark them (since sometimes avx drops the clock speed and can actually hurt performance in multi-threaded applications). Would it be possible to do this in the same binary?

[Enhancement]: Conventions

This library seems to use the following conventions:

column-vectors (*);
row-major storage order for matrices (*);
right-handed coordinate system.

(*) Ensures fast matrix-column-vector multiplications.

Any possibility of providing support for the exact opposites of these as well:

row-vectors (**);
column-major storage order for matrices (**);
left-handed coordinate system.

(**) Ensures fast row-vector-matrix multiplications.

Masked access behaves differently when compiling with -mavx2

Consider the following code snippet. The program prints [10, 20, 30, 40, 5, 6, 7, 8] four times if it's being compiled "out of the box". However as soon as I specify -mavx2 or march=native or the like
the output is [1, 2, 3, 4, 5, 6, 7, 8] for the first three prints. The fourth one works as expected, though.

#include <iostream>
#include <enoki/array.h>
using namespace enoki;
int main() {
  auto print = [](auto x) { std::cout << x << '\n'; };
  using Arr = Array<int, 8>;
  using M = mask_t<Arr>;
  M m{1,1,1,1,0,0,0,0};

  {
    Arr a = {1, 2, 3, 4, 5, 6, 7, 8};
    masked(a, m) *= 10;
    std::cout << a << std::endl;  // <- Wrong: should print [10, 20, 30, 40, 5, 6, 7, 8]
  }
  {
    Arr a = {1, 2, 3, 4, 5, 6, 7, 8};
    a = enoki::select(m, a * 10, a);
    std::cout << a << std::endl; // <- Wrong: should print [10, 20, 30, 40, 5, 6, 7, 8]
  }
  {
    Arr a = {1, 2, 3, 4, 5, 6, 7, 8};
    a[m] *= 10;
    std::cout << a << std::endl; // <- Wrong: should print [10, 20, 30, 40, 5, 6, 7, 8]
  }
  {
    Arr a = {1, 2, 3, 4, 5, 6, 7, 8};
    a[m > 0] *= 10;
    std::cout << a << std::endl; // <- OK: prints [10, 20, 30, 40, 5, 6, 7, 8]
  }
  return 0;
}

I've tested gcc-7.4, clang-7 and clang-9 on ubuntu 18.04.

Here's the CmakeLists.txt I'm using:

cmake_minimum_required(VERSION 3.15)
project(enoki_test)

set(CMAKE_CXX_STANDARD 17)

add_executable(enoki_test main.cpp)

target_include_directories(enoki_test PRIVATE ../enoki/include)

set(CMAKE_CXX_FLAGS "-mavx2")

Any idea how to fix this?

Cast std::array to enoki::array

I tried

float arr[4] = { 10.f, 20.f, 30.f, 40.f };
float* ptr;
ptr = arr;
FloatC arr_enoki = load<float>(ptr);
std::cout << "see if this is working" << std::endl;
std::cout << arr_enoki << std::endl; // [10]

Enoki can'nt be used on MSVC2019

I can'nt use Enoki(release 0.1) on VC2019.
Here is the smallest reproduction code

enoki::Packet<float, 8> x(1.0f);
enoki::pow(x, x);

Thanks.

Question: Thread safety

It is possible to seamlessly use normal cpu threading library together with this library?
Or should the array object basically be treated as sequential only, thread-private object?

Dynamic Complex Arrays

What's the correct way to work with dynamic arrays of complex numbers?

AFAICS there are two possible ways: Complex<DynamicArray<FloatP>> andDynamicArray<Complex<FloatP>>. The first seems to work somehow, but unfortunately it is not possible to use the map function with it. The second version, on the other hand, allows me to map existing memory but fails with most other functions.

Thanks in advance!

`ENOKI_STRUCT_DYNAMIC` with `mask` member

Slicing a dynamically-sized mask (mask_t<FloatX>) returns a float & (see Example 1). This makes sense given that masks are stored using their underlying type's registers and slice needs to return a reference.

But then, in the slicing operator defined by ENOKI_STRUCT_DYNAMIC,

template <typename T>                                                  \
static ENOKI_INLINE auto slice(T &&value, size_t index) {              \
    constexpr static bool co_ = std::is_const<                         \
        std::remove_reference_t<T>>::value;                            \
    using Value = Struct<decltype(enoki::slice(std::declval<           \
        std::conditional_t<co_, const Args &, Args &>>(), index))...>; \
    return Value{ ENOKI_MAP_EXPR_F2(enoki::slice, value, index,        \
                                   __VA_ARGS__) };                     \
}

the following becomes problematic:

MyStruct(
    slice(value.arg1, index), ..., 
    // Trying to initialize a `mask_t<Float &>` with a `Float &`
    slice(value.some_mask, index)
)

Would there be a way to initialize the mask with a reference to the underlying storage directly?
This problem occurs in Mitsuba 2, see Example 2.

Example 1

#include <iostream>
#include <vector>
#include <enoki/array.h>

using namespace enoki;

namespace {
constexpr size_t PacketSize = enoki::max_packet_size / sizeof(float);
using Float    = float;
using FloatP   = Packet<Float, 4>;
using FloatX   = DynamicArray<FloatP>;

template <typename T>
ENOKI_NOINLINE void print(const T &val) {
    std::cout << val << std::endl;
}
}  // namespace

int main() {

    mask_t<FloatX> masks;
    set_slices(masks, 4);
    masks = false; masks[1] = true;

    auto mask = slice(masks, 1);

    print(mask);
    print(typeid(mask).name());

    print(masks.coeff(0));
    print(masks.coeff(1));
    print(typeid(masks.coeff(1)).name());

    return 0;
}

Result:

1
f
0
1
f

Example 2

Usage in Mitsuba 2 that triggers this issue:

// records.h
ENOKI_STRUCT_DYNAMIC(mitsuba::PositionSample, ...)

// Example usage that would trigger compilation error
Position3fX pos;
auto p = slice(pos, 1);

// Actual usage: python/records.cpp
bind_slicing_operators<PositionSample<Point3fX>>();

Is there any suggestion on how to auto diff a Vector3fC?

It seems I cannot backward the Vector3fC, only FloatC. I tried to autodiff each index for the Vector3fC but hoping there is a better way.

How to merge a large cuda array?

The question might be a little confusing but it is something related to rendering an image on GPU. Is there any suggestion on how to do that?
for example, if a want to render an image 600*800 with 32spp, is it a wise idea to create a CUDA array with size 600*800*32?(just assume the GPU can handle that) and use some method to take the average of that into a 600*800 size array? Is there any function to do that?
Also, the Cuda array has a gradient.
Thanks

<from enoki import *> imports only few names

After I following the enoki document GPU Arrays(https://enoki.readthedocs.io/en/master/gpu.html#)

cd <path-to-enoki>
mkdir build
cmake -DENOKI_CUDA=ON -DENOKI_AUTODIFF=ON -DENOKI_PYTHON=ON ..
make

In python:

>>> from enoki import *

I find only few names imported, without FloatC, cuda_set_log_level.

['CPUBuffer',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'allclose',
 'arange',
 'core',
 'e',
 'empty',
 'full',
 'inf',
 'linspace',
 'nan',
 'pi',
 'zero']

Did i miss something? Or the code only for demonstration?
Thanks!

Any fast way to copy a GPU array to CPU?

Is there a good method to copy a FloatC into a cpu array?
I am currently using pytorch's .cpu(), but it seems can be very slow if the graph is too complex.
Is there any better way either in c++ or python?

Enoki PTX linker error

Hey all!

First of all thank you very much for publishing/releasing mitsuba2!
I wanted to start experimenting with inverse rendering and tried multiple platforms (Google Colab and my own hardware), but I keep facing the exact same issue everywhere:

import mitsuba                                                                              
mitsuba.set_variant('gpu_autodiff_rgb') 

# The C++ type associated with 'Float' is enoki::DiffArray<enoki::CUDAArray<float>> 
from mitsuba.core import Float 
import enoki as ek 

# Initialize a dynamic CUDA floating point array with some values 
x = Float([1, 2, 3])                                                                        
# Tell Enoki that we'll later be interested in gradients of 
# an as-of-yet unspecified objective function with respect to 'x' 
ek.set_requires_gradient(x) 

# Example objective function: sum of squares 
y = ek.hsum(x * x)

PTX linker error:
ptxas fatal : SM version specified by .target is higher than default SM version assumed
cuda_check(): driver API error = 0400 "CUDA_ERROR_INVALID_HANDLE" in ../ext/enoki/src/cuda/jit.cu:253.

I've tried different GPU's and the results are:

GPU	Driver version	CUDA version	Result	Computing Capability
Geforce 940M	440.64	10.0.130	Fails	5.0
K80	418.67	10.0.130	Fails	3.7
Tesla P4	418.67	10.0.130	WORKS	6.1
P100	418.67	10.0.130	Fails	6.0

-> The weird thing is that the issue does not occur on a Tesla P4 but it does on all the others

Does anyone have an idea what can cause this and how I can fix it?

Thanks a lot! Pieterjan

Consider making enoki (cmake) installable

Hello,

I have started using enoki in my project. Right now, I have basically cloned the entire repo into a subfolder and included enoki in my include paths. (using old cmake way).

It would be really nice to make it cmake installable. (Internally we use conan as package manager, and making a conan recipe of a project which is cmake installable is straightforward).

Essentially it would be nice if we can do this:

mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/my/path/ -G ninja ..
ninja
ninja install

I don't know enough cmake to do this myself though :/
PS: Some info: http://mariobadr.com/creating-a-header-only-library-with-cmake.html

How about the efficiency?

I really like the enoki with the template design which can be used on multiple platform and autodiff. I want to use this as the base of my fluid simulation code. so I just make a little tests about the efficiency:

std::array<float, 3> srgb_gamma(std::array<float, 3> x) {
    std::array<float, 3> result;
    for (int i = 0; i < 3; i++) {
        if (x[i] <= 0.0031308f)
           result[i] = x[i] * 12.92f;
        else
           result[i] = std::pow(x[i] * 1.055f, 1.f / 2.4f) - 0.055f;
    }
    return result;
}

I handwrite a function and compare the code given in the tutorial, I loop for 10000times and find my test is 100x faster than enoki(without -msse4), 20xfaster(with -msse4), I can't figure out why? Does I miss something in compile flag?

Is enoki support array index?

for example
A = [1, 2, 3, 4, 5]
I = [0, 1, 1, 2]
then A[I] = [A[0], A[1], A[1], A[2]] = [1, 2, 2, 3]

    FloatD some = {2.3f, 3.4f, 4.5f, 5.6f, 6.7f, 7.8f};
    IntC index = {1,1,3,3,5};

    FloatD check = some[index];

    value of check should be: [3.4f, 3.4f, 5.6f, 5.6f, 7.8f]

Treading 3D array as 4d array

Hello,

As the documentation says, 3D arrays are treated as 4D arrays to make better used of intrinsics, but this raises an interesting problem.

Consider the following code:

  using Array = enoki::Array<float, 3>;

  Array numerator{2, 4, 8};
  Array denominator{1, 1, 1};

  auto result = numerator / denominator;
  if (std::fetestexcept(FE_INVALID)) {
    throw std::runtime_error("domain error");
  }

This throws the exception because the last number in the Register is initialized to 0, and this leads to a division by zero. Note that this is not limited to division. Any operation on the last number (things like min, max) also trigger than exception.

We do a bunch of floating point computation and like to keep the floating point exception check to verify we didn't mess up.

I was wondering what is your suggestion to handle cases like this?

Define an emply mask or array with a size?

Hi,
May I ask how to define an emply mask or array?
for example:
MaskC c{true, true, true, true, true};
can I write it into something like:
MaskC c = MaskC(value = true, size = 5);

Compiling with -mavx2 causes compiler error (clang8) regarding feature 'fma'

clang++-8 -I ../../src/ThreadTracer -I ../../src/enoki/include -mavx2 -O2 -g -std=c++17 try.cpp ../../src/ThreadTracer/threadtracer.o -o try
In file included from try.cpp:2:
In file included from ../../src/enoki/include/enoki/array.h:47:
../../src/enoki/include/enoki/array_avx.h:384:35: error: always_inline function '_mm256_fnmadd_ps' requires target
      feature 'fma', but would be inlined into function 'rsqrt_' that is compiled without support for 'fma'
                r = _mm256_mul_ps(_mm256_fnmadd_ps(t1, r, c1), t0);

$ clang++-8 --version
clang version 8.0.0-3~ubuntu18.04.2 (tags/RELEASE_800/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

NOTE: error goes away by:

Either not using -mavx2 flag for clang.
Or not calling enoki::rsqrt()

IntC to int*?

I asked a question before that IntC::copy can copy a int* into IntC on gpu.
Is there any method to copy a IntC into int*?

Consider adding assert with ENOKI_ASSUME_ALIGNED

Hello,

First, thanks for making this public.

While reading the documentation, I noticed this:

Performing an aligned load from an unaligned memory address will cause a general protection fault that immediately terminates the application.

and correctly, doing an avx512 load on an unaligned memory causes the application to segfault. Would it be possible to add assert( ptr % n == 0) whenever you do ENOKI_ASSUME_ALIGNED?

e.g.

      static ENOKI_INLINE Derived load_(const void *ptr) {
          return _mm512_load_ps((const Value *) ENOKI_ASSUME_ALIGNED(ptr, 64));
      }

      static ENOKI_INLINE Derived load_(const void *ptr) {
          assert((uintptr_t) ptr % 64 == 0);
          return _mm512_load_ps((const Value *) ENOKI_ASSUME_ALIGNED(ptr, 64));
      }

This catches the problem in debug build.

Dynamic Array of a fixed sized Matrix

Could you help me to clarify this behavior:

using Mat22f = enoki::Matrix< f32, 2 >;
using Mat22fBuffer = enoki::DynamicArray< enoki::Packet< Mat22f, 2 > >; // With 2
Mat22fBuffer x, y;
y = x;

// Compile time error:
array_generic.h(464,59): error C2440: '<function-style-cast>': cannot convert from 'enoki::Array<eMV::f32,2>' to 'enoki::Matrix<eMV::f32,2>'
array_generic.h(414,1): message : No constructor could take the source type, or constructor overload resolution was ambiguous
array_generic.h(344): message : see reference to function template instantiation 'void enoki::StaticArrayImpl<Value_,2,false,enoki::Packet<Value_,2>,int>::assign_<eMV::Mat22f&,0,1>(T,std::integer_sequence<size_t,0,1>)' being compiled
1>        with
1>        [
1>            Value_=eMV::Mat22f,
1>            T=eMV::Mat22f &
1>        ]

With:

using Mat22f = enoki::Matrix< f32, 2 >;
using Mat22fBuffer = enoki::DynamicArray< enoki::Packet< Mat22f, 2 > >; // With 1
Mat22fBuffer x, y;
y = x;

Compile with no problem.

Using Mask to fliter an array

For example:
FloatC arr = {0.0f, 1.0f, 2.0f, 3.0f, 4.0f};
MaskC msk = {0, 0, 1, 1, 0};
FloatC flitered_arr = do_something(arr, msk);
flitered_arr is {0.0f, 0.0f, 2.0f, 3.0f, 0.0f};
or
FloatC flitered_arr = do_something2(5.0f, arr, msk);
flitered_arr is {5.0f, 5.0f, 2.0f, 3.0f, 5.0f};

Cannot compile a simple example with enoki

Hi, I'm trying to use enoki, but it cannot compile the following example with cmake.

hello_enoki.cpp:

#include <enoki/array.h>
#include <string>
#include <iostream>

using namespace enoki;

using StrArray = Array<std::string, 2>;

int main(int argc, char **argv) {
    StrArray x("Hello ", "How are "), y("world!", "you?");
    std::cout << x + y << std::endl;

    return 0;
}

CMakeLists.txt:

cmake_minimum_required(VERSION 2.8.12)
project(mytest)

# C++17
include(CheckCXXCompilerFlag)
if (CMAKE_CXX_COMPILER_ID MATCHES "^(GNU|Clang|Emscripten|Intel)$")
  CHECK_CXX_COMPILER_FLAG("-std=c++17" HAS_CPP17_FLAG)

  if (HAS_CPP17_FLAG)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++17")
  else()
    CHECK_CXX_COMPILER_FLAG("-std=c++1z" HAS_CPP1Z_FLAG)
    if (HAS_CPP1Z_FLAG)
      set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++1z")
    else()
      message(FATAL_ERROR "Unsupported compiler -- nanogui requires C++17 support!")
    endif()
  endif()
elseif(MSVC)
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /std:c++17")
endif()

# Enoki
add_subdirectory(enoki)
enoki_set_compile_flags()
enoki_set_native_flags()
include_directories(enoki/include)

add_executable(mytest hello_enoki.cpp)

Any advice? thanks!

Recommended pattern for memory mapped arrays

Hi,

I have a 100GB file of floats that I have memory mapped, what would be the recommended pattern for doing things like finding the min and max value or computing a histogram? It seems like DynamicArray is the thing to use but it assumes ownership of the array. I could loop over fixed size chunks and load<>() them into an Array but then I need to deal with the boundary condition at the end if the dataset isn't a multiple of the Array size. What would be your suggestions for this scenario?

How to reshape/cut CUDAarray or add two array with different size?

In c++
FloatC data= (enoki::arange(8)) * 0.f;
FloatC data2 = (enoki::arange(8 * 2)) * 1.f;

How can I do something like:
data = data + data2;
or
data = data + data2[0:8] + data2[8:16];
Thanks!!

Confusion regarding mask types

I wrote a simple test to try enoki. However, I am unable to perform simple comparison operations due to type differences. Documentation states that return type of operator< and neq is mask_t<Array>. However, types of result1 and result2 variable in the following code are different.

import enoki as ek

def myfunc(arr1, arr2):
  result1 = ek.dot(arr1, arr1) < 0
  result2 = ek.neq(arr2, 0)
  print(type(result1), type(result2))
  return result1, result2

def test_scalar():
  from enoki import scalar
  arr1 = scalar.Vector1f([1])
  arr2 = scalar.Vector1f([2])
  res = myfunc(arr1, arr2)


def test_cuda():
  from enoki import cuda
  arr1 = cuda.Vector1f([1])
  arr2 = cuda.Vector1f([2])
  res = myfunc(arr1, arr2)


if __name__ == '__main__':
  test_scalar()
  test_cuda()

Output in scalar mode gives below. According to my understanding this is because the output of dot operation is converted to py::float. Is there a way to perform comparison without explicitly casting to bool in this case?

<class 'bool'> <class 'enoki.scalar.Vector1m'>

Output in cuda mode gives below. The difference between these types is unclear to me. Can you kindly give more details?

class 'enoki.cuda.Mask'> <class 'enoki.cuda.Vector1m'>

How to cast a int* to FloatC?

For example, if I have an int* like an int array
How can I cast that int array into a FloatC/FloatD, which is a cuda_array in enoki efficiently? I don't think it is a good idea to scatter_add each element in that int array to enoki cuda_array.
Thanks

Nothing gets build from cmake

I clone the repository recursively, as suggested by the documentation:

$ git clone --recursive https://github.com/mitsuba-renderer/enoki
Cloning into 'enoki'...
...
Cloning into '/home/bram/src/enoki/ext/cub'...
...
Cloning into '/home/bram/src/enoki/ext/pybind11'...
...
Cloning into '/home/bram/src/enoki/ext/pybind11/tools/clang'...
...

I then call cmake

$ cd enoki
$ mkdir build
$ cd build
$ CXX=clang++-8 CC=clang-8 cmake ../
-- The CXX compiler identification is Clang 8.0.0
-- Check for working CXX compiler: /usr/bin/clang++-8
-- Check for working CXX compiler: /usr/bin/clang++-8 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Setting build type to 'Release' as none was specified.
-- Enoki: using libc++.
-- Found Sphinx: /usr/bin/sphinx-build  
-- Configuring done
-- Generating done
-- Build files have been written to: /home/bram/src/enoki/build

When I then try to make, nothing gets built.

$ make
$

I expect at least the tests to get built with that.

This is on Ubuntu 18.04.4 LTS

How to dereference?

Hi,

I would like to do dereference, something like this...

using PtrP   = enoki::Packet<uintptr_t, 8>;
using Uint16P   = enoki::Packet<uint16_t, 8>;
PtrP p = fun();
Uint16P i = *p;

But can't compile this code.
Secondly, I think enoki::gather<>() fits for this situation.

Uint16P i = enoki::gather<Uint16P>(p,0);

But it doesn't return desired result.
Maybe gather<> presuppose pointer is not packet, I thought.

Any solution?

Thanks.

Aligned memory allocation functions missing.

The documentation references these functions https://enoki.readthedocs.io/en/master/reference.html#memory-allocation but I can't find them in the code.

Link error when trying to use DynamicArray for DiffArray

I am trying to auto-diff using DiffArray<FloatX> defined below, instead of the usual DiffArray<CudaArray<float>>:

#include <enoki/dynamic.h>
#include <enoki/autodiff.h>

using namespace enoki;

using Float  = float;

// not working:
using FloatP = Packet<Float>;
using FloatX = DynamicArray<FloatP>;
using FloatD = DiffArray<FloatX>;

// working:
// using FloatD = DiffArray<Float>;

int main()
{
    FloatD x = 1.f;
    set_requires_gradient(x);
    FloatD y = 10.f * x;
    backward(y);
    std::cout << y << std::endl;
    std::cout << gradient(x) << std::endl;
}

Compiled with Clang++-9 on Ubuntu 18.04, linked with libenoki-autodiff.so, libenoki-cuda.so and cuda.so (the last two may not need; but just adding FYI). Which gives:

/tests/enoki/CMakeFiles/test_examples.dir/examples.cpp.o: In function `main':
examples.cpp:(.text+0x30): undefined reference to `enoki::Tape<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::get()'
examples.cpp:(.text+0x3f): undefined reference to `enoki::Tape<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::append_leaf(unsigned long)'
examples.cpp:(.text+0x7f): undefined reference to `enoki::DiffArray<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::mul_(enoki::DiffArray<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > > const&) const'
examples.cpp:(.text+0x89): undefined reference to `enoki::DiffArray<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::~DiffArray()'
examples.cpp:(.text+0x9a): undefined reference to `enoki::Tape<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::backward(unsigned int, bool)'
examples.cpp:(.text+0x104): undefined reference to `enoki::Tape<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::gradient(unsigned int)'
examples.cpp:(.text+0x166): undefined reference to `enoki::DiffArray<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::~DiffArray()'
examples.cpp:(.text+0x16e): undefined reference to `enoki::DiffArray<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::~DiffArray()'
examples.cpp:(.text+0x1cd): undefined reference to `enoki::DiffArray<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::~DiffArray()'
examples.cpp:(.text+0x1d5): undefined reference to `enoki::DiffArray<enoki::DynamicArray<enoki::Packet<float, 1ul, true, (enoki::RoundingMode)4> > >::~DiffArray()'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

If instead using FloatD = DiffArray<Float>;, then it works. How to fix this link error for DiffArray<FloatX> defined above?

Allow DynamicArray to map over const range

Hello,

I have a use case in which I iterative over a const range, do some transformation on it and store it another range. The original range is const float.

I modelled this as:

  using FloatArray = enoki::Array<float, 8, true, enoki::RoundingMode::Default>;

  enoki::DynamicArray<const FloatArray> input;
  enoki::DynamicArray<FloatArray> destination;

and I do DynamicArray::map over my float ranges like this:

void foo(const float* input, size_t s) {
  input = enoki::DynamicArray<const FloatArray>(input, s);
}

However, map signature is:

static Derived map(void *ptr, size_t size) {

and I get a compiler error that I am casting away my constness.

If I change this to:

    template <typename T>
    static Derived map(T *ptr, size_t size) {

everything works out since now template type T has constness in it.

Would you be open to accept this change as a PR?

neigbours on a grid

Could Enoki be used to to apply a function to a grid when the function requires access to data points and their neighbors (typically when computing a numerical scheme) ? And what would be the preferred method ?

Should I extract the neighbors manually with a loop (in order to build one array by neighbor position) followed with the function application ?

Or is there a better way ? Maybe using a precomputed array of neighbours index ?

An example would be great as, if it is efficiently doable, that would be a great use case for Enoki.

Question: Other Backend (OpenCL, DirectCompute, ...) [Nice to have]

Having other backend: DirectX HLSL Direct Compute, OpenCL, Compute OpenGL, ...
Could you explain what is needed to implement another GPUArray (CLArray, DxArray, ...). Or GPUArray which can target CUDA, OpenCL, ... At compile time.

UB issues with bool_array_t initialization.

I get an UndefinedBehaviourSanitizer hit from Google's sanitizer (https://github.com/google/sanitizers) when initializing a dynamic array of a struct containing bool values to a number of slices not a multiple of the packet size.

Taking the example from here https://enoki.readthedocs.io/en/master/dynamic.html?highlight=bool_array_t#custom-dynamic-data-structures, if you do something like

using FloatP = Packet<float, 4>;
using FloatX = DynamicArray<FloatP>;
using GPSCoord2fX = GPSCoord2<FloatX>;
GPSCoord2fX coord;
set_slices(coord, 1001);

UBSAN will fire saying:

enoki/array_fallbacks.h:495:16: runtime error: load of value 190, which is not a valid value for type 'const bool'

I dug a little into this and traced it down to the clean_trailing_() function in dynamic.h, specifically this line;

store(addr, load<Packet>(addr) & mask);

Something weird is happening with the types here that it doesn't like. I think load<Packet>(addr) causes a read of uninitialized bool values, which are then put into the & expression at array_fallbacks.h:495. A workaround is changing the Bool type in the struct to:

using Bool = enoki::replace_scalar_t<Value, uint8_t>;

This avoids the UB and functions as you'd expect.

conversion from FloatD to FloatC?

e.g.
FloatD cuda_diff = ...;
FloatC cuda = cuda_diff.val();
or
Vector3fD 3d_diff = ...;
Vector3fC 3d = 3d_diff.val();

Prefix sum example?

After reading through Enoki's documentation, I am still somewhat confused how one would implement a prefix sum or create a summed area table. What is the most idiomatic way to perform a prefix sum?

Simple instructions to build and run a trivial auto differentiation example?

It would be great to have some instructions (using cmake) to build and run a trivial c++ example to get started, is there any?
For example a simple scalar CPU example of the Automatic differentiation as described in https://enoki.readthedocs.io/en/master/demo.html

(without requiring a complicated test framework or other dependencies)

I tried running cmake, only got a 'mkdoc' project (aside from ALL_BUILD, ZERO_CHECK) that fails
(1>ImportError: No module named 'guzzle_sphinx_theme'), and I'm likely not interested in creating docs.

Enabling 'ENOKI_TEST' didn't create a new build target.

Thanks!

Switching from PCG to something else

Melissa O'Neil's PCG family of pseudo-random number generators have some dubious claims about their speed and quality. There are multiple reviews that call them into question (for example, here). It appears that xorshift-derived generators (like xoshiro) are better in every regard. Perhaps it would be worth adding more choice to random.h, or even removing PCG from it.

Report: compile error with RGB Gamma example (Visual Studio 2017)

Hello,

All tests has been passed, but I can't compile the following code using Visual Studio 2017 / AVX2:

template <typename Value> 
Value srgb_gamma(Value x) {
    return enoki::select(
        x <= 0.0031308f,
        x * 12.92f,
        enoki::pow(x * 1.055f, 1.f / 2.4f) - 0.055f
    );
}

using ColorP = enoki::Array<float, 16>;
ColorP input = /* ... */;
ColorP output = srgb_gamma(input);

I get this error:

1>  c:\code\vsprojects\enoki\include\enoki\array_math.h(962): error C2672: 'enoki::low': no matching overloaded function found
1>  c:\code\vsprojects\enoki_test\enoki_test\main.cpp(94): note: see reference to function template instantiation 'auto enoki::pow<false,Derived_,float>(const T1 &,const T2 &)' being compiled
1>          with
1>          [
1>              Derived_=enoki::Array<float,16,true,enoki::RoundingMode::Default>,
1>              T1=enoki::Array<float,16,true,enoki::RoundingMode::Default>,
1>              T2=float
1>          ]
1>  c:\code\vsprojects\enoki_test\enoki_test\main.cpp(165): note: see reference to function template instantiation 'Value srgb_gamma<T>(Value)' being compiled
1>          with
1>          [
1>              Value=ColorP,
1>              T=ColorP
1>          ]
1>  c:\code\vsprojects\enoki\include\enoki\array_traits.h(151): error C2783: 'auto enoki::low(const Array &)': could not deduce template argument for '__formal'
...
...

Here, if enoki::Array<float, 16> is replaced by enoki::Array<float, 8>, there is no error.

On the other hand, the below version of code is successfully compiled (explicitly creating Value(1.f / 2.4f)):

template <typename Value> 
Value srgb_gamma(Value x) {
    return enoki::select(
        x <= 0.0031308f,
        x * 12.92f,
        enoki::pow(x * 1.055f, Value(1.f / 2.4f)) - 0.055f
    );
}

using ColorP = enoki::Array<float, 16>;
ColorP input = /* ... */;
ColorP output = srgb_gamma(input);

This behavior may be because MSVC compiler can't resolve this type of function overloading for the current code.

A typo

README.md:

$ git clone --rescursive https://github.com/mitsuba-renderer/enoki

rescursive -> recursive

question about multi-core and multi GPU

I am importing the Enoki in python and using dynamic array. I find it only uses one core of cpu.
Should I use multithread myself or Enoki surpport multi-core parallal?
I have multi GPU, can I specify which GPU Enoki uses?

Thanks for your help!

mitsuba-renderer / enoki Goto Github PK

enoki's Introduction

Enoki — structured vectorization and differentiation on modern processor architectures

Introduction

Motivation

What Enoki does differently

Cloning

Documentation

About

enoki's People

Stargazers

Watchers

Forkers

enoki's Issues

Example 1

Example 2

Recommend Projects

Recommend Topics

Recommend Org