wkcn / mobulaop Goto Github PK

A Simple & Flexible Cross Framework Operators Toolkit

License: MIT License

Python 70.24% C++ 29.37% Shell 0.01% C 0.38%

deep-learning scientific-computing operators cross-framework mxnet numpy pytorch cupy

mobulaop's Introduction

MobulaOP

Linux	Windows	Coverage	Badge

What is it?

MobulaOP is a simple and flexible cross framework operators toolkit.

You can write custom operators by Python/C++/C/CUDA/HIP/TVM without rebuilding deep learning framework from source.

How to use it?

[中文教程]

[Tutorial]

Add an addition operator [Code]

import mobula

@mobula.op.register
class MyFirstOP:
    def forward(self, x, y):
        return x + y
    def backward(self, dy): 
        return [dy, dy]
    def infer_shape(self, in_shape):
        assert in_shape[0] == in_shape[1]
        return in_shape, [in_shape[0]]

# MXNet
import mxnet as mx
a = mx.nd.array([1, 2, 3])
b = mx.nd.array([4, 5, 6])
c = MyFirstOP(a, b)
print (c) # [5, 7, 9]

# PyTorch
import torch
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])
c = MyFirstOP(a, b)
print (c) # [5, 7, 9]

# NumPy
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
op = MyFirstOP[np.ndarray]()
c = op(a, b)
print (c) # [5, 7, 9]

# CuPy
import cupy as cp
a = cp.array([1, 2, 3])
b = cp.array([4, 5, 6])
op = MyFirstOP[cp.ndarray]()
c = op(a, b)
print(c) # [5, 7, 9]

Use custom operators without rebuilding the source of deep learning framework [Code]

# Use ROIAlign operator
import mxnet as mx
import numpy as np
import mobula

# Load ROIAlign Module
mobula.op.load('ROIAlign')

ctx = mx.cpu(0)
dtype = np.float32
N, C, H, W = 2, 3, 4, 4

data = mx.nd.array(np.arange(N*C*H*W).astype(dtype).reshape((N,C,H,W)))
rois = mx.nd.array(np.array([[0, 1, 1, 3, 3]], dtype = dtype))

data.attach_grad()
with mx.autograd.record():
    # mx.nd.NDArray and mx.sym.Symbol are both available as the inputs.
    output = mobula.op.ROIAlign(data = data, rois = rois, pooled_size = (2,2), spatial_scale = 1.0, sampling_ratio = 1)

print (output.asnumpy(), data.grad.asnumpy())

Import Custom C++ Operator Dynamically [Code]

import mobula
# Import Custom Operator Dynamically
mobula.op.load('./AdditionOP')

import mxnet as mx
a = mx.nd.array([1,2,3])
b = mx.nd.array([4,5,6])
c = mobula.op.AdditionOP(a, b)

print ('a + b = c \n {} + {} = {}'.format(a.asnumpy(), b.asnumpy(), c.asnumpy()))

How to get it?

# Clone the project
git clone https://github.com/wkcn/MobulaOP

# Enter the directory
cd MobulaOP

# Install MobulaOP
pip install -v -e .

mobulaop's People

Contributors

Stargazers

Watchers

mobulaop's Issues

Does it work if custom op and the framework are compiled with different version of GCC?

MXNet pip package is built with gcc4. I wonder when Mobula ops are compiled (without MXNet source code) with gcc5, would it still work?

[Question] Using types other than float32?

Is there a way to specify the data type of the outputs (other than always using float32)?

And, in general, does MobulaOP support mixed types when implementing a kernel?

Thanks!

undefined symbol: MXShallowCopyNDArray

When I test the tutorials, I get the warning:

/root/test/MobulaOP/mobula/glue/mx.py:44: UserWarning: Using asynchronous execution for MXNet failed, since /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so: undefined symbol: MXShallowCopyNDArray
It will drop the performance.

But I have try different version of mxnet: 1.5.0, 1.5.1, 1.6.0b20190729
same warning.

compile error

【python test_mul_func.py】，error occurs

[10:21:12] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
/wls/tf_workspace/MobulaOP/mobula/glue/mx.py:44: UserWarning: Using asynchronous execution for MXNet failed, since /home/weishuyi/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so: undefined symbol: MXShallowCopyNDArray
It will drop the performance.
Recommend using the latest version of MXNet
Recommend using the latest version of MXNet""".format(e))
mkdir -p /wls/tf_workspace/MobulaOP/mobula/build/cpu/src
g++ /wls/tf_workspace/MobulaOP/mobula/src/defines.cpp -std=c++11 -DUSING_CUDA=0 -DUSING_HIP=0 -DUSING_OPENMP=0 -DHOST_NUM_THREADS=40 -O3 -DUSING_CBLAS=0 -I/wls/tf_workspace/MobulaOP/mobula/./ -I/wls/tf_workspace/MobulaOP/mobula/./inc -I/wls/tf_workspace/MobulaOP/mobula/../3rdparty/dlpack/include -I/wls/tf_workspace/MobulaOP/mobula/../3rdparty/tvm_packed_func -fPIC -Werror -Wall -Wextra -pedantic -Wcast-align -Wcast-qual -Wctor-dtor-privacy -Wdisabled-optimization -Wformat=2 -Winit-self -Wmissing-include-dirs -Wold-style-cast -Woverloaded-virtual -Wredundant-decls -Wshadow -Wsign-promo -Wundef -fdiagnostics-show-option -c -o /wls/tf_workspace/MobulaOP/mobula/build/cpu/src/defines.o
g++ /wls/tf_workspace/MobulaOP/mobula/src/context.cpp -std=c++11 -DUSING_CUDA=0 -DUSING_HIP=0 -DUSING_OPENMP=0 -DHOST_NUM_THREADS=40 -O3 -DUSING_CBLAS=0 -I/wls/tf_workspace/MobulaOP/mobula/./ -I/wls/tf_workspace/MobulaOP/mobula/./inc -I/wls/tf_workspace/MobulaOP/mobula/../3rdparty/dlpack/include -I/wls/tf_workspace/MobulaOP/mobula/../3rdparty/tvm_packed_func -fPIC -Werror -Wall -Wextra -pedantic -Wcast-align -Wcast-qual -Wctor-dtor-privacy -Wdisabled-optimization -Wformat=2 -Winit-self -Wmissing-include-dirs -Wold-style-cast -Woverloaded-virtual -Wredundant-decls -Wshadow -Wsign-promo -Wundef -fdiagnostics-show-option -c -o /wls/tf_workspace/MobulaOP/mobula/build/cpu/src/context.o
mkdir -p MulElemWise/build/MulElemWise/build/cpu
g++ MulElemWise/build/cpu/MulElemWise_wrapper.cpp -std=c++11 -DUSING_CUDA=0 -DUSING_HIP=0 -DUSING_OPENMP=0 -DHOST_NUM_THREADS=40 -O3 -DUSING_CBLAS=0 -I/wls/tf_workspace/MobulaOP/mobula/./ -I/wls/tf_workspace/MobulaOP/mobula/./inc -I/wls/tf_workspace/MobulaOP/mobula/../3rdparty/dlpack/include -I/wls/tf_workspace/MobulaOP/mobula/../3rdparty/tvm_packed_func -fPIC -Werror -Wall -Wextra -pedantic -Wcast-align -Wcast-qual -Wctor-dtor-privacy -Wdisabled-optimization -Wformat=2 -Winit-self -Wmissing-include-dirs -Wold-style-cast -Woverloaded-virtual -Wredundant-decls -Wshadow -Wsign-promo -Wundef -fdiagnostics-show-option -c -o MulElemWise/build/MulElemWise/build/cpu/MulElemWise_wrapper.o
In file included from /wls/tf_workspace/MobulaOP/mobula/./inc/mobula_op.h:5:0,
from MulElemWise/build/cpu/MulElemWise_wrapper.cpp:8:
/wls/tf_workspace/MobulaOP/mobula/./inc/glue_mx.h: In function ‘void RegisterMXAPI(void*, void*, void*, void*, void*)’:
/wls/tf_workspace/MobulaOP/mobula/./inc/glue_mx.h:45:76: error: ISO C++ forbids casting between pointer-to-function and pointer-to-object [-Werror=pedantic]
reinterpret_cast<decltype(MXShallowCopyNDArray)>(shallow_copy_ndarray);
^
/wls/tf_workspace/MobulaOP/mobula/./inc/glue_mx.h:46:73: error: ISO C++ forbids casting between pointer-to-function and pointer-to-object [-Werror=pedantic]
MXNDArrayFree = reinterpret_cast<decltype(MXNDArrayFree)>(ndarray_free);
^
/wls/tf_workspace/MobulaOP/mobula/./inc/glue_mx.h:48:74: error: ISO C++ forbids casting between pointer-to-function and pointer-to-object [-Werror=pedantic]
reinterpret_cast<decltype(MXNDArrayGetContext)>(ndarray_get_context);
^
/wls/tf_workspace/MobulaOP/mobula/./inc/glue_mx.h:50:70: error: ISO C++ forbids casting between pointer-to-function and pointer-to-object [-Werror=pedantic]
reinterpret_cast<decltype(MXNDArrayToDLPack)>(ndarray_to_dlpack);
^
/wls/tf_workspace/MobulaOP/mobula/./inc/glue_mx.h:52:73: error: ISO C++ forbids casting between pointer-to-function and pointer-to-object [-Werror=pedantic]
reinterpret_cast<decltype(MXEnginePushSyncND)>(engine_push_sync_nd);

Custom Operators Zoo

Hi there, this issue is to summarize some custom operators to be supported.
Please feel free to add it if you want any operator : )

Low performance in gpu mode

I wrote my first demo of Mobula op. The directory of my project:

mobula_test
  │  main.py
  └──TestOP
      └───TestOP.cpp

The content of files:
main.py:

import mobula
import mxnet as mx
from mxnet import nd
from tqdm import tqdm


if __name__ == '__main__':
    mobula.op.load('TestOP')
    ctx = mx.cpu()
    a = nd.ones((5000, 5000), ctx=ctx)
    b = nd.ones((5000, 5000), ctx=ctx)
    out = nd.empty(a.shape, ctx=ctx)

    print("cpu")
    for i in tqdm(range(1000)):
        mobula.func.TestOP(a.size, a, b, out)

    ctx = mx.gpu()
    a = nd.ones((5000, 5000), ctx=ctx)
    b = nd.ones((5000, 5000), ctx=ctx)
    out = nd.empty(a.shape, ctx=ctx)

    print("gpu")
    for i in tqdm(range(1000)):
        mobula.func.TestOP(a.size, a, b, out)

TestOP.cpp:

template<typename DType>
MOBULA_KERNEL TestOP_kernel(const int n, const DType* a, const DType* b, DType* out)
{
    parfor(n, [&](int i)
    {
        out[i] = a[i] + b[i];
    });
}

time cost: cpu 14s, gpu 226s on i7-7700k & 1080ti. The usage of cpu and gpu is both 100%
os environment: win10 1809, cuda 10.0

Not working with multiple processes

When calling MobulaOP in a subprocess, it gets stuck.

Environment: lastest mxnet nightly build and Python 3.6.5

An example code modified from dynamic_import_op.py to replicate this error.

from concurrent import futures

import sys
import mxnet as mx

def foo():
    import mobula
    # Import Custom Operator Dynamically
    mobula.op.load('./AdditionOP')
    AdditionOP = mobula.op.AdditionOP

    a = mx.nd.array([1, 2, 3])
    b = mx.nd.array([4, 5, 6])

    a.attach_grad()
    b.attach_grad()

    with mx.autograd.record():
        c = AdditionOP(a, b)

    dc = mx.nd.array([7, 8, 9])
    c.backward(dc)

    assert ((a + b).asnumpy() == c.asnumpy()).all()
    assert (a.grad.asnumpy() == dc.asnumpy()).all()
    assert (b.grad.asnumpy() == dc.asnumpy()).all()

    print('Okay :-)')
    print('a + b = c \n {} + {} = {}'.format(a.asnumpy(), b.asnumpy(), c.asnumpy()))

def main():
    ex = futures.ProcessPoolExecutor(1)
    r = ex.submit(foo)
    r.result()

if __name__ == "__main__":
    main()

CustomOp in python and C++ for prediction

Hello, it's very useful! I have a problem, I define a custom operator in mxnet(python) and get a model. Now I want to load the model(.json & .params ) by mxnet(C++). Can you give me some advice? thanks.

[Feature request] Gradient check

It will be great if the package could provide a gradient checking tool.

Rename the package

The current package name is mobula in this project.
However, the name is duplicated with the project mobula.

I will rename the package of MobulaOP.

Leveraging framework specific math helpers

Hi @wkcn , really nice work.
I'd like to ask is it possible for Mobula to leverage existing math helpers in deep learning frameworks like ATen in Pytorch and Mshadow in MXNet?
Writing everything from vanilla C++ is prohibitively cumbersome and thus prevent the adoption of Mobula in real practice.

Is MobulaOP support cupy？

Is possible support cupy (NumPy-like API accelerated with CUDA)？

for example:
a = cupy.array([1,2,3])
b = cupy.array([4,5,6])
out = cupy.empty(a.shape)
mobula.func.mul_elemwise(a.size, a, b, out)

Traceback (most recent call last):
File "D:\Miniconda3\envs\python35\lib\site-packages\mobula\func.py", line 208, in call
var, ptype, template_mapping, using_async)
File "D:\Miniconda3\envs\python35\lib\site-packages\mobula\func.py", line 273, in _get_tensor_info
raise TypeError()
TypeError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "test_mul_func.py", line 9, in
mobula.func.mul_elemwise(aa.size, aa, bb, outa)
File "D:\Miniconda3\envs\python35\lib\site-packages\mobula\func.py", line 239, in call
self.name, self.func.arg_types, list(map(type, args))))
TypeError: Unmatched parameters list of the function mul_elemwise:
[const int32_t, <typename const T*>, <typename const T*>, <typename T*>]
vs
[<class 'int'>, <class 'cupy.core.core.ndarray'>, <class 'cupy.core.core.ndarray'>, <class 'cupy.core.core.ndarray'>]

Not working with MXNet nightly build (1.5.0b20181222)

With MXNet 1.5 nightly build (1.5.0b20181222), import mobula gives segmentation fault.

where to find these keywords of mxnet

hi
i am learning from your project and want to where to find these keywords of mxnet?

like those in check_backend(b):
func_names = ['get_pointer', 'dev_id', 'wait_to_read', 'wait_to_write', 'OpGen']
is there any pages online about meanings of these keywords?

The performance of dot and im2col in mobula_op

In current implementation, function dot and im2col are very slow.

The naive dot has low cache hit.

I will find the bottle necks and fix them.

Test Code

In the version db016f2.

OP	MobulaOP	MXNet
FC	2.209	1.68
Conv	13.012	7.537

GPU backend does not work for pytorch

The following code produces wrong output.
If I change .cuda() to .cpu(), I can get correct output.

(Fix #10 is required to run this example)

# Use ROIAlign operator
import sys
sys.path.append('../') # Add MobulaOP path
import numpy as np
import mobula
# Load ROIAlign Module
mobula.op.load('ROIAlign')

dtype = np.float32
N, C, H, W = 2, 3, 4, 4

import torch

data = torch.tensor(np.arange(N*C*H*W).astype(dtype).reshape((N,C,H,W))).cuda()
rois = torch.tensor(np.array([[0, 1, 1, 3, 3]], dtype = dtype)).cuda()

output = mobula.op.ROIAlign(data = data, rois = rois, pooled_size = (2,2), spatial_scale = 1.0, sampling_ratio = 1)

print("= OUTPUT =")
print (output)

LICENSE Problem

~~In this project, I use the header file functional-gcc4_9.h with GPL License to address the problem of ABI compability. I need to resolve the License problem.~~

I have removed the files under GPL in master branch.
In addition, there is a branch under GPL: https://github.com/wkcn/MobulaOP/tree/master-GPL
, which keeps the compatibility of gcc.

Build Failed in GCC5. Incomplete type Error.

gcc5 couldn't compile MobulaOP because incomplete type error.

Implementation ideas of creating Operator

May I ask what implementation of creating operator in our MobulaOP:

use the TVM PackedFunc to register function
use the method of MXAPI
And, why not use the nnvm to register operator?like:https://mxnet.apache.org/api/faq/new_op
Also, are you thinking about the performance?

Question on ATTEN SAMPLER based on MobulaOP

Hi, will you release the PyTorch version? And will it support multi-gpu training?

Lack of comments

There are too many codes which don't have comments. I need to add them.

Todo List:

Python Code
C++/C Code

Does these operators respect the asynchronous execution in MXNet?

This is great work. Is this tested with multi GPUs and can be executed in parallel?

[Bug] Thread Safety

The function dot_add and linalg_gemm_?? are not thread safe.

gluon is supported?

firstly, thanks for the work! make it easy to use mxnet.

my question as following:
op created with Mobula may be called by gluon?

Tutorial of Extending The Package

This is really nice work! It would be great if there is a tutorial of extending this package. Something like PyTorch Extension package http://hangzh.com/PyTorch-Encoding/notes/extending.html

The crash when training model on multiple GPUs

Hi, there.

I found MobulaOP will crash when training model on multiple GPUs.
I'm trying to fix it.

Question on Example on PyTorch

HI, I have tried your basic example on mulelementwise, but got this kind of error.

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/alexhu/Source/MobulaOP/mobula/glue/common.py", line 158, in __call__ return backend.op_gen(glue_mod, op=self.op, name=self.name)(*args, **new_kwargs) File "/home/alexhu/Source/MobulaOP/mobula/glue/th.py", line 41, in __call__ return self.cache[self.name](*pars[0], **pars[1])(*inputs) File "/home/alexhu/anaconda3/envs/slr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/alexhu/Source/MobulaOP/mobula/glue/th.py", line 105, in forward return torch_func.apply(self, *args, **kwargs) File "/home/alexhu/Source/MobulaOP/mobula/glue/th.py", line 59, in forward out = self._forward(*args, **kwargs) File "/home/alexhu/Source/MobulaOP/docs/tutorial/MulElemWise/MulElemWise.py", line 7, in forward mobula.func.mul_elemwise(a.size, a, b, self.y) File "/home/alexhu/Source/MobulaOP/mobula/func.py", line 148, in __call__ data, var_dev_id, ctype = self._get_scalar_info(var, ptype) File "/home/alexhu/Source/MobulaOP/mobula/func.py", line 277, in _get_scalar_info var, ctypes.c_void_p) else ptype.ctype(var) TypeError: an integer is required (got type builtin_function_or_method)

ROIAlign custom op runs slowly

Hi, I have tried to use the ROIAlign custom op provided by the repo, and I run faster rcnn examples, I just simplily replace the symbol code:

roi_pool = mx.symbol.ROIPooling(name='roi_pool', data=conv_new_1_relu, rois=rois, 
           pooled_size=(7, 7), spatial_scale=spatial_scale)

with

roi_pool = mobula.op.ROIAlign(name='roi_pool', data=conv_new_1_relu, rois=rois,
           pooled_size=(7, 7), spatial_scale=spatial_scale, sampling_ratio=0)

the running speed decrease from 0.1s to 1~2s, and when use multi-gpu, the code cannot run parallel, and become much more slower.
My mxnet version is 1.3.0-cu92 from pip install. What might be the problem?