Code Monkey home page Code Monkey logo

heterocl's Introduction

Note: HeteroCL is superseded by Allo, a new programming language for composable accelerator design [PLDI'24]. For the latest updates, please visit our new repository.

GitHub license CircleCI

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

Website | Installation | Tutorials | Documentation

Introduction

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge.

To tackle this challenge, we introduce HeteroCL, a programming infrastructure comprised of a Python-based domain-specific language (DSL) and a compilation flow. The HeteroCL DSL provides a clean programming abstraction that decouples algorithm specification from hardware customizations including compute and data customizations. HeteroCL can further capture the interdependence among these different customization techniques, allowing programmers to explore various performance/area/accuracy trade-offs in a systematic and productive manner.

Language Overview

flow

Current Compilation Flow

flow

Install MLIR-based HeteroCL

To install the HeteroCL-MLIR dialect, please make sure you have installed the tools below:

  • gcc >= 5.4
  • cmake >= 3.19
  • python >= 3.7

The following script shows the complete process of building the HeteroCL-MLIR dialect and connecting it with the HeteroCL frontend. It may take about 10 minutes to install the LLVM package depending on the internet connection and the hardware resource of your machine. If you are a HeteroCL developer, please refer to the guide in the HCL-MLIR repository and build the dialect with the Python binding from source.

git clone https://github.com/cornell-zhang/heterocl.git heterocl-mlir
cd heterocl-mlir
git submodule update --init --recursive
pip install . -v
# export LLVM path
export LLVM_BUILD_DIR=$(pwd)/hcl-dialect/externals/llvm-project/build
export PATH=${LLVM_BUILD_DIR}/bin:${PATH}

To verify HeteroCL is installed correctly, you can run the following test.

python3 -m pytest tests

Related Publications

Related Work

Contributing to HeteroCL

Coding Style (Python)

We follow official Python coding style and use NumPy docstring style. We use Black and PyLint to format Python code.

Coding Style (C and C++)

We follow Google coding style. Please refer to the hcl-dialect repository for more details.

heterocl's People

Contributors

blaok avatar chhzh123 avatar comaniac avatar dependabot[bot] avatar ezw2 avatar hecmay avatar huyuwei avatar jaxtonhale avatar jzfengziyan avatar pbc48 avatar raylin10 avatar schelleg avatar seanlatias avatar wty5 avatar yn224 avatar zzzdavid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

heterocl's Issues

`reuse_at` fails to create multiple buffers for different outputs

Reduced test as followed. HeteroCL returns Segmentation Fault with no extra information.

    hcl.init()

    A = hcl.placeholder((10, 10))
    B = hcl.placeholder((10, 8))
    C = hcl.placeholder((8, 10))

    def kernel(a, b, c):

        @hcl.def_([(10,10), (10,8), (8,10)])
        def stencil(A, B, C):
            hcl.update(B, lambda y, x: A[y, x] + 2*A[y, x+1] + 3*A[y, x+2])
            hcl.update(C, lambda y, x: A[y, x] + 3*A[y+1, x] + 5*A[y+2, x])

        stencil(a, b, c)

    s = hcl.create_schedule([A, B, C], kernel)

    k = kernel.stencil
    RB1 = s.reuse_at(k.A, s[k], k.axis[1])
    RB2 = s.reuse_at(k.A, s[k], k.axis[0])
    f = hcl.build(s)

    np_A = np.random.randint(0, 10, size=(10, 10))
    np_B = np.zeros((10, 8), dtype="int")
    np_C = np.zeros((8, 10), dtype="int")

    for y in range(0, 10):
        for x in range(0, 8):
            np_B[y][x] = np_A[y][x]*1 + np_A[y][x+1]*2 + np_A[y][x+2]*3
            np_C[x][y] = np_A[x][y]*1 + np_A[x+1][y]*3 + np_A[x+2][y]*5

    hcl_A = hcl.asarray(np_A)
    hcl_B = hcl.asarray(np.zeros((10, 8), dtype="int"))
    hcl_C = hcl.asarray(np.zeros((8, 10), dtype="int"))

    print(hcl.lower(s))

    f(hcl_A, hcl_B, hcl_C)

    ret_B = hcl_B.asnumpy()
    ret_C = hcl_C.asnumpy()
    assert np.array_equal(np_B, ret_B)
    assert np.array_equal(np_C, ret_C)

terminate called after throwing an instance of 'std::out_of_range'

The error occurs when the user tries to create an Allocate statement without specifying the storage scope (some other issues can lead to the same error message).

The complete error message from HetroCL is too ambiguous to pinpoint the issue:

terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted

The error is caused by looking up a not existing entry in the map

const Variable* buffer = op->buffer_var.as<Variable>();
std::string scope = alloc_storage_scope_.at(buffer);

Adding a check statement before the lookup can provide more detailed error message

CHECK(alloc_storage_scope_.find(buffer) != alloc_storage_scope_.end());

Backend infrastructure for code generation

Since we attempt to make HeteroCL as a standalone compilation framework, we're going to build our own backend infrastructure for code generation from Halide IR. I first list some necessary tasks here, but they can definitely be refined later.

  • Interface
    The backend infrastructure and the HeteroCL frontend can be developed in parallel, but we need to integrate them eventually. Thus, we have to first determine the interface between HeteroCL frontend and the backend. To be more specific, the connection between HeteroCL frontend output and backend input must be straightforward.

  • Halide IR visitor base class with a supported backend language table
    In order to facilitate the future development and collaboration, it would be good to have a common Halide IR visitor base class. In addition to the base class, we also need to maintain a table that lists all supported backend languages. As a result, everyone is able to develop and experiment new backend language support easily.

  • C code generation
    Since we have planned to generate several C-based backend languages (i.e., HLS C, Merlin C, and even OpenCL for Intel platform), a common C code generator would be helpful.

  • System integration
    The final step is putting it all together.

Also, the following tasks of specialized code generation can be split to other issues later on.

  • HLS C code generation
  • Merlin C code generation
  • OpenCL code generation
  • Stencil DSL code generation
  • Systolic array IR code generation

Integer Division Truncation is different to Halide

I just found that HeteroCL's Integer division truncation is toward zero (-3 / 4 = 0), while the Halide behavior is truncation toward the lower integer (-3 / 4 = -1). This causes lots of problem when comparing the results between Halide programs and HeteroCL programs, because if there's negative integer division occured in the process, the results can't be the same...
So I'm wondering if there's a way to change this behavior to "Truncation toward Lower Integer"?
Thanks!

Support declarative programming APIs in HeteroCL module

HeteroCL does not have complete support for declarative programming API in HeteroCl module function definitions. Using declarative APIs in HeteroCL module will result in errors in Storage Flatten IR pass (i.e. it failed to find the correct buffers for function argument tensors):

heterocl.tvm._ffi.base.TVMError: [19:50:35] src/pass/storage_flatten.cc:413: Check failed: buf_map_.count(key) Cannot find buffer of placeholder(k.A, 0x562ffd06c000) value=0

Stack trace returned 10 entries:
[bt] (0) heterocl/tvm/lib/libhcl.so(dmlc::StackTrace()+0x40) [0x7f346fcedb30]
[bt] (1) heterocl/tvm/lib/libhcl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b) [0x7f346fcee2bb]
[bt] (2) heterocl/tvm/lib/libhcl.so(TVM::ir::StorageFlattener::HandleBufferBindScope(Halide::Internal::AttrStmt const*)+0x3d7) [0x7f346fdf7cd7]
[bt] (3) heterocl/tvm/lib/libhcl.so(TVM::ir::StorageFlattener::Mutate_(Halide::Internal::AttrStmt const*, Halide::Internal::Stmt const&)+0x926) [0x7f346fdfc046]
[bt] (4) heterocl/tvm/lib/libhcl.so(+0x945bf0) [0x7f346fd93bf0]
[bt] (5) heterocl/tvm/lib/libhcl.so(std::_Function_handler<Halide::Internal::Stmt (TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*), TVM::IRFunctor<Halide::Internal::Stmt (TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*)>::set_dispatch<Halide::Internal::AttrStmt>(std::function<Halide::Internal::Stmt (Halide::Internal::AttrStmt const*, Halide::Internal::Stmt const&, TVM::ir::IRMutator*)>)::{lambda(TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*)#1}>::_M_invoke(std::_Any_data const&, TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*&&)+0x30) [0x7f346fd9e590]
[bt] (6) heterocl/tvm/lib/libhcl.so(TVM::IRFunctor<Halide::Internal::Stmt (TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*)>::operator()(TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*) const+0x123) [0x7f346fcee7f3]
[bt] (7) heterocl/tvm/lib/libhcl.so(TVM::ir::IRMutator::Mutate(Halide::Internal::Stmt)+0x49) [0x7f346fcee8e9]
[bt] (8) heterocl/tvm/lib/libhcl.so(TVM::ir::StorageFlattener::Mutate_(Halide::Internal::Realize const*, Halide::Internal::Stmt const&)+0x947) [0x7f346fdf9dd7]
[bt] (9) heterocl/tvm/lib/libhcl.so(+0x945e10) [0x7f346fd93e10]

A simple example to reproduce the error:

    hcl.init()
    A = hcl.placeholder((10,10))
    B = hcl.placeholder((10,10))

    def kernel(A, B):
       @hcl.def_([(10,10), (10,10)])
       def k(A, B):
           hcl.update(B, lambda y, x: A[y, x] + 1)
       k(A, B)

    s = hcl.create_schedule([A, B], kernel)
    f = hcl.build(s)

HostDeviceSplitter doesn't work

When implementing the code generation, I found several points:

  1. As I know, the HeteroCL programming model doesn't have "bind" API for user to specify the thread binding.

  2. Without specifying the thread binding, the user-written compute function will firstly be marked as container.LoweredFunc.MixedFunc when lowering the function.

  3. For the mixed function, build_module.py:build wil use HostDeviceSplitter to split the kernel part as a new function.

  4. HostDeviceSplitter (tvm/src/pass/split_host_device.cc) actually splits the kernel part by checking if the function has thread binding attribute.

  5. As a result, all HeteroCL functions are identified as host functions and will not be passed to the kernel code generation.

On the other hand, it seems not proper to force generating the kernel code from the mixed function. I've tried it and got lots of errors such as "unknown field code" and "Unresolved intrinsic tvm_call_packed with return type int32".

HeteroCL Roadmap 0.2

API:

  • Device placement interface: which part of code should go to FPGA
  • Module interface: users should be able to define a module for reuse

Language:

  • Struct data type: combine different variables into a struct
  • Fixed-point data type with negative fractional/decimal bits
  • Long bitwidth bit-accurate data types: 128 bits

Scheduling:

  • Domain-specific hardware: systolic array & stencil analysis
  • Dataflow support
  • Customized memory hierarchy support

Benchmarks:

  • Optical flow (Rosetta)
  • Sorting (HLS Book)
  • FIR/IIR (HLS Book)

Miscellaneous:

  • Build a Lint test for coding style checking
  • Add concrete tutorials
  • Unit test for each function

Add annocation for hcl.local generated arrays

Although HeteroCL allows users to use hcl.local() to create a scalar variable, it's not really a scalar variable in the IR due to the IR limitation. As a result, the IR will contain an array with length 1 as well as a for-loop with trip-count 1 to initialize the array. For example:

hcl.local(0, "A")
-->
int A[1];
for (int i = 0; i < 1; ++i) {
    A[i] = 0;
}

This coding style affects the static analysis and should be avoided. In order to generate scalar variables, it'd be better to annotate the IR node (specifically, node Allocate) if the array we're declaring is actually a scalar variable. If it is a scalar variable, it also annotates its initial value (e.g., 0 in the above example).

Tools version

Hi

I would like to know if there are any requirements about the version of Vivado and Intel tools to use with HeteroCL?

Thanks for the answer and keep up the good work!

Data Type Error if I don't use hcl.init(dtype)

The error message is:

heterocl.tvm._ffi.base.TVMError: [13:55:02] src/codegen/llvm/llvm_module.cc:59: Check failed: ret == 0 (-1 vs. 0) Assert fail: ((((tvm_struct_get(arg0, 0, 5) == (uint8)2) && (tvm_struct_get(arg0, 0, 6) == (uint8)32)) && (tvm_struct_get(arg0, 0, 8) == (uint8)0)) && (tvm_struct_get(arg0, 0, 7) == (uint8)1)), arg0.dtype is expected to be float32

You can reproduce it by the code below:

hcl.init()
input = hcl.placeholder((1280, 768, 3), "input", dtype = hcl.Float()) #input shape = output shape
def bug(input_tensor):
    return hcl.compute(input_tensor.shape, lambda x,y,c: input_tensor[x,y,c], dtype=hcl.Float())

hcl.build(hcl.create_schedule([input], bug))(hcl.asarray(np.zeros(input.shape, dtype=float)), hcl.asarray(np.zeros(input.shape, dtype=float)))

I don't know what is arg0, because I think all the data type is set to hcl.Float(), and it can be shown by IR presented.
This error can be solved by replacing hcl.init() to hcl.init(hcl.Float()). But I don't know where the wrong data type is.

Thanks!

Segmentation fault (core dumped): hcl.compute with placeholders whose dim more than 2

I'm trying to write a conv_layer using HeteroCL, but a memory error can't be solved.

input = hcl.placeholder((4,32,10,10), "input")
filter = hcl.placeholder((32,32,3,3), "filter")
bias = hcl.placeholder((32,), "bias")
print(bias.shape)

def conv_layer(input, filter, bias):
    def compare_with_zero(a):
        with hcl.if_(a > 0):
            hcl.return_(a)
        with hcl.else_():
            hcl.return_(0)
    output = hcl.compute(input.shape, lambda n,c,h,w: compare_with_zero(input[n,c,h,w]), "output")
    return output
s = hcl.create_schedule([input, filter, bias], conv_layer)
print(hcl.lower(s))
f = hcl.build(s)
print('done')

Above program went error when doing with f = hcl.build(s)
However, when the input placeholder's dimension is 2, this program works. And when the dimension becomes 3 or 4, the error of Segmentation Fault (core dumped) came.

Testing does not compile on CircleCI

The testing on CircleCI failed with the following error when installing TVM for hlib testing. It says the compiler cannot find the config file for VTA. However I can run through the script on our servers and everything works fine.

/usr/local/bin/python: can't open file 'vta/vta-hw/config/vta_config.py': [Errno 2] No such file or directory
-- Build VTA runtime with target: 
/usr/local/bin/python: can't open file 'vta/vta-hw/config/vta_config.py': [Errno 2] No such file or directory
-- Build with contrib.sort
-- Build with contrib.hybriddump

The complete error message: https://circleci.com/gh/cornell-zhang/heterocl/899?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Scheduling function supports

Instructions

Currently, every HeteroCL API is implemented with tvm.extern, which does not have loops information. We need to add those in order to support scheduling. There are several things we need to do.

  1. Collect the loops inside a body. (This can be done either in C++ or Python)
  2. If the above is done in Python, we need to extend current tvm.extern interface by allowing adding information of axes.
  3. Our goal is to support loop-related scheduling functions, such as unrolling and tiling. We also need to support compute_at.

Supported Scheduling Functions

  • compute_at
  • parallel
  • store_at
  • vectorize
  • unroll
  • pipeline
  • fuse
  • split
  • tile

Typing Rules in HeteroCL Arithmetic

Do we have documentation about the typing rules in HeteroCL arithmetic? For example, it seems that the result of multiplication of two int16 numbers is int32, which is inconsistent with C++ behavior. It makes sense to extend the bitwidth and avoid overflow, but sometimes it can result in unnecessary hardware resource usage and thus may not always be the expected behavior. It might be better to allow users to turn this feature on and off, or even just document it properly and provide a way to bypass bitwith extenstion.

cc @tonyjie

Error when called 'hcl.power(2.7, _input[y, x])'

Error when I try to use hcl.power(2.71828, _input[y, x])
Error information is 'float' object has no attribute 'dtype', so it seems that I can't use a simple number as the input argument.
I can only bypass this problems by writing like that:

__expo = hcl.local(init = 2.71828, dtype = hcl.Float())
......
hcl.power(__expo[0], _input[y, x])

But the HeteroCL IR generated will cause some conflicts with what I'm doing on my project now...
So I'm wondering is it a way to use number as input argument of hcl.power function?

Thanks!

Bit operations for HeteroCL

Bit operations are important for bit-accurate data types. From Xilinx's user guide (https://bit.ly/2GNHc1e) we can have a brief idea of what kinds of bit operation are supported. Currently, HeteroCL supports bit selection operation on bit-accurate integers. Namely, we can get the value of a bit according to the given variable and index. Following are the bit operations needed to be supported in the near future.

  • bit selection (assign): assign either 1 or 0 to a bit
  • bit slicing (get/assign): get/assign a slice of bits from a number

To begin with, we can take a look at how the current bit selection works in HeteroCL. To implement, please fork the current repository or create a branch. Do not directly edit the master branch.

First, bit selection corresponds to "subscription" in Python AST (e.g. a[2]). You can refer to https://bit.ly/2Ho8NHe. Here, we first get the variable and index, which are "a" and "2" in "a[2]", respectively. If it is a bit operation, a tuple will be returned. The first value in the tuple is always 0 and the second value is the index for bit selection, i.e., 2. After that, we make a corresponding Halide IR node (GetBit) for this.

Second, the Halide IR node is not defined originally. Therefore, we need to add a new Halide IR node. To do so, please check the file.
Halide IR.pdf

Finally, we need to interpret the IR node to LLVM code. Thus, we need to add the corresponding code generation. This is a link to show you how you can do bit operations.
https://stackoverflow.com/questions/47981/how-do-you-set-clear-and-toggle-a-single-bit
You just need to translate them into LLVM instructions. For bit slicing, it might be a little challenging.

Problem building

I followed the instruction from http://heterocl.csl.cornell.edu/doc/installation.html.
During the build process, I got the following error (I am using ubuntu 18):

/tmp/easy_install-Jc5guN/kiwisolver-1.2.0/.eggs/cppy-1.1.0-py2.7.egg/cppy/include/cppy/ptr.h:292:2: error: ‘Py_hash_t’ does not name a type
  Py_hash_t hash() const
  ^~~~~~~~~

In file included from py/kiwisolver.cpp:10:0:
py/types.h:33:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type

py/types.h:47:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type

py/types.h:66:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type

py/types.h:85:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type

py/types.h:104:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type
py/types.h:122:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type
py/kiwisolver.cpp:163:1: error: ‘PyModuleDef_Slot’ does not name a type; did you mean ‘PyModule_GetDict’?
 PyModuleDef_Slot kiwisolver_slots[] = {
 ^~~~~~~~~~~~~~~~
 PyModule_GetDict
py/kiwisolver.cpp:169:20: error: variable ‘{anonymous}::PyModuleDef {anonymous}::moduledef’ has initializer but incomplete type
 struct PyModuleDef moduledef = {
                    ^~~~~~~~~

hcl.select(10>1, 0, 1) Error: First argument to Select is not a bool: int32

I'm using hcl.select(cond, true, false) function, and just encounter a problem.
When comparing two definite number, e.g. 10 > 1, the result is suppose to be a bool, but hcl.select() function doesn't think so......
It can be reproduced as follows:

import heterocl as hcl 
hcl.init()
def top(input, ):
    blur_x = hcl.compute((640, (480 + 2)), lambda x, y: 0, name = "blur_x", dtype = hcl.UInt(bits = 16))
    with hcl.Stage("blur_x"):
        with hcl.for_(0, 640, name = "x") as x:
            with hcl.for_(0, (480 + 2), name = "y") as y:
                blur_x[x, y] = ((input[(x + 2), y] + input[x, y] + input[(x + 1), y])/3) + hcl.select(10 > 1, 10, 1)
    return blur_x
input = hcl.placeholder((642, 482, ), name = "input", dtype = hcl.UInt(bits = 16))
s = hcl.create_schedule([input, ], top)
print(hcl.lower(s))
# f = hcl.build(s)

The error message is:

heterocl.tvm._ffi.base.TVMError: [01:36:00] src/ir/IR.cpp:140: Check failed: condition.type().is_bool() First argument to Select is not a bool: int32

If I change the cond in hcl.select from a definite number 10 to a variable y. Namely, hcl.select(y > 1, 10, 1), it works. But I think based on Python syntax, the result of a comparison operation should be bool type, right?

SegFault from `reuse_at` in HeteroCL module

A simple minimal test case to reproduce the error.

def kernel(A, B):
    @hcl.def_([(10,10), (10,10)])
    def test(A, B):
        hcl.update(B, lambda y, x: A[y,x] + A[y,x+1]+ A[y,x+2])
    test(A, B)

A = hcl.placeholder((10, 10))
B = hcl.placeholder((10, 10))
s = hcl.create_schedule([A, B], kernel)
s.reuse_at(kernel.test.A, s[kernel.test], kernel.test.axis[0])

The reuse_at primitive fails to locate the target buffer to be reused.

Printing objects in HeteroCL

HeteroCL now cannot print intermediate values, which makes it hard for debugging. It would be good to have something like hcl.print. For example,

A = hcl.placeholder((10,))
B = hcl.compute(A.shape, lambda x: A[x]+1)
hcl.print(B) // print the entire tensor
hcl.print(B[0]) // print element

Fixed-point support for different quantization modes and overflow modes

HeteroCL now supports basic bit-accurate integers and fixed-point numbers. The way I implement it is to extend Halide's data type. Here we can see that a data type has four attributes: code, bits, lanes, and fracs. The first one represents the data type. Namely, it can be int, uint, float, or handle. Here we only need to focus on int and uint. With bits and fracs, we can now create an (unsigned) integer with total bits bits, where fracs bits are for the fractional number. For example, if fracs is zero, it is a normal bit-accurate integer. Otherwise, it is a fixed-point.

To access these field, you can look into this file. Here, Type is a wrapper for the base Halide data type. Note that in addition to the Type struct on Line 296, I also define two enums for quantization mode and overflow mode. For more details on their meaning, you can check here(Page 640). Currently, the quantization mode and overflow mode are only defined but not used. This is also what you are going to implement. Basically, we need these when we assign a fixed-point number to another one.

For the Python user interface, the wrapper Type is defined here. Namely, in the samples, we are all going to use these to represent a data type. These wrapper data types will be translated to Python string. For a fixed-point data type, it looks like fixed{a}_{b}, where a is for bits and b is for fracs. One example is fixed10_5. What you need to do here is to also include the quantization mode and overflow mode into this representation. You can implement it in any way. It is up to you.

There is another place you need to look at. This file defines the behavior when we take in a Numpy array (Line 157) and when we translate it back (Line 223). For different quantization mode, we will need different behaviors.

Finally, you can check the commit for what I have changed. It would be easier for you to understand. Basically, I don't think you need to go down to LLVM level. But I'm not sure though.

Commit 1
Commit 2

Support `StreamExpr` in reuse pattern detection

For now reuse_at can only handle the reuse pattern found in Load expression. However, the Streaming IR pass replaces some potentially reusable Load expressions with StreamExpr expressions, making the generate_reuse_buffer IR pass crash with SegFault.

Example of generated Intel OpenCL code with streaming channel as followed. the Input image is replaced with StreamExpr in this case(i.e., read_channel_intel(c_buf_4)), applying resue_at on this input will result in Seg Fault.

__kernel void grad_weight_y() {
    float g_f[7];
    g_f[0] = 7.550000e-02f;
    g_f[1] = 1.330000e-01f;
    g_f[2] = 1.869000e-01f;
    g_f[3] = 2.903000e-01f;
    g_f[4] = 1.869000e-01f;
    g_f[5] = 1.330000e-01f;
    g_f[6] = 7.550000e-02f;
    for (int32_t y = 0; y < 430; ++y) {
      for (int32_t x = 0; x < 1024; ++x) {
        float reducer2;
        reducer2 = 0.000000e+00f;
        for (int32_t rdx = 0; rdx < 7; ++rdx) {
          reducer2 = ((read_channel_intel(c_buf_4) * g_f[rdx]) + reducer2);
        }
        grad_weight_y_y_filt[((x + (y * 1024)) + 3072)] = reducer2;
      }
    }
    #pragma ii 1
    for (int32_t buf_1 = 0; buf_1 < 1024; ++buf_1) {
      for (int32_t buf_0 = 0; buf_0 < 436; ++buf_0) {
        write_channel_intel(c_buf_5, grad_weight_y_y_filt[(buf_1 + (buf_0 * 1024))]);
      }
    }

Cannot apply parallel primitive in HeteroCL module

The issue occurs in the digit recognition example with the .parallel() primitive. I was trying to use a kernel function to update the knn_mat instead of calling hcl.compute, and perform scheduling on the itervars inside the kernel function (i.e. hcl module). The program after modification looks like:

  def knn(*placeholders):

        @hcl.def_([(10,1800), (10,3)])
        def update_knn(dist, knn_mat):
            with hcl.for_(0,10, name="i") as i:
                with hcl.for_(0,1800, name="j") as j:
                    max_id = hcl.scalar(0, "max_id")
                    with hcl.for_(0, 3, name="k") as k:
                        with hcl.if_(knn_mat[i][k] > knn_mat[i][max_id.v]):
                            max_id.v = k
                    with hcl.if_(dist[i][j] < knn_mat[i][max_id.v]):
                        knn_mat[i][max_id.v] = dist[i][j]

        update_knn(dist, knn_mat)

And the scheduling is performed as the following snippet:

    knn_update = knn.update_knn
    s[knn_update].reorder(knn_update.axis[0], knn_update.axis[1])
    # ISSUE: this primitive will lead to segmentation fault
    # s[knn_update].parallel(knn_update.axis[1])
    s[knn_update].pipeline(knn_update.axis[0])

All other scheduling primitives work well, but when I call the .parallel(). The program will error out with a segmentation fault.

Fix Sample Cases

The current samples are not working with the latest refactor. I list the broken cases here to track the progress.

  • Digitrec
  • KMeans
  • Smith-Waterman
  • LeNet
  • GEMM

Please fix LeNet and GEMM first so that we can start implementing the systolic array interface.
Thanks.

Create scheme from MXNet model

Is there a way to create a scheme from an imported MXNet model without having to define it?

Something like:

sym, arg, aux = onnx_mxnet.import_model(ONNX_MODEL_PATH)

hcl.create_scheme(arg, sym)

or do I have to define the network architecture from scratch as in the LeNet inference tutorial?

Is there a struct-like data type in HeteroCL?

I'm working on generating HeteroCL code from Halide IR. And I found it's hard to compute two value in one For loop in HeteroCL. To be more specific, following is a piece of Halide code:

RDom search(0, 64);
offset(x, y) = {cast<int8_t>(0), cast<uint16_t>(65535)};
offset(x, y) = {select(SAD(x, y, search.x) < offset(x, y)[1],
                      cast<int8_t>(search.x),
                      offset(x, y)[0]),
                     min(SAD(x, y, search.x), offset(x, y)[1])};

This code is to do a argmin computation, and need some iteration in it. So it need to compute two variables offset(x, y)[0] and offset(x, y)[1] in one For loop which can't be separated.

I know I can actually increase a dimension (size = 2) to store the value, and do some similar iteration computation (I haven't tried if it works), but doing so will change the shape of output. Also, it's not the original meaning of the code.

So I'm just wondering and asking is there a feature in HeteroCL to create a structure or a tuple-like data type, because I didn't find it successfully.

Thnaks!

HeteroCL data struct results in incomplete generated code

  1. I was trying to write some imperative code with data struct, where we multiply two values from a data struct and assign the product back to another struct.
stype = hcl.Struct({"fa": ccl.Int(), "fb": ccl.Fixed(12,8)})
t = hcl.scalar(tensor[r,c], dtype=stype)          
r = hcl.scalar(0, dtype=stype)
r.v.fa = t.v.fa * t.v.fb

The generated code failed to capture the computation r.v.fa = t.v.fa * t.v.fb and outputs some incomplete convertors for it like this.

ap_uint<32> _converter2;
_converter2(      _converter3(

The workaround is to write the program in the following way.

stype = hcl.Struct({"fa": ccl.Int(), "fb": ccl.Fixed(12,8)})
t = hcl.scalar(tensor[r,c], dtype=stype)          
r = hcl.scalar(0, dtype=stype)
x = hcl.scalar(t.v.fa)
y = hcl.scalar(t.v.fb)
r.v.fa = x.v * y.v
  1. The generated code has some redundant bit slicing operations, which confuses Vivado HLS in streaming channel analysis, and leads to pre-synthesis failure.
ap_uint<32> _converter;
// Vivado HLS complains grad_x has been consumed more than once
_converter(31, 0) = grad_x[(x + (y * 1024))](31,0);

// We have to manually remove the bit-slicing operation 
// so that grad_x can be implemented as a FIFO
_converter(31, 0) = grad_x[(x + (y * 1024))];

`reuse_at` returns wrong result with `hcl.select`

Here I want to reuse on one input in two different dimensions:

def test_reuse_select():
    hcl.init()
    A = hcl.placeholder((10, 10, 2))
    B = hcl.compute((10, 8, 2), lambda y, x, c:
            hcl.select(c==0, A[y, x, c]*1 + A[y, x+1, c]*1 + A[y, x+2, c]*1,
                             A[y, x, c]*3 + A[y, x+1, c]*5 + A[y, x+2, c]*6))
    s = hcl.create_schedule([A, B])
    RB = s.reuse_at(A, s[B], B.axis[1])
    f = hcl.build(s)

    np_A = np.random.randint(0, 10, size=(10, 10, 2))
    np_B = np.zeros((10, 8, 2), dtype="int")
    np_C = np.zeros((10, 8, 2), dtype="int")

    for y in range(0, 10):
        for x in range(0, 8):
            np_C[y][x][0] = np_A[y][x][0]*1 + np_A[y][x+1][0]*1 + np_A[y][x+2][0]*1
            np_C[y][x][1] = np_A[y][x][1]*3 + np_A[y][x+1][1]*5 + np_A[y][x+2][1]*6

    hcl_A = hcl.asarray(np_A)
    hcl_B = hcl.asarray(np_B)
    print(hcl.lower(s))

    f(hcl_A, hcl_B)

    np_B = hcl_B.asnumpy()

    assert np.array_equal(np_B, np_C)

The result does not match with the ground truth when reuse_at schedule is applied. The IR is as followed:

produce compute0 {
  // attr [placeholder0.reuse] storage_scope = "global"
  allocate placeholder0.reuse[int32 * 1 * 3 * 2]
  // attr [0] extern_scope = 0
  for (y, 0, 10) {
    for (x.reuse, 0, 10) {
      for (c, 0, 2) {
        produce placeholder0.reuse {
          for (placeholder0.1, 0, 2) {
            placeholder0.reuse[(c + (placeholder0.1*3))] = placeholder0.reuse[((c + (placeholder0.1*3)) + 3)]
          }
          placeholder0.reuse[(c + 6)] = placeholder0[((c + (x.reuse*10)) + (y*20))]
        }
        if ((2 <= x.reuse)) {
          compute0[(((c + (x.reuse*2)) + (y*16)) + -4)] = int32(tvm_if_then_else((c == 0), (int34((int33(placeholder0.reuse[c]) + int33(placeholder0.reuse[(c + 3)]))) + int34(placeholder0.reuse[(c + 6)])), (int34((int33((placeholder0.reuse[c]*3)) + int33((placeholder0.reuse[(c + 3)]*5)))) + int34((placeholder0.reuse[(c + 6)]*6)))))
        }
      }
    }
  }
}

I am also wondering if it is possible to modify the reuse_at primitive and make it work on HeteroCL module. Generating reuse buffer along with .to() takes a lot of efforts (I need to figure out which dimension to exploit reusability, and also to do a lot of repeat work as reuse_at), and I am afraid that it cannot be done very soon. Reusing the reuse_at primitive and making it compatible with HeteroCL might be a better idea.

[Roadmap] HeteroCL Release Note v0.2

Release Note

We are happy to announce that HeteroCL is now an open-source project. HeteroCL is currently both Python 2 and Python 3 compatible. Following we list the currently supported features and its related links (in tests and/or in the documentation).

General API

  • hcl.init (initialize a HeteroCL environment)
  • hcl.placeholder (create an input placeholder)
  • hcl.create_schedule (create a schedule for hardware customization)
  • hcl.lower (lower the program to IR for investigation)
  • hcl.build (build the program)
  • Related links: test, tutorial, doc

Data Type

  • hcl.Int, hcl.UInt (can have arbitrary bitwidth up to 255 bits)
  • hcl.Fixed, hcl.UFixed (can have arbitrary bitwidth up to 255 bits)
  • hcl.Float (can be 32 or 64 bits)
  • Related links: test, tutorial, doc

Imperative DSL

  • hcl.and_, hcl.or_
  • hcl.if_, hcl.elif_, hcl.else_
  • hcl.for_, hcl.while_, hcl.break_
  • bit operations (get bit, set bit, get slice, set slice)
  • hcl.def_, hcl.return_ (a custom-defined HeteroCL module)
  • Related links: general test, def test, tutorial, doc

Compute API

  • hcl.compute (compute a new tensor according to the given formula)
  • hcl.update (update an exsiting tensor according to the given formula)
  • hcl.mutate (mutatively update an existing tensor)
  • hcl.reduce_axis (create a reduce axis for reduction operation)
  • hcl.sum (perform a summation on the given axis)
  • hcl.pack, hcl.unpack (pack/unpack a tensor to larger/lower bitwidth)
  • hcl.Stage (create a stage that contains user-defined operations)
  • Related links: general test, reduction test, pack/unpck test, stage test, tutorial, doc

Compute Customization

Data Type API & Data Type Customization

  • hcl.create_scheme (create a scheme for data type customization)
  • hcl.create_schedule_from_scheme
  • hcl.downsize (downsize integers to lower bit-width)
  • hcl.quantize (quantize floating-points to fixed-points)
  • Related links: test, tutorial, doc

Memory Customization

  • partition, reshape
  • reuse_at (create a reuse buffer provided the input tensor and stage)
  • Related links: test, tutorial, doc

Back-end Target

  • llvm (default target for CPU)
  • vhls (generate Vivado HLS code)
  • vhls_csim (generate an executable compiled from Vivado HLS code)
  • ihls (generate Intel HLS code)
  • merlinc (generate Merlin C code)
  • soda (generate SODA DSL)
  • soda_xhls (generate Vivado HLS code from SODA DSL)
  • Related links: vhls test, ihls test, merlinc test, soda test, tutorial, SDOA tutorial

Stable Samples

[Roadmap] HeteroCL Roadmap v0.3

We welcome any suggestions.

General Improvement

  • device placement
  • add a pass to ML models
  • integrate with auto-tuning tools

Data Type

  • C-like struct
  • arbitrary-precision floating points

Compute Customization

  • vectorization
  • dataflow
  • data streaming

Back-end Target

  • integrate PolySA
  • add back end for RTL codegen
  • add OpenCL backend for Xilinx and Intel HLS

Wrong CPU simulation result with HeteroCL module

I created a HCl reducer to realize a sorting algorithm, and wrap it with a HeteroCL module. The gaol is to sort a 10x3 array in axis 1. I checked the IR line by line and did not find any wrong logic inside. But the simulation result is not correct in each row.

The Halide IR:

// attr [sort] storage_scope = "global"
produce sort {
  // attr [0] extern_scope = 0
  def sort(handle64(sort.A[10*3]), handle64(sort.B[10*3])) {
    allocate compute0[int32 * 3]
    for (x, 0, 3) {
      compute0[x] = 1000
    }
    for (x, 0, 10) {
      allocate reducer0[int32 * 3]
      for (args, 0, 3) {
        reducer0[args] = compute0[args]
      }
      for (rdx, 0, 3) {
        for (i, 0, 3) {
          if ((sort.A[(rdx + (x*3))] < reducer0[i])) {
            for (i, 0, ((i + -2)/-1)) {
              reducer0[(2 - i)] = reducer0[(1 - i)]
            }
          }
          reducer0[i] = sort.A[(rdx + (x*3))]
          break
        }
      }
      for (reducer0_i0, 0, 3) {
        sort.B[(reducer0_i0 + (x*3))] = reducer0[reducer0_i0]
      }
    }
  }
}
// attr [sort0] storage_scope = "global"
allocate sort0[int32 * 1]
produce sort0 {
  // attr [0] extern_scope = 0
  sort(A, B)
}

the input array:

[[14 12  6]
 [ 3  7 18]
 [16  2 11]
 [17 15  7]
 [ 4 16  8]
 [ 0 15  7]
 [ 0  0  6]
 [13  1  0]
 [10  6  6]
 [19 19 11]]

And the sorted result is wrong for most rows (for some rows the result is correct). I guess there might be something wrong with memory in LLVM JIT complication?

[[   6   12   14]
 [  18 1000 1000]
 [  11   16 1000]
 [   7   15   17]
 [   8   16 1000]
 [   7   15 1000]
 [   6 1000 1000]
 [   0    1   13]
 [   6   10 1000]
 [  11   19 1000]]

unroll factor missing from loop annotations

I am running test_soda.py and found that none of the loops contains any annotations. Without annotations I cannot pass the unroll factor down to the SODA backend. This used to be working. I was wondering what happened? Why is it (along with others) disabled in the tests?

Unexpected tensor allocation

When testing the stencil backend, I found that in the IR generated for the Gaussian benchmark, the output tensor is explicitly allocated. I believe this is incorrect because the interface already generates an implicit tensor allocation by calling tvm_struct_get. The blur benchmark works fine.

This unexpected tensor allocation is breaking the code for SODA code generation. More specifically, it invalidates the VarExpr comparison because the newly generated Variable is used in the IR, which is not linked to the interface. This results in incorrect detection of output or local tensors in SODA. As a workaround, I had to compare by name_hint, but it may not work in other situations, as the name suggests.

The IR is printed in the test_soda.py unit test and can be reproduced by running python -m unittest test_soda in heterocl/heterocl/tests.

Add bias_add operator

I can't find a way to implement bias_add operator.

Let's say I have a Tensor with a shape of (1, 8, 28, 28) and another Tensor (which contains biases) with a shape (8, 1, 1). How could I add the biases to the first Tensor?

TVM implements such an operator with C++ but I can't find a way to implement it with the hcl.compute API. Any thoughts?

reuse_at generates inconsistent index access for streaming channel

An example: we have a receiver function that needs to read data sequentially from the streaming channel, and a sender function, as shown in the following coding block, to write data (i.e., calc_x_gradient.grad_x) into the channel.

We can apply data reuse schedule on the sender function to exploit the data locality, but this will lead to index access inconsistency between sender and receiver side (i.e., reader reads data[x + y*1024] and writer writes data[x + y*1024 -2]). To avoid the incorrectness introduced by index inconsistency, the streaming inference IR pass generates another nested loops to write data into the channel (i.e. c_buf_2.write(calc_x_gradient.grad_x[(buf_1 + (buf_0*1024))]) ). And this approach will lead to performance degradation.

def calc_x_gradient(handle64(calc_x_gradient.input_image[436*1024]), handle64(calc_x_gradient.grad_x[436*1024])) {
  for (y, 0, 436) {
    for (x.reuse, 0, 1024) {
      produce calc_x_gradient.input_image.reuse {
        for (calc_x_gradient.input_image.0, 0, 4) {
          calc_x_gradient.input_image.reuse[calc_x_gradient.input_image.0] = calc_x_gradient.input_image.reuse[(calc_x_gradient.input_image.0 + 1)]
        }
        calc_x_gradient.input_image.reuse[4] = calc_x_gradient.input_image[(x.reuse + (y*1024))]
      }
      if ((4 <= x.reuse)) {
        allocate reducer0[float32 * 1]
        reducer0[0] = 0.000000f
        for (rdx, 0, 5) {
          reducer0[0] = ((calc_x_gradient.input_image.reuse[rdx]*float32(g_w[rdx])) + reducer0[0])
        }
        calc_x_gradient.grad_x[((x.reuse + (y*1024)) + -2)] = reducer0[0]
      }
    }
  }
  pipelined (buf_1, 0, 1024) {
    for (buf_0, 0, 436) {
      c_buf_2.write(calc_x_gradient.grad_x[(buf_1 + (buf_0*1024))])
    }
  }
}

As I discussed with Sean, a simple solution is to add extra if-else statement around the condition block to maintain the index access order consistency of receiver and sender.

def calc_x_gradient(handle64(calc_x_gradient.input_image[436*1024]), handle64(calc_x_gradient.grad_x[436*1024])) {
  for (y, 0, 436) {
    for (x.reuse, 0, 1026) {
      produce calc_x_gradient.input_image.reuse {
        for (calc_x_gradient.input_image.0, 0, 4) {
          calc_x_gradient.input_image.reuse[calc_x_gradient.input_image.0] = calc_x_gradient.input_image.reuse[(calc_x_gradient.input_image.0 + 1)]
        }
        calc_x_gradient.input_image.reuse[4] = calc_x_gradient.input_image[(x.reuse + (y*1024))]
      }

      if (2 < x.reuse) {
        c_buf_2.write(0); 
      } elif ((4 <= x.reuse)) {
        allocate reducer0[float32 * 1]
        reducer0[0] = 0.000000f
        for (rdx, 0, 5) {
          reducer0[0] = ((calc_x_gradient.input_image.reuse[rdx]*float32(g_w[rdx])) + reducer0[0])
        }
        c_buf_2.write(reducer0[0]);
      } elif (x.reuse > 1024) {
         c_buf_2.write(0); 
      }
    }
  }
}

HeteroCL Roadmap for Modeling

Check List

  • API
    • var
    • placeholder
    • imperative/declarative compute
    • imperative/declarative update
    • imperative/declarative mutate compute
    • imperative block
    • reduction (e.g., sum, arg_max)
  • Imperative DSL support
    • statement: For, If
    • statement: While, Break
    • expression: List, Dict
  • Bit-accurate data type
    • up-to-64-bit unsigned int
    • up-to-64-bit signed int
    • up-to-64-bit fixed-point
    • up-to-128-bit int
    • up-to-128-bit fixed-point
    • bit selection
    • bit slicing, bit concatenation
    • resize interface
  • Other features
    • code placement
    • hardware-specific type system

multi-stage kernel cannot be unrolled

The samples/seidel/seidel.py sample isn't working as expected. To properly generate SODA code, all stencil loops must be unrolled with the same factor; however, only the second stage (output) of seidel can be unrolled; it will crash if one tries to unroll the first stage (tmp). This can be reproduced by uncommenting this line.

SW Simulation failed using RTL IP integration

Vivado HLS provides users with blackbox option when adding source files to the working project. Users can add a json file specifying the RTL kernel (i.e IP we want to integrate) and its corresponding C model (i.e. behavior level model).

addfile -blacbox rtl_model.json

The #170 is currently using the blackbox to connect HLS kernel and RTL IPs. This method is not stable (the SW simulation oftentimes halts forever). Also it has a lot of limitations, making it hard to be extended to other IP cores.

Instead of using blackbox, we should consider each IP core as a kernel function (instead of a sub-function being called inside the top-level function), and enqueue kernels in the OpenCL host program.

error: "Cannot create binary operator with two operands of differing type!"

Error happens when I run the digitrec_no_schedule sample, here is the error message:
python: /home/hyw/Software/llvm5.0/lib/IR/Instructions.cpp:2087: static llvm::BinaryOperator* llvm::BinaryOperator::Create(llvm::Instruction::BinaryOps, llvm::Value*, llvm::Value*, const llvm::Twine&, llvm::Instruction*): Assertion `S1->getType() == S2->getType() && "Cannot create binary operator with two operands of differing type!"' failed. Aborted (core dumped)

The error is fixed when I change a = a >> 1 to a = a / 2 in line 16, and then I get correct results.

What caused the error? Is it because of llvm version?

Kmeans example: provided host program declares wrong kernel function

The kernel function declaration in host program (the output label array is missing):

void default_function(int* X0, int* centers0);

The kernel function generated by HeteroCL:

void default_function(int* placeholder2, int* placeholder3, int* compute3)

This case is not covered by the auto-testing on CircleCI.

Add a schedule/pass that pops up loops to make them perfect

We need to add a schedule that can analyze which loops are safe to be moved up in a loop nest. This will be useful and is needed for tiling.

Example:
Original IR:

for (i, 0, 10)
  if (i > 5)
    for (j, 0, 10)
      #operations

After transform:

for (i, 0, 10)
  for (j, 0, 10)
    if (i > 5)
      #operations

No reuse dimension found in the body for tensor input

I am trying to reuse the input image in a conv2d layer in the LeNet example. The reuse_at primitive works fine with the placeholder inputs (i.e. input_image in the first conv2d). However when passing the max-pooled result to the second conv2d layer, no reuse pattern was found for it.

import heterocl as hcl
import hlib
import numpy as np

batch_size = 1000
qtype1 = hcl.Fixed(16, 14)
qtype2 = hcl.Fixed(16, 14)

def build_lenet(input_image, weight_conv1, weight_conv2,
                weight_fc1, weight_fc2, lenet):
    # first conv
    conv1 = hlib.nn.conv2d_nchw(input_image, weight_conv1, "conv1")
    tanh1 = hlib.nn.tanh(conv1, "tanh1")
    pool1 = hlib.nn.max_pool(tanh1, kernel=(2,2), stride=(2,2), name="pool1")
    # second conv
    conv2 = hlib.nn.conv2d_nchw(pool1, weight_conv2, name="conv2")
    tanh2 = hlib.nn.tanh(conv2, "tanh2")
    pool2 = hlib.nn.max_pool(tanh2, kernel=(2,2), stride=(2,2))
    # first fc
    flat = hlib.nn.flatten(pool2)
    fc1 = hlib.nn.dense(flat, weight_fc1)
    tanh3 = hlib.nn.tanh(fc1, "tanh3")
    # second fc
    fc2 =  hlib.nn.dense(tanh3, weight_fc2)
    # loss
    return hlib.nn.softmax(lenet, fc2)

input_image = hcl.placeholder((batch_size, 1, 28, 28), "input_image")
weight_conv1 = hcl.placeholder((20, 1, 5, 5), "weight_conv1", qtype1)
weight_conv2 = hcl.placeholder((50, 20, 5, 5), "weight_conv2", qtype1)
weight_fc1 = hcl.placeholder((500, 800), "weight_fc1", qtype1)
weight_fc2 = hcl.placeholder((10, 500), "weight_fc2", qtype1)
lenet = hcl.placeholder((batch_size, 10), "lenet")
s = hcl.create_schedule([input_image, weight_conv1, weight_conv2,
    weight_fc1, weight_fc2, lenet], build_lenet)

s[build_lenet.conv1].compute_at(s[build_lenet.tanh1], build_lenet.tanh1.axis[3])
s.reuse_at(input_image, s[build_lenet.conv1], build_lenet.conv1.axis[0])
s.reuse_at(build_lenet.pool1._op, s[build_lenet.conv2], build_lenet.conv2.axis[1])
print(hcl.lower(s))

The error message as followed:

check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
heterocl.tvm._ffi.base.TVMError: [14:12:42] src/pass/generate_reuse_buffer.cc:245: No reuse is found in axis nn

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.