cornell-zhang / heterocl Goto Github PK

View Code? Open in Web Editor NEW

321.0 321.0 93.0 39.6 MB

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing

Home Page: https://cornell-zhang.github.io/heterocl/

License: Apache License 2.0

Python 91.81% Makefile 2.43% C++ 4.94% Shell 0.56% Tcl 0.26%

accelerators compiler dsl fpga fpga-programming high-level-synthesis python

heterocl's People

Contributors

Stargazers

Watchers

Forkers

seanlatias ucla-vast locussam maruiee faku99 hgyhungry ybai62868 pbc48 hecmay kim-sunghoon schelleg hj424 yuqing1995 cfandy dotkrnl kevinliupy cathalmccabe pixiesunky ghjeong12 mk864mk864 hachembensalem shvo custler ailearnerli chhzh123 algapeng leo1726 zhanqiu-hu minisparrow jg925 mangomingo paldebjit k-wu sibylau ezw2 endeavor-gcl xz10620 zhenkuny pasqoc gale-zhang zzzdavid ewenwan sjz38 wxyhv lirui-shanghaitech zzy82518996 zhangyuqaq heiallen mfkiwl deepware-ai yn224 dreamqin68 zhenkun zhenlin-work mookel trendingtechnology lloo099 sunshineywz123 milumilua 00mjk yahya-alhinai sullivan12138 ajacques-uds ngdxzy yass-g xiangyizhao tonyjie pachiramining lpj20 tunghoang290780 sqposeidon ajunlonglive qshan backyes wty5 andrewcaiuuu shihaocao smc447 yummy0929 ana12138 rlin569 accelerating-hard kth-sml anniezfy hitqshao jhhuh quantrpeter vmiheer neil-phan yjyzdx

heterocl's Issues

Add annocation for hcl.local generated arrays

Although HeteroCL allows users to use hcl.local() to create a scalar variable, it's not really a scalar variable in the IR due to the IR limitation. As a result, the IR will contain an array with length 1 as well as a for-loop with trip-count 1 to initialize the array. For example:

hcl.local(0, "A")
-->
int A[1];
for (int i = 0; i < 1; ++i) {
    A[i] = 0;
}

This coding style affects the static analysis and should be avoided. In order to generate scalar variables, it'd be better to annotate the IR node (specifically, node Allocate) if the array we're declaring is actually a scalar variable. If it is a scalar variable, it also annotates its initial value (e.g., 0 in the above example).

Fixed-point support for different quantization modes and overflow modes

HeteroCL now supports basic bit-accurate integers and fixed-point numbers. The way I implement it is to extend Halide's data type. Here we can see that a data type has four attributes: code, bits, lanes, and fracs. The first one represents the data type. Namely, it can be int, uint, float, or handle. Here we only need to focus on int and uint. With bits and fracs, we can now create an (unsigned) integer with total bits bits, where fracs bits are for the fractional number. For example, if fracs is zero, it is a normal bit-accurate integer. Otherwise, it is a fixed-point.

To access these field, you can look into this file. Here, Type is a wrapper for the base Halide data type. Note that in addition to the Type struct on Line 296, I also define two enums for quantization mode and overflow mode. For more details on their meaning, you can check here(Page 640). Currently, the quantization mode and overflow mode are only defined but not used. This is also what you are going to implement. Basically, we need these when we assign a fixed-point number to another one.

For the Python user interface, the wrapper Type is defined here. Namely, in the samples, we are all going to use these to represent a data type. These wrapper data types will be translated to Python string. For a fixed-point data type, it looks like fixed{a}_{b}, where a is for bits and b is for fracs. One example is fixed10_5. What you need to do here is to also include the quantization mode and overflow mode into this representation. You can implement it in any way. It is up to you.

There is another place you need to look at. This file defines the behavior when we take in a Numpy array (Line 157) and when we translate it back (Line 223). For different quantization mode, we will need different behaviors.

Finally, you can check the commit for what I have changed. It would be easier for you to understand. Basically, I don't think you need to go down to LLVM level. But I'm not sure though.

Commit 1
Commit 2

Enhancement for reorder API

As Ritchie mentioned in #21, we should be able to take in a list of integers similar to np.transpose.

Add bias_add operator

I can't find a way to implement bias_add operator.

Let's say I have a Tensor with a shape of (1, 8, 28, 28) and another Tensor (which contains biases) with a shape (8, 1, 1). How could I add the biases to the first Tensor?

TVM implements such an operator with C++ but I can't find a way to implement it with the hcl.compute API. Any thoughts?

When would PolySA be released?

PolySA is in roadmap for 0.3 since May.
When would HeteroCL integrate PolySA?

Thanks

HeteroCL Roadmap for Modeling

Check List

API
- var
- placeholder
- imperative/declarative compute
- imperative/declarative update
- imperative/declarative mutate compute
- imperative block
- reduction (e.g., sum, arg_max)
Imperative DSL support
- statement: For, If
- statement: While, Break
- expression: List, Dict
Bit-accurate data type
- up-to-64-bit unsigned int
- up-to-64-bit signed int
- up-to-64-bit fixed-point
- up-to-128-bit int
- up-to-128-bit fixed-point
- bit selection
- bit slicing, bit concatenation
- resize interface
Other features
- code placement
- hardware-specific type system

Cannot apply parallel primitive in HeteroCL module

The issue occurs in the digit recognition example with the .parallel() primitive. I was trying to use a kernel function to update the knn_mat instead of calling hcl.compute, and perform scheduling on the itervars inside the kernel function (i.e. hcl module). The program after modification looks like:

  def knn(*placeholders):

        @hcl.def_([(10,1800), (10,3)])
        def update_knn(dist, knn_mat):
            with hcl.for_(0,10, name="i") as i:
                with hcl.for_(0,1800, name="j") as j:
                    max_id = hcl.scalar(0, "max_id")
                    with hcl.for_(0, 3, name="k") as k:
                        with hcl.if_(knn_mat[i][k] > knn_mat[i][max_id.v]):
                            max_id.v = k
                    with hcl.if_(dist[i][j] < knn_mat[i][max_id.v]):
                        knn_mat[i][max_id.v] = dist[i][j]

        update_knn(dist, knn_mat)

And the scheduling is performed as the following snippet:

    knn_update = knn.update_knn
    s[knn_update].reorder(knn_update.axis[0], knn_update.axis[1])
    # ISSUE: this primitive will lead to segmentation fault
    # s[knn_update].parallel(knn_update.axis[1])
    s[knn_update].pipeline(knn_update.axis[0])

All other scheduling primitives work well, but when I call the .parallel(). The program will error out with a segmentation fault.

terminate called after throwing an instance of 'std::out_of_range'

The error occurs when the user tries to create an Allocate statement without specifying the storage scope (some other issues can lead to the same error message).

The complete error message from HetroCL is too ambiguous to pinpoint the issue:

terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted

The error is caused by looking up a not existing entry in the map

const Variable* buffer = op->buffer_var.as<Variable>();
std::string scope = alloc_storage_scope_.at(buffer);

Adding a check statement before the lookup can provide more detailed error message

CHECK(alloc_storage_scope_.find(buffer) != alloc_storage_scope_.end());

Add Special Support for Fixed-Points

Add support for fixed-point with negative fields.

Example:

UFixed(10, -5)
UFixed(10, 12)

Testing does not compile on CircleCI

The testing on CircleCI failed with the following error when installing TVM for hlib testing. It says the compiler cannot find the config file for VTA. However I can run through the script on our servers and everything works fine.

/usr/local/bin/python: can't open file 'vta/vta-hw/config/vta_config.py': [Errno 2] No such file or directory
-- Build VTA runtime with target: 
/usr/local/bin/python: can't open file 'vta/vta-hw/config/vta_config.py': [Errno 2] No such file or directory
-- Build with contrib.sort
-- Build with contrib.hybriddump

The complete error message: https://circleci.com/gh/cornell-zhang/heterocl/899?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Problem building

I followed the instruction from http://heterocl.csl.cornell.edu/doc/installation.html.
During the build process, I got the following error (I am using ubuntu 18):

/tmp/easy_install-Jc5guN/kiwisolver-1.2.0/.eggs/cppy-1.1.0-py2.7.egg/cppy/include/cppy/ptr.h:292:2: error: ‘Py_hash_t’ does not name a type
  Py_hash_t hash() const
  ^~~~~~~~~

In file included from py/kiwisolver.cpp:10:0:
py/types.h:33:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type

py/types.h:47:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type

py/types.h:66:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type

py/types.h:85:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type

py/types.h:104:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type
py/types.h:122:12: error: ‘PyType_Spec’ does not name a type; did you mean ‘PyType_Type’?
     static PyType_Spec TypeObject_Spec;
            ^~~~~~~~~~~
            PyType_Type
py/kiwisolver.cpp:163:1: error: ‘PyModuleDef_Slot’ does not name a type; did you mean ‘PyModule_GetDict’?
 PyModuleDef_Slot kiwisolver_slots[] = {
 ^~~~~~~~~~~~~~~~
 PyModule_GetDict
py/kiwisolver.cpp:169:20: error: variable ‘{anonymous}::PyModuleDef {anonymous}::moduledef’ has initializer but incomplete type
 struct PyModuleDef moduledef = {
                    ^~~~~~~~~

Error when called 'hcl.power(2.7, _input[y, x])'

Error when I try to use hcl.power(2.71828, _input[y, x])
Error information is 'float' object has no attribute 'dtype', so it seems that I can't use a simple number as the input argument.
I can only bypass this problems by writing like that:

__expo = hcl.local(init = 2.71828, dtype = hcl.Float())
......
hcl.power(__expo[0], _input[y, x])

But the HeteroCL IR generated will cause some conflicts with what I'm doing on my project now...
So I'm wondering is it a way to use number as input argument of hcl.power function?

Thanks!

Create scheme from MXNet model

Is there a way to create a scheme from an imported MXNet model without having to define it?

Something like:

sym, arg, aux = onnx_mxnet.import_model(ONNX_MODEL_PATH)

hcl.create_scheme(arg, sym)

or do I have to define the network architecture from scratch as in the LeNet inference tutorial?

Data Type Error if I don't use hcl.init(dtype)

The error message is:

heterocl.tvm._ffi.base.TVMError: [13:55:02] src/codegen/llvm/llvm_module.cc:59: Check failed: ret == 0 (-1 vs. 0) Assert fail: ((((tvm_struct_get(arg0, 0, 5) == (uint8)2) && (tvm_struct_get(arg0, 0, 6) == (uint8)32)) && (tvm_struct_get(arg0, 0, 8) == (uint8)0)) && (tvm_struct_get(arg0, 0, 7) == (uint8)1)), arg0.dtype is expected to be float32

You can reproduce it by the code below:

hcl.init()
input = hcl.placeholder((1280, 768, 3), "input", dtype = hcl.Float()) #input shape = output shape
def bug(input_tensor):
    return hcl.compute(input_tensor.shape, lambda x,y,c: input_tensor[x,y,c], dtype=hcl.Float())

hcl.build(hcl.create_schedule([input], bug))(hcl.asarray(np.zeros(input.shape, dtype=float)), hcl.asarray(np.zeros(input.shape, dtype=float)))

I don't know what is arg0, because I think all the data type is set to hcl.Float(), and it can be shown by IR presented.
This error can be solved by replacing hcl.init() to hcl.init(hcl.Float()). But I don't know where the wrong data type is.

Thanks!

HostDeviceSplitter doesn't work

When implementing the code generation, I found several points:

As I know, the HeteroCL programming model doesn't have "bind" API for user to specify the thread binding.
Without specifying the thread binding, the user-written compute function will firstly be marked as container.LoweredFunc.MixedFunc when lowering the function.
For the mixed function, build_module.py:build wil use HostDeviceSplitter to split the kernel part as a new function.
HostDeviceSplitter (tvm/src/pass/split_host_device.cc) actually splits the kernel part by checking if the function has thread binding attribute.
As a result, all HeteroCL functions are identified as host functions and will not be passed to the kernel code generation.

On the other hand, it seems not proper to force generating the kernel code from the mixed function. I've tried it and got lots of errors such as "unknown field code" and "Unresolved intrinsic tvm_call_packed with return type int32".

Kmeans example: provided host program declares wrong kernel function

The kernel function declaration in host program (the output label array is missing):

void default_function(int* X0, int* centers0);

The kernel function generated by HeteroCL:

void default_function(int* placeholder2, int* placeholder3, int* compute3)

This case is not covered by the auto-testing on CircleCI.

[Roadmap] HeteroCL Roadmap v0.3

We welcome any suggestions.

General Improvement

device placement
add a pass to ML models
integrate with auto-tuning tools

Data Type

C-like struct
arbitrary-precision floating points

Compute Customization

vectorization
dataflow
data streaming

Back-end Target

integrate PolySA
add back end for RTL codegen
add OpenCL backend for Xilinx and Intel HLS

multi-stage kernel cannot be unrolled

The samples/seidel/seidel.py sample isn't working as expected. To properly generate SODA code, all stencil loops must be unrolled with the same factor; however, only the second stage (output) of seidel can be unrolled; it will crash if one tries to unroll the first stage (tmp). This can be reproduced by uncommenting this line.

Tools version

I would like to know if there are any requirements about the version of Vivado and Intel tools to use with HeteroCL?

Thanks for the answer and keep up the good work!

Backend infrastructure for code generation

Since we attempt to make HeteroCL as a standalone compilation framework, we're going to build our own backend infrastructure for code generation from Halide IR. I first list some necessary tasks here, but they can definitely be refined later.

Interface
The backend infrastructure and the HeteroCL frontend can be developed in parallel, but we need to integrate them eventually. Thus, we have to first determine the interface between HeteroCL frontend and the backend. To be more specific, the connection between HeteroCL frontend output and backend input must be straightforward.
Halide IR visitor base class with a supported backend language table
In order to facilitate the future development and collaboration, it would be good to have a common Halide IR visitor base class. In addition to the base class, we also need to maintain a table that lists all supported backend languages. As a result, everyone is able to develop and experiment new backend language support easily.
C code generation
Since we have planned to generate several C-based backend languages (i.e., HLS C, Merlin C, and even OpenCL for Intel platform), a common C code generator would be helpful.
System integration
The final step is putting it all together.

Also, the following tasks of specialized code generation can be split to other issues later on.

Scheduling function supports

Instructions

Currently, every HeteroCL API is implemented with tvm.extern, which does not have loops information. We need to add those in order to support scheduling. There are several things we need to do.

Collect the loops inside a body. (This can be done either in C++ or Python)
If the above is done in Python, we need to extend current tvm.extern interface by allowing adding information of axes.
Our goal is to support loop-related scheduling functions, such as unrolling and tiling. We also need to support compute_at.

Supported Scheduling Functions

Printing objects in HeteroCL

HeteroCL now cannot print intermediate values, which makes it hard for debugging. It would be good to have something like hcl.print. For example,

A = hcl.placeholder((10,))
B = hcl.compute(A.shape, lambda x: A[x]+1)
hcl.print(B) // print the entire tensor
hcl.print(B[0]) // print element

error: "Cannot create binary operator with two operands of differing type!"

Error happens when I run the digitrec_no_schedule sample, here is the error message:
python: /home/hyw/Software/llvm5.0/lib/IR/Instructions.cpp:2087: static llvm::BinaryOperator* llvm::BinaryOperator::Create(llvm::Instruction::BinaryOps, llvm::Value*, llvm::Value*, const llvm::Twine&, llvm::Instruction*): Assertion `S1->getType() == S2->getType() && "Cannot create binary operator with two operands of differing type!"' failed. Aborted (core dumped)

The error is fixed when I change a = a >> 1 to a = a / 2 in line 16, and then I get correct results.

What caused the error? Is it because of llvm version?

HeteroCL data struct results in incomplete generated code

I was trying to write some imperative code with data struct, where we multiply two values from a data struct and assign the product back to another struct.

stype = hcl.Struct({"fa": ccl.Int(), "fb": ccl.Fixed(12,8)})
t = hcl.scalar(tensor[r,c], dtype=stype)          
r = hcl.scalar(0, dtype=stype)
r.v.fa = t.v.fa * t.v.fb

The generated code failed to capture the computation r.v.fa = t.v.fa * t.v.fb and outputs some incomplete convertors for it like this.

ap_uint<32> _converter2;
_converter2(      _converter3(

The workaround is to write the program in the following way.

stype = hcl.Struct({"fa": ccl.Int(), "fb": ccl.Fixed(12,8)})
t = hcl.scalar(tensor[r,c], dtype=stype)          
r = hcl.scalar(0, dtype=stype)
x = hcl.scalar(t.v.fa)
y = hcl.scalar(t.v.fb)
r.v.fa = x.v * y.v

The generated code has some redundant bit slicing operations, which confuses Vivado HLS in streaming channel analysis, and leads to pre-synthesis failure.

ap_uint<32> _converter;
// Vivado HLS complains grad_x has been consumed more than once
_converter(31, 0) = grad_x[(x + (y * 1024))](31,0);

// We have to manually remove the bit-slicing operation 
// so that grad_x can be implemented as a FIFO
_converter(31, 0) = grad_x[(x + (y * 1024))];

[Roadmap] HeteroCL Release Note v0.2

Release Note

We are happy to announce that HeteroCL is now an open-source project. HeteroCL is currently both Python 2 and Python 3 compatible. Following we list the currently supported features and its related links (in tests and/or in the documentation).

General API

hcl.init (initialize a HeteroCL environment)
hcl.placeholder (create an input placeholder)
hcl.create_schedule (create a schedule for hardware customization)
hcl.lower (lower the program to IR for investigation)
hcl.build (build the program)
Related links: test, tutorial, doc

Data Type

hcl.Int, hcl.UInt (can have arbitrary bitwidth up to 255 bits)
hcl.Fixed, hcl.UFixed (can have arbitrary bitwidth up to 255 bits)
hcl.Float (can be 32 or 64 bits)
Related links: test, tutorial, doc

Imperative DSL

hcl.and_, hcl.or_
hcl.if_, hcl.elif_, hcl.else_
hcl.for_, hcl.while_, hcl.break_
bit operations (get bit, set bit, get slice, set slice)
hcl.def_, hcl.return_ (a custom-defined HeteroCL module)
Related links: general test, def test, tutorial, doc

Compute API

hcl.compute (compute a new tensor according to the given formula)
hcl.update (update an exsiting tensor according to the given formula)
hcl.mutate (mutatively update an existing tensor)
hcl.reduce_axis (create a reduce axis for reduction operation)
hcl.sum (perform a summation on the given axis)
hcl.pack, hcl.unpack (pack/unpack a tensor to larger/lower bitwidth)
hcl.Stage (create a stage that contains user-defined operations)
Related links: general test, reduction test, pack/unpck test, stage test, tutorial, doc

Compute Customization

split, fuse, tile
unroll, pipeline, parallel
compute_at
Related links: general test, imperative dsl test, tutorial, module tutorial, doc

Data Type API & Data Type Customization

hcl.create_scheme (create a scheme for data type customization)
hcl.create_schedule_from_scheme
hcl.downsize (downsize integers to lower bit-width)
hcl.quantize (quantize floating-points to fixed-points)
Related links: test, tutorial, doc

Memory Customization

partition, reshape
reuse_at (create a reuse buffer provided the input tensor and stage)
Related links: test, tutorial, doc

Back-end Target

llvm (default target for CPU)
vhls (generate Vivado HLS code)
vhls_csim (generate an executable compiled from Vivado HLS code)
ihls (generate Intel HLS code)
merlinc (generate Merlin C code)
soda (generate SODA DSL)
soda_xhls (generate Vivado HLS code from SODA DSL)
Related links: vhls test, ihls test, merlinc test, soda test, tutorial, SDOA tutorial

Stable Samples

KNN-based digit recognition: code, tutorial
CORDIC: code, tutorial
K-means: code, tutorial
GEMM: code, tutorial
Smith-Waterman: code, tutorial
LeNet Inference: code, tutorial

No reuse dimension found in the body for tensor input

I am trying to reuse the input image in a conv2d layer in the LeNet example. The reuse_at primitive works fine with the placeholder inputs (i.e. input_image in the first conv2d). However when passing the max-pooled result to the second conv2d layer, no reuse pattern was found for it.

import heterocl as hcl
import hlib
import numpy as np

batch_size = 1000
qtype1 = hcl.Fixed(16, 14)
qtype2 = hcl.Fixed(16, 14)

def build_lenet(input_image, weight_conv1, weight_conv2,
                weight_fc1, weight_fc2, lenet):
    # first conv
    conv1 = hlib.nn.conv2d_nchw(input_image, weight_conv1, "conv1")
    tanh1 = hlib.nn.tanh(conv1, "tanh1")
    pool1 = hlib.nn.max_pool(tanh1, kernel=(2,2), stride=(2,2), name="pool1")
    # second conv
    conv2 = hlib.nn.conv2d_nchw(pool1, weight_conv2, name="conv2")
    tanh2 = hlib.nn.tanh(conv2, "tanh2")
    pool2 = hlib.nn.max_pool(tanh2, kernel=(2,2), stride=(2,2))
    # first fc
    flat = hlib.nn.flatten(pool2)
    fc1 = hlib.nn.dense(flat, weight_fc1)
    tanh3 = hlib.nn.tanh(fc1, "tanh3")
    # second fc
    fc2 =  hlib.nn.dense(tanh3, weight_fc2)
    # loss
    return hlib.nn.softmax(lenet, fc2)

input_image = hcl.placeholder((batch_size, 1, 28, 28), "input_image")
weight_conv1 = hcl.placeholder((20, 1, 5, 5), "weight_conv1", qtype1)
weight_conv2 = hcl.placeholder((50, 20, 5, 5), "weight_conv2", qtype1)
weight_fc1 = hcl.placeholder((500, 800), "weight_fc1", qtype1)
weight_fc2 = hcl.placeholder((10, 500), "weight_fc2", qtype1)
lenet = hcl.placeholder((batch_size, 10), "lenet")
s = hcl.create_schedule([input_image, weight_conv1, weight_conv2,
    weight_fc1, weight_fc2, lenet], build_lenet)

s[build_lenet.conv1].compute_at(s[build_lenet.tanh1], build_lenet.tanh1.axis[3])
s.reuse_at(input_image, s[build_lenet.conv1], build_lenet.conv1.axis[0])
s.reuse_at(build_lenet.pool1._op, s[build_lenet.conv2], build_lenet.conv2.axis[1])
print(hcl.lower(s))

The error message as followed:

check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
heterocl.tvm._ffi.base.TVMError: [14:12:42] src/pass/generate_reuse_buffer.cc:245: No reuse is found in axis nn

TODO: return reuse shape for the reuse buffers

Move the reuse shape inference earlier during scheduling instead of during lowering.

Unexpected tensor allocation

When testing the stencil backend, I found that in the IR generated for the Gaussian benchmark, the output tensor is explicitly allocated. I believe this is incorrect because the interface already generates an implicit tensor allocation by calling tvm_struct_get. The blur benchmark works fine.

This unexpected tensor allocation is breaking the code for SODA code generation. More specifically, it invalidates the VarExpr comparison because the newly generated Variable is used in the IR, which is not linked to the interface. This results in incorrect detection of output or local tensors in SODA. As a workaround, I had to compare by name_hint, but it may not work in other situations, as the name suggests.

The IR is printed in the test_soda.py unit test and can be reproduced by running python -m unittest test_soda in heterocl/heterocl/tests.

Support `StreamExpr` in reuse pattern detection

For now reuse_at can only handle the reuse pattern found in Load expression. However, the Streaming IR pass replaces some potentially reusable Load expressions with StreamExpr expressions, making the generate_reuse_buffer IR pass crash with SegFault.

Example of generated Intel OpenCL code with streaming channel as followed. the Input image is replaced with StreamExpr in this case(i.e., read_channel_intel(c_buf_4)), applying resue_at on this input will result in Seg Fault.

__kernel void grad_weight_y() {
    float g_f[7];
    g_f[0] = 7.550000e-02f;
    g_f[1] = 1.330000e-01f;
    g_f[2] = 1.869000e-01f;
    g_f[3] = 2.903000e-01f;
    g_f[4] = 1.869000e-01f;
    g_f[5] = 1.330000e-01f;
    g_f[6] = 7.550000e-02f;
    for (int32_t y = 0; y < 430; ++y) {
      for (int32_t x = 0; x < 1024; ++x) {
        float reducer2;
        reducer2 = 0.000000e+00f;
        for (int32_t rdx = 0; rdx < 7; ++rdx) {
          reducer2 = ((read_channel_intel(c_buf_4) * g_f[rdx]) + reducer2);
        }
        grad_weight_y_y_filt[((x + (y * 1024)) + 3072)] = reducer2;
      }
    }
    #pragma ii 1
    for (int32_t buf_1 = 0; buf_1 < 1024; ++buf_1) {
      for (int32_t buf_0 = 0; buf_0 < 436; ++buf_0) {
        write_channel_intel(c_buf_5, grad_weight_y_y_filt[(buf_1 + (buf_0 * 1024))]);
      }
    }

Typing Rules in HeteroCL Arithmetic

Do we have documentation about the typing rules in HeteroCL arithmetic? For example, it seems that the result of multiplication of two int16 numbers is int32, which is inconsistent with C++ behavior. It makes sense to extend the bitwidth and avoid overflow, but sometimes it can result in unnecessary hardware resource usage and thus may not always be the expected behavior. It might be better to allow users to turn this feature on and off, or even just document it properly and provide a way to bypass bitwith extenstion.

cc @tonyjie

Segmentation fault (core dumped): hcl.compute with placeholders whose dim more than 2

I'm trying to write a conv_layer using HeteroCL, but a memory error can't be solved.

input = hcl.placeholder((4,32,10,10), "input")
filter = hcl.placeholder((32,32,3,3), "filter")
bias = hcl.placeholder((32,), "bias")
print(bias.shape)

def conv_layer(input, filter, bias):
    def compare_with_zero(a):
        with hcl.if_(a > 0):
            hcl.return_(a)
        with hcl.else_():
            hcl.return_(0)
    output = hcl.compute(input.shape, lambda n,c,h,w: compare_with_zero(input[n,c,h,w]), "output")
    return output
s = hcl.create_schedule([input, filter, bias], conv_layer)
print(hcl.lower(s))
f = hcl.build(s)
print('done')

Above program went error when doing with f = hcl.build(s)
However, when the input placeholder's dimension is 2, this program works. And when the dimension becomes 3 or 4, the error of Segmentation Fault (core dumped) came.

Test case fails with stencil example

@Blaok, why does the test case fail? #12

Is it intentional?

reuse_at generates inconsistent index access for streaming channel

An example: we have a receiver function that needs to read data sequentially from the streaming channel, and a sender function, as shown in the following coding block, to write data (i.e., calc_x_gradient.grad_x) into the channel.

We can apply data reuse schedule on the sender function to exploit the data locality, but this will lead to index access inconsistency between sender and receiver side (i.e., reader reads data[x + y*1024] and writer writes data[x + y*1024 -2]). To avoid the incorrectness introduced by index inconsistency, the streaming inference IR pass generates another nested loops to write data into the channel (i.e. c_buf_2.write(calc_x_gradient.grad_x[(buf_1 + (buf_0*1024))]) ). And this approach will lead to performance degradation.

def calc_x_gradient(handle64(calc_x_gradient.input_image[436*1024]), handle64(calc_x_gradient.grad_x[436*1024])) {
  for (y, 0, 436) {
    for (x.reuse, 0, 1024) {
      produce calc_x_gradient.input_image.reuse {
        for (calc_x_gradient.input_image.0, 0, 4) {
          calc_x_gradient.input_image.reuse[calc_x_gradient.input_image.0] = calc_x_gradient.input_image.reuse[(calc_x_gradient.input_image.0 + 1)]
        }
        calc_x_gradient.input_image.reuse[4] = calc_x_gradient.input_image[(x.reuse + (y*1024))]
      }
      if ((4 <= x.reuse)) {
        allocate reducer0[float32 * 1]
        reducer0[0] = 0.000000f
        for (rdx, 0, 5) {
          reducer0[0] = ((calc_x_gradient.input_image.reuse[rdx]*float32(g_w[rdx])) + reducer0[0])
        }
        calc_x_gradient.grad_x[((x.reuse + (y*1024)) + -2)] = reducer0[0]
      }
    }
  }
  pipelined (buf_1, 0, 1024) {
    for (buf_0, 0, 436) {
      c_buf_2.write(calc_x_gradient.grad_x[(buf_1 + (buf_0*1024))])
    }
  }
}

As I discussed with Sean, a simple solution is to add extra if-else statement around the condition block to maintain the index access order consistency of receiver and sender.

def calc_x_gradient(handle64(calc_x_gradient.input_image[436*1024]), handle64(calc_x_gradient.grad_x[436*1024])) {
  for (y, 0, 436) {
    for (x.reuse, 0, 1026) {
      produce calc_x_gradient.input_image.reuse {
        for (calc_x_gradient.input_image.0, 0, 4) {
          calc_x_gradient.input_image.reuse[calc_x_gradient.input_image.0] = calc_x_gradient.input_image.reuse[(calc_x_gradient.input_image.0 + 1)]
        }
        calc_x_gradient.input_image.reuse[4] = calc_x_gradient.input_image[(x.reuse + (y*1024))]
      }

      if (2 < x.reuse) {
        c_buf_2.write(0); 
      } elif ((4 <= x.reuse)) {
        allocate reducer0[float32 * 1]
        reducer0[0] = 0.000000f
        for (rdx, 0, 5) {
          reducer0[0] = ((calc_x_gradient.input_image.reuse[rdx]*float32(g_w[rdx])) + reducer0[0])
        }
        c_buf_2.write(reducer0[0]);
      } elif (x.reuse > 1024) {
         c_buf_2.write(0); 
      }
    }
  }
}

hcl.select(10>1, 0, 1) Error: First argument to Select is not a bool: int32

I'm using hcl.select(cond, true, false) function, and just encounter a problem.
When comparing two definite number, e.g. 10 > 1, the result is suppose to be a bool, but hcl.select() function doesn't think so......
It can be reproduced as follows:

import heterocl as hcl 
hcl.init()
def top(input, ):
    blur_x = hcl.compute((640, (480 + 2)), lambda x, y: 0, name = "blur_x", dtype = hcl.UInt(bits = 16))
    with hcl.Stage("blur_x"):
        with hcl.for_(0, 640, name = "x") as x:
            with hcl.for_(0, (480 + 2), name = "y") as y:
                blur_x[x, y] = ((input[(x + 2), y] + input[x, y] + input[(x + 1), y])/3) + hcl.select(10 > 1, 10, 1)
    return blur_x
input = hcl.placeholder((642, 482, ), name = "input", dtype = hcl.UInt(bits = 16))
s = hcl.create_schedule([input, ], top)
print(hcl.lower(s))
# f = hcl.build(s)

The error message is:

heterocl.tvm._ffi.base.TVMError: [01:36:00] src/ir/IR.cpp:140: Check failed: condition.type().is_bool() First argument to Select is not a bool: int32

If I change the cond in hcl.select from a definite number 10 to a variable y. Namely, hcl.select(y > 1, 10, 1), it works. But I think based on Python syntax, the result of a comparison operation should be bool type, right?

Support declarative programming APIs in HeteroCL module

HeteroCL does not have complete support for declarative programming API in HeteroCl module function definitions. Using declarative APIs in HeteroCL module will result in errors in Storage Flatten IR pass (i.e. it failed to find the correct buffers for function argument tensors):

heterocl.tvm._ffi.base.TVMError: [19:50:35] src/pass/storage_flatten.cc:413: Check failed: buf_map_.count(key) Cannot find buffer of placeholder(k.A, 0x562ffd06c000) value=0

Stack trace returned 10 entries:
[bt] (0) heterocl/tvm/lib/libhcl.so(dmlc::StackTrace()+0x40) [0x7f346fcedb30]
[bt] (1) heterocl/tvm/lib/libhcl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b) [0x7f346fcee2bb]
[bt] (2) heterocl/tvm/lib/libhcl.so(TVM::ir::StorageFlattener::HandleBufferBindScope(Halide::Internal::AttrStmt const*)+0x3d7) [0x7f346fdf7cd7]
[bt] (3) heterocl/tvm/lib/libhcl.so(TVM::ir::StorageFlattener::Mutate_(Halide::Internal::AttrStmt const*, Halide::Internal::Stmt const&)+0x926) [0x7f346fdfc046]
[bt] (4) heterocl/tvm/lib/libhcl.so(+0x945bf0) [0x7f346fd93bf0]
[bt] (5) heterocl/tvm/lib/libhcl.so(std::_Function_handler<Halide::Internal::Stmt (TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*), TVM::IRFunctor<Halide::Internal::Stmt (TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*)>::set_dispatch<Halide::Internal::AttrStmt>(std::function<Halide::Internal::Stmt (Halide::Internal::AttrStmt const*, Halide::Internal::Stmt const&, TVM::ir::IRMutator*)>)::{lambda(TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*)#1}>::_M_invoke(std::_Any_data const&, TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*&&)+0x30) [0x7f346fd9e590]
[bt] (6) heterocl/tvm/lib/libhcl.so(TVM::IRFunctor<Halide::Internal::Stmt (TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*)>::operator()(TVM::NodeRef const&, Halide::Internal::Stmt const&, TVM::ir::IRMutator*) const+0x123) [0x7f346fcee7f3]
[bt] (7) heterocl/tvm/lib/libhcl.so(TVM::ir::IRMutator::Mutate(Halide::Internal::Stmt)+0x49) [0x7f346fcee8e9]
[bt] (8) heterocl/tvm/lib/libhcl.so(TVM::ir::StorageFlattener::Mutate_(Halide::Internal::Realize const*, Halide::Internal::Stmt const&)+0x947) [0x7f346fdf9dd7]
[bt] (9) heterocl/tvm/lib/libhcl.so(+0x945e10) [0x7f346fd93e10]

A simple example to reproduce the error:

    hcl.init()
    A = hcl.placeholder((10,10))
    B = hcl.placeholder((10,10))

    def kernel(A, B):
       @hcl.def_([(10,10), (10,10)])
       def k(A, B):
           hcl.update(B, lambda y, x: A[y, x] + 1)
       k(A, B)

    s = hcl.create_schedule([A, B], kernel)
    f = hcl.build(s)

Fix Sample Cases

The current samples are not working with the latest refactor. I list the broken cases here to track the progress.

Please fix LeNet and GEMM first so that we can start implementing the systolic array interface.
Thanks.

128-bit support

No support for more than 64-bit integers.

unroll factor missing from loop annotations

I am running test_soda.py and found that none of the loops contains any annotations. Without annotations I cannot pass the unroll factor down to the SODA backend. This used to be working. I was wondering what happened? Why is it (along with others) disabled in the tests?

`reuse_at` fails to create multiple buffers for different outputs

Reduced test as followed. HeteroCL returns Segmentation Fault with no extra information.

    hcl.init()

    A = hcl.placeholder((10, 10))
    B = hcl.placeholder((10, 8))
    C = hcl.placeholder((8, 10))

    def kernel(a, b, c):

        @hcl.def_([(10,10), (10,8), (8,10)])
        def stencil(A, B, C):
            hcl.update(B, lambda y, x: A[y, x] + 2*A[y, x+1] + 3*A[y, x+2])
            hcl.update(C, lambda y, x: A[y, x] + 3*A[y+1, x] + 5*A[y+2, x])

        stencil(a, b, c)

    s = hcl.create_schedule([A, B, C], kernel)

    k = kernel.stencil
    RB1 = s.reuse_at(k.A, s[k], k.axis[1])
    RB2 = s.reuse_at(k.A, s[k], k.axis[0])
    f = hcl.build(s)

    np_A = np.random.randint(0, 10, size=(10, 10))
    np_B = np.zeros((10, 8), dtype="int")
    np_C = np.zeros((8, 10), dtype="int")

    for y in range(0, 10):
        for x in range(0, 8):
            np_B[y][x] = np_A[y][x]*1 + np_A[y][x+1]*2 + np_A[y][x+2]*3
            np_C[x][y] = np_A[x][y]*1 + np_A[x+1][y]*3 + np_A[x+2][y]*5

    hcl_A = hcl.asarray(np_A)
    hcl_B = hcl.asarray(np.zeros((10, 8), dtype="int"))
    hcl_C = hcl.asarray(np.zeros((8, 10), dtype="int"))

    print(hcl.lower(s))

    f(hcl_A, hcl_B, hcl_C)

    ret_B = hcl_B.asnumpy()
    ret_C = hcl_C.asnumpy()
    assert np.array_equal(np_B, ret_B)
    assert np.array_equal(np_C, ret_C)

Integer Division Truncation is different to Halide

I just found that HeteroCL's Integer division truncation is toward zero (-3 / 4 = 0), while the Halide behavior is truncation toward the lower integer (-3 / 4 = -1). This causes lots of problem when comparing the results between Halide programs and HeteroCL programs, because if there's negative integer division occured in the process, the results can't be the same...
So I'm wondering if there's a way to change this behavior to "Truncation toward Lower Integer"?
Thanks!

Add a schedule/pass that pops up loops to make them perfect

We need to add a schedule that can analyze which loops are safe to be moved up in a loop nest. This will be useful and is needed for tiling.

Example:
Original IR:

for (i, 0, 10)
  if (i > 5)
    for (j, 0, 10)
      #operations

After transform:

for (i, 0, 10)
  for (j, 0, 10)
    if (i > 5)
      #operations

Is there a struct-like data type in HeteroCL?

I'm working on generating HeteroCL code from Halide IR. And I found it's hard to compute two value in one For loop in HeteroCL. To be more specific, following is a piece of Halide code:

RDom search(0, 64);
offset(x, y) = {cast<int8_t>(0), cast<uint16_t>(65535)};
offset(x, y) = {select(SAD(x, y, search.x) < offset(x, y)[1],
                      cast<int8_t>(search.x),
                      offset(x, y)[0]),
                     min(SAD(x, y, search.x), offset(x, y)[1])};

This code is to do a argmin computation, and need some iteration in it. So it need to compute two variables offset(x, y)[0] and offset(x, y)[1] in one For loop which can't be separated.

I know I can actually increase a dimension (size = 2) to store the value, and do some similar iteration computation (I haven't tried if it works), but doing so will change the shape of output. Also, it's not the original meaning of the code.

So I'm just wondering and asking is there a feature in HeteroCL to create a structure or a tuple-like data type, because I didn't find it successfully.

Thnaks!

SegFault from `reuse_at` in HeteroCL module

A simple minimal test case to reproduce the error.

def kernel(A, B):
    @hcl.def_([(10,10), (10,10)])
    def test(A, B):
        hcl.update(B, lambda y, x: A[y,x] + A[y,x+1]+ A[y,x+2])
    test(A, B)

A = hcl.placeholder((10, 10))
B = hcl.placeholder((10, 10))
s = hcl.create_schedule([A, B], kernel)
s.reuse_at(kernel.test.A, s[kernel.test], kernel.test.axis[0])

The reuse_at primitive fails to locate the target buffer to be reused.

Bit operations for HeteroCL

Bit operations are important for bit-accurate data types. From Xilinx's user guide (https://bit.ly/2GNHc1e) we can have a brief idea of what kinds of bit operation are supported. Currently, HeteroCL supports bit selection operation on bit-accurate integers. Namely, we can get the value of a bit according to the given variable and index. Following are the bit operations needed to be supported in the near future.

bit selection (assign): assign either 1 or 0 to a bit
bit slicing (get/assign): get/assign a slice of bits from a number

To begin with, we can take a look at how the current bit selection works in HeteroCL. To implement, please fork the current repository or create a branch. Do not directly edit the master branch.

First, bit selection corresponds to "subscription" in Python AST (e.g. a[2]). You can refer to https://bit.ly/2Ho8NHe. Here, we first get the variable and index, which are "a" and "2" in "a[2]", respectively. If it is a bit operation, a tuple will be returned. The first value in the tuple is always 0 and the second value is the index for bit selection, i.e., 2. After that, we make a corresponding Halide IR node (GetBit) for this.

Second, the Halide IR node is not defined originally. Therefore, we need to add a new Halide IR node. To do so, please check the file.
Halide IR.pdf

Finally, we need to interpret the IR node to LLVM code. Thus, we need to add the corresponding code generation. This is a link to show you how you can do bit operations.
https://stackoverflow.com/questions/47981/how-do-you-set-clear-and-toggle-a-single-bit
You just need to translate them into LLVM instructions. For bit slicing, it might be a little challenging.

HeteroCL Roadmap 0.2

API:

Device placement interface: which part of code should go to FPGA
Module interface: users should be able to define a module for reuse

Language:

Struct data type: combine different variables into a struct
Fixed-point data type with negative fractional/decimal bits
Long bitwidth bit-accurate data types: 128 bits

Scheduling:

Domain-specific hardware: systolic array & stencil analysis
Dataflow support
Customized memory hierarchy support

Benchmarks:

Optical flow (Rosetta)
Sorting (HLS Book)
FIR/IIR (HLS Book)

Miscellaneous:

Build a Lint test for coding style checking
Add concrete tutorials
Unit test for each function

Wrong CPU simulation result with HeteroCL module

I created a HCl reducer to realize a sorting algorithm, and wrap it with a HeteroCL module. The gaol is to sort a 10x3 array in axis 1. I checked the IR line by line and did not find any wrong logic inside. But the simulation result is not correct in each row.

The Halide IR:

// attr [sort] storage_scope = "global"
produce sort {
  // attr [0] extern_scope = 0
  def sort(handle64(sort.A[10*3]), handle64(sort.B[10*3])) {
    allocate compute0[int32 * 3]
    for (x, 0, 3) {
      compute0[x] = 1000
    }
    for (x, 0, 10) {
      allocate reducer0[int32 * 3]
      for (args, 0, 3) {
        reducer0[args] = compute0[args]
      }
      for (rdx, 0, 3) {
        for (i, 0, 3) {
          if ((sort.A[(rdx + (x*3))] < reducer0[i])) {
            for (i, 0, ((i + -2)/-1)) {
              reducer0[(2 - i)] = reducer0[(1 - i)]
            }
          }
          reducer0[i] = sort.A[(rdx + (x*3))]
          break
        }
      }
      for (reducer0_i0, 0, 3) {
        sort.B[(reducer0_i0 + (x*3))] = reducer0[reducer0_i0]
      }
    }
  }
}
// attr [sort0] storage_scope = "global"
allocate sort0[int32 * 1]
produce sort0 {
  // attr [0] extern_scope = 0
  sort(A, B)
}

the input array:

[[14 12  6]
 [ 3  7 18]
 [16  2 11]
 [17 15  7]
 [ 4 16  8]
 [ 0 15  7]
 [ 0  0  6]
 [13  1  0]
 [10  6  6]
 [19 19 11]]

And the sorted result is wrong for most rows (for some rows the result is correct). I guess there might be something wrong with memory in LLVM JIT complication?

[[   6   12   14]
 [  18 1000 1000]
 [  11   16 1000]
 [   7   15   17]
 [   8   16 1000]
 [   7   15 1000]
 [   6 1000 1000]
 [   0    1   13]
 [   6   10 1000]
 [  11   19 1000]]

Enable applying multiple axes in a single primitive

We need to add the support for this. Something like
s[A].unroll([A.axis[0], A.axis[1])

`reuse_at` returns wrong result with `hcl.select`

Here I want to reuse on one input in two different dimensions:

def test_reuse_select():
    hcl.init()
    A = hcl.placeholder((10, 10, 2))
    B = hcl.compute((10, 8, 2), lambda y, x, c:
            hcl.select(c==0, A[y, x, c]*1 + A[y, x+1, c]*1 + A[y, x+2, c]*1,
                             A[y, x, c]*3 + A[y, x+1, c]*5 + A[y, x+2, c]*6))
    s = hcl.create_schedule([A, B])
    RB = s.reuse_at(A, s[B], B.axis[1])
    f = hcl.build(s)

    np_A = np.random.randint(0, 10, size=(10, 10, 2))
    np_B = np.zeros((10, 8, 2), dtype="int")
    np_C = np.zeros((10, 8, 2), dtype="int")

    for y in range(0, 10):
        for x in range(0, 8):
            np_C[y][x][0] = np_A[y][x][0]*1 + np_A[y][x+1][0]*1 + np_A[y][x+2][0]*1
            np_C[y][x][1] = np_A[y][x][1]*3 + np_A[y][x+1][1]*5 + np_A[y][x+2][1]*6

    hcl_A = hcl.asarray(np_A)
    hcl_B = hcl.asarray(np_B)
    print(hcl.lower(s))

    f(hcl_A, hcl_B)

    np_B = hcl_B.asnumpy()

    assert np.array_equal(np_B, np_C)

The result does not match with the ground truth when reuse_at schedule is applied. The IR is as followed:

produce compute0 {
  // attr [placeholder0.reuse] storage_scope = "global"
  allocate placeholder0.reuse[int32 * 1 * 3 * 2]
  // attr [0] extern_scope = 0
  for (y, 0, 10) {
    for (x.reuse, 0, 10) {
      for (c, 0, 2) {
        produce placeholder0.reuse {
          for (placeholder0.1, 0, 2) {
            placeholder0.reuse[(c + (placeholder0.1*3))] = placeholder0.reuse[((c + (placeholder0.1*3)) + 3)]
          }
          placeholder0.reuse[(c + 6)] = placeholder0[((c + (x.reuse*10)) + (y*20))]
        }
        if ((2 <= x.reuse)) {
          compute0[(((c + (x.reuse*2)) + (y*16)) + -4)] = int32(tvm_if_then_else((c == 0), (int34((int33(placeholder0.reuse[c]) + int33(placeholder0.reuse[(c + 3)]))) + int34(placeholder0.reuse[(c + 6)])), (int34((int33((placeholder0.reuse[c]*3)) + int33((placeholder0.reuse[(c + 3)]*5)))) + int34((placeholder0.reuse[(c + 6)]*6)))))
        }
      }
    }
  }
}

I am also wondering if it is possible to modify the reuse_at primitive and make it work on HeteroCL module. Generating reuse buffer along with .to() takes a lot of efforts (I need to figure out which dimension to exploit reusability, and also to do a lot of repeat work as reuse_at), and I am afraid that it cannot be done very soon. Reusing the reuse_at primitive and making it compatible with HeteroCL might be a better idea.

SW Simulation failed using RTL IP integration

Vivado HLS provides users with blackbox option when adding source files to the working project. Users can add a json file specifying the RTL kernel (i.e IP we want to integrate) and its corresponding C model (i.e. behavior level model).

addfile -blacbox rtl_model.json

The #170 is currently using the blackbox to connect HLS kernel and RTL IPs. This method is not stable (the SW simulation oftentimes halts forever). Also it has a lot of limitations, making it hard to be extended to other IP cores.

Instead of using blackbox, we should consider each IP core as a kernel function (instead of a sub-function being called inside the top-level function), and enqueue kernels in the OpenCL host program.

HLS C doesn't have runtime, return kernel code

When I use heterocl, I meet a problem:

HLS C doesn't have runtime, return kernel code

How should I do to solve the problem.

Thanks!

cornell-zhang / heterocl Goto Github PK

heterocl's People

Contributors

Stargazers

Watchers

Forkers

heterocl's Issues

Check List

General Improvement

Data Type

Compute Customization

Back-end Target

Instructions

Supported Scheduling Functions

Release Note

General API

Data Type

Imperative DSL

Compute API

Compute Customization

Data Type API & Data Type Customization

Memory Customization

Back-end Target

Stable Samples

Recommend Projects

Recommend Topics

Recommend Org