The halide-to-hardware from stanfordaha

where is the bin/align.h in the document ./app/hardware_benchmarks/apps/hdr_plus/process.cpp?

where is the bin/align.h in the document ./app/hardware_benchmarks/apps/hdr_plus/process.cpp?I can't find it. Can you tell me? Thank you!

Accumulation support

Add multiple input streams to unified buffer
Extract multiple input streams for accumulation
Connect unified buffer correctly for accumulation
Generate UNet with accumulation
Test UNet with CoreIR

Camera pipeline compute is malformed?

@jeffsetter I cannot get the coreir backend to load the current camera pipeline compute:

When I check the file at the command line I get:

./coreir/bin/coreir --load_libs commonlib --input ./coreir_compute/camera_pipeline_compute.json --output camera_pipeline_compute.v --passes rungenerators;flattentypes;verilog 
ERROR: {hcompute_curved_stencil}.curve$1.clk Is not fully connected (N)
{hcompute_curved_stencil}.curve$1 Is not fully connected (R)


ERROR: {hcompute_curved_stencil}.curve$1.clk Is not fully connected (N)
{hcompute_curved_stencil}.curve$1 Is not fully connected (R)


I AM DYING!

Any idea whats going wrong here?

Cant build examples in branch cleanup_codegen. Link failure with llvm::zlib?

@jeffsetter I pulled and built cleanup_codegen. When I try to build the code in apps/hardware_benchmarks/apps/harris

I get this:

bash-3.2$ make design-vhls
c++ -std=c++11 -I ../../../../include/ -I ../../../../tools/ -fvisibility=hidden -I ../../../../../coreir/include -L../../../../../coreir/lib -Wl,-rpath,../../../../../coreir/lib -g -fno-rtti harris_generator.cpp ../../../../lib/libHalide.a ../../../../tools/GenGen.cpp -o bin/harris.generator  -ldl -lpthread -lz -lcurses -L../../../../../coreir/lib -lcoreir-commonlib -lcoreir -lcoreirsim -lcoreir-float 
Undefined symbols for architecture x86_64:
  "llvm::zlib::uncompress(llvm::StringRef, llvm::SmallVectorImpl<char>&, unsigned long)", referenced from:
      llvm::readPGOFuncNameStrings(llvm::StringRef, llvm::InstrProfSymtab&) in libHalide.a(llvm_377_InstrProf.cpp.o)
  "llvm::zlib::isAvailable()", referenced from:
      llvm::collectPGOFuncNameStrings(llvm::ArrayRef<llvm::GlobalVariable*>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, bool) in libHalide.a(llvm_377_InstrProf.cpp.o)
      llvm::readPGOFuncNameStrings(llvm::StringRef, llvm::InstrProfSymtab&) in libHalide.a(llvm_377_InstrProf.cpp.o)
  "llvm::zlib::compress(llvm::StringRef, llvm::SmallVectorImpl<char>&, llvm::zlib::CompressionLevel)", referenced from:
      llvm::collectPGOFuncNameStrings(llvm::ArrayRef<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) in libHalide.a(llvm_377_InstrProf.cpp.o)
      (anonymous namespace)::ELFWriter::writeObject(llvm::MCAssembler&, llvm::MCAsmLayout const&) in libHalide.a(llvm_1089_ELFObjectWriter.cpp.o)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [bin/harris.generator] Error 1

Any idea what is wrong?

Negative const syntax error in harris compute

When I try to generate test verilog for harris I get the follwing error:

cmd: ${COREIR_PATH}/bin/coreir --load_libs commonlib --input harris.json --output harris.v
cmd: verilator -Wall --cc harris.v --exe --build harris_verilog_tb.cpp --top-module harris -Wno-lint
%Error: harris.v:45542:8: syntax error, unexpected const, expecting IDENTIFIER
45542 | ) const-255__383 (
      |        ^
%Error: harris.v:45701:8: syntax error, unexpected const, expecting IDENTIFIER
45701 | ) const-255__283 (
      |        ^
%Error: Exiting due to 2 error(s)

This seems to come from the harris compute file: https://github.com/dillonhuff/clockwork/blob/b4dc132aa141f610421159bccf8ae21b8ae353ce/coreir_compute/harris_compute.json#L239

Building Unified Buffer Library in Vivado HLS

Improve Unified Buffer Library into template support up to 3D in Jing's line buffer style
Add testcase multiple channel convolution using circular line buffer as a real unified buffer
Line Buffer Lib support address stream
Address Generator
Bank selector support different I/O port
Psum Buffer with update state
VGGnet Pass
Mobilenet Pass

cascade_compute.h is out of date?

@jeffsetter when I run C++ simulation for cascade (from example_progs) I get the following compile error on the generated C++ code:

cmd: g++ -fstack-protector-all -std=c++11 regression_tb_unoptimized_cascade.cpp unoptimized_cascade.cpp
unoptimized_cascade.cpp: In function ‘void op_hcompute_hw_input_global_wrapper_stencil(HWStream<hw_uint<16> >&, hw_input_global_wrapper_stencil_cache&, int, int, int)’:
unoptimized_cascade.cpp:1397:24: error: ‘hcompute_hw_input_global_wrapper_stencil’ was not declared in this scope
  auto compute_result = hcompute_hw_input_global_wrapper_stencil(hw_input_stencil_hw_input_global_wrapper_s0_y_c__hw_input_global_wrapper_s0_x_value);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
unoptimized_cascade.cpp:1397:24: note: suggested alternative: ‘op_hcompute_hw_input_global_wrapper_stencil’
  auto compute_result = hcompute_hw_input_global_wrapper_stencil(hw_input_stencil_hw_input_global_wrapper_s0_y_c__hw_input_global_wrapper_s0_x_value);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                        op_hcompute_hw_input_global_wrapper_stencil
unoptimized_cascade.cpp: In function ‘void op_hcompute_hw_output_stencil(conv2_stencil_cache&, HWStream<hw_uint<16> >&, int, int, int)’:
unoptimized_cascade.cpp:1489:40: error: no matching function for call to ‘HWStream<hw_uint<16> >::write(hw_uint<8>&)’
  hw_output_stencil.write(compute_result);
                                        ^
In file included from cascade_compute.h:2,
                 from unoptimized_cascade.cpp:10:
hw_classes.h:392:10: note: candidate: ‘void HWStream<T>::write(const T&) [with T = hw_uint<16>]’
     void write(const T& v) {
          ^~~~~
hw_classes.h:392:10: note:   no known conversion for argument 1 from ‘hw_uint<8>’ to ‘const hw_uint<16>&’
clockwork: prog.cpp:2842: std::vector<std::__cxx11::basic_string<char> > run_regression_tb(const string&): Assertion `res == 0' failed.

It looks as though one of the functions hcompute_hw_input_global_wrapper_stencil does not exist in cascade_compute.h, could you check if the cascade compute file is up to date?

Error when doing make run-clockwork on new max pooling example

@jeffsetter @thenextged I've gotten the CPU code compiling and running for my max-pooling example (https://github.com/StanfordAHA/Halide-to-Hardware/tree/maxpool_example/apps/hardware_benchmarks/apps/max_pool_2x2), but when I run make run-clockwork I get the following error:

dhuff@kiwi:~/h2h2clockwork/Halide-to-Hardware/apps/hardware_benchmarks/apps/max_pool_2x2$ make run-clockwork
g++-7 -std=c++17 -I../../../../../clockwork -I../../../../../clockwork/include -I/home/dhuff/h2h2clockwork/Halide-to-Hardware/clockwork/barvinok-0.41/isl -fPIC -I/home/dhuff/h2h2clockwork/clockwork/barvinok-0.41/isl/ -c bin/clockwork_codegen.cpp -o bin/clockwork_codegen.o
In file included from bin/clockwork_codegen.cpp:2:0:
bin/maxpool_memory.cpp: In function ‘prog maxpool()’:
bin/maxpool_memory.cpp:11:27: error: ‘arg_0’ was not declared in this scope
 int32_t &hw_output_s0_c = arg_0;
                           ^~~~~
../../hw_support/hardware_targets.mk:138: recipe for target 'bin/clockwork_codegen.o' failed
make: *** [bin/clockwork_codegen.o] Error 1

Any idea what is going on here? Thanks!

H2H failing daily regression

See https://buildkite.com/stanford-aha/garnetflow/builds/1476#fddb505e-d74c-4c86-82a5-aae72b9f399f/6-10334

This is using the latest release: https://github.com/StanfordAHA/Halide-to-Hardware/releases/tag/lakelib

Unified Buffer Functional Model

Renew functional model with the new stencil valid parameter
Use pybinding for functional model, single source of truth
Add reset for functional model (simulation purpose)

Typo on clockwork branch in ExtractHWAccelerators.cpp

Halide-to-Hardware/src/ExtractHWAccelerators.cpp

Line 1 in a8f1a59

#include "ExtractHWAccelerator.h"

Should be #include "ExtractHWAccelerators.h"

FPGA PnR and test on board, get performance and resource

Image processing apps

NN apps

VGG/ unet conv
MobileNet conv
HDRNet

Run models, get results, finetune, iterate, repeat

unet compute c++ file returns a uint8?

@jeffsetter One of the unet compute functions returns uint 8 here:

https://github.com/dillonhuff/clockwork/blob/25be48e94e97710c3b8184024a795c43f20c6328/unet_conv_3_3_compute.h#L144

Is this going to be supported on the CGRA? If not could you replace with a function that returns hw_uint<16>?

Try different UNet schedules

Compile different unet schedules:

unroll r.x
unroll r.x, r.y
unroll output channel
unroll input channel

Integrate into System Flow

Ensure applications run properly (conv_3_3, cascade, harris, kitchen_sink)
Update CoreIR linking using new libraries
Create global buffer metadata from Halide
Run from Halide to CGRA and test using push-button
Run from Halide to SoC and test with global buffer using push-button

Insert unified buffers into a Halide loop nest

Modify loop nest based on unified buffer address generators

Test compiler with tests and applications

3x3 convoluation
harris
camera_pipe
UNet

Generate camera_pipe numbers

check the compilation of camera_pipe
verify that camera_pipe runs on the CoreIR simulator
gather numbers

Does the clockwork / CPU comparison script support 3D buffers?

@jeffsetter @thenextged I'm trying to add max-pooling, which needs a 3D input and output (app here: https://github.com/StanfordAHA/Halide-to-Hardware/tree/maxpool_example/apps/hardware_benchmarks/apps/max_pool_2x2 ). When I run make run-cpu I get:

dhuff@kiwi:~/h2h2clockwork/Halide-to-Hardware/apps/hardware_benchmarks/apps/max_pool_2x2$ make run-cpu
./bin/process run cpu input.png 
Error: Input buffer input requires a buffer of exactly 3 dimensions, but the buffer passed in has 2 dimensions
../../hw_support/hardware_targets.mk:249: recipe for target 'run-cpu' failed
make: *** [run-cpu] Aborted (core dumped)

I've tried to modify process.cpp to use a 3D buffer here:

Halide-to-Hardware/apps/hardware_benchmarks/apps/max_pool_2x2/process.cpp

Lines 71 to 72 in ad08970

    
           processor.input   = Buffer<uint8_t>(64, 64, 3); 
        
           processor.output  = Buffer<uint8_t>(31, 31, 3);

What am I doing wrong here?

Simulate unified buffer using CoreIR simulator

Create stub for unified buffer as a new node in the CoreIR simulator
Create C++ functional model for unified buffer
Simulate address generator
Test simulator using a simple example

Codegen CoreIR for unified buffer

Remove extraneous hardware generated by unified loop nest
Convert access loopnest into (stride, range) pairs
Create CoreIR codegen from Halide for unified buffer
Create CoreIR generator for physical unified buffer
Abstract CoreIR generator for abstract unified buffer

Code generation for global buffer from Halide for one layer

Run some applications on Garnet with Global Buffer

A lot of the designs coming out of the master branch are extremely large, the resulting bitstreams have around the order of 10x more configuration that prior to whatever codegen changed happened. It seems like a lot of the extra things are unnecessary control logic that doesn't need to be synthesized, is there some set of optimization passes that can be run to shrink the resulting designs?

old.txt
new.txt

Halide cpmpile CoreIR error

I have installed Halide and CoreIR according to this tutorial.
I want to compile the "apps/hardware_benchmarks/apps/pointwise" application to get the CoreIR of this example application, and I use "make design" command to compile, but I got the following error info:
/home/linuxbrew/.linuxbrew/bin/ld: CodeGen_PTX_Dev.cpp:(.text+0x5f35): undefined reference to llvm::legacy::PassManager::~PassManager()'
/home/linuxbrew/.linuxbrew/bin/ld: CodeGen_PTX_Dev.cpp:(.text+0x5f41): undefined reference to llvm::legacy::FunctionPassManager::~FunctionPassManager()' /home/linuxbrew/.linuxbrew/bin/ld: CodeGen_PTX_Dev.cpp:(.text+0x5f48): undefined reference to vtable for llvm::raw_pwrite_stream'
/home/linuxbrew/.linuxbrew/bin/ld: CodeGen_PTX_Dev.cpp:(.text+0x5f5f): undefined reference to llvm::raw_ostream::~raw_ostream()' /home/linuxbrew/.linuxbrew/bin/ld: CodeGen_PTX_Dev.cpp:(.text+0x62c6): undefined reference to llvm::DataLayout::~DataLayout()'
/home/linuxbrew/.linuxbrew/bin/ld: ../../../../distrib/lib/libHalide.a(CodeGen_PTX_Dev.o): in function Halide::Internal::CodeGen_PTX_Dev::dump() [clone .localalias.208]': CodeGen_PTX_Dev.cpp:(.text+0xe9): undefined reference to llvm::Module::print(llvm::raw_ostream&, llvm::AssemblyAnnotationWriter*, bool, bool) const'
/home/linuxbrew/.linuxbrew/bin/ld: ../../../../distrib/lib/libHalide.a(CodeGen_PTX_Dev.o): in function _GLOBAL__sub_I_CodeGen_PTX_Dev.cpp': CodeGen_PTX_Dev.cpp:(.text.startup+0x40): undefined reference to LLVMLinkInMCJIT'
/home/linuxbrew/.linuxbrew/bin/ld: ../../../../distrib/lib/libHalide.a(CodeGen_PTX_Dev.o):(.data.rel+0x0): undefined reference to llvm::DisableABIBreakingChecks' collect2: error: ld returned 1 exit status make: *** [bin/pointwise.generator] Error 1

Camera Pipeline Denoise Buffer Access Pattern

The output access pattern is not correct for denoise buffer in camera pipeline. It has two extra dimensions.

Generate Harris result numbers

fix ranges using loop substitutions for successive kernels
combine loop access using sliding window
use logical size to update for loops
create compiler unit tests for access pattern ranges
verify that Harris runs on the CoreIR simulator
verify that Harris runs through GarnetFlow and shale
get numbers for Harris

Discrepancy in casting between pointwise compute .h file and coreir .json

@jeffsetter I was comparing the cgra pointwise output to the output I get with the coreir and noticed that one of the pointwise compute units casts its output to uint8:

https://github.com/dillonhuff/clockwork/blob/b4dc132aa141f610421159bccf8ae21b8ae353ce/pointwise_compute.h#L26

While the corresponding coreir compute uses 16 bit arithmetic in all compute units.

Can this be fixed or am I misunderstanding something about the outputs?

Layer-by-layer code generation for running full Halide application

Refactor: usage of async between hardware kernels

investigate if async can work
use async between loops
modify codegen to produce special memories without synchronization primitives

Adding Rewrite Rules

Bankend optimization, banking, chaining...
Port optimization for basic (2D, 3D) linebuffer
Port optimization for strided linebuffer
Layout transfer buffer for downsample rate matching
Storage folding for circular buffer (input bank size ≠ output bank size)

Improve testing of unified buffer

Multiple producer synchronization

When multiple producers connect to a single consumer, there needs to be some synchronization between the valid signals. There are issues when they are out-of-sync, as when they have a different valid signature.

CoreIR Simulator for Unified Buffer

Framework

merge the simulator plugin into H2H repo and link coreIR in compilation

Application test

Properly use compute and store level

Extract compute level of a func
Determine streaming loops as loops between compute and store level
Unified buffer logical size is based on store level (especially for tiled accelerator)
Test that MobileNet can use a different compute level

Unsharp cpp compute is out of date?

@jeffsetter I'm trying to build a vanilla CPU unsharp and I'm getting the following error:

cmd: g++ -fstack-protector-all -std=c++11 regression_tb_unoptimized_unsharp.cpp unoptimized_unsharp.cpp
unoptimized_unsharp.cpp: In function ‘void op_hcompute_hw_input_stencil(HWStream<hw_uint<16> >&, hw_input_stencil_cache&, int, int, int, int)’:
unoptimized_unsharp.cpp:2142:24: error: ‘hcompute_hw_input_stencil’ was not declared in this scope
  auto compute_result = hcompute_hw_input_stencil(input_copy_stencil_hw_input_s0_x_c__hw_input_s0_y_c__hw_input_s0_c_value);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~
unoptimized_unsharp.cpp:2142:24: note: suggested alternative: ‘op_hcompute_hw_input_stencil’
  auto compute_result = hcompute_hw_input_stencil(input_copy_stencil_hw_input_s0_x_c__hw_input_s0_y_c__hw_input_s0_c_value);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~
                        op_hcompute_hw_input_stencil
clockwork: prog.cpp:2848: std::vector<std::__cxx11::basic_string<char> > run_regression_tb(const string&): Assertion `res == 0' failed.

Also I don't see unsharp_compute.json in coreir_compute.

HLS Bankend Configuration Generation

Application list

3x3 conv
camera pipeline
VGG with basic double buffer, psum buffer
MobileNet with line buffer and layout transfer buffer(down sample buffer)
HDRNet with slicing layer

Extract unified buffer parameters from Halide loop nest

Handle case where number of dimensions increases (demosaic in camera_pipe)
Handle multiple stencils read from a single buffer
Calculate stencils based on output of a previous buffer

Could you make a single stage, grayscale application that uses a rom?

@jeffsetter I'd like to have a rom test that is simpler than camera pipeline. Maybe just some pointwise math that looks up something in a ROM?

Get Pytorch->Onnx->Halide working

To do:

Check weights of .onnx file
Get human readable format of .onnx file
Test with small 1-2 layer pytorch network
Follow issue opened on pytorch github
Figure out how to schedule functions in Halide representation
Integrate new more operations to Onnx->Halide code (e.g. ConvTranspose, other)
Meet with Jeff, get code working end to end on CGRAFlow, fix compat issues with CoreIR

Run auto-scheduler out-of-loop to create schedules for layers of test applications

Remove extraneous muxes

Most of these muxes are due to incorrectly indexed buffers. The most problematic of these are muxes with more than one input. The following are the applications with this issue.

remove muxes in strided conv
remove muxes in downsample
remove muxes in unet conv

CoreIR Rewrite Rule + Application Checklist

General Issue

Stride Conv with odd row size has line buffer size incorrect
Using PyCoreIR instead of pure json generation

Image Processing Applications

3x3 conv
harris
strided conv
camera pipeline

NN Processing Applications

UNet layer
UNet psum buffer ??
MobileNet with line buffer and double buffer ??

Multiple consumer streams of a unified buffer

Change parameters of unified buffer as a vector of consumers
Extract each output stream from HalideIR
Merge consumer buffers if the access pattern is similar
test case for multiple consumers
CoreIR simulator support for multiple consumers
unsharp application working

Master break at CoreHLS codegen when set ubuffer = true

Keyi @Kuree is running CGRA pnr test and need the unified buffer to be generated. When I set use_ubuffer=true, it break your coreHLS codegen. Do you have a quick fix for this? @dillonhuff

Issues with CoreIR code generation, specifically regarding unified buffer

Extra muxes (for example, between data output of unified buffer and input of compute units). Present in multichannel conv, strided conv, avg pool.
Harris -- ranges are incorrect for padded16, lxx, lxy, lyy unified buffers
Tiling image and defining scope of image that should be brought onto the accelerator/CGRA not reflected in generated CoreIR

	processor.input = Buffer<uint8_t>(64, 64, 3);
	processor.output = Buffer<uint8_t>(31, 31, 3);

stanfordaha / halide-to-hardware Goto Github PK

halide-to-hardware's People

Contributors

Stargazers

Watchers

Forkers

halide-to-hardware's Issues