larq / compute-engine Goto Github PK

View Code? Open in Web Editor NEW

239.0 24.0 33.0 5.03 MB

Highly optimized inference engine for Binarized Neural Networks

Home Page: https://docs.larq.dev/compute-engine

License: Apache License 2.0

Python 10.36% Shell 1.09% C++ 72.02% Starlark 5.66% MLIR 10.17% Dockerfile 0.08% C 0.06% CMake 0.55%

larq binarized-neural-networks tflite tensorflow mlir bnn raspberry-pi keras armv8 aarch64

compute-engine's Introduction

Larq Compute Engine

Larq Compute Engine (LCE) is a highly optimized inference engine for deploying extremely quantized neural networks, such as Binarized Neural Networks (BNNs). It currently supports various mobile platforms and has been benchmarked on a Pixel 1 phone and a Raspberry Pi. LCE provides a collection of hand-optimized TensorFlow Lite custom operators for supported instruction sets, developed in inline assembly or in C++ using compiler intrinsics. LCE leverages optimization techniques such as tiling to maximize the number of cache hits, vectorization to maximize the computational throughput, and multi-threading parallelization to take advantage of multi-core modern desktop and mobile CPUs.

Larq Compute Engine is part of a family of libraries for BNN development; you can also check out Larq for building and training BNNs and Larq Zoo for pre-trained models.

Key Features

Effortless end-to-end integration from training to deployment:
- Tight integration of LCE with Larq and TensorFlow provides a smooth end-to-end training and deployment experience.
- A collection of Larq pre-trained BNN models for common machine learning tasks is available in Larq Zoo and can be used out-of-the-box with LCE.
- LCE provides a custom MLIR-based model converter which is fully compatible with TensorFlow Lite and performs additional network level optimizations for Larq models.
Lightning fast deployment on a variety of mobile platforms:
- LCE enables high performance, on-device machine learning inference by providing hand-optimized kernels and network level optimizations for BNN models.
- LCE currently supports 64-bit ARM-based mobile platforms such as Android phones and Raspberry Pi boards.
- Thread parallelism support in LCE is essential for modern mobile devices with multi-core CPUs.

Performance

The table below presents single-threaded performance of Larq Compute Engine on different versions of a novel BNN model called QuickNet (trained on ImageNet dataset, released on Larq Zoo) on a Raspberry Pi 4 Model B at 1.5GHz (BCM2711) board, a Pixel 1 Android phone (2016), and a Mac Mini with M1 ARM CPU:

Model	Top-1 Accuracy	RPi 4B 1.5GHz, 1 thread (ms)	Pixel 1, 1 thread (ms)	Mac Mini M1, 1 thread (ms)
QuickNetSmall	59.4%	27.7	16.8	4.0
QuickNet	63.3%	45.0	25.5	5.8
QuickNetLarge	66.9%	77.0	44.2	9.9

For reference, dabnn (the other main BNN library) reports an inference time of 61.3 ms for Bi-RealNet (56.4% accuracy) on the Pixel 1 phone, while LCE achieves an inference time of 41.6 ms for Bi-RealNet on the same device. They furthermore present a modified version, BiRealNet-Stem, which achieves the same accuracy of 56.4% in 43.2 ms.

The following table presents multi-threaded performance of Larq Compute Engine on a Pixel 1 phone and a Raspberry Pi 4 Model B at 1.5GHz (BCM2711) board:

Model	Top-1 Accuracy	RPi 4B 1.5GHz, 4 threads (ms)	Pixel 1, 4 threads (ms)	Mac Mini M1, 4 threads (ms)
QuickNetSmall	59.4%	12.1	8.9	1.8
QuickNet	63.3%	20.8	12.6	2.5
QuickNetLarge	66.9%	31.7	22.8	3.9

Benchmarked on 2021-06-11 (Pixel 1), 2021-06-13 (Mac Mini M1), and 2022-04-20 (RPi 4B) with LCE custom TFLite Model Benchmark Tool (see here) with XNNPack enabled and BNN models with randomized inputs.

Getting started

Follow these steps to deploy a BNN with LCE:

Pick a Larq model

You can use Larq to build and train your own model or pick a pre-trained model from Larq Zoo.
Convert the Larq model

LCE is built on top of TensorFlow Lite and uses the TensorFlow Lite FlatBuffer format to convert and serialize Larq models for inference. We provide an LCE Converter with additional optimization passes to increase the speed of execution of Larq models on supported target platforms.
Build LCE

The LCE documentation provides the build instructions for Android and 64-bit ARM-based boards such as Raspberry Pi. Please follow the provided instructions to create a native LCE build or cross-compile for one of the supported targets.
Run inference

LCE uses the TensorFlow Lite Interpreter to perform an inference. In addition to the already available built-in TensorFlow Lite operators, optimized LCE operators are registered to the interpreter to execute the Larq specific subgraphs of the model. An example to create and build an LCE compatible TensorFlow Lite interpreter for your own applications is provided here.

Next steps

Explore Larq pre-trained models.
Learn how to build and train BNNs for your own application with Larq.
If you're a mobile developer, visit Android quickstart.
See our build instructions for Raspberry Pi and 64-bit ARM-based boards here.
Try our example programs.

About

Larq Compute Engine is being developed by a team of deep learning researchers and engineers at Plumerai to help accelerate both our own research and the general adoption of Binarized Neural Networks.

compute-engine's People

Contributors

Stargazers

Watchers

compute-engine's Issues

optimizing RUY packing algorithm for int32/64 inputs

currently the generic algorithm is used for 32/64-bit bitpacked elements bc there is no template specialized impl. for the int64_t elements is implemented in RUY.

add ".gitignore" and ".clang-format" files

Add padding support to bitpacking

Add padding support so that tensor dimensions do not have to be multiple of 64.

Switch to a single build system (bazel)

Maintaining two buildsystems (Makefile and bazel) is too much work.

Dependency handling like downloading google-test and so on is easier in bazel. The azure system should also use the bazel system.

We can leave a Makefile that only includes a simple build command for the library (which assumes all the dependencies are already present), for easier development when you are locally recompiling the cpp code. The makefile should not include things like building the python package, that should all be in the bazel system.

adding support of dilation to padding_functor

Check for tensorflow release tag

See the comments in PR #55 .
Once a new tag, such as v2.0.0-rc3 includes this commit then we can use that tag for our submodule, instead of a random commit.
This is also better with regards to Issue #51 .

Bit-packing of input tensor along channel dimension

Current implementation of in compute-engine is as following:

im2col
bitpack the im2col matrix and filter matrix
BGEMM

However, it makes sense to bitpack the input matrix first along the channel dimension (you might need extra bitpadding) and do the im2col afterward (or fuzed bitpacking-im2col algorithm)

Adhere to TensorFlow custom op naming scheme

External custom ops should be prefixed with a namespace according to https://github.com/tensorflow/community/blob/master/rfcs/20190726-custom-ops.md#out-of-tree-ops-must-be-namespaced

I think we should adhere to that.

add c++ testsuite

Investigate the support of dilations

Investigating the issues related to QuickNet TF lite conversion

T vs TOut template name
OpDef (see here): What is it? is it only for builtin ops? if not, why it is not available for our bconv2d op?
what are the differences between TF 2.0 and TF nightly-build which make it possible to convert QuickNet?

TF lite Android deployment

We currently support Raspberry Pi and iOS but not Android.
This is because the build system for Android uses Bazel wheras the other targets work with a Makefile.
We have to figure out if we can have a Bazel file that has tflite android as a dependency but with our library added to it.

create bazel target to build tflite pip package and library

the original bash script to build a pip package for TF is available here: https://github.com/plumerai/compute-engine/blob/master/build_pip_pkg.sh
this script is used here in bazel to create the pip package for TF through Bazel commands:
https://github.com/plumerai/compute-engine/blob/master/BUILD

we can use exact same approach to build the TF lite library and pip package by just running one bazel command which executes our build script based on our custom makefile for TF lite. By doing so we don;t need to manually run a couple of commands to build the TF lite library every time.

Speedup CI builds using caching

GitHub actions now supports caching using actions/cache. I think this could significantly speedup our CI build in particular the TFLite part.

Bazel supports different ways of caching build artifacts either via remote caching probably the GCS backend is the easiest to setup or simple disk caching using the --disk_cache=/path/to/build/cache flag which might also be useful for local development.

C++ testsuite for TF lite code

We need a C++ testsuite for TF lite (similar the one for TF) so we can test individual components (currently there are only python tests in ops level and not components level)

ReorderAxesOperator in ModelConverter

When converting a model to tflite, the converter will generally leave the tensors (inputs, weights) as they are, so that one can use the same code in tflite as in tf.
However, for Conv2D in particular, the most efficient memory layout for the weights is [Out channels, H, W, In channels]. In TF, however, they are stored as [H, W, In channels, Out channels]. One would therefore want to store the weights in the more efficient format. This could be done during the graph setup phase of tflite, but also already during the conversion process which is what they do. The way they accomplish this is as follows: when the converter sees a Conv2D op, it will insert a ReorderAxesOperator between the weights tensor and the Conv2D op. This is done in the file tflite/toco/import_tensorflow.cc (search for "conv2d" to find it). Then later in the converter, in particular in tflite/toco/graph_transformations/{convert,resolve}_reorder_axes.cc, if it finds a ReorderAxesOperator of which the input is constant (like weights instead of inputs) then it will apply that operator during the conversion process.

Therefore, all we have to do in our ModelConverter, is add this ReorderAxesOperator ourselves, and the converter will do the conversion for us.

Add naive Implementation of the binary conv op

bitpacking
Im2col
BGEMM
putting everything together

add larq compute engine to PyPI

we need to reserve the name for larq-compute-engine. I guess we push this initial version which does not have any functionality before going completely public! what do you think @lgeiger ?

LICENSE

Before we make the compute engine public, we should make sure we include proper licenses for all the things we are using, which is for now tensorflow and all the things it depends on.
For example, daBNN includes all kinds of indirect dependencies in its LICENSE file such as Eigen and Flatbuffers.

update the readme file

moving the back transformation to the {0} space numbers in BN

Future optimizations

Things we know that might be faster, but that we left for now because it simplifies the code

Move the bitpack_array C++ loop inside the assembly part. Gives speedup of about 5% for the bitpack_array function on Raspberry Pi.
Remove the bitpack_matrix loop when there's no padding (i.e. bitpack_matrix becomes a single bitpack_array call).
Decouple bitpacking bitwidth and bgemm bitwidth: we can use the 64-bit bitpacking code for 32-bit bgemm code.
Cache RUY prepacking for weights

implement optimized BGEMM for ARM architecture

current reference GEMM implementation does not support multi-threading, cache optimization techniques, SIMD and many other optimization techniques used in efficient impl. of GEMM.

There are multiple projects that we can take a look for inspiration and extend their impl. for binary GEMM:

Eigen
OpenBLAS -> probably only for x86
TF lite RUY impl.
ARM ComputeLibrary
Google XNNPACK

TODO:

understanding the RUY codebase
extending the 8-bit assembly kernels (32/64-bit NEONs) for binary gemm with 8bit bitpacking
writing 32-bit assembly kernels (32/64-bit NEONs) for binary gemm with 32bit bitpacking
writing 64-bit assembly kernels (32/64-bit NEONs) for binary gemm with 64bit bitpacking

add c++ benchmark testsuite

take a look what TF is using internally and if its too cumbersome to use the TF mechanism the google benchmark looks like the right tool to add to the compute-engine infrastructure: https://github.com/google/benchmark

add binary fully connected operator

Binary fully connected operator is in essence doing binary matrix matrix multiplication (BGemm). Assume that the input is M × N , the weight is N×K (M is the batch size,N is the number of neurons of the previous layer, K is the number of neurons of the current layer)

investigating performance of cached pre-packed matrices in RUY

see tensorflow/tensorflow@68c33b1

Add 'packed' datatype.

It might make sense to have a packed datatype, or maybe packed8, packed32 etc which is really just uint8 / uint32 under the hood.
This way, we can have the bitpacking op that sends float to packed32, and then the bgemm op should only accept packedXX as inputs and produces intYY as outputs. Then it will properly give an error if you try to run bgemm on non-packed data. This way, if we have some operation that is supposed to support both normal uint8 and bitpacked things, it can distinguish those.

We can rename the current bitpacking operation to bsign because it is kind of the sign operator but which has a different result type, namely packedXX.

EDIT: They way to implement these in tensorflow (not lite) is by using their DT_VARIANT datatype which supports arbitrary C++ structs. So we can use something like

struct Packed32 {
    uint32_t x;
};

It is somewhat explained in this blogpost

add tests for bconv op with padding, strides and dilation

TF lite deployment options

The normal TF lite system consists of a C++ library which can be used directly (on android, ios, raspberry pi etc), and wrappers around it for python (laptops and raspberry pi), Java (Android apps) and iOS apps. All wrappers use the same C++ library as core.

There are several ways we can add our custom ops to this:

(current method) First compile the tflite C++ library. Then compile our own functions and append them to the library file. Now any wrapper (python, Android, iOS) can be built using the normal methods and will have our additions builtin.
Pro's: build method is very easy, no altering of the wrappers. Usage is also easy, just replace the normal tflite package by our tflite package.
Con's: everytime tflite updates, we have to release an updated package as well to have those changes.
Do not alter the C++ library. Instead compile our additions to a separate library, say lqce_lite.so. Instead modify the wrappers and make them add our custom ops.
Pros: None
Cons: This is much more work than the first option and users still have to replace the tflite package by ours.
Python wrapper only, can be done as an additional deployment option next to the existing ones. This option is marked as "experimental" in the tflite source code. For this option, we do not alter the C++ library. Instead we compile our additions to a separate lqce_lite.so . Then write a single python file lqce_lite.py that only loads this lqce_lite.so file and nothing else. A user then downloads the original tflite pip package as well as our lqce_lite package and uses it as shown in the code snippet below.
Pro's: User can use the original tflite package. When the original tflite updates, we don't have to change anything. User can use our custom ops together with other custom ops.
Cons: It seems that only the python wrapper has this InterpreterWithCustomOps thing so this doesn't work for android and ios.

import tflite_runtime.interpreter import InterpreterWithCustomOps
import lqce_lite

tflite.InterpreterWithCustomOps(["lqce_lite_op_loader"])

We could add option 3 to the larq compute engine.

add python benchmark testsuite

bash script to build TF lite does not work with multiple arguments

root@80187a34a9dc:/working_dir/compute-engine/larq_compute_engine/tflite/build# ./build_lqce.sh --cleanbuild --benchmark                                     
./build_lqce.sh: line 5: [: too many arguments

adding the extra ops required for linear trans. from binary to float to model converter.

fix the bazel build files

Add prebuilt TF lite static library to repo to speedup the build

add support of zero-padding (padding type 'SAME' in TF) in binary convolution op

doing zero-padding in {-1,1} space with im2col algorithm is not trivial because injected zeros in im2col buffer, depending on the implementation, will be interpreted as -1 or 1 which leads to wrong results.

currently I propose the following solution: storing the negative value of the corresponding kernel cell in each padding cell so after im2col and in XNOR operation of bgemm the elements cancel out. However this solution results in slower im2col algorithm since the padding elements can not simply put to zero!

Benchmark a version of LCE_BMLA with less temporaries.

The current LCE_BMLA macro in PR #128 uses 4 temporary registers v26, v27, v28, v29.
We could do it with only 2 temporaries v26, v27, by reordering the instructions (same total number of instructions):

eor v26.16b,  Vr.16b,  Vl1.16b
cnt v26.16b, v26.16b
addv b26, v26.16b

eor v27.16b,  Vr.16b,  Vl2.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[1], v27.s[0]

eor v27.16b,  Vr.16b,  Vl3.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[2], v27.s[0]

eor v27.16b,  Vr.16b,  Vl4.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[3], v27.s[0]

if the eor, cnt, addv instructions can run in parallel (out of order cores) then this version might be slower. But maybe they can only run in parallel with memory read/write instructions, so then this version is equally fast with less temporaries (we should benchmark that). Right now we don't need the extra registers, but we can keep this idea around for future versions.

TF lite converter weight bitpacking

Ideally we would like the TF lite converter to store the binary weights in a packed way so that the tflite model file stays small.

Another option might be that we write a tool ourselves that takes a converted .tflite model and does another pass on it, transforming the weights appropriately.

TFLite conversion bug in TF1

These are some notes and thoughts on tflite conversion.

When converting models, we could put either tf.sign or lqce.bsign in the graph. It doesn't matter which you choose because both of these become custom ops ("Sign" or "Bsign") and in tflite we can handle them appropriately. We can even register both strings in tflite and refer them to the same op.

There is still a weird issue about data types: in BinaryAlexNet near the end there is
BatchNorm -> Flatten -> Sign -> Dense

When using the tflite converter from tensorflow 2, this is all fine and the Flatten becomes a Reshape op.

When converting using tensorflow1 conversion, the flatten layer becomes a very weird thing with ( "Shape" , "StridedSlice" , "Pack" ) and a shortcut. Even though this should not be there, it revealed an issue:
When putting tf.sign after this flatten layer, the "Shape" op would go from float32 to int32.
When putting tf.bsign after this flatten layer, the "Shape" op would go from float32 to float32, yielding an error because "StridedSlice" expected an int32.

So this shows that our bsign op is somehow different from tf.sign regarding type info.
Maybe we should try to investigate this difference.
As suggested by Arash, maybe we can try putting the bsign dtypes in a different order.

the pip package does not include .so library

>>> import larq_compute_engine as lce
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/larq_compute_engine/__init__.py", line 3, in <module>
    from larq_compute_engine.python.ops.compute_engine_ops import bgemm
  File "/usr/local/lib/python2.7/dist-packages/larq_compute_engine/python/ops/compute_engine_ops.py", line 25, in <module>
    resource_loader.get_path_to_datafile('_larq_compute_engine_ops.so'))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python2.7/dist-packages/larq_compute_engine/python/ops/_larq_compute_engine_ops.so: cannot open shared object file: No such file or directory

simplify and document how to run the tests for TF lite ops

there are multiple steps to build and figure out how to run the python tests for tf lite ops. this needs to be simplified and documented (the only documentation I could find was in Github actions instructions)

investigating TF lite SELECTIVE_REGISTRATION feature

the feature SELECTIVE_REGISTRATION in TF Lite allows to create smaller TF lite library by only including the ops used in the converting model (see here). I couldn't find proper doc in the official TF lite guide though.

flaky TF Lite inference tests

the tflite python inference tests are failing in PR #61 despite the fact that any thing related to those test cases are not touched in this PR. So i guess the tests are flaky!

fusing bitpacking with im2col operation for bitpacking along channel dimension

falls #52 is finished, we can fuse the bitpacking with im2col. Technically we bitpack the channels while we are copying them in im2col.

Use TFlite conv2d setup

We should use the multithreading setup of tflite, as well as their im2col algorithm and all that.
Basically we can copy almost everything and only change the bgemm part.

adding testsuite for ARM specific codebase

code written with ARM assembly can only be tested on an ARM machine or an ARM emulator. Currently we have no testsuite for ARM architecture that we can run either on rasberry pi or on our Github actions with an ARM emulator.

setup the CI system

Implement optimized bitpacking

This stackoverflow post might be useful for bitpacking using NEON.

add binary max pooling operator

bitwise OR can be used to get the max of a sequence of ones and zeros

Use op hints to simplify model TFLite conversion

TF 2.1 introduced @tf.function(experimental_implements="larq.bsign") to annotate custom implementations that can be represented by standard TF ops, but would benefit from custom implementations during deployment.
It would be great to integrate this into larq to make our TFLite conversion simpler.

Using experimental_implements might be a bit early, at least can't find how we can retrieve this information easily from TFLite (just from a grep in the code). It might be worth using tf.compat.v1.lite.OpHint for now, which (judging from the code) seams to be supported both by the TOCO and new MLIR converter. And should be easy to integrate as well.

The benefit of doing something like this is that we don't need build and maintain high performance TF ops for things like BSign which likely don't impact training performance much.

add a procedure to test each new op with TF lite

converting the ops from TF to TF Lite with the TF lite convertor does not work for custom ops since new ops need its own kernel and op implementation with TF lite OpResolver. -> we need a procedure to add and test the new custom ops. This requires building the TF lite with new ops and use the binary in a c++ program which executes the op exactly the same way our users would use our ops in their mobile applications.