Code Monkey home page Code Monkey logo

compute-engine's Introduction

Larq Compute Engine larq logo

Tests PyPI - Python Version PyPI PyPI - License

Larq Compute Engine (LCE) is a highly optimized inference engine for deploying extremely quantized neural networks, such as Binarized Neural Networks (BNNs). It currently supports various mobile platforms and has been benchmarked on a Pixel 1 phone and a Raspberry Pi. LCE provides a collection of hand-optimized TensorFlow Lite custom operators for supported instruction sets, developed in inline assembly or in C++ using compiler intrinsics. LCE leverages optimization techniques such as tiling to maximize the number of cache hits, vectorization to maximize the computational throughput, and multi-threading parallelization to take advantage of multi-core modern desktop and mobile CPUs.

Larq Compute Engine is part of a family of libraries for BNN development; you can also check out Larq for building and training BNNs and Larq Zoo for pre-trained models.

Key Features

  • Effortless end-to-end integration from training to deployment:

    • Tight integration of LCE with Larq and TensorFlow provides a smooth end-to-end training and deployment experience.

    • A collection of Larq pre-trained BNN models for common machine learning tasks is available in Larq Zoo and can be used out-of-the-box with LCE.

    • LCE provides a custom MLIR-based model converter which is fully compatible with TensorFlow Lite and performs additional network level optimizations for Larq models.

  • Lightning fast deployment on a variety of mobile platforms:

    • LCE enables high performance, on-device machine learning inference by providing hand-optimized kernels and network level optimizations for BNN models.

    • LCE currently supports 64-bit ARM-based mobile platforms such as Android phones and Raspberry Pi boards.

    • Thread parallelism support in LCE is essential for modern mobile devices with multi-core CPUs.

Performance

The table below presents single-threaded performance of Larq Compute Engine on different versions of a novel BNN model called QuickNet (trained on ImageNet dataset, released on Larq Zoo) on a Raspberry Pi 4 Model B at 1.5GHz (BCM2711) board, a Pixel 1 Android phone (2016), and a Mac Mini with M1 ARM CPU:

Model Top-1 Accuracy RPi 4B 1.5GHz, 1 thread (ms) Pixel 1, 1 thread (ms) Mac Mini M1, 1 thread (ms)
QuickNetSmall 59.4% 27.7 16.8 4.0
QuickNet 63.3% 45.0 25.5 5.8
QuickNetLarge 66.9% 77.0 44.2 9.9

For reference, dabnn (the other main BNN library) reports an inference time of 61.3 ms for Bi-RealNet (56.4% accuracy) on the Pixel 1 phone, while LCE achieves an inference time of 41.6 ms for Bi-RealNet on the same device. They furthermore present a modified version, BiRealNet-Stem, which achieves the same accuracy of 56.4% in 43.2 ms.

The following table presents multi-threaded performance of Larq Compute Engine on a Pixel 1 phone and a Raspberry Pi 4 Model B at 1.5GHz (BCM2711) board:

Model Top-1 Accuracy RPi 4B 1.5GHz, 4 threads (ms) Pixel 1, 4 threads (ms) Mac Mini M1, 4 threads (ms)
QuickNetSmall 59.4% 12.1 8.9 1.8
QuickNet 63.3% 20.8 12.6 2.5
QuickNetLarge 66.9% 31.7 22.8 3.9

Benchmarked on 2021-06-11 (Pixel 1), 2021-06-13 (Mac Mini M1), and 2022-04-20 (RPi 4B) with LCE custom TFLite Model Benchmark Tool (see here) with XNNPack enabled and BNN models with randomized inputs.

Getting started

Follow these steps to deploy a BNN with LCE:

  1. Pick a Larq model

    You can use Larq to build and train your own model or pick a pre-trained model from Larq Zoo.

  2. Convert the Larq model

    LCE is built on top of TensorFlow Lite and uses the TensorFlow Lite FlatBuffer format to convert and serialize Larq models for inference. We provide an LCE Converter with additional optimization passes to increase the speed of execution of Larq models on supported target platforms.

  3. Build LCE

    The LCE documentation provides the build instructions for Android and 64-bit ARM-based boards such as Raspberry Pi. Please follow the provided instructions to create a native LCE build or cross-compile for one of the supported targets.

  4. Run inference

    LCE uses the TensorFlow Lite Interpreter to perform an inference. In addition to the already available built-in TensorFlow Lite operators, optimized LCE operators are registered to the interpreter to execute the Larq specific subgraphs of the model. An example to create and build an LCE compatible TensorFlow Lite interpreter for your own applications is provided here.

Next steps

About

Larq Compute Engine is being developed by a team of deep learning researchers and engineers at Plumerai to help accelerate both our own research and the general adoption of Binarized Neural Networks.

compute-engine's People

Contributors

adamhillier avatar andrewstanfordjason avatar arashb avatar cnugteren avatar dependabot-preview[bot] avatar dependabot[bot] avatar honglh avatar jamescook106 avatar jneeven avatar leonoverweel avatar lgeiger avatar luciengaitskell avatar panickal-xmos avatar sib1 avatar simonmaurer avatar timdebruin avatar tombana avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

compute-engine's Issues

Switch to a single build system (bazel)

Maintaining two buildsystems (Makefile and bazel) is too much work.

Dependency handling like downloading google-test and so on is easier in bazel. The azure system should also use the bazel system.

We can leave a Makefile that only includes a simple build command for the library (which assumes all the dependencies are already present), for easier development when you are locally recompiling the cpp code. The makefile should not include things like building the python package, that should all be in the bazel system.

Bit-packing of input tensor along channel dimension

Current implementation of in compute-engine is as following:

  • im2col
  • bitpack the im2col matrix and filter matrix
  • BGEMM

However, it makes sense to bitpack the input matrix first along the channel dimension (you might need extra bitpadding) and do the im2col afterward (or fuzed bitpacking-im2col algorithm)

TF lite Android deployment

We currently support Raspberry Pi and iOS but not Android.
This is because the build system for Android uses Bazel wheras the other targets work with a Makefile.
We have to figure out if we can have a Bazel file that has tflite android as a dependency but with our library added to it.

create bazel target to build tflite pip package and library

the original bash script to build a pip package for TF is available here: https://github.com/plumerai/compute-engine/blob/master/build_pip_pkg.sh
this script is used here in bazel to create the pip package for TF through Bazel commands:
https://github.com/plumerai/compute-engine/blob/master/BUILD

we can use exact same approach to build the TF lite library and pip package by just running one bazel command which executes our build script based on our custom makefile for TF lite. By doing so we don;t need to manually run a couple of commands to build the TF lite library every time.

C++ testsuite for TF lite code

We need a C++ testsuite for TF lite (similar the one for TF) so we can test individual components (currently there are only python tests in ops level and not components level)

ReorderAxesOperator in ModelConverter

When converting a model to tflite, the converter will generally leave the tensors (inputs, weights) as they are, so that one can use the same code in tflite as in tf.
However, for Conv2D in particular, the most efficient memory layout for the weights is [Out channels, H, W, In channels]. In TF, however, they are stored as [H, W, In channels, Out channels]. One would therefore want to store the weights in the more efficient format. This could be done during the graph setup phase of tflite, but also already during the conversion process which is what they do. The way they accomplish this is as follows: when the converter sees a Conv2D op, it will insert a ReorderAxesOperator between the weights tensor and the Conv2D op. This is done in the file tflite/toco/import_tensorflow.cc (search for "conv2d" to find it). Then later in the converter, in particular in tflite/toco/graph_transformations/{convert,resolve}_reorder_axes.cc, if it finds a ReorderAxesOperator of which the input is constant (like weights instead of inputs) then it will apply that operator during the conversion process.

Therefore, all we have to do in our ModelConverter, is add this ReorderAxesOperator ourselves, and the converter will do the conversion for us.

add larq compute engine to PyPI

we need to reserve the name for larq-compute-engine. I guess we push this initial version which does not have any functionality before going completely public! what do you think @lgeiger ?

LICENSE

Before we make the compute engine public, we should make sure we include proper licenses for all the things we are using, which is for now tensorflow and all the things it depends on.
For example, daBNN includes all kinds of indirect dependencies in its LICENSE file such as Eigen and Flatbuffers.

Future optimizations

Things we know that might be faster, but that we left for now because it simplifies the code

  • Move the bitpack_array C++ loop inside the assembly part. Gives speedup of about 5% for the bitpack_array function on Raspberry Pi.
  • Remove the bitpack_matrix loop when there's no padding (i.e. bitpack_matrix becomes a single bitpack_array call).
  • Decouple bitpacking bitwidth and bgemm bitwidth: we can use the 64-bit bitpacking code for 32-bit bgemm code.
  • Cache RUY prepacking for weights

implement optimized BGEMM for ARM architecture

current reference GEMM implementation does not support multi-threading, cache optimization techniques, SIMD and many other optimization techniques used in efficient impl. of GEMM.

There are multiple projects that we can take a look for inspiration and extend their impl. for binary GEMM:

TODO:

  • understanding the RUY codebase
  • extending the 8-bit assembly kernels (32/64-bit NEONs) for binary gemm with 8bit bitpacking
  • writing 32-bit assembly kernels (32/64-bit NEONs) for binary gemm with 32bit bitpacking
  • writing 64-bit assembly kernels (32/64-bit NEONs) for binary gemm with 64bit bitpacking

add binary fully connected operator

Binary fully connected operator is in essence doing binary matrix matrix multiplication (BGemm). Assume that the input is M × N , the weight is N×K (M is the batch size,N is the number of neurons of the previous layer, K is the number of neurons of the current layer)

Add 'packed' datatype.

It might make sense to have a packed datatype, or maybe packed8, packed32 etc which is really just uint8 / uint32 under the hood.
This way, we can have the bitpacking op that sends float to packed32, and then the bgemm op should only accept packedXX as inputs and produces intYY as outputs. Then it will properly give an error if you try to run bgemm on non-packed data. This way, if we have some operation that is supposed to support both normal uint8 and bitpacked things, it can distinguish those.

We can rename the current bitpacking operation to bsign because it is kind of the sign operator but which has a different result type, namely packedXX.

EDIT: They way to implement these in tensorflow (not lite) is by using their DT_VARIANT datatype which supports arbitrary C++ structs. So we can use something like

struct Packed32 {
    uint32_t x;
};

It is somewhat explained in this blogpost

TF lite deployment options

The normal TF lite system consists of a C++ library which can be used directly (on android, ios, raspberry pi etc), and wrappers around it for python (laptops and raspberry pi), Java (Android apps) and iOS apps. All wrappers use the same C++ library as core.

There are several ways we can add our custom ops to this:

  1. (current method) First compile the tflite C++ library. Then compile our own functions and append them to the library file. Now any wrapper (python, Android, iOS) can be built using the normal methods and will have our additions builtin.
    Pro's: build method is very easy, no altering of the wrappers. Usage is also easy, just replace the normal tflite package by our tflite package.
    Con's: everytime tflite updates, we have to release an updated package as well to have those changes.

  2. Do not alter the C++ library. Instead compile our additions to a separate library, say lqce_lite.so. Instead modify the wrappers and make them add our custom ops.
    Pros: None
    Cons: This is much more work than the first option and users still have to replace the tflite package by ours.

  3. Python wrapper only, can be done as an additional deployment option next to the existing ones. This option is marked as "experimental" in the tflite source code. For this option, we do not alter the C++ library. Instead we compile our additions to a separate lqce_lite.so . Then write a single python file lqce_lite.py that only loads this lqce_lite.so file and nothing else. A user then downloads the original tflite pip package as well as our lqce_lite package and uses it as shown in the code snippet below.
    Pro's: User can use the original tflite package. When the original tflite updates, we don't have to change anything. User can use our custom ops together with other custom ops.
    Cons: It seems that only the python wrapper has this InterpreterWithCustomOps thing so this doesn't work for android and ios.

import tflite_runtime.interpreter import InterpreterWithCustomOps
import lqce_lite

tflite.InterpreterWithCustomOps(["lqce_lite_op_loader"])

We could add option 3 to the larq compute engine.

add support of zero-padding (padding type 'SAME' in TF) in binary convolution op

doing zero-padding in {-1,1} space with im2col algorithm is not trivial because injected zeros in im2col buffer, depending on the implementation, will be interpreted as -1 or 1 which leads to wrong results.

currently I propose the following solution: storing the negative value of the corresponding kernel cell in each padding cell so after im2col and in XNOR operation of bgemm the elements cancel out. However this solution results in slower im2col algorithm since the padding elements can not simply put to zero!

Benchmark a version of LCE_BMLA with less temporaries.

The current LCE_BMLA macro in PR #128 uses 4 temporary registers v26, v27, v28, v29.
We could do it with only 2 temporaries v26, v27, by reordering the instructions (same total number of instructions):

eor v26.16b,  Vr.16b,  Vl1.16b
cnt v26.16b, v26.16b
addv b26, v26.16b

eor v27.16b,  Vr.16b,  Vl2.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[1], v27.s[0]

eor v27.16b,  Vr.16b,  Vl3.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[2], v27.s[0]

eor v27.16b,  Vr.16b,  Vl4.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[3], v27.s[0]

if the eor, cnt, addv instructions can run in parallel (out of order cores) then this version might be slower. But maybe they can only run in parallel with memory read/write instructions, so then this version is equally fast with less temporaries (we should benchmark that). Right now we don't need the extra registers, but we can keep this idea around for future versions.

TF lite converter weight bitpacking

Ideally we would like the TF lite converter to store the binary weights in a packed way so that the tflite model file stays small.

Another option might be that we write a tool ourselves that takes a converted .tflite model and does another pass on it, transforming the weights appropriately.

TFLite conversion bug in TF1

These are some notes and thoughts on tflite conversion.

When converting models, we could put either tf.sign or lqce.bsign in the graph. It doesn't matter which you choose because both of these become custom ops ("Sign" or "Bsign") and in tflite we can handle them appropriately. We can even register both strings in tflite and refer them to the same op.

There is still a weird issue about data types: in BinaryAlexNet near the end there is
BatchNorm -> Flatten -> Sign -> Dense

When using the tflite converter from tensorflow 2, this is all fine and the Flatten becomes a Reshape op.

When converting using tensorflow1 conversion, the flatten layer becomes a very weird thing with ( "Shape" , "StridedSlice" , "Pack" ) and a shortcut. Even though this should not be there, it revealed an issue:
When putting tf.sign after this flatten layer, the "Shape" op would go from float32 to int32.
When putting tf.bsign after this flatten layer, the "Shape" op would go from float32 to float32, yielding an error because "StridedSlice" expected an int32.

So this shows that our bsign op is somehow different from tf.sign regarding type info.
Maybe we should try to investigate this difference.
As suggested by Arash, maybe we can try putting the bsign dtypes in a different order.

the pip package does not include .so library

>>> import larq_compute_engine as lce
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/larq_compute_engine/__init__.py", line 3, in <module>
    from larq_compute_engine.python.ops.compute_engine_ops import bgemm
  File "/usr/local/lib/python2.7/dist-packages/larq_compute_engine/python/ops/compute_engine_ops.py", line 25, in <module>
    resource_loader.get_path_to_datafile('_larq_compute_engine_ops.so'))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python2.7/dist-packages/larq_compute_engine/python/ops/_larq_compute_engine_ops.so: cannot open shared object file: No such file or directory

flaky TF Lite inference tests

the tflite python inference tests are failing in PR #61 despite the fact that any thing related to those test cases are not touched in this PR. So i guess the tests are flaky!

Use TFlite conv2d setup

We should use the multithreading setup of tflite, as well as their im2col algorithm and all that.
Basically we can copy almost everything and only change the bgemm part.

adding testsuite for ARM specific codebase

code written with ARM assembly can only be tested on an ARM machine or an ARM emulator. Currently we have no testsuite for ARM architecture that we can run either on rasberry pi or on our Github actions with an ARM emulator.

Use op hints to simplify model TFLite conversion

TF 2.1 introduced @tf.function(experimental_implements="larq.bsign") to annotate custom implementations that can be represented by standard TF ops, but would benefit from custom implementations during deployment.
It would be great to integrate this into larq to make our TFLite conversion simpler.

Using experimental_implements might be a bit early, at least can't find how we can retrieve this information easily from TFLite (just from a grep in the code). It might be worth using tf.compat.v1.lite.OpHint for now, which (judging from the code) seams to be supported both by the TOCO and new MLIR converter. And should be easy to integrate as well.

The benefit of doing something like this is that we don't need build and maintain high performance TF ops for things like BSign which likely don't impact training performance much.

add a procedure to test each new op with TF lite

converting the ops from TF to TF Lite with the TF lite convertor does not work for custom ops since new ops need its own kernel and op implementation with TF lite OpResolver. -> we need a procedure to add and test the new custom ops. This requires building the TF lite with new ops and use the binary in a c++ program which executes the op exactly the same way our users would use our ops in their mobile applications.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.