Code Monkey home page Code Monkey logo

qnnpack's Introduction

QNNPACK

QNNPACK (Quantized Neural Networks PACKage) is a mobile-optimized library for low-precision high-performance neural network inference. QNNPACK provides implementation of common neural network operators on quantized 8-bit tensors.

QNNPACK is not intended to be directly used by machine learning researchers; instead it provides low-level performance primitives for high-level deep learning frameworks. As of today, QNNPACK is integrated in PyTorch 1.0 with Caffe2 graph representation.

Operator Coverage

Currently implemented and planned for implementation operators are below:

  • 2D Convolution
  • 2D Deconvolution
  • Channel Shuffle
  • Fully Connected
  • Locally Connected
  • 2D Max Pooling
  • 2D Average Pooling
  • Global Average Pooling
  • Sigmoid
  • Leaky ReLU
  • Clamp (can be used for ReLU, ReLU6 if it is not fused in another operator)
  • SoftArgMax (aka SoftMax)
  • Group Normalization

Building

QNNPACK provides standard CMake-based build scripts.

Native compilation

Users are recommended to use scripts/build-local.sh script to build QNNPACK for the host machine.

Cross-compilation for Android

To cross-compile for Android, set $ANDROID_NDK environment variable (where $ANDROID_NDK is the path to Android NDK directory, e.g. /opt/android-ndk-r15c) and use one of the scripts from the table below:

ABI Build script Restrictions
armeabi-v7a scripts/build-android-armv7.sh Requires CPU with ARM NEON
arm64-v8a scripts/build-android-arm64.sh
x86 scripts/build-android-x86.sh

Notes:

  • On armeabi-v7a qnnp_initialize will fail with qnnp_status_unsupported_hardware if the mobile CPU does not support ARM NEON. Don't set -DANDROID_ARM_NEON=1 for QNNPACK compilation as it can make qnnp_initialize crash on CPUs without ARM NEON.

Cross-compilation for iOS

To cross-compile for iOS, clone ios-cmake, and set $IOS_CMAKE_TOOLCHAIN_FILE environment variable (where $IOS_CMAKE_TOOLCHAIN_FILE is the path to ios.toolchain.cmake file in ios-cmake), and use one of the scripts from the table below:

Architecture Build script Notes
armv7 scripts/build-ios-armv7.sh iPhone 3GS/4/4S
armv7 scripts/build-ios-armv7s.sh iPhone 5 and newer
arm64 scripts/build-ios-arm64.sh iPhone 5S and newer
arm64e scripts/build-ios-arm64e.sh iPhone XS/XR
i386 scripts/build-ios-i386.sh iPhone Simulator (32-bit)
x86_64 scripts/build-ios-x86_64.sh iPhone Simulator (64-bit)

End-to-End Benchmarking

Caffe2 backend of PyTorch 1.0 natively integrates QNNPACK, and provides a pre-trained quantized MobileNet v2 model. Below are instructions for benchmarking this model end-to-end with QNNPACK.

Raspberry Pi 2 or 3

# Clone PyTorch 1.0 repo
git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch

# Optional: update QNNPACK submodule to latest revision
git submodule update --remote third_party/QNNPACK

# Build Caffe2 (including binaries) for the host system
# Use only 1 thread for build to avoid out-of-memory failures
MAX_JOBS=1 scripts/build_local.sh -DBUILD_BINARY=ON -DBUILD_PYTHON=OFF \
	-DUSE_OBSERVERS=OFF -DUSE_DISTRIBUTED=OFF

# Download model weights
wget https://s3.amazonaws.com/download.caffe2.ai/models/mobilenet_v2_1.0_224_quant/init_net.pb

# Download model graph
wget https://s3.amazonaws.com/download.caffe2.ai/models/mobilenet_v2_1.0_224_quant/predict_net.pb

# Run speed benchmark with 50 warm-up iterations and 10 measurement iterations
build/bin/speed_benchmark --net predict_net.pb --init_net init_net.pb \
	--input data --input_dims 1,3,224,224 --input_type float \
	--warmup 50 --iter 10

ARMv7 (32-bit) Android

# Clone PyTorch 1.0 repo
git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch

# Optional: update QNNPACK submodule to latest revision
git submodule update --remote third_party/QNNPACK

# Build Caffe2 (including binaries) for Android, and push to device
scripts/build_android.sh -DANDROID_TOOLCHAIN=clang -DBUILD_BINARY=ON
adb push build_android/bin/speed_benchmark /data/local/tmp/speed_benchmark

# Download model weights and copy them to Android device
wget https://s3.amazonaws.com/download.caffe2.ai/models/mobilenet_v2_1.0_224_quant/init_net.pb
adb push init_net.pb /data/local/tmp/init_net.pb

# Download model graph and copy it to Android device
wget https://s3.amazonaws.com/download.caffe2.ai/models/mobilenet_v2_1.0_224_quant/predict_net.pb
adb push predict_net.pb /data/local/tmp/predict_net.pb

# Run speed benchmark with 50 warm-up iterations and 10 measurement iterations
adb shell /data/local/tmp/speed_benchmark \
	--net /data/local/tmp/predict_net.pb \
	--init_net /data/local/tmp/init_net.pb \
	--input data --input_dims 1,3,224,224 --input_type float \
	--warmup 50 --iter 10

ARM64 (64-bit) Android

# Clone PyTorch 1.0 repo
git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch

# Optional: update QNNPACK submodule to latest revision
git submodule update --remote third_party/QNNPACK

# Build Caffe2 (including binaries) for Android, and push to device
scripts/build_android.sh -DANDROID_ABI=arm64-v8a -DANDROID_TOOLCHAIN=clang -DBUILD_BINARY=ON
adb push build_android/bin/speed_benchmark /data/local/tmp/speed_benchmark

# Download model weights and copy them to Android device
wget https://s3.amazonaws.com/download.caffe2.ai/models/mobilenet_v2_1.0_224_quant/init_net.pb
adb push init_net.pb /data/local/tmp/init_net.pb

# Download model graph and copy it to Android device
wget https://s3.amazonaws.com/download.caffe2.ai/models/mobilenet_v2_1.0_224_quant/predict_net.pb
adb push predict_net.pb /data/local/tmp/predict_net.pb

# Run speed benchmark with 50 warm-up iterations and 10 measurement iterations
adb shell /data/local/tmp/speed_benchmark \
	--net /data/local/tmp/predict_net.pb \
	--init_net /data/local/tmp/init_net.pb \
	--input data --input_dims 1,3,224,224 --input_type float \
	--warmup 50 --iter 10

PEP (Performance Evaluation Platform) Method

Facebook AI Performance Evaluation Platform is a framework and backend agnostic benchmarking platform to compare machine learning inferencing runtime metrics on a set of models and a variety of backends.

We use PEP to produce the results we have in our blog

With an ARMv7 device connected:

# Clone PyTorch 1.0 repo
mkdir ~/Code && cd ~/Code
git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch

# Optional: update QNNPACK submodule to latest revision
git submodule update --remote third_party/QNNPACK

# Clone PEP repo
cd ~/Code
git clone --recursive https://github.com/facebook/FAI-PEP.git aibench
cd aibench

# Run PEP benchmark with cool specifications. Try changing that cmd with more specifications!
# First time compile could take 20+ minutes
./benchmarking/run_bench.py \
  --platform android \
  -b ~/Code/aibench/specifications/models/caffe2/mobilenet_v2/mobilenet_v2_quant.json \
  --platform android --repo_dir ~/Code/pytorch \
  --frameworks_dir ~/Code/aibench/specifications/frameworks --framework caffe2

Acknowledgements

QNNPACK is developed by Marat Dukhan, Yiming Wu, Hao Lu, and Bert Maher. We thank Andrew Tulloch and Yangqing Jia for advice during the development of QNNPACK.

License

QNNPACK is BSD licensed, as found in the LICENSE file.

qnnpack's People

Contributors

alexbdv avatar boguscoder avatar brettkoonce avatar dreiss avatar harouwu avatar hlu1 avatar maratyszcza avatar supriyar avatar zhenhuaw-me avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

qnnpack's Issues

do you have some detail resource about quantize ?

r[i] = scale * (q[i] โ€“ zero_point)

how to calculate the scale ,and do the scale need calibration ?

the input scale weight scale and the output scale ??
different input different scale and the output have the different scale too ?
how to solve this problem ?

so do you have some detail resource about quantize ?

No documentation? or how to execute a layer in QNNPACK

There doesn't seem to be documentation of any form? It doesn't need to be extensive, just something that gives a brief introduction to how QNNPACK is meant to be used.

For example, I want to try out using QNNPACK for executing a fully connected layer. To try and figure out how to do that, I've been looking at caffe2's int8_fc_op.h.
Have I correctly understood that the first time I want to execute the layer, I need to call qnnp_create_fully_connected_nc_q8, then qnnp_setup_fully_connected_nc_q8, and then qnnp_run_operator ? And if I want to execute it again, I do the same but without calling qnnp_setup_fully_connected_nc_q8 (if the batch size has not changed)?

Also, why is inputPtr offset by 8 (like here)?

How to build dependencies separately

I'm trying to add a package for QNNPACK to the Spack package manager. I see that QNNPACK downloads its own dependencies, and that this can be avoided by setting *_SOURCE_DIR via cmake. Is there a way to point to an existing external installation instead of a source directory so that Spack doesn't need to rebuild all of these dependencies? Spack is designed to work on air-gapped supercomputers that don't have internet access, so I can't have it download anything at build time.

Support for nn.ConvTranspose2d

Hi - is there a plan to support nn.ConvTranspose2d for both QNNPACK and FBGEMM? I am using this in my model. After applying QNNPACK quantization techniques (both post training static and QAT), I am getting huge drop in accuracy, as well as not great improvement in speed. I suspect it could be due to ConvTranspose2d.

I am using Unet model for semantic segmentation. In the upsampling side, there are series of layers each having ConvTranspose2d operation. In the quantized model, it has to keep de-quantizing weights/activations to float for ConvTranspose2d layer and then back to quantized form for subsequent convolution/BN/Relu layers.

COMPAIR ERROR

from /home//Desktop/QNNPACK-master/bench/hgemm.cc:24:
/home//Desktop/QNNPACK-master/src/qnnpack/scalar-utils.h:70:19: error: exponent has no digits
assert(scale >= 0x1.0p-32f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:30:19: error: exponent has no digits
assert(scale >= 0x1.0p-32f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:208:19: error: exponent has no digits
assert(scale >= 0x1.0p-32f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:276:19: error: exponent has no digits
assert(scale >= 0x1.0p-32f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:336:28: error: exponent has no digits
assert(a_output_scale >= 0x1.0p-14f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:337:28: error: exponent has no digits
assert(b_output_scale >= 0x1.0p-14f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:338:27: error: exponent has no digits
assert(a_output_scale < 0x1.0p+8f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:339:27: error: exponent has no digits
assert(b_output_scale < 0x1.0p+8f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:343:30: error: exponent has no digits
assert(max_output_scale >= 0x1.0p-14f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:344:29: error: exponent has no digits
assert(max_output_scale < 0x1.0p+8f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:425:28: error: exponent has no digits
assert(a_output_scale >= 0x1.0p-10f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:426:28: error: exponent has no digits
assert(b_output_scale >= 0x1.0p-10f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:427:27: error: exponent has no digits
assert(a_output_scale < 0x1.0p+8f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:428:27: error: exponent has no digits
assert(b_output_scale < 0x1.0p+8f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:432:30: error: exponent has no digits
assert(max_output_scale >= 0x1.0p-10f);
^
/home//Desktop/QNNPACK-master/src/qnnpack/requantization.h:433:29: error: exponent has no digits
assert(max_output_scale < 0x1.0p+8f);
^

Out-of-memory during compiling on Raspbian

The system (RPI 3 B V1.2):

uname -a
Linux raspberrypi 4.9.35-v7+ #1014 SMP Fri Jun 30 14:47:43 BST 2017 armv7l GNU/Linux

Executed:

git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch

git submodule update --remote third_party/QNNPACK

MAX_JOBS=1 scripts/build_local.sh -DBUILD_BINARY=ON -DBUILD_PYTHON=OFF \
	-DUSE_OBSERVERS=OFF -DUSE_DISTRIBUTED=OFF

during compiling the step:

...
[ 57%] Building CXX object caffe2/CMakeFiles/caffe2.dir/utils/string_utils.cc.o
[ 58%] Building CXX object caffe2/CMakeFiles/caffe2.dir/utils/threadpool/ThreadPool.cc.o
[ 58%] Building CXX object caffe2/CMakeFiles/caffe2.dir/utils/cpuid.cc.o
[ 58%] Building CXX object caffe2/CMakeFiles/caffe2.dir/utils/bench_utils.cc.o
[ 58%] Building CXX object caffe2/CMakeFiles/caffe2.dir/utils/math_cpu.cc.o

the RPI gradually swaps, after a long time I get:

...
c++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.9/README.Bugs> for instructions.
caffe2/CMakeFiles/caffe2.dir/build.make:4974: recipe for target 'caffe2/CMakeFiles/caffe2.dir/utils/math_cpu.cc.o' failed
make[2]: *** [caffe2/CMakeFiles/caffe2.dir/utils/math_cpu.cc.o] Error 4
CMakeFiles/Makefile2:1328: recipe for target 'caffe2/CMakeFiles/caffe2.dir/all' failed
make[1]: *** [caffe2/CMakeFiles/caffe2.dir/all] Error 2
Makefile:138: recipe for target 'all' failed
make: *** [all] Error 2

Handler of variate in creating convolution23_nhwc_q8 function

I think it is necessary to add hander of output scale in qnnp_create_convolution2d_nhwc_q8, convolution.c.
Before the code, const float convolution_scale = input_scale * kernel_scale / output_scale;
we should judge of the value of output_scale and avoid ZERO.
Thanks.

Is the code multithreaded for arm64?

I'm using QNNPACK for inference for my models, I was testing it on a v7a phone where it worked well. Now I got a newer phone (S9) and while it sometimes is a lot faster (I think there is some CPU throttling happening at times which can make it very slow), I noticed in the profiler that the CPU is at 80%, compared to ~35% on the older phone. I cannot explain how that could happen other than the code is doing multithreading?

How to use QNNPACK on iOS?

I compiled QNNPACK iOS library, now I want to run on iOS with the caffe2 model.but I don't know how to use it.do you have some sample on iOS or some tutorial?

caffe2 abort when running benchmark with input_type uint8_t

Greetings,

I was running benchmark with the guide on Android 64bit device. It is able to run with input_type float. While QNNPACK declares to accelerate fixed point inference, I wanted to test uint8. So I changed the input_type to uint8_t according to the help document (speed_benchmark --help) of speed_bencmark (that's the only change to the running commands). And then got assert message as below.

# ./bin/bench.mobilenetv2.sh
terminate called after throwing an instance of 'c10::Error'
  what():  IsType<T>() ASSERT FAILED at ../aten/src/ATen/core/blob.h:77, please report a bug to PyTorch. wrong type for the Blob instance. Blob contains N6caffe24int813Int8TensorCPUE while ca
ller expects N6caffe26TensorE.
Offending Blob name: data.
Error from operator:
input: "data" output: "1" name: "" type: "NCHW2NHWC" (Get at ../aten/src/ATen/core/blob.h:77)
(no backtrace available)
Aborted (core dumped)

My question is:

  1. when float were filled to input_type, whether the network running on Android device is using fixed point or floating point inference? If it is, the float input is quantized to fixed point according the quantization parameter of the network input? (I assume it is as the net provided in example is quantized)
  2. If the uint8_t input is taken as a valid input type? Or, the input should be float such that it can be quantized according to the quantization parameter of the network input?

Btw, if there are any arguments of speed_benchmark which can control the running threads?

Thanks and Regards

Build failed on rasp pi 4 (Error: selected processor does not support `yield' in ARM mode)

I followed the instructions on README.md to build QNNPACK

MAX_JOBS=1 scripts/build-local.sh -DBUILD_BINARY=ON -DBUILD_PYTHON=OFF \
-DUSE_OBSERVERS=OFF -DUSE_DISTRIBUTED=OFF

with the following error

FAILED: deps/pthreadpool/CMakeFiles/pthreadpool.dir/src/pthreads.c.o
/usr/bin/gcc -DFXDIV_USE_INLINE_ASSEMBLY=0 -DPTHREADPOOL_ENABLE_FASTPATH=0 -D_GNU_SOURCE=1 -I../../deps/pthreadpool/src -I../../deps/pthreadpool/include -I../../deps/fxdiv/include -O3 -DNDEBUG -fPIC   -pthread -std=c11 -MD -MT deps/pthreadpool/CMakeFiles/pthreadpool.dir/src/pthreads.c.o -MF deps/pthreadpool/CMakeFiles/pthreadpool.dir/src/pthreads.c.o.d -o deps/pthreadpool/CMakeFiles/pthreadpool.dir/src/pthreads.c.o   -c ../../deps/pthreadpool/src/pthreads.c
In file included from ../../deps/pthreadpool/src/pthreads.c:50:
../../deps/pthreadpool/include/pthreadpool.h:727:2: warning: โ€˜pthreadpool_function_1d_tโ€™ is deprecated [-Wdeprecated-declarations]
  pthreadpool_function_1d_t function,
  ^~~~~~~~~~~~~~~~~~~~~~~~~
../../deps/pthreadpool/include/pthreadpool.h:733:2: warning: โ€˜pthreadpool_function_1d_tiled_tโ€™ is deprecated [-Wdeprecated-declarations]
  pthreadpool_function_1d_tiled_t function,
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../deps/pthreadpool/include/pthreadpool.h:740:2: warning: โ€˜pthreadpool_function_2d_tโ€™ is deprecated [-Wdeprecated-declarations]
  pthreadpool_function_2d_t function,
  ^~~~~~~~~~~~~~~~~~~~~~~~~
../../deps/pthreadpool/include/pthreadpool.h:747:2: warning: โ€˜pthreadpool_function_2d_tiled_tโ€™ is deprecated [-Wdeprecated-declarations]
  pthreadpool_function_2d_tiled_t function,
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../deps/pthreadpool/include/pthreadpool.h:756:2: warning: โ€˜pthreadpool_function_3d_tiled_tโ€™ is deprecated [-Wdeprecated-declarations]
  pthreadpool_function_3d_tiled_t function,
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../deps/pthreadpool/include/pthreadpool.h:767:2: warning: โ€˜pthreadpool_function_4d_tiled_tโ€™ is deprecated [-Wdeprecated-declarations]
  pthreadpool_function_4d_tiled_t function,
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/ccL8UfEl.s: Assembler messages:
/tmp/ccL8UfEl.s:56: Error: selected processor does not support `yield' in ARM mode
/tmp/ccL8UfEl.s:234: Error: selected processor does not support `yield' in ARM mode
/tmp/ccL8UfEl.s:400: Error: selected processor does not support `yield' in ARM mode
ninja: build stopped: subcommand failed.

HOST machine details:

Raspberry Pi 4
gcc version 8.3.0 (Raspbian 8.3.0-6+rpi1)
Target: arm-linux-gnueabihf

Any hints on how to solve this? Thanks a lot

Performance benchmark on GPUs, NPUs

Great project! I hope QNNPACK makes mobile deployment much easier.

  1. Can models from QNNPACK run on mobile GPUs or NPUs (e.g., Snapdragon 845 DSP or Kirin 970/980)?
    If so, any benchmark results compared to Tensorflow Light would be helpful.
    I want to know whether QNNPACK is optimized for mobile CPUs or does not depend on the types of underlying hardware.

  2. (Similar question to 1) Does QNNPACK use NNAPI that recently enabled in Android 9.0?

No performance difference between ARMv7 (32-bit) Android and ARM64 (64-bit) Android

I follow the tutorial to run speed benchmark (MobileNet v2) on ARMv7 (32-bit) and ARM64 (64-bit) Android. There is no performance difference between them.

Data from a few Android devices

device per iter (ms) iters per second
oneplus_5_32 17.27 57.92
oneplus_5_64 15.3 65.36
htc_10_32 35.56 28.12
htc_10_64 37.63 26.57
huawei_caml21_32 52.34 19.14
huawei_caml21_64 49.89 20.05

Weird performance gap observed for Jetson TX2

I'm currently experimenting QNNPack with Caffe2 on Jetson TX2, using the official mobilenet_v2_quant.
Significant performance gap between Nvidia Denver2 CPU and ARMv8 A57 was observed and I cannot figure out why.

CPU:
HMP Dual Denver 2/2 MB L2 + Quad ARMยฎ A57/2 MB L2

What I observe

When I run with all 6 cores at 2GHz, I can get 15 fps performance, give or take.

When running with 2 Nvidia Denver2 cores only, it drops to around 7fps.

However, when running with 4 A57 cores only, it drops to jaw-dropping 1fps, which is shocking.

Interestingly, I also tried on a Rasp pi with 4 ARMv7 cores at 1.2 GHz (Model 2B, I think), I can achieve around 10fps performance.

Environment

  • Platform: Jetson TX2 with Jetpack 3.3.
  • Caffe2: build from source using pytorch/scripts/build_tegra_x1.sh.
  • Code snippet:
p = workspace.Predictor(init_net, predict_net)
p.run([data])

Can someone give me some pointers to figure out why the quad A57 cores performs so badly?

Thank you.

Edit: Profile summaries of different cpu modes are provided:
https://gist.github.com/TerryTsao/fa30ba54e660c51d5c91aafb55715d89

Building on OSX 10.14 for SIMULATOR64 (i386) fails

The make process tries to compile ARM code (src/arm/match/init.c) when building on OSX 10.14 for SIMULATOR64 (i386).

Command:

$ cmake .. -DCMAKE_TOOLCHAIN_FILE=../../ios-cmake/ios.toolchain.cmake -DIOS_PLATFORM=SIMULATOR64
$ make

Error:

Scanning dependencies of target cpuinfo
[ 29%] Building C object deps/cpuinfo/CMakeFiles/cpuinfo.dir/src/init.c.o
[ 30%] Building C object deps/cpuinfo/CMakeFiles/cpuinfo.dir/src/api.c.o
[ 32%] Building C object deps/cpuinfo/CMakeFiles/cpuinfo.dir/src/arm/uarch.c.o
[ 33%] Building C object deps/cpuinfo/CMakeFiles/cpuinfo.dir/src/arm/cache.c.o
[ 35%] Building C object deps/cpuinfo/CMakeFiles/cpuinfo.dir/src/arm/mach/init.c.o
/Users/kohenchia/github/QNNPACK/deps/cpuinfo/src/arm/mach/init.c:19:24: error: redefinition of 'cpuinfo_isa' with a different type:
      'struct cpuinfo_arm_isa' vs 'struct cpuinfo_x86_isa'
struct cpuinfo_arm_isa cpuinfo_isa = {
                       ^
/Users/kohenchia/github/QNNPACK/deps/cpuinfo/include/cpuinfo.h:683:32: note: previous declaration is here
        extern struct cpuinfo_x86_isa cpuinfo_isa;
                                      ^
/Users/kohenchia/github/QNNPACK/deps/cpuinfo/src/arm/mach/init.c:287:16: error: no member named 'sha1' in 'struct cpuinfo_x86_isa'; did
      you mean 'sha'?
                        cpuinfo_isa.sha1 = true;
                                    ^~~~
                                    sha
/Users/kohenchia/github/QNNPACK/deps/cpuinfo/include/cpuinfo.h:674:8: note: 'sha' declared here
                bool sha;
                     ^
/Users/kohenchia/github/QNNPACK/deps/cpuinfo/src/arm/mach/init.c:288:16: error: no member named 'sha2' in 'struct cpuinfo_x86_isa'; did
      you mean 'sha'?
                        cpuinfo_isa.sha2 = true;
                                    ^~~~
                                    sha
/Users/kohenchia/github/QNNPACK/deps/cpuinfo/include/cpuinfo.h:674:8: note: 'sha' declared here
                bool sha;
                     ^
/Users/kohenchia/github/QNNPACK/deps/cpuinfo/src/arm/mach/init.c:289:16: error: no member named 'pmull' in 'struct cpuinfo_x86_isa'
                        cpuinfo_isa.pmull = true;
                        ~~~~~~~~~~~ ^
/Users/kohenchia/github/QNNPACK/deps/cpuinfo/src/arm/mach/init.c:290:16: error: no member named 'crc32' in 'struct cpuinfo_x86_isa'
                        cpuinfo_isa.crc32 = true;
                        ~~~~~~~~~~~ ^
5 errors generated.
make[2]: *** [deps/cpuinfo/CMakeFiles/cpuinfo.dir/src/arm/mach/init.c.o] Error 1
make[1]: *** [deps/cpuinfo/CMakeFiles/cpuinfo.dir/all] Error 2
make: *** [all] Error 2

This is because CMAKE_SYSTEM_PROCESSOR is empty on my machine. It is sort of a known bug with CMake: https://public.kitware.com/Bug/view.php?id=9065. The if clause on deps/cpuinfo/CMakeLists.txt fails, and proceeds to include src/arm/mach/init.c.

In order to fix it, I had to manually set CMAKE_SYSTEM_PROCESSOR to i386 x86_64. (Edit: Setting to i386 didn't work, which is a separate problem.)

Request to add Conv1d related operation

Hi, Kindly consider adding Conv1d, ConvTranspose1d and LSTM/RNN to the QNNPack library as it would greatly benefit development of mobile models which would work with time series sensor data for a variety of biomedical applications (ECG & wearables) and even benefit Human Activity Recognition using a phone's motion sensors. The reduced parameter needs of the aforementioned modules would be ideal in a mobile platform.

Is QNNPACK still being developed?

Hey there,

Is QNNPACK still being actively developed? I noticed that the last activity was in August of last year but it does seem like an important part of Pytorch.

Thanks,
Andrew

Qnnpack accuracy very poor on unet model

I am using Unet model for semantic segmentation. I pass a batch of images to the model. The model is expected to output 0 or 1 for each pixel of the image (depending upon whether pixel is part of person object or not). 0 is for background, and 1 is for foreground.

I am trying to quantize the Unet model with Pytorch quantization apis for ARM architecture. I chose Qnnpack as quantization configuration. However the model accuracy is very poor for both Post training static quantization as well as QAT. The output is always a complete black image i.e. contains only background, no foreground for the person object. The model outputs bX2X224X224 i.e. batch_size X 2 channels (one for forground and one for background) X height X width.

Following is the output values for the center pixels of images - with original model and with quantized model.
image

As seen, the original model has varying output values. Hence when we apply Softmax on dim=1 (i,e, Channel dimension), we get some pixels as 0 and some as 1. This is as per expectation. However, the quantized model always outputs high positive for background and high negative for foreground channel. After applying softmax, all the pixels are background pixels, and the output is black images.

I need some help to find why this is happening. Is it a bug in qnnpack quantization routine?

The model source code is available at here. I used pretrained version of Unet with MobileNetV2 as backbone (check benchmark section in the readme from the source code link) - see here.

I first tried with Fbgemm configuration, and it worked fine in terms of accuracy - no major loss. However, when tried with qnnpack, I face above issues. Following is my code for QAT.

use_sigmoid = False

def poly_lr_scheduler(optimizer, init_lr, curr_iter, max_iter, power=0.9):
    for g in optimizer.param_groups:
        g['lr'] = init_lr * (1 - curr_iter/max_iter)**power


def dice_loss(logits, targets, smooth=1.0):
    """
    logits: (torch.float32)  shape (N, C, H, W)
    targets: (torch.float32) shape (N, H, W), value {0,1,...,C-1}
    """

    if not use_sigmoid:
        outputs = F.softmax(logits, dim=1)
        targets = torch.unsqueeze(targets, dim=1)
        # targets = torch.zeros_like(logits).scatter_(dim=1, index=targets.type(torch.int64), src=torch.tensor(1.0))
        targets = torch.zeros_like(logits).scatter_(dim=1, index=targets.type(torch.int64),
                                                    src=torch.ones(targets.type(torch.int64).shape))

        inter = outputs * targets
        dice = 1 - ((2 * inter.sum(dim=(2, 3)) + smooth) / (outputs.sum(dim=(2, 3)) + targets.sum(dim=(2, 3)) + smooth))
        return dice.mean()
    else:
        outputs = logits[:,1,:,:]
        outputs = torch.sigmoid(outputs)
        inter = outputs * targets
        dice = 1 - ((2*inter.sum(dim=(1,2)) + smooth) / (outputs.sum(dim=(1,2))+targets.sum(dim=(1,2)) + smooth))
        return dice.mean()

def train_model_for_qat(model, data_loader, num_epochs, batch_size):
    device = torch.device('cpu')
    model_params = [p for p in model.parameters() if p.requires_grad]
    SGD_params = {
        "lr": 1e-2,
        "momentum": 0.9,
        "weight_decay": 1e-8
    }

    optimizer = torch.optim.SGD(model_params, lr=SGD_params["lr"], momentum=SGD_params["momentum"],
                                nesterov=True, weight_decay=SGD_params["weight_decay"])
    init_lr = optimizer.param_groups[0]['lr']
    scheduler = lr_scheduler.StepLR(optimizer, step_size=100, gamma=1)

    max_iter = num_epochs * np.ceil(len(data_loader.dataset) / batch_size) + 5
    curr_batch_num = 0

    for epoch in range(num_epochs):
        print(f"Epoch {epoch} in progress")
        train_data_length = 0
        total_train_loss = 0

        model.train()
        for batch in data_loader:
            # get the next batch
            data, target = batch
            data, target = data.to(device), target.to(device)
            train_data_length += len(batch[0])

            optimizer.zero_grad()
            outputs = model(data)
            loss = dice_loss(outputs, target)
            total_train_loss += loss.item()
            loss.backward()
            optimizer.step()
            curr_batch_num += 1
            poly_lr_scheduler(optimizer, init_lr, curr_batch_num, max_iter, power=0.9)

        total_train_loss = total_train_loss / train_data_length
        scheduler.step()

        if epoch > 10:
            # Freeze quantizer parameters
            model.apply(Q.disable_observer)
        if epoch > 20:
            # Freeze batch norm mean and variance estimates
            model.apply(nn_intrinsic_qat.freeze_bn_stats)

        quantized_model = Q.convert(model.eval(), inplace=False)
        accuracy = eval_model_for_quantization(quantized_model, device)
        print(f"...Accuacy at the end of epoch {epoch} : {accuracy}")
        if (accuracy > 99) and (epoch >= 10):
            print("...GUESS we are done with training now...")
            break

    return total_train_loss, model


Am I missing anything?

One issue that we did encounter is that the upsampling layers of Unet use nn.ConvTranspose2d which is not supported for quantization. Hence before this layer, we need to dequantize tensors, apply nn.ConvTranspose2d, and then requantize for subsequent layers. Can this be reason for lower accuracy?

#------------------------------------------------------------------------------
#   Decoder block
#------------------------------------------------------------------------------
class DecoderBlock(nn.Module):
    def __init__(self, in_channels, out_channels, block_unit):
        super(DecoderBlock, self).__init__()
        self.deconv = nn.ConvTranspose2d(in_channels, out_channels, kernel_size=4, padding=1, stride=2)
        self.block_unit = block_unit
        self.quant = QuantStub()
        self.dequant = DeQuantStub()

    def forward(self, input, shortcut):
        # self.deconv = nn.ConvTranspose2d not supported for FBGEMM and QNNPACK quantization
        input = self.dequant(input)
        x = self.deconv(input)
        x = self.quant(x)
        x = torch.cat([x, shortcut], dim=1)
        x = self.block_unit(x)
        return x

The following is the model after QAT training is completed for 30 epochs . . .

UNet(
  (backbone): MobileNetV2(
    (features): Sequential(
      (0): Sequential(
        (0): QuantizedConvReLU2d(3, 32, kernel_size=(3, 3), stride=(2, 2), scale=0.012562132440507412, zero_point=0, padding=(1, 1))
        (1): Identity()
        (2): Identity()
      )
      (1): InvertedResidual(
        (skip_add): QFunctional(
          scale=1.0, zero_point=0
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(32, 32, kernel_size=(3, 3), stride=(1, 1), scale=0.046556487679481506, zero_point=0, padding=(1, 1), groups=32)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), scale=0.043083205819129944, zero_point=96)
          (4): Identity()
        )
      )
      (2): InvertedResidual(
        (skip_add): QFunctional(
          scale=1.0, zero_point=0
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(16, 96, kernel_size=(1, 1), stride=(1, 1), scale=0.05470738932490349, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(96, 96, kernel_size=(3, 3), stride=(2, 2), scale=0.05578919127583504, zero_point=0, padding=(1, 1), groups=96)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(96, 24, kernel_size=(1, 1), stride=(1, 1), scale=0.08143805712461472, zero_point=131)
          (7): Identity()
        )
      )
      (3): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.10905726253986359, zero_point=133
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(24, 144, kernel_size=(1, 1), stride=(1, 1), scale=0.021390624344348907, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(144, 144, kernel_size=(3, 3), stride=(1, 1), scale=0.03496978059411049, zero_point=0, padding=(1, 1), groups=144)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(144, 24, kernel_size=(1, 1), stride=(1, 1), scale=0.07988038659095764, zero_point=166)
          (7): Identity()
        )
      )
      (4): InvertedResidual(
        (skip_add): QFunctional(
          scale=1.0, zero_point=0
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(24, 144, kernel_size=(1, 1), stride=(1, 1), scale=0.016173962503671646, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(144, 144, kernel_size=(3, 3), stride=(2, 2), scale=0.05084317922592163, zero_point=0, padding=(1, 1), groups=144)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(144, 32, kernel_size=(1, 1), stride=(1, 1), scale=0.08057469874620438, zero_point=133)
          (7): Identity()
        )
      )
      (5): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.07931466400623322, zero_point=141
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(32, 192, kernel_size=(1, 1), stride=(1, 1), scale=0.015451926738023758, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(192, 192, kernel_size=(3, 3), stride=(1, 1), scale=0.01901066116988659, zero_point=0, padding=(1, 1), groups=192)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(192, 32, kernel_size=(1, 1), stride=(1, 1), scale=0.03396213427186012, zero_point=137)
          (7): Identity()
        )
      )
      (6): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.10119215399026871, zero_point=149
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(32, 192, kernel_size=(1, 1), stride=(1, 1), scale=0.009366143494844437, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(192, 192, kernel_size=(3, 3), stride=(1, 1), scale=0.03307875618338585, zero_point=0, padding=(1, 1), groups=192)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(192, 32, kernel_size=(1, 1), stride=(1, 1), scale=0.045690517872571945, zero_point=152)
          (7): Identity()
        )
      )
      (7): InvertedResidual(
        (skip_add): QFunctional(
          scale=1.0, zero_point=0
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(32, 192, kernel_size=(1, 1), stride=(1, 1), scale=0.013529903255403042, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(192, 192, kernel_size=(3, 3), stride=(2, 2), scale=0.030076880007982254, zero_point=0, padding=(1, 1), groups=192)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(192, 64, kernel_size=(1, 1), stride=(1, 1), scale=0.05553155764937401, zero_point=128)
          (7): Identity()
        )
      )
      (8): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.057563915848731995, zero_point=132
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(64, 384, kernel_size=(1, 1), stride=(1, 1), scale=0.008955957368016243, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(384, 384, kernel_size=(3, 3), stride=(1, 1), scale=0.01566135324537754, zero_point=0, padding=(1, 1), groups=384)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), scale=0.02868938073515892, zero_point=162)
          (7): Identity()
        )
      )
      (9): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.05936211720108986, zero_point=140
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(64, 384, kernel_size=(1, 1), stride=(1, 1), scale=0.011350379325449467, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(384, 384, kernel_size=(3, 3), stride=(1, 1), scale=0.013551343232393265, zero_point=0, padding=(1, 1), groups=384)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), scale=0.02829933725297451, zero_point=124)
          (7): Identity()
        )
      )
      (10): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.056326691061258316, zero_point=121
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(64, 384, kernel_size=(1, 1), stride=(1, 1), scale=0.009888351894915104, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(384, 384, kernel_size=(3, 3), stride=(1, 1), scale=0.00840410403907299, zero_point=0, padding=(1, 1), groups=384)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), scale=0.02762036770582199, zero_point=130)
          (7): Identity()
        )
      )
      (11): InvertedResidual(
        (skip_add): QFunctional(
          scale=1.0, zero_point=0
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(64, 384, kernel_size=(1, 1), stride=(1, 1), scale=0.010262548923492432, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(384, 384, kernel_size=(3, 3), stride=(1, 1), scale=0.020638082176446915, zero_point=0, padding=(1, 1), groups=384)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(384, 96, kernel_size=(1, 1), stride=(1, 1), scale=0.03133825585246086, zero_point=114)
          (7): Identity()
        )
      )
      (12): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.049823448061943054, zero_point=106
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(96, 576, kernel_size=(1, 1), stride=(1, 1), scale=0.007199177984148264, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(576, 576, kernel_size=(3, 3), stride=(1, 1), scale=0.017748937010765076, zero_point=0, padding=(1, 1), groups=576)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), scale=0.045204587280750275, zero_point=94)
          (7): Identity()
        )
      )
      (13): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.06418105959892273, zero_point=125
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(96, 576, kernel_size=(1, 1), stride=(1, 1), scale=0.008789398707449436, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(576, 576, kernel_size=(3, 3), stride=(1, 1), scale=0.019841214641928673, zero_point=0, padding=(1, 1), groups=576)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), scale=0.06256742030382156, zero_point=129)
          (7): Identity()
        )
      )
      (14): InvertedResidual(
        (skip_add): QFunctional(
          scale=1.0, zero_point=0
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(96, 576, kernel_size=(1, 1), stride=(1, 1), scale=0.011278725229203701, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(576, 576, kernel_size=(3, 3), stride=(2, 2), scale=0.028320688754320145, zero_point=0, padding=(1, 1), groups=576)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(576, 160, kernel_size=(1, 1), stride=(1, 1), scale=0.06365438550710678, zero_point=132)
          (7): Identity()
        )
      )
      (15): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.08667448908090591, zero_point=127
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(160, 960, kernel_size=(1, 1), stride=(1, 1), scale=0.011708680540323257, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(960, 960, kernel_size=(3, 3), stride=(1, 1), scale=0.026726122945547104, zero_point=0, padding=(1, 1), groups=960)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), scale=0.04459201171994209, zero_point=116)
          (7): Identity()
        )
      )
      (16): InvertedResidual(
        (skip_add): QFunctional(
          scale=0.1879616528749466, zero_point=126
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(160, 960, kernel_size=(1, 1), stride=(1, 1), scale=0.015011523850262165, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(960, 960, kernel_size=(3, 3), stride=(1, 1), scale=0.025075148791074753, zero_point=0, padding=(1, 1), groups=960)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), scale=0.1145012229681015, zero_point=119)
          (7): Identity()
        )
      )
      (17): InvertedResidual(
        (skip_add): QFunctional(
          scale=1.0, zero_point=0
          (activation_post_process): Identity()
        )
        (conv): Sequential(
          (0): QuantizedConvReLU2d(160, 960, kernel_size=(1, 1), stride=(1, 1), scale=0.006071502808481455, zero_point=0)
          (1): Identity()
          (2): Identity()
          (3): QuantizedConvReLU2d(960, 960, kernel_size=(3, 3), stride=(1, 1), scale=0.01608050987124443, zero_point=0, padding=(1, 1), groups=960)
          (4): Identity()
          (5): Identity()
          (6): QuantizedConv2d(960, 320, kernel_size=(1, 1), stride=(1, 1), scale=0.02348274178802967, zero_point=127)
          (7): Identity()
        )
      )
      (18): Sequential(
        (0): QuantizedConvReLU2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), scale=0.08627913892269135, zero_point=0)
        (1): Identity()
        (2): Identity()
      )
    )
  )
  (decoder1): DecoderBlock(
    (deconv): ConvTranspose2d(1280, 96, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (block_unit): InvertedResidual(
      (skip_add): QFunctional(
        scale=1.0, zero_point=0
        (activation_post_process): Identity()
      )
      (conv): Sequential(
        (0): QuantizedConvReLU2d(192, 1152, kernel_size=(1, 1), stride=(1, 1), scale=0.038529299199581146, zero_point=0)
        (1): Identity()
        (2): Identity()
        (3): QuantizedConvReLU2d(1152, 1152, kernel_size=(3, 3), stride=(1, 1), scale=0.06666766107082367, zero_point=0, padding=(1, 1), groups=1152)
        (4): Identity()
        (5): Identity()
        (6): QuantizedConv2d(1152, 96, kernel_size=(1, 1), stride=(1, 1), scale=0.0864308550953865, zero_point=117)
        (7): Identity()
      )
    )
    (quant): Quantize(scale=tensor([0.0979]), zero_point=tensor([128]), dtype=torch.quint8)
    (dequant): DeQuantize()
  )
  (decoder2): DecoderBlock(
    (deconv): ConvTranspose2d(96, 32, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (block_unit): InvertedResidual(
      (skip_add): QFunctional(
        scale=1.0, zero_point=0
        (activation_post_process): Identity()
      )
      (conv): Sequential(
        (0): QuantizedConvReLU2d(64, 384, kernel_size=(1, 1), stride=(1, 1), scale=0.06379921734333038, zero_point=0)
        (1): Identity()
        (2): Identity()
        (3): QuantizedConvReLU2d(384, 384, kernel_size=(3, 3), stride=(1, 1), scale=0.28728926181793213, zero_point=0, padding=(1, 1), groups=384)
        (4): Identity()
        (5): Identity()
        (6): QuantizedConv2d(384, 32, kernel_size=(1, 1), stride=(1, 1), scale=0.2210002988576889, zero_point=126)
        (7): Identity()
      )
    )
    (quant): Quantize(scale=tensor([0.0561]), zero_point=tensor([120]), dtype=torch.quint8)
    (dequant): DeQuantize()
  )
  (decoder3): DecoderBlock(
    (deconv): ConvTranspose2d(32, 24, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (block_unit): InvertedResidual(
      (skip_add): QFunctional(
        scale=1.0, zero_point=0
        (activation_post_process): Identity()
      )
      (conv): Sequential(
        (0): QuantizedConvReLU2d(48, 288, kernel_size=(1, 1), stride=(1, 1), scale=0.11421681195497513, zero_point=0)
        (1): Identity()
        (2): Identity()
        (3): QuantizedConvReLU2d(288, 288, kernel_size=(3, 3), stride=(1, 1), scale=0.20177718997001648, zero_point=0, padding=(1, 1), groups=288)
        (4): Identity()
        (5): Identity()
        (6): QuantizedConv2d(288, 24, kernel_size=(1, 1), stride=(1, 1), scale=0.21056368947029114, zero_point=114)
        (7): Identity()
      )
    )
    (quant): Quantize(scale=tensor([0.1462]), zero_point=tensor([123]), dtype=torch.quint8)
    (dequant): DeQuantize()
  )
  (decoder4): DecoderBlock(
    (deconv): ConvTranspose2d(24, 16, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (block_unit): InvertedResidual(
      (skip_add): QFunctional(
        scale=1.0, zero_point=0
        (activation_post_process): Identity()
      )
      (conv): Sequential(
        (0): QuantizedConvReLU2d(32, 192, kernel_size=(1, 1), stride=(1, 1), scale=0.1248798817396164, zero_point=0)
        (1): Identity()
        (2): Identity()
        (3): QuantizedConvReLU2d(192, 192, kernel_size=(3, 3), stride=(1, 1), scale=0.1945924311876297, zero_point=0, padding=(1, 1), groups=192)
        (4): Identity()
        (5): Identity()
        (6): QuantizedConv2d(192, 16, kernel_size=(1, 1), stride=(1, 1), scale=0.3521292805671692, zero_point=135)
        (7): Identity()
      )
    )
    (quant): Quantize(scale=tensor([0.1450]), zero_point=tensor([140]), dtype=torch.quint8)
    (dequant): DeQuantize()
  )
  (conv_last): Sequential(
    (0): QuantizedConv2d(16, 3, kernel_size=(3, 3), stride=(1, 1), scale=1.0138226747512817, zero_point=131, padding=(1, 1))
    (1): QuantizedConv2d(3, 2, kernel_size=(3, 3), stride=(1, 1), scale=2.995656728744507, zero_point=128, padding=(1, 1))
  )
  (quant): Quantize(scale=tensor([0.0186]), zero_point=tensor([114]), dtype=torch.quint8)
  (dequant): DeQuantize()
)

Any help is highly appreciated - thanks.

How to create a quantized model in order to try out QNNPACK?

Hi,
I am so excited that Facebook is revealing its own magic on mobile inference framework. After reading the article about QNNPACK, I really want to try it out on my own caffemodel.(I know that you guys have posted quantized mobilenetv2, and it beats the TFLITE one by 2x.) But how can I convert my prototxt and caffemodel to the preferred model format which can be applied to QNNPACK?

Error compiling on PC

Compiling on PC using the "local" build I get:

CMakeFiles/q8dw-test.dir/test/q8dw.cc.o: In function DepthwiseMicrokernelTester::test(void (*)(unsigned long, unsigned long, unsigned char const**, void const*, unsigned char*, unsigned long, unsigned long, qnnp_conv_quantization_params const*)) const': q8dw.cc:(.text._ZNK26DepthwiseMicrokernelTester4testEPFvmmPPKhPKvPhmmPK29qnnp_conv_quantization_paramsE[_ZNK26DepthwiseMicrokernelTester4testEPFvmmPPKhPKvPhmmPK29qnnp_conv_quantization_paramsE]+0x14ba): undefined reference to q8dw_ukernel_25c8__neon'
collect2: error: ld returned 1 exit status
CMakeFiles/q8dw-test.dir/build.make:90: recipe for target 'q8dw-test' failed

The problem is that in 'depthwise-microkernel-tester.h' references an arm-only function 'q8dw_ukernel_25c8__neon'.
I tried bypassing it with "#if CPUINFO_ARCH_ARM || CPUINFO_ARCH_ARM64" however it still raises a segmentation fault on the benchmark when running 'DWConv5x5' (passes 'DWConv3x3')

What's the relationship between the ASM and C version micro kernel?

Hi there,

I assume that C is for prototyping while ASM is the finally selected one. However, I found that the asm built from C looks to be unexpected when I am playing around.

Taking example

const uint8x8_t vb01234567 = vld1_u8(w); w = (void*) ((uintptr_t) w + sizeof(uint8x8_t));
const int16x8_t vxb01234567 = vreinterpretq_s16_u16(vsubl_u8(vb01234567, vb_zero_point));
which generates asm sequence below (just an example, not the one peered the C code above) for Android ARM64 target.

dup v24.8h, v9.h[0]
ext v24.16b, v24.16b, v24.16b, #8
uxtl v24.8h, v24.8b
sub v27.8h, v24.8h, v29.8h

This is unexpected to me as the C code is using neon intrinsics. The generated asm instructions of C is about 1.5x of asm kernel. If this is expected, or I need to fix my build environment?

Thanks and Regards

Deconvolutional Layer is slow for large kernel size

Deconvolutional layer is very slow for large kernel size. Kernel size is 16x16x4x2, input size is 32x32x4, output size is 256x256x2, padding is 4 (top, bottom, left, right), stride is 8 (height, width) and dilation is 1 (height width). The run time can be as slow as a few hundred ms.

Building on OSX 10.14 for OS64 (ARM 64) fails

I'm getting an undefined symbol error when trying to build on OSX 10.14 for ARM 64:

Command:

$ cmake .. -DCMAKE_TOOLCHAIN_FILE=../../ios-cmake/ios.toolchain.cmake -DIOS_PLATFORM=OS64
$ make

Error:

Scanning dependencies of target sgemm-bench
[ 49%] Building CXX object CMakeFiles/sgemm-bench.dir/bench/sgemm.cc.o
[ 50%] Linking CXX executable sgemm-bench
ld: warning: -headerpad_max_install_names is ignored when used with -bitcode_bundle (Xcode setting ENABLE_BITCODE=YES)
Undefined symbols for architecture arm64:
  "_sgemm_ukernel_5x8__neon", referenced from:
      SGEMM_L1_5x8__neon_Benchmark::BenchmarkCase(benchmark::State&) in sgemm.cc.o
      SGEMM_Op_5x8__neon_Benchmark::BenchmarkCase(benchmark::State&) in sgemm.cc.o
  "_sgemm_ukernel_6x8__neon", referenced from:
      SGEMM_L1_6x8__neon_Benchmark::BenchmarkCase(benchmark::State&) in sgemm.cc.o
      SGEMM_Op_6x8__neon_Benchmark::BenchmarkCase(benchmark::State&) in sgemm.cc.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [sgemm-bench] Error 1
make[1]: *** [CMakeFiles/sgemm-bench.dir/all] Error 2
make: *** [all] Error 2

I've been poking around benchmark.h/cpp a bit but it's not clear to me why this would be happening.

build error

<command-line>:0:1: error: macro names must be identifiers
<command-line>:0:1: error: macro names must be identifiers
<command-line>:0:1: error: macro names must be identifiers
<command-line>:0:1: error: macro names must be identifiers
<command-line>:0:1: error: macro names must be identifiers
<command-line>:0:1: error: macro names must be identifiers
<command-line>:0:1: error: macro names must be identifiers
<command-line>:0:1: error: macro names must be identifiers
deps/clog/CMakeFiles/clog.dir/build.make:62: recipe for target 'deps/clog/CMakeFiles/clog.dir/src/clog.c.o' failed
make[2]: *** [deps/clog/CMakeFiles/clog.dir/src/clog.c.o] Error 1
CMakeFiles/Makefile2:1672: recipe for target 'deps/clog/CMakeFiles/clog.dir/all' failed
make[1]: *** [deps/clog/CMakeFiles/clog.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
deps/pthreadpool/CMakeFiles/pthreadpool.dir/build.make:62: recipe for target 'deps/pthreadpool/CMakeFiles/pthreadpool.dir/src/threadpool-pthreads.c.o' failed
make[2]: *** [deps/pthreadpool/CMakeFiles/pthreadpool.dir/src/threadpool-pthreads.c.o] Error 1
CMakeFiles/Makefile2:1823: recipe for target 'deps/pthreadpool/CMakeFiles/pthreadpool.dir/all' failed
make[1]: *** [deps/pthreadpool/CMakeFiles/pthreadpool.dir/all] Error 2
[  4%] Building CXX object deps/googletest/googlemock/CMakeFiles/gmock_main.dir/src/gmock-all.cc.o
[  5%] Building CXX object deps/googletest/googlemock/CMakeFiles/gmock_main.dir/src/gmock_main.cc.o
<command-line>:0:1: error: macro names must be identifiers
<command-line>:0:1: error: macro names must be identifiers
deps/googlebenchmark/src/CMakeFiles/benchmark.dir/build.make:88: recipe for target 'deps/googlebenchmark/src/CMakeFiles/benchmark.dir/colorprint.cc.o' failed
make[2]: *** [deps/googlebenchmark/src/CMakeFiles/benchmark.dir/colorprint.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
[  6%] Building CXX object deps/googletest/googlemock/CMakeFiles/gmock.dir/src/gmock-all.cc.o
<command-line>:0:1: error: macro names must be identifiers
deps/googletest/googlemock/CMakeFiles/gmock_main.dir/build.make:88: recipe for target 'deps/googletest/googlemock/CMakeFiles/gmock_main.dir/src/gmock_main.cc.o' failed
make[2]: *** [deps/googletest/googlemock/CMakeFiles/gmock_main.dir/src/gmock_main.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
deps/googlebenchmark/src/CMakeFiles/benchmark.dir/build.make:62: recipe for target 'deps/googlebenchmark/src/CMakeFiles/benchmark.dir/benchmark.cc.o' failed
make[2]: *** [deps/googlebenchmark/src/CMakeFiles/benchmark.dir/benchmark.cc.o] Error 1
deps/googletest/googlemock/CMakeFiles/gmock_main.dir/build.make:75: recipe for target 'deps/googletest/googlemock/CMakeFiles/gmock_main.dir/src/gmock-all.cc.o' failed
make[2]: *** [deps/googletest/googlemock/CMakeFiles/gmock_main.dir/src/gmock-all.cc.o] Error 1
deps/googletest/googlemock/CMakeFiles/gmock_main.dir/build.make:62: recipe for target 'deps/googletest/googlemock/CMakeFiles/gmock_main.dir/__/googletest/src/gtest-all.cc.o' failed
make[2]: *** [deps/googletest/googlemock/CMakeFiles/gmock_main.dir/__/googletest/src/gtest-all.cc.o] Error 1
CMakeFiles/Makefile2:1954: recipe for target 'deps/googletest/googlemock/CMakeFiles/gmock_main.dir/all' failed
make[1]: *** [deps/googletest/googlemock/CMakeFiles/gmock_main.dir/all] Error 2
deps/googletest/googlemock/gtest/CMakeFiles/gtest.dir/build.make:62: recipe for target 'deps/googletest/googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o' failed
make[2]: *** [deps/googletest/googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
CMakeFiles/Makefile2:2085: recipe for target 'deps/googletest/googlemock/gtest/CMakeFiles/gtest.dir/all' failed
make[1]: *** [deps/googletest/googlemock/gtest/CMakeFiles/gtest.dir/all] Error 2
deps/googletest/googlemock/CMakeFiles/gmock.dir/build.make:62: recipe for target 'deps/googletest/googlemock/CMakeFiles/gmock.dir/__/googletest/src/gtest-all.cc.o' failed
make[2]: *** [deps/googletest/googlemock/CMakeFiles/gmock.dir/__/googletest/src/gtest-all.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
deps/googlebenchmark/src/CMakeFiles/benchmark.dir/build.make:75: recipe for target 'deps/googlebenchmark/src/CMakeFiles/benchmark.dir/benchmark_register.cc.o' failed
make[2]: *** [deps/googlebenchmark/src/CMakeFiles/benchmark.dir/benchmark_register.cc.o] Error 1
CMakeFiles/Makefile2:2197: recipe for target 'deps/googlebenchmark/src/CMakeFiles/benchmark.dir/all' failed
make[1]: *** [deps/googlebenchmark/src/CMakeFiles/benchmark.dir/all] Error 2
deps/googletest/googlemock/CMakeFiles/gmock.dir/build.make:75: recipe for target 'deps/googletest/googlemock/CMakeFiles/gmock.dir/src/gmock-all.cc.o' failed
make[2]: *** [deps/googletest/googlemock/CMakeFiles/gmock.dir/src/gmock-all.cc.o] Error 1
CMakeFiles/Makefile2:1991: recipe for target 'deps/googletest/googlemock/CMakeFiles/gmock.dir/all' failed
make[1]: *** [deps/googletest/googlemock/CMakeFiles/gmock.dir/all] Error 2
Makefile:140: recipe for target 'all' failed

Implementation detail things...

Hi @Maratyszcza , I am reading the implementation of q8gemm_ukernel_4x8__neon and your QNNPACK TR, and confused by its arguments.

In my understanding, as gemm is processing matrix stored in format of H x W, which should be mr x k for the matrix a in q8gemm_ukernel_4x8__neon. Thus, when we obtaining lanes from matrix a, ai = a + k * i should be enough. So, what is a_stride/c_stride here in detail?

size_t a_stride,

Would you please point out where I started to misunderstand the code?

Per channel quantization for the weights

Hiya,

Is it possible to provide quantization parameters on a finer scale than per layer? In for example (1) they talk about doing per channel quantization for the weights, they're talking about convolutions there, but I remember reading a paper where they mentioned doing this for a normal linear transform as well (can't find the paper again at the moment sorry). Any plans for including such capabilities?

I'm trying to do post training quantization but am getting significant degradation when replacing a single layer. It's a smaller network and from what I've read these are harder to quantise. If you don't mind me asking I'm curious what your experience is with post-training quantization and how well it works?

(1) https://arxiv.org/pdf/1806.08342.pdf

Typo in the announcement blog?

Hi there,

I am wondering if the figure (as below) which describes matrix-matrix multiplication of the announcement blog has a typo?

gemm

In matrix B (KxN), the highlighted is the reduction dimension which should be K rather than N?

@hlu1

Why QNNPACK has better accuracy on MobileNetV2?

Hi authors

I notice in MobilenetV2 section, there is figure comparing the accuracy between QNNPACK 8bit and TFlite 8bit, and your library significantly outperforms TF-Lite. I wonder did you use some special quantize algorithm? Or do a quantized training?

build-local.sh error!

when run build-local.sh, there's an error:
--- LOG END ---
error: downloading 'https://github.com/google/benchmark/archive/v1.4.1.zip' failed
status_code: 1
status_string: "Unsupported protocol"
log:
--- LOG BEGIN ---
Protocol "https" not supported or disabled in libcurl

Closing connection -1

     --- LOG END -

But when i download it offline, and put it into:
"QNNPACK/build/local/deps/googlebenchmark-download/googlebenchmark-prefix/src"
there's another error:
does not match expected value
expected: '61ae07eb5d4a0b02753419eb17a82b7d322786bb36ab62bd3df331a4d47c00a7'
actual: 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
-- File already exists but hash mismatch. Removing...

please help~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.