Kenning
Copyright (c) 2020-2021 Antmicro
This is a project for implementing and testing pipelines for deploying deep learning models on edge devices.
Deployment flow
Deploying deep learning models on edge devices usually involves the following steps:
- Preparation and analysis of the dataset, preparation of data preprocessing and output postprocessing routines,
- Model training (usually transfer learning), if necessary,
- Evaluation and improvement of the model until its quality is satisfactory,
- Model optimization, usually hardware-specific optimizations (e.g. operator fusion, quantization, neuron-wise or connection-wise pruning),
- Model compilation to a given target,
- Model execution on a given target.
There are different frameworks for most of the above steps (training, optimization, compilation and runtime). The cooperation between those frameworks differs and may provide different results.
This framework introduces interfaces for those above-mentioned steps that can be implemented using specific deep learning frameworks.
Based on the implemented interfaces, the framework can measure the inference duration and quality on a given target. It also verifies the compatibility between various training, compilation and optimization frameworks.
Kenning installation
Module installation with pip
To install Kenning with its basic dependencies with pip, run:
pip install -U git+https://github.com/antmicro/kenning.git
Since Kenning can support various frameworks, and not all of them are required for user's use cases, the optional requirements are available as extra requirements. The groups of extra requirements are following:
tensorflow
- modules for working with TensorFlow models (ONNX conversions, addons, and TensorFlow framework),torch
- modules for working with PyTorch models,mxnet
- modules for working with MXNet models,nvidia_perf
- modules for performance measurements for NVIDIA GPUs,object_detection
- modules for working with YOLOv3 object detection and Open Images Dataset V6 computer vision dataset.
To install the extra requirements, i.e. tensorflow
, run:
sudo pip install git+https://github.com/antmicro/kenning.git#egg=kenning[tensorflow]
Working directly with the repository
For development purposes, and for usage of additional resources (as sample scripts or trained models), clone repository with:
git clone https://github.com/antmicro/kenning.git
To download model weights, install Git Large File Storage (if not installed) and run:
git lfs pull
Kenning structure
The kenning
module consists of the following submodules:
core
- provides interface APIs for datasets, models, compilers, runtimes and runtime protocols,datasets
- provides implementations for datasets,modelwrappers
- provides implementations for models for various problems implemented in various frameworks,compilers
- provides implementations for compilers of deep learning models,runtimes
- provides implementations of runtime on target devices,runtimeprotocols
- provides implementations for communication protocols between host and tested target,onnxconverters
- provides ONNX conversions for a given framework along with a list of models to test the conversion on,resources
- contains project's resources, like RST templates, or trained models,scenarios
- contains executable scripts for running training, inference, benchmarks and other tests on target devices.
Model preparation
The kenning.core.dataset.Dataset
classes
Classes that implement the methods from kenning.core.dataset.Dataset
are responsible for:
- preparing the dataset, including the download routines (use
--download-dataset
flag to download the dataset data), - preprocessing the inputs into the format expected by most of the models for a given task,
- postprocessing the outputs for the evaluation process,
- evaluating a given model based on its predictions,
- subdividing the samples into training and validation datasets.
Based on the above methods, the Dataset
class provides data to the model wrappers, compilers and runtimes to train and test the models.
The datasets are included in the kenning.datasets
submodule.
Check out the Pet Dataset wrapper for an example of Dataset
class implementation.
The kenning.core.model.ModelWrapper
classes
The ModelWrapper
class requires implementing methods for:
- model preparation,
- model saving and loading,
- model saving to the ONNX format,
- model-specific preprocessing of inputs and postprocessing of outputs, if neccessary,
- model inference,
- providing metadata (framework name and version),
- model training,
- input format specification,
- conversion of model inputs and outputs to bytes for the
kenning.core.runtimeprotocol.RuntimeProtocol
objects.
The ModelWrapper
provides methods for running the inference in a loop from data from dataset and measures both the quality and inferenceperformance of the model.
The kenning.modelwrappers.frameworks
submodule contains framework-wise specifications of ModelWrapper
class - they implement all methods that are common for all the models implemented in this framework.
For the Pet Dataset wrapper object there is example classifier implemented in TensorFlow 2.x called TensorFlowPetDatasetMobileNetV2.
The kenning.core.compiler.ModelCompiler
classes
Objects of this class implement compilation and optional hardware-specific optimization.
For the latter, the ModelCompiler
may require a dataset, for example to perform quantization or pruning.
The implementations for compiler wrappers are in kenning.compilers
.
For example, TFLiteCompiler class wraps the TensorFlow Lite routines for compiling the model to a specified target.
Model deployment and benchmarking on target devices
Benchmarks of compiled models are performed in a client-server manner, where the target device acts as a server that accepts the compiled model and waits for the input data to infer, and the host device sends the input data and waits for the outputs to evaluate the quality of models.
The general communication protocol
The communication protocol is message-based. There are:
OK
messages - indicate success, and may come with additional information,ERROR
messages - indicate failure,DATA
messages - provide input data for inference,MODEL
messages - provide model to load for inference,PROCESS
messages - request processing inputs delivered inDATA
message,OUTPUT
messages - request results of processing,STATS
messages - request statistics from the target device.
The message types and enclosed data are encoded in format implemented in the kenning.core.runtimeprotocol.RuntimeProtocol
-based class.
The communication during inference benchmark session is as follows:
- The client (host) connects to the server (target),
- The client sends the
MODEL
request along with the compiled model, - The server loads the model from request, prepares everything for running the model and sends the
OK
response, - After receiving the
OK
response from the server, the client starts reading input samples from the dataset, preprocesses the inputs, and sendsDATA
request with the preprocessed input, - Upon receiving the
DATA
request, the server stores the input for inference, and sends theOK
message, - Upon receiving confirmation, the client sends the
PROCESS
request, - Just after receiving the
PROCESS
request, the server should send theOK
message to confirm that it starts the inference, and just after finishing the inference the server should send anotherOK
message to confirm that the inference is finished, - After receiving the first
OK
message, the client starts measuring inference time until the secondOK
response is received, - The client sends the
OUTPUT
request in order to receive the outputs from the server, - Server sends the
OK
message along with the output data, - The client parses the output and evaluates model performance,
- The client sends
STATS
request to obtain additional statistics (inference time, CPU/GPU/Memory utilization) from the server, - If server provides any statistics, it sends the
OK
message with the data, - The same process applies to the rest of input samples.
The way of determining the message type and sending data between the server and the client depends on the implementation of the kenning.core.runtimeprotocol.RuntimeProtocol
class.
The implementation of running inference on the given target is implemented in the kenning.core.runtime.Runtime
class.
The kenning.core.runtimeprotocol.RuntimeProtocol
classes
The RuntimeProtocol
class conducts the communication between the client (host) and the server (target).
The RuntimeProtocol
class requires implementing methods for:
- initializing the server and the client (communication-wise),
- waiting for the incoming data,
- sending the data,
- receiving the data,
- uploading the model inputs to the server,
- uploading the model to the server,
- requesting the inference on target,
- downloading the outputs from the server,
- (optionally) downloading the statistics from the server (i.e. performance speed, CPU/GPU utilization, power consumption),
- notifying of success or failure by the server,
- parsing messages.
Based on the above-mentioned methods, the kenning.core.runtime.Runtime
connects the host with the target.
Look at the TCP runtime protocol for an example.
The kenning.core.runtime.Runtime
classes
The Runtime
objects provide an API for the host and (optionally) the target device.
If the target device does not support Python, the runtime needs to be implemented in a different language, and the host API needs to support it.
The client (host) side of the Runtime
class utilizes the methods from Dataset
, ModelWrapper
and RuntimeProtocol
classes to run inference on the target device.
The server (target) side of the Runtime
class requires implementing methods for:
- loading model delivered by the client,
- preparing inputs delivered by the client,
- running inference,
- preparing outputs to be delivered to the client,
- (optionally) sending inference statistics.
Look at the TVM runtime for an example.
ONNX conversion
Most of the frameworks for training, compiling and optimizing deep learning algorithms support ONNX format. It allows conversion of models from one representation to another.
The ONNX API and format is constantly evolving, and there are more and more operators in new state-of-the-art models that need to be supported.
The kenning.core.onnxconversion.ONNXConversion
class provides an API for writing compatibility tests between ONNX and deep learning frameworks.
It requires implementing:
- method for importing ONNX model for a given framework,
- method for exporting ONNX model from a given framework,
- list of models implemented in a given framework, where each model will be exported to ONNX, and then imported back to the framework.
The ONNXConversion
class implements a method for converting the models.
It catches exceptions and any issues in the import/export methods, and provides the report on conversion status per model.
Look at the TensorFlowONNXConversion class for an example of API usage.
Running the benchmarks
All executable Python scripts are available in the kenning.scenarios
submodule.
Running model training on host
The kenning.scenarios.model_training
script is run as follows:
python -m kenning.scenarios.model_training \ kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2 \ kenning.datasets.pet_dataset.PetDataset \ --logdir build/logs \ --dataset-root build/pet-dataset \ --model-path build/trained-model.h5 \ --batch-size 32 \ --learning-rate 0.0001 \ --num-epochs 50
By default, kenning.scenarios.model_training
script requires two classes:
ModelWrapper
-based class that describes model architecture and provides training routines,Dataset
-based class that provides training data for the model.
The remaining arguments are provided by the form_argparse
class methods in each class, and may be different based on selected dataset and model.
In order to get full help for the training scenario for the above case, run:
python -m kenning.scenarios.model_training \ kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2 \ kenning.datasets.pet_dataset.PetDataset \ -h
This will load all the available arguments for a given model and dataset.
The arguments in the above command are:
--logdir
- path to the directory where logs will be stored (this directory may be an argument for the TensorBoard software),--dataset-root
- path to the dataset directory, required by theDataset
-based class,--model-path
- path where the trained model will be saved,--batch-size
- training batch size,--learning-rate
- training learning rate,--num-epochs
- number of epochs.
If the dataset files are not present, use --download-dataset
flag in order to let the Dataset API download the data.
Benchmarking trained model on host
The kenning.scenarios.inference_performance
script runs the model using the deep learning framework used for training on a host device.
It runs the inference on a given dataset, computes model quality metrics and performance metrics.
The results from the script can be used as a reference point for benchmarking of the compiled models on target devices.
The example usage of the script is as follows:
python -m kenning.scenarios.inference_performance \ kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2 \ kenning.datasets.pet_dataset.PetDataset \ build/result.json \ --model-path kenning/resources/models/classification/tensorflow_pet_dataset_mobilenetv2.h5 \ --dataset-root build/pet-dataset
The obligatory arguments for the script are:
ModelWrapper
-based class that implements the model loading, I/O processing and inference method,Dataset
-based class that implements fetching of data samples and evaluation of the model,build/result.json
, which is the path to the output JSON file with benchmark results.
The remaining parameters are specific to the ModelWrapper
-based class and Dataset
-based class.
Testing ONNX conversions
The kenning.scenarios.onnx_conversion
runs as follows:
python -m kenning.scenarios.onnx_conversion \ build/models-directory \ build/onnx-support.rst \ --converters-list \ kenning.onnxconverters.pytorch.PyTorchONNXConversion \ kenning.onnxconverters.tensorflow.TensorFlowONNXConversion \ kenning.onnxconverters.mxnet.MXNetONNXConversion
The first argument is the directory, where the generated ONNX models will be stored.
The second argument is the RST file with import/export support table for each model for each framework.
The third argument is the list of ONNXConversion
classes implementing list of models, import method and export method.
Running compilation and deployment of models on target hardware
There are two scripts - kenning.scenarios.inference_tester
and kenning.scenarios.inference_server
.
The example call for the first script is following:
python -m kenning.scenarios.inference_tester \ kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2 \ kenning.compilers.tflite.TFLiteCompiler \ kenning.runtimes.tflite.TFLiteRuntime \ kenning.datasets.pet_dataset.PetDataset \ ./build/google-coral-devboard-tflite-tensorflow.json \ --protocol-cls kenning.runtimeprotocols.network.NetworkProtocol \ --model-path ./kenning/resources/models/classification/tensorflow_pet_dataset_mobilenetv2.h5 \ --model-framework keras \ --target "edgetpu" \ --compiled-model-path build/compiled-model.tflite \ --inference-input-type int8 \ --inference-output-type int8 \ --host 192.168.188.35 \ --port 12345 \ --packet-size 32768 \ --save-model-path /home/mendel/compiled-model.tflite \ --dataset-root build/pet-dataset \ --inference-batch-size 1 \ --verbosity INFO
The script requires:
ModelWrapper
-based class that implements model loading, I/O processing and optionally model conversion to ONNX format,ModelCompiler
-based class for compiling the model for a given target,Runtime
-based class that implements data processing and the inference method for the compiled model on the target hardware,Dataset
-based class that implements fetching of data samples and evaluation of the model,./build/google-coral-devboard-tflite-tensorflow.json
, which is the path to the output JSON file with performance and quality metrics.
In case of running inference on remote edge device, the --protocol-cls RuntimeProtocol
also needs to be provided in order to provide communication protocol between the host and the target.
If --protocol-cls
is not provided, the inference_tester
will run inference on the host machine (which is useful for testing and comparison).
The remaining arguments come from the above-mentioned classes. Their meaning is following:
--model-path
(TensorFlowPetDatasetMobileNetV2
argument) is the path to the trained model that will be compiled and executed on the target hardware,--model-framework
(TFLiteCompiler
argument) tells the compiler what is the format of the file with the saved model (it tells which backend to use for parsing the model by the compiler),--target
(TFLiteCompiler
argument) is the name of the target hardware for which the compiler generates optimized binaries,--compiled-model-path
(TFLiteCompiler
argument) is the path where the compiled model will be stored on host,--inference-input-type
(TFLiteCompiler
argument) tells TFLite compiler what will be the type of the input tensors,--inference-output-type
(TFLiteCompiler
argument) tells TFLite compiler what will be the type of the output tensors,--host
tells theNetworkProtocol
what is the IP address of the target device,--port
tells theNetworkProtocol
on what port the server application is listening,--packet-size
tells theNetworkProtocol
what the packet size during communication should be,--save-model-path
(TFLiteRuntime
argument) is the path where the compiled model will be stored on the target device,--dataset-root
(PetDataset
argument) is the path to the dataset files,--inference-batch-size
is the batch size for the inference on the target hardware,--verbosity
is the verbosity of logs.
The example call for the second script is as follows:
python -m kenning.scenarios.inference_server \ kenning.runtimeprotocols.network.NetworkProtocol \ kenning.runtimes.tflite.TFLiteRuntime \ --host 0.0.0.0 \ --port 12345 \ --packet-size 32768 \ --save-model-path /home/mendel/compiled-model.tflite \ --delegates-list libedgetpu.so.1 \ --verbosity INFO
This script only requires Runtime
-based class and RuntimeProtocol
-based class.
It waits for a client using a given protocol, and later runs inference based on the implementation from the Runtime
class.
The additional arguments are as follows:
--host
(NetworkProtocol
argument) is the address where the server will listen,--port
(NetworkProtocol
argument) is the port on which the server will listen,--packet-size
(NetworkProtocol
argument) is the size of the packet,--save-model-path
is the path where the received model will be saved,--delegates-list
(TFLiteRuntime
argument) is a TFLite-specific list of libraries for delegating the inference to deep learning accelerators (libedgetpu.so.1
is the delegate for Google Coral TPUs).
First, the client compiles the model and sends it to the server using the runtime protocol. Then, it sends next batches of data to process to the server. In the end, it collects the benchmark metrics and saves them to JSON file. In addition, it generates plots with performance changes over time.
Render report from benchmarks
The kenning.scenarios.inference_performance
and kenning.scenarios.inference_tester
create JSON files that contain:
command string that was used to generate the JSON file,
frameworks along with their versions used to train the model and compile the model,
performance metrics, including:
- CPU usage over time,
- RAM usage over time,
- GPU usage over time,
- GPU memory usage over time,
predictions and ground truth to compute quality metrics, i.e. in form of confusion matrix and top-5 accuracy for classification task.
The kenning.scenarios.render_report
renders the report RST file along with plots for metrics for a given JSON file based on selected templates.
For example, for the file ./build/google-coral-devboard-tflite-tensorflow.json
created in :ref:`compilation-and-deployment` the report can be rendered as follows:
python -m kenning.scenarios.render_report \ build/google-coral-devboard-tflite-tensorflow.json \ "Pet Dataset classification using TFLite-compiled TensorFlow model" \ docs/source/generated/google-coral-devboard-tpu-tflite-tensorflow-classification.rst \ --img-dir docs/source/generated/img/ \ --root-dir docs/source/ \ --report-types \ performance \ classification
Where:
build/google-coral-devboard-tflite-tensorflow.json
is the input JSON file with benchmark results"Pet Dataset classification using TFLite-compiled TensorFlow model"
is the report name that will be used as title in generated plots,docs/source/generated/google-coral-devboard-tpu-tflite-tensorflow-classification.rst
is the path to the output RST file,--img-dir docs/source/generated/img/
is the path to the directory where generated plots will be stored,--root-dir docs/source
is the root directory for documentation sources (it will be used to compute relative paths in the RST file),--report-types performance classification
is the list of report types that will form the final RST file.
The performance
type provides report sections for performance metrics, i.e.:
- Inference time changes over time,
- Mean CPU usage over time,
- RAM usage over time,
- GPU usage over time,
- GPU memory usage over time.
It also computes mean, standard deviation and median values for the above time series.
The classification
type provides report section regarding quality metrics for classification task:
- Confusion matrics,
- Per-class precision,
- Per-class sensitivity,
- Accuracy,
- Top-5 accuracy,
- Mean precision,
- Mean sensitivity,
- G-Mean.
The above metrics can be used to determine any quality losses resulting from optimizations (i.e. pruning or quantization).
Adding new implementations
Dataset
, ModelWrapper
, ModelCompiler
, RuntimeProtocol
, Runtime
and other classes from kenning.core
module have dedicated directories for their implementations.
Each method in base classes that requires implementation raises NotImplementedError
exception.
Implemented methods can be also overriden, if neccessary.
Most of the base classes implement form_argparse
and from_argparse
methods.
The first one creates an argument parser and a group of arguments specific to the base class.
The second one creates an object of the class based on the arguments from argument parser.
Inheriting classes can modify form_argparse
and from_argparse
methods to provide better control over their processing, but they should always be based on the results of their base implementations.