Streaming Machine Learning from IoT Devices with HiveMQ, Apache Kafka and TensorFLow

WORK IN PROGRESS... NOT FINISHED YET !!!!!!

This project implements a scenario where you can train new analytic models from streaming data - without the need for an additional data store like S3, HDFS or Spark.

We use HiveMQ as open source MQTT broker to ingest data from IoT devices, ingest the data in real time into an Apache Kafka cluster for preprocessing (using Kafka Streams / KSQL), and model training (using TensorFlow 2.0 and its Kafka IO plugin).

Use Case and Architecture

Streaming Machine Learning with MQTT, Apache Kafka and TensorFlow I/O:

Data Integration
Data Preprocessing
Model Training
Model Deployment
Real Time Scoring
Real Time Monitoring

All steps happen in real time. No additional data store like S3 or HDFS is required:

Streaming Ingestion and Model Training with Kafka and TensorFlow-IO

Typically, analytic models are trained in batch mode where you first ingest all historical data in a data store like HDFS, AWS S3 or GCS. Then you train the model using a framework like Spark MLlib, TensorFlow or Google ML.

TensorFlow I/O is a component of the TensorFlow framework which allows native integration with various technologies.

One of these integrations is tensorflow_io.kafka which allows streaming ingestion into TensorFlow from Kafka WITHOUT the need for an additional data store! This significantly simplifies the architecture and reduces development, testing and operations costs. Yong Tang, member of the SIG TensorFlow I/O team, did a great presentation about this at Kafka Summit 2019 in New York (video and slide deck available for free).

You can pick and choose the right components from the Apache Kafka and TensorFlow ecosystems to build your own machine learning infrastructure for data integration, data processing, model training and model deployment:

This demo will do the following steps:

Consume streaming data from MQTT Broker HiveMQ via a Kafka Consumer
Preprocess the data with KSQL (filter, transform)
Ingest the data into TensorFlow (tf.data and tensorflow-io)
Build, train and save the model (TensorFlow 2.0 API)
Deploy the model within a Kafka Streams application for embedded real time scoring

Optional steps (nice to have)

Show IoT-specific features with HiveMQ tool stack
Deploy the model via TensorFlow Serving
Some kind of A/B testing
Re-train the model and updating the Kafka Streams application (via sending the new model to a Kafka topic)
Monitoring of model training (via TensorBoard) and model deployment / inference (via some kind of Kafka integration + dashboard technology)
Confluent Cloud for Kafka as a Service (-> Focus on business problems, not running infrastructure)
Enhance demo with C / librdkafka clients and TensorFlow Lite for edge computing
todo - other ideas?

Why is this so awesome

Again, you don't need another data store anymore! Just ingest the data directly from the distributed commit log of Kafka:

Be aware: This is still NOT Online Training

Streaming ingestion for model training is fantastic. You don't need a data store anymore. This simplifies the architecture and reduces operations and developemt costs.

However, one common misunderstanding has to be clarified - as this question comes up every time you talk about TensorFlow I/O and Apache Kafka: As long as machine learning / deep learning frameworks and algorythms expect data in batches, you cannot achieve real online training (i.e. re-training / optimizing the model with each new input event).

Only a few algoryhtms and implementations are available today, like Online Clustering.

Thus, even with TensorFlow I/O and streaming ingestion via Apache Kafka, you still do batch training. Though, you can configure and optimize these batches to fit your use case. Additionally, only Kafka allows ingestion at large scale for use cases like connected cars or assembly lines in factories. You cannot build a scalable, reliable, mission-critical ML infrastructure just with Python.

The combination of TensorFlow I/O and Apache Kafka is a great step closer to real time training of analytic models at scale!

I posted many articles about videos about this discussion. Get started with How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and check out my other resources if you want to learn more.

Requirements and Installation

TODO

Live Demo

Until the full demo is ready, you can already checkout two working examples which use Kafka Python clients to produce data to Kafka topics and then consume the streaming data directly with TensorFlow I/O for streaming ML without an additional data store:

Streaming ingestion of MNIST data into TensorFlow via Kafka for image regonition.
Autoencoder for anomaly detection of sensor data into TensorFlow via Kafka. Producer (Python Client) and Consumer (TensorFlow I/O Kafka Plugin) + Model Training.

TODO IMPLEMENT DEMO

Start HiveMQ
Start Confluent Platform
Start TensorFlow Python script consuming data to train the model
Create continuous data stream (MQTT messages)
Finish model training
Use model for inference: 1) via TensorFlow-IO Python API, 2) exported to a Kafka Streams / KSQL microservice and 3) via TensorFlow Serving

More Information

TensorFlow 2.0 is still beta and requires Linux

At the time of writing (July 2019), TensorFlow 2.0 is still beta
TensorFlow 2.0 I/O does NOT work on Mac or Windows (e.g. 'pip install tensorflow-io-2.0-preview does not work'). You need to install it on a Linux system (VM or Cloud Instance) to avoid many headaches and spend your time on the problem, not the infrastructure

TensorFlow 1.x to 2.0 Migration and Embedded Keras API

The upgrade tool allows smooth migration, especially if your TensorFlow 1.x code also uses Keras. I just executed: "tf_upgrade_v2 --infile Python-Tensorflow-1.x-Keras-Fraud-Detection-Autoencoder.ipynb --outfile Python-Tensorflow-2.0-Keras-Fraud-Detection-Autoencoder.ipynb"
Python-Tensorflow-1.x-Keras-Fraud-Detection-Autoencoder.ipynb is the initial Jupyter Notebook to create the autoencoder
Python-Tensorflow-2.0-Keras-Fraud-Detection-Autoencoder.ipynb is the TensorFlow 2.0 version
You usually still need to do some custom fixes. My migration report had no errors, but I still has to replace "tensorflow.keras" with "keras" imports (and uninstall the independent Keras package via pip to be on the safe path)
The Jupyter Notebook now runs well with TensorFlow 2.0 and its embedded Keras API

ksjpswaroop / hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference Goto Github PK

hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference's Introduction

Streaming Machine Learning from IoT Devices with HiveMQ, Apache Kafka and TensorFLow

Use Case and Architecture

Streaming Ingestion and Model Training with Kafka and TensorFlow-IO

Why is this so awesome

Be aware: This is still NOT Online Training

Requirements and Installation

Live Demo

More Information

TensorFlow 2.0 is still beta and requires Linux

TensorFlow 1.x to 2.0 Migration and Embedded Keras API

hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent