Outlier detection on Sensor Data

The aim of this work is to perform an outlier detection on a stream of sensor data.

An outlier is a data point that is very different from most of the remaining data. Hawkins formally defined the notion of an outlier as follows:

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.

Three techniques will be used to perform outlier detection:

Tukey Fence
Local Outlier Factor
ARIMA

In the end more focus will be given to the latter technique, looking at it with a bigger amount of data.

The work will follow some steps, each one will be developped in a iPython notebook

Preliminary Plots - Transmission Frequencies

This first analysis is performed on sensor data, where the transmission occur on a Wi-Fi network.

Notebook

Here will be checked the transmission times of the sensors.

As transmission time of an observation will be considered the difference of the timestamp of the observation with the timestamp of the immediately previous one.

In the analysis will be seen the number of measurements taken daily. For the days with most measurements, will be checked the numer of measuremtns per hour, and for the msot significative hours will be plotted the frequencies of the transmission times and compared among the different types of sensors.

As can be seen, the most common transmission time is of 10s with only few delays in most of the case contained. Only one of the sensors in the dataset has a lot of delays, but the majority of measurements is of 10s.

Then the next step is to count the outliers.

Tukey Fence

Notebook

Tukey fence is one of the most common methods to find outliers.

if Q1 and Q3 are the lower and upper quartile, an outlier is any observation outside the range:

[Q1 - 1.5(Q3 - Q1), Q3+1.5(Q3-Q1)]

What is done, is consider the datapoints in a tiling window of 1 hour, then for each time window are computed the first and third quartile and at the end the ponts outside the fence are marked as outliers.

With this technique the number of outliers is really low w.r.t. the amount of measurements.

In particular the number of outliers is connected to the number of missing measurements within a time window, since if the measurements are poor this is not able to found any outlier.

Local Outlier Factor

The local outlier factor is based on a concept of a local density, where locality is given by k-nearest neighbors, whose distance is used to estimate the density.

By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors.

These lower density points are considered to be outliers.

What happens is:

LOF<1 means Higher density than neighbors (inlier)
LOF>1 means Lower density than neighbors (outlier)

Notebook

To compute the LOF the datapoints are considered as point of a 2-dimensional space, where the dimensions are the timestamp and the value of the measurements.

In this case are performed two experiments, one computing the LOF using a tiling window of 1 hour, and one using the dataset as a whole.

The results of the two are similar, this is because using a tiling window, we can lose some information about the poits that are to the edge of the time window that affects the distance computation.

The points marked as outliers are poor, and in particular they do not looks like points that deviates form the distribution, this could happen because the time dimension has a bigger influence in the computaation of the distances of the points.

ARIMA

ARIMA (Auto-Regressive Integrate Moving Average) is a generalization of ARMA models which cannot be applied to non-stationary data.

An ARMA Model is expressed as a composition of an AutoRegressive Model (AR(p)) and a Moving Average Model (MA(q))

As said, the ARMA model is not able to addess non-stationary data. So in many cases the non-stationary data can be addressed by combining differencing with the ARMA model, resulting in a ARIMA model.

Notebook

In the Notebook is used a 1 hour tiling window to train an ARIMA model that is then used to perform prediction in the subsequent window.

The prediction is the used to determine whether a new obseravtion is an outlier or not: the error is computed as difference between prediction and measurement, if this error is above a certain threshold (that depends on the sensor type), the new observation is marked as outlier.

The predictions follow quite well the trend of the sensors, the only problem is when the values decay or grow to fast, causing the prediction move too far from the real values, causing an increase in the number of outliers.

Thus we can see that the method is sensible to noisy behavior, causing the outlier count grow a lot. This behavior could be positive, since an excessive noise in the data could mean that a sensor is not working well.

Transmission Frequencies on XBee

This second analysis will be performed on sensor streams that transmit using the standard 802.15.4, at 2.4GHz, in theory since the network is not shared with users (like the wi-fi network) the transmission rate should be more stable.

Notebook

What has been done is similar to the first section: look at the number of measurements taken daily, then to some specific hours. In particular this is done to determine a transmission time of the sensors needed for producing the plots for ARIMA.

With this technology, the data transmitted per day is much higher, but in most cases there isn't a constant transmission time, or one transmission time that is most frequent than others.

This could happen because the sensor are not working properly or because of the transmission media which is affected by other devices working at 2.4GHz.

ARIMA on XBee

Notebook

The amount of data used is much bigger from the previous case. In particular here there are different estimated transmission rates.

The first thing that one can notice is that in the stream with a lower amount of data per hour, the number of outliers is lower.

Here again it is possible to see that the technique is sensitive to noise in data. As one can evince from the plots, for the sensor with higher transmission rate there is an higehr amount of measurements, but also the noise in the data is higher.

In particular, when the noise in the data is higher, the number of outliers increase.

Transmission Frequencies on LoRa

This last one is performed on sensor transmitting using LoRa (Long Range), which allow transmission over long distances.

Again what we expect is a more stable transmission times.

Notebook

As for XBee we cannot tell what is the most common transmission time of the sensor.

The number of measurements is quite high, but the most frequent transmission time changes a lot and often is not a single one.

ARIMA on LoRa

Notebook

In this case betwen the measuremtns there is also an indicator of the signal (Received signal sternght indicator). In most of the cases the signal is weak and this affects the number of measurements.

The sensors show similar behaviors in the outlier count. As most of the plot suggest there is noise in the observations, but when the noise is really high that makes the sensor detect extreme value also those point are correctly detected as outliers.

yanuba / dm Goto Github PK

dm's Introduction

Outlier detection on Sensor Data

Preliminary Plots - Transmission Frequencies

Tukey Fence

Local Outlier Factor

ARIMA

Transmission Frequencies on XBee

ARIMA on XBee

Transmission Frequencies on LoRa

ARIMA on LoRa

dm's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent