Code Monkey home page Code Monkey logo

pynomaly's Introduction

PyNomaly

PyNomaly is a Python 3 implementation of LoOP (Local Outlier Probabilities). LoOP is a local density based outlier detection method by Kriegel, Kröger, Schubert, and Zimek which provides outlier scores in the range of [0,1] that are directly interpretable as the probability of a sample being an outlier.

License PyPi Build Status Coverage Status JOSS

The outlier score of each sample is called the Local Outlier Probability. It measures the local deviation of density of a given sample with respect to its neighbors as Local Outlier Factor (LOF), but provides normalized outlier scores in the range [0,1]. These outlier scores are directly interpretable as a probability of an object being an outlier. Since Local Outlier Probabilities provides scores in the range [0,1], practitioners are free to interpret the results according to the application.

Like LOF, it is local in that the anomaly score depends on how isolated the sample is with respect to the surrounding neighborhood. Locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that lie in regions of lower density compared to their neighbors and thus identify samples that may be outliers according to their Local Outlier Probability.

The authors' 2009 paper detailing LoOP's theory, formulation, and application is provided by Ludwig-Maximilians University Munich - Institute for Informatics; LoOP: Local Outlier Probabilities.

PyNomaly Seeks Maintainers! ✨

Love using PyNomaly? Want to develop your open source software (OSS) experience and credentials?

PyNomaly is looking for maintainers! PyNomaly doesn't need much on a day to day basis, but needs some attention.

On the flip side, the sky is the limit... Have you seen Mojo and what it can do with matrix multiplication? Would definitely speed things up.

Interested? Send an email to [email protected].

Implementation

This Python 3 implementation uses Numpy and the formulas outlined in LoOP: Local Outlier Probabilities to calculate the Local Outlier Probability of each sample.

Dependencies

  • Python 3.5 - 3.8
  • numpy >= 1.16.3
  • python-utils >= 2.3.0
  • (optional) numba >= 0.45.1

Numba just-in-time (JIT) compiles the function with calculates the Euclidean distance between observations, providing a reduction in computation time (significantly when a large number of observations are scored). Numba is not a requirement and PyNomaly may still be used solely with numpy if desired (details below).

Quick Start

First install the package from the Python Package Index:

pip install PyNomaly # or pip3 install ... if you're using both Python 3 and 2.

Then you can do something like this:

from PyNomaly import loop
m = loop.LocalOutlierProbability(data).fit()
scores = m.local_outlier_probabilities
print(scores)

where data is a NxM (N rows, M columns; 2-dimensional) set of data as either a Pandas DataFrame or Numpy array.

LocalOutlierProbability sets the extent (in integer in value of 1, 2, or 3) and n_neighbors (must be greater than 0) parameters with the default values of 3 and 10, respectively. You're free to set these parameters on your own as below:

from PyNomaly import loop
m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20).fit()
scores = m.local_outlier_probabilities
print(scores)

This implementation of LoOP also includes an optional cluster_labels parameter. This is useful in cases where regions of varying density occur within the same set of data. When using cluster_labels, the Local Outlier Probability of a sample is calculated with respect to its cluster assignment.

from PyNomaly import loop
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.6, min_samples=50).fit(data)
m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20, cluster_labels=list(db.labels_)).fit()
scores = m.local_outlier_probabilities
print(scores)

NOTE: Unless your data is all the same scale, it may be a good idea to normalize your data with z-scores or another normalization scheme prior to using LoOP, especially when working with multiple dimensions of varying scale. Users must also appropriately handle missing values prior to using LoOP, as LoOP does not support Pandas DataFrames or Numpy arrays with missing values.

Utilizing Numba and Progress Bars

It may be helpful to use just-in-time (JIT) compilation in the cases where a lot of observations are scored. Numba, a JIT compiler for Python, may be used with PyNomaly by setting use_numba=True:

from PyNomaly import loop
m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20, use_numba=True, progress_bar=True).fit()
scores = m.local_outlier_probabilities
print(scores)

Numba must be installed if the above to use JIT compilation and improve the speed of multiple calls to LocalOutlierProbability(), and PyNomaly has been tested with Numba version 0.45.1. An example of the speed difference that can be realized with using Numba is avaialble in examples/numba_speed_diff.py.

You may also choose to print progress bars with our without the use of numba by passing progress_bar=True to the LocalOutlierProbability() method as above.

Choosing Parameters

The extent parameter controls the sensitivity of the scoring in practice. The parameter corresponds to the statistical notion of an outlier defined as an object deviating more than a given lambda (extent) times the standard deviation from the mean. A value of 2 implies outliers deviating more than 2 standard deviations from the mean, and corresponds to 95.0% in the empirical "three-sigma" rule. The appropriate parameter should be selected according to the level of sensitivity needed for the input data and application. The question to ask is whether it is more reasonable to assume outliers in your data are 1, 2, or 3 standard deviations from the mean, and select the value likely most appropriate to your data and application.

The n_neighbors parameter defines the number of neighbors to consider about each sample (neighborhood size) when determining its Local Outlier Probability with respect to the density of the sample's defined neighborhood. The idea number of neighbors to consider is dependent on the input data. However, the notion of an outlier implies it would be considered as such regardless of the number of neighbors considered. One potential approach is to use a number of different neighborhood sizes and average the results for reach observation. Those observations which rank highly with varying neighborhood sizes are more than likely outliers. This is one potential approach of selecting the neighborhood size. Another is to select a value proportional to the number of observations, such an odd-valued integer close to the square root of the number of observations in your data (sqrt(n_observations).

Iris Data Example

We'll be using the well-known Iris dataset to show LoOP's capabilities. There's a few things you'll need for this example beyond the standard prerequisites listed above:

  • matplotlib 2.0.0 or greater
  • PyDataset 0.2.0 or greater
  • scikit-learn 0.18.1 or greater

First, let's import the packages and libraries we will need for this example.

from PyNomaly import loop
import pandas as pd
from pydataset import data
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

Now let's create two sets of Iris data for scoring; one with clustering and the other without.

# import the data and remove any non-numeric columns
iris = pd.DataFrame(data('iris'))
iris = pd.DataFrame(iris.drop('Species', 1))

Next, let's cluster the data using DBSCAN and generate two sets of scores. On both cases, we will use the default values for both extent (0.997) and n_neighbors (10).

db = DBSCAN(eps=0.9, min_samples=10).fit(iris)
m = loop.LocalOutlierProbability(iris).fit()
scores_noclust = m.local_outlier_probabilities
m_clust = loop.LocalOutlierProbability(iris, cluster_labels=list(db.labels_)).fit()
scores_clust = m_clust.local_outlier_probabilities

Organize the data into two separate Pandas DataFrames.

iris_clust = pd.DataFrame(iris.copy())
iris_clust['scores'] = scores_clust
iris_clust['labels'] = db.labels_
iris['scores'] = scores_noclust

And finally, let's visualize the scores provided by LoOP in both cases (with and without clustering).

fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(iris['Sepal.Width'], iris['Petal.Width'], iris['Sepal.Length'],
c=iris['scores'], cmap='seismic', s=50)
ax.set_xlabel('Sepal.Width')
ax.set_ylabel('Petal.Width')
ax.set_zlabel('Sepal.Length')
plt.show()
plt.clf()
plt.cla()
plt.close()

fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(iris_clust['Sepal.Width'], iris_clust['Petal.Width'], iris_clust['Sepal.Length'],
c=iris_clust['scores'], cmap='seismic', s=50)
ax.set_xlabel('Sepal.Width')
ax.set_ylabel('Petal.Width')
ax.set_zlabel('Sepal.Length')
plt.show()
plt.clf()
plt.cla()
plt.close()

fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(iris_clust['Sepal.Width'], iris_clust['Petal.Width'], iris_clust['Sepal.Length'],
c=iris_clust['labels'], cmap='Set1', s=50)
ax.set_xlabel('Sepal.Width')
ax.set_ylabel('Petal.Width')
ax.set_zlabel('Sepal.Length')
plt.show()
plt.clf()
plt.cla()
plt.close()

Your results should look like the following:

LoOP Scores without Clustering LoOP Scores without Clustering

LoOP Scores with Clustering LoOP Scores with Clustering

DBSCAN Cluster Assignments DBSCAN Cluster Assignments

Note the differences between using LocalOutlierProbability with and without clustering. In the example without clustering, samples are scored according to the distribution of the entire data set. In the example with clustering, each sample is scored according to the distribution of each cluster. Which approach is suitable depends on the use case.

NOTE: Data was not normalized in this example, but it's probably a good idea to do so in practice.

Using Numpy

When using numpy, make sure to use 2-dimensional arrays in tabular format:

data = np.array([
    [43.3, 30.2, 90.2],
    [62.9, 58.3, 49.3],
    [55.2, 56.2, 134.2],
    [48.6, 80.3, 50.3],
    [67.1, 60.0, 55.9],
    [421.5, 90.3, 50.0]
])

scores = loop.LocalOutlierProbability(data, n_neighbors=3).fit().local_outlier_probabilities
print(scores)

The shape of the input array shape corresponds to the rows (observations) and columns (features) in the data:

print(data.shape)
# (6,3), which matches number of observations and features in the above example

Similar to the above:

data = np.random.rand(100, 5)
scores = loop.LocalOutlierProbability(data).fit().local_outlier_probabilities
print(scores)

Specifying a Distance Matrix

PyNomaly provides the ability to specify a distance matrix so that any distance metric can be used (a neighbor index matrix must also be provided). This can be useful when wanting to use a distance other than the euclidean.

from sklearn.neighbors import NearestNeighbors

data = np.array([
    [43.3, 30.2, 90.2],
    [62.9, 58.3, 49.3],
    [55.2, 56.2, 134.2],
    [48.6, 80.3, 50.3],
    [67.1, 60.0, 55.9],
    [421.5, 90.3, 50.0]
])

neigh = NearestNeighbors(n_neighbors=3, metric='hamming')
neigh.fit(data)
d, idx = neigh.kneighbors(data, return_distance=True)

m = loop.LocalOutlierProbability(distance_matrix=d, neighbor_matrix=idx, n_neighbors=3).fit()
scores = m.local_outlier_probabilities

The below visualization shows the results by a few known distance metrics:

LoOP Scores by Distance Metric DBSCAN Cluster Assignments

Streaming Data

PyNomaly also contains an implementation of Hamlet et. al.'s modifications to the original LoOP approach [4], which may be used for applications involving streaming data or where rapid calculations may be necessary. First, the standard LoOP algorithm is used on "training" data, with certain attributes of the fitted data stored from the original LoOP approach. Then, as new points are considered, these fitted attributes are called when calculating the score of the incoming streaming data due to the use of averages from the initial fit, such as the use of a global value for the expected value of the probabilistic distance. Despite the potential for increased error when compared to the standard approach, it may be effective in streaming applications where refitting the standard approach over all points could be computationally expensive.

While the iris dataset is not streaming data, we'll use it in this example by taking the first 120 observations as training data and take the remaining 30 observations as a stream, scoring each observation individually.

Split the data.

iris = iris.sample(frac=1) # shuffle data
iris_train = iris.iloc[:, 0:4].head(120)
iris_test = iris.iloc[:, 0:4].tail(30)

Fit to each set.

m = loop.LocalOutlierProbability(iris).fit()
scores_noclust = m.local_outlier_probabilities
iris['scores'] = scores_noclust

m_train = loop.LocalOutlierProbability(iris_train, n_neighbors=10)
m_train.fit()
iris_train_scores = m_train.local_outlier_probabilities
iris_test_scores = []
for index, row in iris_test.iterrows():
    array = np.array([row['Sepal.Length'], row['Sepal.Width'], row['Petal.Length'], row['Petal.Width']])
    iris_test_scores.append(m_train.stream(array))
iris_test_scores = np.array(iris_test_scores)

Concatenate the scores and assess.

iris['stream_scores'] = np.hstack((iris_train_scores, iris_test_scores))
# iris['scores'] from earlier example
rmse = np.sqrt(((iris['scores'] - iris['stream_scores']) ** 2).mean(axis=None))
print(rmse)

The root mean squared error (RMSE) between the two approaches is approximately 0.199 (your scores will vary depending on the data and specification). The plot below shows the scores from the stream approach.

fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(iris['Sepal.Width'], iris['Petal.Width'], iris['Sepal.Length'],
c=iris['stream_scores'], cmap='seismic', s=50)
ax.set_xlabel('Sepal.Width')
ax.set_ylabel('Petal.Width')
ax.set_zlabel('Sepal.Length')
plt.show()
plt.clf()
plt.cla()
plt.close()

LoOP Scores using Stream Approach with n=10 LoOP Scores using Stream Approach with n=10

Notes

When calculating the LoOP score of incoming data, the original fitted scores are not updated. In some applications, it may be beneficial to refit the data periodically. The stream functionality also assumes that either data or a distance matrix (or value) will be used across in both fitting and streaming, with no changes in specification between steps.

Contributing

Please use the issue tracker to report any erroneous behavior or desired feature requests.

If you would like to contribute to development, please fork the repository and make any changes to a branch which corresponds to an open issue. Hot fixes and bug fixes can be represented by branches with the prefix fix/ versus feature/ for new capabilities or code improvements. Pull requests will then be made from these branches into the repository's dev branch prior to being pulled into main. Pull requests which are works in progress or ready for merging should be indicated by their respective prefixes ([WIP] and [MRG]). Pull requests with the [MRG] prefix will be reviewed prior to being pulled into the main branch.

Tests

When contributing, please ensure to run unit tests and add additional tests as necessary if adding new functionality. To run the unit tests, use pytest:

python3 -m pytest --cov=PyNomaly -s -v

To run the tests with Numba enabled, simply set the flag NUMBA in test_loop.py to True. Note that a drop in coverage is expected due to portions of the code being compiled upon code execution.

Versioning

Semantic versioning is used for this project. If contributing, please conform to semantic versioning guidelines when submitting a pull request.

License

This project is licensed under the Apache 2.0 license.

Research

If citing PyNomaly, use the following:

@article{Constantinou2018,
  doi = {10.21105/joss.00845},
  url = {https://doi.org/10.21105/joss.00845},
  year  = {2018},
  month = {oct},
  publisher = {The Open Journal},
  volume = {3},
  number = {30},
  pages = {845},
  author = {Valentino Constantinou},
  title = {{PyNomaly}: Anomaly detection using Local Outlier Probabilities ({LoOP}).},
  journal = {Journal of Open Source Software}
}

References

  1. Breunig M., Kriegel H.-P., Ng R., Sander, J. LOF: Identifying Density-based Local Outliers. ACM SIGMOD International Conference on Management of Data (2000). PDF.
  2. Kriegel H., Kröger P., Schubert E., Zimek A. LoOP: Local Outlier Probabilities. 18th ACM conference on Information and knowledge management, CIKM (2009). PDF.
  3. Goldstein M., Uchida S. A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. PLoS ONE 11(4): e0152173 (2016).
  4. Hamlet C., Straub J., Russell M., Kerlin S. An incremental and approximate local outlier probability algorithm for intrusion detection and its evaluation. Journal of Cyber Security Technology (2016). DOI.

Acknowledgements

pynomaly's People

Contributors

bbambico avatar lmeshoo avatar michaelschreier avatar robmarkcole avatar vc1492a avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pynomaly's Issues

Adjust indexing in data_store

cardinality, data_store[:, 3])]).T))

PyNomaly used to index the nearest neighbor distance column number 2, data_store[:, 2]. This indexing has been removed, but the indexing of the remaining calculations has not been changed to reflect this resulting in a column of None values in the second column of the data_store object.

question again

when use loop to detect outlier, it often goes wrong with the statement "return (probabilistic_distance / ev_prob_dist) - 1.", and I have to fix neightbors to a lower value. But sometimes when I gave 4 or 5, it also comes with "ZeroDIvisionError: float division by zero". Thx for answering !!!

Add docstring to tests

While the names of the tests in tests/test_loop.py provide an indication as to the purpose of the unit tests, it would be beneficial to include some more verbose docstring to provide a better idea as to the purpose of each test.

Distance Matrix support

I'm currently using LOF for a Distance Matrix. Is it possible to also use a Distance Matrix for LoOP? Or are the points needed for the computation of the probabilities?

Clarify orientation of 2d array

It might help to mention what the semantic orientation of the input data array is. Originally I had it transposed which gave me a somewhat unhelpful index error. If you have a square array, maybe for testing, you might not realize that the data is flipped.

Alter Fit Convention

Provide parameters in LocalOutlierProbability() and provide data in fit() as opposed to providing data in LocalOutlierProbability() along with params. This is s.t. PyNomaly is more in line with scikit-learn and other popular libraries.

Use Trees for data structures

It looks like all distances are currently being calculated, which is expensive. Borrowing from sklearn, BallTree and KDTree could be used to speed up nearest neighbor calculations.

data stream

Does this package support streaming data in anomaly detection?

parallelize

It would be great if there's an option for embarrassingly parallel computations, especially if all N^2 distances are calculated.

Algorithm Discrepancies

Recently I've had to do outlier analysis for work and came across your package which neatly implements LoOP. When I was reading through the source code, I noticed two spots where the calculations don't seem to line up with what was described in the paper. There may be good reasons why it differs from the original algorithm (or I'm misreading it entirely).

  1. When calculating the PLOF for an object O, the expected value used in this package is the expected value of every pdist value, where the paper says to just use the expected value for the pdist of objects that are in O's neighborhood

  2. In the calculation of the nPLOF, first _prob_local_outlier_factors_ev calculates the expected value for PLOF squared just fine, but followed up after in _norm_prob_outlier_factor the value is again squared and sqaure rooted. The paper just specifies a square root and not another square.

I was just wondering if there was a reason for the deviation from the paper in these regards.
Thanks!

Division by zero when including cluster labels

When using Kreigel et al's original 2d-synthetic dataset, and when including the cluster labels, the result is a divide-by-zero error.

screen shot 2018-07-30 at 9 12 02 am

screen shot 2018-07-30 at 9 11 47 am

Without the cluster labels, the algorithm runs to completion, but produces the result we talked about last week (slightly too confidant probability values). The two behaviors may be related, but as I am not sure, I thought it better to mention both issues.

Add regression tests for refactor validation

Add regression tests to establish a baseline with the current working version of PyNomaly by comparing the LoOP results of a predefined array of values to the expted results. These regression tests can be used to compare the results of future refactored versions. If we detect differences in the results, we will know that something in our refactored calculations is wrong.

numpy arrays as input don't work

Using data = pd.DataFrame(np.random.rand(100, 5)) as the input to LocalOutlierProbability works as expected, however, passing the raw ndarray np.random.rand(100, 5) throws an error:
self.points_vector = self.data.reshape(self.data.shape[1:]) ValueError: cannot reshape array of size 500 into shape (5,)

Library documentation

As the current capabilities of PyNomaly are solidified and new capabilities added, it would be beneficial to have dedicated documentation that is hosted and available to users outside of the readme.

Integrate Numba's JIT with key functions

Numba's JIT can greatly accelerate the processing speed of the _distances function and others. One of the main drawbacks of nearest-neighbor approaches is their computational resource needs - reducing this drawback by just-in-time compiling key functions has the potential to greatly accelerate the speeds in which observations are processed.

Inconsistency in case of dataframe and distance matrix input

This is not a project issue, but a suggestion to put some kind of warning in the distance matrix example in the README.

There is an example of using distance matrix as input for LoOP in the README. It shows how sklearn.neighbors.NearestNeighbors could be used to obtain distance matrix together with index matrix. It seems that this way the matricies also contain distance measures to a point itself, resulting to zero distance for the first nearest neighbor of every point.
On the other hand internal method _compute_distance_and_neighbor_matrix, used when data argument is specified, excludes the distances to a point itself and so giving different scores on same data.
I took a look into the test case, which allowes difference of 0.15 in scores vector, and thus the difference between 0.45 and 0.6 is considered negligible.
I think the output metrices of sklearn.neighbors.NearestNeighbors should be transformed first to be consistent with the internal algorithm.

ZeroDivisionError when using DataFrame

I had been using version 0.1.5 without issues in the following script. But I decided to upgrade and now I see the following issue:

data = [43.3,62.9,55.2,48.6,67.1,421.5] # example data
new_array=pd.DataFrame(data)
scores = loop.LocalOutlierProbability(new_array).fit()
scores = scores.local_outlier_probabilities

Traceback:

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-6-1cb16c12004f> in <module>()
      6 
      7 new_array=pd.DataFrame(l)
----> 8 scores = loop.LocalOutlierProbability(new_array).fit()
      9 scores = scores.local_outlier_probabilities
     10 np.where(scores > DETECTION_FACTOR)[0]

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in fit(self)
    226         store = self._norm_prob_local_outlier_factors(store)
    227         self.norm_prob_local_outlier_factor = np.max(store[:, 9])
--> 228         store = self._local_outlier_probabilities(store)
    229         self.local_outlier_probabilities = store[:, 10]
    230 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probabilities(self, data_store)
    211         return np.hstack(
    212             (data_store,
--> 213              np.array([np.apply_along_axis(self._local_outlier_probability, 0, data_store[:, 7], data_store[:, 9])]).T))
    214 
    215     def fit(self):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
    114     except StopIteration:
    115         raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 116     res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
    117 
    118     # build a buffer for storing evaluations of func1d.

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probability(plof_val, nplof_val)
    111     def _local_outlier_probability(plof_val, nplof_val):
    112         erf_vec = np.vectorize(erf)
--> 113         return np.maximum(0, erf_vec(plof_val / (nplof_val * np.sqrt(2.))))
    114 
    115     def _n_observations(self):

ZeroDivisionError: float division by zero

Changes to distance measure implementation to improve speed

Hello authors,
I have worked with a range of anomaly detection algorithms. I have using LoOP for my testbed and the time consumed is very high. More particularly, my training data includes about 46000 points with two features for each and the number of clusters is 1 (only normal traffic with label 0). It took me about 11000 seconds for the training phase and 0.5s for testing phase (with k=20).
when I reviewed the code in loop.py file, I saw you use two loops in the function:
def _distances(self, progress_bar: bool = False) -> None:
I have already rewritten this function and ignored the cluster (because I do not need it) and also use functions in numpy to reduce this function to only one 'for' loop. It only takes me about 120s instead.
This function likes:
def distance_dn(self,point_vector):
"""
Calculate distances from each point to the remaining points
"""
data = point_vector
k = self.n_neighbors
distance = np.zeros((len(data),k))
index = np.zeros((len(data),k))
t1 =time.time()
for i in range(len(data)):
data_i = np.array([[data[i][0],data[i][1]]])
point_arr = np.repeat(data_i,len(data),axis=0)
diff = (point_arr - data)**2
dis = (np.sum(diff,axis=1))**0.5
index[i] = (np.argpartition(dis,k))[0:k]
distance[i] = dis[(np.argpartition(dis,k))[0:k]]
#print(distance)
t2 = time.time()
#print(t2-t1)
return distance, index

I hope here can be my contribution for LoOP?

Add coverage reporting for Numba JIT-compiled functions

Any functions with the @jit decorator are not read by pytest-cov as having been executed despite them running successfully within the unit tests, as they are compiled by numba as C and thus are not Python functions which are able to be evaluated by pytest-cov.

This occurs even when setting the environment variable to disable numba execution, e.g. NUMBA_DISABLE_JIT = "1".

Abstract User Warnings

Abstract user warnings so you can issue warnings on object instantiation (i.e. f = loop() vs only loop.fit()).

Support Python 3.7

PyNomaly currently supports Python 3.4-3.6. Extend and test support with Python 3.7.

Progress bar fails with ZeroDivisionError using a small number of observations

The below line contained within PyNomaly/loop.py fails with a ZeroDivisionError when the total number of observations is less than the terminal or cell width where the code was executed.

if index % block_size == 0:

The cause is a block_size of 0, which is not allowable with the current implementation.

This can be resolved by changing the code to the following:

    if total < w:
        block_size = int(w / total)
    else:
        block_size = int(total / w)

if index % block_size == 0:

Which will ensure the block_size remains greater than 0 even when the number of observations is less than the width of the terminal / cell window.

Passing cluster_labels broken

I think I have found a bug that occurs when passing some cluster_labels.

When I completely reverse the order of all input (data and cluster_labels), and I reverse the result (local_outlier_probabilities), I would expect the same numbers. This does happen as long as all cluster_labels values are equal. Once I have two (really separate) clusters, the results change when flipped!
An extra indication that things go wrong (IMHO): the second cluster's neighbor numbers are in the first cluster!

A small reproduction example:

import matplotlib.pyplot as plt
from PyNomaly import loop

np.random.seed(1)
n = 9
data = np.append(np.random.normal(2, 1, [n, 2]), np.random.normal(8, 1, [n, 2]), axis=0)
clus = np.append(np.ones(n),                     2 * np.ones(n)).tolist()  # 2 cluster numbers!
model = loop.LocalOutlierProbability(data, n_neighbors=5, cluster_labels=clus)
fit = model.fit()
res = fit.local_outlier_probabilities
print(res)
print(fit.neighbor_matrix)

data_flipped = np.flipud(data)
clus_flipped = np.flipud(clus).tolist()
model2 = loop.LocalOutlierProbability(data_flipped, n_neighbors=5, cluster_labels=clus_flipped)
fit2 = model2.fit()
res2 = np.flipud(fit2.local_outlier_probabilities)
print(res2)
print(np.flipud(fit2.neighbor_matrix))

s  = 1 + 100 * res.astype(float)
s2 = 1 + 100 * res2.astype(float)
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s,  marker='+')
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s2, marker='x')
plt.show()

lopp for novetly detection

lof can be used to predict unseen data using predict method.
I can't find predict method in lopp. Does lopp support this scenario?

Predict the probability of testing data

Hi Dear Author, I wonder does this package contains an API to do independent testing after fitting? For instance, something like:

m = loop.LocalOutlierProbability(data).fit()

scores_of_test_data = m.local_outlier_probabilities(test_data)

where the "data" is used for training (fitting) and "test_data" is another np.array that is for testing only, by which we want to know whether the "test_data" is the outlier for training "data", while we don't put them together for fitting (because fitting again every time takes a long time).

Does this package have such an API?

sir, help

In loop.py, data_store[vec][1] = distances[vec], sir, data_store is a 2 dimension numpy array and distances also, why it's legal when you give a sequence to a data_store[vec][1] ?

Alter Naming Convention

Could just be loop().fit() (so you can have lof().fit(), etc). E.g. from PyNomaly import loop, loop().fit().

Implementation Speed

I am running the Loop algorithm on unclustered 37k two-dimensional points. It's taking forever to run. Is it because of implementation or the algorithm is inherently slow?

data stream

Is this package suitable for stream data?
By stream data I mean An array of data that is constantly producing and I want to find an anomaly as soon as it appears.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.