yzhao062 / pyod Goto Github PK

A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

License: BSD 2-Clause "Simplified" License

Python 86.63% Jupyter Notebook 13.37%

anomaly anomaly-detection autoencoder data-analysis data-mining data-science deep-learning fraud-detection machine-learning neural-networks novelty-detection out-of-distribution-detection outlier-detection outlier-ensembles outliers python python3 unsupervised-learning

pyod's Introduction

Hi there, I'm Yue ZHAO (赵越 in Chinese)! 👋

😄 I am an Assistant Professor at USC Computer Science; see more information at my homepage.

Prospective Students. I am peacefully welcoming prospective Ph.D. students (apply by Dec 15th for Fall 24 admission; full financial support) and research interns. You are expected to have one published top paper on my research topics (current focus includes anomaly/outlier/OOD detection, Auto ML, and Multimodal Learning) and strong programming skills (such as (ML) System papers and/or open-source experience) for open-source ML and/or systems. See more at my homepage

🌱 My research: I build fast, automated, and open machine learning (ML) and data mining (DM) systems, with a focus on but not limited to anomaly detection, graph neural networks, and healthcare for AI.

Accelerate large-scale learning tasks by leveraging ML systems techniques.
Automate unsupervised ML by model selection and hyperparameter optimization.
Develop open-source ML tools to support applications in healthcare, finance, and security.

Ph.D. time. At CMU, I work with Prof. Leman Akoglu for automated ML, Prof. Zhihao Jia for machine learning systems, and Prof. George H. Chen for general ML. I am a member of CMU automated learning systems group (Catalyst) and Data Analytics Techniques Algorithms (DATA) Lab. I have collaborated with Prof. Jure Leskovec at Stanford and Prof. Philip S. Yu at UIC.

⚡ Open-source Contribution: I have led or contributed as a core member to more than 10 ML open-source initiatives, receiving 15,000 GitHub stars (top 0.002%: ranked 800 out of 40M GitHub users) and >20,000,000 total downloads.

📫 Contact me by:

Email (yzhao010 [AT] usc.edu)
Twitter
知乎:「微调」
Homepage

pyod's People

Contributors

Stargazers

Watchers

Forkers

zhandao lidaguo xiangnanyue gloriajuice webvul kianqunki cecgreg zhuyansen gucasbrg kitdongbo deltat99 lightge chou-chou shlpu qiuchumo dgq2011 allensmile yunfeihaha zhuwenxiao ksharpdabu endymecy bballamudi ml-ai-nlp-ir zhouli01 weijun05 james-fu dxgung xmur wwlaoxi ptzagk yongduek dingboy castboxer hordaway raphaelrevivor hczheng fayssica jayeshd7 yuangzhang rlshuhart jkliang9714 zhangsilun tanzhuqing zhongkailv chendicao imran1570 biodun moenchishti kellyzhao960510 kushalvenkatesh pjhaest tplink32 xjtueducation till93 master-of-tides tartaruszen celikmustafa89 p-m-m-c lidan456 marcomiglionico94 yxryxryxr3 csingyi autowonderman juanlp eycab tungk sprinterzzj lht1949 lallouslab hhgl widged theshortj jvfisher zchoice valeman sabrish89 rumeysatalu vvcln breakend2010 chapzq77 jiaoshang frapoleon vencent-love-python davidlanz gezijun xybsoft sanjames ienoob rahulrajpl fthou limitedfxw frankfqchen ygshuwu nemocpp fzeeshan daehongkim1 circlez3791117 batermj ironmanchen zhouyonglong

pyod's Issues

Generate Synthetic Data in Clusters

Adding new feature - that is generate artificial data in clusters.
Creating utility function to generate synthesized data in clusters.
Generated data can involve the low density pattern problem and global outliers which are considered as difficult tasks for outliers detection algorithms.
Highlights

Slow installation due to the underlying dependencies

It is noted that PyOD depends on a few libraries, including:

keras
matplotlib (optional, required for running examples)
nose
numpy>=1.13
numba>=0.35
scipy>=0.19.1
scikit_learn>=0.19.1
tensorflow (optional, required if calling AutoEncoder, other backend also works)

It is getting more serious when we started introducing deep learning models into PyOD which is implemented in Keras (and of course with some backend libraries, e.g., TensorFlow).

In addition, for improving the efficiency, we started using JIT in PyOD, specifically Numba, for accelerating the execution, which uses LLVM compiler to overcome the overhead of Python.

In long run, I am also interested in bringing GPU support for PyOD, which could be done through CUDA programming. However, it will clearly make the installation and maintenance a mess due to the complexity.

Therefore, I would like to gather some ideas regarding comprehensiveness vs efficiency vs complexity for the development of PyOD. What is your opinion? Will the current installation too cumbersome for you?

Can GPU be used during trainning?

Can GPU be used during trainning such as XGBOD ? Thank you !

LOCI fails on MacOS with Python 2.7 (caused by np.count_nonzero)

It is noted running LOCI model on MacOS with Python 2.7 may fail. One potential cause is the following code, as np.count_nonzero returns int instead of array.
I am currently investigating how to fix it. Please stay tuned.

 def _get_alpha_n(self, dist_matrix, indices, r):
        """Computes the alpha neighbourhood points.
        
        Parameters
        ----------
        dist_matrix : array-like, shape (n_samples, n_features)
            The distance matrix w.r.t. to the training samples.
        
        indices : int
            Subsetting index
        
        r : int
            Neighbourhood radius
            
        Returns
        -------
        alpha_n : array, shape (n_alpha, )
            Returns the alpha neighbourhood points.       
        """

        if type(indices) is int:
            alpha_n = np.count_nonzero(
                dist_matrix[indices, :] < (r * self._alpha))
            return alpha_n
        else:
            alpha_n = np.count_nonzero(
                dist_matrix[indices, :] < (r * self._alpha), axis=1)
            return alpha_n

The error message looks like below:

(test27) bash-3.2$ python loci_example.py
/anaconda2/envs/test27/lib/python2.7/site-packages/pyod/models/loci.py:199: RuntimeWarning: divide by zero encountered in double_scalars
outlier_scores[p_ix] = mdef/sigma_mdef
/Users/zhaoy9/.local/lib/python2.7/site-packages/numpy/core/_methods.py:101: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
On Training Data:
Traceback (most recent call last):
File "loci_example.py", line 133, in
evaluate_print(clf_name, y_train, y_train_scores)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/pyod/utils/data.py", line 159, in evaluate_print
roc=np.round(roc_auc_score(y, y_pred), decimals=4),
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 356, in roc_auc_score
sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/base.py", line 77, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 328, in _binary_roc_auc_score
sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 618, in roc_curve
y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 403, in _binary_clf_curve
assert_all_finite(y_score)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 68, in assert_all_finite
_assert_all_finite(X.data if sp.issparse(X) else X, allow_nan)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

AUC Score & Precision score are different why not same?

from pyod.utils.data import evaluate_print

evaluate and print the results

print("\nOn Training Data:")
evaluate_print(clf_name, y_true, y_scores)

On Training Data:
KNN ROC:0.9352, precision @ rank n:0.568

from sklearn import metrics
print("Accuracy Score",round(metrics.accuracy_score(y_true, y_pred),2))
print("Precision Score",round(metrics.precision_score(y_true, y_pred),2))
print("Recall Score",round(metrics.recall_score(y_true, y_pred),2))
print("F1 Score",round(metrics.f1_score(y_true, y_pred),2))
print("Roc Auc score",round(metrics.roc_auc_score(y_true, y_pred),2))

Accuracy Score 0.92
Precision Score 0.55
Recall Score 0.59
F1 Score 0.57
Roc Auc score 0.77

Problem in CBLOF when the number of clusters is big and the train set has too many repeated value

If the train set has many repeated values and a large number of clusters is used, then some clusters will have the same value for the center. So, when defining the equation self.cluster_sizes_=np.bincount(clf.cluster_labels_), the results is an array smaller than the number of cluster, which generate an error and turns impossible to set large and small clusters. This could be avoided by changing self.cluster_sizes_=np.bincount(clf.cluster_labels_) to self.cluster_sizes_=np.bincount(clf.cluster_labels_, minlength=n_clusters). This is an issue that is damaging my code flexibility and I want to know if it is worth getting fixed.

Example of code:

from pyod.utils.data import generate_data
x = [[ 0.30244003],  [0.01218177],[-0.50835109], [-0.36951435],[ 0.97274482], [-0.68325119], 
     [0.0], [0.0], [0.08], [0.0], [0.0], [ 0.0],[ 0.0], [ 0.0],[0.09], [0.0],[ 0.0], [0.0],
     [0.0], [ 0.0],[-20.29518778], [0.0],[ 0.0], [0.0],[ 0.0], [ 0.0],
     [0.0], [ 8.38548823], [0.0], [ 0.0]]
test = generate_data(train_only=True)
clf_name = 'CBLOF'
clf = CBLOF(alpha=0.1, n_clusters=15, beta=10, check_estimator=False)
try:
    clf.fit(x)
except Exception as ex:
    print(str(ex))
    print("\n Cluster centers: " + str(clf.cluster_centers_))
    print("\n Cluster sizes: " + str(clf.cluster_sizes_))
    print('\n Supposed to be the cluster size: ' + str(np.bincount(clf.cluster_labels_, minlength=15)))
    print("\n Large clusters: " + str(clf.large_cluster_labels_))
    print("\n Small clusters: " + str(clf.small_cluster_labels_))

Output:

index 11 is out of bounds for axis 0 with size 11

 Cluster centers: [[ 0.00000000e+00]
 [-2.02951878e+01]
 [ 8.38548823e+00]
 [ 9.72744820e-01]
 [-5.08351090e-01]
 [ 3.02440030e-01]
 [-6.83251190e-01]
 [-3.69514350e-01]
 [ 8.00000000e-02]
 [ 1.21817700e-02]
 [ 9.00000000e-02]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 8.00000000e-02]
 [ 0.00000000e+00]]

 Cluster sizes: [20  1  1  1  1  1  1  1  1  1  1]

 Supposed to be the cluster size: [20  1  1  1  1  1  1  1  1  1  1  0  0  0  0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-39-1b14a2099b96> in <module>()
     18 try:
---> 19     clf.fit(x)
     20 except Exception as ex:

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in fit(self, X, y)
    168         self._set_cluster_centers(X, n_features)
--> 169         self._set_small_large_clusters(n_samples)
    170 

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in _set_small_large_clusters(self, n_samples)
    251 
--> 252             if size_clusters[sorted_cluster_indices[i]] / size_clusters[
    253                 sorted_cluster_indices[i - 1]] >= self.beta:

IndexError: index 11 is out of bounds for axis 0 with size 11

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-39-1b14a2099b96> in <module>()
     23     print("\n Cluster sizes: " + str(clf.cluster_sizes_))
     24     print('\n Supposed to be the cluster size: ' + str(np.bincount(clf.cluster_labels_, minlength=15)))
---> 25     print("\n Large clusters: " + str(clf.large_cluster_labels_))
     26     print("\n Small clusters: " + str(clf.small_cluster_labels_))
     27 

AttributeError: 'CBLOF' object has no attribute 'large_cluster_labels_'

Thanks for your help,
Giovanna

The questions about the implementation of the hbos

Correct handling of LOF proba predictions

Hi,

thanks for the great library.
When evaluating whether it is usable for my work i stumbled across an potential issue.
My workflow looks as follows:

Train the LOF detector on a training dataset.
Provide raw scores and outlier probabilites for this set
Deploy the model to generate outlier probabilities on new data

Im not quite sure how to correctly perform step 2. By executing lof.predict_proba(train) it executes lof.decision_function(train) which delegates to the sklearn implementation. In sklearn it explicitely states that this function is only supposed to handle new data (https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/neighbors/lof.py#L233) which is violated here.

Thanks for your help
Alex

I am trying to do RandomizedSearchCV on ABOD, but surprisingly it does not run?

Here is my code:

from pyod.models.abod import ABOD
param_grid = {'neighbours': list(range(1, 5,1)),
'contamination': np.linspace(0.01, 0.05, 5)}

skf = StratifiedKFold(n_splits=10)
folds = list(skf.split(X.toarray(), y_true))
clf = ABOD()
scoring = make_scorer(precision_score)
search = RandomizedSearchCV(estimator=clf, param_distributions=param_grid, scoring=scoring, cv=folds)
search.fit(X.toarray(), y_true)
y_pred= search.predict(X.toarray())
print('Best parameters:%0.10f' % search.best_params_["contamination"], 'Precision score: %0.3f' % precision_score(y_true, y_pred),
'Recall score: %0.3f' % recall_score(y_true, y_pred))

Best parameters:0.0100000000 Precision score: 0.000 Recall score: 0.000

kNN visualization (interpretation)

The visualization produced by knn_example.py for the "Test Set Prediction" shows two false positives, i.e., 12 outlier findings instead of 10 as in the "Test Set Ground Truth" chart. Isn't this somewhat inconsistent with the result printed to console that ROC_AUC = 1?

If so, I think the inconsistency arises because the predicted labels for the chart are based on y_test_pred = clf.predict(X_test). I think that means that the test labels are being predicted from comparison the distance threshold clf.threshold_ obtained in fitting clf to the training data. In contrast, the ROC_AUC value is based on a fixed contamination rate (10%).

It would only make sense to use clf.threshold_ for this purpose if the kNN distance for any point x_i in the test set were being computed over the distances from x_i to each of the 200 training points, not the distances to the other 99 test points. But then, the ROC_AUC curve ought to be based on those same labels, and it isn't, is it? I think it's currently being generated from a set of labels that re-applies the 10% contamination assumption, ignoring clf.threshold_.

(I can't quite follow whether the kNN distances for the points in the test set are being computed vs. the training set or vs. the other points in the test set. Can you clarify this for me? I have to guess that it's the former; if it's the latter, then it would seem really weird to be applying clf.threshold_ from a training set of a different size.)

Is it even appropriate to apply the kNN model from training data directly to the test set? I would have thought this use of kNN is intended to be used on an entire data set all by itself. Although one could perhaps study a training set to make a reasonable judgment about appropriate values of k and contamination rate.

Thanks!

Does pydo support regression tasks?

from pyod.models.xgbod import XGBOD

xgb = XGBOD()

print(y[0:10])

xgb.fit(X,y)

Thanks!

intended clf.predict_proba usage

I'm trying to make sense of the predict_proba function.

What I want to achieve: Get class probabilities for generating metrics like ROC-curves, calibration curves, Precision, Accuracy, etc with scikit-learn tools. As I am working on a binary classification task, I though I could use the predict_proba for this.

The documentation describes it as "predict the probability of a sample being outlier" that returns:
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].
which is what I am looking for currently. What I don't understand is that a ndarray of shape (no_of_observations,2) is returned.
If I compare the output of clf.predict() and clf.predict_proba() side by side, I see a high value in the first column of the predict_proba array all the time:

0 -> [0.86014439 0.13985561]
0 -> [0.96943563 0.03056437]
0 -> [0.88716599 0.11283401]
0 -> [0.87912382 0.12087618]
0 -> [0.9686196   0.0313804]
0 -> [0.87921815 0.12078185]
1 -> [0.83279906 0.16720094]
0 -> [0.87921815 0.12078185]
0 -> [0.86137304 0.13862696]
0 -> [0.98987502 0.01012498]

Might the first column be read as "how confident is the classifier that the predicted class is correct"? It would be great if you could help me out on this one.

By the way: Thanks for building such a great Python module!

pyod fails to install using pip

When attempting to install without nose, I receive the following error:

(PyVi) Michael:PyVi michael$ pip install pyod
Collecting pyod==0.5.0 (from -r requirements.txt (line 18))
  Using cached https://files.pythonhosted.org/packages/c9/8c/6774fa2e7ae6fe9c2c648114d15ba584f950002377480e14183a0999af30/pyod-0.5.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/setup.py", line 2, in <module>
        from pyod import __version__
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/__init__.py", line 4, in <module>
        from . import models
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/models/__init__.py", line 2, in <module>
        from .abod import ABOD
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/models/abod.py", line 17, in <module>
        from .base import BaseDetector
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/models/base.py", line 27, in <module>
        from ..utils.utility import precision_n_scores
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/utils/__init__.py", line 2, in <module>
        from .utility import check_parameter
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/utils/utility.py", line 18, in <module>
        from sklearn.utils.testing import assert_equal
      File "/Users/michael/anaconda3/envs/PyVi/lib/python3.6/site-packages/sklearn/utils/testing.py", line 49, in <module>
        from nose.tools import raises
    ModuleNotFoundError: No module named 'nose'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/

CBLOF predict error

Hi,
When I try to use CBLOF to predict one or two or any short number of samples, sometimes it fails like in the example above:

clf_name = 'CBLOF'
clf = CBLOF(alpha=0.7, beta=2, check_estimator=False, n_clusters=6)
clf.fit(a[0:336])
print([a[338]])
clf.predict([a[338]])

Output:

[array([0.21751617])]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-173-5342692feffe> in <module>()
      3 clf.fit(a[0:336])
      4 
----> 5 clf.predict([a[338]])

/usr/local/lib/python3.5/dist-packages/pyod/models/base.py in predict(self, X)
    125         check_is_fitted(self, ['decision_scores_', 'threshold_', 'labels_'])
    126 
--> 127         pred_score = self.decision_function(X)
    128         return (pred_score > self.threshold_).astype('int').ravel()
    129 

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in decision_function(self, X)
    179         X = check_array(X)
    180         labels = self.clustering_estimator_.predict(X)
--> 181         return self._decision_function(X, labels)
    182 
    183     def _validate_estimator(self, default=None):

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in _decision_function(self, X, labels)
    281 
    282         scores[large_indices] = pairwise_distances_no_broadcast(
--> 283             X[large_indices, :], large_centers)
    284 
    285         if self.use_weights:

/usr/local/lib/python3.5/dist-packages/pyod/utils/stat_models.py in pairwise_distances_no_broadcast(X, Y)
     36     :rtype: array of shape (n_samples,)
     37     """
---> 38     X = check_array(X)
     39     Y = check_array(Y)
     40     assert_allclose(X.shape, Y.shape)

/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    580                              " minimum of %d is required%s."
    581                              % (n_samples, shape_repr, ensure_min_samples,
--> 582                                 context))
    583 
    584     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

But when I try to predict ensuring that one of them is not an anomaly, then it works in all the cases:

pred = clf.predict([clf.cluster_centers_[clf.large_cluster_labels_[0]],a[338]])
print (pred)

Output:

[0 1]

Thanks for your help

Breadth-First Approach in FeatureBagging

May I ask a question about the implemented approach for "combination" in the "feature_bagging.py" ?

IMHO, the idea of using "maximization" is not a precise reflection of the original paper (lazarevic2005feature). The authors describes there a breath-first search procedure; arguably the numeric differences might be small.
However, please consider a generic toys example as a counter-example:

|------| Alg1| Alg2|
| Obs1 | 10.0| 2.0 |
| Obs2 | 9.0 | 3.0 |
| Obs3 | 8.0 | 4.0 |

Maximization would return the order Obs1 (score:10) , Obs2(score:9), Obs(score:3), Breadth-First Search would return Obs1 (rank1 in Alg 1), Obs3, (rank1 in Alg 2) and the Obs2.

Many thanks.

Documentation / Implementation difference in Autoencoder

While exploring the AutoEncoder in pyod, I've noticed a discrepancy between the generated docs and the implementation.
While the docs (https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.auto_encoder) inform you, that hidden_neurons defaults to a list ([64, 32, 32, 64]), the implementation assigns None as type: https://github.com/yzhao062/pyod/blob/development/pyod/models/auto_encoder.py#L126
Whilst this isn't a problem for itself, instancing an AutoEncoder like that resulted in a TypeError on my side:

         # Verify the network design is valid
>       if not self.hidden_neurons == self.hidden_neurons[::-1]:
E       TypeError: 'NoneType' object is not subscriptable

It might be worth a try to change the default for hidden_neurons to the list mentioned in the docs.
And by the way: Thanks for this framework, it is really a breeze to work with!

IForest: FutureWarning: behaviour="old" is deprecated

Hi,

Thanks for a great library!

When declaring a new IForest object, Sklearn throws the following warning:

FutureWarning: behaviour="old" is deprecated and will be removed in version 0.22. Please use behaviour="new", which makes the decision_function change to match other anomaly detection algorithm API.
FutureWarning)

This new behavior in sklearn's iforest is about where the threshold is set between anomalies and normal observations. See documentation on behaviour argument and offset_:

behaviour : str, default='old'
Behaviour of the decision_function which can be either 'old' or
'new'. Passing behaviour='new' makes the decision_function
change to match other anomaly detection algorithm API which will be
the default behaviour in the future. As explained in details in the
offset_ attribute documentation, the decision_function becomes
dependent on the contamination parameter, in such a way that 0 becomes
its natural threshold to detect outliers.

offset_ : float
Offset used to define the decision function from the raw scores.
We have the relation: decision_function = score_samples - offset_.
Assuming behaviour == 'new', offset_ is defined as follows.
When the contamination parameter is set to "auto", the offset is equal
to -0.5 as the scores of inliers are close to 0 and the scores of
outliers are close to -1. When a contamination parameter different
than "auto" is provided, the offset is defined in such a way we obtain
the expected number of outliers (samples with decision function < 0)
in training.
Assuming the behaviour parameter is set to 'old', we always have
offset_ = -0.5, making the decision function independent from the
contamination parameter.

I think a simple fix would be to add argument behaviour="new" in the call to sklearn.ensemble.IsolationForest

n_jobs ignored

Hi, I'm using xgbod with n_jobs = -1 and its no different than using it with n_jobs = 1...

add HDBscan to PyOD - new feature

could you add HDBscan ( https://github.com/scikit-learn-contrib/hdbscan ) as another anomaly detection method to PyOD?

func:`pyod.utils.data.visualize` is not existed

Is this function pyod.utils.data.visualize deprecated? I cannot import this function.

import sys
import pyod
In[]: sys.version
Out[]: '3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]'
In[]: pyod.utils.data.visualize(clf_name, X_train, X_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)

Traceback (most recent call last):

  File "<ipython-input-9-1628666df63a>", line 2, in <module>
    pyod.utils.data.visualize(clf_name,

AttributeError: module 'pyod.utils.data' has no attribute 'visualize'

 (py36) E:\MyNutshell>pip show pyod                                   
Name: pyod                                                           
Version: 0.5.6                                                       
Summary: A Python Outlier Detection (Anomaly Detection) Toolbox      
Home-page: https://github.com/yzhao062/Pyod                          
Author: Yue Zhao                                                     
Author-email: [email protected]                                 
License: UNKNOWN

local feature importance for outlier prediction?

Hello, is there anything available to identify/highlight what may be the features most probable to be triggering the outlier status?
thx!

Merge with kenchi

Hi,

I am currently developing an anomaly detection package called kenchi and would like to merge this code into your package.
https://github.com/HazureChi/kenchi

There are three points that I can contribute to pyod.

The first is the implementation of One-time sampling.
https://github.com/HazureChi/kenchi/blob/master/kenchi/outlier_detection/distance_based.py

Sugiyama, M., and Borgwardt, K., "Rapid distance-based outlier detection via sampling," Advances in NIPS, pp. 467-475, 2013.

The second is the implementation of metrics for outlier function.
https://github.com/HazureChi/kenchi/blob/master/kenchi/metrics.py

Lee, W. S, and Liu, B., "Learning with positive and unlabeled examples using weighted Logistic Regression," In Proceedings of ICML, pp. 448-455, 2003.

Goix, N., "How to evaluate the quality of unsupervised anomaly detection algorithms?" In ICML Anomaly Detection Workshop, 2016.

The last is the implementation of the function that loads and return various datasets.
https://github.com/HazureChi/kenchi/blob/master/kenchi/datasets/base.py

If you agree, I actively would like to contribute to pyod in the future.

Thanks.

use the Pyod for timeseries anomaly detection

Hi
i'm looking for toolkits for timeseries anomaly detection. is the Pyod provides timeseries anomaly detection??

easy way to plot loss for autoencoder model

Is there an easy way to plot the loss as a function of epochs for the pyod autoencoder module? I want to visualize this to decide how many epochs to train for.

Subspace Outlier Degree (SOD) - Request

I cannot find any subspace-related type algorithm among yours, I believe it will greatly contribute to your collection. So, Can ye please add the SOD algorithm to your package please:
http://www.dbs.ifi.lmu.de/Publikationen/Papers/pakdd09_SOD.pdf

outlier score highly correlated to over distance to points of origin

I calculated the distance of each data points to origins at 0, by use 'np.linalg.norm(x)', while x is just one multi-variate sample, then normalize all these values to 0-1, I called this 'global_score'. When I compare the global score to scores from different methods, it turns out it's highly correlated (0.99) with PCA, autoencoder, CBLOF, KNN. So it seems all these methods are just calculating the overall distance of the samples, instead of anomalies from multiple clusters.
I was very troubled by this fact and hope you can confirm whether this is true and if it is, what's the reason for this.

Thanks

Resource updation request of an Article - Outlier Detection using PyOD

Hi,
I have written an article on Outlier Detection using PyOD on Analytics Vidhya Blog -
https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/

In the article, I have tried to explain the need for outlier detection and how can we use pyod for the same and also implemented pyod on a real world data set.
Please consider it including in your resources section on GitHub. I believe it would be really helpful for the people who wanted to get started with pyod.

Thanks

Angle-based Outlier Detector (ABOD) returns None

Can you explain in which situations ABOD returns None and how should I interpret these?

SOS: overflow encountered in multiply beta[i] = beta[i] * 2.0

I am running the following code:

clf_name = 'SOS'

clf_name = 'SOS'
clf = SOS()
clf.fit(X_train)

and got the following warning:
RuntimeWarning: overflow encountered in multiply
beta[i] = beta[i] * 2.0
/opt/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
~/proj/myPylib/lib/python3.6/site-packages/pyod/models/base.py:336: RuntimeWarning: invalid value encountered in greater
self.labels_ = (self.decision_scores_ > self.threshold_).astype(

data.zip
I have uploaded the data for X_train here.

My samples have duplicates and when I remove the duplicates the error does not occur. However I need to retain the duplicates.

tensorflow pip installation fails with travis-ci python 3.7

matrix:
include:
- python: "3.7"
dist: xenial
sudo: true

Error message:
Collecting tensorflow (from -r requirements_travis.txt (line 8))
Could not find a version that satisfies the requirement tensorflow (from -r requirements_travis.txt (line 8))(from versions: )
No matching distribution found for tensorflow (from -r requirements_travis.txt (line 8))
The command "pip install -r requirements_travis.txt" failed and exited with 1 during .

Wait until pip fixes tensorflow installation under python 3.7.

KNN Mahalanobis distance error

Hi,

When I use the Mahalanobis metric for KNN I always get the error "Must provide either V or VI for Mahalanobis distance" even when I provide V with metric_params. The same request works with sklearn.neighbors.


from pyod.models.knn import KNN  
from pyod.utils.data import generate_data
from sklearn.neighbors import NearestNeighbors
import numpy as np

contamination = 0.1  
n_train = 200  
n_test = 100 

X_train, y_train, X_test, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=contamination)

#Doesn't work (Must provide either V or VI for Mahalanobis distance)
clf = KNN(algorithm='brute', metric='mahalanobis', metric_params={'V': np.cov(X_train)})
clf.fit(X_train)

#Works
nn = NearestNeighbors(algorithm='brute', metric='mahalanobis', metric_params={'V': np.cov(X_train)})
nn.fit(X_train)

XGBOD and LSCP missing from install

I installed PyOD using:

pip install pyod
pip install --upgrade pyod

However, LSCP and XGBOD are not installed. All of the other models in the repo can be successfully imported into a jupyter notebook. Attempting to import LSCP and XGBOD both yield a "ModuleNotFoundError: No module named" error.

if n_samples is large , certain outlier model error rate 200% higher

Hi, YZhao;

I was writing to one possible isuue: in example notebooks compare all models, if the n_samples change to a big number, for example : 10*5 or larger. certain model OD result is totally wrong.
Note: I noticed there is a similar issue. Problem in CBLOF when the number of clusters is big and the train set has too many repeated value #53. But my found should be differnet, so post is as well.

Here is isssue:

the default n_sample =200, outlier_fraction = 0.25. That means od ground trun point is 50. After change the n_sample to 10**5, the ground true outlier point should be 25000.
However, following models error higher that the groud true OD points. they were:
Feature Bagging: 35259
Local Outlier Factor (LOF) 36144
Locally Selective Combination (LSCP) 37276
below is the screen capture.

I guess it might related to the dataset type. Doest it mean the simulation data similar to the
"glass", "optdigits" sample date? why the other estimator did not show such higher error rate?

Looking forward your kindly response！

Last but not least, I was know PYOD existed from zhihu. It is a excellent tools, especialy, I have access to more OD resources link for your github. Your work is awesome!

Recently, i has begun to try use some model it in one of my general OD automation tools.(which use docker, and airflow for platform). dataset type, and dataset qty are two of point need considered.

WangYong
[email protected]

Instructions on setting up Keras and Tensorflow for AutoEncoder in PyOD

It is nice that PyOD includes some neural network based models, such as AutoEncoder. However, you may find that after pip install pyod, AutoEncoder models do not run. This is expected since I do not want PyOD relies on too many packages, and not everyone needs to run AutoEncoder.

**If you have tensorflow-gpu installed, keras would automatically run with GPU. **
If you want to run AutoEncoder, please first install keras+a backend library, e.g., tensorflow. Either of the following two should do the installation for you:

pip install keras tensorflow or pip install keras tensorflow-gpu
conda install keras tensorflow or pip install keras tensorflow-gpu

You need tensorflow-gpu if your device have GPU and want to leverage it.

After keras and tensorflow being installed, you are ready to run auto_encoder_example.py.

Here are some potential error messages you may encounter:

1. ModuleNotFoundError: No module named 'theano'

In this case, you should specify keras backend to the one you want to use, e.g., TensorFlow
Go to $HOME/.keras/keras.json, and change the "backend" to "tensorflow"

2. ModuleNotFoundError: No module named 'error'

In this case, you need to install keras and tensorflow with conda, which can either be done in the GUI or simply use "conda install keras" and "conda install tensorflow"

load_cardio() and load_letter() do not work under Ubuntu 14.04

While running comb_example.py, the program may fail due to loadmat() function.
A quick workaround is to use synthesized data instead of real-world datasets.

This only affects comb_example.py. Will be addressed in the next release.

matplotlib libc++abi.dylib failure on MacOS (conda env)

Users may see example failure by using PyOD on MacOS if virtual env is initialized by Anaconda. This is indeed a matplotlib bug. See posts below:

specifying categorical features in Python Outlier Detection (PyOD)

How to specify the categorical features in PyOD when using Histogram-based Outlier Detection (HBOS) for anomaly detection ?
I've read that HBOS can be used for anomaly detection when there are categorical features involved. I found it's Python implementation here:
https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.hbos
But I can't figure out how should I pass the position or list of names of categorical features of my dataset while training the model.
The code I've tried:

clf = HBOS(n_bins=10, alpha=0.1, tol=0.5, contamination=0.1)
clf.fit(train_df)
train_pred = clf.labels_

There is no parameter to mention categorical features while training.

The KNN example is incorrect

The KNN example is incorrect
There is no get_outliers_inliers in pyod

from pyod.utils.data import generate_data from pyod.utils.data import get_outliers_inliers

how to understand ExtremeLowDensityModel?

Is it possible to make CBLOF ignore contamination parameter?

CBLOF parameters are useless. Basically the only thing that matters when dealing with the method is the contamination parameter, if i set it to 0.3 it will find 30% of anomalies and it doesn't matter how normal they are or if they belong to a big cluster. For what i understood about the method it has the ability to define what is an anomaly and what is not only based on the parameters alpha and beta, why is this happening?
Is there a way to ignore contamination?

LSCP with multiple LOF testing error: range parameter must be finite

I am running the following code:

clf_name = 'LSCP_LOF'

other parameters:

lof_list = [LOF(n_neighbors=5), LOF(n_neighbors=10), LOF(n_neighbors=20), LOF(n_neighbors=30), LOF(n_neighbors=40), LOF(n_neighbors=50), LOF(n_neighbors=75)]

clf = LSCP(lof_list)
#clf = LOF(n_neighbors=5, contamination=outliers_fraction)
clf.fit(X_train)

and got the following error, however, when fit directly with LOF method, it runs fine:

ValueError Traceback (most recent call last)
in ()
12 clf = LSCP(lof_list)
13 #clf = LOF(n_neighbors=5, contamination=outliers_fraction)
---> 14 clf.fit(X_train)
15
16 # get the prediction label and outlier scores of the training data

~/proj/myPylib/lib/python3.6/site-packages/pyod/models/lscp.py in fit(self, X, y)
171
172 # set decision scores and threshold
--> 173 self.decision_scores_ = self._get_decision_scores(X)
174 self._process_decision_scores()
175

~/proj/myPylib/lib/python3.6/site-packages/pyod/models/lscp.py in _get_decision_scores(self, X)
273 pred_scores_ens[i,] = np.mean(
274 test_scores_norm[
--> 275 i, self._get_competent_detectors(pearson_corr_scores)])
276
277 return pred_scores_ens

~/proj/myPylib/lib/python3.6/site-packages/pyod/models/lscp.py in _get_competent_detectors(self, scores)
355 "classifiers, reducing n_bins to n_clf.")
356 self.n_bins = self.n_clf
--> 357 hist, bin_edges = np.histogram(scores, bins=self.n_bins)
358
359 # find n_selected largest bins

/opt/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py in histogram(a, bins, range, normed, weights, density)
668 if not np.all(np.isfinite([first_edge, last_edge])):
669 raise ValueError(
--> 670 'range parameter must be finite.')
671 if first_edge == last_edge:
672 first_edge -= 0.5

ValueError: range parameter must be finite.

Thanks

Fixing generate_data_clusters documentation and adding it to README

broken function (_predict_rank) due to the changes to the dependent libs

This private function is broken due to the changes to the underlying dependency update (sklearn updated to 0.20). It would affect any major functionalities and I will fix this error in the next few days.

Do not use _predict_rank if your sklearn version is not 0.19.

failure_log.txt

Installing Pyod broke my TensorFlow installation

Ubuntu 16.04

Traceback (most recent call last): File "features_2_3_rot_unet_1.py", line 3, in <module> import tensorflow as tf File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/__init__.py", line 22, in <module> from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/__init__.py", line 81, in <module> from tensorflow.python import keras File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/__init__.py", line 24, in <module> from tensorflow.python.keras import activations File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/activations/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.activations import elu File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/_impl/keras/__init__.py", line 21, in <module> from tensorflow.python.keras._impl.keras import activations File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/_impl/keras/activations.py", line 23, in <module> from tensorflow.python.keras._impl.keras import backend as K File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/_impl/keras/backend.py", line 36, in <module> from tensorflow.python.layers import base as tf_base_layers File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/layers/base.py", line 25, in <module> from tensorflow.python.keras.engine import base_layer File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/engine/__init__.py", line 23, in <module> from tensorflow.python.keras.engine.base_layer import InputSpec File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/engine/base_layer.py", line 33, in <module> from tensorflow.python.keras import backend File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/backend/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.backend import abs ImportError: cannot import name 'abs'

ValueError                                Traceback (most recent call last)
<ipython-input-252-21e0f0751702> in <module>()
      2 # evaluate and print the results
      3 print("\nOn Training Data:")
----> 4 evaluate_print(clf_name, y_train, y_train_scores)
      5 print("\nOn Test Data:")
      6 evaluate_print(clf_name, y_test, y_test_scores)
------
    157     print('{clf_name} ROC:{roc}, precision @ rank n:{prn}'.format(
    158         clf_name=clf_name,
--> 159         roc=np.round(roc_auc_score(y, y_pred), decimals=4),
    160         prn=np.round(precision_n_scores(y, y_pred), decimals=4)))

any suggestions?

What is difference between Scikit One Class SVM vs PYOD One Class SVM?

Running Scikit OCSVM gives me Results:

Best parameters: (Nu 0.1500000000 gamma 0.0) Precision score: 0.143 Recall score: 0.800

Running PYOD OCSVM gives me Results:

Best parameters: (Nu 0.0300000000 gamma 0.0) Precision score: 0.868 Recall score: 0.103

the whole set of vectors should fit into memory, then it can start training? or it supports shards

Without enough free RAM, model.fit(X) failed. can it fit by split vectors?

yzhao062 / pyod Goto Github PK

pyod's Introduction

Hi there, I'm Yue ZHAO (赵越 in Chinese)! 👋

pyod's People

Contributors

Stargazers

Watchers

Forkers

pyod's Issues

evaluate and print the results

other parameters:

Recommend Projects

Recommend Topics

Recommend Org