guansongpang / deviation-network Goto Github PK

Source code of the KDD19 paper "Deep anomaly detection with deviation networks", weakly/partially supervised anomaly detection, few-shot anomaly detection, semi-supervised anomaly detection

License: GNU General Public License v3.0

Python 100.00%

anomaly-detection deep-learning weakly-supervised-learning semi-supervised-learning outlier-detection few-shot-learning

deviation-network's People

Contributors

Stargazers

Watchers

Forkers

pdudzic123 jakubkarczewski wenjingsu zeigar anitaweng andreasalam schaelle jingmouren blaxe05 maybeee18 niuwan1 dc-thanh mydre jinhui0520 jiapengwei william-a blackwaltz0-1 vishwa360 fzldq jimmy-inl hell-to-heaven sterlingjosh ginger45 sixitingting robotaref ilwoof eddiepbc pazsheimy yujing-chen cse-msstate jxzhangjhu zhanglining123 alephnotation ola-sumbo shuoli90 summuk yiyg510 texaspandaa er-vivekkumar yoyo-cup luisfredgs hudavn yuz128pitt xy-shen paullu-ualberta slime0 vishwasagar navalinovian francisco-innowave manman-hub mofy-freshman pulak-gautam pzaninelli

deviation-network's Issues

Interpretability of anomaly scores

In terms of interpretability of the anomaly scores, can we also get the contribution of each feature importances of the datasets (columns) from the Z-Score-based deviation loss ?

Reproduce for a different dataset

Hello,
I have a 1d time series dataset in the format (x, y, z), where x = number of samples, y = number of timesteps and z = 1 (dimensions).
How am I able to run your code with my dataset? Where do I need to make changes?

Infinite loop when generating the batches

I noticed that there is an infinite loop in this function batch_generator_sup() and I don't fully understand the logic of generating the batches and injecting the noise. Why not just use the normal dataloader with negative sampling?

Training and testing

How do I train or test the code?

Calculating metrics

Hey,

I have a question concerning the way you calculate metrics:

Predictions of trained model can vary from close to 0 to over 60 for some experiments that I performed using Kaggle Credit Card Fraud Dataset.

My question is: why do use the predictions as if they were probabilities (even though they are not withing [0, 1] range) when calculating AUC-ROC and AUC-PR?

I know that both sklearn functions support this type of mixed input (binary and continuous vectors) but doesn't it give a false result?

This is a bit similar to what is happening in your implementation. Also, I think that the confidence threshold you mentioned in paper should be dependent on the value of margin used in deviation loss - not only on probit and normal distribution parameters.

Thanks for the reproducible paper :)
Kuba

Anomaly Contamination level

Hi @GuansongPang, I have another question regarding the anomaly contamination level setting on the unlabeled training dataset.

Since we have this feature on the model during the training process (assuming that we have anomalous data on the unlabeled training dataset), does that mean the model can also see which unlabeled data instances that tend to be "anomalous" (have high anomaly score) on the testing dataset?

Thank you.

TF2 implementation

Hi Mr Pang.
I have read your paper and I found it very interesting to replicate. Congratulations!

I am currently trying to explore your ideas and to implement them in a fraud prevention context but I found some trouble trying to adapt your code to tf2 and our dataset. I am experimenting problems with the code, particularly inside the custom cost function.

The first one occurs with the ref variable. The framework throws the following error: tf.function-decorated function tried to create variables on non-first call. One solution would be to declare the symbolic variable outside the function but the random sample from which the mean and variance are obtained would always be the same.
On the other hand, for debugging purposes only, we train the model in eager mode and the error we encountered is: assertion failed: [predictions must be> = 0] [Condition x> = y did not hold element-wise: ] ….

Have you or your team ever tried to migrate your code to tf2? It could be very helpful for us your experience.

Thank you.

Code for modified DSVDD

Hello,

I would like to first thank you for providing the code for deviation network.
In the paper, it was mentioned that you used a modified version of DSVDD. Would it be possible to provide the code for that?

Model performance

Hi,
I run the code but dont get the performance reported in the paper with Credit card default dataset:
AUC-ROC: 0.6005, AUC-PR: 0.1028 (last run)
average AUC-ROC: 0.6429, average AUC-PR: 0.0941
Did you feature engineering or change default arguments to get the best performance for this dataset (I trained model with default arguments) ?

Using deviation networks on custom datasets

Hey again!

Do you have any advice on how to approach usage of deviation networks on real world datasets?

Main concerns when using real world data:

due to feature engineering many columns are dependent (so central limit theorem doesn't quite apply - the variables are not independent)
there are many missing values (imputation with mean&mode helps but it may skew the distribution even more)

Having that in mind, do you have any advice about how to approach new datasets?

Do you think of any method that would let me check if my dataset's anomaly scores fit normal distribution and (if not) guide me towards other distribution type?

Have you considered using other distributions or other distribution parameters during your research?

I've seen that you coauthored other paper that solves similar problem. Do you plan to release it's implementation?

Thanks again and cheers,
Kuba

Model performance on backdoor datasets

Hi,Mr Pang
I trained model with default arguments by using the backdoor dataset and get the performance as follows
AUC-ROC: 0.7793, AUC-PR: 0.2663
average AUC-ROC: 0.7834, average AUC-PR: 0.2730
I saw a similar question in previous issue and your answer is that perform some standard data prepocessing steps，could you provide specific steps and data process code on github?