Code Monkey home page Code Monkey logo

dgfraud's Introduction



PRs Welcome GitHub GitHub release PRs

A Deep Graph-based Toolbox for Fraud Detection

Introduction

May 2021 Update: The DGFraud has upgraded to TensorFlow 2.0! Please check out DGFraud-TF2

DGFraud is a Graph Neural Network (GNN) based toolbox for fraud detection. It integrates the implementation & comparison of state-of-the-art GNN-based fraud detection models. The introduction of implemented models can be found here.

We welcome contributions on adding new fraud detectors and extending the features of the toolbox. Some of the planned features are listed in TODO list.

If you use the toolbox in your project, please cite one of the two papers below and the algorithms you used :

CIKM'20 (PDF)

@inproceedings{dou2020enhancing,
  title={Enhancing Graph Neural Network-based Fraud Detectors against Camouflaged Fraudsters},
  author={Dou, Yingtong and Liu, Zhiwei and Sun, Li and Deng, Yutong and Peng, Hao and Yu, Philip S},
  booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM'20)},
  year={2020}
}

SIGIR'20 (PDF)

@inproceedings{liu2020alleviating,
  title={Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection},
  author={Liu, Zhiwei and Dou, Yingtong and Yu, Philip S. and Deng, Yutong and Peng, Hao},
  booktitle={Proceedings of the 43nd International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2020}
}

Useful Resources

Table of Contents

Installation

git clone https://github.com/safe-graph/DGFraud.git
cd DGFraud
python setup.py install

Requirements

* python 3.6, 3.7
* tensorflow>=1.14.0,<2.0
* numpy>=1.16.4
* scipy>=1.2.0
* networkx<=1.11

Datasets

DBLP

We uses the pre-processed DBLP dataset from Jhy1993/HAN You can run the FdGars, Player2Vec, GeniePath and GEM based on the DBLP dataset. Unzip the archive before using the dataset:

cd dataset
unzip DBLP4057_GAT_with_idx_tra200_val_800.zip

Example dataset

We implement example graphs for SemiGNN, GAS and GEM in data_loader.py. Because those models require unique graph structures or node types, which cannot be found in opensource datasets.

Yelp dataset

For GraphConsis, we preprocessed Yelp Spam Review Dataset with reviews as nodes and three relations as edges.

The dataset with .mat format is located at /dataset/YelpChi.zip. The .mat file includes:

  • net_rur, net_rtr, net_rsr: three sparse matrices representing three homo-graphs defined in GraphConsis paper;
  • features: a sparse matrix of 32-dimension handcrafted features;
  • label: a numpy array with the ground truth of nodes. 1 represents spam and 0 represents benign.

The YelpChi data preprocessing details can be found in our CIKM'20 paper. To get the complete metadata of the Yelp dataset, please email to [email protected] for inquiry.

User Guide

Running the example code

You can find the implemented models in algorithms directory. For example, you can run Player2Vec using:

python Player2Vec_main.py 

You can specify parameters for models when running the code.

Running on your datasets

Have a look at the load_data_dblp() function in utils/utils.py for an example.

In order to use your own data, you have to provide:

  • adjacency matrices or adjlists (for GAS);
  • a feature matrix
  • a label matrix then split feature matrix and label matrix into testing data and training data.

You can specify a dataset as follows:

python xx_main.py --dataset your_dataset 

or by editing xx_main.py

The structure of code

The repository is organized as follows:

  • algorithms/ contains the implemented models and the corresponding example code;
  • base_models/ contains the basic models (GCN);
  • dataset/ contains the necessary dataset files;
  • utils/ contains:
    • loading and splitting the data (data_loader.py);
    • contains various utilities (utils.py).

Implemented Models

Model Paper Venue Reference
SemiGNN A Semi-supervised Graph Attentive Network for Financial Fraud Detection ICDM 2019 BibTex
Player2Vec Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework CIKM 2019 BibTex
GAS Spam Review Detection with Graph Convolutional Networks CIKM 2019 BibTex
FdGars FdGars: Fraudster Detection via Graph Convolutional Networks in Online App Review System WWW 2019 BibTex
GeniePath GeniePath: Graph Neural Networks with Adaptive Receptive Paths AAAI 2019 BibTex
GEM Heterogeneous Graph Neural Networks for Malicious Account Detection CIKM 2018 BibTex
GraphSAGE Inductive Representation Learning on Large Graphs NIPS 2017 BibTex
GraphConsis Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection SIGIR 2020 BibTex
HACUD Cash-Out User Detection Based on Attributed Heterogeneous Information Network with a Hierarchical Attention Mechanism AAAI 2019 BibTex

Model Comparison

Model Application Graph Type Base Model
SemiGNN Financial Fraud Heterogeneous GAT, LINE, DeepWalk
Player2Vec Cyber Criminal Heterogeneous GAT, GCN
GAS Opinion Fraud Heterogeneous GCN, GAT
FdGars Opinion Fraud Homogeneous GCN
GeniePath Financial Fraud Homogeneous GAT
GEM Financial Fraud Heterogeneous GCN
GraphSAGE Opinion Fraud Homogeneous GraphSAGE
GraphConsis Opinion Fraud Heterogeneous GraphSAGE
HACUD Financial Fraud Heterogeneous GAT

TODO List

  • Implementing mini-batch training
  • The log loss for GEM model
  • Time-based sampling for GEM
  • Add sampling methods
  • Benchmarking SOTA models
  • Scalable implementation
  • Pytorch implementation

How to Contribute

You are welcomed to contribute to this open-source toolbox. The detailed instructions will be released soon. Currently, you can create issues or email to [email protected] for inquiry.

dgfraud's People

Contributors

hengruizhang98 avatar jimliu96 avatar yingtongdou avatar yutongd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dgfraud's Issues

dataset problem

Amazon dataset, can you provide it? I can't find this dataset corresponding to the description in the paper

My datasets problems

When I use the GEM algorithm of DGFraud in my datasets, I have some problems with my datasets,

1、my adj matrix is 720,000 X 720,000,how do i deal with it,sampling?if sampling may be lead to nodes and realationship broken?

2、My features is continuity numerical, for example the ages feature values is between 18 and 50, use directly or binning processing ,one-hot encoding?

Looking forward to your reply。

How did I use sparse matrix as adj_list in GEM

When I use the GEM algorithm of DGFraud in my datasets, I try to use sparse matrix as input adj_list, but it not work,
this is my code ,I dont know how to feed sparse tensor to the model,can you help me ?

for d in range(self.devices_num):
            indices = np.mat([self.placeholders['a'][d].row, self.placeholders['a'][d].col]).transpose()
            m = tf.SparseTensor(indices, self.placeholders['a'][d].data, self.placeholders['a'][d].shape)
            with tf.Session as session:
                ah = session.run(tf.nn.embedding_lookup_sparse(input, m,None,combiner='sum'))
            
            ahv = tf.matmul(ah, self.vars['V'])
            h2.append(ahv)

关于GraphConsis

我发现GraphConsis此模型目前很多地方运行不了,可以麻烦您上传新的代码吗

data_loader DBLP

Can you explain the function load_data_dblp?

Why do this data['net_APA'] - np.eye(N) ?

About semignn

Hello,when I use semignn test my model,The following error occurred:

ResourceExhaustedError: OOM when allocating tensor with shape[26656,213248] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu

u_i_embedding = tf.nn.embedding_lookup(a_u, tf.cast(self.placeholders['u_i'], dtype=tf.int32))
u_j_embedding = tf.nn.embedding_lookup(a_u, tf.cast(self.placeholders['u_j'], dtype=tf.int32))
inner_product = tf.reduce_sum(u_i_embedding * u_j_embedding, axis=1)

I debug this , found u_i, u_j shape (213248,) may be the reason,when we use u_i, u_j compute loss2 , inner_product shape(26656,213248) so big ,it will cause the memory overflow 。 I only have 6664 nodes, 8 views, but I compute the loss2 with inner_product so big that program not continues.

Can you give me some help ?

Asking about YelpChi dataset

Hi, I'm currently reading the GraphConsis paper and I saw table 2:
image

I did notice that the features used in the paper are extracted using word2vec instead of BoW, but should these results stay roughly the same for both feature types? At the moment I couldn't reproduce these results even with the simplest Logistic Regression.

I got AUC of only 0.5 with the default sklearn LogisticRegression, using 80% training data.

Node-Attention in SemiGNN

您好!再次仔细拜读了SemiGNN的实现代码,我存在一个疑问。

  • 代码中有关Node-Attention机制的实现,见代码链接,其中H_v是下式中的注意力参数,代码中H_v的shape是[hidden_size, 1],其中hidden_size是节点嵌入向量的长度,这里按照论文理解的话,H_v的shape应该是由当前节点的邻居节点数量所决定,但这样H_v的shape不是固定的,网络就没办法back propagation训练。
    image

感谢您花费宝贵的时间,谢谢!

HACUD get_data.py

Hello,in HACUD get_data.py,I guess there is a bug:
u_index.append(z[0])
v_index.append(z[1])
To get the neighbor node index,it should be:
u_index.append(np.where(z)[0])
v_index.append(np.where(z)[1])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.