Code Monkey home page Code Monkey logo

bert4eth's Introduction

BERT4ETH

This is the repo for the code (TensorFlow version) and datasets used in the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023. Here you can find our slides.

If you find this repository useful, please give us a star : ) Thank you!

Update: I've recently added a section (Section 5.5) discussing the multi-hop modeling capability of BERT4ETH to the paper on arXiv. (10/30)

BERT4ETH-PyTorch: Here you can find the PyTorch implementation: https://github.com/Bayi-Hu/BERT4ETH_PyTorch

Some notes:

Note 1: The master branch hosts the basic BERT4ETH. If you wish to run the basic model, there is no need to download the ERC-20 log dataset. Advanced features such as In/out separation and ERC20 log can be found in the old branch but are not recommended due to the inefficiency of computation and memory.

Note 2: Despite BERT4ETH is a sequential model, it is able to capture three-hop relationship from a graph perspective. (For more details please refer to our slides.) multi_hop_modeling.png

Note 3: The results reported in our paper are the best results among five times experiments (pre-training). The outcomes might slightly vary between different runs of pre-training, steps of checkpoints, and runs of cascaded MLP classifier training. Below are our recent results on the phishing detection task with fixed training:

Getting Start

Requirements

  • Python >= 3.6
  • TensorFlow >= 2

I use python 3.9, tensorflow 2.9.2 with CUDA 11.2, numpy 1.19.5.

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

cd BERT4ETH/Data; # Labels are already included
unzip ...;

Step 3: Transaction Sequence Generation

cd Model;
python gen_seq.py --bizdate=bert4eth_exp

Pre-training

Step 1: Pre-training Data Generation from Sequence

python gen_pretrain_data.py --bizdate=bert4eth_exp  \ 
                            --max_seq_length=100  \
                            --dupe_factor=10 \
                            --masked_lm_prob=0.8 

Step 2: Pre-train BERT4ETH

python run_pretrain.py --bizdate=bert4eth_exp \
                       --max_seq_length=100 \
                       --epoch=5 \
                       --batch_size=256 \
                       --learning_rate=1e-4 \
                       --num_train_steps=1000000 \
                       --save_checkpoints_steps=8000 \
                       --neg_strategy=zip \
                       --neg_sample_num=5000 \ 
                       --neg_share=True \ 
                       --checkpointDir=bert4eth_exp 
Parameter Description
bizdate The signature for this experiment run.
max_seq_length The maximum length of BERT4ETH.
masked_lm_prob The probability of masking an address.
epochs Number of training epochs, default = 5.
batch_size Batch size, default = 256.
learning_rate Learning rate for the optimizer (Adam), default = 1e-4.
num_train_steps The maximum number of training steps, default = 1000000,
save_checkpoints_steps The parameter controlling the step of saving checkpoints, default = 8000.
neg_strategy Strategy for negative sampling, default zip, options (uniform, zip, freq).
neg_share Whether enable in-batch sharing strategy, default = True.
neg_sample_num The negative sampling number for one batch, default = 5000.
checkpointDir Specify the directory to save the checkpoints.

Step 3: Output Representation

python output_embed.py --bizdate=bert4eth_exp \
                       --init_checkpoint=bert4eth_exp/model_104000 \
                       --max_seq_length=100 \
                       --neg_sample_num=5000 \
                       --neg_strategy=zip \
                       --neg_share=True

I have generated a version of embedding file, you can unzip it under the directory of "Model/inter_data/" and test the results.

Testing on output account representation

Phishing Account Detection

python run_phishing_detection.py --init_checkpoint=bert4eth_exp/model_104000 # Random Forest (RF)

python run_phishing_detection_dnn.py --init_checkpoint=bert4eth_exp/model_104000 # DNN, better than RF

De-anonymization (ENS dataset)

python run_dean_ENS.py --metric=euclidean \
                       --init_checkpoint=bert4eth_exp/model_104000

De-anonymization (Tornado Cash)

python run_dean_Tornado.py --metric=euclidean \
                           --init_checkpoint=bert4eth_exp/model_104000

Fine-tuning for phishing account detection

python gen_finetune_phisher_data.py --bizdate=bert4eth_exp \ 
                                    --max_seq_length=100 
python run_finetune_phisher.py --init_checkpoint=bert4eth_exp/model_104000 \
                               --bizdate=bert4eth_exp \ 
                               --max_seq_length=100 \ 
                               --checkpointDir=tmp

Citation

@inproceedings{hu2023bert4eth,
  title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
  author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
  booktitle={Proceedings of the ACM Web Conference 2023},
  pages={2189--2197},
  year={2023}
}

Q&A

If you have any questions, you can either open an issue or contact me ([email protected]), and I will reply as soon as I see the issue or email.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.