christam96 / blockchain-gnn-research Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 22.11 MB

Research papers, experiments, and ideas for the application of GNNs to blockchain data

Python 31.27% Jupyter Notebook 68.73%

blockchain-gnn-research's People

Contributors

Stargazers

Watchers

Forkers

passcet46

blockchain-gnn-research's Issues

Create test suite to validate graph expansion

At each iteration of graph expansion, run calculations for how many nodes and edges should be in the resulting graph G'. The number of nodes and edges can be determined using the following equations:

G'.nodes = G_prev.nodes + G_next.nodes - intersection(G_prev, G_next).nodes
G'.edges = G_prev.edges + len(G_next.nodes)

where G_prev is the current state of the graph, G_next is the neighbour graph being added and G' is the resulting graph from composition.

The pseudocode is as follows:

for each iteration of graph expansion:
  evaluate expected graph data (nodes and edges)
  try:
    construct graph G'
    assert(G'.data == expectation)
  except Exception as e:
    print(Graph mismatch)

question about IIBGNN

I am currently working on reproducing the code from this paper. I am wondering if you know the explicit way of calculating the subgraph in this paper.

The paper has mentioned that the subgraph is calculated by EACH account in nA with adding it's direct neighbor. But the picture it shows later turns out that it evolved the neighbor's neighbor of the account in nA.
And it also mentions a number "nu" which refers the up-limit of nodes. so when we train the network, is this number fixed or not?

Evaluating similarity of transaction subgraphs using k-means clustering

Questions:

What do embeddings capture?
Do similar embeddings represent similar transaction patterns?
- Similar transactions patterns i.e. same wallets/accounts
Do embeddings capture topological and qualitative (e.g., features) network structures?
- E.g. # neighbours, same neighbours, value exchanged (edge weight)

Framework:

Create 1-hop transactions subgraphs (#43)
- Obtain 1-hop transaction subgraphs for user
- Filter criteria (e.g., 30 < number of user txs < 300)
Embed transaction subgraph (#45)
Cluster embeddings using k-means clustering
- Compute k-means clustering of subgraph embeddings
- Recover addresses from node labels

Encode tx information into network

First-order nodes:

Construct 1st-order tx subgraph

Constructing transaction graph: Filter through addresses

Excerpt from Yuan et. al. (2020):

At the same time, we need to note that there is often a lot of redundant data in the Ethereum transaction records, i.e. a considerable number of addresses have little value for our research target because of their extreme situation. For example, addresses act as wallet interact with other addresses frequently, for which their numbers of neighbor addresses are too large for analysis of trans- action network. This kind of data increases the burden of data processing, and even cause unexpected deviations to the results of model training. Thus it is necessary to set standards to clean the obtained dataset. According to previous work, we consider the addresses with the number of transactions less than 10 as inactive addresses and the addresses with the number more than 300 as overactive addresses, filtering them after transaction records collecting [16].

Create test dataset from 2nd-order tx dataset

Purpose: Test graph construction and easier debugging

Method: Cut down exiting dataset to ~1% while maintaining file structure

GNN: graph2vec

graph2vec

Find transaction subgraph embedding using graph2vec model.

Todo

Read graph2vec paper (#28)
Find publicly available implementation of graph2vec (here)
Build data pipeline using Ethereum 2nd-order transaction data (#33)
Obtain subgraph embeddings from Ethereum 2nd-order transaction data (#35)
Classify embeddings using SVM (#36)
- Set up downstream SVM classifier
- Create train/test datasets
- Generate prediction

Paper: graph2vec

Reading of graph2vec paper.

graph2vec model made available by Benedek Rozemberczki.

Remove data limit for development stage

There are currently two ways we are limiting data intake for the purpose of development:

Blocks - Blocks are currently capped to 0to999999 (the first million). We have 13.5 million blocks of data available in the XBlock-ETH dataset.
Transactions - Internal/external transaction data is limited to nrows=100000 (i.e. the first 100,000 rows of transaction data in each block section file). Since each block section file contains 1 million blocks, we're obviously omitting A LOT of transaction data.

This means that when we create subgraphs lists in create_subgraphs.py for the entire XBlock-ETH dataset, the whole process could take a very long time. We want to make sure the output subgraphs are usable/correct so as to limit the number of times we need to create subgraphs for the entire dataset as much as possible.

NetworkX - Encoding graphs: Duplicate edge weights

Current approach: Read csv in as dataframe and create DiGraph

import networkx as nx
import pandas as pd

df = pd.read_csv(//path to data)
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, source='From', target='To', edge_attr='Value', create_using=Graphtype)

print(G.edges.data())

NetworkX appears to be aggregating edge weights for duplicate source/target pairs, but manual inspection reveals inconsistencies. metadata_totals.py sums values for distinct source values and check_sums.py compares the manual summation to G's edge weights, showing 75 inconsistencies in the test file.

For now will proceed with NetworkX but may want to look into this down the road.

Find ETH 2nd-order transaction dataset on XBlock

Find dataset used in #11

Constructing transaction graph: Encode bidirectional transactions

2nd-order transaction subgraph construction only considers transactions sent to root. This mistake resulted from only looking at the 1st-order csv for address0x0000000000000000000000000000000000000000, and it was assumed the authors only considered incoming transactions to root and neighbours. If you take a look at other 1st-order nodes, you will see bidirectional transactions. Thus bi-directional transactions also need to be considered in the 2nd-order transaction subgraphs.

`README.md` - General resources section

Add section for general GNN or related resources. For example,

PyG: PyTorch GNN library
NetworkX: Network analysis in Python
GNN Intro: A Gentle Introduction to Graph Neural Networks

Graph2Vec: Output embedding

Need to understand the output of Graph2Vec. From the paper:

output: Matrix of vector representations of graphs Φ ∈ R^(|G|×δ).

Label Sugraphs: Generate labels corresponding to each subgraph embedding

In #36 labels are hardcoded based on the number of phishing/non-phishing embeddings and are stored independently of subgraph embeddings. This risks incorrectly associating an embedding to a label. Instead, store subgraph labels in the same data structure as the embeddings so they are kept together.

Constructing transaction graph: Build subgraphs on the fly

Should be able to call create_graph.py to build subgraphs on the fly. For example, dynamically generate appropriate transaction subgraph for each step of GNN training.

Non-descript `README.md`: Update for <PAPER> Identity Inference on Blockchain using Graph Neural Network

Summarize for key points, topics and findings.

`README.md`: Add datasets/resources section

Came across XBlock, a blockchain data platform for academic re-search. Looks like a useful resource, let's keep reference to this and similar resources in the README.

`README.md` - Add 'Reading List' section

Reserve section for reading list items.

2nd-Order Transaction Embeddings: Bug audit

SVM predictions producing same accuracy scores regardless of train/test split and random seed used. This doesn't align with my intuition that the model should produce slightly different results based on the data used during training (which varying train/test split and random seed should affect).

5-fold CV accuracy:
[0.96279762 0.97172619 0.96875 0.9702381 0.96279762]