Code Monkey home page Code Monkey logo

blockchain-gnn-research's People

Contributors

christam96 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

passcet46

blockchain-gnn-research's Issues

Create test suite to validate graph expansion

At each iteration of graph expansion, run calculations for how many nodes and edges should be in the resulting graph G'. The number of nodes and edges can be determined using the following equations:

G'.nodes = G_prev.nodes + G_next.nodes - intersection(G_prev, G_next).nodes
G'.edges = G_prev.edges + len(G_next.nodes)

where G_prev is the current state of the graph, G_next is the neighbour graph being added and G' is the resulting graph from composition.

The pseudocode is as follows:

for each iteration of graph expansion:
  evaluate expected graph data (nodes and edges)
  try:
    construct graph G'
    assert(G'.data == expectation)
  except Exception as e:
    print(Graph mismatch)

question about IIBGNN

I am currently working on reproducing the code from this paper. I am wondering if you know the explicit way of calculating the subgraph in this paper.
image

The paper has mentioned that the subgraph is calculated by EACH account in nA with adding it's direct neighbor. But the picture it shows later turns out that it evolved the neighbor's neighbor of the account in nA.
And it also mentions a number "nu" which refers the up-limit of nodes. so when we train the network, is this number fixed or not?

Evaluating similarity of transaction subgraphs using k-means clustering

Questions:

  • What do embeddings capture?
  • Do similar embeddings represent similar transaction patterns?
    • Similar transactions patterns i.e. same wallets/accounts
  • Do embeddings capture topological and qualitative (e.g., features) network structures?
    • E.g. # neighbours, same neighbours, value exchanged (edge weight)

Framework:

  1. Create 1-hop transactions subgraphs (#43)
    • Obtain 1-hop transaction subgraphs for user
    • Filter criteria (e.g., 30 < number of user txs < 300)
  2. Embed transaction subgraph (#45)
  3. Cluster embeddings using k-means clustering
    • Compute k-means clustering of subgraph embeddings
    • Recover addresses from node labels

Constructing transaction graph: Filter through addresses

Excerpt from Yuan et. al. (2020):

At the same time, we need to note that there is often a lot of redundant data in the Ethereum transaction records, i.e. a considerable number of addresses have little value for our research target because of their extreme situation. For example, addresses act as wallet interact with other addresses frequently, for which their numbers of neighbor addresses are too large for analysis of trans- action network. This kind of data increases the burden of data processing, and even cause unexpected deviations to the results of model training. Thus it is necessary to set standards to clean the obtained dataset. According to previous work, we consider the addresses with the number of transactions less than 10 as inactive addresses and the addresses with the number more than 300 as overactive addresses, filtering them after transaction records collecting [16].

GNN: graph2vec

graph2vec

Find transaction subgraph embedding using graph2vec model.

Todo

  • Read graph2vec paper (#28)
  • Find publicly available implementation of graph2vec (here)
  • Build data pipeline using Ethereum 2nd-order transaction data (#33)
  • Obtain subgraph embeddings from Ethereum 2nd-order transaction data (#35)
  • Classify embeddings using SVM (#36)
    • Set up downstream SVM classifier
    • Create train/test datasets
    • Generate prediction

Remove data limit for development stage

There are currently two ways we are limiting data intake for the purpose of development:

  1. Blocks - Blocks are currently capped to 0to999999 (the first million). We have 13.5 million blocks of data available in the XBlock-ETH dataset.
  2. Transactions - Internal/external transaction data is limited to nrows=100000 (i.e. the first 100,000 rows of transaction data in each block section file). Since each block section file contains 1 million blocks, we're obviously omitting A LOT of transaction data.

This means that when we create subgraphs lists in create_subgraphs.py for the entire XBlock-ETH dataset, the whole process could take a very long time. We want to make sure the output subgraphs are usable/correct so as to limit the number of times we need to create subgraphs for the entire dataset as much as possible.

NetworkX - Encoding graphs: Duplicate edge weights

Current approach: Read csv in as dataframe and create DiGraph

import networkx as nx
import pandas as pd

df = pd.read_csv(//path to data)
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, source='From', target='To', edge_attr='Value', create_using=Graphtype)

print(G.edges.data())

NetworkX appears to be aggregating edge weights for duplicate source/target pairs, but manual inspection reveals inconsistencies. metadata_totals.py sums values for distinct source values and check_sums.py compares the manual summation to G's edge weights, showing 75 inconsistencies in the test file.

For now will proceed with NetworkX but may want to look into this down the road.

Constructing transaction graph: Encode bidirectional transactions

2nd-order transaction subgraph construction only considers transactions sent to root. This mistake resulted from only looking at the 1st-order csv for address0x0000000000000000000000000000000000000000, and it was assumed the authors only considered incoming transactions to root and neighbours. If you take a look at other 1st-order nodes, you will see bidirectional transactions. Thus bi-directional transactions also need to be considered in the 2nd-order transaction subgraphs.

Graph2Vec: Output embedding

Need to understand the output of Graph2Vec. From the paper:

output: Matrix of vector representations of graphs Φ ∈ R^(|G|×δ).

2nd-Order Transaction Embeddings: Bug audit

SVM predictions producing same accuracy scores regardless of train/test split and random seed used. This doesn't align with my intuition that the model should produce slightly different results based on the data used during training (which varying train/test split and random seed should affect).

5-fold CV accuracy:
[0.96279762 0.97172619 0.96875 0.9702381 0.96279762]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.