christam96 / blockchain-gnn-research Goto Github PK
View Code? Open in Web Editor NEWResearch papers, experiments, and ideas for the application of GNNs to blockchain data
Research papers, experiments, and ideas for the application of GNNs to blockchain data
At each iteration of graph expansion, run calculations for how many nodes
and edges
should be in the resulting graph G'. The number of nodes
and edges
can be determined using the following equations:
G'.nodes = G_prev.nodes + G_next.nodes - intersection(G_prev, G_next).nodes
G'.edges = G_prev.edges + len(G_next.nodes)
where G_prev
is the current state of the graph, G_next
is the neighbour graph being added and G'
is the resulting graph from composition.
The pseudocode is as follows:
for each iteration of graph expansion:
evaluate expected graph data (nodes and edges)
try:
construct graph G'
assert(G'.data == expectation)
except Exception as e:
print(Graph mismatch)
I am currently working on reproducing the code from this paper. I am wondering if you know the explicit way of calculating the subgraph in this paper.
The paper has mentioned that the subgraph is calculated by EACH account in nA with adding it's direct neighbor. But the picture it shows later turns out that it evolved the neighbor's neighbor of the account in nA.
And it also mentions a number "nu" which refers the up-limit of nodes. so when we train the network, is this number fixed or not?
Questions:
Framework:
First-order nodes:
Excerpt from Yuan et. al. (2020):
At the same time, we need to note that there is often a lot of redundant data in the Ethereum transaction records, i.e. a considerable number of addresses have little value for our research target because of their extreme situation. For example, addresses act as wallet interact with other addresses frequently, for which their numbers of neighbor addresses are too large for analysis of trans- action network. This kind of data increases the burden of data processing, and even cause unexpected deviations to the results of model training. Thus it is necessary to set standards to clean the obtained dataset. According to previous work, we consider the addresses with the number of transactions less than 10 as inactive addresses and the addresses with the number more than 300 as overactive addresses, filtering them after transaction records collecting [16].
Purpose: Test graph construction and easier debugging
Method: Cut down exiting dataset to ~1% while maintaining file structure
Find transaction subgraph embedding using graph2vec model.
Ethereum 2nd-order transaction
data (#33)Ethereum 2nd-order transaction
data (#35)Reading of graph2vec paper.
graph2vec model made available by Benedek Rozemberczki.
There are currently two ways we are limiting data intake for the purpose of development:
0to999999
(the first million). We have 13.5 million blocks of data available in the XBlock-ETH dataset.nrows=100000
(i.e. the first 100,000 rows of transaction data in each block section file). Since each block section file contains 1 million blocks, we're obviously omitting A LOT of transaction data.This means that when we create subgraphs lists in create_subgraphs.py
for the entire XBlock-ETH dataset, the whole process could take a very long time. We want to make sure the output subgraphs are usable/correct so as to limit the number of times we need to create subgraphs for the entire dataset as much as possible.
Current approach: Read csv in as dataframe and create DiGraph
import networkx as nx
import pandas as pd
df = pd.read_csv(//path to data)
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, source='From', target='To', edge_attr='Value', create_using=Graphtype)
print(G.edges.data())
NetworkX appears to be aggregating edge weights for duplicate source/target
pairs, but manual inspection reveals inconsistencies. metadata_totals.py
sums values for distinct source
values and check_sums.py
compares the manual summation to G
's edge weights, showing 75 inconsistencies in the test file.
For now will proceed with NetworkX but may want to look into this down the road.
Find dataset used in #11
2nd-order
transaction subgraph construction only considers transactions sent to root
. This mistake resulted from only looking at the 1st-order
csv for address0x0000000000000000000000000000000000000000
, and it was assumed the authors only considered incoming transactions to root and neighbours. If you take a look at other 1st-order
nodes, you will see bidirectional transactions. Thus bi-directional transactions also need to be considered in the 2nd-order
transaction subgraphs.
Need to understand the output of Graph2Vec. From the paper:
output: Matrix of vector representations of graphs Φ ∈ R^(|G|×δ)
.
In #36 labels are hardcoded based on the number of phishing
/non-phishing
embeddings and are stored independently of subgraph embeddings. This risks incorrectly associating an embedding to a label. Instead, store subgraph labels in the same data structure as the embeddings so they are kept together.
Should be able to call create_graph.py
to build subgraphs on the fly. For example, dynamically generate appropriate transaction subgraph for each step of GNN training.
Summarize for key points, topics and findings.
Came across XBlock, a blockchain data platform for academic re-search. Looks like a useful resource, let's keep reference to this and similar resources in the README
.
Reserve section for reading list items.
SVM predictions producing same accuracy scores regardless of train/test split and random seed used. This doesn't align with my intuition that the model should produce slightly different results based on the data used during training (which varying train/test split and random seed should affect).
5-fold CV accuracy:
[0.96279762 0.97172619 0.96875 0.9702381 0.96279762]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.