graphdeeplearning / graphtransformer Goto Github PK

Graph Transformer Architecture. Source code for "A Generalization of Transformer Networks to Graphs", DLG-AAAI'21.

Home Page: https://arxiv.org/abs/2012.09699

License: MIT License

Python 75.97% Shell 18.50% Jupyter Notebook 5.53%

graph-transformer transformers transformer attention graph-neural-networks graph-deep-learning aaai transformer-networks

graphtransformer's People

Contributors

Stargazers

Watchers

Forkers

rohitpandey13 amit1nayak adbmd brunocavagnaro tubbz-alt zeta1999 dragomirradev trendingtechnology donkas albertvillanova zjl863761131 saro00 zyzyzhou asbe sanmayphy racoutinho zdqf czh513 zhaosiheng codeaudit snirjhar-forked jaehunjung1 nirvanesque theomoutakanni dylanchan24 earlbabson chrisbyd xxylii sailfish009 brando90 wcshin-git nilesh-iiita afcarl zencyyoung seongjinahn zkcpku gnn2qsu chantalmp edgoll hotzingtone huyhoang17 cs-savvy drsnowbird shunsunsun leelasd nnguyen19 123carmen saqibmamoon truongchien wangjye777 mryuan0428 dataflowr nashid heyqxx yanyipu tiger-tiger xiaominumd joreyyan lswzjuer wang-yuchen-alice sonovice neilmehta31 szxuhongye stjordanis mfcardenas chaobo521 governance-foundation coolcodelvs amine-natik scotthoang arunadevikaruppasamy liuchuang0059 mtubpeng1 mjdhasan esgnie qiuyue1993 tmukande-debug sfqrm wozaimoyu meiirbek-islamov mohammad-abdul-hadi federicofontanelle liangzhendong123 akdong91 alexandor91 zero506 kakaeriol pocheang co-simulation seawaterkr jiahaohuang99 ymzqwq seanby eziotao-tyd qsnznjing gregori0o sasageyoorg yitingliu777 haldate-yu standardgalactic

graphtransformer's Issues

Why did you divide this term?

Hi there,

I was reading your code on graphtransformer, I'm kind of curious on the operation shown below. Why did you divide the wV score by the w(or so called 'score' term), I didn't see any terms shown in your equation 4 or equation 9 in the paper. Could you illustrated that?

graphtransformer/layers/graph_transformer_edge_layer.py

Line 112 in c9cd493

    
           h_out = g.ndata['wV'] / (g.ndata['z'] + torch.full_like(g.ndata['z'], 1e-6)) # adding eps to all values here

Thanks

How to explain the data in train.index

There's a series of numbers in 'train.index', but i don't know how to explain it

about attention

Hello, about the problem of calculating attention, the attention of node i and its adjacent nodes is calculated in the formula, but I find that the final calculation is the attention of all nodes, and it does not distinguish whether the nodes are connected. Is there a problem with my understanding?

Do you plan to apply `graphtransformer` to a machine translation task?

Would be interested to know whether you plan to apply the graphtransformer for a machine translation task in the future?

Error in Using this Graph Transformer Layer on random graph

Hi!
I am trying to use this Graph transformer layer on the random graph (see below). The error is occurring KeyError: 'wV'

import torch
import dgl
import networkx as nx
from model import GraphTransformerLayer (# this imports the layers/graph_transformer_layer.py )

torch.manual_seed(42)

def create_random_graph(num_nodes, node_feature_dim):
g = dgl.DGLGraph()
g.add_nodes(num_nodes)
node_features = torch.randn(num_nodes, node_feature_dim)
g.ndata['feat'] = node_features
return g

num_nodes = 10
node_feature_dim = 16
num_heads = 4

random_graph = create_random_graph(num_nodes, node_feature_dim)

model_layer = GraphTransformerLayer(in_dim=node_feature_dim, out_dim=node_feature_dim, num_heads=num_heads)

output_features = model_layer(random_graph, random_graph.ndata['feat'])

Details of the laplacian encoding

Hi,

Neat work! I was looking at the implementation of the laplacian encoding, and some things weren't clear.

def laplacian_positional_encoding(g, pos_enc_dim):

    # Laplacian
    A = g.adjacency_matrix_scipy(return_edge_ids=False).astype(float)
    N = sp.diags(dgl.backend.asnumpy(g.in_degrees()).clip(1) ** -0.5, dtype=float)
    L = sp.eye(g.number_of_nodes()) - N * A * N

    # Eigenvectors with numpy
    EigVal, EigVec = np.linalg.eig(L.toarray())
    idx = EigVal.argsort() # increasing order
    EigVal, EigVec = EigVal[idx], np.real(EigVec[:,idx])
    g.ndata['lap_pos_enc'] = torch.from_numpy(EigVec[:,1:pos_enc_dim+1]).float() 
    
    return g

Why do you drop the first eigenvector in the last line (i.e. why do you run use indexes 1:pos_enc_dim+1)? Does this come from the assumption that the first eigenvalue will be very close to 0?

    g.ndata['lap_pos_enc'] = torch.from_numpy(EigVec[:,1:pos_enc_dim+1]).float()

Another quick Q, could you explain why you need to use np.real in the line below? Are there any cases when we would have complex numbers here?

   EigVal, EigVec = EigVal[idx], np.real(EigVec[:,idx])

Thanks in advance!

Data

Superpixel dataset

Thanks for sharing codes for your interesting paper. I'm interested in applying this method to MNIST superpixel dataset (your benchmarking work) where the task is graph classification, graphs have both node features and edge features, and the number of nodes/edges are different between graphs. What modifications should I made to the current code?

is the installation instruction still valid ?

I'm trying to setup the environment on mac, followed your instructions carefully however I got the below error when building the env by conda:
ResolvePackageNotFound:

h5py=2.9.0
tensorboard=1.14.0
requests==2.22.0
ipython=7.7.0
ipykernel=5.1.2
notebook=6.0.0
pip=19.2.3
scikit-image=0.15.0
scipy=1.3.0
torchvision==0.7.0
pytorch=1.6.0
matplotlib=3.1.0
pillow==6.1
mkl=2019.4
dgl=0.6.1
scikit-learn=0.21.2
python=3.7.4

Sparse graph and full graph

Thanks for the innovative work! Could you please tell me how can we get a full graph？Did full graph mean the full attention map？Did sparse graph mean that we only retain the immediate neighbor nodes‘ value of full graph？

checkpoints of pretrained models

Hello,
Thanks for sharing this amazing work!
Any chance you can share the pretrained models used in the paper ?
Thanks!

Technical question

Hi, thanks for the great paper :)

I was just curious as to what the 'z' variable is in line 59 of the graph_transformer_layer.py code? I cannot seem to find the equivalent in the paper. It seems you are normalizing the output heads by the sum of the attention weights?

Would appreciate a little point :)

Thanks,
Devin

pos_enc_dim value

For graphs with large difference in the number of vertices, how to determine the value of pos_enc_dim? For example, the number of vertices of a graph is 7, and the number of vertices of another graph is 3? Do you take the pos_enc_dim is 8 because the number of vertices in your experimental data set is greater than 8?

Scaling of Laplacian pre-computation

First, I would like to say that I think there are some very good ideas in the paper. Nice work! I have some questions though:

Could you tell me what the largest graph is that you've used this approach on? Do you have any recommendations for Laplacian eigenvector encodings for large graphs? The way it's implemented now, using np.linalg.eig and the .to_array() call, which seems to lose the sparsity, could give some problems.

Memory consumption

Could you provide some additional information about the memory consumption using your Graph Transformer?

You state, that sparse-attention favors both computation time and memory consumption, but do not provide actual measurements of the second property in your evaluation or do not state clearly, if and how your implementation is able to take advantage of it.
Some peak memory measurements of your experiments as an addendum to your evaluation of the computation times (e.g. Table 1) could be beneficial to others, too. As in my case the quadratic growth of the memory consumption w.r.t. the sequence length prevents an efficient use of Transformers for some task, where connectivity information is given and can be simply modeled by masking out (-Inf) the attention scores in the attention matrix.

Also some exemplary or artificial data could be interesting, e.g. (Mean) number of nodes n = {128, 1024, 2048, 4096}, (mean) number of edges per node e = {4, 8, 16, 64, 128}, to get an impression of the resource consumption of your Graph Transformer with Sparse Graph vs. NLP-Transformer (Full Graph with masking).

(I hope, that I could run the experiments myself, but I suppose your evaluation pipeline is already running and data provided by the original author should be more precise and trustworthy to other researchers, too.)

Graph Classification

Hello there,
First of all, thank you for providing such an amazing work.
I'd like to know how can I leverage graphtransformer on Graph Classification task with textual data, for instance, I first extract nodes and edge info from the text data, given node features and edge information (only one type of edge in my case), the model generate binary targets based on those given features.

Kind Regards
Michael

laplacian positional encoding

how to deal with the laplacian positional encoding of directed graph? The adjacency matrix of a directed graph is not an identity matrix.

Dataset request

Hello, author, please tell me how to handle the datasets of this article. I want to run other datasets on your code. What should I do?Or, if you have other data sets, please send them to me. Thank you very much!

[email protected]

node update

g.send_and_recv(eids, fn.src_mul_edge('V_h', 'score', 'V_h'), fn.sum('V_h', 'wV')) this only update target node;
head_out = g.ndata['wV']/g.ndata['z'] so this only update target node;the source node not update?

About Equations 11~12

Hi,

Great work!

I want to confirm whether my understanding of equations 11~12 is correct.

I understand equation 12 in this way: (Q h_i * K h_j / sqrt(d_k)) is a scalar, and (E e_ij) is a d_k-dim vector. Then a scalar multiplying a vector gives a d_k-dim vector. In equation 11, this d_k-dim vector is transformed to a scalar by computing w_1+w_2+...+w_dk. Is it correct?

AttributeError: Can't get attribute 'DGLHeteroGraph' on <module 'dgl.heterograph' >

Hi, I try to run the example code "main_SBMs_node_classification.py" by the following command:

python main_SBMs_node_classification.py --gpu_id 0 --config 'configs/SBMs_GraphTransformer_CLUSTER_500k_full_graph_BN.json'

But it comes with the following error:
AttributeError: Can't get attribute 'DGLHeteroGraph' on <module 'dgl.heterograph' from '/home/rody/.local/share/virtualenvs/rody-V7qEFACp/lib/python3.6/site-packages/dgl/heterograph.py'>

How can I fix this bug, thanks!

Detail on softmax

Great work!

I have a question concerning the implementation of softmax in the graph_transformer_edge_layer.py

When you define the softmax, you use the following function:

def exp(field):
    def func(edges):
        # clamp for softmax numerical stability
        return {field: torch.exp((edges.data[field].sum(-1, keepdim=True)).clamp(-5, 5))}
    return func

Shouldn't the attention weights/scores be scalars? From what I see, each head has an 8-dimensional score vector which you then compute .sum() on. The graph_transformer_layer.py layer does not have this .sum() function.

def scaled_exp(field, scale_constant):
    def func(edges):
        # clamp for softmax numerical stability
        return {field: torch.exp((edges.data[field] / scale_constant).clamp(-5, 5))}

    return func

Would appreciate any clarification on this :)

Best,
Devin

FileNotFoundError: [Errno 2] No such file or directory: 'data/molecules/ZINC.pkl'

I run the command:

python main_molecules_graph_regression.py --config 'configs/molecules_GraphTransformer_LapPE_ZINC_500k_sparse_graph_BN.json' # for CPU

but get this error:

FileNotFoundError: [Errno 2] No such file or directory: 'data/molecules/ZINC.pkl'

where i can download the file

Eval sign flipping

Hi Vijay,

Thanks for your repo!

Question: I see your doing sign flipping of eigen pos_enc during training, but it seems that you are not doing so during eval time. I understand that we want to make deterministic predictions so we don't have random flipping when evaluating it. Do you have further comments or justification for this?

Best
Kezhi

Attention Matrix

Hi ! Congratulations for your paper and thank you for making the implementation publicly available as well.

Quick question on this function :

    def func(edges):
        return {out_field: (edges.src[src_field] * edges.dst[dst_field])}
    return func

Why do you do a multiplication of K and Q and not a dot product? The dimensions of the scores are [num_edges, num_heads, hidden_dim/num_heads]. But I expect a [num_edges,num_edges] matrix .

You can also reach me here : [email protected]
Hope to hear from you soon , Pietro Bonazzi