Extended SimGNN with LSH

A pair-wise graph similarity learning pipeline utilizing Deep Learning (DL) and Locality Sensitive Hashing (LSH).

The DL model used is based on a PyTorch Geometric implementation of "SimGNN: A Neural Network Approach to Fast Graph Similarity Computation" (WSDM 2019) [Paper]. While the initial implementation is done in [benedekrozemberczki/SimGNN] the basis implementation for this repository is the [gospodima/Extended-SimGNN] extention that added the Graph Isomorphism Operator from the “How Powerful are Graph Neural Networks?” paper and the Differentiable Pooling Operator from the "Hierarchical Graph Representation Learning with Differentiable Pooling" paper.

This implementation was written and used to conduct experiments for my bachelor thesis "Pair-wise graph similarity learning with Graph Convolutional Networks and Locality Sensitive Hashing" at the Informatics Department of Athens University of Economics and Business (AUEB), under the mentorship of Prof. Ioannis Kotidis.

The paper's original reference implementation is accessible here in Tensorflow.

Example illustration of the pipeline

Requirements

The codebase is implemented in Python 3.6 and needed package versions used for development are below.

lshashpy3==0.0.8
texttable==1.6.2
torch==1.5.0+cu101
torch-cluster==1.5.7
torch-geometric==1.4.3
torch-scatter==2.0.5
torch-sparse==0.6.7
torch-spline-conv==1.2.0
torchvision==0.6.0+cu101
tqdm==4.48.2

Other packages like numpy or matplotlib are installed as dependencies.

Installation guide:

Note: The steps below are for systems with a CUDA compatible GPU. While I didn't try it, replacing +cu101 with +cpu in the below commands can also work.

To get up and running:

Ensure that CUDA 10.1 is properly installed on your system, and that Python is installed in your system.
- Check you have the versions mentioned above.
Create and activate a virtualenv with the correct Python version, to continue with the installation. (Help)
Install torch with pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html. This command is based on this site.

Install torch-geometric packages, as written here.

Note: For torch-geometric package explicitly install version 1.4.3.
The commands:

pip install torch-scatter==2.0.5+cu101 -f https://pytorch-geometric.com/whl/torch-1.5.0.html
pip install torch-sparse==0.6.7+cu101 -f https://pytorch-geometric.com/whl/torch-1.5.0.html
pip install torch-cluster==1.5.7+cu101 -f https://pytorch-geometric.com/whl/torch-1.5.0.html
pip install torch-spline-conv==1.2.0+cu101 -f https://pytorch-geometric.com/whl/torch-1.5.0.html
pip install torch-geometric==1.4.3

Install additional packages by utilizing the extra_packages.txt file with pip install -r extra_packages.txt.

Notes on source code

To get the information of the result files in example_results folder, the wanted temp_runfiles (Dataset) folder needs to be moved to the same level with the src folder, and run the mainForResults.py file.
- Caution: Running the main.py script with Dataset A, always cleans the temp_runfiles (A) folder.
The pipeline is currently not compatible with PyGeometric beyond version 1.4.3.
Caution: In order to run the pipeline (with PyGeometric v.1.4.3), a manual edit must be done in the PyTorch Geometric code.
- Specifically, in file ged_dataset.py and line 157 (see the commit that fixed it for v.1.5.0 here).
- Relevant issue.
- This is of course only a temporary fix to get the code running.
The code for generating and using synthetic data and for the 'measure time' functionality are removed because I didn't test it, but it might turn out useful in the future.
Also the code of some early (incomplete) efforts to incorporate ResGatedGCN, is left in comments.

Datasets

The datasets are loaded with the help of GEDDataset, where the databases specified in the original repository with GED-values are used. Currently AIDS700nef, LINUX and IMDBMulti databases are supported.

Options

Training a SimGNN model is handled by the src/main.py script which provides the following command line arguments.

Input and output options for main.py

  --dataset               STR         Name of the dataset to be used.          Default is `AIDS700nef`.
  --plot-loss             BOOL        Plot mse values during the learning.     Default is False.

Model options

  --diffpool              BOOL        Differentiable pooling.                  Default is False.
  --gnn-operator          STR         Type of gnn operator.                    Default is gin.
  --use-lsh               BOOL        Flag for using or not the LSH model.     Default is False.
  --filters-1             INT         Number of filter in 1st GNN layer.       Default is 64.
  --filters-2             INT         Number of filter in 2nd GNN layer.       Default is 32. 
  --filters-3             INT         Number of filter in 3rd GNN layer.       Default is 32.
  --tensor-neurons        INT         Neurons in tensor network layer.         Default is 16.
  --bottle-neck-neurons   INT         Bottle neck layer neurons.               Default is 16.
  --bins                  INT         Number of histogram bins.                Default is 16.
  --batch-size            INT         Number of pairs processed per batch.     Default is 128. 
  --epochs                INT         Number of SimGNN training epochs.        Default is 350.
  --dropout               FLOAT       Dropout rate.                            Default is 0.
  --learning-rate         FLOAT       Learning rate.                           Default is 0.001.
  --weight-decay          FLOAT       Weight decay.                            Default is 5*10^-4.
  --histogram             BOOL        Include histogram features.              Default is False.

Example run

To run the model with LINUX, using Diffpool, LSH and histogram use: python main.py --dataset LINUX --diffpool --use-lsh --histogram

chuhtra / extended-simgnn-with-lsh Goto Github PK