mwcvitkovic / supervised-learning-on-relational-databases-with-gnns Goto Github PK

Code to reproduce the results in the paper Supervised Learning on Relational Databases with Graph Neural Networks.

License: MIT License

Python 99.45% Shell 0.19% Dockerfile 0.36%

supervised-learning-on-relational-databases-with-gnns's Introduction

Supervised Learning on Relational Databases with Graph Neural Networks

This is code to reproduce the results in the paper Supervised Learning on Relational Databases with Graph Neural Networks.

Install dependencies

The file docker/whole_project/environment.yml lists all dependencies you need to install to run this code.

You can follow the instructions here to automatically install a conda environment from this file.

You can also build a docker container which contains all dependencies. You'll need docker (or nvidia-docker if you want to use a GPU) installed to do this. The file docker/whole_project/Dockerfile builds a container that can run all experiments.

Get datasets

I would love to have a link here where you could just download the prepared datasets. But unfortunately that would violate the Kaggle terms of service.

So you either need to follow the instructions below and build them yourself, or reach out to me by email and I may be able to provide them to you.

Preparing the datasets yourself

Set the data_root variable in /__init__.py to be the location where you'd like to install the datasets. Default is <HOME>/RDB_data.
Download raw dataset files from Kaggle. You need a Kaggle account to do this. You only need to download the datasets you're interested in.

a) Put the Acquire Valued Shoppers Challenge data in data_root/raw_data/acquirevaluedshopperschallenge. Extract any compressed files.

b) Put the Home Credit Default Risk data in data_root/raw_data/homecreditdefaultrisk. Extract any compressed files.

c) Put the KDD Cup 2014 data in data_root/raw_data/kddcup2014. Extract any compressed files.
Build the docker container specified in docker/neo4j/Dockerfile. This creates a container with the neo4j graph database installed, which is used to build the datasets.
Start the database server(s) for the datasets you want to build:

docker run -d -e "NEO4J_dbms_active__database=<db_name>.graph.db" --publish=7474:<port_for_browser> --publish=7687:<port_for_db> --mount type=bind,source=<path_to_code>/data/datasets/<db_name>,target=/data rdb-neo4j

where <path_to_code> is the location of this repo on your system, <port_for_browser> is an optional port for using the build-in neo4j data viewer (you can set it as 7474 if you don't care), and (<db_name>, <port_for_db>) are (acquirevaluedshopperschallenge, 9687), (homecreditdefaultrisk, 10687), or (kddcup2014, 7687), respectively.
Run python -m data.<db_name>.build_database_from_kaggle_files from the root directory of this repo.
(optional) To view the dataset in the built-in neo4j data viewer, navigate to <your_machine's_ip_address>:7474 in a web browser, run :server disconnect to log off whatever your web browser thinks is the default neo4j server, and log into the right one by specifying <port_for_browser> in the web interface.
Run python -m data.<db_name>.build_dataset_from_database from the root directory of this repo.
(optional) Run python -m data.<db_name>.build_db_info from the root directory of this repo.
(optional) to create the tabular and DFS datasets used in the experiments, run python -m data.<db_name>.build_DFS_features from the root directory of this repo. Then run python -m data.<db_name>.build_tabular_datasets from the root directory of this repo.

Add your own datasets

If you have your own relational dataset you'd like to use this system with, you can copy and modify the code in one of the data/acquirevaluedshopperschallenge, data/homecreditdefaultrisk, or data/kddcup2014 directories to suit your purposes.

The main thing you have to do is create the .cypher script to get your data into a neo4j database. Once you've done that, nearly all the dataset building code is reusable. You'll also have to add your dataset's name in a few places in the codebase, e.g. in the __init__ method of the DatabaseDataset class.

Run experiments

All experiments are started with the scripts in the experiments directory.

For example, to recreate the PoolMLP row in paper tables 3 and 4, you would run python -m experiments.GNN.PoolMLP from the root directory of this repo to start training, then run python -m experiments.evaluate_experiments when training is finished, and finally run python -m experiments.GNN.print_and_plot_results.

By default, experiments run in tmux windows on your local machine. But you can also change the argument in the run_script_with_kwargs command at the bottom of each experiment script to run them in a local docker container. Or you can export the docker image built with docker/whole_project/Dockerfile to AWS ECR and modify the arguments in experiments/utils/run_script_with_kwargs to run all experiments on AWS Batch.

License

The content of the notes linked above is licensed under the Creative Commons Attribution 3.0 license, and the code in this repo is licensed under the MIT license.

supervised-learning-on-relational-databases-with-gnns's People

Contributors

Stargazers

Watchers

Forkers

pravak114 mindis gurdaspuriya aburgoscimne yoon-gu fagan2888 hansamaldharmananda albertoburgosplaza zhengtongyan adiseshyeragudi hussien tunguyenlam lidia-k

supervised-learning-on-relational-databases-with-gnns's Issues

ValueError: Found array with 0 sample In Transaction "productsize"

Hi, Thanks for your great work. I'm trying to reproduce your results. I can successfully run your code to "Step 7 Run python -m data.<db_name>.build_dataset_from_database from the root directory of this repo." I do found some configuration problems and I can solve them.

However, at "Step 8 Run python -m data.<db_name>.build_db_info from the root directory of this repo." I encountered this problem:
running: MATCH (n:Transaction) RETURN n.productsize ;
Query Result: []
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required by RobustScaler.

Could you please help to figure out this problem?

A bug encountered by me when reading projects.csv

Hi, mwcvitkovic! When I utilize your provided docker file and process the kddcup2014 dataset, I encounter a bug:

docker exec -i 625f51dcfb1a6772c34b20e3f77e297033e1d613dcac84c74be1cabe73550139 cypher-shell < /data/moyichuan/Relational_data/kddcup2014/kddcup2014_neo4j_loader.cypher


Couldn't load the external resource at: file:/data/projects.csv

Do you have any idea about the solution to this error?

Add Arxiv link to readme

Thanks for the code and paper. Please add a link to the Arxiv preprint at https://arxiv.org/abs/2002.02046 to the readme.

ModuleNotFoundError: No module named 'neo4j.time'

Hello, mwcvitkovic! Thank you for your previous kind replies! However, when I run the code with the command python -m experiments.GNN.PoolMLP, I encounter the following bug:

0%|                                                                                        | 0/182 [00:01<?, ?it/s]
0%|                                                                                        | 0/300 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 506, in <module>
    main(kwargs)
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 441, in main
    train_model(writer, **kwargs)
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 323, in train_model
    raise e
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 276, in train_model
    val_auroc, val_acc, val_loss = validate_model(writer, val_loader, model, epoch)
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 79, in validate_model
    for batch_idx, (input, label) in enumerate(tqdm(val_loader)):
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
ModuleNotFoundError: Caught ModuleNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/moyichuan/supervised_learning_on_graph/data/DatabaseDataset.py", line 88, in __getitem__
    dp = pickle.load(f)
ModuleNotFoundError: No module named 'neo4j.time'

Do you have any ideas about this bug? Look forward to your further replies.

docker.errors.APIError

After I run the step 5, I got this error: docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.43/containers/01b37b09b76ba9e886bafbe9bb0cd67cfc6f4373632d244e2e9faf134a2a9ece/start: Internal Server Error ("driver failed programming external connectivity on endpoint clever_ritchie (8eef8c20c1089281600f5ef736204eb896bb08d1665f757f6734af2e3dc61bc3): Bind for 0.0.0.0:7687 failed: port is already allocated")
Can you guild me how to fix this error, please ?