Code Monkey home page Code Monkey logo

supervised-learning-on-relational-databases-with-gnns's Introduction

Supervised Learning on Relational Databases with Graph Neural Networks

This is code to reproduce the results in the paper Supervised Learning on Relational Databases with Graph Neural Networks.

Install dependencies

The file docker/whole_project/environment.yml lists all dependencies you need to install to run this code.

You can follow the instructions here to automatically install a conda environment from this file.

You can also build a docker container which contains all dependencies. You'll need docker (or nvidia-docker if you want to use a GPU) installed to do this. The file docker/whole_project/Dockerfile builds a container that can run all experiments.

Get datasets

I would love to have a link here where you could just download the prepared datasets. But unfortunately that would violate the Kaggle terms of service.

So you either need to follow the instructions below and build them yourself, or reach out to me by email and I may be able to provide them to you.

Preparing the datasets yourself

  1. Set the data_root variable in /__init__.py to be the location where you'd like to install the datasets. Default is <HOME>/RDB_data.

  2. Download raw dataset files from Kaggle. You need a Kaggle account to do this. You only need to download the datasets you're interested in.

    a) Put the Acquire Valued Shoppers Challenge data in data_root/raw_data/acquirevaluedshopperschallenge. Extract any compressed files.

    b) Put the Home Credit Default Risk data in data_root/raw_data/homecreditdefaultrisk. Extract any compressed files.

    c) Put the KDD Cup 2014 data in data_root/raw_data/kddcup2014. Extract any compressed files.

  3. Build the docker container specified in docker/neo4j/Dockerfile. This creates a container with the neo4j graph database installed, which is used to build the datasets.

  4. Start the database server(s) for the datasets you want to build:

    docker run -d -e "NEO4J_dbms_active__database=<db_name>.graph.db" --publish=7474:<port_for_browser> --publish=7687:<port_for_db> --mount type=bind,source=<path_to_code>/data/datasets/<db_name>,target=/data rdb-neo4j

    where <path_to_code> is the location of this repo on your system, <port_for_browser> is an optional port for using the build-in neo4j data viewer (you can set it as 7474 if you don't care), and (<db_name>, <port_for_db>) are (acquirevaluedshopperschallenge, 9687), (homecreditdefaultrisk, 10687), or (kddcup2014, 7687), respectively.

  5. Run python -m data.<db_name>.build_database_from_kaggle_files from the root directory of this repo.

  6. (optional) To view the dataset in the built-in neo4j data viewer, navigate to <your_machine's_ip_address>:7474 in a web browser, run :server disconnect to log off whatever your web browser thinks is the default neo4j server, and log into the right one by specifying <port_for_browser> in the web interface.

  7. Run python -m data.<db_name>.build_dataset_from_database from the root directory of this repo.

  8. (optional) Run python -m data.<db_name>.build_db_info from the root directory of this repo.

  9. (optional) to create the tabular and DFS datasets used in the experiments, run python -m data.<db_name>.build_DFS_features from the root directory of this repo. Then run python -m data.<db_name>.build_tabular_datasets from the root directory of this repo.

Add your own datasets

If you have your own relational dataset you'd like to use this system with, you can copy and modify the code in one of the data/acquirevaluedshopperschallenge, data/homecreditdefaultrisk, or data/kddcup2014 directories to suit your purposes.

The main thing you have to do is create the .cypher script to get your data into a neo4j database. Once you've done that, nearly all the dataset building code is reusable. You'll also have to add your dataset's name in a few places in the codebase, e.g. in the __init__ method of the DatabaseDataset class.

Run experiments

All experiments are started with the scripts in the experiments directory.

For example, to recreate the PoolMLP row in paper tables 3 and 4, you would run python -m experiments.GNN.PoolMLP from the root directory of this repo to start training, then run python -m experiments.evaluate_experiments when training is finished, and finally run python -m experiments.GNN.print_and_plot_results.

By default, experiments run in tmux windows on your local machine. But you can also change the argument in the run_script_with_kwargs command at the bottom of each experiment script to run them in a local docker container. Or you can export the docker image built with docker/whole_project/Dockerfile to AWS ECR and modify the arguments in experiments/utils/run_script_with_kwargs to run all experiments on AWS Batch.

License

The content of the notes linked above is licensed under the Creative Commons Attribution 3.0 license, and the code in this repo is licensed under the MIT license.

supervised-learning-on-relational-databases-with-gnns's People

Contributors

mwcvitkovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

supervised-learning-on-relational-databases-with-gnns's Issues

ValueError: Found array with 0 sample In Transaction "productsize"

Hi, Thanks for your great work. I'm trying to reproduce your results. I can successfully run your code to "Step 7 Run python -m data.<db_name>.build_dataset_from_database from the root directory of this repo." I do found some configuration problems and I can solve them.

However, at "Step 8 Run python -m data.<db_name>.build_db_info from the root directory of this repo." I encountered this problem:
running: MATCH (n:Transaction) RETURN n.productsize ;
Query Result: []
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required by RobustScaler.

Could you please help to figure out this problem?

A bug encountered by me when reading projects.csv

Hi, mwcvitkovic! When I utilize your provided docker file and process the kddcup2014 dataset, I encounter a bug:

docker exec -i 625f51dcfb1a6772c34b20e3f77e297033e1d613dcac84c74be1cabe73550139 cypher-shell < /data/moyichuan/Relational_data/kddcup2014/kddcup2014_neo4j_loader.cypher


Couldn't load the external resource at: file:/data/projects.csv

Do you have any idea about the solution to this error?

ModuleNotFoundError: No module named 'neo4j.time'

Hello, mwcvitkovic! Thank you for your previous kind replies! However, when I run the code with the command python -m experiments.GNN.PoolMLP, I encounter the following bug:

0%|                                                                                        | 0/182 [00:01<?, ?it/s]
0%|                                                                                        | 0/300 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 506, in <module>
    main(kwargs)
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 441, in main
    train_model(writer, **kwargs)
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 323, in train_model
    raise e
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 276, in train_model
    val_auroc, val_acc, val_loss = validate_model(writer, val_loader, model, epoch)
  File "/data/moyichuan/supervised_learning_on_graph/start_training.py", line 79, in validate_model
    for batch_idx, (input, label) in enumerate(tqdm(val_loader)):
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
ModuleNotFoundError: Caught ModuleNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/moyichuan/miniconda3/envs/RDB/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/moyichuan/supervised_learning_on_graph/data/DatabaseDataset.py", line 88, in __getitem__
    dp = pickle.load(f)
ModuleNotFoundError: No module named 'neo4j.time'

Do you have any ideas about this bug? Look forward to your further replies.

docker.errors.APIError

After I run the step 5, I got this error: docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.43/containers/01b37b09b76ba9e886bafbe9bb0cd67cfc6f4373632d244e2e9faf134a2a9ece/start: Internal Server Error ("driver failed programming external connectivity on endpoint clever_ritchie (8eef8c20c1089281600f5ef736204eb896bb08d1665f757f6734af2e3dc61bc3): Bind for 0.0.0.0:7687 failed: port is already allocated")
Can you guild me how to fix this error, please ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.