ramprasad-group / polygnn Goto Github PK

polyGNN is a Python library to automate ML model training for polymer informatics.

License: Other

Python 100.00%

data-driven-design data-science machine-learning materials-informatics materials-science polymer polymer-informatics rdkit artificial-intelligence graph-neural-network

polygnn's Introduction

polygnn

This repository contains the training code and model weights presented in a companion paper, polyGNN: Multitask graph neural networks for polymer informatics.

Installation

This repository is currently set up to run on 1) Mac OSX and 2) Linux/Windows machines with CUDA 10.2. Please raise a GitHub issue if you want to use this repo with a different configuration. Otherwise, follow these steps for installation:

Install poetry on your machine.
If Python3.7 is installed on your machine skip to step 3, if not you will need to install it. There are many ways to do this, one option is detailed below:
- Install Homebrew on your machine.
- Run brew install [email protected]. Take note of the path to the python executable.
Clone this repo on your machine.
Open a terminal at the root directory of this repository.
Run poetry env use /path/to/python3.7/executable. If you installed Python3.7 with Homebrew, the path may be something like /usr/local/Cellar/python\@3.7/3.7.13_1/bin/python3.7.
Run poetry install.
If your machine is a Mac, run poetry run poe torch-osx. If not, run poetry run poe torch-linux_win-cuda102.
If your machine is a Mac, run poetry run poe pyg-osx. If not, run poetry run poe pyg-linux_win-cuda102.

Dependencies

As can be seen in pyproject.toml, polygnn depends on several other packages, including polygnn_trainer, polygnn_kit, and nndebugger. The functional relationships between these libraries are described briefly below and in example.py.

polygnn contains the polyGNN architecture developed in the companion paper. The architecture relies on polygnn_kit, which is a library for performing operations on polymer SMILES strings. Meanwhile, polygnn_trainer is a library for training neural network architectures, and was used in the companion paper to train the polyGNN architectures. Part of the training process utilized nndebugger, a library for debugging neural networks.

Usage

`example.py`

The file example.py contains example code that illustrates how this package was used to the train models in the companion paper. The code uses training data located in the directory sample_data to train an ensemble model (composed of several submodels). The submodels, by default, are saved in a directory named example_models. The data in sample_data is a small subset of the DFT data used to train the models in the companion paper. A complete set of the DFT data can be found at Khazana.

To train polygnn models run: poetry run python example.py --polygnn. To train polygnn2 models run: poetry run python example.py --polygnn2. Running either line on a machine with at least 8GB of free GPU memory should not take longer than 3 minutes. To manually specify the device you want to use for training, set the device flag. For example poetry run python example.py --polygnn --device cpu. Otherwise, the device will automatically be chosen.

Looking at sample_data/sample.csv, you will notice that this dataset contains multiple different properties (e.g., band gap, electron affinity, etc.). In example.py, we use this data to train a multitask model, capable of predicting each property. To train your own multitask model, you can replace sample_data/sample.csv with your own dataset containing multiple properties. Single task models are also supported.

`example2.py`

example.py is an example of how to train a multitask model with only SMILES strings as features. example2.py is an example of how to train a multitask model containing both SMILES and non-SMILES features. example.py and example2.py share the same flags. Read the comments in example2.py for more details.

`more_examples`

A directory containing more example files.

more_examples/example_predict_trained_models.py an example of how to just do prediction using one of the models trained in the companion paper. To run the file do cd more_examples && poetry run python example_predict_trained_models.py. The file shows how to get predictions for 36 different properties of polyethylene. Of course, you can change the polymer that you want to get predictions for, but the properties that can be predicted are fixed. If you want to make a prediction for a different property, then you'll need to train your own model (see example.py and example2.py for more details on training your own model).
more_examples/example_predict.py an example of how to just do prediction using a previously-trained model not included in the companion paper](https://pubs.acs.org/doi/full/10.1021/acs.chemmater.2c02991). This file requires that an unmodified example.py be run first. This file shares the same flags as example.py.

Questions

I (@rishigurnani) am more than happy to answer any questions about this codebase. If you encounter any troubles, please open a new Issue in the "Issues" tab and I will promptly respond. In addition, if you discover any bugs or have any suggestions to improve the codebase (documentation, features, etc.) please also open a new Issue. This is the power of open source!

Citation

If you use this repository in your work please consider citing us.

@article{Gurnani2023,
   annote = {doi: 10.1021/acs.chemmater.2c02991},
   author = {Gurnani, Rishi and Kuenneth, Christopher and Toland, Aubrey and Ramprasad, Rampi},
   doi = {10.1021/acs.chemmater.2c02991},
   issn = {0897-4756},
   journal = {Chemistry of Materials},
   month = {feb},
   number = {4},
   pages = {1560--1567},
   publisher = {American Chemical Society},
   title = {{Polymer Informatics at Scale with Multitask Graph Neural Networks}},
   url = {https://doi.org/10.1021/acs.chemmater.2c02991},
   volume = {35},
   year = {2023}
}

Companion paper

The results shown in the companion paper were generated using v0.2.0 of this package.

License

This repository is protected under a General Public Use License Agreement, the details of which can be found in GT Open Source General Use License.pdf.

polygnn's People

Contributors

Stargazers

Watchers

Forkers

rishigurnani mayank-kr liberty-1776 oliverhvidsten rekumar

polygnn's Issues

Create a `example_predict.py` file

The purpose of this file is to provide an example of how to just do prediction using already-trained models. A prediction example is already contained in example.py but it is far down in the file (something like line 340), so having a new shorter file will be helpful.

As part of this issue, let's also create another folder called more_examples and place example_predict.py in it. Additional example files can be added to this directory without cluttering the root directory.

Model seemingly not converging

Hi, was wondering how I can get the model to converge, the high RMSE and low R2 seem to be almost unchanging. Any leads?

ValueError: Found input variables with inconsistent numbers of samples: [293, 874]; trying to tune HP on subset of data

Hi,
Im trying to tune my model HP using a subset of my data and then train on the whole dataset. I am running into this issue of 'inconsistent numbers of samples:'.
In the code i changed where it begins to '# split train and val data'; I changed the split to split on the subset (master_data) instead of group data.

whole data access

how to get the whole data used in this paper

Running with CUDA 12.3

Hi, Our systems were recently updated to CUDA 12.3 and Im having issues with dependencies. How should i go about using polyGNN for CUDA 12.3?

Smiles Key Error

Do you have any ideas? I have been reading through the code a little to try and figure out why this may be happening.

Traceback (most recent call last):
File "GNN_CV_training.py", line 434, in
random_seed=RANDOM_SEED,
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 373, in train_kfold_ensemble
model, train_pts, val_pts, scaler_dict, train_config, break_bad_grads=False
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 182, in train_submodel
train_pts = tc.get_train_dataloader()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 368, in
training_df
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 345, in cv_get_train_dataloader
return training_df.apply(get_data_augmented, axis=1).values.tolist()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/frame.py", line 7552, in apply
return op.get_result()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/apply.py", line 185, in get_result
return self.apply_standard()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/apply.py", line 276, in apply_standard
results, res_index = self.apply_series_generator()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/apply.py", line 305, in apply_series_generator
results[i] = self.f(v)
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 331, in get_data_augmented
data = augmented_featurizer(x.smiles_string)
File "GNN_CV_training.py", line 270, in
augmented_featurizer = lambda x: random.sample(eq_graph_tensors[x], k=1)[0]
KeyError: '[]C(C)(C(=O)OCCCS(=O)(=O)[N-]S(=O)(=O)C(F)(F)F)COCCOCCOCCOCCOCCOCCOCCOCCOCCOCC[].[Li+]'

After successful installation, running polygnn shows error with torch

Running "poetry run python example.py --polygnn" shows the following error:

While installing, I got this message.

Please look into this issue. Thanks!

ModuleNotFoundError: No module named 'distutils.cmd'

Upon setup and running the following command:

`poetry run poe torch-linux_win-cuda102

on tyrion2, the following error message is received:

ModuleNotFoundError: No module named 'distutils.cmd'

And distutils cannot be installed on python 3.7 without sudo permission

Direct access to fingerprints

Modfiy the polyGNN class in https://github.com/Ramprasad-Group/polygnn/blob/main/polygnn/models.py so that the learned polymer fingerprints can be directly accessed.

matrix multiplication errors while training

Running into an issue of matrix multiplication while training the model, any ideas on how to solve this?

About Dataset

Hi, I only find part of the dataset in your paper in Khazana. Can you provide a full dataset use in your article? Thanks!

Update wheel version

Users have reported that they need to change the wheel version from 0.37.1 to 0.40.0 in pyproject.toml manually.

Debugging Matrix Multiplication Error

I am getting a matrix multiplication error that is a few layers down in the code. Currently it is trying to multiply a 50x128 matrix by a 129x64 matrix. I am assuming the correct multiplication dimension length is 128, but I am not sure what to do to make this happen.

I was wondering what the best way was to figure out what is causing this error and how to fix it.

Error pasted here:
Traceback (most recent call last):
File "GNN_training.py", line 281, in
optimal_capacity = session.choose_model_size_by_overfit()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/nndebugger/dl_debug.py", line 387, in choose_model_size_by_overfit
start=start,
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/nndebugger/torch_utils.py", line 72, in default_per_epoch_trainer
output = model(data)
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/oliver/polygnn/polygnn/models.py", line 63, in forward
x = self.final_mlp(x) # hidden layers
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/layers.py", line 106, in forward
x = layer(x)
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/layers.py", line 41, in forward
return self.dropout(self.activation(self.linear(x)))
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (50x128 and 129x64)

smiles_featurizer;'ValueError: Invalid repeat unit. Periodic bond types are mismatching.'

Hi,
Im trying to make predictions on 12k+ polymers and ran into a problem with double and triple bonds on the terminator. The smiles_featurizer fails to featurize these bonds and the program ends. The exact error is; "Invalid repeat unit. Periodic bond types are mismatching.". Ive also included a picture of the whole output.

Here are some smiles that make the program fail;
'[]CCCC(=[])Cl'
'[]CC(COS(=O)(=O)c1ccc(C)cc1)=C(C#[])COS(=O)(=O)c1ccc(C)cc1'
'[]CC(OS(=O)(=O)c1ccc(C)cc1)=C(C#[])OS(=O)(=O)c1ccc(C)cc1'

Poor Prediction Results

Multi-Task model with 9,100 points for one property and 6,200 points for the other property
I have previously trained this dataset on polygnn_trainer and got much better results

I have attached:

three parity plots (plots.png),
the output log file (mT_Log.txt)
the altered python file that I used to run polygnn (GNN_CV_training.txt)

Do you know why this behavior might be seen?
I got very similar results on my first test and decided to increase some of the constants at the beginning. Those changes did not improve prediction accuracy.

mT_Log.txt
GNN_CV_training.txt

Memory Allocation Error

I got this error. Any thoughts of how to fix?

RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 15.89 GiB total capacity; 2.14 GiB already allocated; 3.31 MiB free; 2.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

some questions about the dataset

in the sample.csv, the values of Egc and Egb seem to be reversed:
sample.csv:

paper:

https://www.polymergenome.org :

and by the way, how to get the full dataset? the website https://khazana.gatech.edu/ seems have ssl and nginx 400 mistake, or I should get some keywords to get the dataset

Issue with installation

HI, while installing polyGNN with a virtual environment, we face the following issue:

Could you please help us out?
Thanks in advance!

Update citation details

The current citation details in the README correspond to the ArXiv preprint. They should be updated to the Chem. Mat. version.