matthewcarbone / crescendo Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 17.54 MB

A greatly abstracted machine learning suite built for ease of use, powered by Hydra + Lightning

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

machine-learning

crescendo's Introduction

Crescendo

Crescendo provides a unified command line + API for training and evaluating Lightning models

⚠️ Crescendo is a work in progress and highly subject to change

🙏 Some of our boilerplate is based on the wonderful template by ashleve! See here.

Summary

⭐️ Crescendo leverages the power of Hydra, Lightning and the humble command line to make executing the training of neural networks as easy as possible.

⭐️ Hydra supports an incredible suite of tools such as powerful approaches for hyperparameter tuning. These are built in and accessible.

⭐️ Loading your models will be handled with the crescendo.analysis API, so you can train your models via the command line on a supercomputer, then load the results in your local Jupyter notebook.

Install

You can easily install Crescendo via Pip!

pip install crescendo

Of particular note, this not only installs the crescendo module, but also the cr command line executable. A simple example to test that everything is working properly:

cr model=mlp data=california_housing

Acknowledgement

This research is based upon work supported by the U.S. Department of Energy, Office of Science, Office Basic Energy Sciences, under Award Number FWP PS-030. This research used resources of the Center for Functional Nanomaterials (CFN), which is a U.S. Department of Energy Office of Science User Facility, at Brookhaven National Laboratory under Contract No. DE-SC0012704. This software is also based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Department of Energy Computational Science Graduate Fellowship under Award Number DE-FG02-97ER25308.

The Software resulted from work developed under a U.S. Government Contract No. DE-SC0012704 and are subject to the following terms: the U.S. Government is granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable worldwide license in this computer software and data to reproduce, prepare derivative works, and perform publicly and display publicly.

THE SOFTWARE IS SUPPLIED "AS IS" WITHOUT WARRANTY OF ANY KIND. THE UNITED STATES, THE UNITED STATES DEPARTMENT OF ENERGY, AND THEIR EMPLOYEES: (1) DISCLAIM ANY WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT, (2) DO NOT ASSUME ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF THE SOFTWARE, (3) DO NOT REPRESENT THAT USE OF THE SOFTWARE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS, (4) DO NOT WARRANT THAT THE SOFTWARE WILL FUNCTION UNINTERRUPTED, THAT IT IS ERROR-FREE OR THAT ANY ERRORS WILL BE CORRECTED.

IN NO EVENT SHALL THE UNITED STATES, THE UNITED STATES DEPARTMENT OF ENERGY, OR THEIR EMPLOYEES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, CONSEQUENTIAL, SPECIAL OR PUNITIVE DAMAGES OF ANY KIND OR NATURE RESULTING FROM EXERCISE OF THIS LICENSE AGREEMENT OR THE USE OF THE SOFTWARE.

crescendo's People

Contributors

Stargazers

Watchers

Forkers

shubharajkharel

crescendo's Issues

Test the QM9 dataset integrity

We need to write tests to ensure that we are loading and analyzing the data properly. This will involve the following:

Writing ~3 tests that spot check individual molecules to ensure that the xyz file, features and geometries are loaded properly
Writing ~3 tests that spot check molecular structure information as predicted by analyze: basically, taking a few molecules, visualizing their structure on e.g. molview and comparing to the analysis method (#9) results.

@SekouRowe I think this will be a good issue for you as well.

Implement edge featurizer in to_graph

The to_graph method currently has a node featurizer and graph constructor but not an edge featurizer... for some reason, I just never did this.

CAMB3LYP should be the default data loaded by the qm8 loader

Test large molecules

Different from #32 as that will be completed in the next merge to master. We still want to ensure the to_graph method is working for the larger molecules, perhaps by simply counting the number of atoms/bonds that should be in the molecule.

Implement parallel GPU training with dgl

So firstly, it seems that this is possible, but it seems to be quite challenging to implement. According to the pytorch docs, they recommend using DistributedDataParallel to implement parallel models in the first place, so it would be nice to do this and not have to use DataParallel like we currently are.

Note that it is also even the case that DataParallel is completely incompatible with parallel GPU training with dgl. Thus, we'll need to use DistributedDataParallel if we are to have any shot.

Finally, it appears that the crux of the problem lies with how the batching is performed. Basically, a dgl.batch(...) object is a graph in and of itself, but it batches smaller graphs together by creating a block-diagonal adjacency matrix. The problem is that, I think, torch, when splitting a batch onto multiple GPU's, wants to split this matrix equally, even though the block-diagonal elements do not occur on equal intervals.

It is very much unclear to me how to fix these issues, but some hints might lie in e.g. this example here.

Luckily, for QM9, it doesn't really matter, as one GPU is extremely fast to train on. Some early benchmarking on ~100k training data takes roughly 50s/epoch on one GPU using batch sizes of ~3k. So even 1000 epochs will take only roughly 14 hours to complete. QM8 is an order of magnitude faster.

Incompatible Pre-trained Model being pulled

Error

Loading pre-trained model while featurizing raises this BaseException exception.

Cause

load_model is not pulling the version of pre-trained model from matgl:v0.8.5 being used in cresendo but a newer model which includes fixes not compatible with matgl:v0.8.5

Temporary Fix

Replace the pre-trained model (default: M3GNet-MP-2021.2.8-PES) files cached in directory ~/.cache/matgl/ with the older pre-trained version.

Create FEFF and VASP XANES models

FEFF/VASP transition metal models

Using the databases I will create, we want to train the following models (note specifically, when we say "X-O", we mean a REST query to the Materials Project v2 API in which we make queries like ["X-O", "X-O-*", "X-O-*-*"]).

FEFF

VASP

Ti-O
Cu-O

Transfer-learned FEFF-to-VASP models

Ti-O
Cu-O

Tasks

For each model, there will be two different train/validation/testing splits:

"random split": each dataset contains the site-wise features and XAS pairs. The random split simply splits the dataset randomly, as we are all used to.
"materials split": the random split can lead to atoms from the same material in the training and testing set. This is obviously undesirable because the testing set should represent the "real world deployment scenario" of the model, and we will be testing on previously unseen materials, not sites. The materials split splits the data roughly evenly across materials, then saves the site data.

For each of these, and each model, we will do the following:

Train and hyper parameter tune via Optuna (Crescendo has this functionality) to find the best model architecture and hyper-parameters.
Evaluate on the testing set and analyze.
Train a "production model" on all of the data (Crescendo also has this functionality).
Put this production model into the model zoo (crescendo/extern/m3gnet/zoo).
[Optional, not necessary for a paper] Train an ensemble model so we can accurately quantify uncertainty.

We also want to demonstrate the superiority of the transfer-learned FEFF-to-VASP models over the pure VASP models.

Implement a "graphize" feature for QM9

We want to embed QM9 molecules in graphs. Using dgl/networkx this should be easily possible.

Possible mis-alignment between qm8 and qm9 testing data

@SekouRowe, are these the correct QM8 electronic properties?

    5000      0.16172701     0.16230368     0.00145320     0.15055007     0.15075559     0.15709222     0.00083043     0.13377021     0.15223530     0.15301923     0.00106530     0.13645817     0.15815072     0.15885263     0.15320000     0.00130000
    5001      0.14464118     0.16665604     0.00119145     0.17476919     0.13655482     0.16646855     0.00077656     0.17492321     0.13728793     0.16192712     0.00095586     0.16962216     0.14497241     0.16475090     0.00120000     0.17930000
    5002      0.16469761     0.18990316     0.22962985     0.00025683     0.16554669     0.19177703     0.20400373     0.00012896     0.15984340     0.19029825     0.19592690     0.00018085     0.16196898     0.19960029     0.21250000     0.00780000
    5004      0.16322593     0.19238947     0.24865233     0.00008322     0.16067264     0.18921842     0.21430228     0.00005398     0.15519544     0.18813330     0.20460769     0.00004730     0.15744881     0.19630387     0.21710000     0.00010000
    5005      0.17342091     0.19384287     0.23245662     0.00041210     0.17249723     0.19420906     0.20596117     0.00017324     0.16751581     0.19307258     0.19303158     0.00064372     0.17018980     0.20119888     0.20870000     0.11190000

In my test data for QM9, I have (other than 1-10, which are correct)

dsgdb9nsd_100001.xyz
dsgdb9nsd_100002.xyz
dsgdb9nsd_100003.xyz
dsgdb9nsd_100004.xyz

It doesn't look like these match up, as you have 5001, 5002, 5003, 5004. Did I miss something here? Thanks!

Btw I'm more than happy to just remove the QM9 tests I have for e.g. 100001 and substitute in 5001, just wanted to be sure that this column is definitely the QM9 ID.

Pipeline redesign

We want a quick way to name datasets, and to reload them from disk so that the process of initially constructing a dataset, which can take about 10 minutes, does not need to be repeated.

More likely than not, this can be accomplished via pickle and simple load_state/save_state methods which appropriately load and save self.__class__.__dict__ to disk.

It might be prudent then to separate the dataset construction itself and the ML featurization aspects. In other words, three steps:

Load data from individual files and save as a single .csv or pickle file on disk.
Load step 1 saved data and turn the SMILES into graphs, save that to disk.
Load step 2 saved data and train. Each trial should have some random hash and should have checkpointing functionality.

Test to_graph

In qm9.py

Implement dgl-lifesci analysis methods

Instead of coding our own, we can simply use well-tested, existing methods: dgl-life code.

Add spectrum-to-spectrum autoencoder

Can put it as a standalone in models. Might be the case that at some point we have a feature method that makes use of this.

Test the as_dict and from_dict methods

@stevetorr I'm not sure this is going to work as intended. Not all of the attributes of the QM9SmilesDatum are initialized from parameters. 😄

as_dict needs to be tested in tandem with from_dict, which should probably be moved to a utilities directory. I guess we'd want the capability of pickling and un-pickling these objects.

QM8 Electronic Spectra loader

@x94carbone I created a file for loading QM8 data can you review

Implement a

Test parse_QM8_electronic_properties

https://github.com/x94carbone/crescendo/blob/caf122709c379d3cbf18c5d724089be9200d474e/crescendo/datasets/qm9.py#L15

Method for hetero-bonds in the QM9 datum

We want to check if a SMILES string has a hetero-bond, e.g. C=N, C=0, C#N, etc.

Create MPNN training protocol

Join mlp and mlp_random_architecture

Confusing to have these two options. Better to just make architecture a dictionary with a few switches and optimal arguments.

Add method to detect common functional groups from datum

Probably best to try and use Daylight SMARTS to do this.

Port the radcom code to this repo

Add CNN support

Currently on master, I have both feed-forward neural networks and message passing neural networks (graph-to-vector) implemented. Obviously something we're sorely lacking right now is support for convolutional neural networks. We should probably have this. Minimum things we need:

General CNN network Python code
Accompanying default Hydra config
Smoke tests (perhaps with MNIST)

Move qm9 datum object into the dataset class

While there is utility in separating these objects into different files/modules, they are not generalized to other datasets right now and for that reason I think they should be in the same file. No need to overcomplicate!

Create and test an MPNN

I've more or less figured this out, but I want to document it as an issue.

Code up the qm9 loader

When I have the time I will try to do this, but if you'd like to, @SekouRowe take a crack at it! This is already basically implemented in the old repo, but if you'd like to improve on it in any way go for it.

Methods to transfer QM9 data to common formats

Pymatgen's Molecule structure, or ASE's Atoms structure are helpful and familiar for other users. I imagine that rdkit also has classes designed to represent molecules. For ease of accessibility, it would be helpful to have methods which export a QM9 datum to one or several such classes from other packages.

Rewrite README testing section

This is outdated:

coverage run --source=crescendo test.py

and to view the report,

coverage report -m

Fix the huge amount of output dumped to the console during analysis

The analysis module currently is far too verbose.

Test analysis methods

Specifically:

has_n_membered_ring
is_aromatic
has_double_bond
has_triple_bond
has_hetero_bond

@SekouRowe if you'd like to give this a shot, go for it, else let me know if you don't think you have the time and I will handle it!

Add analyze method to the QM9 Dataset

https://github.com/x94carbone/crescendo/blob/8278e82bc1e786fbd00ee8f01b5fb93cea19a071/crescendo/datasets/gm9_dataset.py#L243

@SekouRowe @aclaybo I think this is a good issue for you to tackle!

Just to be clear, we're talking about analyzing the number of double, triple, aromatic bonds, etc.!

Figure out what's causing a bug in the CI pipeline

_______________ ERROR collecting tests/test_models/test_mpnn.py ________________
ImportError while importing test module '/home/runner/work/crescendo/crescendo/tests/test_models/test_mpnn.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_models/test_mpnn.py:3: in <module>
    from crescendo.models.mpnn import MPNN
crescendo/models/mpnn.py:5: in <module>
    from dgllife.model import MPNNPredictor
/usr/share/miniconda/lib/python3.7/site-packages/dgllife/__init__.py:9: in <module>
    from . import model
/usr/share/miniconda/lib/python3.7/site-packages/dgllife/model/__init__.py:8: in <module>
    from .model_zoo import *
/usr/share/miniconda/lib/python3.7/site-packages/dgllife/model/model_zoo/__init__.py:33: in <module>
    from .acnn import *
/usr/share/miniconda/lib/python3.7/site-packages/dgllife/model/model_zoo/acnn.py:14: in <module>
    from dgl import BatchedDGLHeteroGraph
E   ImportError: cannot import name 'BatchedDGLHeteroGraph' from 'dgl' (/usr/share/miniconda/lib/python3.7/site-packages/dgl/__init__.py)

Remove old test data and update data README

Document all the modules in crescendo with a readme before it gets too confusing

Test read qm8

https://github.com/x94carbone/crescendo/blob/caf122709c379d3cbf18c5d724089be9200d474e/crescendo/datasets/qm9.py#L64

Re-think path order for models

See discussion in #69. Saving this for a later PR.

@deyulu, quick question. You noted in the PR that this was your preferred ordering

{XANES, EXAFS}/EDGE/ABSORBER/LEVEL_OF_THEORY/models

Would you be ok with

LEVEL_OF_THEORY/{XANES, EXAFS}/EDGE/ABSORBER/models

I feel like keeping the ABSORBER in last place makes the most sense, since for a single material we'll be loading models for multiple elements. Also, the level of theory is kinda at the level of the code. It seems to make sense to have that first. Otherwise, I agree with the {XANES, EXAFS}/EDGE/ABSORBER/models order.

Note that the current ordering is

{XANES, EXAFS}/LEVEL_OF_THEORY/EDGE/ABSORBER/models

Test generate_qm9_pickle

Needs to be tested. In qm9.py.

Also it would be great to have a way to load this into a usable form.

Implement meta-data analysis on qm9 loader

We want to extract useful information such as the number of bond types, atom types, aromaticity, etc. of the molecules in QM9. Defining a simple analyze() method or something like this would be useful.