Code Monkey home page Code Monkey logo

crescendo's Introduction

Crescendo

python pytorch lightning hydra black
image image image Codacy Badge

Crescendo provides a unified command line + API for training and evaluating Lightning models


โš ๏ธ Crescendo is a work in progress and highly subject to change

๐Ÿ™ Some of our boilerplate is based on the wonderful template by ashleve! See here.

Summary

โญ๏ธ Crescendo leverages the power of Hydra, Lightning and the humble command line to make executing the training of neural networks as easy as possible.

โญ๏ธ Hydra supports an incredible suite of tools such as powerful approaches for hyperparameter tuning. These are built in and accessible.

โญ๏ธ Loading your models will be handled with the crescendo.analysis API, so you can train your models via the command line on a supercomputer, then load the results in your local Jupyter notebook.

Install

You can easily install Crescendo via Pip!

pip install crescendo

Of particular note, this not only installs the crescendo module, but also the cr command line executable. A simple example to test that everything is working properly:

cr model=mlp data=california_housing

Acknowledgement

This research is based upon work supported by the U.S. Department of Energy, Office of Science, Office Basic Energy Sciences, under Award Number FWP PS-030. This research used resources of the Center for Functional Nanomaterials (CFN), which is a U.S. Department of Energy Office of Science User Facility, at Brookhaven National Laboratory under Contract No. DE-SC0012704. This software is also based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Department of Energy Computational Science Graduate Fellowship under Award Number DE-FG02-97ER25308.

The Software resulted from work developed under a U.S. Government Contract No. DE-SC0012704 and are subject to the following terms: the U.S. Government is granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable worldwide license in this computer software and data to reproduce, prepare derivative works, and perform publicly and display publicly.

THE SOFTWARE IS SUPPLIED "AS IS" WITHOUT WARRANTY OF ANY KIND. THE UNITED STATES, THE UNITED STATES DEPARTMENT OF ENERGY, AND THEIR EMPLOYEES: (1) DISCLAIM ANY WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT, (2) DO NOT ASSUME ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF THE SOFTWARE, (3) DO NOT REPRESENT THAT USE OF THE SOFTWARE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS, (4) DO NOT WARRANT THAT THE SOFTWARE WILL FUNCTION UNINTERRUPTED, THAT IT IS ERROR-FREE OR THAT ANY ERRORS WILL BE CORRECTED.

IN NO EVENT SHALL THE UNITED STATES, THE UNITED STATES DEPARTMENT OF ENERGY, OR THEIR EMPLOYEES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, CONSEQUENTIAL, SPECIAL OR PUNITIVE DAMAGES OF ANY KIND OR NATURE RESULTING FROM EXERCISE OF THIS LICENSE AGREEMENT OR THE USE OF THE SOFTWARE.

crescendo's People

Contributors

matthewcarbone avatar sekourowe avatar stevetorr avatar shubharajkharel avatar

Stargazers

Alper Karaca avatar  avatar  avatar Yihui "Ray" Ren avatar

Watchers

 avatar Cole Miles avatar  avatar  avatar

Forkers

shubharajkharel

crescendo's Issues

Test the QM9 dataset integrity

We need to write tests to ensure that we are loading and analyzing the data properly. This will involve the following:

  • Writing ~3 tests that spot check individual molecules to ensure that the xyz file, features and geometries are loaded properly
  • Writing ~3 tests that spot check molecular structure information as predicted by analyze: basically, taking a few molecules, visualizing their structure on e.g. molview and comparing to the analysis method (#9) results.

@SekouRowe I think this will be a good issue for you as well.

Test large molecules

Different from #32 as that will be completed in the next merge to master. We still want to ensure the to_graph method is working for the larger molecules, perhaps by simply counting the number of atoms/bonds that should be in the molecule.

Implement parallel GPU training with dgl

So firstly, it seems that this is possible, but it seems to be quite challenging to implement. According to the pytorch docs, they recommend using DistributedDataParallel to implement parallel models in the first place, so it would be nice to do this and not have to use DataParallel like we currently are.

Note that it is also even the case that DataParallel is completely incompatible with parallel GPU training with dgl. Thus, we'll need to use DistributedDataParallel if we are to have any shot.

Finally, it appears that the crux of the problem lies with how the batching is performed. Basically, a dgl.batch(...) object is a graph in and of itself, but it batches smaller graphs together by creating a block-diagonal adjacency matrix. The problem is that, I think, torch, when splitting a batch onto multiple GPU's, wants to split this matrix equally, even though the block-diagonal elements do not occur on equal intervals.

It is very much unclear to me how to fix these issues, but some hints might lie in e.g. this example here.

Luckily, for QM9, it doesn't really matter, as one GPU is extremely fast to train on. Some early benchmarking on ~100k training data takes roughly 50s/epoch on one GPU using batch sizes of ~3k. So even 1000 epochs will take only roughly 14 hours to complete. QM8 is an order of magnitude faster.

Create FEFF and VASP XANES models

FEFF/VASP transition metal models

Using the databases I will create, we want to train the following models (note specifically, when we say "X-O", we mean a REST query to the Materials Project v2 API in which we make queries like ["X-O", "X-O-*", "X-O-*-*"]).

FEFF

  • Ti-O
  • V-O
  • Cr-O
  • Mn-O
  • Fe-O
  • Co-O
  • Ni-O
  • Cu-O

VASP

  • Ti-O
  • Cu-O

Transfer-learned FEFF-to-VASP models

  • Ti-O
  • Cu-O

Tasks

For each model, there will be two different train/validation/testing splits:

  • "random split": each dataset contains the site-wise features and XAS pairs. The random split simply splits the dataset randomly, as we are all used to.
  • "materials split": the random split can lead to atoms from the same material in the training and testing set. This is obviously undesirable because the testing set should represent the "real world deployment scenario" of the model, and we will be testing on previously unseen materials, not sites. The materials split splits the data roughly evenly across materials, then saves the site data.

For each of these, and each model, we will do the following:

  1. Train and hyper parameter tune via Optuna (Crescendo has this functionality) to find the best model architecture and hyper-parameters.
  2. Evaluate on the testing set and analyze.
  3. Train a "production model" on all of the data (Crescendo also has this functionality).
  4. Put this production model into the model zoo (crescendo/extern/m3gnet/zoo).
  5. [Optional, not necessary for a paper] Train an ensemble model so we can accurately quantify uncertainty.

We also want to demonstrate the superiority of the transfer-learned FEFF-to-VASP models over the pure VASP models.

Possible mis-alignment between qm8 and qm9 testing data

@SekouRowe, are these the correct QM8 electronic properties?

    5000      0.16172701     0.16230368     0.00145320     0.15055007     0.15075559     0.15709222     0.00083043     0.13377021     0.15223530     0.15301923     0.00106530     0.13645817     0.15815072     0.15885263     0.15320000     0.00130000
    5001      0.14464118     0.16665604     0.00119145     0.17476919     0.13655482     0.16646855     0.00077656     0.17492321     0.13728793     0.16192712     0.00095586     0.16962216     0.14497241     0.16475090     0.00120000     0.17930000
    5002      0.16469761     0.18990316     0.22962985     0.00025683     0.16554669     0.19177703     0.20400373     0.00012896     0.15984340     0.19029825     0.19592690     0.00018085     0.16196898     0.19960029     0.21250000     0.00780000
    5004      0.16322593     0.19238947     0.24865233     0.00008322     0.16067264     0.18921842     0.21430228     0.00005398     0.15519544     0.18813330     0.20460769     0.00004730     0.15744881     0.19630387     0.21710000     0.00010000
    5005      0.17342091     0.19384287     0.23245662     0.00041210     0.17249723     0.19420906     0.20596117     0.00017324     0.16751581     0.19307258     0.19303158     0.00064372     0.17018980     0.20119888     0.20870000     0.11190000

In my test data for QM9, I have (other than 1-10, which are correct)

  • dsgdb9nsd_100001.xyz
  • dsgdb9nsd_100002.xyz
  • dsgdb9nsd_100003.xyz
  • dsgdb9nsd_100004.xyz

It doesn't look like these match up, as you have 5001, 5002, 5003, 5004. Did I miss something here? Thanks!

Btw I'm more than happy to just remove the QM9 tests I have for e.g. 100001 and substitute in 5001, just wanted to be sure that this column is definitely the QM9 ID.

Pipeline redesign

We want a quick way to name datasets, and to reload them from disk so that the process of initially constructing a dataset, which can take about 10 minutes, does not need to be repeated.

More likely than not, this can be accomplished via pickle and simple load_state/save_state methods which appropriately load and save self.__class__.__dict__ to disk.

It might be prudent then to separate the dataset construction itself and the ML featurization aspects. In other words, three steps:

  1. Load data from individual files and save as a single .csv or pickle file on disk.
  2. Load step 1 saved data and turn the SMILES into graphs, save that to disk.
  3. Load step 2 saved data and train. Each trial should have some random hash and should have checkpointing functionality.

Test the as_dict and from_dict methods

@stevetorr I'm not sure this is going to work as intended. Not all of the attributes of the QM9SmilesDatum are initialized from parameters. ๐Ÿ˜„

as_dict needs to be tested in tandem with from_dict, which should probably be moved to a utilities directory. I guess we'd want the capability of pickling and un-pickling these objects.

Add CNN support

Currently on master, I have both feed-forward neural networks and message passing neural networks (graph-to-vector) implemented. Obviously something we're sorely lacking right now is support for convolutional neural networks. We should probably have this. Minimum things we need:

  • General CNN network Python code
  • Accompanying default Hydra config
  • Smoke tests (perhaps with MNIST)

Move qm9 datum object into the dataset class

While there is utility in separating these objects into different files/modules, they are not generalized to other datasets right now and for that reason I think they should be in the same file. No need to overcomplicate!

Code up the qm9 loader

When I have the time I will try to do this, but if you'd like to, @SekouRowe take a crack at it! This is already basically implemented in the old repo, but if you'd like to improve on it in any way go for it.

Methods to transfer QM9 data to common formats

Pymatgen's Molecule structure, or ASE's Atoms structure are helpful and familiar for other users. I imagine that rdkit also has classes designed to represent molecules. For ease of accessibility, it would be helpful to have methods which export a QM9 datum to one or several such classes from other packages.

Test analysis methods

Specifically:

  • has_n_membered_ring
  • is_aromatic
  • has_double_bond
  • has_triple_bond
  • has_hetero_bond

@SekouRowe if you'd like to give this a shot, go for it, else let me know if you don't think you have the time and I will handle it!

Figure out what's causing a bug in the CI pipeline

_______________ ERROR collecting tests/test_models/test_mpnn.py ________________
ImportError while importing test module '/home/runner/work/crescendo/crescendo/tests/test_models/test_mpnn.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_models/test_mpnn.py:3: in <module>
    from crescendo.models.mpnn import MPNN
crescendo/models/mpnn.py:5: in <module>
    from dgllife.model import MPNNPredictor
/usr/share/miniconda/lib/python3.7/site-packages/dgllife/__init__.py:9: in <module>
    from . import model
/usr/share/miniconda/lib/python3.7/site-packages/dgllife/model/__init__.py:8: in <module>
    from .model_zoo import *
/usr/share/miniconda/lib/python3.7/site-packages/dgllife/model/model_zoo/__init__.py:33: in <module>
    from .acnn import *
/usr/share/miniconda/lib/python3.7/site-packages/dgllife/model/model_zoo/acnn.py:14: in <module>
    from dgl import BatchedDGLHeteroGraph
E   ImportError: cannot import name 'BatchedDGLHeteroGraph' from 'dgl' (/usr/share/miniconda/lib/python3.7/site-packages/dgl/__init__.py)

Re-think path order for models

See discussion in #69. Saving this for a later PR.

@deyulu, quick question. You noted in the PR that this was your preferred ordering

{XANES, EXAFS}/EDGE/ABSORBER/LEVEL_OF_THEORY/models

Would you be ok with

LEVEL_OF_THEORY/{XANES, EXAFS}/EDGE/ABSORBER/models

?

I feel like keeping the ABSORBER in last place makes the most sense, since for a single material we'll be loading models for multiple elements. Also, the level of theory is kinda at the level of the code. It seems to make sense to have that first. Otherwise, I agree with the {XANES, EXAFS}/EDGE/ABSORBER/models order.

Note that the current ordering is

{XANES, EXAFS}/LEVEL_OF_THEORY/EDGE/ABSORBER/models

Test generate_qm9_pickle

Needs to be tested. In qm9.py.

Also it would be great to have a way to load this into a usable form.

Implement meta-data analysis on qm9 loader

We want to extract useful information such as the number of bond types, atom types, aromaticity, etc. of the molecules in QM9. Defining a simple analyze() method or something like this would be useful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.