Code Monkey home page Code Monkey logo

espaloma's Introduction

espaloma: Extensible Surrogate Potential Optimized by Message-passing Algorithms 🍹

CI Documentation Status

Source code for Wang Y, Fass J, and Chodera JD "End-to-End Differentiable Construction of Molecular Mechanics Force Fields."

abstract

Documentation: https://docs.espaloma.org

Paper Abstract

Molecular mechanics (MM) potentials have long been a workhorse of computational chemistry. Leveraging accuracy and speed, these functional forms find use in a wide variety of applications in biomolecular modeling and drug discovery, from rapid virtual screening to detailed free energy calculations. Traditionally, MM potentials have relied on human-curated, inflexible, and poorly extensible discrete chemical perception rules atom types for applying parameters to small molecules or biopolymers, making it difficult to optimize both types and parameters to fit quantum chemical or physical property data. Here, we propose an alternative approach that uses graph neural networks to perceive chemical environments, producing continuous atom embeddings from which valence and nonbonded parameters can be predicted using invariance-preserving layers. Since all stages are built from smooth neural functions, the entire process---spanning chemical perception to parameter assignment---is modular and end-to-end differentiable with respect to model parameters, allowing new force fields to be easily constructed, extended, and applied to arbitrary molecules. We show that this approach is not only sufficiently expressive to reproduce legacy atom types, but that it can learn and extend existing molecular mechanics force fields, construct entirely new force fields applicable to both biopolymers and small molecules from quantum chemical calculations, and even learn to accurately predict free energies from experimental observables.

Installation

$ conda install -c conda-forge "espaloma=0.3.2"

Example: Deploy espaloma 0.3.2 pretrained force field to arbitrary MM system

# imports
import os
import torch
import espaloma as esp

# define or load a molecule of interest via the Open Force Field toolkit
from openff.toolkit.topology import Molecule
molecule = Molecule.from_smiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")

# create an Espaloma Graph object to represent the molecule of interest
molecule_graph = esp.Graph(molecule)

# load pretrained model
espaloma_model = esp.get_model("latest")

# apply a trained espaloma model to assign parameters
espaloma_model(molecule_graph.heterograph)

# create an OpenMM System for the specified molecule
openmm_system = esp.graphs.deploy.openmm_system_from_graph(molecule_graph)

If using espaloma from a local .pt file, say for example espaloma-0.3.2.pt, then you would need to run the eval method of the model to get the correct inference/predictions, as follows:

import torch
...
# load local pretrained model
espaloma_model = torch.load("espaloma-0.3.2.pt")
espaloma_model.eval()
...

The rest of the code should be the same as in the previous code block example.

Compatible models

Below is a compatibility matrix for different versions of espaloma code and espaloma models (the .pt file).

Model 🧪 DOI 📝 Supported Espaloma version 💻 Release Date 🗓️ Espaloma architecture change 📐?
espaloma-0.3.2.pt 0.3.1, 0.3.2 Sep 22, 2023 ✅ No
espaloma-0.3.1.pt 0.3.1, 0.3.2 Jul 17, 2023 ⚠️ Yes
espaloma-0.3.0.pt 0.3.0 Apr 26, 2023 ⚠️Yes

Note

espaloma-0.3.1.pt and espaloma-0.3.2.pt are the same model.

Using espaloma to parameterize small molecules in relative free energy calculations

An example of using espaloma to parameterize small molecules in relative alchemical free energy calculations is provided in the scripts/perses-benchmark/ directory.

Manifest

  • espaloma/ core code for graph-parametrized potential energy functions.
    • graphs/ data objects that contain various level of information we need.
      • graph.py base modules for graphs.
      • molecule_graph.py provide APIs to various molecular modelling toolkits.
      • homogeneous_graph.py simplest graph representation of a molecule.
      • heterogeneous_graph.py graph representation of a molecule that contains information regarding membership of lower-level nodes to higher-level nodes.
      • parametrized_graph.py graph representation of a molecule with all parameters needed for energy evaluation.
    • nn/ neural network models that facilitates translation between graphs.
      • dgl_legacy.py API to dgl models for atom-level message passing.
    • mm/ molecular mechanics functionalities for energy evaluation.
      • i/ energy terms used in Class-I force field.
        • bond.py bond energy
        • angle.py angle energy
        • torsion.py torsion energy
        • nonbonded.py nonbonded energy
      • ii/ energy terms used in Class-II force field.
        • coupling.py coupling terms
        • polynomial.py higher order polynomials.

License

This software is licensed under MIT license.

Copyright

Copyright (c) 2020, Chodera Lab at Memorial Sloan Kettering Cancer Center and Authors: Authors:

espaloma's People

Contributors

ijpulidos avatar jchodera avatar jthorton avatar kaminow avatar kntkb avatar madilynpaul avatar mattwthompson avatar maxentile avatar mikemhenry avatar yuanqing-wang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

espaloma's Issues

Node featurization does not properly assign hybridization using OpenEye toolkit

Problem

  1. Current implementation does not properly assign atom hybridization using openeye toolkit. I think you need to first run oechem.OEAssignHybridization() to assign the atom hybridization. However, I could not find it in read_homogeneous_graph.py where I assume is responsible for creating the node features.

  2. Regarding the discussions in espaloma_charge issue #18 should we exclude atom.GetValence(), atom.GetExplicitValence(), and perhaps atom.GetFormalCharge() from the input features? Also, do we need atom.GetIsotope()?

  3. RDKit and OpenEye toolkit gives different node features. Also the HybridizationType in RDKit is not fully supported which will raise an error if hydrogen atom is passed.

Reproduce (atom hybridization):

from rdkit import Chem
from openeye import oechem

HYBRIDIZATION_RDKIT = {
    Chem.rdchem.HybridizationType.SP: [1, 0, 0, 0, 0],
    Chem.rdchem.HybridizationType.SP2: [0, 1, 0, 0, 0],
    Chem.rdchem.HybridizationType.SP3: [0, 0, 1, 0, 0],
    Chem.rdchem.HybridizationType.SP3D: [0, 0, 0, 1, 0],
    Chem.rdchem.HybridizationType.SP3D2: [0, 0, 0, 0, 1],
    Chem.rdchem.HybridizationType.S: [0, 0, 0, 0, 0],
    Chem.rdchem.HybridizationType.UNSPECIFIED: [0, 0, 0, 0, 0],
    Chem.rdchem.HybridizationType.OTHER: [0, 0, 0, 0, 0],
}

HYBRIDIZATION_OE = {
    oechem.OEHybridization_sp: [1, 0, 0, 0, 0],
    oechem.OEHybridization_sp2: [0, 1, 0, 0, 0], 
    oechem.OEHybridization_sp3: [0, 0, 1, 0, 0],
    oechem.OEHybridization_sp3d: [0, 0, 0, 1, 0],
    oechem.OEHybridization_sp3d2: [0, 0, 0, 0, 1],
    oechem.OEHybridization_Unknown: [0, 0, 0, 0, 0],
}

# define smiles and add hydrogen
smi = '[O-]C(c1ccccc1)=O'
mol = Chem.MolFromSmiles(smi)
mol = Chem.AddHs(mol)
smi = Chem.MolToSmiles(mol)
print('# Smiles with hydrogen')
print(smi)
print('---')

# rdkit
print('# RDKIT')
rdmol = mol
for atom in rdmol.GetAtoms():
    print(atom.GetAtomicNum(), atom.GetHybridization(), HYBRIDIZATION_RDKIT[atom.GetHybridization()])
print('---')

# openeye: w/o hybridization assignment
oemol = oechem.OEGraphMol()
oechem.OESmilesToMol(oemol, smi)
print('# OpenEye (w/o hybridization assignment)')
for atom in oemol.GetAtoms():
    print(atom.GetAtomicNum(), atom.GetHyb(), HYBRIDIZATION_OE[atom.GetHyb()])
print('---')

# openeye: w/ assign hybridization
oechem.OEAssignHybridization(oemol)
print('# OpenEye: w hybridization assignment')
for atom in oemol.GetAtoms():
    print(atom.GetAtomicNum(), atom.GetHyb(), HYBRIDIZATION_OE[atom.GetHyb()])

output:

# Smiles with hydrogen
[H]c1c([H])c([H])c(C(=O)[O-])c([H])c1[H]
---
# RDKIT
8 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
8 SP2 [0, 1, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
---
# OpenEye (w/o hybridization assignment)
1 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
8 0 [0, 0, 0, 0, 0]
8 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
---
# OpenEye: w hybridization assignment
1 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
8 2 [0, 1, 0, 0, 0]
8 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]

rdkit version: '2022.03.4'
openeye version: '2022.1.1'

How to run the example script provided in paper?

Hi guys, I cloned the repo and tried to run the following code provided in the paper, but it complained:

  File "/root/group_ceph/espaloma/example.py", line 3, in <module>
    dataset = esp.data.dataset.GraphDataset.load('gen2').view(batch_size=128)
  File "/root/group_ceph/espaloma/espaloma/data/dataset.py", line 316, in load
    paths = os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: 'gen2'

Do you have any suggestions for what I should do next?

企业微信截图_16339527201650

I found that there are some h5 files in data folder, another error raised when I tried to load it by pandas:

  File "H5F.c", line 620, in H5Fopen
    unable to open file
  File "H5VLcallback.c", line 3501, in H5VL_file_open
    failed to iterate over available VOL connector plugins
  File "H5PLpath.c", line 578, in H5PL__path_table_iterate
    can't iterate over plugins in plugin path '(null)'
  File "H5PLpath.c", line 620, in H5PL__path_table_iterate_process_path
    can't open directory: /data/miniconda3/envs/esp/lib/hdf5/plugin
  File "H5VLcallback.c", line 3351, in H5VL__file_open
    open failed
  File "H5VLnative_file.c", line 97, in H5VL__native_file_open
    unable to open file
  File "H5Fint.c", line 1990, in H5F_open
    unable to read superblock
  File "H5Fsuper.c", line 405, in H5F__super_read
    file signature not found

End of HDF5 error back trace

Unable to open/create file './data/qca/Coverage.h5'

Signature for legacy toolkit atom typing and NN atom typing.

Related to discussions we have here #9

I suggested that the signature for the function for legacy atom typing frameworks as well as NN atom typing / parametrization should be the same, i.e. graph -> graph.

Nonetheless, this would mean that we'd have to carry the molecule (RDKit, OpenEye, or OpenForceField) object with the homogeneous graph. This wouldn't be very efficient.

If we have the function for legacy typing to have the signature molecule object -> graph, we need to think more about the bookkeeping efforts to make sure that the indices won't be messed up after batching, training, and debatching.

restructure

I think it's time, for the sake of scaling this up to more systematic experiments, to restructure this project.

The primary challenge I realized when designing experiments and structuring the current project is that, there are too many parts in this streamline that we want to model. Specifically:

  • molecule object (rdkit.Molecule, openeye.GraphMol, or openforcefield.Molecule)

  • graph object (dgl.Graph or dgl.HeteroGraph)
    the object that contains the information regarding the identities / attributes of the atoms and how they are connected.

  • parametrized graph object (still dgl.Graph or dgl.HeteroGraph)
    the object that contains all the parameters necessary for MM-like energy evaluation, namely ks and eqs and so forth

  • energy (scalar tensor or a list of scalar tensor)
    the end object in both fitting and inference

There are problems with each of these objects, and we can argue that there are still downstream objects that we can have, namely those needed for openmm simulations.

But first, here's how I picture to structure the repo:

  • espaloma/ core code for graph nets-powered ������force field fitting.
    • graph.py graph abstraction of molecules.
    • parametrized_graph.py molecular graph with all the parameters needed for energy evaluation
    • net.py interface for neural networks that operates on espaloma.graph
    • mm/ helper functions to calculate MM energy
  • scripts/ experiment scripts
    • typing/ experiment scripts to reproduce atom-typing
    • mm_fitting/ experiment scripts to fit espaloma potentials to Molecular Mechanics
    • qm_fitting/ experiment scripts to fit espaloma potentials to Quantum Mechanics
      -class_i/
      • class_ii/ with the inclusion of coupling and higher-order terms
    • deployment/ experiment to provide interface to openmm.system objects and run actual MD.
  • devtools development tools

The following functions are needed to allow the aforementioned streamline in a cleaner manner:

  • espaloma.graph.from_oemol and friends: read molecules into graphs
  • espaloma.parameterized_graph.from_forcefield(forcefield='gaff2') parametrize a mol graph using a legacy forcefield. we will likely need to port from other repos for these implementations, or import them.
  • `espaloma.parametrized_graph.from_nn()' parametrize a mol from neural network models that could be trained.
  • espalmoa.mm.energy(g, x)evaluate energy from the parametrized molecule (however it's parametrized) and its geometry ( lots of helper functions are needed, of course, and test coverage would ideally ensure it's consistency with OpenMM although that might be tricky for especially nonbondned terms.

The following are the trickiest choices that I would like to have a discussion here:

  • ways to structure graph
    currently this is done by having both graph, heterograph and hierarchical_graph objects as input. I find this a bit ugly. I suggest allowing only one kind of graph as the input of either NNs or legacy force fields, and put whatever needed to express the relationships between graph entities as part of the model. Note, however, that we would need tricks to make sure that this part of the modeling is only executed exactly once during training.

  • ways to output energies
    without any sum functions, one molecule would have separated bond, angle, and nonbonded energies all with different dimensions. this becomes even more complicated when you have a batch with multiple molecules. I think it's critical that we find a simple way to output energies

  • ways to have a general enough espaloma.net object to enable future expansion
    now on the representation level we already have a bunch of ideas: atom-level message-passing only, hierarchical message-passing, or somewhat fancier version of it that I proposed to use an RNN to encode the walks with different length (corresponding to different levels in the hierarchy). If we were to limit the input graph to be universal, how are we going to develop the net module so that it can be not too much of a headache to express these ideas.

Does espaloma correctly work on CPUs without modification?

I am trying to craft a simple script to create and simulate a small molecule with espaloma via openmmforcefields.

There's probably a much simpler way to do this with just espaloma, and I'd love to know what that is, but trying to run this on a CPU exposed a potential bug when running espaloma on a machine without a GPU:

$ python espaloma-aniline.py
Source espaloma-0.2.2.offxml could not be read. If this is a file, ensure that the path is correct.
If the file is present, ensure it is in a known SMIRNOFF encoding.
Valid formats are: ['XML']
Parsing failed with the following error:
syntax error: line 1, column 0

Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/heterograph.py:72: DGLWarning: Recommend creating graphs by `dgl.graph(data)` instead of `dgl.DGLGraph(data)`.
  dgl_warning('Recommend creating graphs by `dgl.graph(data)`'
/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/convert.py:997: DGLWarning: dgl.to_homo is deprecated. Please use dgl.to_homogeneous
  dgl_warning("dgl.to_homo is deprecated. Please use dgl.to_homogeneous")
Traceback (most recent call last):
  File "/lila/home/chodera/espaloma-aniline/espaloma-aniline.py", line 14, in <module>
    system = system_generator.create_system(openmm_topology, molecules=[molecule])
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/openmmforcefields/generators/system_generators.py", line 324, in create_system
    system = self.forcefield.createSystem(topology, **forcefield_kwargs)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/openmm/app/forcefield.py", line 1212, in createSystem
    templateForResidue = self._matchAllResiduesToTemplates(data, topology, residueTemplates, ignoreExternalBonds)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/openmm/app/forcefield.py", line 1417, in _matchAllResiduesToTemplates
    if generator(self, res):
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/openmmforcefields/generators/template_generators.py", line 304, in generator
    ffxml_contents = self.generate_residue_template(molecule)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/openmmforcefields/generators/template_generators.py", line 1630, in generate_residue_template
    self.espaloma_model(molecule_graph.heterograph)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/espaloma/nn/sequential.py", line 144, in forward
    x = self._sequential(g_, x)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/espaloma/nn/sequential.py", line 62, in forward
    x = getattr(self, exe)(g, x)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/espaloma/nn/layers/dgl_legacy.py", line 46, in forward
    return self.gn(g, x)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/nn/pytorch/conv/sageconv.py", line 235, in forward
    graph.update_all(msg_fn, fn.mean('m', 'neigh'))
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/heterograph.py", line 4895, in update_all
    ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/core.py", line 357, in message_passing
    ndata = invoke_gspmm(g, mfunc, rfunc)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/core.py", line 332, in invoke_gspmm
    z = op(graph, x)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/ops/spmm.py", line 189, in func
    return gspmm(g, 'copy_lhs', reduce_op, x, None)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/ops/spmm.py", line 75, in gspmm
    ret = gspmm_internal(g._graph, op,
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/backend/pytorch/sparse.py", line 757, in gspmm
    return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 94, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/backend/pytorch/sparse.py", line 126, in forward
    out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
  File "/lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/sparse.py", line 228, in _gspmm
    _CAPI_DGLKernelSpMM(gidx, op, reduce_op,
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [11:21:00] /opt/dgl/src/array/cpu/./spmm_blocking_libxsmm.h:267: Failed to generate libxsmm kernel for the SpMM operation!
Stack trace:
  [bt] (0) /lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x2ad0bac7000f]
  [bt] (1) /lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/libdgl.so(void dgl::aten::cpu::SpMMRedopCsrOpt<long, float, dgl::aten::cpu::op::CopyLhs<float>, dgl::aten::cpu::op::Add<float> >(dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray)+0x3d4) [0x2ad0baeb07f4]
  [bt] (2) /lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/libdgl.so(void dgl::aten::cpu::SpMMSumCsrLibxsmm<long, float, dgl::aten::cpu::op::CopyLhs<float> >(dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray)+0x73) [0x2ad0baeb08a3]
  [bt] (3) /lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/libdgl.so(void dgl::aten::cpu::SpMMSumCsr<long, float, dgl::aten::cpu::op::CopyLhs<float> >(dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray)+0x12f) [0x2ad0baecbeaf]
  [bt] (4) /lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/libdgl.so(void dgl::aten::SpMMCsr<1, long, 32>(std::string const&, std::string const&, dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0xcd3) [0x2ad0baee2203]
  [bt] (5) /lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/libdgl.so(dgl::aten::SpMM(std::string const&, std::string const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0x13d5) [0x2ad0baf14455]
  [bt] (6) /lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/libdgl.so(+0x46a8d8) [0x2ad0baf288d8]
  [bt] (7) /lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/libdgl.so(+0x46ae71) [0x2ad0baf28e71]
  [bt] (8) /lila/home/chodera/miniconda/envs/espaloma-perses/lib/python3.9/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x2ad0baf7c058]

Can we make sure we can handle CPU-only as well?

Here's the script:

# Create OpenFF Molecule and conformers                                                                                                                                                                                            
from openff.toolkit.topology import Molecule
smiles = 'Nc1ccccc1' # aniline                                                                                                                                                                                                     
molecule = Molecule.from_smiles(smiles)
molecule.generate_conformers()

# Parameterize it                                                                                                                                                                                                                  
from openmmforcefields.generators import SystemGenerator
from openmm import app
from openmm import unit
openmm_topology = molecule.to_topology().to_openmm()
forcefield_kwargs = { 'constraints' : app.HBonds, 'rigidWater' : True, 'removeCMMotion' : False, 'hydrogenMass' : 4*unit.amu }
system_generator = SystemGenerator(small_molecule_forcefield='espaloma-0.2.2', forcefield_kwargs=forcefield_kwargs)
system = system_generator.create_system(openmm_topology, molecules=[molecule])

# Create OpenMM Context                                                                                                                                                                                                            
from simtk import unit
import openmm
temperature = 300 * unit.kelvin
collision_rate = 1.0 / unit.picoseconds
timestep = 4.0 * unit.femtoseconds
integrator = openmm.LangevinMiddleIntegrator(temperature, collision_rate, timestep)
context = openmm.Context(system, integrator)
context.setPositions(molecule.conformers[0])

# Minimize                                                                                                                                                                                                                         
openmm.LocalEnergyMinimizer.minimize(context)

# Write structure                                                                                                                                                                                                                  
from openmm.app import PDBFile
positions = context.getState(getPositions=True).getPositions(asNumpy=True)
with open('aniline-minimized.pdb', 'wt') as outfile:
    PDBFile.writeFile(openmm_topology, positions, outfile)

# Simulate                                                                                                                                                                                                                         
print('Simulating...')
integrator.step(25000)

# Write structure                                                                                                                                                                                                                  
from openmm.app import PDBFile
positions = context.getState(getPositions=True).getPositions(asNumpy=True)
with open('aniline-100ps.pdb', 'wt') as outfile:
    PDBFile.writeFile(openmm_topology, positions, outfile)

Information about the input graph

Hi,

I am a bit confused with the input graph that is given as an input, could you clarify this to me? We provide a Graph with the following attributes: n1, n2, n3, n4, which are atoms, bonds, angles, torsion angles, correct? What are the properties/features? Like for n1 I see it has 'xyz', 'idxs', 'h0'. 'xyz' must be 3D coordinates, what about 'idxs' and 'h0'? I am similarly confused about other fields. Could you please point me to the readme, where this information is provided, or just break down the typical input for me here in terms of graph attributes/parameter names?

Improve test robustness to http errors

I've seen this error pop up in the tests which ping qcarchive:

E Could not connect to server [https://api.qcarchive.molssi.org:443/,](https://api.qcarchive.molssi.org/,) please check the address and try again.

We should add a retry decorator to the test so pytest can run it a few times before we throw a failure. We can also mark the test as "if this fails because of http, don't mark the CI as a failure.

Key errors for n4_impropers with non-empty suffix when attempting to compute energy using reference forcefield

A near-term workaround might be to double-check we're not using suffixes -- a slightly more involved fix will be to track down the source of these specific errors, or to refactor so that this kind of runtime key error is less likely.

from espaloma.graphs.graph import Graph
import espaloma as esp
import torch

ff = esp.graphs.legacy_force_field.LegacyForceField("openff-1.2.0")


## Behavior on molecule NOT containing impropers 
smi_not_containing_impropers = '[H]C1(C(C(O1)([H])[H])(C([H])([H])O[H])O[H])[H]'
graph = Graph(smi_not_containing_impropers)
parameterized_graph = ff.parametrize(graph)
heterograph = parameterized_graph.heterograph

atoms = heterograph.nodes['n1']
n_atoms = len(atoms.data['idxs'])

atoms.data['xyz'] = torch.randn(n_atoms, 100, 3)

_ = esp.mm.geometry.geometry_in_graph(heterograph)

# fine
_ = esp.mm.energy.energy_in_graph(heterograph, suffix='_ref', terms=["n2", "n3", "n4"])

# fine
for i in [2, 3, 4]:
    _ = heterograph.ndata[f'u_n{i}_ref']['g']

# missing key u_n4_improper_ref
try:
    _ = esp.mm.energy.energy_in_graph(heterograph, suffix='_ref', terms=["n4_improper"])
except KeyError as e:
    print('key error on molecule NOT containing impropers', e)

    
## Behavior on molecule CONTAINING impropers 
smi_containing_impropers = '[H]c1c(c2c(c(c1[H])C([H])([H])[H])C(=O)N(C(C(C2([H])[H])([H])[H])([H])[H])[H])[H]'
graph = Graph(smi_containing_impropers)
parameterized_graph = ff.parametrize(graph)
heterograph = parameterized_graph.heterograph
    
atoms = heterograph.nodes['n1']
n_atoms = len(atoms.data['idxs'])

atoms.data['xyz'] = torch.randn(n_atoms, 100, 3)

_ = esp.mm.geometry.geometry_in_graph(heterograph)

# fine
_ = esp.mm.energy.energy_in_graph(heterograph, suffix='_ref', terms=["n2", "n3", "n4"])

# fine
for i in [2, 3, 4]:
    _ = heterograph.ndata[f'u_n{i}_ref']['g']

# missing key k_ref
try:
    _ = esp.mm.energy.energy_in_graph(heterograph, suffix='_ref', terms=["n4_improper"])
except KeyError as e:
    print('key error on molecule CONTAINING impropers', e)

Outputs:

key error on molecule NOT containing impropers 'u_n4_improper_ref'
key error on molecule CONTAINING impropers 'k_ref'

[Question] Datasets for reproduction of the QM results in the paper

*Edit: This has been resolved, it was my bad, see comment below.

Hello,

I am trying to reproduce the QM-fitting results from the table in figure 4a) from the paper (https://arxiv.org/abs/2010.01196). There is also information on the dataset in the table, e.g. that the PepConf set that was used has 736 molecules and 22154 snapshots in total. It is stated that the datasets can be obtained from QCArchive by filtering out snapshots with energies more than 0.1 Hartree higher than the minima.

However, the OptimizationDataset "OpenFF PEPCONF OptimizationDataset v1.0" that can be downloaded from QCArchive has 937 molecules with 50559 snapshots in total (50555 after filtering). And the PepConf dataset that is provided in the QM fitting tutorial (https://espaloma.wangyq.net/experiments/qm_fitting.html) has 631 molecules and 89073 snapshots in total. Thus I guess that another dataset was used to obtain the results from figure 4a).

How can the exact datasets that were used for QM-fitting in the paper be obtained?

Thanks!

espaloma_fig4a

Espaloma trained on QCArchive-derived bond/angle parameters

We have derived molecule-specific bond/angle force field parameters for ~20K molecules hosted on QCArchive.

We would like to use these data to train espaloma to assign bond/angle parameters to any organic molecule outside the training set.

Please can you let us know if this sounds possible, and if so how we can get started with the code?

@djcole56
@jthorton

Discrepancies in proper torsion energies between espaloma.energy_in_graph and OpenMM when initialized from smirnoff forcefield

I just noticed a mismatch between OpenMM and espaloma on computing proper torsion potentials, at least when the torsion parameters come from LegacyForceField (currently unsure if this mismatch occurs also when the parameters are obtained not through LegacyForcefield). (gist with details)

If we instantiate a parameterized espaloma graph using

from espaloma.graphs.graph import Graph
import espaloma as esp
from openforcefield.topology import Molecule

forcefield_name = "openff-1.2.0"

ff = esp.graphs.legacy_force_field.LegacyForceField(forcefield_name)

smi_not_containing_impropers = '[H]C1(C(C(O1)([H])[H])(C([H])([H])O[H])O[H])[H]'
offmol = Molecule.from_smiles(smi_not_containing_impropers)
graph = Graph(smi_not_containing_impropers)
parameterized_graph = ff.parametrize(graph)
heterograph = parameterized_graph.heterograph

then assign xyz coordinates, and compute potential energies using

_ = esp.mm.geometry.geometry_in_graph(heterograph)
_ = esp.mm.energy.energy_in_graph(heterograph, suffix='_ref', terms=["n2", "n3", "n4"])
u_esp = heterograph.ndata['u_ref']['g'].detach()

then there is a discrepancy in the overall reported energy compared with openmm getPotentialEnergy()

image

Bonds and angles are fine
image
image
But periodic torsion force has some discrepancies
image

Possible explanations:

Either the problem is in "functional form" (how periodic torsion potential is computed from (x1, x2, x3, x4, ks)), or in "parameter cloning" (how parameters are duplicated from a reference forcefield), or in perhaps in some other intermediate step I haven't identified yet.

  • "Functional form" -- seems very unlikely to be the culprit, since unit tests stringently assert consistency with OpenMM PeriodicTorsionForce here, given identical input parameters and random geometries. However, if there is a bug here, it is of high severity.

    • Geometry:
      def dihedral(
      x0: torch.Tensor, x1: torch.Tensor, x2: torch.Tensor, x3: torch.Tensor
      ) -> torch.Tensor:
      """ Dihedral between four points.
      Reference
      ---------
      Closely follows implementation in Yutong Zhao's timemachine:
      https://github.com/proteneer/timemachine/blob/1a0ab45e605dc1e28c44ea90f38cb0dedce5c4db/timemachine/potentials/bonded.py#L152-L199
      """
      # check input shapes
      assert x0.shape == x1.shape == x2.shape == x3.shape
      # compute displacements 0->1, 2->1, 2->3
      r01 = x1 - x0
      r21 = x1 - x2
      r23 = x3 - x2
      # compute normal planes
      n1 = torch.cross(r01, r21)
      n2 = torch.cross(r21, r23)
      rkj_normed = r21 / torch.norm(r21, dim=-1, keepdim=True)
      y = torch.sum(torch.mul(torch.cross(n1, n2), rkj_normed), dim=-1)
      x = torch.sum(torch.mul(n1, n2), dim=-1)
      # choose quadrant correctly
      theta = torch.atan2(y, x)
      return theta
    • Functional form:
      def periodic(
      x, k, periodicity=list(range(1, 7)), phases=[0.0 for _ in range(6)]
      ):
      """ Periodic term.
      Parameters
      ----------
      x : `torch.Tensor`, `shape=(batch_size, 1)`
      k : `torch.Tensor`, `shape=(batch_size, number_of_phases)`
      periodicity: either list of length number_of_phases, or
      `torch.Tensor`, `shape=(batch_size, number_of_phases)`
      phases : either list of length number_of_phases, or
      `torch.Tensor`, `shape=(batch_size, number_of_phases)`
      """
      if isinstance(phases, list):
      phases = torch.tensor(phases, device=x.device)
      if isinstance(periodicity, list):
      periodicity = torch.tensor(
      periodicity, device=x.device, dtype=torch.get_default_dtype(),
      )
      if periodicity.ndim == 1:
      periodicity = periodicity[None, None, :].repeat(
      x.shape[0], x.shape[1], 1
      )
      elif periodicity.ndim == 2:
      periodicity = periodicity[:, None, :].repeat(1, x.shape[1], 1)
      if phases.ndim == 1:
      phases = phases[None, None, :].repeat(x.shape[0], x.shape[1], 1,)
      elif phases.ndim == 2:
      phases = phases[:, None, :].repeat(1, x.shape[1], 1,)
      n_theta = periodicity * x[:, :, None]
      n_theta_minus_phases = n_theta - phases
      cos_n_theta_minus_phases = n_theta_minus_phases.cos()
      k = k[:, None, :].repeat(1, x.shape[1], 1)
      energy = (k * (1.0 + cos_n_theta_minus_phases)).sum(dim=-1)
      return energy
  • "Parameter cloning" -- The step that clones parameters from a labeled molecule into an espaloma graph is possibly error prone. If an error here is the sole source of discrepancies, it would have minimal impact on results since we only report only a comparison to reference bond and angle terms (not torsion terms), and don't use this parameter cloning logic for anything else.

    • Torsion parameters from smirnoff forcefield
      def apply_torsion(node, n_max_phases=6):
      phases = torch.zeros(
      g.heterograph.number_of_nodes("n4"), n_max_phases,
      )
      periodicity = torch.zeros(
      g.heterograph.number_of_nodes("n4"), n_max_phases,
      )
      k = torch.zeros(g.heterograph.number_of_nodes("n4"), n_max_phases,)
      force = forces["ProperTorsions"]
      for idx in range(g.heterograph.number_of_nodes("n4")):
      idxs = tuple(node.data["idxs"][idx].numpy())
      if idxs in force:
      _force = force[idxs]
      for sub_idx in range(len(_force.periodicity)):
      if hasattr(_force, "k%s" % sub_idx):
      k[idx, sub_idx] = getattr(
      _force, "k%s" % sub_idx
      ).value_in_unit(esp.units.ENERGY_UNIT)
      phases[idx, sub_idx] = getattr(
      _force, "phase%s" % sub_idx
      ).value_in_unit(esp.units.ANGLE_UNIT)
      periodicity[idx, sub_idx] = getattr(
      _force, "periodicity%s" % sub_idx
      )
      return {
      "k_ref": k,
      "periodicity_ref": periodicity,
      "phases_ref": phases,
      }
      g.heterograph.apply_nodes(apply_torsion, ntype="n4")
  • Some other unspecified error -- Possibly there's another error-prone step between having correct parameters written on each n4 node and computing potential energies.

atom typing dataset

@maxentile actually, do you mind if you could open the PR back again for the data pipeline for atom types?

I just figured that it would be faster since you're much more familiar with those. I'll continue my model shopping for now.

Ensure potential energy terms are defined consistently

  • Coulomb missing from energy.py (present in energy_ii.py)
  • Spring constant k parameter unused in angle and bond terms in energy_ii.py
  • Periodic torsion terms in energy.py should be updated (following #1)
  • Add reference for higher-order coupling terms, and compare to that reference
  • Exceptions and exclusions are always tricky and should be double-checked
  • Add test for energy and force consistency with OpenMM

Depending on refactoring cost:

  • Switch from k, eq naming for interactions that don't have spring constants or equilibrium lengths

[Question] Evaluating Amber ff14SB and espaloma on Pepconf

Hello,

I am trying to reproduce the QM-fitting results from the table in figure 4a) from the paper (https://arxiv.org/abs/2010.01196). There, you compare espaloma with Amber ff14SB as baseline on the dataset that you created by sampling from optimization trajectories of pepconf (https://github.com/aoterodelaroza/pepconf).

espaloma_fig4a

I am wondering how you applied Amber ff14SB to the dataset since, to my knowledge, this force field requires information on the protein residues of the individual atoms. These can neither be found in the "pepconf"-dataset that is provided in the QM fitting tutorial (https://espaloma.wangyq.net/experiments/qm_fitting.html) nor in the one on qcarchive.

Also, the pepconf dataset on qcarchive is currently incomplete since a majority of the trajectories there cannot be loaded due to convergence issues (openforcefield/qca-dataset-submission#322) and the one from the tutorial mentioned above has a different number of molecules/snapshots than given in figure 4a) of the paper.

Any help on the dataset and obtaining the Amber ff14SB energies for reproducing the pepconf results of the paper would be appreciated!

[Question] Retraining espaloma model to include nonbonded potentials

Hello, I'm trying to retrain espaloma model on my own data and I was wondering how can I extract non bonded potentials. As far as I see in the GitHub repo the code that includes non bonded calculations in espaloma/mm/energy.py the application of nonbonded energies is commented out. So my questions are:

  1. How can I do to retrain the model including also non bonded energies calculation?
  2. How can I then extract them from the graph after applying the model to a new molecule?

Thanks!

Incorrect Lennard-Jones functional form

There’s a bug in the Lennard-Jones functional form here

    return epsilon * (
        coefficients[0] * sigma_over_x ** order[0]
        - coefficients[1] * sigma_over_x ** order[1]

The prefactor should be 4*epsilon, not just epsilon.
This wouldn’t impact error statistics for fitting energies, but if you’re looking at MM parameter RMSE for LJ, it would matter.
Likely does not impact any experiments you’re running now.

Missing Proper Torsion Terms

Espaloma throws an error when trying to parameterize a molecule with a C-C-P-C torsion bond. I am wondering if this is merely an error from a lack of training data on molecules containing phosphorus, or are there certain elements that espaloma cannot repeat, such as elements with a d-orbital.

Environment yaml file with explicit versions

Hi everyone,

Thanks for the great project and repo! I was wondering if you could provide an environment yaml file that has explicit versions of each package. The reason is that I'm having version trouble with packages like DGL (I get cuda-related errors after building the environment from this file; installing dgl-cuda then gives me DGL version 2.X, but the syntax of espaloma seems to only work with DGL version 1.X). I think this should be as easy as taking one of your working espaloma conda environments, and exporting it with conda env export > environment.yml.

Thanks!

Simon

Choice of regression target?

Currently the regression target is the total energy of each (molecule, configuration) pair, including the prediction of a geometry-independent per-molecule offset and prediction of geometry-dependent "strain" energy. However, for the QCArchive subset @yuanqing-wang is looking at, the variation of the per-molecule offsets initially appears much larger in magnitude than the conformation-dependent variation within a molecule's collection of snapshots.

Should we do something to decompose the variance into these two components, i.e. (1) predict the constant offset for each molecule, and (2) assuming away the constant offset, predict geometry-dependent strain energies for a given molecule? (To target (1), we can assume away any geometry-dependence and try to predict just the energy of a molecule's global minimum snapshot from its topology. To target (2), we can assume away any constant offset and try to minimize standard deviation of the residuals.)

Also, the energy prediction currently does not include electrostatic contribution. Should the regression target be something other than total energy? (Initially, it seems reasonable to target the valence contributions, for example by targeting QM total energy minus a MM-predicted nonbonded contribution, where the MM prediction uses Parsley's partial charges, sigmas, epsilons, combining rules, and exceptions.)

Update install instructions

Now that espaloma is on conda-forge, we can modify the install instructions to install espaloma from conda-forge (but dgl is not so that will require another channel to still get pulled in).

Can I install espaloma by conda?

Hi!

I'm trying to use espaloma with openmmforcefield.

I'm currently working on the google colab, and I want to install espaloma via conda.

However, The installation failed.

The code I used and the error message is the following.

!pip install -q condacolab
import condacolab
condacolab.install()
import sys
print(sys.version)
!conda config --set always_yes yes
!conda config --add channels conda-forge
!mamba install git espaloma
Looking for: ['git', 'espaloma']
Encountered problems while solving.
Problem: nothing provides requested espaloma

Is there any possible way to install espaloma on colab?

Saving prmtop files from espaloma parameterization

Hello,

I am wondering how can we save the espaloma FF parameters (bonded and non-bonded) files, to use in standalone amber simulations, for example, what I need to do after this step:

openmm_system = esp.graphs.deploy.openmm_system_from_graph(molecule_graph,charge_method='am1-bcc')

In order to save ligand.prmtop, ligand.inpcrd, or ligand.off or ligand.frcmod files ?

This is not clear from the documentations.

Thanks,
Marawan

Add SMIRNOFF improper handling

Currently, espaloma implements a six-fold improper torsion scheme (see image) that includes pairs of improper torsions that mostly, but not completely, cancel out:
image
This differs from the SMIRNOFF improper torsion definition, which includes three impropers in a consistent ordering to ensure they gracefully handle non-planar restraints.

We should implement an additional mode with the SMIRNOFF improper definition, and see how things change when we refit the espaloma model.

This will also allow us to export the model to OpenMM via openmmforcefields:

cc: openmm/openmmforcefields#182 (comment)

Total charge incorrect

Hi, I have been testing out espaloma using the master branch and the newest set of model parameters and have found that the total molecular charge predicted by the network always seems to be neutral even with charged molecules, attached is a simple example using ammonium. I am unsure if there is an easier way to inspect charges rather than an openmm system?

from openff.toolkit.topology import Molecule
import torch
import espaloma as esp
import numpy as np

mol = Molecule.from_smiles("[NH4+]")
esp_model = torch.load("espaloma-0.2.2.pt")
mol_graph = esp.Graph(molecule)
esp_model(mol_graph.heterograph)

system = esp.graphs.deploy.openmm_system_from_graph(mol_graph, charge_method="nn")
# this seems to be the nonbonded force
force = system.getForce(1)
charges = []
for i in range(force.getNumParticles()):
    charge, _, _ = force.getParticleParameters(i)
    charges.append(charge) 
np.sum(charges)

Quantity(value=0.0, unit=elementary charge)

[Question] Getting Free Energy results and plots

Hello - I've been trying to obtain results in the format shown in the Espaloma paper in Figure 6. That is, the free energy values and the plots using the pretrained Espaloma model (which I imagine is the one located at http://data.wangyq.net/espaloma_model.pt). In the last couple of days, I haven't been able to piece together the code examples, but I've been looking through Espaloma files, Perses examples, as well as Openmmtools examples.

My particular use case has a PDB file and 1 or more ligand SDFs.

Any help or orientation with this would be appreciated.

how to implement data object

let's discuss here how should we design the dataset object to host training data that contains

  • molecular graph
  • coordinates

Ideally, we want to be able to load large dataset while implementing ways to shuffle, batch, and sample. Note that there are many tricks we can do to sample the dataset since it's naturally partitioned by molecular graph.

(a subclasss of torch.DataLoader would be nice although we need to hack it to make dgl.Graph compellable)

Also I think we need to think about the possibilities to distribute it across machines. (we can ignore this for now but I think if we can make it compilable with torch.nn.DataParallel it would make large-scale training faster)

Regarding the energy terms in the dataset, for QM datasets, we would need to include QM energies.

@maxentile argued that we should also include terms in the factor graph, namely bond, angle, and non-bonded energy.

I'd suggest that, for MM datasets, for the sake of simplicity, we emit these terms, since we would have MM energy functions anyway to do:

u_g = u(g, x)

and thus we can have, in the training stage, depending on what target we're fitting, either

loss = loss_fn(g_ref, g_hat)

or

loss = loss_fn(u(g_ref, x), u(g_hat, x))

which I think is easier to debug.

how should we do batching across molecules and snapshots?

In order for the experiments to be finished in a reasonable time, we need to batch the datasets of molecules with trajectories with various lengths into larger graphs. Few options:

  • Set a small batch size, and break the longer trajectories into chunks of that size, treat them as independent graphs, and then batch together among the species. (See #51 )
  • Set a large batch size, filter out all the trajectories shorter than that, and batch the rest among the species. (Like what we did before.)
  • Have multiple bins and various batch sizes.

[Question] How does espaloma compare to the equivariant forcefield networks?

I was wondering whether you guys have any experience/comparisons of how espaloma does compared to equivariant networks like for instance nequip https://github.com/mir-group/nequip

I'm by no means an expert, but from my understanding these two neural networks basically do the same thing. They both take atomic positions and types as input and calculate the potential energy/force field as output. With the difference being that espaloma is designed to immitate the conventional force terms using bonds angles and torsions, while nequip accounts for having the right SE(3) symmetries through equivariance, but doesn't otherwise try to immitate any of the conventional force terms.

Have you guys done any comparisons or what are your thoughts on their approach, and do you believe that a combination of the two might be even more beneficial (an equivariant espaloma)?

Trying to train on gen2 dataset

Hi,

I am trying to overfit espaloma to a small batch from gen2 dataset. I noticed that the reference energy u_ref is in large negative scale:

ds = esp.data.dataset.GraphDataset.load("gen2")
ds = ds[:10]
ds.shuffle(seed=2666)
ds_tr, ds_vl, ds_te = ds.split([5, 3, 2])

ds_tr_loader = ds_tr.view(batch_size=1, shuffle=True)
ds_vl_loader = ds_vl.view(batch_size=1, shuffle=True)

g_tr = next(iter(ds_tr.view(batch_size=1)))

g_tr = next(iter(ds_tr.view(batch_size=1)))
torch.mean(g_tr.nodes["g"].data['u_ref'], dim=1)
tensor([-1988.4373])

(side note, when I try to increase the batch size I get the following error)

Expect all graphs to have the same schema on nodes["g"].data, but graph 1 got
	{'u_openff-1.2.0': Scheme(shape=(29,), dtype=torch.float32), 'u_gaff-2.11': Scheme(shape=(29,), dtype=torch.float32), 'u_qm': Scheme(shape=(29,), dtype=torch.float32), 'u_ref': Scheme(shape=(29,), dtype=torch.float32), 'u_gaff-1.81': Scheme(shape=(29,), dtype=torch.float32)}
which is different from
	{'u_openff-1.2.0': Scheme(shape=(77,), dtype=torch.float32), 'u_gaff-2.11': Scheme(shape=(77,), dtype=torch.float32), 'u_qm': Scheme(shape=(77,), dtype=torch.float32), 'u_ref': Scheme(shape=(77,), dtype=torch.float32), 'u_gaff-1.81': Scheme(shape=(77,), dtype=torch.float32)}.

The model that I am using is initialized the following way:

representation = esp.nn.Sequential(
    layer=esp.nn.layers.dgl_legacy.gn("SAGEConv"), # use SAGEConv implementation in DGL
    config=[128, "relu", 128, "relu", 128, "relu"], # 3 layers, 128 units, ReLU activation
)

readout = esp.nn.readout.janossy.JanossyPooling(
    in_features=128, config=[128, "relu", 128, "relu", 128, "relu"],
    out_features={              # define modular MM parameters Espaloma will assign
        1: {"e": 1, "s": 1}, # atom hardness and electronegativity
        2: {"log_coefficients": 2}, # bond linear combination, enforce positive
        3: {"log_coefficients": 2}, # angle linear combination, enforce positive
        4: {"k": 6}, # torsion barrier heights (can be positive or negative)
    },
)

espaloma_model = torch.nn.Sequential(
                 representation, readout, 
                 esp.nn.readout.janossy.ExpCoefficients(),
                 esp.mm.geometry.GeometryInGraph(),
                 esp.mm.energy.EnergyInGraph(),
                 #esp.mm.energy.EnergyInGraph(suffix="_ref"),
                 esp.nn.readout.charge_equilibrium.ChargeEquilibrium(),
)

Now I am trying to overfit and I am training the following model:

normalize = esp.data.normalize.ESOL100LogNormalNormalize()
for idx_epoch in tqdm(range(2000)):
    intr_loss = 0
    k = 0
    for g in ds_tr_loader:
        optimizer.zero_grad()
        if torch.cuda.is_available():
            g = g.to("cuda:0")
        g = espaloma_model(g)
        g = normalize.unnorm(g)
            
        loss = loss_fn(g)
        loss.backward()
        optimizer.step()
        intr_loss += loss.item()

I am using the following loss function:

loss_fn = esp.metrics.GraphMetric(
        base_metric=torch.nn.MSELoss(),
        between=['u', "u_ref"],
        level="g",)

After training the train loss plot looks the following (epochs on the x-axis):

newplot (28)

The loss gets stuck at ~1.4M when you would expect it to be close to 0 (since I am training only on 5 examples). The energy for individual examples converges at some small positive value:

g_tr = espaloma_model(g_tr)
g_tr.nodes["g"].data["u"]
1.2619

If I do the same but on pepconf dataset (peptides) and get similar results. The output of espaloma is on a different scale.

My question is, what am I doing wrong? Is it the model architecture? The normalizer? Or smth else? Would appreciate any help.

Thanks!

object-oriented handling of heterograph without subclassing

https://discuss.dgl.ai/t/subclass-dgl-heterograph/997/3

Subclassing dgl.DGLHeteroGraph is not recommended. For now, I will simply put heterographs as attributes of espaloma.HeterogeneousGraph and thereby enabling message passing by grammar like:

g = espaloma.HeterogeneousGraph(...)
GN = espaloma.nn.SomeModel
GN(g._graph)

since modification of DGLHeterograph could be in-place.

The reasoning behind this is to make parametrization and energy evaluation semantics simpler.

Add support for espaloma parameterization of small molecules

We are planning to add support for using espaloma to parameterize small molecules by creating a new EspalomaTemplateGenerator based on the SMIRNOFFTemplateGenerator.

Both espaloma and SMIRNOFF generate OpenMM System objects, so we can refactor this code to share the same infrastructure for going from System to ffxml_contents strings. I plan to refactor the common capabilities into a SystemTemplateGenerator parent, but this may have to be a mix-in rather than a base class to avoid issues with the automatic discovery of small molecule force field handling capabilities.

We would let the user specify small_molecule_forcefield='espaloma-0.2.0' and check if we have a cached version of espaloma_0.2.0.pt, retrieving it from the release artifact and caching it if not.

Charge Method Question

I just had a couple questions that I ran into when running Espaloma and reading through the docs.

First, it seems like the primary function of the openmm_system_from_graph is to create an openmm system that stores the information (forcefield parameters) obtained from espaloma. If this is the case, why is there an option to calculate the partial charges separately rather than using ones given by espaloma? Is there something about the partial charges given by espaloma that might give someone reason to re-calculate them?

Second, according to the doc strings, the default behavior should be to use the partial charges from espaloma, but the actual default behavior is to assign partial charges using the am1-bcc method. Is this a typo and/or mistake? Or is there a reason am1-bcc is chosen as the default instead of nn?

Thank you!

Summaries to include in report.md generators

The report generators in supervised_train.py and supervised_param_train.py are great! They make it much easier to browse results of the numerical experiments @yuanqing-wang has been doing.

A wishlist for things that would be good to include in the future iterations of the report generator:

  • A few other quick summaries that may be useful to add are variance(target), stddev(target), and mean_absolute_error. For example, to compare with MoleculeNet benchmark results on QM9 energy regression task, it would be useful to have MAE. To put the RMSE in context, it would be good to know what is the standard deviation of the target value.
  • In model summary section, we have a lot of important detail about layer sizes, etc. Could we also add description of how node, edge, ... etc. features are initialized? (Currently says only the input dimension.) Here would also be good to describe the loss function in more detail. The description mentions that loss_fn=mse_loss, but @yuanqing-wang mentions by Slack that this loss is measured on a normalized regression target.
  • For R^2, could you include the definition used, perhaps in a footnote? The reported values are often negative, and I think it is using the definition 1 - (residual sum of squares) / (total sum of squares), as in sklearn.metric.r2_score, but a reader might reasonably expect one of the other definitions that leads to a non-negative value.
  • For R^2, often the value reported is rounded to 1.00. We might need to use more digits of precision here.
  • Another plot that may be informative to look at is a scatter plot of predictions and targets. (So we can see what is the variance of the target quantity, if there are just a few outliers that are dominating the RMSE summary, etc.).
  • The plots should have axes labeled. In some cases the x-axis is number of optimizer steps, and in some cases the number of epochs. In some cases I think the y-axis is in units of kcal/mol, and in some cases it is measuring error on regression target normalized to have mean 0 and variance 1.
  • In some reports, the final iterate is much worse than the best iterate. For example, in this report, an RMSE of ~5-10 (kcal/mol?) and R^2 of ~1 are attained after 60 epochs, but then the optimizer decided to go way uphill and never come back, and the report includes a table that says the model obtained an RMSE of 150 (kcal/mol?) and R^2 of 0.25. Since we're using an optimizer that doesn't always step in descent directions, could we also add to the summary a description of the best iterate encountered, in addition to the currently summarized last iterate?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.