openmm / nnpops Goto Github PK

View Code? Open in Web Editor NEW

81.0 81.0 17.0 229 KB

High-performance operations for neural network potentials

License: Other

C++ 44.40% Cuda 26.97% Python 27.45% CMake 1.19%

cuda gpu machine-learning molecular-dynamics molecular-modeling

nnpops's Introduction

OpenMM: A High Performance Molecular Dynamics Library

Introduction

OpenMM is a toolkit for molecular simulation. It can be used either as a stand-alone application for running simulations, or as a library you call from your own code. It provides a combination of extreme flexibility (through custom forces and integrators), openness, and high performance (especially on recent GPUs) that make it truly unique among simulation codes.

Getting Help

Need Help? Check out the documentation and discussion forums.

nnpops's People

Contributors

Stargazers

Watchers

Forkers

raimis yueyericardo mikemhenry bluehope davkovacs jharrymoore stjordanis scal444 raulppelaez byun-jinyoung sef43 jslj23 talo mfkiwl roy-kid gilli08 lohedges

nnpops's Issues

Whether to match TorchANI?

The calculations implemented in TorchANI are different in a couple of ways from what's described in the paper. Which one should this code match?

The first difference is that it divides all the radial symmetry functions by 4. A comment in the code notes this is different from the paper, and says they're doing it to be consistent with NeuroChem. I have no idea why NeuroChem does it.

This one isn't that big a deal, since the user can easily multiply or divide the functions by 4 if they need to. The other one is a bigger problem. When computing the angular symmetry functions, it calculates the angle by taking the dot product of two vectors and using acos(). There are two problems you have to be careful of when doing this. Because acos() is singular at 1 and -1, the result becomes inaccurate when it gets too close to those extremes. And if roundoff error should ever cause it to go outside that range, the result is NaN.

This problem is easy to deal with. You detect when the dot product is too close to 1 or -1, and switch over to using asin() to calculate the angle from the cross product. This produces accurate results over the entire range. Unfortunately, that isn't what TorchANI does.

Instead, it uses the hack of just multiplying all dot products by 0.95. This produces very inaccurate results near the extremes. For example, if the true angle is exactly 0, this will instead produce an angle of about 18 degrees. It also distorts the results by a smaller amount even for angles that are far from the extremes.

What behavior should we implement here? Should we match what's in the paper or what's implemented in TorchANI?

Add continuous integration for NNPOps

Once we restructure to add a setup.py or create the conda recipe, we could easily use this for CI

Unifying code formatting

We could enforce some agreement on formatting throughout the code using an automatic tool, such as clang-format for C++ and black for Python.
A .clang-format at the root of a project is recognized automatically by most IDEs (VS code, vim, emacs...) . It can also be integrated as a git pre-commit, although this can be a nuisance for a contributor so we could just make it a guideline.
Finally, the clang-format binary can simply be called from the cli.
From working through the OpenMM and OpenMM-torch codebases I have crafted a .clang-format file that leaves most of the current code untouched:

IndentAccessModifiers: 'false'
AccessModifierOffset: -4
IndentWidth: 4
PointerAlignment: Left
AllowShortFunctionsOnASingleLine: 'None'
ColumnLimit: 200
BreakBeforeBraces: Custom	
BraceWrapping:
    BeforeCatch: true
SortIncludes: false
SortUsingDeclarations: false
IndentPPDirectives: 'BeforeHash'
Standard: 'Cpp11'

To give you some examples, these lines:

template <typename scalar_t> __device__ __forceinline__ scalar_t sqrt_(scalar_t x) {};
template<> __device__ __forceinline__ float sqrt_(float x) { return ::sqrtf(x); };
template<> __device__ __forceinline__ double sqrt_(double x) { return ::sqrt(x); };

Turn to this:

template <typename scalar_t> __device__ __forceinline__ scalar_t sqrt_(scalar_t x){};
template <> __device__ __forceinline__ float sqrt_(float x) {
    return ::sqrtf(x);
};
template <> __device__ __forceinline__ double sqrt_(double x) {
    return ::sqrt(x);
};

While these lines:

    if (box_vectors.size(0) != 0) {
        TORCH_CHECK(box_vectors.dim() == 2, "Expected \"box_vectors\" to have two dimensions");
        TORCH_CHECK(box_vectors.size(0) == 3 && box_vectors.size(1) == 3, "Expected \"box_vectors\" to have shape (3, 3)");
        double v[3][3];
        for (int i = 0; i < 3; i++)
            for (int j = 0; j < 3; j++)
                v[i][j] = box_vectors[i][j].item<double>();

are left untouched.
What do you think?

Efficient error reporting in CUDA

We need to design and integrate an efficient way to detect an error state in a CUDA kernel and capture it in the CPU. In particular detecting when some particle has more neighbours than the maximum allowed, replacing this:

NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu

Line 67 in 8b2d427

assert(i_pair < neighbors.size(1));

which currently leaves the CUDA context in an invalid state, requiring a full context reset.

Additionally, the strategy should be compatible with CUDA graphs.
This is related to this PR #70

The main difficulty here is that there is no way to communicate information between a kernel and the CPU that does not involve a synchronization barrier and a memory copy.

I think we should go about this in a similar way as the native CUDA reporting goes, by somehow building an error checking function into the interface that is allowed to synchronize and memcpy.

The class building the list, here

NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu

Line 100 in 8b2d427

class Autograd : public Function<Autograd> {

could own a device array/value storing error states (maybe an enum, or a simple integer), the function building the neighbour list would atomically set this error state instead of the assertion above.

Then, checking this error state in the CPU should be delayed as much as possible. For instance, before constructing a CUDA graph a series of error-checking calls to

NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu

Lines 102 to 106 in 8b2d427

    
           static tensor_list forward(AutogradContext* ctx, 
        
                                      const Tensor& positions, 
        
                                      const Scalar& cutoff, 
        
                                      const Scalar& max_num_neighbors, 
        
                                      const Tensor& box_vectors) {

with increasing max_num_neighbours could be made to determine an upper bound for it. Then a graph is constructed in a way such that this error state is no longer automatically checked.

This has of course the downside that errors would go silent during a simulation, with the code crashing in an uncontrolled way.

Custom PyTorch operations

Requirements and progress of the custom operations for PyTorch.

Requirements:

ANI symmetry functions with adjustable parameters
Derivatives with respect to atomic coordinates
Compatible with TorchScript
CPU and GPU implementations
Conda package

Relevant documentation:

Batched molecule computation

Hi,

Thanks for the great work!
Just want to check that, if batched molecule computation are going to be supported in the future?
If so, the codes here could be also used for training purpose.

Performance of the atomic neural networks in TorchANI

End-to-end performance benchmarks of ANI-2x

Molecule: 46 atoms (pytorch/molecules/2iuz_ligand.mol2)
GPU: GTX 1080 Ti

Forward & backward passes with complete ANI-2x:

TorchANI with original featurizer: 90 ms
TorchANI with our featurizer: 81 ms

Just forward pass with complete ANI-2x:

TorchANI with original featurizer: 25 ms
TorchANI with our featurizer: 23 ms

Forward & backward passes with ANI-2x using just one set of the atomic NNs, not 8:

TorchANI with original featurizer: 11 ms
TorchANI with our featurizer: 6.8 ms

Just forward pass with ANI-2x using just one set of the atomic NNs, not 8:

TorchANI with original featurizer: 6.3 ms
TorchANI with our featurizer: 3.7 ms

Originally posted by @raimis in #5 (comment)

Removing the dependency of TorchANI

NNPOps depends on TorchANI but, as discussed in #39, it shouldn't.

@peastman proposal (#39 (comment)):

The first thing I would do is open one or more issues on the TorchANI repo suggesting these changes and offering to provide code. This class doesn't implement any kind of calculation. It's just a workaround for a particular flaw in TorchANI (that it repeats the calculation on every step). The more of these changes we can push upstream to them, the better.

For anything they don't want to accept, the next thing I would do is move the imports inside the classes. You should be able to import NNPOps even if TorchANI isn't installed. In practice that means we should never import torchani at the top level. It should only be imported inside the specific classes or functions that use it.

Finally we should consider whether NNPOps is the best place for all of these, or whether, for example, OpenMM-ML would be a better place.

Benchmarking SchNet

I've been trying to figure out how to write a benchmarking script for SchNet. Here's what I have so far with SchNetPack. It loads a PDB file and computes the energy 1000 times with one of the pre-trained QM9 models. I haven't figured out yet how to get it to compute forces, so any advice on that would be appreciated. There probably are other ways this could be improved too.

import torch
import schnetpack as spk
import schnetpack.md.calculators
import sys
import ase.io
import time

device = torch.device('cuda')
model = torch.load("trained_schnet_models/qm9_energy_U0/best_model", map_location=device)

atoms = ase.io.read(sys.argv[1])
system = spk.md.System(1, device=device)
system.load_molecules([atoms])

calculator = spk.md.calculators.SchnetPackCalculator(
    model,
    required_properties=['energy_U0'],
    force_handle=spk.Properties.forces,
    position_conversion='A',
    force_conversion='kcal/mol/A'
)
inputs = calculator._generate_input(system)
model(inputs)
t1 = time.time()
for i in range(1000):
    results = model(inputs)
print(results)
print(time.time()-t1)

Testing a 60 atom system on a Titan V, it takes about 3.6 ms per energy evaluation. Testing a 2269 atom system it runs out of memory on the GPU and crashes.

While the test is running, nvidia-smi shows that the GPU is only 28% busy. nvvp shows a lot of short kernels with larger gaps between them. The two most significant kernels are volta_sgemm_32x128_tn (19.8% of GPU time) and volta_sgemm_32x32_sliced1x4_tn (16% of GPU time). It then gets into a whole lot of kernels with uninformative names like _ZN2at6native6legacy18elementwise_kernelILi128ELi4EZNS0_15gpu_kernel_implIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_EEvS5_RKT_EUliE2_EEviT1_.

libNNPOpsPyTorch.so: undefined symbol: _ZN3c1012OptionalTypeC1ESt10shared_ptrINS_4TypeEE

Hi,

I installed the NNPOps package from the conda-forge channel with the command:
conda install -c conda-forge nnpops

The installation finished without any issues but when I try to import the required modules, I get the following error message

import torch
import torchani
from NNPOps.SpeciesConverter import TorchANISpeciesConverter
Traceback (most recent call last):
File "", line 1, in
File "/home/venkat/miniconda3/envs/py39/lib/python3.9/site-packages/NNPOps/init.py", line 1, in
from NNPOps.OptimizedTorchANI import OptimizedTorchANI
File "/home/venkat/miniconda3/envs/py39/lib/python3.9/site-packages/NNPOps/OptimizedTorchANI.py", line 28, in
from NNPOps.BatchedNN import TorchANIBatchedNN
File "/home/venkat/miniconda3/envs/py39/lib/python3.9/site-packages/NNPOps/BatchedNN.py", line 31, in
torch.ops.load_library(os.path.join(os.path.dirname(file), 'libNNPOpsPyTorch.so'))
File "/home/venkat/miniconda3/envs/py39/lib/python3.9/site-packages/torch/_ops.py", line 104, in load_library
ctypes.CDLL(path)
File "/home/venkat/miniconda3/envs/py39/lib/python3.9/ctypes/init.py", line 382, in init
self._handle = _dlopen(self._name, mode)
OSError: /home/venkat/miniconda3/envs/py39/lib/python3.9/site-packages/NNPOps/libNNPOpsPyTorch.so: undefined symbol: _ZN3c1012OptionalTypeC1ESt10shared_ptrINS_4TypeEE

I am not sure how to fix this error and any help would be highly appreciated. Happy to provide any additional information if needed.

I am working with Python 3.9.12 on a Debian OS machine.

Fix the warnings of CFConv

It also reports what look like some genuine errors in the code:

/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:101:29: warning: equality comparison result unused [-Wunused-comparison]
                activation_ == ::CFConv::ShiftedSoftplus;
                ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:101:29: note: use '=' to turn this equality comparison into an assignment
                activation_ == ::CFConv::ShiftedSoftplus;
                            ^~
                            =
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:103:29: warning: equality comparison result unused [-Wunused-comparison]
                activation_ == ::CFConv::Tanh;
                ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:103:29: note: use '=' to turn this equality comparison into an assignment
                activation_ == ::CFConv::Tanh;
                            ^~
                            =

Originally posted by @peastman in #63 (comment)

NNPOps 0.5

CGSchNet support

@peastman: Would love to see if we could support the CGSchNet model described in this excellent paper from @brookehus, since this could allow us to support much larger coarse-grained models as part of our ML integration.

GPU memory consumption of NNPOps

I have seen increased GPU memory consumption using NNPOps compared to the original torchani implementation. Attached is a plot that shows the memory consumption of ANI-2x on the CUDA platform with nnpops and the torchani implementation for a watercluster test system. I am wondering if this is expected behavior or if I am doing something wrong? I am using the openmm-8.0-beta conda environment and openmmml to perform the benchmark watercluster calculations.

(The plot shows on the x-axis the number of atoms in the system and on the left y-axis the time it takes to execute a single energy evaluation and on the right y-axis the GPU memory footprint of the model. The dotted lines show the GPU memory. The labels are always model/platform/implementation)

Neigbors for >128K particles

The index here overflows with 128K particles (a thread is launched per possible pair, N*(N-1)/2)

NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu

Lines 43 to 46 in 054d487

    
           const int32_t index = blockIdx.x * blockDim.x + threadIdx.x; 
        
           if (index >= num_all_pairs) return; 
        
           int32_t row = floor((sqrtf(8 * index + 1) + 1) / 2);

I know it sounds like an absurd regime for such a brute force approach, but it turns out that this kernel remains a
competitive option in some situations depending on the average number of neighbors per particle, for instance.

I believe there should be at least a TORCH_CHECK around.

Switching to int64_t virtually eliminates this problem, but there is a big performance penalty and the maximum number of blocks a CUDA kernel can take becomes a problem soon after.

OTOH, the way the row index is computed results in incorrect results due to floating point error at ~100k particles, even when switching to int64 and double.

I suspect this kernel will be defeated by this other approach at some point:
https://developer.nvidia.com/gpugems/gpugems3/part-v-physics-simulation/chapter-31-fast-n-body-simulation-cuda

Which does not take too much time to implement and does not suffer from the aforementioned issues, since it launches an O(N) number of threads. Maybe it would be worth to have the two options and decide on the algorithm at runtime depending on some heuristic?

Additionally, this line:

NNPOps/src/pytorch/neighbors/getNeighborPairsCPU.cpp

Line 58 in 054d487

const Tensor indices = arange(0, num_pairs, positions.options().dtype(kInt32));

Tries to allocate 8GB of memory when asked for 128k particles.
This can be fixed by using something like fancy iterators, however, I wonder if there is a reason the CPU implementation is written using torch operations exclusively.

Readme example appears broken

Running the readme example code, replacing molecule.mol2 with a water pdb
(example_input.zip), I get this exception

Traceback (most recent call last):
  File "/home/kevin/benchmarks/nnpops/run.py", line 22, in <module>
    nnp.aev_computer = TorchANISymmetryFunctions(nnp.species_converter, nnp.aev_computer, species).to(device)
  File "/home/kevin/programs/conda/envs/openmm-8-beta-linux/lib/python3.10/site-packages/NNPOps/SymmetryFunctions.py", line 90, in __init__
    species = converter((atomicNumbers, torch.empty(0))).species[0].tolist()
AttributeError: 'tuple' object has no attribute 'species'

It looks like SpeciesConverter returns a tuple (code), but TorchANISymmetryFunctions expects some other structure (code)

TestNeighbors.py failures with pytorch 1.13

If I install NNPOps with pytorch 13

conda install -c conda-forge nnpops pytorch=1.13

Or build from source with pytorch 1.13 then the tests in TestNeighbors.py fail with a pytorch runtime error, e.g.:

FAILED TestNeighbors.py::test_neighbor_grads[distances-1-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [0]], which is output 0 of NormBackward1, is at version 1;...

Output of

pytest TestNeighbors.py

=================================================================================== short test summary info ====================================================================================
FAILED TestNeighbors.py::test_neighbor_grads[distances-1-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [0]], which is output 0 of NormBackward1, is at version 1;...
FAILED TestNeighbors.py::test_neighbor_grads[distances-1-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [0]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[distances-2-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1]], which is output 0 of NormBackward1, is at version 1;...
FAILED TestNeighbors.py::test_neighbor_grads[distances-2-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [1]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[distances-3-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3]], which is output 0 of NormBackward1, is at version 1;...
FAILED TestNeighbors.py::test_neighbor_grads[distances-3-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [3]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[distances-4-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [6]], which is output 0 of NormBackward1, is at version 1;...
FAILED TestNeighbors.py::test_neighbor_grads[distances-4-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [6]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[distances-5-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[distances-5-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [10]], which is output 0 of NormBackward1, is at version ...
FAILED TestNeighbors.py::test_neighbor_grads[distances-10-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [45]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[distances-10-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [45]], which is output 0 of NormBackward1, is at version ...
FAILED TestNeighbors.py::test_neighbor_grads[distances-100-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [4950]], which is output 0 of NormBackward1, is at version...
FAILED TestNeighbors.py::test_neighbor_grads[distances-100-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [4950]], which is output 0 of NormBackward1, is at versio...
FAILED TestNeighbors.py::test_neighbor_grads[distances-1000-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [499500]], which is output 0 of NormBackward1, is at versi...
FAILED TestNeighbors.py::test_neighbor_grads[distances-1000-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [499500]], which is output 0 of NormBackward1, is at vers...
FAILED TestNeighbors.py::test_neighbor_grads[combined-1-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [0]], which is output 0 of NormBackward1, is at version 1;...
FAILED TestNeighbors.py::test_neighbor_grads[combined-1-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [0]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[combined-2-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1]], which is output 0 of NormBackward1, is at version 1;...
FAILED TestNeighbors.py::test_neighbor_grads[combined-2-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [1]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[combined-3-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3]], which is output 0 of NormBackward1, is at version 1;...
FAILED TestNeighbors.py::test_neighbor_grads[combined-3-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [3]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[combined-4-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [6]], which is output 0 of NormBackward1, is at version 1;...
FAILED TestNeighbors.py::test_neighbor_grads[combined-4-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [6]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[combined-5-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[combined-5-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [10]], which is output 0 of NormBackward1, is at version ...
FAILED TestNeighbors.py::test_neighbor_grads[combined-10-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [45]], which is output 0 of NormBackward1, is at version 1...
FAILED TestNeighbors.py::test_neighbor_grads[combined-10-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [45]], which is output 0 of NormBackward1, is at version ...
FAILED TestNeighbors.py::test_neighbor_grads[combined-100-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [4950]], which is output 0 of NormBackward1, is at version...
FAILED TestNeighbors.py::test_neighbor_grads[combined-100-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [4950]], which is output 0 of NormBackward1, is at versio...
FAILED TestNeighbors.py::test_neighbor_grads[combined-1000-dtype0] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [499500]], which is output 0 of NormBackward1, is at versi...
FAILED TestNeighbors.py::test_neighbor_grads[combined-1000-dtype1] - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [499500]], which is output 0 of NormBackward1, is at vers...
========================================================================== 32 failed, 214 passed, 2 warnings in 8.07s ==========================================================================

The test passes with pytorch=1.12

Is anyone else able to reproduce?
running on Linux with Cuda 11.7

Consider pytorch_geometric as an example of how we can structure this repo

Does the ANI implementation within NNPOps also consume 20x more memory with PBC?

Hi NNPOps developers,

I was wondering how does the ANI AEV implementations within NNPOps handle the PBC in terms of memory consumption? Previously I was trying to run simulations with torchANI in ASE but setting the PBC conditions to true in ASE resulted in over a 20× requirement in memory which I assume was because of torchANI being required to evaluate the atoms in the 26 neighbouring PBC cells. Does NNPOps also face such a constraint when running within OpenMM? And are there any rough estimates on maximum system sizes (number of atoms) which can be run on a GPU with 24GB of VRAM?

Thank you for your time to address my queries.

Best regards,
Joshua Soon

Support SchNet-like models as well

SchNet-like models have significant advantages in that they use atom environment embeddings instead of separate pairs of networks to compute energy contributions. Would it be possible to support these as well?

Set Default CUDA Arch

I think we need to add something like this:

cmake_minimum_required(VERSION)

if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
  set(CMAKE_CUDA_ARCHITECTURES 75)
endif()

project(example LANGUAGES CUDA)

to our CMakeLists.txt.
That way if we are on a system that has nvcc but doesn't have a GPU, there is a fallback on which CUDA arch to build. I ran into this issue when building on conda-forge since they have nvcc in their pipeline but no GPU, so then the arch doesn't get set and the build fails.

I'll have to test this in a VM later, but let me know if you think this is unnecessary or handled in a better way.

Surprising differences between nnpops and torchani

Hi,

I have been getting surprising results running waterbox simulations with the torchani vs the nnpops implementation of ani2x. I used the openmmtools waterbox testsystem with an edge length of 20 A, and a 1 fs timestep, and simulated for 1 ns using a Langevin integrator with 1/ps collision rate. The system was set up with potential.createSystem().

When I run simulations in NpT with the torchani implementation at 300 K everything looks relatively normal (density is a bit too high, and the rdf has some surprising signal though):

When I perform the same simulation with the nnpops implementation I see this:

and the simulation box has shrunk (the initial box size is the yellow outlined square). Also, note the difference in the y-axis for the potential energy.

In NVT I observe vacuum bubbles with nnpos
https://user-images.githubusercontent.com/31651017/218335894-0254ed80-e51f-4189-9bfc-ae94637cfd85.mp4

compared to the same simulation with torchani
https://user-images.githubusercontent.com/31651017/218335817-3e911757-d19d-4f71-b922-8b9de913237e.mp4

I attached a minimal example to reproduce the simulations.

min_example.py.zip

NNPOps installation/build missing dependencies (absent from environment.yml)

Hi, I wasn't able to install and build the NNPOPS here as described, but i was able to install and see passing pytests with the attached env.txt
nnpops_openmm_torch_env.txt
thanks to @ijpulidos. i seem to need cudnn, cudatoolkit-dev, and gxx_linux-64=10.3.0 to cmake and install successfully (which i think is missing from the environment.yml)

To run example with TorchForce, do I need the conda-installable version of torchani or this version?

Conda install not working

I cannot install the package from conda:

conda install -c conda-forge nnpops
Collecting package metadata: done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - nnpops -> __glibc[version='>=2.17,>=2.17,<3.0.a0']

Current channels:

  - https://conda.anaconda.org/conda-forge/linux-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/linux-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

Does anyone know what might be the cause of this?

Incorrect library name on Mac

I'm trying to build NNPOps on a Mac. It builds correctly, but when I try to import NNPOps it fails with this error:

Traceback (most recent call last):
  File "/Users/peastman/miniconda3/envs/openmm/lib/python3.9/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/Users/peastman/miniconda3/envs/openmm/lib/python3.9/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/Users/peastman/miniconda3/envs/openmm/lib/python3.9/site-packages/NNPOps/__init__.py", line 7, in <module>
    torch.ops.load_library(os.path.join(os.path.dirname(__file__), 'libNNPOpsPyTorch.so'))
  File "/Users/peastman/miniconda3/envs/openmm/lib/python3.9/site-packages/torch/_ops.py", line 255, in load_library
    ctypes.CDLL(path)
  File "/Users/peastman/miniconda3/envs/openmm/lib/python3.9/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/peastman/miniconda3/envs/openmm/lib/python3.9/site-packages/NNPOps/libNNPOpsPyTorch.so, 0x0006): tried: '/Users/peastman/miniconda3/envs/openmm/lib/python3.9/site-packages/NNPOps/libNNPOpsPyTorch.so' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/peastman/miniconda3/envs/openmm/lib/python3.9/site-packages/NNPOps/libNNPOpsPyTorch.so' (no such file), '/Users/peastman/miniconda3/envs/openmm/lib/python3.9/site-packages/NNPOps/libNNPOpsPyTorch.so' (no such file)

The problem comes from this line:

NNPOps/src/pytorch/__init__.py

Line 7 in 054d487

    
           torch.ops.load_library(os.path.join(os.path.dirname(__file__), 'libNNPOpsPyTorch.so'))

On Macs, shared libraries have the extension .dylib rather than .so. It also won't work on Windows where they have the extension .dll.

NNPOps 0.4

It is time for the next release with:

Especially the PBC fix (#83). Does anybody working on something and wants to include it?

Ping: @peastman @RaulPPelaez @sef43

CI with GPU

NNPOps is primarily developed for GPU, but the current CI doesn't test on GPU.

Maybe we could use the infrastructure as OpenMM for the GPU testing: https://jenkins.jasonswails.com/blue/organizations/jenkins/openmm-github%2Fopenmm/detail/master/176/pipeline/8

NNPOps 0.2

It is time for the release of NNPOps 0.2:

We have made several significant changes
@mikemhenry is almost done with a conda-forge package (#26, conda-forge/staged-recipes#16257)
The tutorial needs an official package (openmm/openmm-torch#62)

@peastman do you agree?

NNPOps 0.3

It is time for the next release.

We have implemented the nearest neighbor operation (#58, #61)
We have fixed a few bugs (#67, #71)

I think we are missing just an example for getNeighborPairs, so users can easily discover that operation.

Torch problems when using NNPOps with Openmm-ML

Thanks for the great ecosystem for ML potentials in MD!

I tried running this simple `openmm-ml` example that uses `createSystem`:

#!/usr/bin/env python3

from openmm.app import *
from openmm import *
from openmm.unit import *
from openmmml import MLPotential

from sys import argv,stdout

# must be either "nnpops", "torchani"
implementation = argv[1]
input_file = argv[2]

pdb = PDBFile(input_file)

print("Creating ANI potential")
potential = MLPotential('ani2x')

print("Creating system")
system = potential.createSystem(pdb.topology, implementation=implementation)

print("Creating simulation")
integrator = LangevinMiddleIntegrator(300*kelvin, 1/picosecond, 0.004*picoseconds)
simulation = Simulation(pdb.topology, system, integrator)
simulation.context.setPositions(pdb.positions)

print("Minimizing energy")
simulation.minimizeEnergy()

print("Simulating")
simulation.reporters.append(StateDataReporter(stdout, 1000, step=True,
            potentialEnergy=True, temperature=True))
simulation.step(10000)
print("done")

I'm using a simple methane PDB file:

HETATM    1  C1  UNK     0      -0.238   0.373   0.000  1.00  0.00           C
HETATM    2  H1  UNK     0      -0.238   1.486   0.000  1.00  0.00           H
HETATM    3  H2  UNK     0      -1.286   0.002  -0.057  1.00  0.00           H
HETATM    4  H3  UNK     0       0.335   0.002  -0.879  1.00  0.00           H
HETATM    5  H4  UNK     0       0.236   0.002   0.936  1.00  0.00           H
END

When I specify to use the torchani implementation, everything goes through OK.

However, when I try to use nnpops, I get the following stacktrace (when running the energy minimization):

Traceback (most recent call last):
  File "/scratch/openmm-nnp/./run_md.py", line 28, in <module>
    simulation.minimizeEnergy()
  File "/scratch/.conda/envs/openmm_nnp/lib/python3.10/site-packages/openmm/app/simulation.py", line 137, in minimizeEnergy
    mm.LocalEnergyMinimizer.minimize(self.context, tolerance, maxIterations)
  File "/scratch/.conda/envs/openmm_nnp/lib/python3.10/site-packages/openmm/openmm.py", line 8544, in minimize
    return _openmm.LocalEnergyMinimizer_minimize(context, tolerance, maxIterations)
openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<string>", line 57, in <backward op>
            self_scalar_type = self.dtype
            def backward(grad_output):
                grad_self = AD_sum_backward(grad_output, self_size, dim, keepdim).to(self_scalar_type) / AD_safe_size(self_size, dim)
                            ~~~~~~~~~~~~~~~ <--- HERE
                return grad_self, None, None, None
  File "<string>", line 24, in AD_sum_backward
            if not keepdim and len(sizes) > 0:
                if len(dims) == 1:
                    return grad.unsqueeze(dims[0]).expand(sizes)
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                else:
                    res = AD_unsqueeze_multiple(grad, dims, len(sizes))
RuntimeError: expand(CUDADoubleType{[1, 1]}, size=[1]): the number of sizes provided (1) must be greater or equal to the number of dimensions in the tensor (2)

I'm using the mmh/openmm-8-beta-linux environment (via the command mamba env create mmh/openmm-8-beta-linux) on a Debian Bullseye system with an NVIDIA T4.

My full environment dump (`conda env export`):

channels:
  - conda-forge/label/openmm-torch_rc
  - conda-forge/label/openmm_rc
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - attrs=22.2.0=pyh71513ae_0
  - brotlipy=0.7.0=py310h5764c6d_1005
  - bzip2=1.0.8=h7f98852_4
  - c-ares=1.18.1=h7f98852_0
  - ca-certificates=2022.12.7=ha878542_0
  - cached-property=1.5.2=hd8ed1ab_1
  - cached_property=1.5.2=pyha770c72_1
  - certifi=2022.12.7=pyhd8ed1ab_0
  - cffi=1.15.1=py310h255011f_3
  - charset-normalizer=2.1.1=pyhd8ed1ab_0
  - colorama=0.4.6=pyhd8ed1ab_0
  - cryptography=39.0.0=py310h34c0648_0
  - cudatoolkit=11.8.0=h37601d7_11
  - cudnn=8.4.1.50=hed8a83a_0
  - exceptiongroup=1.1.0=pyhd8ed1ab_0
  - h5py=3.7.0=nompi_py310h416281c_102
  - hdf5=1.12.2=nompi_h4df4325_101
  - icu=70.1=h27087fc_0
  - idna=3.4=pyhd8ed1ab_0
  - importlib-metadata=6.0.0=pyha770c72_0
  - importlib_metadata=6.0.0=hd8ed1ab_0
  - iniconfig=2.0.0=pyhd8ed1ab_0
  - keyutils=1.6.1=h166bdaf_0
  - krb5=1.20.1=h81ceb04_0
  - lark-parser=0.12.0=pyhd8ed1ab_0
  - ld_impl_linux-64=2.39=hcc3a1bd_1
  - libaec=1.0.6=h9c3ff4c_0
  - libblas=3.9.0=16_linux64_openblas
  - libcblas=3.9.0=16_linux64_openblas
  - libcurl=7.87.0=hdc1c0ab_0
  - libedit=3.1.20191231=he28a2e2_2
  - libev=4.33=h516909a_1
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=12.2.0=h65d4601_19
  - libgfortran-ng=12.2.0=h69a702a_19
  - libgfortran5=12.2.0=h337968e_19
  - libhwloc=2.8.0=h32351e8_1
  - libiconv=1.17=h166bdaf_0
  - liblapack=3.9.0=16_linux64_openblas
  - libnghttp2=1.51.0=hff17c54_0
  - libnsl=2.0.0=h7f98852_0
  - libopenblas=0.3.21=pthreads_h78a6416_3
  - libprotobuf=3.21.12=h3eb15da_0
  - libsqlite=3.40.0=h753d276_0
  - libssh2=1.10.0=hf14f497_3
  - libstdcxx-ng=12.2.0=h46fd767_19
  - libuuid=2.32.1=h7f98852_1000
  - libxml2=2.10.3=hca2bb57_1
  - libzlib=1.2.13=h166bdaf_4
  - llvm-openmp=15.0.6=he0ac6c6_0
  - magma=2.5.4=hc72dce7_4
  - mkl=2022.2.1=h84fe81f_16997
  - nccl=2.14.3.1=h0800d71_0
  - ncurses=6.3=h27087fc_1
  - ninja=1.11.0=h924138e_0
  - nnpops=0.2=cuda112py310h8b99da5_5
  - numpy=1.24.1=py310h08bbf29_0
  - ocl-icd=2.3.1=h7f98852_0
  - ocl-icd-system=1.0.0=1
  - openmm=8.0.0beta=py310h2996cf7_2
  - openmm-ml=1.0beta=pyh79ba5db_2
  - openmm-torch=1.0beta=cuda112py310h02d4f52_2
  - openssl=3.0.7=h0b41bf4_1
  - packaging=22.0=pyhd8ed1ab_0
  - pip=22.3.1=pyhd8ed1ab_0
  - pluggy=1.0.0=pyhd8ed1ab_5
  - pycparser=2.21=pyhd8ed1ab_0
  - pyopenssl=23.0.0=pyhd8ed1ab_0
  - pysocks=1.7.1=pyha2e5f31_6
  - pytest=7.2.0=pyhd8ed1ab_2
  - python=3.10.8=h4a9ceb5_0_cpython
  - python_abi=3.10=3_cp310
  - pytorch=1.12.1=cuda112py310he33e0d6_201
  - readline=8.1.2=h0f457ee_0
  - requests=2.28.1=pyhd8ed1ab_1
  - setuptools=59.5.0=py310hff52083_0
  - setuptools-scm=6.3.2=pyhd8ed1ab_0
  - setuptools_scm=6.3.2=hd8ed1ab_0
  - sleef=3.5.1=h9b69904_2
  - tbb=2021.7.0=h924138e_1
  - tk=8.6.12=h27826a3_0
  - tomli=2.0.1=pyhd8ed1ab_0
  - torchani=2.2.2=cuda112py310h98dee98_6
  - typing_extensions=4.4.0=pyha770c72_0
  - tzdata=2022g=h191b570_0
  - urllib3=1.26.13=pyhd8ed1ab_0
  - wheel=0.38.4=pyhd8ed1ab_0
  - xz=5.2.6=h166bdaf_0
  - zipp=3.11.0=pyhd8ed1ab_0

I've seen some mention of similar problems, but haven't been able to find the solution.

Any help is greatly appreciated. Apologies if this isn't the correct repo to open this issue in.

Thanks!

integrator timing data for tip3p:ligand and protein:ligand simulations suggests a non-pytorch bottleneck?

while profiling some protein:ligand and solvated ligand simulation times with NNPOPS and openmm-torch (treating the ligand with an ANI2x TorchForce) I noticed that there is a similar slowdown factor in both of the phases. does this mean there is a non-pytorch bottleneck in the integrator/force calculation process?

To be more specific, I am doing the following experiments:

solvent MM simulation runs md LangevinIntegrator on a Tip3p solvated ligand given by the first entry of this sdf file
solvent MM/ML simulation runs the aforementioned integrator on (1) except the ligand atoms are treated with ANI-2x model with a TorchForce
complex MM runs the specified integrator on a Tip3p solvated ligand (from the sdf above) concatenated with this protein
complex MM/ML runs MD as in (3), except the ligand atoms are treated with ANI-2x (as in 2).

the plot above shows the wall clock time (y-axis) per 2fs timestep; the x axis is the md iteration step number. the wall clock profiles are shown/labelled for the 4 cases described above.

getNeighborPairs error

I am trying to use the new neighbourlist with pytorch 1.12.1. I have managed to install NNPOps via conda.

If I check the PyTorch version it is 1.12, so that is fine.

If I try to import NNPOps that is also fine:

>>> from NNPOps.neighbors import getNeighborPairs

And it is event telling me it needs two positional arguments:

>>> getNeighborPairs()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: getNeighborPairs() missing 2 required positional arguments: 'positions' and 'cutoff'

But when I actually specify those positional arguments I am getting the following error:

>>> getNeighborPairs(positions=positions, cutoff=cutoff)
Traceback (most recent call last):
  File "/home/dpk25/.conda/envs/MACE/lib/python3.10/site-packages/torch/_ops.py", line 198, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator neighbors::getNeighborPairs

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dpk25/.conda/envs/MACE/lib/python3.10/site-packages/NNPOps/neighbors/getNeighborPairs.py", line 82, in getNeighborPairs
    return ops.neighbors.getNeighborPairs(positions, cutoff, max_num_neighbors)
  File "/home/dpk25/.conda/envs/MACE/lib/python3.10/site-packages/torch/_ops.py", line 202, in __getattr__
    raise AttributeError(f"'_OpNamespace' object has no attribute '{op_name}'") from e
AttributeError: '_OpNamespace' object has no attribute 'getNeighborPairs'

Does anyone perhaps know what might be going wrong?

A few relevant info about the environment:

python                    3.10.4               h12debd9_0  
torch                     1.12.1+cu113             pypi_0    pypi

Interestingly, NNPOps does not appear when I do conda list.

Add OSX to CI testing matrix

See convo here: openmm/openmm-ml#26 (comment)

Can't run steps of dynamics with NNPOps `TorchForce`

In attempting to run MD on a TorchForce-equipped System (the TorchForce has the NNPOps symmetry functions equipped as described here ), I am observing strange behavior. Namely, I am able to create a Context with the System and return the State object with a potential energy, but when i run a step of dynamics, I observe

Traceback (most recent call last):
  File "/lila/home/rufad/github/qmlify/qmlify/openmm_torch/notebooks/yield_dynamics.py", line 119, in <module>
    ml_int.step(1)
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/simtk/openmm/openmm.py", line 7036, in step
    return _openmm.CustomIntegrator_step(self, steps)
simtk.openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torchani/nn.py", line 95, in forward
    if torch.gt((torch.size(midx))[0], 0):
      input_ = torch.index_select(aev0, 0, midx)
      _29 = torch.flatten((_22).forward(input_, ), 0, -1)
                           ~~~~~~~~~~~~ <--- HERE
      _30 = torch.masked_scatter_(output, mask, _29)
    else:
  File "code/__torch__/torch/nn/modules/container.py", line 22, in forward
    _5 = getattr(self, "5")
    _6 = getattr(self, "6")
    input0 = (_0).forward(input, )
              ~~~~~~~~~~~ <--- HERE
    input1 = (_1).forward(input0, )
    input2 = (_2).forward(input1, )
  File "code/__torch__/torch/nn/modules/linear.py", line 13, in forward
    input: Tensor) -> Tensor:
    _0 = __torch__.torch.nn.functional.linear
    return _0(input, self.weight, self.bias, )
           ~~ <--- HERE
  File "code/__torch__/torch/nn/functional.py", line 4, in linear
    weight: Tensor,
    bias: Optional[Tensor]=None) -> Tensor:
  return torch.linear(input, weight, bias)
         ~~~~~~~~~~~~ <--- HERE
def celu(input: Tensor,
    alpha: float=1.,

Traceback of TorchScript, original code (most recent call last):
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/torchani/nn.py", line 68, in forward
            if midx.shape[0] > 0:
                input_ = aev.index_select(0, midx)
                output.masked_scatter_(mask, m(input_).flatten())
                                             ~ <--- HERE
        output = output.view_as(species)
        return SpeciesEnergies(species, torch.sum(output, dim=1))
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/torch/nn/modules/container.py", line 119, in forward
    def forward(self, input):
        for module in self:
            input = module(input)
                    ~~~~~~ <--- HERE
        return input
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 94, in forward
    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight, self.bias)
               ~~~~~~~~ <--- HERE
  File "/home/rufad/anaconda3/envs/nnpops/lib/python3.9/site-packages/torch/nn/functional.py", line 1753, in linear
    if has_torch_function_variadic(input, weight):
        return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
    return torch._C._nn.linear(input, weight, bias)
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

On the other hand, if i do not equip the NNPops ani symmetry functions, this error is not encountered. I didnt notice any examples/pytests in this repo re: equipping a TorchForce with ANISymmetryFunctions. I'm not sure if this interoperability has been tested yet. If so, would it be possible to add a pytest/example? I'm not sure if this should go into the openmm-torch repo instead (since the functionality I was to practice uses NNPOPS). I'd be happy to troubleshoot if needed.

Transfer to openmm organization

I've transferred this repository to the openmm org. Update any references you have to it. New PRs should go to openmm/NNPOps rather than peastman/NNPOps. Github will automatically redirect them, but eventually I want to create a new fork at peastman/NNPOps.

Error building with CUDA

If I try to enable building CUDA, CMake fails with this error:

 CMake Error at CMakeLists.txt:8 (enable_language):
   Language 'CUDA' is currently being enabled.  Recursive call not allowed.

I tried removing the call to enable_language(CUDA), but then it fails with the errors,

 CMake Error: Cannot determine link language for target "TestCudaANISymmetryFunctions".
 CMake Error: CMake can not determine linker language for target: TestCudaANISymmetryFunctions
 CMake Error: Cannot determine link language for target "TestCudaCFConv".
 CMake Error: CMake can not determine linker language for target: TestCudaCFConv

Many tests are being skipped by the CI

The CI runs this to test:

ctest --verbose --exclude-regex TestCuda

Running this on my machine results in 11 tests. But the CI only runs 3.
I cannot find out why.

Profiling ANISymmetryFunctions

I've written a benchmark that loads in a PDB file, then times how long it takes to compute the symmetry functions and their derivatives many times. I'm opening this issue as a place to record my observations and discuss performance and optimizations.

For a small molecule with 60 atoms, it takes 356 us for each iteration. For a much larger system with 2269 atoms, it takes 1389 us for each iteration. So 38 times more atoms takes 3.8 times as long! Clearly the smaller system isn't coming anywhere close to saturating the GPU (a Titan V).

Indeed, nvvp shows that with the smaller system most of the SMs are just sitting idle. This is expected given the way the code splits the work between threads: one warp for every atom. With 60 atoms, there's only enough work for 60*32 = 1920 threads, which is many fewer than the GPU wants. On the larger system it does much better, although most SMs are still only shown as about 60% utilized. There may be room to improve that.

For the larger system, here is how the time is divided between kernels:

backpropAngularFunctions: 43.7%
computeAngularFunctions: 27.4%
computeRadialFunctions: 12.2%
backpropRadialFunctions: 9.4%

For the smaller system it's even more extreme, with about 94% of the time spent on the angular functions. But even at 2269 atoms, the angular functions take much more time than the radial functions. That indicates there's no need to worry about a neighbor list for the radial functions until we get up to much larger systems than we currently anticipate.

I was worried that all the atomic writes might create a memory bottleneck, but that doesn't seem to be the case. The profile shows only a few percent of samples as "memory dependency" or "memory throttle".

Test on GPU

Run all the tests on GPU.

Cut a release of NNPOps (after merging critical PRs)

error when trying to jit.script getNeighbors

When I torch.jit.script a module using getNeighbors it fails (pytorch=1.13.1, nnpops=0.4):
Example:

import torch
from NNPOps.neighbors import getNeighborPairs

class ForceModule(torch.nn.Module):
    
    def forward(self, positions):

        neighbors, deltas, distances = getNeighborPairs(positions, cutoff=1.0)
        mask = torch.isnan(distances)
        distances = distances[~mask]

        return torch.sum(distances**2)

model = ForceModule()
module = torch.jit.script(model)
module.save('model.pt')

output:

Traceback (most recent call last):
  File "/home/sfarr/Documents/MLP_train/run_md_nequip/test_nn_nl.py", line 15, in <module>
    module = torch.jit.script(model)
  File "/home/sfarr/miniconda3/envs/mlp_train/lib/python3.9/site-packages/torch/jit/_script.py", line 1286, in script
    return torch.jit._recursive.create_script_module(
  File "/home/sfarr/miniconda3/envs/mlp_train/lib/python3.9/site-packages/torch/jit/_recursive.py", line 476, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
  File "/home/sfarr/miniconda3/envs/mlp_train/lib/python3.9/site-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
    create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
  File "/home/sfarr/miniconda3/envs/mlp_train/lib/python3.9/site-packages/torch/jit/_recursive.py", line 393, in create_methods_and_properties_from_stubs
    concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
  File "/home/sfarr/miniconda3/envs/mlp_train/lib/python3.9/site-packages/torch/jit/_recursive.py", line 863, in try_compile_fn
    return torch.jit.script(fn, _rcb=rcb)
  File "/home/sfarr/miniconda3/envs/mlp_train/lib/python3.9/site-packages/torch/jit/_script.py", line 1343, in script
    fn = torch._C._jit_script_compile(
RuntimeError: 
Return value was annotated as having type Tuple[Tensor, Tensor] but is actually of type Tuple[Tensor, Tensor, Tensor]:
  File "/home/sfarr/miniconda3/envs/mlp_train/lib/python3.9/site-packages/NNPOps/neighbors/getNeighborPairs.py", line 126
    if box_vectors is None:
        box_vectors = empty((0, 0), device=positions.device, dtype=positions.dtype)
    return ops.neighbors.getNeighborPairs(positions, cutoff, max_num_neighbors, box_vectors)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
'getNeighborPairs' is being compiled since it was called from 'ForceModule.forward'
  File "/home/sfarr/Documents/MLP_train/run_md_nequip/test_nn_nl.py", line 8
    def forward(self, positions):
    
        neighbors, deltas, distances = getNeighborPairs(positions, cutoff=1.0)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        mask = torch.isnan(distances)
        distances = distances[~mask]

Looks like it is just the type annotations.

We should add jit.scripting to the test cases

Have nvcc compile for multiple architectures for conda-packages

Right add, nvcc by default compiles for only one architecture. We should be sure to build for multiple architectures.

Conda install the CUDA version of nnpops on a linux CPU platform

Hi developers, I have a question of pulling in the CUDA version of nnpops on a linux CPU platform.
I need to build a docker image to deploy my pipeline and I usually does this on a linux CPU platform as part of the CI.
The nnpops is pulled in through conda as part of the building process. It seems that If I just tried to blindly pull in the nnpops on this CPU platform, I will get the CPU version, which makes sense.
However, since I will be deploying the docker image on a GPU platform, it will be helpful if I could pull in the cuda version during the docker building.
I tried to tell conda that I want the CUDA build variant via
mamba create -c conda-forge -n test "nnpops>=0.4=cuda*"
However, I will get

Could not solve for environment specs
Encountered problems while solving:
  - nothing provides __cuda needed by pytorch-1.13.1-cuda112py310he33e0d6_200

The environment can't be solved, aborting the operation

And it seems that only 0.4 and 0.3 has this problem as I could pull in nnpops=0.2 as fine.
Any guidance is appreciated. Thank you.

Other models to support?

What other models do we want to support? ANI is done and SchNet is nearing completion. What are the next top priorities?

torchani faster than nnpops for larger systems?

Hi,

when I simulate a 15 Angstrom waterbox with the torchani and nnpops implementation the torchani implementation is slightly faster. Is nnpops only outperforming torchani with small system size? I have attached a minimum example to reproduce the shown output.

# NNPOPS
Implementation: nnpops

MD run: 1000 steps
#"Step" "Time (ps)"     "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100     0.10000000000000007     -20461968.233400125     0
200     0.20000000000000015     -20462109.582584146     3.02
300     0.3000000000000002      -20462215.08696869      3.02
400     0.4000000000000003      -20462184.75506845      3.02
500     0.5000000000000003      -20462176.182438154     3.02
600     0.6000000000000004      -20462290.934872355     3.02
700     0.7000000000000005      -20462276.06124924      3.02
800     0.8000000000000006      -20462268.749944247     3.01
900     0.9000000000000007      -20462303.856101606     3.01
1000    1.0000000000000007      -20462353.939166784     3.01
# TorchANI
Implementation: torchani

MD run: 1000 steps
#"Step" "Time (ps)"     "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100     0.10000000000000007     -20456827.93509699      0
200     0.20000000000000015     -20453552.138266437     3.36
300     0.3000000000000002      -20446930.31249438      3.39
400     0.4000000000000003      -20442156.674454395     3.39
500     0.5000000000000003      -20434295.0773298       2.97
600     0.6000000000000004      -20432329.317804128     3.03
700     0.7000000000000005      -20427635.139502555     3
800     0.8000000000000006      -20422604.906581655     3.04
900     0.9000000000000007      -20420074.77440338      3.07
1000    1.0000000000000007      -20414884.105911426     3.09

min.py.zip

Neighborlists?

This is a thread for discussing how we should (eventually) deal with neighborlists in ANI.

-should the CPU code have a neighborlist?
-which approach should we take? voxel/spatial hashing? or something more similar to what's done in OpenMM
-should we build a single neighborlist that's the max(radial_cutoff, angular_cutoff), or do something more complicated?

Full listing of available operations?

Hi all,

Thanks very much for making this development effort public and modular!

Is there a listing somewhere here of the primitive operations that are provided by NNPOps (not the high level entire ANI implementation), with some indication of their intended uses / when one might expect a speedup?

Conda-forge package of NNPOps

@peastman , I've had some difficulty cmaking and installing this repo (ultimately with success). would it be possible to make this conda-installable so we can have a consistent working version with openmm-torch and OpenMM 7.5.1 maybe?

NNPOps 0.6

Let's start the new release.

New features:

#103

Testing improvements:

#100
#104

Schnet in OpenMM

As in openmm/openmm-torch#33, I am trying to implement a SchNet neural network into OpenMM for md simulations. I wrote the ForceModule as in https://github.com/openmm/openmm-torch/blob/master/README.md such that i could use torch.jit.script. However, I start getting errors due to the code in Schnetpack: for example the function atom_distances in https://github.com/atomistic-machine-learning/schnetpack/blob/master/src/schnetpack/nn/neighbors.py sometimes returns 1 variable and in other cases 2, which causes errors for torch.jit.script.
Adapting the schnetpack code such that i can compile it to a torch_script seems futile as this method is not that efficient in the first place (as the neighborlist is rebuild in every iteration).
In order to have an efficient version it seems that NNPOps should do the trick, however, for as far as I understand it there is not yet a python wrapper for the schnet in NNPOps. I have little experience with C and python wrappers, so I could definitely be mistaken.
Therefore, I have the question whether or not it is possible to rebuild the schnet network in python using NNPOps, or if there is a simpler solution to use a SchNet neural network in OpenMM md Simulations?

	static tensor_list forward(AutogradContext* ctx,
	const Tensor& positions,
	const Scalar& cutoff,
	const Scalar& max_num_neighbors,
	const Tensor& box_vectors) {

	const int32_t index = blockIdx.x * blockDim.x + threadIdx.x;
	if (index >= num_all_pairs) return;

	int32_t row = floor((sqrtf(8 * index + 1) + 1) / 2);

openmm / nnpops Goto Github PK

nnpops's Introduction

OpenMM: A High Performance Molecular Dynamics Library

Introduction

Getting Help

nnpops's People

Contributors

Stargazers

Watchers

Forkers

nnpops's Issues

Recommend Projects

Recommend Topics

Recommend Org