leeping / forcebalance Goto Github PK

Systematic force field optimization.

License: Other

Shell 2.56% Python 87.88% C 6.31% Makefile 0.06% Arc 2.58% Gnuplot 0.04% DIGITAL Command Language 0.04% Roff 0.51%

forcebalance's Introduction

#==================#
#|  ForceBalance  |#
#==================#

If you're reading this, you are probably looking for help. :D

The ForceBalance project website is located at
https://simtk.org/home/forcebalance

There is documentation available online in html at:
http://leeping.github.io/forcebalance/doc/html/index.html

You can also download the documentation in pdf format here:
http://leeping.github.io/forcebalance/doc/ForceBalance-Manual.pdf

#================================#
#|   Conda / pip installation   |#
#================================#

As of version 1.7.4, ForceBalance is available as a package on the conda-forge channel.
To install the package, make sure you are using an Anaconda/Miniconda Python distribution
for Python versions 2.7, 3.5, 3.6, or 3.7, then run:

`conda install --strict-channel-priority -c conda-forge forcebalance`

This will install ForceBalance and all of the required dependencies.  It will not install
optional dependencies such as OpenMM, Gromacs, AMBER, Tinker, CCTools/Work Queue,
or the Open Force Field toolkit.

(Note: If you were installing ForceBalance from the omnia repository previously,
you may need to clear your index cache using `conda clean -i`.)

Similarly, to install from PyPI (Python Package Index), run the command:

`pip install forcebalance`

#=========================================#
#|   Building / installing from source   |#
#=========================================#

To build the package:

python setup.py build

To install the package:

python setup.py install

To install the package locally (i.e. if you don't have root permissions), try the following:

python setup.py install --user
python setup.py install --prefix=$HOME/.local
python setup.py install --prefix=/your/favorite/directory
export PYTHONPATH=/your/favorite/directory

Additionally, the Anaconda/Miniconda distributions are very helpful for setting up a local Python environment.

To turn the package into a distributable .tar.bz2 file:

python setup.py sdist

#========================#
#| Published Parameters |#
#========================#

If you are here to obtain data for a published force field (e.g. AMBER-FB15), you can
find it in the Products folder.  Please email me at leeping (at) ucdavis (dot) edu
if you notice anything that should be added.

#=======================#
#| Citing ForceBalance |#
#=======================#

Published work that utilizes ForceBalance should include the following references:

Wang L-P, Chen J and Van Voorhis T. "Systematic Parametrization of Polarizable Force Fields from Quantum Chemistry Data." 
J. Chem. Theory Comput. 2013, 9, 452-460. DOI: 10.1021/ct300826t

Wang L-P, Martinez TJ and Pande VS. "Building Force Fields: An Automatic, Systematic, and Reproducible Approach." 
J. Phys. Chem. Lett. 2014, 5, 1885-1891. DOI: 10.1021/jz500737m

#===========================#
#| Funding Acknowledgement |#
#===========================#

The development of this code has been supported in part by the following grants and awards:

NIH Grant R01 AI130684-02
ACS-PRF 58158-DNI6
NIH Grant U54 GM072970
Open Force Field Consortium

forcebalance's People

Contributors

Stargazers

Watchers

Forkers

rmcgibbo kmckiern mlaury pren andysim ebran rbharath jisraeli jnapoli albaugh altaetran ahvigil mesoniat pengfeili1 xuw10 yuhangwang yudongqiu rtb1c13 alexw432 mingchenchen plin1112 jmelcr jag1g13 jstoppelman mohebifar falagara chayast adw62 ruanyangry jingxiangguo hyejang gpilania qize tobigithub jmaat mschauperl tkpiskorz mattwthompson leucinw bradendkelly trevorgokey tiger-tiger lmacaya mehutchi chemshift gkxiao roseane-reis sailfish009 jthorton misterbrandonwalker layeqa pavankum bieniekmateusz jeff231li youngyanyoung cresset-group lilyminium nadzhou jaketanderson rnaimehaom trellixvulnteam wenyan4work openforcefield dmejiar standardgalactic ntbre web-logs2 nastamauger 5l1v3r1 dashbaord202401 chrinide chapincavender mqzpt

forcebalance's Issues

Replicating Results

Hello,
I want to use particular targets in an optimization. To check the liquid.pdb, gas.pdb and file.in are correct I tried to use them to replicate the results this paper obtained for tip4p-fb http://dx.doi.org/10.1021/jz500737m However whilst the density and heat of vaporisation are close the static dielectric constant is around 5 instead of 77. Is there something else I should add to my targets to replicate this result?
This is my liquid.pdb:

CRYST1   20.000   20.000   20.000  90.00  90.00  90.00 P 1           1
ATOM      1  O   HOH    1       9.457  18.861  18.736  1.00  0.00      QQQ   
ATOM      2  H1  HOH    1       8.845  19.288  19.335  1.00  0.00      QQQ   
ATOM      3  H2  HOH    1      10.058  19.557  18.471  1.00  0.00      QQQ   
ATOM      4  M   HOH    1       9.456  19.004  18.779  1.00  0.00      QQQ   
.........

however with 233 molecules, I could email the rest.
This is the gas.pdb:

CRYST1   20.000   20.000   20.000  90.00  90.00  90.00 P 1           1
ATOM      1  O   HOH    1       1.457   0.861   0.736  1.00  0.00      QQQ   
ATOM      2  H1  HOH    1       0.845   1.288   1.335  1.00  0.00      QQQ   
ATOM      3  H2  HOH    1       2.058   1.557   0.471  1.00  0.00      QQQ   
ATOM      4  M   HOH    1       1.456   1.004   0.779  1.00  0.00      QQQ

the data.csv:
T,P,MBAR,Rho,Rho_wt,Hvap,Hvap_wt,Eps0,Eps0_wt
298.15,1.0 atm,TRUE,997.045,1.0,43.989,1.0,78.409,1.0
the file.in:

$options
forcefield tip4pfb.xml
jobtype newton
$end
$target
type Liquid_OpenMM
name WATER
# Parameters for self-polarization correction of nonpolarizable water
self_pol_mu0 1.855
self_pol_alpha 1.470
$end

where tip4pfb.xml is the force field provided in simtk/openmm/app/data in openMM. Thank you in advance.
Alex

Release 1.3

This is a way for me to list the features that I've added since the last ForceBalance release. It's really time to release version 1.3 by now, and after I implement Thermo I will release version 2.0.

"--continue" option should tolerate existing indicate.log in condensed phase folder.

Running ForceBalance with the --continue option allows the optimization to pick up where it left off in the middle of an iteration, so that targets that already have been evaluated will have their data read back in. This is done using the read() function which reads the objective function information as well as the indicator lines from indicate.log. It will then proceed to evaluate the rest of the targets normally, and this includes calling get() on condensed phase property targets that have more than 75% of their jobs completed.

If we attempt to write indicate.log in a folder where we're calling get(), ForceBalance assumes it's an error and throws an exception because get() should not be called if we have already evaluated the objective function (rather read() should have been called). However, the completed condensed phase property targets are an exception to this rule, so the exception should not be thrown.

Build #66

For some reasons test on this pull request
http://not0rious.stanford.edu:8080/job/forcebalance-pull-request/66/
has been run for more than 1 day, I'm not sure if it's still running properly.
#68 is an update on Energy Estimator(Centroid estimator instead of Primitive Estimator, and )

Should we use Twisted for system calls and asynchronous reads?

Currently ForceBalance has a method called _exec() that it uses for all system calls. It is a wrapper around functions in subprocess that actually carry out the system call, but it uses select to poll for whether the stdout and stderr file descriptors have data to read. This way, we are able to read from both streams while they are being generated without them being garbled.

I have noticed that when using this method in other contexts, sometimes I get the error file descriptor out of range or something similar. This is due to a limitation in select where there is a maximum number of file descriptors. One possibility is that I simply need to close the file descriptor after I'm done reading from it, but I haven't tested this (the error doesn't show up often since it needs to hit the limit). That would be a simple fix.

Another option would be to bundle Twisted, an event-driven programming library (http://twistedmatrix.com), with ForceBalance. It abstracts away low-level things like select and should provide the functionality we need, and I think it properly accounts for the file descriptor issue. It is licensed with the MIT license (highly permissive) and its only dependency is zope.interface. Having better asynchronous abilities in ForceBalance might enable other things as well, including maybe a way to fork local targets so they may be evaluated in parallel. However, introducing new dependencies should be thought out carefully.

Let me know what you think. This is medium to low priority.

Implement test based on liquid properties using stored data

Per discussion at https://openforcefieldgroup.slack.com/archives/C8P8JEE5T/p1545071217055600

We want to implement tests of ForceBalance's optimization process using liquid parameters. However, liquid simulations are not deterministic (even given same starting parameters, hardware differences may cause different results), and are prohibitively time-consuming.

When the property calculator API is finalized, we should store a small set of test trajectories, mirroring the simulations that would be run by ForceBalance in a specific optimization run. These simulation trajectories would be returned by a mock property-calculator interface. This setup would allow robust integration testing of ForceBalance's optimization using trajectory data.

How can we improve path handling?

ForceBalance carries out many calculations by using system calls to MD software such as Gromacs and TINKER. It will search for the executables in the directory specified by the tinkerpath or gmxpath option in the input file - but if this option is not present, it uses the which command to figure out the location from the $PATH environment variable.

The problem is that a single calculation may be carried out on many different machines. In the case of thermodynamic property fitting and "remote" targets, ForceBalance distributes the calculation using Work Queue and the MD software might not be installed in the same place on the remote computer. The possible solutions are:

Pass the tinkerpath, gmxpath, etc. options to all remote calculations and expect that all machines have the software installed in the same place.
Use the tinkerpath, gmxpath, etc. options only for local calculations and rely on $PATH for remote calculations.
Eliminate all tinkerpath and gmxpath options and expect the user to always have these programs in $PATH.
Some other clever solution that I haven't thought of.

I like option 3 because it reduces confusion for the user, but it only works if ForceBalance is compatible with default (unhacked) versions of the MD software. This is currently the case with GROMACS, OpenMM and TINKER, though I believe a small hack is still required for AMBER. (On the other hand, I don't know anyone who currently uses the AMBER interface.)

I am bringing this up now because until recently, ForceBalance has required a modified version of TINKER, but now it works with the official TINKER. We should be moving towards total compatibility toward regular versions of MD software.

Let me know what you think.

All targets should have the ability to read/write from disk

Currently, "remote" targets and "liquid" targets have the ability to read data from the disk. This was relatively easy to do because Work Queue required packaging up all of the data as files anyway. However, all targets should have this ability, and it's not hard (in fact, remote is a wrapper around any target). I'll make sure to implement this at some point.

What would it take to get forcebalance to use mdtraj?

It's not a really high priority, but I'd like to merge all (or most) of the functionality from Molecule.py into mdtraj. Do you know, off the top of your head, what features are missing? One that comes to my mind is support for the xyz file format, but I'm sure there are others.

Needs test cases for AMBER force matching

Implement all thermodynamic property features into Thermo target.

This is a summary of my discussions with Erik (@ebran) today. Hopefully I'll get to this implementation within the next week or two. The overall goal is to merge the existing functionality of the Liquid and Lipid targets into the more general Thermo target, which will greatly facilitate optimizations that use general combinations (or new types) of experimental thermodynamic data.

After this is done, the Liquid and Lipid targets will remain as long as there are users who are currently using them / want to use them, though it's my hope that I can someday remove them since all of their functionality will be in Thermo.

Overall workflow of Thermo target.

The target is specified in the input file as follows:

$target
name MyThermoProperties
type Thermo_GMX
properties H_ads Rho
simulations
global
steps 20000
--
name Liquid
steps 10000
--
name Adsorbed
gmx_gro conf.gro
gmx_top topol.top
gmx_mdp grompp.mdp
gmx_ndx index.ndx
/simulations
$end

Note that the properties field is a list of property names, which correspond to columns in expset.txt (below) and implemented methods for calculating that property and its derivatives from the MD simulation. In general, properties are calculated using timeseries of instantaneous observables that come out of simulations. In other words, specifying a given property will cause particular simulations to run, and particular timeseries to be extracted from these simulations. We won't extract all possible timeseries because it's time-consuming and space intensive, instead opting to extract only those needed. Multiple properties can make use of the same timeseries in the same simulation.

The simulation settings are specified within the simulations block which is a "section input type" - i.e. in contrast to most options, there will be a custom parser for the text contained between simulations and /simulations. Each simulation has a name which corresponds to the required simulations for the calculated properties - that is to say, if a given property requires a timeseries from the Interface simulation, then the target must contain simulation settings for Interface. Simple simulation settings - such as time step - can be entered here, and depending on which MD software is used, the user may provide files like grompp.mdp and tinker.key. The global block contains default settings that are copied over to the individual simulations.

In the target folder, there will be a plain text file expset.txt corresponding to the experimental data set. A separate chain of simulations will be run for each row of data in this file. This file will contain columns corresponding to the experimental data and the relevant variables for the simulation (e.g. temperature and pressure). I might want to change the default name to data.txt because my other collaborators are used to data.csv.

# Experimental data for liquid bromine.
  Label     Temp_K Density_kg/m^3 w   Enthalpy_kJ/mol w    Temperature_K Pressure_bar
  298.15K   298.15 3102.8         1.0 29.96           1.0  298.15        1.01325

# Variables: Denominators and weights for quantities
Denoms  = 30 0.3
Weights = 1.0 1.0

Here's how the calculation will work: The optimizer first provides the force field parameters, and the target's stage() function will execute the simulations of thermodynamic properties, 1+ chain of simulations per row of experimental data. These can either run locally or through the Work Queue. The executed script, md-chain.py, takes as input the simulation settings (including initial conditions) that are required for calculating the properties. The simulations are executed sequentially, and the time series of instantaneous observables (e.g. potentials, volumes) are written to disk. Each time series will be keyed with the simulation name and observable name. As a sanity check / indicator, md-chain.py will perform a standalone calculation of properties corresponding to the available time series.

After md-chain.py returns the time series, they will be used to calculate the thermodynamic properties themselves in the target's get() method. The current thought is to create an object for each row of experimental data, which is initialized using the time series and provides methods to calculate each property and their gradients. The main advantage of Thermo is that it allows for the implementation of any thermodynamic property as a method that operates on time series of observables from simulation chains.

Here the time series of observables may be preprocessed using MBAR which increases the amount of data available for each row of experimental data, as long as the only variables are temperature and pressure. Also here would be the entry point for evaluating the standard error of the objective function via bootstrapping.

After the property values and their gradients are calculated, the objective function, gradient and Hessian contributions are easily computed and returned to Objective.

Required coding.

Parser for input options. This should be easy.
Need to create a class corresponding to "row of experimental data", which contains the time series of observables / experimental data and implements the methods for calculating properties and their derivatives. This code should be straightforward.

I don't really know what to name this class, because DataSet sounds more like the entire experimental data set, and DataPoint makes it sound like just one property. How about ThermoProperties?
Need to modify md-chain.py so that it automatically knows how to set up and execute the simulation chain and extract the relevant time series. This should not be too hard except for these potential challenges:
- Handling of initial conditions. Under normal circumstances, each simulation chain should store the initial conditions corresponding to the final coordinates of the last good optimization iteration. If we want to run multiple simulation chains for each row of properties, then each one needs to have different initial conditions for the simulations to matter. Because of this, I think I need some method to specify the initial conditions explicitly rather than having a single set of default coordinates that is used everywhere.
- Flexible handling of user input. ForceBalance will expect the user to have run parameters and simulation settings (.mdp, .top, .key etc.) inside the target folder, under a subfolder corresponding to the row label. Settings in the input file (e.g. number of MD steps) and variables in expset.txt will override the options in the run parameter files. If the run parameter files do not contain required options, ForceBalance will provide some defaults (e.g. reasonable Ewald summation settings if none provided). This is how I'm currently doing it, but I imagine md-chain.py needs to be especially careful with handling multiple layers of possible input. (Extra note: The input can be simplified greatly if the GROMACS, TINKER etc. input files are required to have the same base name as the simulation name.)
- Calculation of relevant properties. Depending on which simulations are run, md-chain.py can calculate some properties on its own to provide a sanity check for the user (I've found this to be helpful in npt.py). Since md-chain.py already knows the simulations to run and which time series to extract, it should be able to look up which properties it is able to calculate on its own.
Need to implement Work Queue functionality, MBAR and error estimation via bootstrap into thermo.py. Also it needs to make extensive use of the ThermoProperties objects. We should incorporate the backtracking feature in Liquid that reuses experimental data when an optimization step is rejected. We should also have the ability to execute multiple copies of a simulation chain for each ThermoProperties object in order to take advantage of parallelism. I think these are in order of increasing difficulty, and since it ties everything together we should implement it last.
thermo.py needs to parse expset.txt (or data.txt depending on what we name it) intelligently. In the beginning this should be easy, because many properties are simply floating point numbers. However, in the case of deuterium order parameters, the property is a medium-sized array and we need an intuitive syntax for parsing array input. I think Keri (@kmckiern) already has code for this.
Engine objects need to have a more general way to run molecular dynamics, and implement methods for extracting time series. This should be easy, though I would like to rewrite the molecular_dynamics() method while preserving the compatibility with the Liquid and Lipid targets.

Considerations for future properties

I'm not an expert at free energy calculations, but it's best to know whether this framework will accommodate free energies (as a simulation in the chain and a method in ThermoProperties), or if it is too complex and requires a different framework.
Protein NMR measurements. Are there any special considerations here?

How can we improve internal option handling?

Thinking out loud here..

Currently a ForceBalance calculation reads from the input file a set of general options for the calculation and one set of target options for each target. Options that are not set in the input file are set to their default values as declared in the parser. This is done by first copying the default general options / target options and then inserting user-provided values. Simple right?

Wrong; it turns out we often want to define default values at the target level. That is to say, different targets might want to have different default values for a given option. The default values for XYZ coordinates is an example.

I have tried (unsuccessfully) to implement this. The set_option method defined for all ForceBalance classes allows you to enter a default value. This value is used if the options dictionaries (passed into the target initialization) do not contain the key in question. However, since the options dictionaries always contain the default values from parser.py, the target-level default values are never used, so we don't have the fine-grained control over target options that we desire.

I think there ought to be a hierarchy of ways to set target options:

User-provided input (in the input file) is always used.
Defaults provided at the target level are next highest in priority.
Global defaults provided in parser.py are lowest priority.

One way to implement this would be to simply not copy the global defaults when initially creating the global option / target option dictionaries, so they are empty except for the user-provided values when passed into the class initialization. This could break a few existing things, but thank goodness for unit tests.

Also: Certain True/False options have previously defaulted to False, but the proposed change would make it so that the option doesn't exist at all. This means several places in the code I have to do this clumsiness:

-        if not options['continue']: 
+        if ('continue' not in options) or (not options['continue']):

It's not too clumsy but somehow it seems wrong.

Let me know what you think.

Dependencies missing in setup.py

The current ForceBalance code on master seems to be requiring lxml, networkx, pymbar, six, and future (for python 2 users) that could be added to setup.py. Is there a reason they are not specified in setup.py?

RuntimeError in trying BasinHopping method

Hi @leeping, I hope all is well! I want to try out the BasinHopping optimization routine in FB with the lammpsio.py interface I have added. I receive the following error when I try to run it:

"""
indicate.log should not exist yet in this directory: /home/jnapoli/test_reax_fb/regularization_tests/L2_1.0_anneal/opt_reax.tmp/cluster-01/iter_0000
"""

This is triggered in target.py (https://github.com/leeping/forcebalance/blob/master/src/target.py#L508). I think this means that the program should not be in the iter_0000 directory, but I am not very sure. I have also tried BasinHopping with an AbInitio_OPENMM target and receive the same error.

-Joe

Performance

I'm wondering the main difference(extra calculations, work, etc.) between running single point calculation on ForceBalance and running an OpenMM simulation.

I've tried to run a 10000 steps(5ps) single point calculation on ForceBalance using qtip4pf water model with RPMD(2 beads), and the jobs take more than half hour to finish(The simulation itself should take 20~30s).

"Engine" concept - a potential major improvement

I'd like to bring up an idea that's been rolling around in my head, which if successfully implemented should greatly improve the reusability of the code. However, I'm still puzzling over some design choices, so this is more of a future improvement.

Currently, "Target" classes are implemented in the ForceBalance code using the following inheritance structure:

Target (base class)
    Liquid
        Liquid_OpenMM
        Liquid_GMX
    BindingEnergy
        BindingEnergy_OpenMM
        BindingEnergy_TINKER
    AbInitio
        AbInitio_OpenMM
        AbInitio_GMX
        AbInitio_TINKER

The derived classes Liquid and BindingEnergy correspond to the physical quantities that we would like to fit using our force field. They contain code for reading the reference data files and calculating the objective function from the difference of the reference quantity and the simulated quantity. Since they don't evaluate the simulated quantities, these classes are independent of any MD software that is installed.

The next layer of derived classes Liquid_GMX are like an implementation layer. They contain code to execute the external MD code (or make API calls in the case of OpenMM). For instance, BindingEnergy_TINKER contains calls to the TINKER programs optimize and analyze, in order to optimize the geometries of complexes and calculate their energies with respect to the isolated monomers.

The problem that I want to address is the duplication of code in the implementation layer. For instance, both the classes Liquid_OpenMM and AbInitio_OpenMM contain calls to OpenMM to evaluate the potential energy on a set of structures. It might be possible to restructure the code such that performing a given operation (e.g. evaluating the potential energy on a set of structures) with a given MD code (e.g. OpenMM) only needs to be implemented once.

I have heard bad things about multiple inheritance, so I think a "composition" framework might be a good design. This would introduce a new class that represents the MD code, which I will call the Engine.

The Engine object should contain methods that carry out useful operations in the MD code, which need to be callable without knowledge of which code we're actually using. An early example is already implemented in https://github.com/leeping/forcebalance/blob/master/src/data/npt.py lines 181 through 803.

Once we have these Engine classes, then I imagine that we will no longer need two layers of derived classes for Target classes. Instead, the Target objects will contain Engine objects and call them as needed.

Two outstanding questions / related issues is whether the Engine class should contain any data, and how to handle file dependencies. For instance, in order to minimize the energy of a structure in any MD code (e.g. this would look like TinkerEngine.minimizeEnergy() in the proposed code), one needs to provide the structure. We can either store the structure as data in the Engine object (e.g. as a Molecule object), or the method needs to take the structure as an input argument.

The benefit of storing the structures is that we can count on the Molecule class to write the correct file format for any MD code to use. But where does it make sense to store the structures - in the Target object or the Engine object? There is also a cost associated with storing the structures because they need to be in memory over the course of the calculation, whereas many MD codes simply iterate over the files. For certain targets, it might not be possible to store the structures in memory (e.g. if they correspond to long MD runs.) In these instances, does it make more sense to remember just the file names?

Let me know what you think. I think that once the data storage question is cleared up, I can begin implementing this.

get_multipoles

Attempt to edit the function get_multipoles in openmmio.py by looping over all copies of system in RPMD simulation.

I notice getSystemMultipoleMoments function are used to calculate for AmoebaMultipoleForce, but not for Coulomb's force. I'm wondering if the Amoeba Multipole can be calculated using ordinary method from coordinate, because the argument context in getSystemMultipoleMoments cannot possible get positions for all copies of system.

Should _exec() method read carriage returns from Popen.stdout and Popen.stderr?

This is a minor issue in the refactored _exec() method. The new _exec() has the ability to simultaneously follow stdout / stderr streams and print them to the terminal, and store the stream in a list, and print it to a file.

However, many programs like Gromacs will print out progress reports with carriage returns, something like this:

"25% done\r50% done\r75% done\r100%done\n"

If I read from this stream using the readline() function, it will only print out the entire string when the final newline character appears. On the other hand, if I use the read() function and read in some amount of bytes, then the stdout and stderr streams can get garbled.

The ideal solution would be for readline() to have the ability to return strings that end in \r and \n, but I don't know how to do this. The current workaround is to have an option to set "universal_newlines=True", but that can print out many new lines, which isn't really desirable. The main goal should be to reproduce the output of the running program.

How are the gradients determined?

Okay, this is more of a question than a bug report. And I'm sorry for asking it here, I couldn't find the answer by reading the code/docs.

I am using the study #7 (007_tip4p_STP) with a few modifications.

I have deleted the AbInitio target.
Using OpenMM instead of Gromacs

Using that, I've tried reparametrizing the charge for a simple water model and everything seems to be working perfectly and optimization is done after a couple of iterations.

I noticed that a mvals is written at every iteration which contains the gradients, and that is used as a direction for the next iterations.

Now, I have modified the OpenMM's NonBonded code, introducing some new parameters (c6, c8, c10, c12). ForceBalance still does a perfect job reparameterizing the charges with the modified OpenMM code and the gradients are not zero. But when I try to reparametrize the new parameter, it stops after the first iteration and I realized the reason is that the gradient is 0.0 in the mvals file. So my question is how are the gradients determined? Why do I get zero gradients?

Here's my forcefield file

Stochastic test failure

Lately I've noticed an stochastic failure of one of my unit tests. I think this is due to the TINKER version update and their geometry optimization code being changed, but I can't prove it.

	 >	test_engine.TestAmber99SB.test_optimized_geometries
	 >	
	 >	Traceback (most recent call last):
	 >	  File "/home/travis/miniconda/envs/forcebalance_env/lib/python2.7/unittest/case.py", line 329, in run
	 >	    testMethod()
	 >	  File "test/test_engine.py", line 139, in test_optimized_geometries
	 >	  File "/home/travis/miniconda/envs/forcebalance_env/lib/python2.7/unittest/case.py", line 561, in assertAlmostEqual
	 >	    raise self.failureException(msg)
	 >	AssertionError: -61.8518871 != -61.820866873804967 within 0.001 delta : TINKER optimized energies do not match the reference```

Attempt to add Custom Bond Force Feature in openmmio.py

Try to add an extra function
def CopyCustomBondParameters(src, dest):
dest.setEnergyFunction(src.getEnergyFunction())
for i in range(src.getNumGlobalParameters()):
dest.setGlobalParameterName(i,_src.GlobalParameterName(i))
for i in range(src.getNumGlobalParameters()):
dest.setGlobalParameterDefaultValue(i,_src.getGlobalParameterDefaultValue(i))
for i in range(src.getNumPerBondParameters()):
dest.setPerBondParameterName(i,_src.getPerBondParameterName(i))
for i in range(src.getNumPerBondParameters()):
dest.setBondParameters(i,_src.getBondParameters(i))

and modified CopySystemParameter, by adding 'CustomBondForce' into Copiers,

Make a test run on qtip4pf water optimization(where the OH bond energy is 4th deg. polynomial), get the following error message,

return _openmm.CustomBondForce_setPerBondParameterName(self, *args)
TypeError: CustomBondForce_setPerBondParameterName() takes exactly 3 arguments (4 given)

when I substitute
dest.setPerBondParameterName(i,_src.getPerBondParameterName(i)) by
dest.setPerBondParameterName(_src.getPerBondParameterName(i)) I get the message says I only input 2 arguments, but i is just integer.

Nifty module needs to work with other Python packages.

Since we have made the transition to logging, the "nifty.py" file is no longer completely standalone (i.e. it now depends on some aspects of the ForceBalance package through init). Since the "nifty.py" file is used in both the forcebalance and nanoreactor packages, the files are beginning to diverge in the two packages.

Because of this, I need to make the "nifty.py" file standalone again.

I will create a default logger if the package name is not "forcebalance", and I might revert back to using global variables for the WORK_QUEUE and WQIDS objects. It would be best not to go back to global variables, so if anyone has ideas please let me know.

Installing on Mac OS X - OpenMP libraries not found

When installing ForceBalance on Mac OS, the following error occurs when attempting to build the contact extension:

clang -bundle -undefined dynamic_lookup -Wl,-F. -arch i386 -arch x86_64 build/temp.macosx-10.8-intel-2.7/ext/contact/contact.o build/temp.macosx-10.8-intel-2.7/ext/contact/contact_wrap.o -o build/lib.macosx-10.8-intel-2.7/forcebalance/_contact_wrap.so -lgomp
ld: library not found for -lgomp
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'clang' failed with exit status 1

In order to get it to compile, I changed the compiler and linker arguments based on the operating system, so it no longer looks for "-lgomp" when it compiles. It shouldn't matter so much whether contact is compiled with OpenMP libraries or not, since this part is only used in building the molecular connectivities. This is a temporary workaround however, and it would be best if the linker could actually find where the libraries are.

Calculation of heats of vaporization & other properties

Currently, FB includes kinetic energy terms in the gas and liquid phases when computing the heat of vaporization. I am pretty sure these terms are supposed to analytically cancel out in classical mechanics, so these terms should also be removed from the calculations.

This issue may also extend to the calculation of thermal expansion coefficients and heat capacities, where the kinetic energy term shows up in the expressions but we may have analytic expressions for them. We should think about whether the proposed fix for ∆Hvap should also extend to these other properties.

[Thermo] Make Quantity class more independent

Email snippet:

Here is an attempt on force field optimization to general
experimental target data (quantities) using ForceBalance and
Gromacs. The new target is called thermo.py and quantities are
defined in quantity.py. I have taken bits and pieces from
liquid.py and npt.py to put it all together but stripped some
functionality, most notably the MBAR stuff and the workqueue.

I think the code looks great.

In terms of organization, "quantity.py" looks engine-independent;
do you think it's worthwhile to move the Gromacs-specific parts of
quantity.py into gmxio.py as functions or engine methods? You have
a bit of Gromacs-specific code for heat of vaporization - but for
other properties such as the thermal expansion coefficient,
isothermal compressibility etc. it could come up again. More
generally, the extraction of time series from the simulation
(densities, potential energies) is engine-dependent, but their
processing in quantity.py is engine-independent.

It's definitely worthwhile and is probably the best solution in the
end. My problem was that I am not very familiar with the other engines
(OPENMM and TINKER) than Gromacs.

The point is essentially what you stress above:

Run a series of MD simulations and extract the results as
time-series. (engine-dependent)

Process time-series and combine them into some quantity for
comparison with reference data (engine-independent)

It is not always known what time-series are needed to calculate an
arbitrary quantity, but for most cases common properties will be enough
(energies, volumes, pressures,...).

You're right, I haven't completely figured this out. There aren't that many scalar time series - g_energy covers most of them, and in most cases it's ensemble or energy-related (instantaneous temperature, pressure, volume, potential, energy components.) Some other time series (for example, order parameters) depend on the geometry, and the user should implement code to calculate it.

I can see a conceptual division between the "time-series" or "instantaneous observable" described above, and your "quantity" class which is a statistical property. The user might have to make this distinction when implementing their properties.

Call graph generation is not working properly

The documentation generator makes some cool call graphs, but the last time I checked, the graphs themselves didn't make sense. For example, invert_svd does not call warn_press_key or the other way around:

Am I understanding the graph properly?

TypeError : an integer is required

Hi,
I have implemented a custom non bonded force into openMM and am now trying to optimize it with ForceBalance. However when run with my custom forcefeild and targets I get the error:
File "/usr/lib64/python-2.7/sitepackages/numpy/core/numeric.py", line 183, in ones
a = empty(shape, dtype, order)
TypeError : an integer is required
This is the xml i'm using:

 <ForceField>
  <AtomTypes>
   <Type mass="15.99943" element="O" class="OW" name="tip3p-O"/>
   <Type mass="1.007947" element="H" class="HW" name="tip3p-H"/>
  </AtomTypes>
  <Residues>
   <Residue name="HOH">
    <Atom name="O" type="tip3p-O"/>
    <Atom name="H1" type="tip3p-H"/>
     <Atom name="H2" type="tip3p-H"/>
     <Bond to="1" from="0"/>
     <Bond to="2" from="0"/>
   </Residue>
  </Residues>
  <HarmonicBondForce>
   <Bond k="462750.4" length="0.09572" class2="HW" class1="OW"/>
  </HarmonicBondForce>
  <HarmonicAngleForce>
   <Angle k="836.8" class2="OW" class1="HW" angle="1.82421813418" class3="HW"/>
  </HarmonicAngleForce>
  <NonbondedForce lj14scale="0.5" coulomb14scale="0.833333">
   <Atom type="tip3p-O" epsilon="0" sigma="0" charge="-0.834"/>
   <Atom type="tip3p-H" epsilon="0" sigma="0" charge="0.417"/>
  </NonbondedForce>
  <BuckinghamForce bondCutoff="3" energy="B1*B2*(eps/(1-(6/gamma)))*((6/gamma)*exp(gamma*(1-r/sig))-((sig/r)^6)); sig=sqrt(sig1*sig2); gamma=sqrt(gamma1*gamma2); eps=sqrt(eps1*eps2)">
   <PerParticleParameter name="B"/>
   <PerParticleParameter name="eps"/>
   <PerParticleParameter name="sig"/>
   <PerParticleParameter name="gamma"/>
   <Atom type="tip3p-O" parameterize="eps,sig,gamma" gamma="12.75" sig="3.55e-1" eps="0.65" B="1"/>
   <Atom type="tip3p-H" gamma="1" sig="1" eps="1" B="0"/>
  </BuckinghamForce>
 </ForceField>

my only targets are gas and liquid pdb from forcebalance/studies/005_iamoeba/targets and a data.csv with:

T         P    MBAR      Rho   Rho_w  Hvap  Hvap_wt
298.15  1.0 atm TRUE    997.045 1   43.989  1

I'm not sure what is causing this error other examples in the studies directory seem to work fine, any help would be greatly appreciated.
Many Thanks,
Alex

[Thermo] Allow md_chain.py to return time series instead of quantities

In your code, the quantities are calculated in "md_chain.py", and
"thermo.py" reads them in after the calculation is complete. This
is different from how "liquid.py" and "npt.py" do the calculation;
"npt.py" saves the time series of densities, energies etc. and the
quantities are calculated in "liquid.py". I see the merit of this
approach - it is compact and straightforward to implement. However,
it doesn't allow us to use the time series from multiple state
points to calculate a quantity (this is useful for MBAR, I'm not sure
what else in the future.)

Because of this, I would prefer for "md_result.p" to contain time
series of simulation variables (e.g. densities, energies, and energy
derivatives for each simulation) rather than the quantities
themselves, and then "thermo.py" would calculate the quantities.
Instead of specifying quantities on the command line, "thermo.py"
would tell "md-chain.py" which simulations to run (e.g. liquid, gas,
interface, or something.) What do you think?

Sounds good! The idea with md-chain.py is that a number of different
simulations (md chain) are required to calculate a single quantity,
as in e.g., enthalpy differences (solvation enthalpy, adsorption
enthalpy, vaporization enthalpy).

If you can have a simple way to specify the chain, then md-chain.py
could return a matrix of time-series where each column corresponds to
one simulation in the chain.

I agree here - the simulations in the chain are separate from the quantities. A single quantity might use data from multiple simulations, and a simulation may contribute data to multiple quantities. How about we specify the simulation names in the arguments to md-chain.py ("liquid", "gas")? The quantities need to know the names of simulations that need to be run, and all of the simulations will be sent to md-chain.py .

Parallelization

Hi,
I am running a ForceBalance optimization with 369 tip4p water molecules with a custom non-bonded force and:
liquid_eq_steps 50000
liquid_md_steps 500000
gas_eq_steps 50000
gas_md_steps 500000
liquid_timestep 2.0
liquid_interval 0.2
However it takes around 24 hours to finish a step. Does force balance support parallelization to help reduce this time? If not is there anything else I could do to speed this up?
Many Thanks,
Alex

Remote target evaluation.

For future projects - especially the protein force field - it will become essential to have the ability to evaluate targets remotely.

For example, we might have 441 dihedral energy scan targets, where each target is evaluating the potential energy over a dihedral energy scan of a tripeptide - e.g. ACE-X-Y-NME. The reason for the "441" number is because there are 20 amino acids to consider, plus I'm looking at the HIE and HID protonation states. There might be even more if we consider the protonation states of the other side chains.

The reasons for wanting to run these calculations remotely are:

The calculations are computationally intensive. For a tripeptide, we need to do at least three two-dimensional scans; if we choose a grid spacing of 15 degrees, it would amount to 1875 calculations for a single-point evaluation, and ~200 times as many (equal to the number of parameters times two) for an analytic derivative of the objective function. If we have 441 tripeptides then this number goes into the hundreds of millions.
We cannot store a large number of CUDA contexts on a single host. Each GPU has an upper limit of around 20 CUDA contexts, and each target (i.e. each tripeptide) contains one CUDA context.
The calculations are trivially parallel. The targets do not depend on one another, so we should be able to evaluate them independently in order to maximize our utilization of the available resources.

Here are some of my early thoughts for the design.

If we are running a ForceBalance calculation with the "remote target" capability, the remote machines (i.e. compute nodes) need to have the ForceBalance code installed. However, we won't be manually running any calculations on the remote machines; rather, they will automatically evaluate components of the main calculation as needed.
The Work Queue software is a very useful "personal scheduler" that allows any program to automate the scheduling and distribution of calculations across many machines. It works using a file based interface. We'll be using Work Queue to handle the remote evaluation of targets.

We first run the main ForceBalance calculation on the local machine. The ForceBalance calculation will contain a "Work Queue object", which opens a port on the local machine and listens for connecting "workers". Next we submit a number of "worker" processes on the remote machines, and tell them the hostname and port where they need to connect. Each worker process will connect to the main ForceBalance calculation running on the local machine, receive the command and data needed to evaluate a component of the calculation, and then send the result back.

A few decisions need to be made regarding the interface:

Should the remote calculation resemble a complete ForceBalance calculation in the sense that there's a "forcefield" directory, a "target" directory and an input file? In this case, the worker would need the relevant parts of the directory tree (i.e. "forcefield" and "target/target_name"), as well as a custom input file that tells it to evaluate only one target.
On the other hand, should the remote calculation be much smaller and self-contained? We could attempt to write the Target object to disk and send it to the worker, so we would no longer need to send an entire directory tree. The remote script would simply open the Target object and perform the calculation. However, this raises questions about whether Target objects can be written to disk; "pickle" can serialize more data types than "json", but cannot guarantee cross platform compatibility. We might need to place restrictions on which attributes are allowed to be in a Target object, which might be undesirable. However, the idea that we are sending an internal component of the calculation, rather than constructing an independent calculation, is appealing.
There may be other ways that do not require a file based interface at all, or Work Queue for that matter - for example it may be possible to use MPI. However, the learning curve might be steeper so it would need to be worth the payoff.
We also need to decide what output data is sent back. Each Target returns a few key variables; "X", "G", and "H" are the contributions to the objective function, the gradient, and the Hessian. There are also some other variables that are used to indicate the quality of fit - for example, the RMS force error when performing force matching. Should we attempt to standardize the output data that is returned by a Target? This would simplify the transmission of output data by a lot, but it would take a moderate amount of work.
What is the relationship between this proposed new feature and the existing remote execution capabilities? Currently, the "liquid" target is capable of running calculations remotely, but it does so in a very different way. The target itself is still evaluated on the local machine that is running the main ForceBalance calculation, but it requires a number of expensive condensed phase simulations. These simulations are evaluated by the remote workers, but the Work Queue object exists as a part of the "liquid" target. Thus, it is conceptually different from sending the entire target to the remote worker.

Anyway, these are my thoughts and I'd like to hear your opinions.

Lee-Ping

AMBER frcmod and GROMACS itp parameter names

Right now, parameter names like VDWS and VDWT (these are usually sigma and epsilon) are shared in AMBER .frcmod and GMX .itp formats. These parameters also have hard-coded rescaling factors in forcefield.py. The code is:

    elif match('BONDSB|UREY_BRADLEYB|MORSEB|VDWS|VPAIRS|VSITE|VDW_BHAMA|VPAIR_BHAMA',termtype):
        # nm to atomic unit
        rsfactors[termtype] = 0.05291772
....
    elif match('VDWT|VDW_BHAMB|VPAIR_BHAMB',termtype):
        # VdW well depth; using kT.  This was a tough one because the energy scale is so darn small.
        rsfactors[termtype] = kb*Temperature

The problem is that these are GROMACS units, so the default rescaling factors are wrong when using AMBER targets. This can be worked around by adding a priors section in the FB input file where AMBER targets are used, so that the units are converted to kcal/mol:

priors
   VDWS                                 : 5.29177e-01
   VDWT                                 : 5.92481e-01
/priors

At the same time, forcefield is largely independent of target / engine so I don't think forcefield should figure out what engine is being used in the targets. For example we could also mix force field formats and engines in a single job, which was done for iAMOEBA.

This is part of a longer-term issue regarding the need for a standard system of parameter names and units. This would break backward compatibility and should be a part of release 2.0 .

Update RPMD Energy Estimator

5ffd3fb

For now I edit the energy estimator part such that it includes 3 different quantities.
cent_kinetic: virial kinetic energy calculated by centroid estimator.

pri_kinetic: virial kinetic energy calculated by primitive estimator.

kinetic: the "kinetic energy" in classical sense, that is, if we divided this by degree of freedom and some constants it should average to thermostat temperature.

Question,
Is there any way to record these quantities during the MD simulation, and output the average, standard deviation, and perhaps it's derivative with respect to force field parameter? I know most of them are calculated in liquid.py but that's simulation-engine independent.

Consider merging ForceBalance shell script into ForceBalance.py.

Right now ForceBalance is a shell script that calls ForceBalance.py, and also does the following:

Decide the names of output and error files, and strips the carriage returns from the printout so that the output file is tidier.
Backs up existing output and error files by default so they don't get overwritten.
If ForceBalance.py has a nonzero exit status or if the error file contains any messages, it shows the last few lines of the error file to indicate a potential error message.

I'm thinking of merging the shell script entirely into the Python script, so that ForceBalance and ForceBalance.py are the same file. This would simplify things from the user's perspective, but it might also require a fair bit of stream handling to replicate all the functionality of the shell script.

[Thermo] Add user-defined labels to experimental data points

The different points in the reference data set do not have to correspond to different values of the ensemble parameters, e.g., for enthalpy of adsorption you can have \Delta h_vap (n_H2O), where each point (each value of n_H2O) is at the same temperature, T, and pressure, p. In this sense T and p, are the two only special variables (and \mu in a GC ensemble) that need to be specified for each point in the reference data. T,p, and possibly \mu are fixed control parameters that define the ensemble.

That's a good point. Going back to your original application, the points in the reference data set might have the same ensemble parameters but different topologies.

Thus far it's been convenient for me to identify a point in the reference data set using T and p, but that is tied to my previous applications (physical properties of water at different temperatures and pressures).

I think it's possible to have it both ways. How about having an optional "name" column in expset.txt, so that the user may name directories as they please - but if the column is missing, then the folders are numbered numerically by default?

Python 3.6 support

I wonder if there is any fork that already has python 3.6 support for forcebalance,
Like here: https://www.digitalocean.com/community/tutorials/how-to-port-python-2-code-to-python-3
Or maybe its not worthwhile and rather use conda and python 2.7?
Tobias

Parallelize over initial conditions for MD simulations.

The proposed feature would provide an option to run multiple copies of simulations corresponding to a single experimental data point in Liquid, Lipid and Thermo targets.

The idea is really simple - by running multiple simulations that only differ by their initial conditions, we should be able to trivially parallelize the accurate estimation of properties and their derivatives as long as resources are available.

In terms of organization:

The code needs to know about simulations that are grouped into the same point, so that the data sets can be combined when simulations are completed.
We should allow MBAR to handle situations where some simulations have different lengths than others. For example, we might run 2 copies of a simulation for lower temperatures but only 1 copy at higher temperatures.
We should have a good way of making sure different simulations start out with different initial conditions, and if the number of initial conditions is insufficient it should throw an error to avoid wasting resources.
The user needs to enter this as an option in data.csv or expset.txt.

As always, please let me know what you think.

[Thermo] Merge md() and molecular_dynamics() methods

Currently the molecular_dynamics() method (implemented by me) returns a few specific pieces of information, such as the time series of energies, densities and so on. This wasn't general enough for the Thermo target so md() was added by Erik. At some point I would like to merge the two, and create a more general method for running molecular dynamics simulations + returning properties.

molecule.py integration with MDTraj

This thread discusses the possibility of using MDTraj to load and process trajectories in ForceBalance.

Currently I have a standalone file molecule.py, containing the Molecule object that performs all of my trajectory processing and file conversion. It is easily editable for random tasks that crop up in my research, and it works in the way I intuitively expect it to - thus it's an essential piece of code for my research. On the other hand, MDTraj offers excellent things such as:

High performance
Kyle's implementation of NMR observables
Minimum image convention distances
Ability to compute hydrogen bonded networks
Much more

I think one possible solution is to have MDTraj as an optional dependency of molecule.py. It will work something like this:

When I create a Molecule object, it reads in the trajectory with MDTraj if possible, and it will then contain a Trajectory object.
Certain operations will check for the existence of the Trajectory object using if statements.

Ideally the number of times I do (2) will be on the low side, because having too many if statements could become unmanageable. This is partially because I'm trying to have it both ways - getting the improved performance and features of MDTraj, while keeping my ability to rapidly prototype and hack molecule.py. I think it's a reasonable compromise.

Additionally this will provide a route towards tighter integration down the road if desirable. molecule.py obviously has some overlap with MDTraj, and as time goes on I can start to implement the nonoverlapping functionality from molecule.py into MDTraj.

OpenMM loading PDB error matching water residue

Description:
When parameterizing water model using Liquid targets, OpenMM engine gives error when loading liquid.pdb:
ValueError: Residue 4 (HOH) does not match any template defined by the ForceField.

To reproduce error:
Decompress zip file, go into the folder openmm_residue_match_issue, run python npt.py openmm 273.15 1.00. Will call this folder "debug folder" below.

Initial investigation on source of error:
3 versions of liquid.pdb are compared.

First is provided in the targets/Liquid folder, same as /forcebalance/studies/005_iamoeba/targets/Liquid/liquid.pdb, renamed to orig_liquid.pdb in debug folder. Using this file as liquid.pdb will triggle residue matching error from Residue 1.
Second one is generated by the new ForceBalance molecule.py, renamed to backup.pdb in debug folder, this version of liquid.pdb uses HETATM entry and deleted CONECT for water, which I think is not the cause of error. Using this file as liquid.pdb will triggle residue matching error from Residue 1.
Third one is a mixed version for finding the cause of error, named liquid.pdb in the debug folder. Here the residue match error occurs at the 4th residue. Compare it to the 3rd residue, the only difference is the atomnames are changed, from O H1 H2 to O1 H2 H3, and despite the provided CONECT record, the residue matching error is still triggered.

An walk around of this error would be to avoid the atomnames O1 H2 H3 for water residue in PDB file. Further investigation is needed to fully understand this error. This error did not occur when I used previous version of OpenMM (7.0) but happens when using the current version OpenMM 7.2

openmm_residue_match_issue.zip

Differences in logger behavior when using Anaconda?

I was doing some development on my laptop which has a different environment from my workstation (Laptop (Mac OS 10.9) uses the Anaconda distribution whereas my workstation (Ubuntu 12.04 LTS) uses normal Python). On my laptop the extra logging messages are being printed to the stderr stream, which is confusing:

Does Anaconda modify the behavior of the logger in some way? (In general I don't know enough about how loggers work to troubleshoot them effectively.)

INFO:forcebalance.thermo:comm "# Experimental data for liquid, bromine."                        

INFO:forcebalance.thermo:empt                       

INFO:forcebalance.thermo:head Temp (K)  Density (kg/m^3)    w   Hvap (kJ/mol)   w   Pressure (bar)

INFO:forcebalance.thermo:data 298.15    3102.8  1   29.96   1   1.01325

Remove some global variables (ITERATION, ITERINIT, GOODSTEP).

Suggestion from Han:

Currently ForceBalance uses a global variable which represents which iteration it's on. It would be good if this global was removed, but this is nontrivial. The ITERATION variable is often used to determine which folder files are written to / read from, and it would greatly increase the flexibility of functions if they instead took the directory name instead.

Discuss Options for Organizing Experimental Data

Update documentation with current python version

Current documentation in .pdf (http://leeping.github.io/forcebalance/doc/ForceBalance-Manual.pdf) and on html (http://leeping.github.io/forcebalance/doc/html/installation.html) says that force balance requires python version 2.7, but should be updated to using python 3.6.

ForceBalance modules only produce output when stream handlers are explicitly added.

Hi Arthur,

Tonight I was using _exec() in an external script and I noticed it wasn't generating output anymore. I could add a RawStreamHandler to the logger in my external script, but this is a bit cumbersome. Perhaps the default behavior of a logger should be to have a RawStreamHandler to stdout. Maybe something like this at the top of each module file?

from forcebalance.output import getLogger, RawStreamHandler, INFO
logger = getLogger(__name__)
if len(logger.handlers) == 0:
    logger.addHandler(RawStreamHandler(sys.stdout))
    logger.setLevel(INFO)

BSD 3-clause License

@kmckiern @ahvigil @jisraeli @yudongqiu

ForceBalance is currently licensed under GPL v2, and I'd like to change it to the 3-clause BSD license. My understanding is that the new license would make it easier for folks in industry to collaborate and use our code.

Below is a copy of the filled proposed license, where I've pasted the BSD 3-clause license template and filled in the authors and contributors. Please let me know if you have any suggestions or concerns. My plan is to finalize this change in 1-2 weeks.

Copyright 2011-2015 Stanford University and the Authors
Copyright 2015-2018 Regents of the University of California and the Authors

Developed by: Lee-Ping Wang
              University of California, Davis
              http://www.ucdavis.edu

Contributors: Keri A. McKiernan, Arthur Vigil, Yudong Qiu, Erik G. Brandt, Johnny Israeli

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, 
this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, 
this list of conditions and the following disclaimer in the documentation 
and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors 
may be used to endorse or promote products derived from this software 
without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY 
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. 
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, 
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 
NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 
POSSIBILITY OF SUCH DAMAGE.

CustomForces need CopyParameter functions.

This is needed for any force field that uses Custom Forces. 📦

Improved error treatment in optimization.

Please provide input if you are interested.

When ForceBalance encounters a bad optimization step due to a noisy objective function and gradient (containing statistical properties which are noisy), it rejects the optimization step, recalculates the gradient and Hessian, and takes a reduced-size step.

However, the gradient and Hessian from the repeated step are just as noisy as the initial step. Also, the noise term remains constant as the objective function converges, so towards the end of the optimization most of the steps are bad. Many of my optimizations are terminated by hand when they take too many bad steps, or I restart the optimization with more MD steps - neither solution is optimal.

We should be able to leverage the information from multiple calculations to improve the optimization step, since they were carried out at the same values of the parameters. The simplest thing to do is average the objective function, gradient, and Hessian values at the "revisited" parameter values but I doubt this is the strictly correct thing to do.

For example, a simple calculation shows that when the property has statistical noise (X + Δ), the objective function (X + Δ - E)^2 is on average larger than the noiseless function by an amount <Δ^2>. This is obvious when the true property average agrees with experiment exactly - the objective function should be zero but it will never be zero.

When the property does not agree with experiment exactly, the objective function might be higher or lower than the true value, but on average it is higher. The variance of the objective function is something like this:

<Δ^4 + 6Δ^2(X - E)^2>

The standard error of the property Δ should probably not be used to correct our objective function because it cannot be precisely evaluated; estimates of standard errors tend to be less precise than estimates of means, so it would worsen our estimate rather than improve it.

The main point is that by averaging multiple objective function values, we don't get the same objective function that would have resulted from a single calculation with proportionally longer trajectories; we would get a lower value from the latter. But we can't compare this lower value with the next optimization step because it will have the full noise again. In fact, the "averaged objective function" from multiple steps is probably the better comparison value for future optimization steps.

What about the gradient and Hessian? Simple calculations assuming <Δ>=0 indicate that averaging the gradient over multiple calculations indeed converges to the true gradient.

More onerous are the Hessian diagonal elements. They are on average increased by a factor <Δ(∇i)^2> (i.e. the squared standard error in the ith component of the property gradient). This means that on average we are taking smaller optimization steps than we should, and this error would persist through averaging multiple Hessians. Taking too small of an optimization step is a bad thing, because they are more likely to be within the statistical noise.

So what should we do here? Possible solutions include:

Improving estimates of the Hessian through averaging should be done at the property calculation (i.e. "target") level, where the quantity being squared and summed is accessible - so that the Hessian diagonal elements are not inadvertently increased by the statistical noise.
Improving estimates of the objective function through averaging should be done at the optimization level, because they would incorporate the positive bias which subsequent optimization steps would share.
Each time the optimizer takes a step back and recalculates the objective function, subsequent evaluations of the objective function should increase the number of MD steps by a proportional factor. This would remove the requirement for (2), so all of the "averaging" calculations may be done at the target level. The downside is the optimization time would blow up. We might recover some efficiency by reducing bias in the Hessian diagonal elements but still...

One question is: When I repeat the liquid properties calculation for the same parameters after rejecting an optimization step, whether I need to rerun MBAR for the "doubled" collection of energies or just take their arithmetic mean. I don't think MBAR scales favorably with very large number of snapshots (I haven't run timings, just an impression from experience).

Let me know what you think. Thanks!

Lee-Ping

Experimental Data File Specification

Starting a new topic because this focuses on how the experimental data files are stored on disk (these files are created / edited by the user), rather than how they are parsed or stored in memory. Note that we can't easily go the other way (i.e. convert from the internal representation to data tables on disk), but that's presently not an intended use case.

If you are thinking of a nontrivial data type, please mention it here and we can discuss how it can fit into this framework.

Specification for experimental data file format in ForceBalance (updated 4/1/2014)

Context and Purpose: In ForceBalance, I presently have an existing, fairly rigid format for storing liquid and lipid bilayer properties (Liquid and Lipid targets), and a simple format for specifying general thermodynamic properties (Thermo target) as contributed by @ebran.

The following spec attempts to encompass more types of physical / chemical experimental data for force field parameterization, so that we can incorporate a larger number of properties into the Thermo target. The format is intended to be highly flexible for different data types without compromising on simplicity.

File Formats: Three formats will be available.

Comma-separated values or .csv, as written by Microsoft Excel. Every line in the file - including empty lines - must contain the same number of comma separated fields.
Tab-delimited values as written by Microsoft Excel. Every line in the data table (see below) must contain the same number of tab separated fields. Metadata lines are still tab-delimited, but trailing tabs don't need to fill out the number of fields (as opposed to the case for commas in .csv). Comment lines are tab-delimited but only the first field matters anyway.
Space-delimited, fixed width format. Every line in the data table must have the same width. Note that this is the most human-readable format, but Excel may parse it in unexpected ways due to the three types of lines (see below). Comment lines or metadata lines are space-delimited and don't have to be fixed width (though preserving the fixed width improves the chances that Excel will read it correctly).

Content of the files: There are three types of lines in the experimental data file, in addition to empty lines (which are ignored).

Comment lines: The first field starts with a octothorpe (#) character. Subsequent fields are ignored.
Metadata lines: The first field is the (case-insensitive) variable name. The second field must be an equals sign (=), otherwise the line will not be recognized as metadata. Subsequent fields are the (case-sensitive) variable values. If any of the value fields starts with # (comment), then it and subsequent fields are all ignored. Examples of metadata are discussed below.
Data table (the most important): The first line is a header line with column headings. Subsequent lines are the rows in the table containing the same number of fields as the header line. Intervening comment lines are allowed, for noting experimental references and similar things. Intervening metadata lines are not allowed. Each row corresponds to: (a) one set of experimental observables taken at the same conditions, (b) one set of simulations for simulating these observables, and (c) one subfolder in targets containing initial conditions and simulation settings.

Content of the data table: The columns in the data table can be one of the following. Column headers are case-insensitive.

Index: (case-sensitive string, leftmost column) The row label; this is required. The label is also the subfolder name in targets.
Temperature of the experiment/simulation (float; default unit Kelvin). If provided, will be used in simulations and calculation of observables.
Pressure of the experiment/simulation (float; default unit atm). If provided, will be used in simulations and calculation of observables.
Observable value (float with optional unit). The column header is the observable name, and must correspond to an internally implemented method to calculate the observable from simulations. The column header and entries may optionally contain a physical unit (syntax is observable (unit) with optional space). The unit in the column header is the default for the whole column; if no units are provided, a hard-coded conventional unit will be used. An example of a unit would be 10^-4 K^-1. (Note: We should use the SimTK unit conversion system to do unit conversions, but it would be nice if we could map the abbreviated units to the full unit name. I also have issues with bundling unit with ForceBalance because these units are incompatible with the existing simtk.unit.) Also note: observable (unit) counts as a single field so the fixed-width parser must account for this.
Array observable (multiple values or file reference). This is a flexible format that accommodates complicated observables, such as a deuterium order parameter, a radial distribution function or IR spectrum. For deuterium order parameters, the list of n carbon atom names and the list of n-2 order parameters are needed. For a radial distribution function, it may be the list of atom separations, the two atom name selections used to calculate the RDF, or the list of RDF values. The data may be entered as multiple rows, in which case the other rows can be empty. Alternatively, a file reference may be provided in the form of other_file:column where the values are read from the column of the other file (numbered from one). Note that the number of entries must be consistent (i.e. the number of atom separations must equal the number of RDF values, and the number of carbons must equal the number of deuterium order parameters + 2.)
Observable weight (optional, float). The relative weight of the observable in this row with respect to the same observable in other rows; it will be normalized at the end. The column header is "w", "wt" or "wts" and applies to the observable column to the left. Defaults to one. If the experimental data is missing in this column, the simulation result will still be printed, but it won't contribute to the objective function.
Observable uncertainty (optional, float). The total uncertainty of the observable in this row (experimental + simulation uncertainty). The column header is "s", "sig" or "sigma" and applies to the observable column to the left. This overrides the global uncertainty in the metadata. Note: Mathematically speaking, the inverse square of the uncertainty is a multiplicative weight - however, it is conceptually different from the above weight which corresponds to our own expectations of which data points are more important.
In any case where the column applies to the observable / temperature / pressure column to the left, it may be prepended with observable_ to override this behavior and apply to the chosen observable instead.
Number of independent initial conditions (integer). The column header is "n_ic". The number of independent simulations that are launched to compute this observable in parallel, and must correspond to existing initial condition files on disk.
More specialized behavior:
- MBAR may be used to combine statistics from simulations at different thermodynamic conditions (rows) to improve the statistical precision of each. This only makes sense for compatible simulations (same number of molecules). This is different from the MBAR that may be used to calculate solvation free energies. The column heading is MBAR, and the entries are integers corresponding to groupings of simulations that MBAR is applied to (0 or None means MBAR is not used.)
- Quantum corrections for heat of vaporization, heat capacity.

Metadata. Possible entries are:

Sigmas (aka Denoms). The total uncertainty of the observable (experimental + simulation uncertainty), expressed in the unit system of the column heading or the default unit. The number of entries must equal the number of observables being calculated - or provide multiple rows e.g. observable_sigma with one uncertainty per row. Note: This may be modified to specify the experimental and simulation uncertainty separately. My experience is that calculating the simulation uncertainty during the optimization leads to stability issues, so pre-computing the values is recommended.
Weights. The relative weight of the observable with respect to other observables. Note that weights are hierarchical; the individual weights in the table are within a single observable, these weights are across observables, and the ForceBalance input file has weights across targets.
True/False switches for activating or deactivating QM corrections.

Support for multiple files:

Since the columns contain more than just experimental data (they actually contain some information that is pertains to the optimization), we should support splitting these columns into a separate file.
At the same time, we don't want to sacrifice the convenience of having everything in a single file, especially when the whole optimization concerns just one or two data points.
Because of this, we will support multiple files with formal separation between data types (experimental data / parameters for optimization) but still allow single files that combine multiple data types.
The ForceBalance input file needs to specify which files to read, but does not have to specify what each file contains. All of the files are parsed in the same way and stored in the Target class.

Potential pitfalls and concerns:

What experimental observables haven't I thought about, that may not fall into this framework?
Writing tab-delimited files by hand may confuse the tab-delimited parser (e.g. using tabs to create a fixed-width effect but putting different numbers of tabs between the columns in each row.)
Sometimes pressure is not a control variable but an observable (for example in NVT simulations). Same goes for number of particles / chemical potential - because if we choose to simulate in one ensemble, we obtain statistics on the conjugate thermodynamic variables. How do we account for this dual role?
When referencing files in the data table, should columns be numbered from one or zero? I am inclined to say one.

custom simulation specifications for openmm simulations in forcebalance

thought it might be a good idea to talk about this on here.

currently the simulations run using the openmm engine in forcebalance are a bit inflexible. i think we could add flexibility in a stylistically consistent way if we allow the user to include elements of an openmm python script in the target directory, similar to how gromacs liquid simulation defaults are overwritten a by liquid.mdp file. so there are defaults in npt.py and then anything in the targets/targetx/sim.py file takes precedence.

maybe to prevent the need for super general parsing, we could fix a few variables, and the sim.py script would be required to use these variables? so we could write a template script and the user could add attributes.

Releasing version 1.4?

The release schedule for ForceBalance is rather arbitrary, but soon it will be a good time to update the version to 1.4 because of several new features. Also I need to refresh my memory on how to generate the documentation.

Major features:

AMBER engine: Interface to AMBER for matching gradients, interaction energies, vibrational frequencies and electrostatic potentials
Force field class can parse script blocks in XML (Johnny's contribution)
Derived parameters using EVAL can now be functions of one another
Hydration free energy target (handles implicit solvent for the time being)

Minor features:

Improved file handling: Pickle files are automatically zipped
Enable optimization of a subset of charges on a molecule (previously had to optimize all of them)
Use setuptools for automatic dependency handling
Liquid target no longer normalizes the weights to one (note change from previous behavior)

Here's what needs to be done first:

Merge Keri's contributions (multiple initial conditions and lipid unit test)
Remove ITERATION global variable