wehs7661 / ensemble_md Goto Github PK

A python package for performing GROMACS simulation ensembles

License: MIT License

Shell 0.25% Python 99.75%

advanced-sampling-methods free-energy-calculations molecular-dynamics statistical-mechanics expanded-ensembles parallel-computing python replica-exchange drug-design protein-simulations

ensemble_md's Introduction

Ensemble Molecular Dynamics

ensemble_md is a Python package that provides methods for setting up, running, and analyzing molecular dynamics ensembles in GROMACS. The current implementation is mainly for synchronous replica exchange (REX) of expanded ensemble (EE), abbreviated as REXEE. In the future, we will develop methods like asynchronous REXEE or multi-topology REXEE. For installation instructions, theory overview, tutorials, and API references, please visit the documentation.

Copyright

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.6.

ensemble_md's People

Contributors

Stargazers

Watchers

Forkers

eirrgang riazi-r shirtsgroup ajfriedman22

ensemble_md's Issues

A list of short term action items

Functionalities for testing purposes or of less usage should be deprecated

Here we list functionalities in the package that will be removed/deprecated soon:

The scheme of multiple swaps
The RMSE cutoff for weight combination
The option final for the YAML parameterw_combine

The documentation will also be updated accordingly.

Fixed-weight EEXE simulations should have the choice to be initiated by different sets of weights for different replicas

In the current implementation, the weights of EXE replicas in a fixed-weight EEXE simulation are initialized from the same set of weights defined in the template MDP file. For example, say that the weights corresponding to the full alchemical path that covers 4 states include 0, 0.5, 1.5, and 3.5. Then for the first replica that spans 3 states, the weights would be 0, 0.5, and 1.5, and those for the second replica would be 0, 1.0, 3.0 (after shifting the weight of the first state to 0).

However, the user should have the choice to have different sets of weights for states in different replicas. For example, one could have 0., 0.5, and 1.5 for the first replica, but have 0, 0.5, 3.0 for the second replica (instead of 0, 1.0, 3.0). This could be useful for, but not limited to EEXE simulations for multiple serial mutations and can possibly be done by inputting multiple MDP files or specifying the list of weights in the YAML file.

This issue is a part of the work in the project EEXE for serial mutations.

Information about the simulation is not printed on-the-fly to `run_REXEE_log.txt`

The information about the simulation is not printed on-the-fly to run_REXEE_log.txt. It is only printed to the file after the simulation is completed. If the simulation is interrupted, the content of run_REXEE_log.txt will be empty. This should be corrected, presumably by modifying Logger in utils.py.

Enable extension/continuation of weight-updating REXEE simulations

The current implementation does not support extension/continuation of REXEE simulations, which should be added shortly.

Refine the documentation of the package

Improve the descriptions of different MC schemes for swapping replicas (update with some experimental results)
Describe rules of thumb for ALL EEXE parameters
A section for analysis methods
Add simple examples in the docstrings
Proofread the documentation pages
Proofread all the docstrings
Set up tutorials

Modify `config.yaml` of CircleCI to allow rerunning failed tests

For unknown reasons, the unit tests for functions that use MPI occasionally fail with uninformative STDOUT/STDERR. Instead of figuring out the underlying reason for these failures, an easier way to ensure that at least the build on the master branch passed the CI may be simply rerunning the CI. However, this would require some modifications in config.yaml. Some instructions can be found here.

Remove the YAML parameter `acceptance`

In the updated theory, there is only one way to calculate the acceptance ratio assuming symmetric proposal probabilities. Previously we removed the option metropolis_eq/metropolis-eq from the YAML parameter acceptance and we decided to remove the state-state swapping scheme, which leaves only one option (metropolis) for the parameter acceptance and makes the parameter acceptance not necessary anymore. Therefore, the YAML parameter acceptance should be removed and the documentation should be updated accordingly.

A list of todos for the next release of `ensemble_md`

Here we list the todos that should be done before the next release of the ensemble_md package. The release number has not been finalized and can be either 0.10.0 or 1.0.0 depending on the status of the paper review. Overall, the todos are to improve the code coverage and the documentation of the package. This issue includes updated components from the plans mentioned in issues #7, #26, and #33, as parts of them have become outdated.

Improving the code coverage

Note that we do not include adding unit tests for analysis/msm_analysis.py since MSM analysis for REXEE is a WIP.

Add unit tests for analysis/analyze_free_energy.py
Add unit tests for analysis/clustering.py
Add unit tests for analysis/synthesize_data.py
Complete test_mpi_func.py, which tests functions that use MPI
Add unit tests for functionalities relevant to MT-REXEE, which are in replica_exchange_EE.py
Re-examine the missed lines to push the code coverage as much as possible.
Check if the warnings are actually harmless.

Potentially, we might want to consider adding regression tests for the CLIs.

Improving the documentation

Update the theory section of the documentation
Describe rules of thumb for REXEE parameters
Proofread the entire documentation
Proofread the docstrings
Adding simple examples in some of the docstrings

Potentially, we might want to add some simple tutorials before the next release.

A list of non-urgent but important enhancements for future consideration

Here we list todos that are not currently urgent but could be important for future work:

Re-examine previously written tests to see if there are better ways to test the functions. Presumably a lot of tests could be made more rigorous/meaningful using unittest.mock. test_analyze_traj.py should be free from this since most of its tests have just been written recently.
Examine the code in msm_analysis.py and write unit tests for its functionalities. Expand its functionalities if necessary.

Reducing the computational cost for the new implementation of EEXE with MPI-enabled GROMACS

In the commit c661fb8, we have enabled the use of MPI-enabled GROMACS and disabled the use of thread-MPI GROMACS. However, as discussed in issue #10 , in the original implementation, the execution of multiple grompp commands was parallelized by mpi4py using the conditional statement if rank < self.n_sim:, while in the new implementation that allows MPI-enabled GROMACS, the GROMACS tpr files are generated serially using the following lines in the function run_grompp:

# Run the GROMACS grompp commands in parallel
for i in range(self.n_sim):
    print(f'Generating the TPR file for replica {i} ...')
    returncode, stdout, stderr = self.run_gmx_cmd(args_list[i])
    if returncode != 0:
        print(f'Error:\n{stderr}')
        sys.exit(returncode)

I tested the code with a fixed-weight EEXE simulation for the anthracene system with nst_sim=1250 and n_sim on Bridges-2 with 64 cores requested and runtime_args={'-ntomp': '1'} and n_proc=64. As a result, 50 iterations took 327 seconds. This is much longer than the original implementation that used mpi4py to parallelize the GROMACS grompp commands. Specifically, using the new implementation, 20000 iterations would take approximately 327 * 400 seconds, which is around 36 hours, much longer than 13 hours required to finish the same simulation using the original implementation.

In light of this, we should figure out a way to parallelize the GROMACS grompp commands without introducing any noticeable overhead and without using mpi4py (to prevent nested MPI calls). Also, we should try to identify if there is any other source that contributed to higher computational costs in the new implementation of EEXE.

A checklist for docstring proofreading

As a part of issue #41, here is a checklist for keeping track of the progress of proofreading docstrings of all modules in ensemble_md.

Adapt the data to be pickled for cases with `subsampling_avg = True`

In free energy calculations for a REXEE simulation using the current implementation, the preprocessed data pickled do not support recalculations with different subsampling fractions and can therefore fail when subsampling_avg = True. This is because when subsampling_avg=True is set, variables like t_idx_list and g_list are required to calculate the input t and g values for preprocess_data, but these variables are not pickled. While this can be resolved by simply removing the pickled data and rerunning the entire calculation, codes should be modified so that the pickled data can also work with cases with subsampling_avg = True.

Add the CLI `run_EEXE_mpi`

As discussed in issues #10, and #13, the flag -multidir in mdrun can be used to enable MPI-enabled GROMACS, but the overhead could be high due to longer GROMACS start times. However, in cases where the exchange frequency is not too high so the introduced overhead is affordable (e.g. in EEXE simulations for multiple serial mutations), enabling MPI-enabled GROMACS might still be useful.

To make our EEXE implementation work with both thread-MPI GROMACS and MPI-enabled GROMACS, we decided to have two CLIs for running EEXE simulations, including the original CLI run_EEXE that works for thread-MPI GROMACS, and the CLI run_EEXE_mpi that we aim to develop here to work with MPI-enabled GROMACS. These two CLIs will be mostly the same, except that run_EEXE_mpi will not use mpi4py to avoid nested MPI calls. Some functions in ensemble_EXE.py may need to be modified to work with both thread-MPI GROMACS and MPI-enabled GROMACS, or, when this is not possible, functions specifically for MPI-enabled GROMACS will be added.

This issue is a part of the work in the project EEXE for serial mutations.

Improve existing unit tests and overall code coverage

Before the next release (presumably 0.5.0), one of the action items is to improve the quality of existing unit tests and covered the untested functions. Here is a checklist for the modules to test. An item is checked only if the existing tests have been re-examined and new tests have been added (when necessary) to increase the code coverage.

ensemble_EXE.py (finished at commit 09adb37)
utils/utils.py (commits d83e5a)
utils/gmx_parser.py
analysis/analyze_matrix.py
analysis/analyze_traj.py
analysis/analyze_free_energy.py
analysis/msm_analysis.py

Notably, there should not be a need to cover the CLIs and utils/exceptions.py. We do need to re-examine the CLIs to make sure they are all sufficiently compartmentalized, so I list them below as well.

Compartmentalize run_EEXE.py
Compartmentalize analyze_EEXe.py
Compartmentalize explore_EEXE.py

Enable different simulation input files for different replicas

We should enable different simulation input files, including the GRO and TOP files for different replicas. This is a feature especially required in EEXE simulations for multiple serial mutations because different replicas corresponding to different mutations require different TOP files and have different GRO files at λ = 0. Note that, as mentioned in issue #15, different replicas are bound to different changes in chemistry, in which case we shouldn't need to swap the TOP files, but just the GRO files (done by coordinate manipulation).

Note that this issue is a part of the work in the project EEXE for serial mutations.

A checklist for the documentation pages

Get passed the documentation building
Describe how $\Delta$ in the acceptance ratio is calculated
Describe how swaps should be proposed
Describe how concurrent multiple swaps should be carried out and the math behind it
Describe whether the weights are combined across the replicas to be swapped or across all possible replicas
Modify the description of specifying input files to reflect the fact that multiple GROMACS input files can be used.
Add the description that fixing the distance reference does not work for multiple input GRO files (and it doesn't make sense for it to work for multiple GRO files)
Add a section about what MDP parameters are recommended (e.g. nstlog should be less or equal to nst_sim in YAML)

Enable coordinate modification in the EEXE framework

To expand the usage of EEXE, we want to enable coordinate manipulation at exchanges between replicas, which is most likely to be useful for estimating the free energy of multiple serial mutations using expanded ensemble simulations, such as mutating methane into ethane and then propane.

For example, we can have an EEXE simulation composed of two replicas mutating methane into ethane and ethane into propane, respectively, and only exchange the coordinates between replicas when they are at the end states, i.e., replica 1 being at λ=1 and replica 2 being at λ=0. In this example, we will have the following end states:

Replica 1: Mutating methane to ethane
- State a: At λ = 0, we have methane with a dummy methyl group.
- State b: At λ = 1, we have ethane with a dummy H atom.
Replica 2: Mutating ethane to propane
- State c: At λ = 0, we have ethane with a dummy methyl group.
- State d: At λ = 1, we have propane with a dummy H atom.

At exchanges, we will have two output gro files respectively from replicas 1 and 2, namely rep1.gro (state b, ethane with a dummy H atom at the first carbon) and rep2.gro (state c, ethane with a dummy ethyl group at the second carbon).

Note that in EEXE, each replica is bound to the transformation for its assigned alchemical range. In our case, this means that replica 1 will only be responsible for the mutation of a methane to an ethane, and replica 2 will only be responsible for mutating an ethane to a propane. Normally, we would just swap the gro files as is, so in the next iteration, replica 1 will be initialized with rep2.gro and sample the intermediate states along the mutation path between methane and ethane. However, rep2.gro is an ethane with a dummy methyl group, not an ethane with a dummy H atom that we need for such sampling. The same thing would happen when trying to initialize the next iteration of replica 2 using rep1.gro.

To address this issue, we can modify rep2.gro as follows and use it to proceed to the next iteration of replica 1:

Remove the dummy methyl group at the second carbon atom from rep2.gro.
Attach a dummy H atom to the first carbon atom in rep2.gro. Specifically, the coordinate of the dummy H atom can just be the coordinates of the second carbon atom. There won't be clashes since the dummy H atoms have no interactions with the rest of the system.

Similarly, we can modify rep1.gro as follows for the next iteration of replica 2:

Remove the dummy H atom at the first carbon from rep1.gro.
Attach a dummy methyl group at the second carbon atom in rep1.gro. Specifically, we can take the internal coordinates of the methyl group in rep2.gro, treat the group as rigid, rotate, and attach the group to the second carbon atom in rep1.gro.

Importantly, we can make the two modified gro files have the same potential energy, so the proposed exchange will always be adopted.

Here, we are not going to implement functions for coordinate manipulation in EEXE but modify the CLI run_EEXE (and the function run_grompp in ensemble_EXE.py, if necessary) to allow the flexibility of calling a user-defined function for coordinate manipulation from an input python module (where the user-defined function is defined).

Update the theory section in the documentation

The current theory section in the documentation is outdated and a part of it is wrong. Should update it ASAP.