openpathsampling / openpathsampling Goto Github PK

View Code? Open in Web Editor NEW

98.0 14.0 47.0 274.57 MB

An open source Python framework for transition interface and path sampling calculations.

Home Page: http://openpathsampling.org

License: MIT License

Shell 0.15% HTML 0.01% Python 99.70% CSS 0.10% Makefile 0.01% C 0.04%

openpathsampling's Introduction

OpenPathSampling

An open source Python framework for transition interface and path sampling calculations.

See documentation here:

http://openpathsampling.org

openpathsampling's People

Contributors

Stargazers

Watchers

Forkers

jhprinz dwhswenson msultan asagodi marscher jchodera maxentile jonycastagna raffaelacabriolu arjunwadhawan sarahmccartan stefanopoggio rotarywave cleboo sroet singraber yangxi1209 hejung mackernan bolhuis nileshjchoudhary nicole-de-groot adamwlodarczyk king-michael md-club ajsilveira snow910716 mfsjmenger jasperdenduijf apdealbao tsenapathi share1992 madlenreiner zoebreed bdice oliverdutton layeqa pk-organics hjjonas shantanu-collab vera-weili asclepiusinformatica aximthered

openpathsampling's Issues

Which Topology to store ? mdtraj.Topology, openmm.Topology or our own...

We wanted to use mdtraj to store the topology, but it does not contain periodic boundaries for creating an openmm system that has periodic boundaries. Openmm.Topology has, but in converting from openmm to mdtraj the boundaries get lost.
We could add these boundaries by using

om_topology = md_topology.to_openmm()
om_topology.setUnitCellDimensions(dimensions)

We need some place to get the dimensions from, maybe the box_vectors from a snapshot. But we need to have these available for the simulation when it creates the system object. openmm solves this by sticking non-topology information into the topology object, which I would like to avoid.

Also since we are using mdtraj mostly for analysis we want to keep the md.topology as the main topology object.

My question would be where to store this extra information. I would think that we might store one configuration, either under a fixed ID or pointing to a configuration. When the storage is created the
topology is also stored and we could just add a snapshot (while a pdb already contains all we need!)
and point to it.

Any suggestions?

I have finished a simple JSON Serializer for mdtraj.Topology that also includes missing elements and restores from a saved JSON. So this can be ready soon

Add Storage for Toy Dynamics

will add storage capabilities for Toy Dynamics and create a Topology object for this as well

Add nose tests

Would be great to get at least some basic nose tests going here to test the various modules!

Change PDF-only documentation into README.md files

It would be great if files like this one were actually README.md markdown text files:
https://github.com/choderalab/msm-tis/blob/master/code/python/PySRMSTIS/src/tutorial_sheet.pdf

Request for showtree.py: State Labels at ends of paths

Path trees of the sort generated by showtree.py have mainly been used for TPS paths, not TIS paths. To be really useful for TIS paths, which can end in either state A or state B (or state C, D, etc., in these days of multiple state TIS), we need to know which states the paths connect.

Visually, my suggestion is to use a circle (or maybe rounded-rect); either of a color that corresponds to a given state, or putting a string (which would have to be maybe 3 or 4 chars max) corresponding to a given state in the circle. I think the string will look better. This would also mean that the user would have to provide a dictionary of Volume-to-color/name for the states. But that might not be so hard if the state booleans are saved as labelled order parameters in the storage... then the user just needs to specify which names are states, perhaps with a short version for the strings?

@jhprinz , showtree is your baby; would this be terribly hard to implement?

Logging

I’m starting to set up logging. This my first time playing with Python’s logging tools, which are quite nice, and I’m doing it now for a few reasons:

I suspect that as we develop PathMovers and Calculations, we’ll be doing a lot of debug-by-print. This is, of course, what logging is really intended for, and including logging statements makes it easier if we have to check the same part of code more than once.
@bolhuis suggested (a while ago) that when default parameters are used, we should announce that somewhere. This gives a way for users to discover new options (if they won’t RTFM). We can create a logger to do that.
We can replace our current informative print statements (about how the shooting is going, etc) with logging statements, allowing a user to decide whether to see them.

Anyway, the first of those is the main reason I’d like to do this now. The others are useful bonuses we get by properly setting up logging.

Here’s my suggestion for the logger layout:

Each module has its own logger (this seems to be a standard practice)
We have a global logger called opentis.initialization which logs all initialization parameters (handling the second bullet point above) and logs them at level "info"
Configuration of the logger setup is in a logging.conf (or ops_logging.conf) file. Default configuration is initialization logs info messages to a file initialization.log, and the module-level initializers log info messages to stdout and ignore debug-level messages. (At least, Calculation and PathMover info messages should go to stdout, maybe not everything else: all of this is easily configured in the logging.conf file)
The user will have to explicitly load the logging.conf file -- although it puts more work on the user (2 lines of code), everything I’ve read suggests that this is the best practice when writing a framework: the default behavior without the configuration file will be that all log messages are dropped.

Thoughts or suggestions? I've got proof of concept pretty much working on this, it's just a matter of adding messages and finding the right default behavior now.

Add PDFs for .lyx files (or just use .tex)

Will make it easier for me to read!

Getting a first runnable version...

Working to get a first running prototype to start from. There are still a few things that I need to finish:

use XMLSerializer in netCDF storage to better restart
add snapshots with reversed momenta by either
1. using new snapshots
2. mark a frame in a trajectory as inversed by using
  a. negative snapshot indices or
  b. a boolen variable
for all inversed frames store additional data like change in energies...

a negative index might be problematic since negative indices in python do not trigger an error but might read wrong frames instead. An additional bool might be overlooked in an implementation but reads at least the correct frame.

from an implementational point of view this translates into the question if an inversed snapshot is really a new object or just a different view. Also we want to be able to read frames fast

are there other changes to snapshots besides momenta inversion? Maybe consider a trajectory for another ensemble like with changed pressure or temperature. In this case it might be good to decouple saving a snapshot and its momenta and positions. This means
- a trajectory is a list of snapshot IDs
- a snapshot has its ID and links to a position ID and/or a momenta ID. This would allow to store snapshots without momenta or generate the same configuration with differerent momenta.
- an indicator if momenta should be reversed (optional - this should be part of a trajectory, but if we wanted to really create a new snapshot then we could do that)
- a snapshot also has additional information like box vectors, energies, etc...
- since in all shooting simulations we have forward and backward trajectories we should store the time direction of each frame in a boolean. Need to think about this.
setup a well documented example for Alanine Dipeptide
rewrite the usage of the KMediod Clustering
include MSM generation
improve documentation

Enable travis building of pull requests

Don't forget to enable the building of pull requests. This is very useful.

compatible to Python 3

Coule we change the code so that it runs also under Python 3?
I had the idea to add a Siegtank converter which would allow to import a complete Siegetank project into our netCDF Storage and continuously sync it, as it progresses.
The idea was to just get the actual status of a siegetank project. Get all the trajectories one by one, strip water and store the essential part (I assume that I don't need water and or momenta) in our storage.
Unfortunately Siegetank requires Python 3 and to avoid having to seperate scripts it would be great to keep out code also runnable (if not optimal) under Python 3.

This seems to involve mainly

change print to print()
take care of unicode strings
rewrite some iterables (e.g. iterates() does not work in Python 3)

which is not too much work. Most of it can be done automatically.

Would that be a good goal? I mean importing from Siegetank is a nice gimmick but potentially not relevent unless we want to use Siegetank at some point to run our trajectories (which would be compatible withe the next simulation generator scheme)

Change .rtf files to .md markdown

Would be helpful for readability.

Project Name

Just to collect ideas separately for a while...

About the name. So this package effective allows to sample ensembles of paths. So something with these three words?

PathFinder
PES (PathEnsembleSampling)
Groups of paths might form a "Flow"?
FlowSampling

Lots of paths form a street (Autobahn) a map

PathMapper
EnsembleMapper
MapSampler

Cartography?

Reorganize examples?

The examples/ directory could use some reorganizing. Right now, the contents are

alanine/
data/
ipython/
openmm/

Perhaps we could add examples/README.md as a master index of examples, and then have more meaningful subdirectory names that represent the nature of the system and dynamics under study?

Should we use openmmtools.integrators instead of duplicating it here?

We now have a copy of integrators in openmmtools.integrators. Should we use that instead of the integrators.py here? We have to make some changes to the variable names anyway to avoid potential bugs with CustomForce systems, so it might be easiest to only maintain one set of these files.

On the other hand, if there's the expectation that a lot of new custom integrators that would not be of general interest might be added, it may be useful to keep a local integrators.py. But for now, this should probably be avoided.

Change repo name to openpathsampling?

Might be a good time to do this, since we're getting everything set up with http://openpathsampling.org

Overshooting vs. Not overshooting: Consistency in `DynamicsEngine.generate`

This is one of those issues which won’t cause any problems in practice, but should be handled if we’re going to claim that our code can handle any general path-sampling-like approach. I’m adding a tag "noturgent" to mark this and other questions like it. (Until we solve these, our code isn’t as general-purpose as we claim, but these bugs won’t break any likely approach.)

Depending on the nature of the continue conditions (which, in turn, typically depends on the nature of the ensemble), we can get one of two types of paths from DynamicsEngine.generate. To describe the problem, think in the context of having a continue condition based on Ensemble.can_append for some ensemble:

If the path ensemble has LengthEnsemble requirements at the end (which is how we’re setting up our TIS ensemble, and all the ensembles I can think of from path sampling approaches can at least be written that way), then a path which could be accepted is exactly the path which is generated by DynamicsEngine.generate()
If the path ensemble does not have a LengthEnsemble requirement but does have a way to stop (e.g., if it ends with an InXEnsemble), then we have to overshoot before can_append will return false. In this case, generate returns a trajectory which has one frame too many; we must trim that trajectory in order to accept it.

I see three possible solutions to this:

Every path mover which uses generate must, if the trajectory returned by generate isn't acceptable, also check the trajectory with one fewer frame, to see if we overshot. This is easy to implement, but it is an ugly approach because it requires the PathMover to modify the trajectory, which really should be the job of the DynamicsEngine. It also requires more work on the part of contributors who might add PathMovers.
Always overshoot. This is also easy to implement, but I don’t like it because it requires us to generate one extra frame for each trajectory.
Split can_append into two functions: one that really is can_append, and one that is not_overshot. The dynamics_engine would have to know whether it had to stop because it can’t append another trajectory, or whether the stop was because of an overshoot. If it overshot, then it would remove the extra frame and return the pre-overshoot trajectory. I think this is by far the best solution; the only problem is that it isn’t trivial to implement. If we agree that this is the way to go, I have a rough idea of how it would look, but still need to work out details.

Personally, I think this (like anything with the "noturgent" tag) is in the "not urgent but important" quadrant of the Eisenhower matrix, meaning we should come back to it before release but not before getting working TIS. The current implementation will work fine for TIS, TPS, and AMS (and probably for FFS, steered TPS, and anything else we’re putting into 1.0).

Add tests for OpenMM functionality

We (I) should add some tests to check the open simulator object. Right now I just run the alanine_example which fails if the simulation takes too long.

Fix MSMBuilder build error

Conda now defaults to MSMB3, which does not have the metrics package, breaking Travis builds because orderparameter.py calls for msmbuilder.metrics.

This was given a temporary fix in #93 by fixing msmbuilder version to an old version (the same one I got through pip a while ago). This is not the best solution, so something better should be done.

Assigning to @jhprinz, since the MSM side of things is his expertise, and he can more easily correct the existing code.

First version Q&A

The first thing I'm trying to do is to convert my XYZ files into your NetCDF format and vice versa. This will be an quick and dirty hack, since it isn't intended to be used very much. But I need to know how to do the following:

Cool, idea. Before I get into this, we could also write at some point a different simulator that will just generate trajectories in your example potentials. This does not have to
use OpenMM at all, but I guess that would be nice.

What you have to do is to generate a new netCDF file with a topology file that you have to create. This will be stored in a serialized form in the netCDF so that it is always available
if you continue your runs. When you create the netCDF it takes the atom number from the topology and creates a dimension called 'atoms' and sets it to the number. After that
the tables that holds the atom coordinates is fixed in this size and cannot be changed anymore (I think). The number of frames is set to 0 which means that it can grow in this
direction.

At the topology level:

get and set number of atoms

get and set atom name labels

Hmm, this is actually the most tricky part, since I am using an mdtraj.Topology object for this that contains all these information. For you you would have to construct
this for your system, which should be easy, if you have done this before. If you send be a rough outline of the structure I can assist you with this.

At the snapshot level:

get and set snapshot velocities

get and set snapshot positions

I recently changed the structure on this so we should make sure that you have the newest version. The main point was, that I did not have a way to store reversed frames.
I wanted to keep this information in a boolean which is effectively attached to the trajectory and not to a snapshot. So what I came up with so far is:

A 'Trajectory' is a list of 'Snapshot'

A 'Snapshots' consists of

a 'Configuration' Object or None
a 'Momentum' Object or None
a boolean indicating if the velocities are reversed

A 'Configuration' contains

a set of coordinates
an ID where the coordinates are stored
a potential energy
a set of box_vectors

A 'Momentum' contains

a set of velocities
an ID where the velocities are stored
a kinetic energy

And in the database there are several tables now that are used to reconstruct these objects

configuration_coordinates
configuration_potential_energy
configuration_box_vectors

momentum_velocities
momentum_kinetic_energy

trajectory_configuration_idx
trajectory_momentum_idx
trajectory_momentum_reversed
trajectory_path_hamiltonian # not used so far
trajectory_length

Unfortunately so far I am using a maximal trajectory length, but this can be easily changed to allow for infinite length trajectories, although storage is not effective then.

Alright, this makes it a little more complicated, but should only be under the hood and not make a problem for the user. He can just work with trajectories and snapshots.
But now we can also store easily snapshots without their momentum without problems. Or store the same coordinates with several velocities, which could be useful.
Also, reversibility is treated naturally and we just store the same momentum index and set the frame to be reversed in the snapshot.

Having said that how do we get the positions into the database. First make an empty trajectory

traj = Trajectory([])

You need to construct snapshots using

new_frame = Snapshot(coordinates = your_numpy_array_of_nx3_coordinates, box_vectors = your_box_vectors, kinetic_energy = whatever)

then save it using

new_frame.save()

or, if you have a problem because I did not correctly implement the saving

new_frame.configuration.save()

which stores only the coordinates, since the momentum is empty (I need to check if this really works).

Finally append this to your trajectory using

traj.append(new_frame)

and loop. Last save the trajectory so that we know the order of frames ;)

traj.save()

At the trajectory level:

get and add snapshot in the trajectory

you retrieve a trajectory from the database using

traj = Trajectory.get(index)

get snapshot just like using a list. You can also use + to combine lists and use slicing as well as use a list itself to get only specific frames, etc…

my_snapshot = traj[index]

e.g.

new_trajectory = traj[0:10]
new_sliced_trajectory = traj[ [0,10,20,30,40] ]
snapshot = traj[2]

replacing and appending also works as usual

traj.append(my_snapshot)

snapshots can be reversed by

snapshot.reverse()

which just flags the snapshot that its momenta should be reversed if retrieved.

You get back the coordinates and/or velocities using

my_snapshot.coordinates # this is short for my_snapshot.configuration.coordinates
my_snapshot.velocities # NOT  short for my_snapshot.momentum.velocities because    my_snapshot.momentum.velocities returns the momentum in the attached snapshot while my_snapshot.velocities get back reversed ones depending on the reversed flag in snapshot !

provided of course that the snapshot has momenta or coordinates.Internally a missing configuration or momentum is marked by index zero (0). Interestingly even, if the momentum is missing it can still be marked as reversed. Might come in handy for later
analysis

At the trajectory-set (storage?) level (i.e., set of saved trajectories):

get and add a trajectory in the trajectory-set

as I showed before. Make a trajectory by whatever means and then call save on it and it gets added, like

traj = Trajectory([ snapshot_1, snapshot_2 ])
traj.save()

new_traj = traj[0:100:2]
new_traj.save()

If you put a number in the save command (works also for Configuration and Momentum) you can replace an existing one.
So far I have not tested what happens if the index does not exist. I believe it just gets created by the netCDF library.

traj.save(10)

At the file level:

read in and write out your data objects to a file.

I assume you want to write these as trajectory files? For this I am using mdtraj although I have not updated to version 1.0.0 yet.

http://mdtraj.org/1.0.0/api/trajectory_files.html

which allows to write most file formats. Unfortunately not .xyz format. You probably know best on how to do that. I have some functions to retrieve just coordinates as a numpy array which might help.
use

storage.trajectory_coordinates_as_array(trajectory_index, list_of_atom_coordinates (or None to get all))

here storage is a link to the global storage object. In the sandbox.py example this is given in

simulator.storage

to use mdtraj to store a trajectory, e.g. as a multi-frame pdb which can be read (very slowly) in VMD use the md() command on a trajectory to get an mdtraj trajectory object that can be stored. E.g.

my_traj.md().save_pdb('mdtraj.pdb', True)

The true at the end allows to replace existing files

One nice thing that a trajectory object can also do is to use subsets for specific atom_coordinates. So if you want to store only specific atoms you could use

small_traj = my_traj.subset([0,1,2,3,4,5,6])

which will generate a new trajectory object that remembers that certain output should only give atoms 0 through 6. This is still a little experimental after the most recent update.
I haven’t really tested it. Since we might mostly want only the solute, I have a shortcut for the subset with solute coordinates (if they are specified). You can just do

solute_traj = my_traj.solute

Internally these subset trajectories are only for post-processing. They never get saved and internally there is always the full coordinates saved. So finally saving the solute trajectory is done by

my_traj.solute.md().save_pdb('solute_traj.pdb', True)

We could add a XYZ saver, that could save some time.

SlicedTrajectoryEnsemble

As discussed in #37 (comment) and #69 (comment), we should implement a SlicedTrajectoryEnsemble to return most of the functionality we previously had from the frames attribute of VolumeEnsembles. The idea is to use the fact that trajectories are lists, and this subclass of AlteredTrajectoryEnsemble modifies the trajectory to take a slice of frames, defined by a Python slice object. The old approach didn’t play well with SequentialEnsembles; this one should.

One question: what's the best way to handle single frames? I see 3 reasonable options.

take an int for the frames argument to __init__, but internally convert that to a slice
use a different object for single frame slices (SingleFrameEnsemble)
have additional logic in the SlicedTrajectoryEnsemble methods for single frames and slices

That's roughly my order of preference, but I'd appreciate second opinions. My idea for the first approach is that the __init__ function (with argument frames) would have something like:

if type(frames) == int:
    self.frames = slice(frames, frames+1)

That gives the extra flexibility with the smallest amount of additional code to maintain.

Add adaptive MSM generation

I had this idea to extend the framework a little bit to also allow "us" to make regular (adaptive) MSM simulations. These usually work, by starting somewhere to run a lot of short trajectories. Then do analysis and then pick some samples to start more simulations from these points.

This is very similar to what we do, we only

to not try to stop simulations, but run for a fixed time
do MSM construction (okay, we need that for MSMTIS!)
do not pick uniform but specific samples
make a special path move, that is not a shooting, but only restarting at a point

We could even do this with several replicas easily. The storage is not super effective, but with all the order parameter logic etc might provide an easy way to analyze a system.

I think I will start implementing the necessary steps and see what happens.

Parliamentary Inquiry: Where to put scientific discussions/results?

(Parliamentary inquiry in the sense of a question about the rules of order -- not an investigation of wrongdoing.)

Where do we want to put scientific discussion?

A lot of what we'll be working on soon (e.g., how to define interfaces/order parameters based on cluster centers) are things where I think we'll try a few reasonable ideas and see what works. Mainly, I'm thinking about the kinds of results/discussions that I will normally share with Peter at the Monday meeting of our path sampling subgroup. What’s the best way to keep those of you at MSKCC in the loop as well?

As I see it, we have at least three choices (or more if we use tools other than this GitHub repo):

GitHub issues. Advantages: Everyone gets an email and keeps up to date on the discussion. Disadvantages: Not a great platform for equation-heavy discussions; not easy to summarize conclusions into something for a manuscript.
GitHub Wiki. Voordelen: Seems like this kind of idea development might work on a wiki. Nadelen: Not easy to follow casually as a discussion. Also not great for equations.
TeX file in the repo. Upsides: Best for mathematical discussions and eventual conversion into a manuscript. Downsides: Not easy to follow casually as a discussion.

Maybe some combination of these (writing a TeX file and calling people's attention to it with a new issue?) would work best. Preferences? Suggestions?

On "MultiMovers" (this is a long one….)

(Note: I wrote all of this a couple weeks ago, and kept forgetting to post it. JHP, please take a look, and let me know what you think of the move_path attribute, whether it'll cause problems for storage: I think there's a problem there we need to solve, but I'm open to other solutions.)

Highlights

I describe an overall approach to combinations of PathMovers, included several classes which I’ve already mostly implemented
I suggest changes to the MoveDetails object, in particular replacing the mover attribute with a move_path attribute, which is a list of 2-tuples (mover, accepted)
I describe how the Sample.details.accepted can be derived from Sample.details.move_path and how this allows much more analysis of the behavior of our PathMovers (a key tool for practical troubleshooting)
I give examples of how real PathMovers can be implemented using these combination tools

The main thing I need is advice from @jhprinz on whether this Sample.details.move_path will be a pain to add to storage.

If you see any holes in this implementation, let me know, but I think it will allow us to quickly add new move schemes in the future. It’s currently being implemented in the branch for PR #106.

As I’ve been putting together the framework for PathMovers in general, I’ve been thinking about the fact that many of our more interesting moves in path space can be described as some combination of more elementary moves. This is much the same as how our Ensembles are made of particular combinations of other the elementary ensembles.

So the questions I’ve been working on answering are:

What are the elementary moves?
What are the relevant combinations?
How do we manage these combinations correctly; i.e., how do we make sure we have the right acceptance, etc?

For the first two, we’re fine if the answers are only partially complete. In fact, I’m not planning to implement some parts of this. But we do want to correctly anticipate best way to handle the last one.

The Move Tree

Before diving into the specifics, let me introduce the general idea of the move tree. The basic idea is that we typically have a decision tree to select which PathMover will generate Samples for this step in the Monte Carlo. For example, perhaps you first choose a replica exchange out of the list of shooting, replica exchange, minus move, or path reversal. Then you have to choose a specific replica exchange pair to use.

In order to make this sort of process as flexible as possible, we use the idea of the move tree. In general (at least with PathSampling approaches) we’ll have a root_mover which leads to the whole tree.

The point is that we need to, in some way, record the move tree path that led to generating a specific Sample. In fact, I’d argue that the Sample should be labelled as generated by the move tree path, not just by a given mover (in fact, JHP already added something to this effect by having MixedMover modify the sample.details -- what I propose here is to generalize and complete that).

Some moves also consist of several sequential submoves. That’s also handled in this discussion.

Elementary Moves

I don’t claim that this list is exhaustive, but here are a few of the moves I have identified as elementary. We may add others as we go along:

ForwardShooting and BackwardShooting
TwoWayShooting (not to be implemented in version 1.0)
ReplicaExchange
EnsembleHop

I think that a number of other moves (e.g., StateSwap) are just special cases of ReplicaExchange or EnsembleHop. A few sections below, I’ll show that OneWayShooting and MinusMove can be described using these elementary moves and the combination tools I describe in the next section.

It’s possible that 2-way shooting can be implemented as a combination move. I’d prefer that, although I haven’t thought out all the details of what would be needed to make that happen.

`MultiMover`s: Combining other moves

We need various ways to combine these moves to give the overall behavior we want. I’ve identified 5 such combinations; there may be others to be made. I describe each as a move consisting of submoves.

MixedMover : Selects a submove at random (optionally with weighted selection probabilities). Does only that move.
SequentialMover : Does its submoves in order, attempting all submoves and applying them to the global state sequentially. For example, this could be a move scheme that involved doing all repex in increasing order of replica, followed by shooting on all replicas, followed by all repex in decreasing order of replica.
PartialAcceptanceSequentialMover : Does its submoves in order until/unless one is rejected. All moves up to that point are applied to the global state. For example, a bootstrapping move first does a Shooting; if that is accepted, then it attempts an EnsembleHop. But even if the EnsembleHop is rejected, we want to apply the success of the Shooting.
ConditionalSequentialMover : Does its submoves in order until/unless one is rejected, but if one is rejected, previous steps are not applied to the global state. For example, a MinusMove involves first attempting a replica exchange, and then doing an extension (specialized shooting) move. If the replica exchange fails, then we never start the expensive extension process. If the extension fails, then the whole move is rejected (including the successful replica exchange).
SimultaneousMover (not implemented in version 1.0): This is similar to the SequentialMover, but instead of updating the global state after each move, all samples are generated starting from the same global state. I think this is only a good idea if there’s no overlap in the replicas affected by the submoves (for example, when doing a shooting move on all replicas). In principle, this could be parallelized. For now, such cases give equivalent results if run as a SequentialMover.

Common Moves as Combinations of Elementary Moves

As discussed in #104, we could either make moves like these as pseudoclasses (i.e., offer convenience functions that construct them, but don’t actually have a class named, for example, OneWayShooting) or we can make actual classes which are mostly subclasses with different __init__ functions. I prefer the latter, because I think it will make analysis easier, and it may make fancier moves possible as well. Advanced users can always use either approach to create their own moves.

One-Way Shooting

Subclass of MixedMover, consisting of ForwardShootingMover and BackwardShootingMover, both using the same ShootingPointSelector. This is, indeed, how we’ve already implemented this.

Minus Move

Subclass of MixedMover, consisting of ForwardMinusMover and BackwardMinusMover. Each of those are, in turn, subclasses of ConditionalSequentialMover. The submoves of those ConditionalSequentialMovers are a replica exchange move followed by a shooting move (which selects the endpoint as the shooting point).

Managing `MoveDetails`

One part that gets tricky is that we now have to manage all the details of this information in the MoveDetails object. My suggestion is to replace the MoveDetails.mover object with a list, MoveDetails.mover_path, which consists of 2-tuples of (mover_ID, accepted) for each of the moves in the move tree path that led to this sample. MoveDetails.accepted is true if and only if all moves in the tree path were accepted.

This allows us to easily analyze acceptance ratios (which give us the first sign of problems) from many different angles: by starting with a move that includes all shooting moves, we can get the overall shooting accepted. Then we can drill down to see the acceptance for each individual replica. Then we can drill even further down to see how forward/backward shooting acceptance ratios differ for each replica.

To make this work, we need to think about the acceptance given in the mover_path 2-tuple for each multimover. For all movers except ConditionalSequentialMover, this is always True: acceptance of the sample depends only on acceptance of the submove. The previous paragraph is done by taking all Samples containing a given mover in their sample.details.mover_paths, and looking at sample.details.accepted.

But consider a MinusMove where the replica exchange part is accepted, but the extension is not. This will generate two samples: one from the replica exchange, and one from the extension. The mover_path entry from the replica exchange sample will say that the replica exchange was accepted, whereas the entry from the extension part will say it was accepted. The key is that in BOTH cases, the mover_path entry associated with the ConditionalSequentialEnsemble will be rejected. In this case, we can analyze the acceptance ratio of the replica exchange part by looking at its contributions to the mover_path list, which will give different information from the overall acceptance given by sample.details.accepted, which would be false for both of these.

Storage changes

The only downside of this approach is that it does require some modification to storage. Basically, we need to be able to store these move_path lists. @jhprinz, would that be hard to implement?

Refactor Snapshot(), Configuration() and Momentum() to be independent of OpenMM

Like in the toy_dyanmics example of David. He uses a function on the simulator to get a snapshot object.
So the simulator is responsible to create the snapshot not the snapshot using a simulation itself!

Or we add function to the simulator that get called by the Snapshot.init to get the necessary stuff from the simulator. Although I like the first way better. Snapshots itself can only be build using coordinates, energies, etc but not using a simulator/context object.

Set up travis-ci

You can use the mdtraj template for this:
https://github.com/SimTk/mdtraj
https://github.com/SimTk/mdtraj/blob/master/.travis.yml

GPU Cluster Access and OpenCL

I think it would be good to get somehow David and Peter access to the GPU cluster if there is a way to do so. Mainly for faster testing right now than for production. It would also be a good test for a potential conda package.

Also, there is a way to use AMD graphics cards, but not on a Mac. Linux should work though!

Conda Package on Binstar

in the Travis-CI setup we actually already create a conda package and it would be great to make this publicly available (after a successful build). I think we are almost at a stage where we could do that. That would also come in handy when we want to test the framework on a cluster.

I looked at the way mdtraj does this and (as John mentioned) we need encrypted key to do so. My question @jchodera would be what account to use on binstar for this. I have one but have not access to user omnia so I cannot upload to binstar for this account and I am not sure if my account can host it or if we want my account to do so.

Ultimately it should be part of omnia. What is the best way to do so and is there a way to keep this private until we are confident to really make this publicly available?

Search for Samples by MoverPath specifics

Find samples that have been generated by specific path of PathMovers.

Need a class LambdaVolumePeriodic

LambdaVolume as it stands doesn't work for volumes which wrap around periodic boundaries (e.g., defining the \beta state of alanine dipeptide in terms of the \psi angles). This can be hacked as the union of two LambdaVolumes, but I think a separate subclass class will be cleaner for the user.

I'll do this (already assigned to self); just making the issue to remind myself to come back to it.

Combined PathMovers: proper class or pseudoclass?

As I’m working on some of the PathMover objects, I realize that there are a lot of cases where moves can be considered combinations of other moves. The question I have is whether we want to create them as proper classes in their own right, or whether we want them to be generated using convenience methods.

For example, a OneWayShootingMover is really a MixedMover that randomly picks to either do a BackwardShootingMover or a ForwardShootingMover. Do we want the returned type of OneWayShootingMover to be MixedMover or OneWayShootingMover? With ensembles, we’ve chosen to not treat them as classes (for example, a TISEnsemble is just a shortcut for the user to create a special kind of SequentialEnsemble; the code only sees it as a SequentialEnsemble.

If moves are not separate classes:

Easier for users to define their own; they don’t need to subclass anything, just to create an instance of the combination movers we have
It can still be given a particular name by overriding the .name attribute, which normally defaults to class name but could be something else to make analysis easier.
For a non-programmer user, it might be hard to re-use newly defined movers. Basically, creating a one-off mover might be easier, but once we’re talking about creating a convenience function to create those movers, the user already needs to be working with the code at the "contributor" level (so it isn't really any easier than creating new classes).

If moves are separate classes:

Analysis is probably easier by default
It still isn’t hard to implement new moves: in most cases, the contributor just has to write a different __init__() function (assuming the .move() is just a combination type we've defined of the moves we’ve defined). If that isn’t enough, it’s easy for the contributor to see how to make much more complicated things by overriding .move()

I think the best thing to do is probably to treat these separate classes. Note that both approaches will be available to users: they can always define one-off moves without creating subclasse, or (of course) they can always create subclasses. The question is which approach we use, and therefore recommend by example.

Rename github repo

Do you want to rename this GitHub repo from opentis to openpathsampling?

Splitting of netCDF4 files

Peter @bolhuis suggested that it might be useful to have a splitting feature for the file storage. I thought this should be easy and kind of failed to do quickly implement it. So now I have a draft version that allows us to call a "split" function that then creates a new file and continues storing in the second file, which is itself a valid netCDF4 Storage files. Internally it looks exactly as a normal Storage so that no code outside of the Storage is touched.

To do so, I needed to subclass the netcdf.Variable and especially netcdf.Dimension to controle the splitting of a dimension. This means if a file is split the class knows how many items are in each file and if you request no 500 then it find the correct file adjusts the position and reads the object.

This all works so far, although I need more testing. I think this is especially useful to create milestones in simulations and have files, that are completely separated. And can already be moved and/or analyzed.
Also all of this is independend of the previous implementations. There is just a new subclass of storage that is different. So, we can completely ignore the splitting of we want to.

There are a few questions I have though:

Lets assume that each file can store 100 configurations and if I request number 150 then it will load no50 from file no. 2. Simple. The problem is that all objects can reference each other, which means, that in file 2 might be a trajectory that points to configuration 10, which is in file 1 and even if a trajectory in file two points to object 150, this object is found in the same file at position 50 not 150. This means, that although the file looks intact like a storage, without the other files it is partially meaningless. And I would like to avoid this.

I think, it would be better to have parts that are completely intact (if this makes sense). This will require two additional features:

Store some objects multiple times. This is easy since we can easily check, if a file is already in a specific storage file present. If not, add it. This might give some overhead but makes sure that each separate file is complete. In the above example of a trajectory in file no. 2 that points to frame 50 in file 1 we would just store this frame again in file 2
Store numbering. Right now each object has a number that is just given by the row it is in and we would have to switch to a numbering that allows each object to store its ID. So we need an additional integer table for each storable object. That could be done automatically, but it would require to change the current file system although to use the arbitrary numbering. We actually have a similar mechanism already, but it uses strings to name objects of certain types and we might add just using numbers to it.
One drawback is, that it will make access a little slower. Maybe not much, but we will always have to search the list of IDs to find the stored position of an object. This has also the nice effect that all frames independent of the stored file have a unique ID. Right now if we store, say, every second frame also in a seperate file they will have different number and we cannot relate them later. We might even think about adding a UUID for the project so that we can later reference all generated object uniquely!

One second problem. Right now I store order parameters per configuration. This means that an order parameter has the same table length as there are numbers of configurations in the file. This means for splitting that if we create a new order parameter (or save it!) after we have started splitting, we need to add the orderparameter to all files. Also, if we add cached data to the order parameter and save again. It will potentially need to access all storage files, which kind of breaks the nice property of having only one active file and the rest is immutable.

I would prefer this immutable idea so that we can really take first N-1 split files and store or analyze them already. The main problem stems from the fact that an orderparamter object is not immutable and its cache will change over time. There are several ways to fix this:

Just allow write access to the last open file and if we compute order parameters for already saved configurations these will not be saved and are lost. So we have to encourage the user to compute and save all orderparamters before a split
Save order parameters in a different way and store the whole data in one blob in the active file. If we save again, we will write the whole cached order parameter again and for loading only read the last stored one. This would require the user not to store the orderparamter too often. This can potentially increase the overhead in stored data a lot and we would also have to think about the consistency, since the storage of an order parameter in priciple needs all related configurations in the file present to not break the completeness of the file and would force us to store ALL configurations that are present in the orderparamter.
We might store all frames that have an unsaved order parameter again and then save the orderparameter with them. This again might produce a big overhead.
Always compute all order parameters before saving without checking if it is needed. This might be a problem since an orderparameter might have a very complex procedure to compute values.

I would prefer no. 1 and kind of make the order parameter immutable in the file. It could still happen that we save a configuration again for completeness and in the meantime have more order parameters. Then these of course would be also stored. This way we keep each file consistent and can maybe only use the last file for a restart.

Summary

We would like to be able to split the storage files fow various reasons. I would like this property to be independent of our old concept and allow to use this, if necessary and have each file to be

complete: It contains all objects that are referenced within the file itself
consistent: All the references use a unique number
immutable except the current/last open file: For purposes of archiving, moving and analysis only the last file can be saved to. All previous files are immutable and obey 1. and 2.

All of the ideas are of course consistent with our current single file approach and require almost no changes.

Downsides

The additional files and numbering will increase the read overhead (writing should be fine). Especially if we have to find first, in which files an object is present. The completeness overhead should make this not too bad since related data can always be found in a single file.

Filenaming convention

Stupid question, but
filename.[###].nc
or
filename.nc.[###]
or do we want our own fileending?
filename.[###].ops

using hdf5 or netCDF4 multiple support

I saw that the netCDF4 supports multifile read, so you can just concatenate file and internally treat it as one, but that is not sufficient for what we want.
I think hdf5 could allow to split data to multiple files but I want to have the completeness and this needs to be handled internally by the storage.

storage creation not done in engines and make engines storable

To allow to create several engine objects I would like to move the storage creation out of the engines and let this be handled by the user. So the storage has all the information about the system, n_atoms, topology, an initial configuration which is what an engine might need

storage = Storage('file.nc', mode='w/a', topology=..., n_atoms=..., configuration=...)
engine = OpenMM(storage=storage, opts) # creates new engine

# save the options in an engine used to generate it
storage.engines.save(engine)

# create an engine using the stored options and information in the store
old_engine = storage.engines.load(0, OpenMM)

This would save exactly what we save right now, just in a more separate way.

Next, we would still have to pass the engine to the PathMover class. Either by setting the global using

PathMover.simulator = engine

or per mover

mover = ForwardMover(..., engine=engine)

Suggestions?

Toy Dynamics should have separate System object

Per discussion in #65 (comment) ff., Toy Dynamics should have a separate object (perhaps called ToySystem or ToySystemState), such that for simulation = ToySimulation(), instead of

vel = simulation.velocities
pos = simulation.positions

we have

vel = simulation.current_system.velocities
pos = simulation.current_system.positions

This makes Toy Dynamics more pedagogically useful

Refactor : Renaming of functions and variables

A new issue to discuss name changes:

Classes and member variables

class Snapshot()
[the object holding a configuration, a momentum and a reversed flag]
=> keep
=> Frame()

Snapshot.configuration
[the pointer to the configuration]
=> keep
=> Snapshot.coordinates
=> Snapshot.configuration
=> Snapshot.potential (better not, although we could have potential.energy, kinetic.energy)

Snapshot.momentum
[the pointer to the velocities]
=> Snapshot.velocity
=> Snapshot.velocities
=> Snapshot.kinetic

class Configuration()
[the object holding coordinates, box_vectors, potential_energies per frame]
=> keep
=> class Coordinates()

Configuration.coordinates
[table of atom positions]
=> keep
=> Configuration.xyz

class Momentum()
[the object holding velocities and kinetic energies per frame]
=> class Velocity()
=> class Velocities()
=> class Kinetics()

Momentum.velocities
[the actual velocity table]
=> Velocity.coordinates
=> Velocity.xyz
=> Velocity.velocity_xyz

Examples

coordinates_xyz = snapshot1.coordinates.xyz
velocities_xyz = snapshot1.velocities.xyz
snapshot1.coordinates.potential_energy
snapshot1.velocities.kinetic_energy
snapshot1.coordinates.box_vectors

Functions

Ensemble.forward(trajectory)
[the function that checks if a trajectory can still be in an ensemble if extended by one frame at the end ] => Ensemble.continue_foward(trajectory)

Ensemble.backward(trajectory)
[the function that checks if a trajectory can still be in an ensemble if extended by one frame at the beginning ]
=> Ensemble.continue_backward(trajectory)

License?

What license do you guys want to use?

How about the same LGPL that OpenMM is released under? That's reasonably permissive and consistent with other projects in the lab.

Units in storage

@dwhswenson and me discussed that it would be nice to use different unit system in the storage which is especially useful when it comes to using ToyModels.

We would like to set units to be used for storage and then, if we store a Quantity() in the wrong units they get converted automatically.

I found that simtk has such a feature already which I would use, but am not sure how to treat non-physical unit systems where we do not use units. Maybe we just should not do that. Since I have almost implemented that feature I propose that:

a storage has a simtk.Unitystem() if not set this uses md_unit_system, which uses (Dalton, nanometers, picoseconds, kelvin, kilojoule_per_mole and radians)
when a variable/table is created in the netCDF file we save an attribute with it that specifies the unit. This option should be given as a smitk.unit parameter and it allows it to be converted to units that fit into the unit system. Example: We call

storage.init_variable('configuration_coordinates', 'float', units = simtk.unit.meters)

and since in the standard md_unit_system length is in nanometers this would effectively call

storage.init_variable('configuration_coordinates', 'float', units = 'nanometers')

storage.init_variable('configuration_coordinates', 'float', units = 'nm')

When we now store data of dimension length we can automatically convert to the unit in the storage and save. E.g. when we use a storage with ångström the values from openmm (in nanometers) get multiplied by 10.

Internally the object_storage instance determines at initialization what the stored unit should be unless it is overridden
We might add the option to disable units completely

Address confusion due to similarity of `Simulation` and `Simulator` names

We have two objects, Simulator and Simulation, which have very similar names and very different purposes. I think that will be terribly confusing for users. In addition, we have a Calculation class whose purpose might also be confused with the other two.

Here is my description of what each object does (@jhprinz, let me know if I'm missing any aspects):

Simulation: interacts directly with the underlying MD engine. It just transfers state information (positions, velocities, etc.) back and forth between MD engine and our code.
Simulator: this one is a bit harder to describe. It mainly involves two functions: (1) setting up the system (including setting up our storage and setting up topologies, force fields, etc. for the MD engine); (2) the generate function, which runs the MD and stops when we reach a stopping condition.
Calculation: think of this as the main() for whatever kind of sampling we're doing: it's the part that distinguishes the overall sampling algorithm, e.g., making TIS different from SATIS or AMS.

Originally the Simulator/Simulation divide was because we needed our own class (Simulator) to interact with OpenMM's Simulation. I think we can clean that up now; whether we also need to change the Calculation name is up to you.

Here's my suggestion for refactoring:

Let specific Simulations (e.g., OpenMMSimulation) inherit from Simulator (which might be renamed to Simulation). For cases like OpenMMSimulation, which currently inherits from openmm.app.Simulation, we could use multiple inheritance, or we could make the openmm.app.Simulation into an instance variable for the OpenMMSimulation. So there would be a superclass with the code to set up the storage and for the generic generate function, and the subclasses would support system set-up and dynamics state information for OpenMM, or for my ToyDynamics, or (perhaps one day) for Gromacs.

From a user perspective, this would mean that instead of having to build both a Simulator and a Simulation, you just need a single object which combines information from both. In fact, the final object is kind of an OpenTIS wrapper around the MD engine's objects, providing a few extra functions and the ability to translate to and from our data structures.

If this seems like a good idea, I can implement it. Any thoughts?

Ensemble.forward() and logic in NegatedEnsemble(LeaveXEnsemble(vol))

Two parts:

Part 1. Meaning/usefulness of Ensemble.forward() (and .backward())

I’m a little confused as to what Ensemble.forward(traj) is supposed to do. The docs say that it is supposed to let you know whether you might still be in the ensemble if you add another frame. In other words, it is the logical not of a stopping condition. If that’s the goal, shouldn’t it basically boil down to:

InX/OutX : return whether the frame being checked is still in the ensemble (i.e., self.volume(frame)).
HitX/LeaveX : always return True. (These ensembles require "at least one frame" to satisfy the criterion; they have no stopping condition.)

I'm not quite clear on what the code in HitX and LeaveX is doing, but it is significantly more complicated than returning True.

Part 2. If I’ve understood that correctly (or even close to correctly), there’s a logic problem in NegatedEnsemble. Using notation of x for a trajectory, x[t] for a frame, and vol for a given volume, I’ll explain:

Let’s talk about NegatedEnsemble(LeaveXEnsemble(vol)). Since LeaveXEnsemble(vol) is
{ x | exists x[t] where x[t] not in vol },
NegatedEnsemble(LeaveXEnsemble(vol)) means
not { x | exists x[t] where x[t] not in vol } == { x | not exists x[t] where x[t] not in vol }
and that, in turn, is just the contrapositive of
{ x | for all x[t]; x[t] in vol }
which is InXEnsemble(vol). So NegatedEnsemble(LeaveXEnsemble(vol)) == InXEnsemble(vol). This makes sense to me: If you never have a frame outside the volume, then all your frames are inside the volume.

Now consider a trajectory traj which consists of one frame in a volume vol. What is the expected behavior of .forward(traj) for various ensembles? At this point, LeaveXEnsemble(vol).forward(traj) == True, right? If a trajectory goes forward, it is possible for it to eventually leave the volume vol. Also, InXEnsemble(vol).forward(traj) == True. It's now in the volume, and it might stay.

However, in its current implementation, NegatedEnsemble(LeaveXEnsemble(vol)).forward(traj) == False, because NegatedEnsemble(ens).forward returns not ens.forward. Flipping the forward() in NegatedEnsemble isn’t logically sound.

Set up automated travis pushing of docs Amazon S3 bucket

I've created an S3 bucket called openpathsampling that we will have openpathsampling.org point to, much like http://mdtraj.org does now.

@pgrinaway can help configure this.

Make use of test systems in openmmtools.testsystems?

You should also feel free to make use of the test systems in openmmtools.testsystems to set up some tests and examples for users.

We can also add examples you find useful to that file to make it easier to import and share them.

Including MSM-TIS in Omnia?

What would you guys think about eventually targeting msm-tis (or whatever you'd like to call this in the end) as part of the conda-installable Omnia collective?

http://omnia.md

The goal here is to collect a number of OpenMM-based packages into conda-packaged tools that share some common structures and file formats, with the main goal being ease of installation/use and interoperability.

We're trying to standardize some aspects of this, such as how documentation is made available, how travis tests are run, packaging standards, etc.

@kyleabeauchamp can provie more information about how this works with other projects like mdtraj.

Renaming MoveDetails attributes

I think the current names of the MoveDetails attributes are a bit confusing. Before we get too far into making more PathMovers, I'd like to change some of those names. In particular, I think final vs result is a little unclear, and success vs accepted vs acceptance is unclear.

Here are my suggestions, in the format oldname => newname : description

Things to change:

final => trial : a Trajectory representing the trial trajectory (what will go to the output sample if the move is accepted)
acceptance => acceptance_probability : the probability of accepting this step (i.e., Metropolis acceptance probability). Note that for many moves, this is 1 or 0.
accepted => trial_is_in_ensemble : this seems to only be the test of whether the trial path in a shooting move is in the correct ensemble. Unless I'm misunderstanding, it should be moved out of general move details, and only exist for shooting movers.
success => accepted : whether the trial is accepted by the Monte Carlo method

Things to keep the same:

inputs => inputs : list of the Samples selected to be used in the move
result => result : the Sample that gets returned by the move
mover => mover : the PathMover which generated these details

Note that right now, the things I say are Samples are actually Trajectory objects. For now, I'm leaving them as Trajectory, but I think we might want to change that to Sample eventually. (Should result be a Sample which is recursively referential, or should it be the Trajectory of that sample? Gets weird when thinking about EnsembleHopMovers, where the Trajectory doesn't change.)

Any other thoughts on these names? Suggestions for better names to use?

GlobalStateMover (and related)

I’d like to start fixing up the GlobalStateMover and related classes. I picture something that (based on our current naming, which I consider open to changes) looks like this:

PathMover : generates individual Sample objects; these are the "atoms" of an overall move: shooting move (on a single ensemble), replica exchange move (on a single pair), etc.
GlobalStateMover : this allows us to combine several of the "atom" moves at once. For example, we may want to do shooting moves on all ensembles at once. This groups the PathMover moves together, and statistical sampling is done after each of these.
GeneralPathMover : this may actually be a factory function, but the point is that it should take nested lists of path movers and create an appropriate GlobalStateMover object to handle all moves in the simulation. It's basically a convenience function to create the kind of GlobalStateMover we often use in TIS.

A more detailed explanation (based on how these will be used) follows; please let me know what you think of this plan soon, since I’ll start implementing it as soon as I can! Also, as far as I’m concerned, all names are open for changing.

I should also clarify that, at this point, we’ve implemented a few things using PathMovers, but we haven’t set up GlobalStateMovers (beyond a skeleton) or GeneralPathMovers.

Usage examples

`PathMover`

shooter = ShootMover(UniformSelector(), ensemble)
repexer = ReplicaExchangeMover(ensemble1, ensemble2)
mixedmover = MixedMover([shooter, repexer])

shooter does is the shooting move for a given ensemble. repexer is the replica exchange mover swapping the two given ensembles. The MixedMover has a 50% change of doing replica exchange, otherwise it does the shooting. The probability can be changed by including another parameter, probability when setting up the MixedMover.

None of these increment the step on the Monte Carlo sampling: that is, they contain no information about when you should take samples from your total distribution for statistics purposes. This way, you can combine PathMovers in a way what the subsets of the ordering doesn’t satisfy detailed balance, but the whole sequence (defined in GlobalStateMover) still does.

There will also be factory convenience methods for common groups of replica exchange movers:

PathMoverFactory.NearestNeighborReplicaExchange(ensemble_list)
PathMoverFactory.AllReplicaExchange(ensemble_list)

NearestNeighbor is frequently used (sometimes with a few extra exchanges included); I've only ever used AllReplica when testing out some new ideas.

`GlobalStateMover`

all_shooting = GlobalStateMover([shooter1, shooter2, shooter3])

all_shooting.move() will simultaneously do shooting moves of all 3 types, and will increment the Monte Carlo step, saying that you should sample after each of these moves.

`GeneralPathMover`

The GeneralPathMover is a way of lumping all of this together to simplify setup. You general one huge list of movers: if all the objects in the list are of the PathMover type, then it makes a GlobalStateMover. Otherwise, it makes all of the objects in the list into GlobalStateMovers (you can make a GlobalStateMover with a list of one PathMover).

all_moves = GeneralPathMover([[shooter1, shooter2], [repex1], [repex2])

Each list of PathMovers is a GlobalMover. Each list of GlobalMovers is a GlobalMixedMover. GlobalMixedMovers can be nested. Here we again have to deal with probabilities. My suggestion is that the default be that each PathMover have equal probability, and to take the option probability='equal' to make all GlobalStateMovers within each GlobalMixedMover have an equal probability.

Overall usage

Putting all of this together, the idea is that you'd be able to eventually set up a TIS calculation for a given ensemble_set with something like this:

# assume we've already set up ensemble_set, a list of ensembles
from opentis.pathmovers import PathMoverFactory as mf
shooters = mf.OneWayShootingSet(UniformSelector(), ensemble_set)
repexers = mf.NearestNeighborReplicaExchange(ensemble_set, as_global=True)
reversal = mf.PathReversalSet(ensemble_set)
minus = MinusMover(ensemble_set[0], stateA)
general = mf.GeneralPathMover(movers=[shooters, repexers, reversal, minus])

tis = PathSampling(storage=storage, engine=engine, movers=movers)

To get the best efficiency, you'd probably have to also include a probability argument for the GeneralPathMover. A few of the exact calls here might change, but the idea is that it shouldn't really get any more complicated than that.

Adding conda package

After all this the conda package is still missing. I think this is simply part of the
after_success.sh script so upload the created conda package to binstar. Would that be good for
Peter for example?
I think we don't really need this right now, do we?

SequentialEnsemble : Concepts

This is a heck of a long issue (more of a manifesto, really), but I've been working a lot on ensemble.py and I'm suggesting a few changes which involve imposing a little more structure on the ensembles we've defined, but which will allow us to do just as much (or more) with a lot less code to
manage.

The big picture of what I'm proposing is to move our more complicated ensembles from a single logic-combined ensemble into an ensemble which consists of a time-ordered sequence of ensembles (SequentialEnsemble).

We'll probably need the SequentialEnsemble for the minus interface anyway, so I've already implemented it (currently testing it and fixing cases as I identify them).

(Also, in this issue I use the notation can_append and can_prepend: I'm still open to changing those names, but I need to call them something to discuss them!)

Logic Combination vs. Sequential

Let's start with a standard TIS path ensemble. Assuming we've defined volumes for stateA, stateB, and interfaceA0, our current notation describes this interface as

(  InXEnsemble(stateA, frames=0) &
   OutXEnsemble(stateA | stateB, frames=slice(1,-1)) & LeaveXEnsemble(interfaceA0) &
   InXEnsemble(stateA | stateB, frames=-1)
)

I think this is pretty clear, and corresponds directly to our mathematical notation $h_A(x_0)\prod_{J\in{A,B}} \bar{h}J(x_1,..., x{L-1}) \hat{h}{A,0}(x) \sum{J\in{A,B}} h_J(x_L)$.¹

On the other hand, we can also base this whole thing on the SequentialEnsemble. In that case, in addition to stateA, stateB, and interfaceA0, we'll need an ensemble length1 for path segments with 1 frame (That's just JHP's LengthEnsemble(1)). Then our TIS path ensemble would be defined by:

SequentialEnsemble( [
    InXEnsemble(stateA) & length1,
    OutXEnsemble(stateA | stateB) & LeaveXEnsemble(interfaceA0),
    InXEnsemble(stateA | stateB) & length1
] )

The path starts in stateA (for one frame), then is out of stateA | stateB, eventually crosses interfaceA0, and ends with exactly one frame in stateA | stateB. I think that's pretty easily read from the code, and tends to be the way we actually talk about interfaces.

When trying to define the minus interface, this gets a bit more confusing. For the sake of discussion, I'll define the minus interface as including the segment that is being extended, so a path in the minus interface consists of two segments that satisfy the innermost interface ensemble, with a segment in between which does not cross the interface. If we do not require that interfaceA0 be equivalent to stateA, I haven't come up with a way to do this with just logical ensemble combinations², and it gets harder if you allow more excursions for better decorrelation, as Peter and I did in our work on the "geisslerene" fluid. However, with a SequentialEnsemble, it isn't too bad:

SequentialEnsemble( [
    InXEnsemble(stateA) & length1,
    OutXEnsemble(stateA) & LeaveXEnsemble(interfaceA0),
    InXEnsemble(stateA) & length1,
    InXEnsemble(interfaceA0) | length0,
    OutXEnsemble(interfaceA0),
    InXEnsemble(stateA) & length1
] )

In that, I've also defined a length0 ensemble for trajectories of length zero; using ensemble | length0 is a way to make frames matching ensemble be optional; in the case of the minus interface it's there just in case interfaceA0 is very close to (or equal to) stateA and we have a path that enters stateA for one frame before doing the second exit for the minus interface.

Thanks to python's delicious syntactic sugar for lists, doing the multiple excursion version almost as simple:

SequentialEnsemble(
    [ InXEnsemble(stateA) & length1,
      OutXEnsemble(stateA) & LeaveEnsemble(interfaceA0)
      InXEnsemble(stateA) & length1
    ] + [
      InXEnsemble(interfaceA0) | length0,
      OutXEnsemble(interfaceA0),
      InXEnsemble(stateA) & length1
    ] * N_l
)

where N_l is as we defined it in our paper.

Orthogonal roles for ensemble class categories

We can think of ensembles as falling into categories depending on what trajectory information they interact with. We already have several categories of ensembles. The CombinationEnsemble is the main ensemble for doing logical combinations. The LengthEnsemble considers the total trajectory length. The various subclasses of VolumeEnsemble consider the path of the trajectory through defined regions of some collective variable subspace.

However, in the current implementation, all of these can apply to specific frames as well. This means that we're mixing time and time-ordering roles with the other roles. My proposal is to put all the time-ordering roles into the SequentialEnsemble, and therefore only leave the Volume tests to the VolumeEnsembles. By making logical combinations of VolumeEnsembles with SequentialEnsembles and LengthEnsembles, I think we can do everything that the frames attribute of VolumeEnsembles allowed previously (at least, I've managed to come up with ways to do crazy examples I tried to come up with: please try it some yourselves!)

The advantage of removing the frames attribute is that it significantly simplifies can_append and can_prepend. These functions become one or two lines for any ensemble (except SequentialEnsemble: that one was a bit tricky). By having clear and orthogonal roles for each category of ensemble, I think it is also easier to correctly create more complicated ensembles.

Furthermore, this removes the need for the AlteredTrajectoryEnsembles. At least, as those exist now, they're basically a hack to let you generate part of a trajectory, particularly in the case of one-way shooting.

Reasons to prefer Sequential Ensembles over Logical Combinations

Makes treatment of stop conditions equivalent for one- and two-way shooting.
Reduces code.
Gives specific and distinct roles to each class.
I assume most people typically think of trajectories as time-ordered sequences of conditions, in which case this is a more natural way to describe an ensemble. It also matches how we verbally describe ensembles.
It makes it easy to implement more complicated ensembles like the multiple-exit multiple interface set minus interface that Peter and I recently used.

Reasons to prefer Logical Combinations over Sequential Ensembles

Defining all ensembles purely as logical combinations is closer to the mathematical formalism used in papers on TIS. (On the other hand, the sequential approach is how we describe ensembles in text or verbally.)
The code logic in SequentialEnsemble is more complicated than in the other ensembles. However, I think we need it anyway for the minus interface.

How this saves us code

First, the can_append and can_prepend methods for the primitive volume ensembles get massively shrunk. Like, for HitXEnsemble and LeaveXEnsemble they're reduced to the line return True. InX and OutX just call themselves on the first/last frame of the trajectory. There's considerably more fanciness in SequentialEnsemble's implementation of these two functions, but we were going to need that anyway.

Second, the classes ForwardAppendedTrajectoryEnsemble and BackwardAppendedTrajectoryEnsemble were only used to help one-way shooting. They aren't needed for that. (There might be other tricks they are needed for, but I think most things can be replaced by SequentialEnsemble. And we don't have to implement anything else for two-way shooting, either.

Third, other time-order-dependent VolumeEnsemble subclasses, like EntersXEnsemble and ExitsXEnsemble, will also be obviated.

So framing things in this way removes the need for several classes that we've already implemented and a couple more that we were likely to need to implement for two-way shooting. It also brings a couple of dozen-line functions down to one- or two-liners. The cost is one class (with some admittedly difficult to understand methods) that we probably would have had to implement anyway.

Caveats

There are a couple of warnings to be concerned about:

Logical not might not play well with this (see #32, and JHP's partial fix #34)
Matching is currently "hungry," i.e., instead of (.*)c matching everything ending in c, it matches everything: the first pattern keeps matching until it fails. (This is probably fixable, and only requires a change in one function.)

Still to-do

I'm running tests of SequentialEnsemble: all functions work on all the test cases I've given in so far (out of a library of 50-ish toy trajectories; not all used on every test), but I've only tested ensembles based on InXEnsemble and OutXEnsemble. I still need to add support for the zero-length trajectory and I need to test with LeaveXEnsemble and HitXEnsemble.

Actually, there's also a bit in the code to say that the path must contain at least 3 frames: without that a path which went from A to B in one step would be accepted, leaving you no shooting frames. But if that happens, you're doing something very, very wrong). ↩
If stateA == interfaceA0, then you can define a class that counts the entrances and exits of the ensemble; both must equal 2 (or N_l+1 in general). ↩

OptionalEnsemble and AlteredEnsemble

Two parts to this issue: a new feature, and a name change suggestion inspired by the new feature.

OptionalEnsemble

While @bolhuis was looking at the code the other day, he suggested that we create an OptionalEnsemble class to handle optional parts of SequentialEnsembles. This can be basically a wrapper for what we already do, which is to or an ensemble with LengthEnsemble(0). Conceptually, this is a bit similar to a regular expression (condition)*: zero or more frames must match this condition. The point is that what I’ve been doing so far, explicitly or’ing the ensemble being made optional with a LengthEnsemble(0), is ugly and unreadable.

AlteredTrajectoryEnsemble => AlteredEnsemble

I’d like to rename AlteredTrajectoryEnsemble to an AlteredEnsemble. The way these ensembles work is to wrap an ensemble given in initialization (saved as self.ensemble) and change the ensemble functions to work not on trajectory but some function of trajectory, self._alter(trajectory) by calling self.ensemble.function(self._alter(trajectory)) (where function is __call__, can_append, can_prepend). The default version (the superclass AlteredTrajectoryEnsemble) makes no change to the trajectory, and so should give the same results as the original ensemble.

I realized that the same machinery could be re-used by not only having subclasses that change the trajectories, but other subclasses that change the underlying ensemble. In these, instead of taking the given ensemble and putting it directly into self.ensemble, we save a modified ensemble as self.ensemble. Using the default (unchanged) self._alter means that the AlteredEnsemble would call a modified ensemble on the unmodified trajectory.

In general, I have qualms about mixing two types of effects (alter how ensemble sees trajectory, alter ensemble itself) in one class, and I’d definitely recommend against mixing the two in any one subclass (unless there wasn’t another way). But it also seems that the overlap in functionality is pretty large between these two. And AlteredEnsemble as a superclass reduces implementation of OptionalEnsemble to its .__init__() and .__str__() methods.

The other option here would be to have two classes, AlteredEnsemble and AlteredTrajectoryEnsemble, which implement nearly the same behavior as superclasses, and then have different trees of subclasses.

Any arguments against the name change?

Add host-guest system as an example?

It would be great if we could add one or two of the CB7 host-guest systems from SAMPL that @bas-rustenburg and @pgrinaway are working on, since I think this would be a nice, simple system to study bimolecular association/dissociation with. @bas-rustenburg and @pgrinaway are setting this up for OpenMM in implicit and explicit solvent, so you can probably just use their input/output files for this and the OpenMM Amber inpcrd/prmtop reader.

Catch errors and provide informative error messages in Storage

If you have outdated or corrupted netCDF files, or if the files have been misused in some way, we generate an IndexError. It would be really nice if we caught those errors and added some information to make them a bit more informative. Example partial backtrace:

  File "/Users/dwhs/Dropbox/msm-tis/opentis/openmm_engine.py", line 74, in auto
    return OpenMMEngine._restore_from_storage(filename)
  File "/Users/dwhs/Dropbox/msm-tis/opentis/openmm_engine.py", line 116, in _restore_from_storage
    engine = OpenMMEngine(storage.template, options)
  File "/Users/dwhs/Dropbox/msm-tis/opentis/storage/netcdf_storage.py", line 173, in template
    return self.snapshot.load(int(self.variables['template_idx'][0]))
  File "/Users/dwhs/Dropbox/msm-tis/opentis/storage/wrapper.py", line 68, in inner
    obj = func(self, n_idx, *args, **kwargs)
  File "/Users/dwhs/Dropbox/msm-tis/opentis/storage/snapshot_store.py", line 34, in load
    configuration_idx = self.configuration_idx(idx)
  File "/Users/dwhs/Dropbox/msm-tis/opentis/storage/snapshot_store.py", line 104, in configuration_idx
    return int(self.load_variable('snapshot_configuration_idx', idx))
  File "/Users/dwhs/Dropbox/msm-tis/opentis/storage/object_storage.py", line 434, in load_variable
    return self.storage.variables[name][idx]
  File "netCDF4.pyx", line 3015, in netCDF4.Variable.__getitem__ (netCDF4.c:41549)
  File "netCDF4.pyx", line 3653, in netCDF4.Variable._get (netCDF4.c:49443)
IndexError

At some point on that backtrace, a clearer error message would be nice. (Most of this is that I'm an idiot: the problem is solved by deleting the bad file, but I forget that and spend hours trying to figure out which part of my recent full reinstall is broken and causing the error. Please, save me from continuing to waste hours being an idiot!)

Domain name

I've registered openpathsampling.org on godaddy. We'll configure this to point to the AWS S3 bucket openpathsampling so you can push docs there once they are built by travis.