mdtraj / mdtraj Goto Github PK

View Code? Open in Web Editor NEW

584.0 47.0 277.0 35.16 MB

An open library for the analysis of molecular dynamics trajectories

Home Page: https://mdtraj.readthedocs.io/en/latest/

License: GNU Lesser General Public License v2.1

Python 45.47% C 13.72% C++ 7.41% Arc 26.63% Cython 6.77%

molecular-dynamics python mdtraj rmsd dihedral-angles pdb pdb-files molecular-dynamics-trajectories

mdtraj's Issues

trajectory.restrict_atoms doesn't set topology._numAtoms correctly.

After running restrict_atoms(), i have

In [27]: t.topology._numAtoms
Out[27]: 576

In [28]: len(list(t.topology.atoms()))
Out[28]: 172

little RMSD bug

On rmsd.pyx:51, xyz1 should not have to be the same n_frames and xyz2.

HDF5Reporter report interval is not being used correctly

The report interval always seems to be 1. Fix coming.

binpos not in main lib?

@rmcgibbo maybe something didn't get checked in (accidentally)? Or is it just not done yet?

TJ@dn51u4eb ~/Programs/mdtraj
$ sudo python setup.py install
dyld: DYLD_ environment variables being ignored because main executable (/usr/bin/sudo) is setuid or setgid
running install
running bdist_egg
running egg_info
writing MDTraj.egg-info/PKG-INFO
writing top-level names to MDTraj.egg-info/top_level.txt
writing dependency_links to MDTraj.egg-info/dependency_links.txt
reading manifest file 'MDTraj.egg-info/SOURCES.txt'
writing manifest file 'MDTraj.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.5-x86_64/egg
running install_lib
running build_py
running build_ext
skipping 'MDTraj/xtc/xtc.c' Cython extension (up-to-date)
skipping 'MDTraj/dcd/dcd.c' Cython extension (up-to-date)
cythoning MDTraj/binpos/binpos.pyx to MDTraj/binpos/binpos.c
error: /Users/TJ/Programs/mdtraj/MDTraj/binpos/binpos.pyx: No such file or directory

can we bundle mdtraj as an egg

and have it be pip/easy_install'd?

Would be great for odin/msmbuilder dependencies. @rmcgibbo if you agree, I can go ahead and take a shot at this.

Does RG really need to be written in C / Cython?

I strongly doubt that we gain much performance by writing our own for loop. RG is really just vector math, so it makes sense to BLAS.

pdb reader seems to read only the first model in a multiframe pdb

openmm_boxes at individual frames

I wonder if the traj.openmm_boxes() function should take a frame index as an argument?

Is there ever a case where OpenMM interacts with an entire trajectory at once? I don't think there is.

My thinking is that converting the entire trajectory to an OpenMM list could be a memory-killer for some applications.

Implement AMBER formatted text format

Recommended by @jchodera.

Switch RMSD code to external dependency on @ihaque's newest IRMSD code

The latest version of @ihaque's RMSD code is on simtk as a separate installable project. Instead of copying the code in msmbuilder2, we should just use that code. I haven't looked too closely at the C, but the python side is much cleaner and more userfriendly that what we're using now. Also, it has support now for axis or atom major order, which means that for very large datasets, we can possibly avoid a copy operation at the cost of some speed (might be worth it).

https://github.com/simtk/irmsd

Add n_residues?

I wonder this would be useful.

Round out the RMSD API

A few design decisions are left before the RMSD and alignment code should be considered stable.

API of the Transform operator, inverse() method.
Exactly how the RMSDTrajectory works
Test coverage

Write a simple test for the OpenMM Serializer State XML reader

Improve the speed of Travis by installing more things from aptitude instead of pip

The msmbuilder .travis installs numpy from apt, and it works. Right now, our travis builds are taking like 8 minutes, and the tests only take like 15 seconds.

Top level docstring

When we import mdtraj, it would be nice to be able to see the available modules. The goal is to avoid having to memorize the class structure before opening a python session.

This is low-priority and mostly a reminder for later.

Formalized HDF5 Trajectory ABI

Idea

I want to formalize the format for storing HDF5 Trajectories on disk, and provide some kind of a "contract" that our code can say that it implements. It makes sense for the writers of trajectories to try to stick to this contract as closely as possible, and for the readers of trajectories to be as flexible as possible in reading trajectories.

Why HDF5?

Design Goals

In storing MD trajectory data for the for purposes including building MSMs, there are a few design goals. (1) The trajectories should be small and space efficient on disk. (2) The trajectories should be fast to write and fast to read. (3) The data format should support flexible read options. For instance, random access to different frames in the trajectory should be possible. It should be possible to easily query the dimensions of the trajectory (n_frames, n_atoms, etc) without loading the file into memory. It should be possible to load only every n-th frame, or to directly only a subset of the atoms with limited memory overhead. (5) The trajectory format should be easily extensible in a backward compatible manner. For instance, it should be possible to add new arrays/fields like the potential energy or the topology without breaking backwards compatibility.

Other Formats

Currently, MDTraj is able to read and write trajectories in dcd, xtc, trr, binpos, and amber netcdf formats, in addition to hdf5. This presents an opportunity to compare these formats and see how they fit our design goals. The most space efficient is xtc, because it uses 16bit fixed precision encoding. For some reason, the xtc read times are quite slow though. dcd is fast to read and write, but relatively inflexible. netcdf is fast and flexible. binpos is garbage -- it's neither fast, small, nor flexible.

What's lacking?

Of the formats we currently have, AMBER NetCDF is the best, in that it it satisfies all of the design goals except for the first. But the trajectories are twice as big on disk as XTC, which is really quite unfortunate. For dealing with FAH data, size matters. So let's define a HDF5 standard that has the benefits of AMBER NetCDF and the benefits of XTC mixed together. We'll use an extensible data format (HDF5), we'll provide options for lossy and lossless compression, and we'll store the topology inside the trajectory.

MDTraj HDF5 Format

This specification is heavily influenced by the AMBER NetCDF standard. Significant portions of the text are copied verbatim.

Encoding

Files will be encoded in HDF5, a data model, library, and file format for storing and managing data produced at NCSA.
Arrays may use be encoded with zlib compression.
Libraries implementing this standard may, at their desecration, round the data to an appropriate number of significant digits, which can significantly enhance zlib compression ratios.

Recommended file extensions

The .h5 extension will be preferred, although not required.

Global Attributes

Conventions (required) Contents of this attribute are a comma or space delimited list of tokens representing all of the conventions to which the file conforms. Creators shall include the string Pande as one of the tokens in this list. In the usual case, where the file conforms only to this convention, the value of the attribute will simply be ``Pande''. Readers may fail if this attribute is not present or none of the tokens in the list are Pande. Optionally, if the reader does not expect HDF5 files other than those conforming to the Pande convention, it may emit a warning and attempt to read the file even when the Conventions attribute is missing.
ConventionVersion (required) Contents are a string representation of the version number of this convention. Future revisions of this convention having the same version number may include definitions of additional variables, dimensions or attributes, but are guaranteed to have no incompatible changes to variables, dimensions or attributes specified in previous revisions. Creators shall set this attribute to "1.0". If this attribute is present and has a value other than "1.0", readers may fail or may emit a warning and continue. It is expected that the version of this convention will change rarely, if ever.
application (optional) If the creator is part of a suite of programs or modules, this attribute shall be set to the name of the suite.
program (required) Creators shall set this attribute to the name of the creating program or module.
programVersion (required) Creators shall set this attribute to the preferred textual formatting of the current version number of the creating program or module.
title (optional) Creators may set use this attribute to represent a user-defined title for the data represented in the file. Absence of a title may be indicated by omitting the attribute or by including it with an empty string value.
topology (optional, but highly recommended) For protein systems, creators shall describe the topology of the system in ASCII encoded JSON. The format for the topology definition is described in the topology subsection of this document.
randomState (optional), ASCII-encoded string Creators may optionally describe the
state of their internal random number generators at the start of their simulation.
The semantics of this string are specific to the MD code and are not specified by
this standard.
forcefield (optional), ASCII-encoded string For data from a molecular dynamics simulation,
creators may optionally describe the Hamiltonian used. This should be a short, human readable
string, like "AMBER99sbildn".
reference (optional), ASCII-encoded string Creators may optionally specify published a reference
that documents the program or parameters used to generate the data. The reference should be listed
in a simple, human readable format. Multiple references may be listed simply by separating the references
with a human readable delimiter within the string, like a newline.

Arrays

coordinates (required) shape=(n_frames, n_atoms, 3), type=Float32, units="nanometers". This variable shall contain the Cartesian coordinates of the specified particle for the specified.
time (optional), shape=(n_frames), dtype=Float32, units="picoseconds" When coordinates on the frame dimension have a temporal sequence (e.g. they form a molecular dynamics trajectory), creators shall define this dimension and write a float for each frame coordinate representing the simulated time value in picoseconds associated with the frame. Time zero is arbitrary, but typically will correspond to the start of the simulation. When the file stores a collection of conformations having no temporal sequence, creators shall omit this variable.
cell_lengths (optional), shape=(n_frames, 3, 3), dtype=Float32, units="nanometers" When the data in the coordinates variable come from a simulation with periodic boundaries, creators shall include this variable. his variable shall represent the lengths (a,b,c) of the unit cell for each frame. The edge with length a lies along the x axis; the edge with length b lies in the x-y plane. The origin (point of invariance under scaling) of the unit cell is defined as (0,0,0). If the simulation has one or two dimensional periodicity, then the length(s) corresponding to spatial dimensions in which there is no periodicity shall be set to zero.
cell_angles shape=(n_frames, 3, 3), dtype=Float32, units="degrees" Creators shall include this variable if and only if they include the cell_lengths variable. This variable shall represent the angles (\alpha, \beta, \gamma) defining the unit cell for each frame. \alpha defines the angle between the b and c vectors, \beta defines the angle between the a and c vectors and \gamma defines the angle between the a and b vectors. Angles that are undefined due to less than three dimensional periodicity shall be set to zero.
velocities (optional), shape=(n_frames, n_atoms, 3), type=Float32, units="nanometers/picosecond"
When the velocities variable is present, it shall represent the cartesian components of
the velocity for the specified particle and frame. It is recognized that due to the
nature of commonly used integrators in molecular dynamics, it may not be possible
for the creator to write a set of velocities corresponding to exactly the same point
in time as defined by the time variable and represented in the coordinates variable.
In such cases, the creator shall write a set of velocities from the nearest point in
time to that represented by the specified frame.
kineticEnergy (optional), shape=(n_frames), type=Float32, units="kJ/mol"
Creators may optionally specify the kinetic energy of the system at each frame.
potentialEnergy (optional), shape=(n_frames), type=Float32, units="kJ/mol"
Creators may optionally specify the potential energy of the system at each frame.
temperature (optional), shape=(n_frames), type=Float32, units="Kelvin"
Creators may optionally specify the temperature of the system at each frame.
lambda (optional), shape=(n_frames), type=Floa32 units="" For describing an alchemical free energy
simulation, a creator may optionally notate each frame in the simulation with a value of lambda.
constraints (optional), shape=(n_constraints, 3), type=CompoundType(int, int, float) units=[None, None, "nanometers"]
Creators may optionally describe any constraints applied to the bond lengths. constraints shall be a compound-type table (referred to a table as opposed to an array in the pytables documentation), such that the first two entries are the indices of the two atoms involved in the constant, and the final entry is the distance those atoms are constrained to.

Array Metadata

For arrays that contain naturally unitted numbers (which is all of them), creators shall explicitly declare their units. The unit system of length=nanometers, time=picoseconds, mass=daltons, temperature=Kelvin, energy=kJ/mol, force=kJ/mol/nm shall be used everywhere. For angles, degrees shall be used. The units shall be set as an "attribute", on the array, under the key "units", within the parlance of HDF5. It shall be a string.
For arrays that contain numbers which have been rounded to a certain number of significant digits, creators shall declare the number of significant digits by setting the "least_significant_digit" attribue, which should be a positive integer.

Topology

Rational

It is our experience that not having the topology stored in the same file as the the trajectory's coordinate data is a pain. It's just really inconvenient. And generally, the trajectories are long enough that it doesn't take up much incremental storage space to store the topology in there too. The topology is not that complicated.

Format

The topology will be stored in JSON. The JSON will then be serialized as a string and stored in the HDF5 file with an ASCII encoding.

The topology stores a hierarchical description of the chains, residues, and atoms in the system. Each chain is associated with an index and a list of residues. Each residue is associated with a name, an index, and a list of atoms. Each atom is associated with a name, an element, and an index. All of the indicies should be zero-based.

The name of a residue is not strictly proscribed, but should generally be one of the three letter codes for amino acids, or the commonly used extensions thereof, e.g. "HOH" for water. The element of an atom shall be one of the one or two letter element abbreviations from the periodic table. The name of an atom shall indicate some information about the type of the atom beyond just its element, such as 'CA' for the alpha carbom, 'HG' for a gamma hydrogen, etc. This format does not specify exactly what atom names are allowed -- creators should follow the conventions from the forcefield they are using.

In addition to the chains, the topology shall also contain a list of the bonds. The bonds shall be a list of length-2 lists of integers, where the integers refer to the index of the two atoms that are bonded.

Example

The following shows the topology of alanine dipeptide in this format. Since it's JSON, the whitespace is optional and just for readability.

{'bonds': [[4, 1],
           [4, 5],
           [1, 0],
           [1, 2],
           [1, 3],
           [4, 6],
           [14, 8],
           [14, 15],
           [8, 10],
           [8, 9],
           [8, 6],
           [10, 11],
           [10, 12],
           [10, 13],
           [7, 6],
           [14, 16],
           [18, 19],
           [18, 20],
           [18, 21],
           [18, 16],
           [17, 16]],
 'chains': [{'index': 0,
             'residues': [{'atoms': [{'element': 'H',
                                      'index': 0,
                                      'name': 'H1'},
                                     {'element': 'C',
                                      'index': 1,
                                      'name': 'CH3'},
                                     {'element': 'H',
                                      'index': 2,
                                      'name': 'H2'},
                                     {'element': 'H',
                                      'index': 3,
                                      'name': 'H3'},
                                     {'element': 'C',
                                      'index': 4,
                                      'name': 'C'},
                                     {'element': 'O',
                                      'index': 5,
                                      'name': 'O'}],
                           'index': 0,
                           'name': 'ACE'},
                          {'atoms': [{'element': 'N',
                                      'index': 6,
                                      'name': 'N'},
                                     {'element': 'H',
                                      'index': 7,
                                      'name': 'H'},
                                     {'element': 'C',
                                      'index': 8,
                                      'name': 'CA'},
                                     {'element': 'H',
                                      'index': 9,
                                      'name': 'HA'},
                                     {'element': 'C',
                                      'index': 10,
                                      'name': 'CB'},
                                     {'element': 'H',
                                      'index': 11,
                                      'name': 'HB1'},
                                     {'element': 'H',
                                      'index': 12,
                                      'name': 'HB2'},
                                     {'element': 'H',
                                      'index': 13,
                                      'name': 'HB3'},
                                     {'element': 'C',
                                      'index': 14,
                                      'name': 'C'},
                                     {'element': 'O',
                                      'index': 15,
                                      'name': 'O'}],
                           'index': 1,
                           'name': 'ALA'},
                          {'atoms': [{'element': 'N',
                                      'index': 16,
                                      'name': 'N'},
                                     {'element': 'H',
                                      'index': 17,
                                      'name': 'H'},
                                     {'element': 'C',
                                      'index': 18,
                                      'name': 'C'},
                                     {'element': 'H',
                                      'index': 19,
                                      'name': 'H1'},
                                     {'element': 'H',
                                      'index': 20,
                                      'name': 'H2'},
                                     {'element': 'H',
                                      'index': 21,
                                      'name': 'H3'}],
                           'index': 2,
                           'name': 'NME'}]}]}

Fixed Width vs. Float16

Hi,

I wanted to have a tiny discussion about the pros / cons of fixed width versus Float16. I originally used fixed width because it was used in Gromacs and because early PyTables didn't have Float16Atom(). However, now I'd say we are free to choose between the two.

As I see it, the pros of Float16 are that we offload all complexity to a simple choice of precision.

Can you just remind me why Float16 might be a bad idea? I'm having a tough time thinking about "what it means" in terms of complications regarding MD trajectories...

cannot load h5 using io.loadh

I would like to use the io.loadh functionality but I am having some troubles. Any reason why this is happening ?

In [17]: f = tables.open_file('test1.ring')

In [18]: f
Out[18]: 
File(filename=test1.ring, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, fletcher32=False))
/ (RootGroup) ''
/xyza (Array(7952,)) ''
  atom := Float32Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/rings (Group) ''
/rings/qres (Array(1,)) ''
  atom := Float32Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/rings/ringI_267 (Array(500, 360)) ''
  atom := Float32Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/rings/ringR_267 (Array(500, 360)) ''
  atom := Float32Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

but then I try

In [19]: io.loadh( 'test1.ring' )
Exception AttributeError: "'DeferredTable' object has no attribute '_own_fid'" in  ignored
---------------------------------------------------------------------------
NoSuchNodeError                           Traceback (most recent call last)
<ipython-input-19-80c137a1e905> in <module>()
----> 1 io.loadh( 'test1.ring' )

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/mdtraj-0.0.0-py2.7-macosx-10.6-intel.egg/mdtraj/io.pyc in loadh(file, name, deferred)
    200         return result
    201 
--> 202     return DeferredTable(handle, own_fid)
    203 
    204 

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/mdtraj-0.0.0-py2.7-macosx-10.6-intel.egg/mdtraj/io.pyc in __init__(self, handle, own_fid)
    206     def __init__(self, handle, own_fid):
    207         self._handle = handle
--> 208         self._node_names = [e.name for e in handle.iterNodes(where='/')]
    209         self._loaded = {}
    210         self._own_fid = own_fid

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/group.pyc in __getattr__(self, name)
    768             self._g_add_children_names()
    769             return myDict[name]
--> 770         return self._f_get_child(name)
    771 
    772     def __setattr__(self, name, value):

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/group.pyc in _f_get_child(self, childname)
    644         self._g_check_open()
    645 
--> 646         self._g_check_has_child(childname)
    647 
    648         childPath = join_path(self._v_pathname, childname)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/group.pyc in _g_check_has_child(self, name)
    377             raise NoSuchNodeError(
    378                 "group ``%s`` does not have a child named ``%s``"
--> 379                 % (self._v_pathname, name))
    380         return node_type
    381 

NoSuchNodeError: group ``/rings`` does not have a child named ``name``

In [20]:

Wish List

trajectory indexing (i.e. traj[5])
PBC vectors
Minimal logic on the PBC vectors.
- Make sure they're converted between the format required for each library, written to PDB files in a way that OpenMM or something would understand.
DCD & XTC units same -- nm to be consistent with old msmbuilder?

OpenMM PyTables/HDF5 Reporter

This is a natural place for such a reporter to live.

How to remove waters / atoms

In the old MSMB trajectory, we had a function restrict_atom_indices. Do we plan on implementing similar features in MDTraj for performing operations on sets of atoms?

Numpy array access to topology items

I find it annoying that I have to use a generator to parse things the OpenMM topology.

When we start using mdtraj for real calculations, I worry that this will be even more of a pain.

In particular, I think being able to use fancy indexing, np.where(), and other features are probably useful. What do you think? For example, I'm thinking about how to calculate dihedral angles etc.

Serializing the topology as JSON inside the HDF5 or NetCDF Trajectories

Currently, the topology is being pickled and then the bytes are being saved in the HDF5 trajectory format.

I think this is a bad solution, because it's not very transparent. Pickle is a python-specific protocol, and being able to unpickle the object depends on the mdtraj.topology.Topology object existing and being in some way unchanged from the one in the file, I think.

Instead, I think we should serialize the topology in a way that's transparent, such that if we changed the way we were storing the topology inside our apps, we could write a new reader. The structure that makes the most sense I think is JSON, like:

{
  chain : {
    index : 0
    residue : {
      index : 0
      name : ALA,
      atom : {
        index : 0,
        name : N,
        element : N,
      },
      atom : {
        index : 1
        name : CA,
        element : C
     }
  },
  bonds : [[0,1]],
}

geometry.contact should be renamed to geometry.distance or something

For consistency with geometry.angle and geometry.dihedral, the module for calculating distances/bond lengths should be named similarly.

Test failure in test_rmsd.test_irmsd_against_numpy_qcp2

On the latest travis.

Port RMSD from MSMB?

Use Cython for wrapping.
Implement reference implementation in Numpy
Support general structural alignment / rotation / selection

Is this beyond the scope of MDTraj? My thinking was that RMSD calculation belongs in MDTraj, but clustering algorithms will remain in MSMB.

Performance bug with multi-file load

See this notebook: http://nbviewer.ipython.org/5530251

With out current code, loading many relatively short XTC files is 10x slower than it could be. I think this is because our functools.reduce() code is basically parsing the pdb file once for every filename, which is not good.

Add a function in utils for managing delayed imports and printing a nice error message if they're not found

This functionality would be used for importing scipy, tables, netCDF, networkx, which are only required for specific modules and might not be installed on everyone's machine. We want to only import them when they're required (i.e. the class/function that actually uses them is called), and catch ImportErrors, giving a nice message about where the module can be downloaded from and how to install it.

what should default behavior for overwriting a file be?

Right now, when using saveh in io, the method will die if trying to overwrite an (almost) identical file, but it fails at the very end (after a potentially expensive calculation). Should it either fail at the beginning, or overwrite the already-saved file?

BUG: Slicing a trajectory with box vectors with a single integer index errors

In [13]: t = trajectory.load('fbab7225-1925-4f48-abb1-5afa9d20eb67.lh5')

In [14]: t[0]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-d0a24a012877> in <module>()
----> 1 t[0]

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/mdtraj-0.1-py2.7-macosx-10.5-x86_64.egg/mdtraj/trajectory.pyc in __getitem__(self, key)
    645     def __getitem__(self, key):
    646         "Get a slice of this trajectory"
--> 647         return self.slice(key)
    648
    649     def slice(self, key, copy=True):

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/mdtraj-0.1-py2.7-macosx-10.5-x86_64.egg/mdtraj/trajectory.pyc in slice(self, key, copy)
    681             box = None
    682
--> 683         newtraj = self.__class__(xyz, topology, time, box)
    684         return newtraj
    685

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/mdtraj-0.1-py2.7-macosx-10.5-x86_64.egg/mdtraj/trajectory.pyc in __init__(self, xyz, topology, time, box)
    692         self._box = None
    693         if box is not None:
--> 694             self.box = box
    695
    696         # time will take the default 1..N

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/mdtraj-0.1-py2.7-macosx-10.5-x86_64.egg/mdtraj/trajectory.pyc in box(self, value)
    560             if not value.shape == (self.n_frames, 3, 3):
    561                 raise ValueError("Wrong shape. Got %s, should be %s" % (value.shape,
--> 562                     (self.n_frames, 3, 3)))
    563             box = value
    564

ValueError: Wrong shape. Got (3, 3), should be (1, 3, 3)

Match fields in our HDF5 format with the AMBER NetCDF format

@kyleabeauchamp: The AMBER NetCDF format is really intelligent. It has all the info you want in there, and it's very explicit about the format. It has places for a version tag, on the format, and for labeling what units things are in. These are all things that our HDF5 format lacks.

We should make a new version of our HDF5 format that formalizes it, and makes it match the AMBER NetCDF format as much as possible.

We should also do some performance benchmarks to check if we really need our own proprietary format.

The new cython code segfaults on linux

it seems to work on mac. i find this very surprising.

Load / Save Box Size

Some of the trajectory formats store the box size with box vectors, and others store the box lengths and the three angles. Currently, we've pretty much punted and aren't doing the conversion.

We should have both represenations available via @property on the trajectory object, and then do the load/save in whatever representation is required.

parse the topology from msmbuilder hdf5 files automatically

for greater backwards compatibility, it would be best to be able to parse the pdb-like entries in the MSMBuilder style hdf5 files and get a topology object.

this is one thing holding back incorporation of this module into msmbuilder, IMO.

Fix chi dihedral parsing

The current code does not take into account the several possible atom names for the chi torsion angle. We should fix this somehow.

Create feature vectorizer class / API.

The idea is that we want a robust framework for featurizing trajectories. Key abilities:

Mix and match functions from geometry
Possible combination of multiple vectorizers
Possibly memory aware--split trajectories into chunks to avoid trying to featurize everything at once...

Error Saving Topology in hdf5.py

I'm still trying to debug this, but I think there might be a size limit on the amount of data that can be stored in a node attribute. This might require a change to the HDF5 protocol, to put the topology into a table instead.

This same code works fine in the test suite, but this system is much larger than the one in the test suite.

Traceback (most recent call last):
  File "/home/rmcgibbo/opt/python/2.7.4/bin/accelerator", line 5, in <module>
    pkg_resources.run_script('msmaccelerator==0.3', 'accelerator')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/msmaccelerator-0.3-py2.7.egg/EGG-INFO/scripts/accelerator", line 22, in <module>
    main()
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/msmaccelerator-0.3-py2.7.egg/EGG-INFO/scripts/accelerator", line 19, in main
    app.start()
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/msmaccelerator-0.3-py2.7.egg/msmaccelerator/core/app.py", line 132, in start
    return self.subapp.start()
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/msmaccelerator-0.3-py2.7.egg/msmaccelerator/simulate/simulation.py", line 79, in start
    super(Simulator, self).start()
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/msmaccelerator-0.3-py2.7.egg/msmaccelerator/core/device.py", line 79, in start
    self.on_startup_message(msg)
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/msmaccelerator-0.3-py2.7.egg/msmaccelerator/simulate/simulation.py", line 87, in on_startup_message
    return getattr(self, msg.header.msg_type)(msg.header, msg.content)
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/msmaccelerator-0.3-py2.7.egg/msmaccelerator/simulate/simulation.py", line 138, in simulate
    simulation.step(self.number_of_steps)
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 127, in step
    reporter.report(self, state)
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/reporters/hdf5reporter.py", line 174, in report
    self._initialize(simulation)
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/reporters/hdf5reporter.py", line 141, in _initialize
    self._traj_file.topology = simulation.topology
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/hdf5.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/hdf5.py", line 253, in topology
    self._handle.root._v_attrs.topology = json.dumps(topology_dict)
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/tables/attributeset.py", line 444, in __setattr__
    self._g__setattr(name, value)
  File "/home/rmcgibbo/opt/python/2.7.4/lib/python2.7/site-packages/tables/attributeset.py", line 386, in _g__setattr
    self._g_setAttr(self._v_node, name, stvalue)
  File "hdf5Extension.pyx", line 429, in tables.hdf5Extension.AttributeSet._g_setAttr (tables/hdf5Extension.c:4064)
tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5A.c", line 927, in H5Awrite
    not an attribute

End of HDF5 error back trace

Can't set attribute 'topology' in node:
 / (RootGroup) ''.
Closing remaining open files: /home/rmcgibbo/projects/msmaccelerator2.villin/project/trajs/376cd500-7be1-4fda-9a05-6e8ae1f2a70b.h5... done

tests may fail because `import mdtraj` in the root dir imports the local copy

And when this happens, it can't find the compiled .so files (which are in the egg).

Everything works as long as you're not in the mdtraj base dir, though.

Not a huge deal, but FYI.

How to set _time after changing _xyz

Maybe we need a way to set the value of _xyz so that the _time gets set automatically? Not sure. The idea is that now it's easier for _xyz and _time to get out of sync.

set a VERSION key in the hdf5 files that can be incremented when the format needs to change

this will make it easier to deal with any future backwards compatibility issues.

Is eq() working?

This seems wrong to me. I get non-agreement even when I set decimal=1. This may have to do with testing a scalar value.

  File "/home/kyleb/src/kyleabeauchamp/mdtraj/MDTraj/test/test_geometry.py", line 80, in test_dihedral
    eq(float(phi[0,0]), phi0, decimal=1)
  File "/home/kyleb/opt/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/testing/testing.py", line 71, in eq
    eq_(o1, o2)
AssertionError: -0.602303798297833 != -0.6023054454145343

----------------------------------------------------------------------
Ran 195 tests in 1.223s

FAILED (failures=1)

The topologies are not the same

import mdtraj.trajectory
r1 = mdtraj.trajectory.load("./traj.lh5")
r2 = mdtraj.trajectory.load("./traj.lh5")
r1 + r2

ValueError Traceback (most recent call last)
in ()
----> 1 r1 + r2

/home/kyleb/opt/lib/python2.7/site-packages/mdtraj-0.1-py2.7-linux-x86_64.egg/mdtraj/trajectory.pyc in add(self, other)
578 def add(self, other):
579 "Concatenate two trajectories"
--> 580 return self.join(other)
581
582 def join(self, other, check_topology=True):

/home/kyleb/opt/lib/python2.7/site-packages/mdtraj-0.1-py2.7-linux-x86_64.egg/mdtraj/trajectory.pyc in join(self, other, check_topology)
605 if not np.all(mdtraj.topology.to_bytearray(self.topology) ==
606 mdtraj.topology.to_bytearray(other.topology)):
--> 607 raise ValueError('The topologies are not the same')
608
609 xyz = np.concatenate((self.xyz, other.xyz))

ValueError: The topologies are not the same

Failure with tables version 3.0.0 (just released)

======================================================================
ERROR: MDTraj.test.test_hdf5.test_topology
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/rmcgibbo/local/mdtraj/MDTraj/test/test_hdf5.py", line 106, in test_topology
    f.topology = top
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/hdf5.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/hdf5.py", line 258, in topology
    self._handle.createArray(where='/', name='topology', object=[str(json.dumps(topology_dict))])
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/tables/_past.py", line 35, in oldfunc
    return obj(*args, **kwargs)
TypeError: create_array() got an unexpected keyword argument 'object'

======================================================================
ERROR: Failure: TypeError (create_array() got an unexpected keyword argument 'object')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/nose/loader.py", line 253, in generate
    for test in g():
  File "/home/rmcgibbo/local/mdtraj/MDTraj/test/test_io.py", line 125, in test_groups
    f.createArray(where='/mygroup', name='myarray', object=x)
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/tables/_past.py", line 35, in oldfunc
    return obj(*args, **kwargs)
TypeError: create_array() got an unexpected keyword argument 'object'

======================================================================
ERROR: Failure: TypeError (create_array() got an unexpected keyword argument 'object')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/nose/loader.py", line 253, in generate
    for test in g():
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/numpy/testing/decorators.py", line 153, in skipper_gen
    for x in f(*args, **kwargs):
  File "/home/rmcgibbo/local/mdtraj/MDTraj/test/test_reporter.py", line 67, in test_reporter
    simulation.step(100)
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 127, in step
    reporter.report(self, state)
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/reporters/hdf5reporter.py", line 172, in report
    self._initialize(simulation)
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/reporters/hdf5reporter.py", line 126, in _initialize
    self._traj_file.topology = simulation.topology
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/hdf5.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/hdf5.py", line 258, in topology
    self._handle.createArray(where='/', name='topology', object=[str(json.dumps(topology_dict))])
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/tables/_past.py", line 35, in oldfunc
    return obj(*args, **kwargs)
TypeError: create_array() got an unexpected keyword argument 'object'

======================================================================
ERROR: Failure: TypeError (create_array() got an unexpected keyword argument 'object')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/nose/loader.py", line 253, in generate
    for test in g():
  File "/home/rmcgibbo/local/mdtraj/MDTraj/test/test_trajectory.py", line 78, in test_box_load_save
    t.save(temp_fn)
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/trajectory.py", line 1042, in save
    return saver(filename)
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/trajectory.py", line 1057, in save_hdf
    f.topology = self.topology
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/hdf5.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/mdtraj-0.2-py2.7-linux-x86_64.egg/mdtraj/hdf5.py", line 258, in topology
    self._handle.createArray(where='/', name='topology', object=[str(json.dumps(topology_dict))])
  File "/home/rmcgibbo/envs/latest/lib/python2.7/site-packages/tables/_past.py", line 35, in oldfunc
    return obj(*args, **kwargs)
TypeError: create_array() got an unexpected keyword argument 'object'

Add XTC and NetCDF OpenMM reporters

They should also take atom subsets, as was proposed by @kyleabeauchamp on the OpenMM issue tracker.

Some load() functions take single files, others take lists of files

I wonder if we should have a more consistent interface. One idea would be to have load_X() always take a single filename, but provide a cute way to merge multiple trajectories. Just a thought.

Catch Exception in DCD Reader

I need to add more error checking to the DCD reader. This may have been introduced when I added the n_frame option in read().

In [10]: x, lengths, angles = reader.read(40000)
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-10-c52b259ebf12> in <module>()
----> 1 x, lengths, angles = reader.read(40000)

/home/kyleb/opt/lib/python2.7/site-packages/mdtraj-0.1-py2.7-linux-x86_64.egg/mdtraj/dcd.so in mdtraj.dcd.DCDReader.read (MDTraj/dcd/dcd.c:2569)()

IOError: [Errno Error: %s] -1

get the following error when installing under ubuntu

from mdtraj import dcd
main:1: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility
main:1: RuntimeWarning: numpy.flatiter size changed, may indicate binary incompatibility

Any thoughts?

Move HDF Trajectory IO outside of trajectory.py

Feel free to shoot this one down, but I think it would be nice if trajectory.py reader more like a header file and less like an implementation file.

I think one way to do this would be to move the guts out of the HDF5 IO into a separate hdf5.py file.

The advantage of this is that trajectory.py is really really short and just points readers to the filetype-specific readers.

Expose a streaming read/write interface

All of the dcd/xtc code reads in all of the data from memory (or writes a given amount).

It would be nice if, additionally, it exposed a streaming API.

Should other MSMBuilder Features should move to MDTraj?

What other calculations that work with Trajectories should belong in MDTraj, rather than MSMBuilder?

Dihedral calculation?
Rg?

How will we deal with MSMBuilder Projects?