datreant / mdsynthesis Goto Github PK

View Code? Open in Web Editor NEW

30.0 10.0 6.0 14.48 MB

a logistics and persistence engine for the analysis of molecular dynamics trajectories

Home Page: http://mdsynthesis.readthedocs.org

License: GNU General Public License v2.0

Python 100.00%

molecular-dynamics-simulation mdanalysis python science database

mdsynthesis's Issues

Data aggregator doesn't have keys method

Sim.selections.keys() # works
Sim.categories.keys() # works

# but
Sim.data.keys() # doesn't work
Sim.universes.keys() # doesn't work

If I'm using a dict like object, it makes sense to have a keys method for this. I had a quick look through the code, and it looks like the data aggregator is different, but ideally it should behave like the others? Maybe link keys to data._list?

Failing in adding a Universe creates bad state

In [1]: import mdsynthesis as mds

In [2]: S = mds.Sim('cg')

In [5]: S.universes.add('main', ['topol.tpr','cg.xtc'])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-f7469218f877> in <module>()
----> 1 S.universes.add('main', ['topol.tpr','cg.xtc'])

/home/richard/.local/lib/python2.7/site-packages/mdsynthesis-0.5.0-py2.7.egg/mdsynthesis/core/aggregators.py in add(self, handle, topology, *trajectory)
    333                 outtraj.append(traj)
    334 
--> 335         self._backend.add_universe(handle, topology, *outtraj)
    336 
    337         if not self.default():

/home/richard/.local/lib/python2.7/site-packages/mdsynthesis-0.5.0-py2.7.egg/mdsynthesis/core/persistence.py in inner(self, *args, **kwargs)
    224             self._exlock(self.handle)
    225             try:
--> 226                 out = func(self, *args, **kwargs)
    227             finally:
    228                 self.handle.close()

/home/richard/.local/lib/python2.7/site-packages/mdsynthesis-0.5.0-py2.7.egg/mdsynthesis/core/persistence.py in add_universe(self, universe, topology, *trajectory)
    880 
    881         # add topology paths to table
--> 882         table.row['abspath'] = os.path.abspath(topology)
    883         table.row['relCont'] = os.path.relpath(topology, self.get_location())
    884         table.row.append()

/usr/lib/python2.7/posixpath.pyc in abspath(path)
    365 def abspath(path):
    366     """Return an absolute path."""
--> 367     if not isabs(path):
    368         if isinstance(path, _unicode):
    369             cwd = os.getcwdu()

/usr/lib/python2.7/posixpath.pyc in isabs(s)
     59 def isabs(s):
     60     """Test whether a path is absolute"""
---> 61     return s.startswith('/')
     62 
     63 

AttributeError: 'list' object has no attribute 'startswith'

In [6]: S.universes.add('main', 'topol.tpr','cg.xtc')
---------------------------------------------------------------------------
NoSuchNodeError                           Traceback (most recent call last)
<ipython-input-6-f176da1cef98> in <module>()
----> 1 S.universes.add('main', 'topol.tpr','cg.xtc')

/home/richard/.local/lib/python2.7/site-packages/mdsynthesis-0.5.0-py2.7.egg/mdsynthesis/core/aggregators.py in add(self, handle, topology, *trajectory)
    333                 outtraj.append(traj)
    334 
--> 335         self._backend.add_universe(handle, topology, *outtraj)
    336 
    337         if not self.default():

/home/richard/.local/lib/python2.7/site-packages/mdsynthesis-0.5.0-py2.7.egg/mdsynthesis/core/persistence.py in inner(self, *args, **kwargs)
    224             self._exlock(self.handle)
    225             try:
--> 226                 out = func(self, *args, **kwargs)
    227             finally:
    228                 self.handle.close()

/home/richard/.local/lib/python2.7/site-packages/mdsynthesis-0.5.0-py2.7.egg/mdsynthesis/core/persistence.py in add_universe(self, universe, topology, *trajectory)
    872                 '/universes/{}'.format(universe), 'topology')
    873             self.handle.remove_node(
--> 874                 '/universes/{}'.format(universe), 'trajectory')
    875 
    876         # construct topology table

/home/richard/.local/lib/python2.7/site-packages/tables/file.pyc in remove_node(self, where, name, recursive)
   1779         """
   1780 
-> 1781         obj = self.get_node(where, name=name)
   1782         obj._f_remove(recursive)
   1783 

/home/richard/.local/lib/python2.7/site-packages/tables/file.pyc in get_node(self, where, name, classname)
   1614         # Now we have the definitive node path, let us try to get the node.
   1615         if node is None:
-> 1616             node = self._get_node(nodepath)
   1617 
   1618         # Finally, check whether the desired node is an instance

/home/richard/.local/lib/python2.7/site-packages/tables/file.pyc in _get_node(self, nodepath)
   1553             return self.root
   1554 
-> 1555         node = self._node_manager.get_node(nodepath)
   1556         assert node is not None, "unable to instantiate node ``%s``" % nodepath
   1557 

/home/richard/.local/lib/python2.7/site-packages/tables/file.pyc in get_node(self, key)
    434 
    435         if self.node_factory:
--> 436             node = self.node_factory(key)
    437             self.cache_node(node, key)
    438 

/home/richard/.local/lib/python2.7/site-packages/tables/group.pyc in _g_load_child(self, childname)
   1184             childname = join_path(self._v_file.root_uep, childname)
   1185         # Is the node a group or a leaf?
-> 1186         node_type = self._g_check_has_child(childname)
   1187 
   1188         # Nodes that HDF5 report as H5G_UNKNOWN

/home/richard/.local/lib/python2.7/site-packages/tables/group.pyc in _g_check_has_child(self, name)
    400             raise NoSuchNodeError(
    401                 "group ``%s`` does not have a child named ``%s``"
--> 402                 % (self._v_pathname, name))
    403         return node_type
    404 

NoSuchNodeError: group ``/`` does not have a child named ``/universes/main/trajectory``

Make common convenience methods for aggregating data for ``Group.members`` and ``Bundle``.

Group.members and Bundle are intended to make it easy to manipulate many Containers at once, but currently they only give access to the objects themselves. It would be useful to include methods that yield aggregate information from these collections. Both objects would have these methods in common.

For example, could have

Bundle.data, which gives access to concatenations of stored pandas data sets. It grabs any datasets it can that match the handle given, and tries to concatenate them. Would be useful for quickly aggregating and manipulating ensemble data.

Bundle.tags, which gives all tags present in the collection. Could have keywords for any and all criterion for what to return.

Pickled Python data structures cannot be modified in append mode

The problem is pretty straight forward

import mdsynthesis as mds
s=mds.Sim("marlar")
s.data['poop']=25
s.data['poop']
25
s.data['poop']=50
s.data['poop']
25
s.data.add('poop', 50)
s.data['poop']
25
s.data['poop2']=[1,2]
s.data['poop2']
[1, 2]
s.data['poop2']=[1,2,3]
s.data['poop2']
[1, 2]

I dug a little into this one and the cause of the issue seems to be the file mode ab+, it straight up doesn't work for pickle dumping (see code on mdsynthesis/core/persistence.py:1962 where you use that mode). I can submit a one-line PR that changes the mode to "wb+" if there's no reason not to.

Here's the MDSynthesis-free root of the problem:

import pickle
pickle.dump(20, open("marlar.pkl", "ab+"))
pickle.load(open("marlar.pkl","rb"))
20
pickle.dump(30, open("marlar.pkl", "ab+"))
pickle.load(open("marlar.pkl","rb"))
20
pickle.dump(30, open("marlar.pkl", "wb"))
pickle.load(open("marlar.pkl","rb"))
30

Allow globbing syntax for universe definitions.

When multiple trajectories are needed for a universe definition, it may be useful to allow globbing syntax for the paths stored so that new trajectory files that match the pattern are picked up by the universe. Then again, this feature could be troublesome. How feasible is it to include, and what are some pitfalls?

upgrade needed to work with MDAnalysis 0.11.0?

The new MDAnalysis release 0.11.0 breaks parts of the API and it is possible that MDSynthesis needs to be migrated and pinned to MDAnalysis >= 0.11.0.

Need stress tests for file locking / concurrency

Despite the advantages of structuring Containers with their stored data at the filesystem level instead of shoving them into a single database, the major disadvantage is that we have to be careful that file locks are working to ensure against concurrency corruption. We need stress tests of the file locking mechanisms for:

ContainerFile and its child classes
DataFile variants

We also need to test that changes in filesystem elements that occur through Container properties (e.g. #19) are also handled gracefully, though this will almost certainly require the underlying ContainerFile object to use the Foxhound to fetch its new path.

online docs are missing class/function reference manual

The online docs do not contain the doc strings of the individual classes and functions. There should be a module/class/function reference section.

Need test coverage of Sims and Groups.

Pytest is our testing framework of choice, and some basic tests have been written for Container elements (tags and categories), but these need to be expanded to cover all components of Sims and Groups. This includes:

universes
selections
members
renaming and re-locating
data (pandas, numpy, and pure-python objects)

Additional, less atomic tests can be added later. These may include, e.g., organizational patterns:

does everything work when Containers are nested (in another Container's tree)?
does Foxhound find a Container that has been moved? Does it do it reasonably quickly?

Setting Sim.universe to universe with chain reader doesn't work

In [39]: S = mds.Sim('adk')

In [40]: u = mda.Universe('adk.psf',['adk_dims.dcd', 'adk_dims.dcd'])

In [41]: S.universe = u

In [42]: S.udef.trajectory
Out[42]: u'/home/richard/test/mdsynthesis/adk_dims.dcd'

Storing 1D DataFrame gives Pandas TypeError on retrieval

I'm not exactly sure what's going on here, but I can't store a 1D DataFrame. Retrieving a (10,1) DataFrame gives the error "TypeError: Index(...) must be called with a collection of some kind, None was passed":

import mdsynthesis as mds                                                        
import numpy as np                                                               
import pandas as pd                                                              

s = mds.Sim('marklar')                                                           
s.data.add('test1',pd.DataFrame(np.zeros((1,10))))                                                            
print s.data["test1"]        #Good

s.data.add('test2',pd.Series(np.zeros((10,))))                                                                        
print s.data["test2"]        #Good

s.data.add('test3',pd.DataFrame(np.zeros((10,10))))                                                              
print s.data["test3"]        #Good

s.data.add('test4',pd.DataFrame(np.zeros((10,1))))                                                              
print s.data["test4"]        #Not Good

I'm pretty sure this is actually a problem with writing the file to disk because the h5 file seems weird:

z=pd.HDFStore("test4/pdData.h5", 'r')
>>> z
<class 'pandas.io.pytables.HDFStore'>
File path: pdData.h5
/main            [invalid_HDFStore node: sequence item 0: expected string, numpy.int64 found]

Any idea how to fix this?

Add keywords to the Universes.activate() method for loading only topology or some subset of trajectories in definition.

This would be useful when loading a trajectory may be rather slow, especially if it is many XTC trajectories since these will be indexed by MDAnalysis immediately on load. Also, being able to load a subset of the trajectories may be useful for the same reason until we get persistent indexes.

Categories.add() silently fails for some invalid entry methods

Trying to add a category to a Sim with

s.categories.add('temperature', 303)

will raise no exception, but also won't add the intended category.

``Group.members`` should also allow indexing by member name

Currently, Group.members[:] yields a list of all members, and slicing by index is allowed to get lists of a subset since members do have an order. However, it would be most convenient to be able select members by name, such as with::

Group.members['lark']

or::

Group.members[['lark', 'hark']]

as is used by pandas DataFrames to select multiple columns. This should then yield a Bundle object instead of a list containing all members that matched the names given. Since names are not required to be unique, this could be many more than the number of names supplied. This would allow calls such as::

Group.members[['lark', 'hark']].data['ionbinding']

Which would retrieve a concatenation of all concatenatable datasets present in the Bundle matching the name 'ionbinding' once Issue #8 is addressed.

Need clear definition for when Containers use logging vs. when they throw exceptions

At the moment each Sim and Group gets a logging instance corresponding to Sim.<name> or Group.<name>, respectively, and these are used to output information to the user. So far there are no written rules as to when the logger should be used, as well as when an exception should be raised instead.

Sim should take a real MDAnalysis.Universe instance, too

It would be convenient (and more object oriented) if I could say

u = MDAnalysis.Universe(TPR, XTC)
s = Sim(name)
s.universes.add('anyname', u)

instead of s.universes.add('anyname', universe=[TPR, XTC, ...]).

Although I appreciate that this will make it more difficult to recreate the universe. Perhaps should be tackled together with MDAnalysis/mdanalysis#173 , which could supply the necessary state information to reconstitute the universe.

renaming universes?

I haven't found a way to rename a universe --- am I missing something? I was looking for

Sim.universes.rename(old, new)

or at least

Sim.universes[new] = Sim.universes[old]
del Sim.universes[old]

Split off MDSynthesis core into another package?

Since the basic structure of MDSynthesis isn't entirely specific to molecular dynamics simulation, I think it might make sense to break the core out into a separate package. I'd been considering this for a while, but today I was contacted by someone with a use-case outside of MD, and I think it would be of benefit to MDSynthesis' core development to make it more general.

I've started a repository for this work here: https://github.com/dotsdl/datreant

Is anyone opposed to this? One thing to to note is that datreant is BSD 3-clause licensed, making its use more permissive than MDSynthesis, which is GPLv2 (same as MDAnalysis). I technically need permission from everyone that has contributed code so far to make this change.

P.S. Opinions on the name are welcome. I wanted something that included 'dat' to indicate 'data', and some kind of word to indicate 'trees' (as in directory trees). An old D&D woodland creature ('treant') came to mind. :D

Make Sims and Groups read-only usable

It may be useful for others in a lab group on a shared volume to be able to use Sims and Groups of others with read permissions but without write permissions. If Sims and Groups can generally be made to work with only read permissions for accessing their stored attributes/data, this would make this possible.

The main changes would come in the __init__ methods of each container.

raise KeyError instead of NoSuchNodeError

I like using selections (and everything else) in a pythonic fashion (i.e. if it looks like a dict it should mostly behave like one even if under the hood it's all HDF5). One strength of MDS is hiding all the bookkeeping.

Therefore, it is annoying if a non-existent, say, selection raises NoSuchNodeError (no idea what kind of exception this is) when I tried

try:
   sel = self.sim.selections[name]
except KeyError:
   # do something about it because selection 'name' is not stored in the sim

because from the syntax I expected to get a KeyError.

Add a 'topologies' and 'trajectories' properties.

We need convenient mechanisms modifying universe definitions. One way to do this would be to have Universes.topology and Universes.trajectory properties that allow getting and setting of the corresponding elements. If it isn't too confusing, could include __setitem__ and __getitem__ to the underlying objects that allow setting and getting for any universe definition.

make a 1.0 release?

The docs still state that this is alpha software but I think in truth it has been in stable use for more than a year. Isn't it time to slap a 1.0 on it?

(Or are we waiting for MDAnalysis 0.16.0?)

first release

What is holding up a release?

Universes and members should be locateable by relative paths.

At the moment, Universes and Members use only stored absolute paths for finding the files they need to generate universes and members, respectively. Relative paths from the Container's basedir are also stored, but they are not yet used. This breaks functionality for Sims and Groups when moving them around in a filesystem, even when the relative paths between these Containers and the files they require haven't changed.

The relative paths should be tried in the event the absolute path fails. If the file is found, the absolute path should be updated.

Using Sim selections

I was playing with the selections tool on Sims, and I bumped into a couple annoyances

ag = u.atoms[:10]

S.selections.add(ag)

# TypeError: object name is not a string: <AtomGroup with 10 atoms>

I guess this ties in with #25 in accepting premade MDA objects. This should be possible as AtomGroups are unambiguous based on a hash of their Universe (to uniquely identify this among S.universes) and then just ag.indices()

S.selections += 'something'

# TypeError: unsupported operand type(s) for +=: 'Selections' and 'str'

Using .add() feels weird to me, intuitively I want to += stuff in. Thoughts?

Crashes when opening scalar numpy arrays

Here's a fun one for the test suite, MDSynthesis crashes when you open up scalar numpy arrays. This can easily be avoided by not storing scalars in the first place, but it's worth mentioning!

import mdsynthesis as mds
import numpy as np
s=mds.Sim("marlar")
s.data['harhar']=np.array(20)
s.data['harhar']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.5-x86_64/egg/mdsynthesis/core/aggregators.py", line 960, in __getitem__
  File "build/bdist.macosx-10.5-x86_64/egg/mdsynthesis/core/aggregators.py", line 899, in inner
  File "build/bdist.macosx-10.5-x86_64/egg/mdsynthesis/core/aggregators.py", line 1154, in retrieve
  File "build/bdist.macosx-10.5-x86_64/egg/mdsynthesis/core/persistence.py", line 1504, in get_data
  File "build/bdist.macosx-10.5-x86_64/egg/mdsynthesis/core/persistence.py", line 1819, in inner
  File "build/bdist.macosx-10.5-x86_64/egg/mdsynthesis/core/persistence.py", line 1878, in get_data
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/Users/cing/Projects/h5py/h5py/_objects.c:2458)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/Users/cing/Projects/h5py/h5py/_objects.c:2415)
  File "/Users/cing/anaconda/lib/python2.7/site-packages/h5py-2.5.0-py2.7-macosx-10.5-x86_64.egg/h5py/_hl/dataset.py", line 418, in __getitem__
    selection = sel2.select_read(fspace, args)
  File "/Users/cing/anaconda/lib/python2.7/site-packages/h5py-2.5.0-py2.7-macosx-10.5-x86_64.egg/h5py/_hl/selections2.py", line 92, in select_read
    return ScalarReadSelection(fspace, args)
  File "/Users/cing/anaconda/lib/python2.7/site-packages/h5py-2.5.0-py2.7-macosx-10.5-x86_64.egg/h5py/_hl/selections2.py", line 80, in __init__
    raise ValueError("Illegal slicing argument for scalar dataspace")
ValueError: Illegal slicing argument for scalar dataspace

MDSynthesis stores the scalars just fine and I don't see anything wrong with reading them using h5py as per usual:

import h5py
f = h5py.File('marlar/harhar/npData.h5','r')
f['main'].value
20

Make name and location setters get exclusive lock

Since it's possible for name and location properties to change the path to a Container's statefile, these should obtain an exclusive lock on the statefile before being applied. A possible problem with this is that it requires an open file descriptor. What is a good solution?

Remove core elements of MDSynthesis that have moved to datreant

The core of MDSynthesis was moved to a new package called datreant, making it easier for others outside of the MD community to use and build on the MDS data model (see issue #38). The split is complete, but now the elements in datreant need to be removed from MDS and imports added in their place.

Cryptic error on opening Sim with conflicting name with file in working directory

Just a little better error handling is needed here, or well, you could support Sims with the same path name as files but that could get really confusing!

In shell:

touch marlar

then in Python,

from mdsynthesis import mds
s=mds.Sim("marlar")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.5-x86_64/egg/mdsynthesis/containers.py", line 512, in __init__
  File "build/bdist.macosx-10.5-x86_64/egg/mdsynthesis/containers.py", line 224, in _regenerate
  File "build/bdist.macosx-10.5-x86_64/egg/mdsynthesis/core/persistence.py", line 65, in containerfile
UnboundLocalError: local variable 'statefileclass' referenced before assignment

handle folders with several simulations

I sometimes get several simulations of colleagues that are all in one folder

sim
+-- foo.01.xtc
+-- foo.02.xtc
+-- foo.03.xtc
+-- foo.pdb

My current approach is to move these into separate folders but I would like it I can also tell a Sim that there is more then one simulation in that folder. All simulations have exactly the same setup so I treat it as an ensemble in my analysis instead of a single simulation.

I know I just copy them to different directories and then create sims in them I'm just curious is such a workflow would be possible with mdsynthesis.

create sim from existing Treant

Can I convert a treant into a sim to keep all his tags and categories?

module naming not PEP8 compliant

Having all modules start with uppercase looks a bit weird and is not the recommendation of PEP8:

"Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged."
"Class names should normally use the CapWords convention. "
"Function names should be lowercase, with words separated by underscores as necessary to improve readability."

(I know that e.g. MDAnalysis is also not fully compliant but this is a new project and you still have a chance to do it right without p***ing off too many users...)

Make Data iterable, accessible with list of data names.

At the moment the Data aggregator doesn't allow iteration through datasets. This should be implemented. It should also allow accessing datasets with a list of data names.

Add top-level functions for manipulating/making Sims and Groups

Although Sims and Groups can be created and manipulated just fine with their built-in methods, it would be useful to start making some top-level functions that can do this as well, but potentially taking as input multiple Containers at once.

Some ideas:

mds.copy() could copy stored elements of one container into another new or existing container. It could include keyword arguments to indicate what to include/leave out of the copy. It could also be made a method of Containers themselves, e.g. Sim.copy(sim, all=False, universes=None, ...)

mds.load(*containers) could yield a mds.Bundle object, which behaves like an ordered set of Containers. If a single container path given, will just spit out the loaded Container itself.

There is probably room for methods to add tags, categories, members, universes, and selections to any number of the appropriate Containers as convenience functions. These would also lend themselves to easily building a shell-level script for manipulating Containers and their data.

Allow globbing syntax for Data.remove().

Since datasets are stored in a directory structure, and since their names reflect this, it would be fairly easy to make deletions using globbing. This would be a great convenience when some datasets matching a pattern should be removed without removing others.

Add tests for checking limitations of organizational patterns

Need tests for the following:

Does everything work when Containers are nested (in another Container's tree)?
Does Foxhound find a Container that has been moved? Does it do it reasonably quickly?

Need mechanism for updating file schema from previous versions.

Upon loading an existing Container, the corresponding ContainerFile subclass should do a version check between the version given by the state file and the current version of MDS. It should then run code that updates the schema to that used by the current version of MDS.

This will require:

an explicitly-documented schema spec for each ContainerFile subclass, which can change each release
a mechanism for performing iterative updates to existing files; in other words, a file that was made with a very old version of MDS gets schema updates for each version of MDS it is behind, until it reaches the current one
the mechanism to update files needs to be stress-tested with its own set of unit tests; it should be robust enough to restart conversion of a file that is only half-converted to a new version, which might happen if the python session dies mid-conversion.

Add globbing syntax to Bundle.

Bundle is intended to be a useful grouping tool for Containers without the persistence of a Group. It would be awesome if it could run any strings it receives as input through glob.glob to grab whole sets of Containers from the filesystem easily.

Build in mechanism to state files to reclaim space

HDF5 doesn't (yet) reclaim space from deleted nodes. Therefore, in principle SimFiles will slowly grow if enough universe definitions / selections are added/replaced. We need to assess how big a problem this is, and what the best solution is for "cleaning" state files that have this potential for bloat.

Design of ContainerFile / Tags system

So in trying to write tests for the Tags system, I've had some trouble with the design

currently it's something like

class ContainerFile:
    def get_tags(self):
        # do all the work on getting

    def set_tags(self):
        # do some other work on setting of tags

class Tags:
    def __iter__(self):
        return iter(self._containerfile.get_tags())

    def add(self, things):
        # do some work
        self._containerfile.add_tags(processed_things)

So when getting/setting tags, the work is split between 2 classes... Ideally all the work should belong to the Tags class, so something more like

class ContainerFile:
    # doesn't have to know he has tags

class Tags:
    def get_tags(self):
        table = self._containerfile.handle.get_node()

So any aggregators plug into the container file, and use its API. ie I could write a new aggregator which wouldn't require modifying ContainerFile.

I'm not 100% sure on how this would work with all the file decorators, maybe generic Containerfile.read_table and Containerfile.write_table methods which Aggregators could use...

Maybe .get_read_handle() which returns handle with the _read_state lock

Thoughts?

Store MDAnalysis XTCReader index for XTC files.

To allow quick random access to an XTC trajectory, MDAnalysis generates an index of its frames when prompted. This index can take several minutes to build, and if the Universe in questions is defined with multiple XTC files, it can take far longer. It would be incredibly useful if these indexes could be saved and recalled later.

The XTCReader in MDAnalysis can write and read these indices to and from disk, so the major question is how to implement such that a stale index is not applied to a trajectory that has changed. This will be where most of the work will come.

Addendum: File locking will need to apply here. The existing machinery for doing this (Core.Files) should be used somehow.

Add map method to Group.members (and Bundle)

It's very common for me to loop through all the members in a Group and apply some function to each one. It would be very useful to have a map method that takes a function as input and a parameter for how many processes to use to pool out the application of the function in parallel.

Make addition of Containers yield a Bundle

Bundles function somewhat as throwaway Groups, giving built-in methods for dealing with whole collections of Containers. It would be particularly pythonic if addition between Containers creates a Bundle, and addition between a Bundle and a Container adds that Container to the Bundle.

failed to store DataFrame with column multi-index

With MDS 0.5.1 the following fails:

import pandas as pd
import mdsynthesis as mds

df = pd.DataFrame({('R1', 'NZ1'): np.arange(3), ('R1', 'NZ2'): np.arange(3,0,-1),
                   ('T2', 'OG1'): np.arange(3)*0.5,
                  ('Q3', 'OE1'): np.arange(3)*2, ('Q3', 'OE1'): np.arange(3)*(-2),
                  })

sim = mds.Sim('boba')
sim.data.add('multi', df)

with the error


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-105-5af881236448> in <module>()
----> 1 sim.data.add('multi', df)

/tmp/src/datreant/datreant/aggregators.py in inner(self, handle, *args, **kwargs)
    609 
    610             try:
--> 611                 out = func(self, handle, *args, **kwargs)
    612             finally:
    613                 del self._datafile

/tmp/src/datreant/datreant/aggregators.py in add(self, handle, data)
    688 
    689         """
--> 690         self._datafile.add_data('main', data)
    691 
    692     def remove(self, handle, **kwargs):

/tmp/src/datreant/datreant/persistence.py in add_data(self, key, data)
   1380                 os.path.join(self.datadir, pydatafile), logger=self.logger)
   1381 
-> 1382         self.datafile.add_data(key, data)
   1383 
   1384         # dereference

/tmp/src/datreant/datreant/persistence.py in inner(self, *args, **kwargs)
    292                 self.handle = self._open_file_w()
    293                 try:
--> 294                     out = func(self, *args, **kwargs)
    295                 finally:
    296                     self.handle.close()

/tmp/src/datreant/datreant/persistence.py in add_data(self, key, data)
   1567             self.handle.put(
   1568                 key, data, format='table', data_columns=True, complevel=5,
-> 1569                 complib='blosc')
   1570         except AttributeError:
   1571             self.handle.put(

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in put(self, key, value, format, append, **kwargs)
    812             format = get_option("io.hdf.default_format") or 'fixed'
    813         kwargs = self._validate_format(format, kwargs)
--> 814         self._write_to_group(key, value, append=append, **kwargs)
    815 
    816     def remove(self, key, where=None, start=None, stop=None):

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
   1250 
   1251         # write the object
-> 1252         s.write(obj=value, append=append, complib=complib, **kwargs)
   1253 
   1254         if s.is_table and index:

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwargs)
   3755         self.create_axes(axes=axes, obj=obj, validate=append,
   3756                          min_itemsize=min_itemsize,
-> 3757                          **kwargs)
   3758 
   3759         for a in self.axes:

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   3357             axis, axis_labels = self.non_index_axes[0]
   3358             data_columns = self.validate_data_columns(
-> 3359                 data_columns, min_itemsize)
   3360             if len(data_columns):
   3361                 mgr = block_obj.reindex_axis(

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in validate_data_columns(self, data_columns, min_itemsize)
   3220         if info.get('type') == 'MultiIndex' and data_columns:
   3221             raise ValueError("cannot use a multi-index on axis [{0}] with "
-> 3222                              "data_columns {1}".format(axis, data_columns))
   3223 
   3224         # evaluate the passed data_columns, True == use all columns

ValueError: cannot use a multi-index on axis [1] with data_columns True

It is quite likely that this is a problem that I (or MDS) have with pandas --- any insights welcome.

Support auxilliaries in Universe definitions

Universes can now iterate through auxiliary data in addition to a trajectory. We want to persist defined auxiliaries in the Universe definition for a Sim in the same way that we persist other information, such as kwargs.

Add simple query for data elements in a Container.

Although the keys for all data elements currently display by default using, e.g. Sim.data, it would be useful to be able to get a listing of data keys that match a query. This could be as simple as making some kind of Data.isin method that takes a string as input and outputs all keys that have that string present.

The current release doesn't work with MDAnalysis 0.16.0dev

trying to import mdsynthesis

----> 1 import mdsynthesis as mds

build/bdist.linux-x86_64/egg/mdsynthesis/__init__.py in <module>()

build/bdist.linux-x86_64/egg/mdsynthesis/treants.py in <module>()

build/bdist.linux-x86_64/egg/mdsynthesis/limbs.py in <module>()

ImportError: No module named AtomGroup

Need iterator for Universes aggregator

Iterating through a Universes aggregator should yield keys. Also, need a keys() method for the aggregator.

Remove Sim treanttype

In the spirit of datreant/datreant#100, we want to get rid of the concept of treanttypes and make all statefiles have names like Treant.<uuid>.json. This has the benefit that tools such as datreant.cli can easily work with all Treants, not just those generated with datreant.core. It also greatly simplifies the relationship between datreant and libraries such as mdsynthesis, allowing us to fix some annoying behavior.

To accomplish this consistently, mdsynthesis must at least:

Feature its own discover method that only selects Treants that already feature the mdsynthesis namespace in their state file, returning these as Sim objects. This will come with a performance penalty since now the files must be parsed to check for this condition, whereas before the filename encoded this.
Have a Sim object that, upon use on an existing Treant file, creates the mdsynthesis namespace marking it as a Sim for later discovery.
Not change the behavior of any datreant components on import, such as discover or Bundle, as it currently does.

In order for this scheme to work consistently, datreant.core.Bundle must be modified to not allow paths as input, but instead only take Treant objects or their subclasses directly (otherwise it's not clear what class to use on the path). Must check that serialization and deserialization still works under this scheme.

One Sim, one Universe: remove multiple Universes functionality

Something that's become a clear problem from changes in upstream datreant.core is the state Sim instances carry with them, this being their "active" universe. The original idea for this was that for a given simulation one might have several different post-processed trajectories, and therefore perhaps different MDAnalysis.Universe definitions along with their own selections. It's not a bad idea, but it means that you will get something different with different instances of the same Sim when doing Sim.universe depending on what you've done previously.

This is an issue especially when working with Bundles of Sims, since doing set-style operations between different Bundles with overlapping sets of Sims (but not the same instances) means that one has no guarantee they get the Sim with the state they expect.

I propose to eliminate the "multiple Universes" functionality from Sim objects. This will reduce the state of a Sim to that stored in its state file (good!) and that stored in its loaded Universe instance (not great, but with new features coming in upstream MDAnalysis, could be mitigated). It will also simplify the Sim API greatly.

For those that use the "multiple Universes" functionality, (myself included), I think nesting Sim objects might be a good general solution. Something like:

main_Sim
    |-> Sim.<uuid>.json
    |
    |-> no_water
    |   |
    |   |->Sim.<uuid>.json
    |
    |-> fitted
        |
        |-> Sim.<uuid>.json

will work, and will be easy to use given work being done in upstream datreant.core.

Thoughts?

datreant / mdsynthesis Goto Github PK

mdsynthesis's Issues

Recommend Projects

Recommend Topics

Recommend Org