Code Monkey home page Code Monkey logo

msmbuilder-legacy's Introduction

MSMBuilder

Build Status PyPi version License Documentation

MSMBuilder is a python package which implements a series of statistical models for high-dimensional time-series. It is particularly focused on the analysis of atomistic simulations of biomolecular dynamics. For example, MSMBuilder has been used to model protein folding and conformational change from molecular dynamics (MD) simulations. MSMBuilder is available under the LGPL (v2.1 or later).

Capabilities include:

  • Feature extraction into dihedrals, contact maps, and more
  • Geometric clustering with a variety of algorithms.
  • Dimensionality reduction using time-structure independent component analysis (tICA) and principal component analysis (PCA).
  • Markov state model (MSM) construction
  • Rate-matrix MSM construction
  • Hidden markov model (HMM) construction
  • Timescale and transition path analysis.

Check out the documentation at msmbuilder.org and join the mailing list. For a broader overview of MSMBuilder, take a look at our slide deck.

Installation

The preferred installation mechanism for msmbuilder is with conda:

$ conda install -c omnia msmbuilder

If you don't have conda, or are new to scientific python, we recommend that you download the Anaconda scientific python distribution.

Workflow

An example workflow might be as follows:

  1. Set up a system for molecular dynamics, and run one or more simulations for as long as you can on as many CPUs or GPUs as you have access to. There are a lot of great software packages for running MD, e.g OpenMM, Gromacs, Amber, CHARMM, and many others. MSMBuilder is not one of them.

  2. Transform your MD coordinates into an appropriate set of features.

  3. Perform some sort of dimensionality reduction with tICA or PCA. Reduce your data into discrete states by using clustering.

  4. Fit an MSM, rate matrix MSM, or HMM. Perform model selection using cross-validation with the generalized matrix Rayleigh quotient

msmbuilder-legacy's People

Contributors

gbowman avatar kyleabeauchamp avatar leeping avatar mpharrigan avatar rmcgibbo avatar schwancr avatar tjlane avatar vvoelz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msmbuilder-legacy's Issues

Remove CreateMergedTrajectoriesFromFAH.py

This code is superceded by TJ's FahProject.py.

The only code that uses this module is scripts/UpdateProjectToHDF5.py (which may be getting deprecated or removed -- see the issue about that).

DoTPT.py doesn't work

There are differences in the way DoTPT is using the library functions. One is:

    TypeError: GetBCommittors() got an unexpected keyword argument 'maxiter'

The script is passing a kwarg that no longer exists in the library function!

pep8 fixes to library

Step 1) Install the pep8 checker
$ pip install pep8

Step 2) Tabulate the pep8 violations by file

rmcgibbo@vspm10 ~/local/msmbuilder/src/python
$ cat errors.ipy 
from collections import defaultdict
from pprint import pprint
!pep8 *.py --ignore=W293,E501 > violations
n_violations = defaultdict(lambda : 0)
for line in open('violations'):
    n_violations[line.split('.py')[0]] += 1

pprint(sorted(n_violations.iteritems(), key=lambda x: x[1], reverse=True))

rmcgibbo@vspm10 ~/local/msmbuilder/src/python
$ ipython errors.ipy 
Enthought Python Distribution -- www.enthought.com

[('PDB', 528),
 ('Trajectory', 193),
 ('transition_path_theory', 169),
 ('Project', 154),
 ('lumping', 131),
 ('FahProject', 125),
 ('CreateMergedTrajectoriesFromFAH', 88),
 ('Serializer', 72),
 ('xtc', 65),
 ('MSMLib', 50),
 ('SCRE', 39),
 ('clustering', 38),
 ('dcd', 36),
 ('msm_analysis', 34),
 ('Conformation', 32),
 ('CopernicusProject', 31),
 ('plot_graph', 27),
 ('arglib', 24),
 ('assigning', 17),
 ('utils', 11),
 ('park_kmedoids', 10),
 ('__init__', 9),
 ('drift', 9),
 ('Citation', 3),
 ('License', 2)]

Step 3) Fix code.

The pep8 checker is probably too strict, so these numbers don't need to come down to zero, but they're a good indication of which files need work.

Maybe people can call out which files they're going to fix, and we'll try to do it evenly so that nobody has too much work?

When you change method names, you should put in aliases to the old names with a deprecation warning. Look in MSMLib.py and the "deprecated" decorator in utils.py for a template.

Trajectory Should Contain Timestep Metadata.

While I'm on the list of confusing things I always forget to write down, I think this would be sweet. Would require the user to input it, since this data is not associated with XTCs or DCDs.

Could be optional to begin with, at least.

Edit: Changed title to reflect an affirmative answer to the question, "Should Trajectory Contain Timestep Metadata?" -Robert

making git-push less annoying

I was having some annoying trouble with git push, where I was trying to push my branch, but its complaining that the other branches aren't caught up.

To solve it,

rmcgibbo@ubuntu ~/msmbuilder
$ git config push.default current

changes the behavior of git-push to only push the current branch unless you say otherwise. Apparently, the normal behavior is to push every branch that has a matching branch on the remote. This seems like a better default when people are doing concurrent development on different branches.

this is the symptom:

rmcgibbo@ubuntu ~/msmbuilder
$ git push
Counting objects: 120, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (26/26), done.
Writing objects: 100% (70/70), 14.52 KiB, done.
Total 70 (delta 62), reused 47 (delta 44)
To [email protected]:SimTk/msmbuilder.git
   d4786b7..b3ed9c6  new_setup.py -> new_setup.py
 ! [rejected]        logging -> logging (non-fast-forward)
 ! [rejected]        master -> master (non-fast-forward)
error: failed to push some refs to '[email protected]:SimTk/msmbuilder.git'
To prevent you from losing history, non-fast-forward updates were rejected
Merge the remote changes (e.g. 'git pull') before pushing again.  See the
'Note about fast-forwards' section of 'git push --help' for details.

Implement Hub Scores

TJL is working on it! I have code but is currently not giving the correct answer.

recording a history of command line actions?

@tjlane and I were brainstorming, and thinking it would be nice if the scripts would records a history of what you've done, so that you can have a record of what steps you took to build a certain MSM.

Some data that would need to be recorded was what command line arguments the script was called with, what output it saved to disk, maybe what date/time it was run?

Not really sure where this info should be stored.

In a sense, the thing that is saved should almost be a DAG where the nodes are output files and the edges are mutual use?

Sandboxes vs. Branching (vs. committing straight to the master)

I'm just figuring out how to use git branches, and I think they might be preferable in many cases to the sandbox system.

For example, here's a workflow that would work for Christian's work on modifying Trajectory.py to load more flexibly.

# create a new branch and check it out ;)
git checkout -b new-trajectory 

[... make changes to files ... ]

git commit "awesome new features"
# commit the new branch to github
git push origin new-trajectory 

Instead of needing to create files like Trajectory_crs.py with new names, you just edit the files in place and git does the work.

Switching between branches is really easy and takes no time at all. Say you want to compare how your new code is working against the old code, you do:

git checkout new-trajectory
python setup.py install
[... test ...]

git checkout master
python setup.py install
[... test ...]

And then if you want to merge your changes into the "master" branch, you can click the pull-request thing on github and then you get this really nice GUI of where the code has changed, and people can comment, etc.

fix the last test failure on master

TestWrappers.TestWrappers.test_n_FindPaths (from nosetests)

Arrays are not almost equal to 6 decimals

(shapes (10, 2), (10,) mismatch)
 x: array([[ 0, 70],
       [ 0, 57],
       [ 0, 50],...
 y: array([  1.00761346e-05,   9.11540389e-06,   8.18049790e-06,
         6.44561131e-06,   4.16600680e-06,   2.60505753e-06,
         2.40638719e-06,   2.29047655e-06,   2.17286763e-06,
         2.15678648e-06])
-------------------- >> begin captured logging << --------------------
tpt: INFO: Searched 74 nodes
tpt: INFO: Path Num | Path | Bottleneck | Flux
tpt: DEBUG: In Backtrack: Flux 1.00761346032e-05, bestflux 0
tpt: INFO: 1 | [0, 70] | (0, 70) | 1.00761346032e-05 
tpt: DEBUG: In Backtrack: Flux 9.11540388794e-06, bestflux 0
tpt: INFO: 2 | [0, 57, 70] | (0, 57) | 9.11540388794e-06 
tpt: DEBUG: In Backtrack: Flux 8.18049789543e-06, bestflux 0
tpt: INFO: 3 | [0, 50, 70] | (0, 50) | 8.18049789543e-06 
tpt: DEBUG: In Backtrack: Flux 6.44561130982e-06, bestflux 0
tpt: INFO: 4 | [0, 56, 70] | (0, 56) | 6.44561130982e-06 
tpt: DEBUG: In Backtrack: Flux 4.16600680449e-06, bestflux 0
tpt: INFO: 5 | [0, 18, 70] | (0, 18) | 4.16600680449e-06 
tpt: DEBUG: In Backtrack: Flux 2.6050575318e-06, bestflux 0
tpt: INFO: 6 | [0, 51, 49, 70] | (51, 49) | 2.6050575318e-06 
tpt: DEBUG: In Backtrack: Flux 2.40638718843e-06, bestflux 0
tpt: INFO: 7 | [0, 9, 50, 70] | (9, 50) | 2.40638718843e-06 
tpt: DEBUG: In Backtrack: Flux 2.29047654724e-06, bestflux 0
tpt: INFO: 8 | [0, 36, 70] | (0, 36) | 2.29047654724e-06 
tpt: DEBUG: In Backtrack: Flux 2.17286762871e-06, bestflux 0
tpt: INFO: 9 | [0, 45, 70] | (0, 45) | 2.17286762871e-06 
tpt: DEBUG: In Backtrack: Flux 2.15678647887e-06, bestflux 0
tpt: INFO: 10 | [0, 9, 45, 70] | (9, 45) | 2.15678647887e-06 
--------------------- >> end captured logging << ---------------------

http://171.65.102.206/job/msmbuilder/TOXENV=py27/117/testReport/TestWrappers/TestWrappers/test_n_FindPaths/

Remove Extras

I will wield the rm -rf hammer if it is not gone organically!

pdb reader/conformation base when only 1 atom in pdb

something is breaking here -- not really sure what but adding another atom seems to fix it.

the error is that self[key], insteadinf of being an array of strings, is a string. like something is being flattened?

KeysToForceCopy=["ChainID","AtomNames","ResidueNames","AtomID","ResidueID"]
for key in KeysToForceCopy:#Force copy to avoid owning same numpy memory.
self[key]=self[key].copy()

Fix bad syntax, unused functions in transtion_path_theory.py

Kyle,

There is some code in the TPT library where I have no idea what is being called. Maybe you could do a quick pass and see where I commented? Most of the comments are up to date, but there are some mysteries in there.

There are also two functions at the very end I bet we can get rid of. See what you think.

TJ

Bugs in FAHProject.py

Has anyone ever actually tried the workserver inject code in FAHProject.py?

restart_server has syntax errors in it.

I don't think send_error_email works -- most computers don't have an SMTP server running for one.

Minor bugs in Cfep, hub scores

I just ran

$ pylint -E src/python

to search for syntax errors

************* Module python.cfep
E:189,18:CutCoordinate._build_msm_from_counts: Undefined variable 'symmetrization_error'
E:224,27:CutCoordinate.reaction_mfpt: Undefined variable 'zh'
E:224,34:CutCoordinate.reaction_mfpt: Undefined variable 'zc'
E:224,56:CutCoordinate.reaction_mfpt: Undefined variable 'zh'

************* Module python.tpt
E:738,10:calculate_all_to_all_mfpt: Instance of 'matrix' has no 'transpose' member
E:978,12:calculate_fraction_visits: Passing unexpected keyword argument 'dense' in function call
E:1154,12:calculate_all_hub_scores: Undefined variable 'all_to_all_mfpt'
E:1162,67:calculate_all_hub_scores: Undefined variable 'waypoints'

pylint isn't always perfect, but I trust these.

Writing tests for methods that use randomness

I just noticed some bugs in GetRandomConfs.py that slipped through the arglib switch, which we didn't
notice because it's basically untested. The reason it was untested is because whoever wrote TestWrappers.py wasn't sure how to test methods that use randomness. (There's a comment that says # This one is tricky since it is stochastic...)

After googling around a bit, I saw this.
http://stackoverflow.com/questions/5836335/consistenly-create-same-random-numpy-array

The key answer to look at is actually the one by Robert Kern, who is on the numpy/scipy core team.

This isn't urgent, but I think we should add an optional keyword argument like random_source or something to methods that use randomness, which when supplied they use instead of numpy.random.

def kmedoids_clustering(..., random_source=None):
    if random_source is None:
        random_source = np.random

   ... 

   some_random_number_used_for_algorithm = random_source.randint()

So this way, during testing (or maybe this would be useful in debugging too?) you can supply f(... , random_source=np.random.RandomState(someseed)) and you will get reproducible results.

lprmsd not working from Cluster.py

rmcgibbo@certainty-a ~/msmbuilder/Tutorial
$ Cluster.py lprmsd kcenters -k 100

breaks with

File "/home/rmcgibbo/local/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/msmbuilder-2.6.dev-py2.7-linux-x86_64.egg/msmbuilder/metric_LPRMSD.py", line 68, in __init__
self.TD = RMSD.TheoData(S['XYZList'][:,np.array(aidx)])
IndexError: arrays used as indices must be of integer (or boolean) type

Maybe the new construct_metric code is not setting the dtype right?

Make docstrings render nicely

If you look at the docs, http://msmbuilder.readthedocs.org/en/latest/, they look pretty nice. A few things could be better though.

  1. The references are not showing up that nicely. I think we might be messing up the numpy docstring reference syntax.
  2. Indentation matters in the docstrings, so some of them aren't getting parsed that well, such as
    http://msmbuilder.readthedocs.org/en/latest/generated/msmbuilder.tpt.calculate_hub_score.html#msmbuilder.tpt.calculate_hub_score
  3. Putting examples in the docstrings looks really nice. I didn't really know this was a thing, but I noticed it here: http://msmbuilder.readthedocs.org/en/latest/utils.html. That code I basically copied from activestate.com, and you can see that it shows up fantastically.

I also know that when I read other docs, I usually go straight for the examples.

Feature Request: Hooks to modify states

This is a feature request to add hooks in MSMBuilder so that the state space can be manipulated without modifying the existing code.

When version 2.0 was released on the SimTK website, I hacked in a number of project-specific features including using environmental variables to select the distance metric and some methods to modify the state space to remove states beyond what was done with the ergodic trimming. This involved monkey patching MSMLib and rewriting a number of the scripts to call the custom version of the code. I see that in version 2.5 it is much easier to use custom distance metrics. I think it would be quite useful to extend the plugin mechanism and add hooks throughout the code to allow users to modify the count matrix and state space in a similar way.

tests / reference data in one place.

  1. We have the new unit tests and the old integration tests (wrapper tests) in different places, which is not optimal. The reference data for the tests is similarly mixed around.
  2. In sweet project like numpy and scipy, you can run the tests from inside the project. This can only work because the setup.py script installs the tests and data files somewhere. I guess the tests are installed in some kind of subpackage, and the files needed for the tests are in some data directories that go with the egg, in a format that setuptools knows about.
import numpy as np
np.test()
>>> Ran 3542 tests in 39.265s
>>> OK (KNOWNFAIL=3, SKIP=1)

MLE getting stuck for hours

Anyone seeing this type of thing when running the MLE?

Log-Likelihood after 12145 function evaluations: -2623598.26493
Log-Likelihood after 12178 function evaluations: -2623598.26493
Log-Likelihood after 12211 function evaluations: -2623598.26493
Log-Likelihood after 12244 function evaluations: -2623598.26493
Log-Likelihood after 12277 function evaluations: -2623598.26493
Log-Likelihood after 12310 function evaluations: -2623598.26493
Log-Likelihood after 12343 function evaluations: -2623598.26493
Log-Likelihood after 12376 function evaluations: -2623598.26493
Log-Likelihood after 12409 function evaluations: -2623598.26493
Log-Likelihood after 12442 function evaluations: -2623598.26493
Log-Likelihood after 12475 function evaluations: -2623598.26493

I'm getting this type of printout a lot. Maybe the convergence criteria is set too high?

Print -> logging

  • Replace print statements with calls to the logging module
  • Add -q (quiet) and -v (verbose) flags to arglib (by default?) and do setup of logging module in arglib

Proposed changes to Trajectory

  1. Delete Conformation, ConformationBaseClass
  2. Use Biopython as PDB reader
  3. Cleanup?
  4. Make Serializer more lightweight? or even delete?

optimal use of pytables

I'm finding that it is faster to load a trajectory in chunks and then concatenate than to load the entire trajectory.

It seems that in some cases it can be 5-10 times faster to read it in chunks.

Does anyone know:

  1. Why this is?
  2. Is there an optimal "chunk size" to use with pytables?
  • There is a 'chunk_size' attribute to the file when you open it with pytables, so there has been some work here

Jenkins cannot find TPT reference data

Sad days. @rmcgibbo, is there an easy fix?

Traceback (most recent call last):
  File "/usr/lib/python2.7/unittest/case.py", line 318, in run
    self.setUp()
  File "/var/lib/jenkins/jobs/msmbuilder/workspace/TOXENV/py27/.tox/py27/local/lib/python2.7/site-packages/nose/case.py", line 381, in setUp
    try_run(self.inst, ('setup', 'setUp'))
  File "/var/lib/jenkins/jobs/msmbuilder/workspace/TOXENV/py27/.tox/py27/local/lib/python2.7/site-packages/nose/util.py", line 478, in try_run
    return func()
  File "/var/lib/jenkins/jobs/msmbuilder/workspace/TOXENV/py27/new_tests/test_tpt.py", line 19, in setUp
    self.tprob = io.mmread( os.path.join(self.tpt_ref_dir, "tProb.mtx") ) #.toarray()
  File "/usr/local/lib/python2.7/dist-packages/scipy/io/mmio.py", line 68, in mmread
    return MMFile().read(source)
  File "/usr/local/lib/python2.7/dist-packages/scipy/io/mmio.py", line 298, in read
    stream, close_it = self._open(source)
  File "/usr/local/lib/python2.7/dist-packages/scipy/io/mmio.py", line 247, in _open
    stream = open(filespec, mode)
IOError: [Errno 2] No such file or directory: 'reference/transition_path_theory_reference/tProb.mtx'

ReadTheDocs.org

We have some sphinx docs, they should get posted/updated automatically.

Jenkins doesn't have yaml!

I'm not sure when the yaml dependency ended up in the nosetests. But I just added a two line fix to master and jenkins fails because:

    No module named yaml

MSMLib unit tests, BuildMSM wrapper tests are failing/broken

I just made some changes to MSMLib in master, and when trying to run the tests noticed that a lot of stuff is broken.

There are two problems:

(1) test_msmlib.py in new_tests:

The other problem is that these tests are, for the most part, incomprehensible. Not sure where the reference data came from, so not sure what to trust. Also there are tons of tests for build_msm, but it's not clear which are doing what and which should work. Some comments would be nice.

(2) TestWrappers.py

I think I fixed anything I broke in the changes, but it's hard to be sure -- completely unrelated parts of the code are giving me errors (e.g. assignments).

@kyleabeauchamp, did you write these? Can you help me clean it up? Also check out the changes I made to build_msm and be sure you approve. I think it's a lot cleaner now.

@rmcgibbo it could be good to merge the old & new tests sooner rather than later...

PDB reader doesn't load the multi frame PDBs that it writes

Symptom:

In [4]: t = Trajectory.LoadTrajectoryFile('trj0.lh5')[0:10]

In [5]: len(t)
Out[5]: 10

In [6]: t.SaveToPDB('trj0.pdb')

In [7]: len(t.LoadFromPDB('trj0.pdb'))
Out[7]: 1

The saved trj0.pdb has 10 frames in it (they just look cated together, with 10 different ENDMDL records).

Merge SavePDBs.py and GetRandomConfs.py

We don't want two separate scripts for grabbing random conformations.

However, I still prefer having the ability (SavePDBs) to dump out separate PDBs for each state and snapshot. This feature is important for creating publication quality figures in PyMol.

PDB Reader should preserve ResidueID

When we load a PDB into a trajectory object, the current code renumbers the ResidueIDs to start with 1. I think the desired behavior is to preserve exactly the contents of the PDB file.

Refactor metrics.py

metrics.py is like ~2000 lines, so it should really be a sub-package. the init.py can import the classes inside the sub-package

__init__.py
abstract_metric.py
rmsd.py
vectorized.py
etc..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.