bowman-lab / enspara Goto Github PK

Modeling molecular ensembles with scalable data structures and parallel computing

Home Page: https://enspara.readthedocs.io

License: GNU General Public License v3.0

Python 98.34% Cython 1.66%

biophysics clustering kinetic-modeling markov-state-model molecular-dynamics molecular-modeling protein-structure python

enspara's People

Contributors

Stargazers

Watchers

Forkers

silvermm rafwiewiora jharman25 ameller1994 alexholehouse jzjosuerivera71 lgsmith justin-j-miller jimtufts ellanguyen1711 bobym wrmartin skblnw sukritsingh atrent777

enspara's Issues

Number of frames distributed to centroids

Dear all,

enspara was recomended to my by (mdtraj/mdtraj#1544 (comment)) and so far, I absolutely enjoy how fast clustering is done! 84 seconds for 175000 frames is insane!

I clustered my trajectories (20 clusters) and got the fs-khybrid-clusters0020-centers.pickle file, which mdjoined afterwards. Then i converted this into a pdb.

My question now is, how do I know which clusters is used the most? So how many frames where assigned to one cluster?

The code I used:

/enspara/enspara/apps$ python cluster.py \
>   --trajectories production*.h5 \
>   --topology topology.pdb \
>   --algorithm khybrid \
>   --cluster-number 20 \
>   --atoms '(name N or name C or name CA)' \
>   --distances /home/user/enspara/enspara/apps/fs-khybrid-clusters0020-distances.h5 \
>   --center-features /home/user/enspara/enspara/apps/fs-khybrid-clusters0020-centers.pickle \
>   --assignments /home/user/enspara/enspara/apps/fs-khybrid-clusters0020-assignments.h5

>>> import pickle
>>> import mdtraj as md
>>>
>>> with open('fs-khybrid-clusters0020-centers.pickle', 'rb') as f:
...     ctr_structs = md.join(pickle.load(f))

>>> ctr_structs.save_pdb('test.pdb)

Any help is appreciated.

RaggedArray fails with len(shape) >= 3 and slices

a = ra.RaggedArray([[[0, 1, 0, 1],
                     [0, 1, 0, 1]],
                    [[1, 1, 1, 1, 0],
                     [0, 0, 0, 0, 1]]])

a[:, :, 0]

Fails with

ValueError                                Traceback (most recent call last)
<ipython-input-5-caa57490a0a5> in <module>()
----> 1 a[:, :, 0]

/Users/jrporter/projects/enspara/enspara/util/array.py in __getitem__(self, iis)
    458         # tuples get index conversion from 2d to 1d
    459         elif type(iis) is tuple:
--> 460             first_dimension, second_dimension = iis
    461             # if the first dimension is a slice, converts both sets of indices
    462             if type(first_dimension) is slice:

ValueError: too many values to unpack (expected 2)

So does a[:, 0, :] and a[0, :, :].

Also, the dtype of a is object, which isn't what I expected (since they're ints) and the shape is [2, 2, None] where I expected [2, None, 2]...

nose -> pytests?

There was some discussion in issue #214 about switching from nose to another testing suite. At the time the consensus was to switch to pytest. Is this still the preference?

Happy to poke at this in the coming days if so!

Parallelization

We need to figure out a standardized parallelization protocol

Distance measurement as feature input

Dear enspara Team,

i try to implement enspara as one of my standart analysis tools. I read in your publication from 2019 that you gave the solvent accessible surface area as an input feature. I would like to do this based on a distance measurement between 2 atoms tracked over the trajectory. However, i didnt find on your website how the feature input file should look like or how it is build.

Any hint on how to generate an input file like this is appreciated.
Thanks a lot for your effort!

implied_timescales sometimes produces large, negative eigenvalues

An interesting bug/instability:

assigns = io.loadh('data/myh7/myh7-0.25A-hybrid-backbone_cb-assignments-778d259e.h5')['arr_0']
timescales = msm.implied_timescales(assigns, lag_times=range(10, 500, 10),
                                    n_imp_times=5, symmetrization='transpose')
print(timescales[:, 1].min())

produces -22517998136852548.0 on my machine. Sad!

OpenMP Parallelization

In setup.py on MacOS, we detect an OpenMP-compatible C compiler with:

if 'darwin' in platform.system().lower():
    os.environ["CC"] = "gcc-7"
    os.environ["CXX"] = "gcc-7"

Obviously, this is not terribly portable, and should be fixed.

RaggedArray in three dimensions fails to identify even array

ra.RaggedArray(
    [[[1, 0, 0],
      [1, 0, 0],
      [1, 0, 0]],
     [[1, 0, 0],
      [0, 1, 0],
      [0, 1, 0]],
     [[1, 0, 0],
      [0, 1, 0],
      [0, 1, 0]]])

produces

RaggedArray([
      [[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]], dtype=object,
      [[1, 0, 0],
       [0, 1, 0],
       [0, 1, 0]], dtype=object,
      [[1, 0, 0],
       [0, 1, 0],
       [0, 1, 0]], dtype=object])

where I expected a return type of

array([[[1, 0, 0],
        [1, 0, 0],
        [1, 0, 0]],

       [[1, 0, 0],
        [0, 1, 0],
        [0, 1, 0]],

       [[1, 0, 0],
        [0, 1, 0],
        [0, 1, 0]]])

@mizimmer90 we should talk about this behavior.

Add tolerance exit for Kmedoids

Add an optional exit to Kmedoids clustering (similar to radius) where if the acceptance rate goes sufficiently low you stop any future Kmedoids sweeps.

pockets.py returning None for no-pockets breaks FAST-pockets

@ameller1994 @mizimmer90
Wanted to raise this as an issue to solicit thoughts on best recourse. This commit:
c019706

switched enspara/geometry/pockets.py to returning None if there are no pockets found in a structure. If found in FAST-pockets, this breaks fast/analysis/pockets.py as FAST is expecting a MDTraj object. Error trace at the end of this issue. @ameller1994 I assume this is a useful commit that we'd prefer not revert?

@lgsmith and I briefly discussed this and think the best course is to have FAST run a check to see if it's getting a MDtraj object or None.

Currently, in the course of 0 pockets, FAST returns the following:
pocket_sizes.dat containing the int 0
stateXXXXXXX_pockets.pdb containing:

MODEL        0
ENDMDL
END

Any/all thoughts welcome!

Error trace (from FAST)-

Traceback (most recent call last):
  File "/home/justinjm/miniconda3/envs/enspara/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/justinjm/miniconda3/envs/enspara/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/export/seas-home/justinjm/software/fast/analysis/pockets.py", line 47, in _save_pocket_element
    pocket_element.save_pdb(pok_output_name)
AttributeError: 'NoneType' object has no attribute 'save_pdb'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "analysis.py", line 4, in <module>
    c.run()
  File "/export/seas-home/justinjm/software/fast/analysis/pockets.py", line 283, in run
    self.output_folder, self.n_cpus)
  File "/export/seas-home/justinjm/software/fast/analysis/pockets.py", line 72, in save_pocket_elements
    pool.map(_save_pocket_element, save_info)
  File "/home/justinjm/miniconda3/envs/enspara/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/justinjm/miniconda3/envs/enspara/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
AttributeError: 'NoneType' object has no attribute 'save_pdb'

MSM analysis additions

We should add some tools to analyze MSMs and get errors. i.e. calculating cdfs and bootstrapping MSMs to get error bars

clustering naming

the KCenters object has different names for inputs than the kcenters function. We should probably be consistent with the naming; I don' have a strong preference for the final naming conventions.

RaggedArray third-dimension slice

It appears not to be possible to calculate slice a stride in the third dimesion of a RaggedArray:

a = ra.RaggedArray(np.zeros((100, 2)), lengths=[90, 10])
a[:, :, ::2]

gives

ValueError                                Traceback (most recent call last)
<ipython-input-37-64bea6dd4686> in <module>()
      1 a = ra.RaggedArray(np.zeros((100, 2)), lengths=[90, 10])
----> 2 a[:, :, ::2]

~/modules/enspara/ra/ra.py in __getitem__(self, iis)
    531         # tuples get index conversion from 2d to 1d
    532         elif isinstance(iis, tuple):
--> 533             first_dimension, second_dimension = iis
    534             # if the first dimension is a slice, converts both sets of indices
    535             if isinstance(first_dimension, slice):

ValueError: too many values to unpack (expected 2)

@mizimmer90 is this a bug or expected behavior? It seems very reasonable to want to do this.

Add intermediate write out for kcenters during clustering?

Could be nice to have save out of kcenters every Xth center. In particular this would allow you to qualitatively assess how finely clustered things are mid-run.

enspara not executing correctly

This traceback error pops when executing enspara
Kindly assist

INFO:numexpr.utils:NumExpr defaulting to 8 threads.
Traceback (most recent call last):
  File "/Users/monsurat/opt/anaconda3/bin/enspara", line 33, in <module>
    sys.exit(load_entry_point('enspara==0.1.1', 'console_scripts', 'enspara')())
  File "/Users/monsurat/opt/anaconda3/lib/python3.8/site-packages/enspara-0.1.1-py3.8-macosx-10.9-x86_64.egg/enspara/apps/main.py", line 44, in main
    args = identify_app(argv)
  File "/Users/monsurat/opt/anaconda3/lib/python3.8/site-packages/enspara-0.1.1-py3.8-macosx-10.9-x86_64.egg/enspara/apps/main.py", line 23, in identify_app
    while h in argv and argv.index(h) != 1:
TypeError: argument of type 'NoneType' is not iterable

Saving cluster centers with 'trj_filename'

Is there a way to save the cluster centers with ''trj_filename' so that one can identifies which cluster center originates from which trajectory?
I guess the following part of the script is doing the saving part

if type(topology) == str:
        centers_location = np.array(
            centers_location, dtype=[
                ('state', 'int'), ('conf', 'int'), ('traj_num', 'int'),
                ('frame', 'int'), ('trj_filename', np.str_, 500),
                ('output', np.str_, 500),
                ('topology', np.str_, 500)])
    else:
        centers_location = np.array(
            centers_location, dtype=[
                ('state', 'int'), ('conf', 'int'), ('traj_num', 'int'),
                ('frame', 'int'), ('trj_filename', np.str_, 500),
                ('output', np.str_, 500),
                ('topology', type(topology))])
    unique_trajs = np.unique(centers_location['traj_num'])

Is it as simple as to change unique_trajs = np.unique(centers_location['traj_num']) to unique_trajs = np.unique(centers_location['trj_filename']) or is there other tricks?

articles.json not copied by install

Had to manually copy articles.json after installation, otherwise:

Traceback (most recent call last):
  File "/home/rafal.wiewiora/repos/enspara/enspara/apps/collect_cards.py", line 33, in <module>
    from enspara.cards import cards
  File "/home/rafal.wiewiora/anaconda3/lib/python3.6/site-packages/enspara-0.1.0-py3.6-linux-x86_64.egg/enspara/cards/__init__.py", line 4, in <module>
    from .cards import *
  File "/home/rafal.wiewiora/anaconda3/lib/python3.6/site-packages/enspara-0.1.0-py3.6-linux-x86_64.egg/enspara/cards/cards.py", line 6, in <module>
    from ..info_theory import mutual_info
  File "/home/rafal.wiewiora/anaconda3/lib/python3.6/site-packages/enspara-0.1.0-py3.6-linux-x86_64.egg/enspara/info_theory/__init__.py", line 6, in <module>
    from . import exposons
  File "/home/rafal.wiewiora/anaconda3/lib/python3.6/site-packages/enspara-0.1.0-py3.6-linux-x86_64.egg/enspara/info_theory/exposons.py", line 8, in <module>
    from enspara.citation import cite
  File "/home/rafal.wiewiora/anaconda3/lib/python3.6/site-packages/enspara-0.1.0-py3.6-linux-x86_64.egg/enspara/citation/__init__.py", line 3, in <module>
    from . import citation
  File "/home/rafal.wiewiora/anaconda3/lib/python3.6/site-packages/enspara-0.1.0-py3.6-linux-x86_64.egg/enspara/citation/citation.py", line 50, in <module>
    CITATION_DB = load_citation_db()
  File "/home/rafal.wiewiora/anaconda3/lib/python3.6/site-packages/enspara-0.1.0-py3.6-linux-x86_64.egg/enspara/citation/citation.py", line 10, in load_citation_db
    with open(os.path.join(os.path.dirname(__file__), 'articles.json')) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/rafal.wiewiora/anaconda3/lib/python3.6/site-packages/enspara-0.1.0-py3.6-linux-x86_64.egg/enspara/citation/articles.json'

ragged array disables error checking erroneiously

When using the array/lengths format for building a RaggedArray, error checking is disabled on arrays that have total size 20k or greater, rather than first dimension 20k or greater.

Additionally, the warning here is put into the root logger, it should use a logger associated with its file. To do this, add the following to the beginning of the file:

logger = logging.getLogger(__name__)

Allow for subsampling of featured trajectories

Clustering on subsampled data is useful for both memory efficiency but also time to compute kmedoids updates. Would be nice to add the option to subsample featurized datasets at the point of clustering.

No topology if different trajectories loaded

Dear enspara Team,

I really enjoy enspara and hope, that you can help me with my issue.

I try to clusters 2 trajectories from 2 different peptides (both with 14 aa). I only wright the C, CA and N atoms in the trajectory to be able to clusters them together. For sure, I also only cluster based on RMSD of C, CA and N. The topology pdb also only includes the backbone atoms C, CA, N. After the clustering I want to count which cluster was used how many times by peptide 1 or 2 and hopefully see a different distribution.

python cluster.py \
  --trajectories peptide1.h5 peptide2.h5\ 
  --topology peptide_backbone.pdb \
  --algorithm khybrid \
  --cluster-number 5 \
  --atoms '(name N or name C or name CA)' \
  --distances /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-distances.h5 \
  --center-features /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-centers.pickle \
  --assignments /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-assignments.h5

The clustering works and I get a get distribution for the clusters:

cl_assign=enspara.ra.load('fs-khybrid-clusters0020-assignments.h5')
cl_assign_list=np.ndarray.tolist(cl_assign)
cl_assign_list_flat= [val for sublist in cl_assign_list for val in sublist]

centroid1=cl_assign_list_flat.count(0)
centroid2=cl_assign_list_flat.count(1)
centroid3=cl_assign_list_flat.count(2)
centroid4=cl_assign_list_flat.count(3)
centroid5=cl_assign_list_flat.count(4)
print('centroid1:',centroid1)
print('centroid2:',centroid2)
print('centroid3:',centroid3)
print('centroid4:',centroid4)
print('centroid5:',centroid5)

output:

centroid1: 2021
centroid2: 2152
centroid3: 3663
centroid4: 3622
centroid5: 2542

But when I try to generate a pdb of the cluster centers, i get the error:

ValueError: The topologies of the Trajectories are not the same

The code I used:

with open('fs-khybrid-clusters0020-centers.pickle', 'rb') as f:
    ctr_structs = md.join(pickle.load(f))
traj=ctr_structs.superpose(ctr_structs,frame=0,parallel=True)
traj.save_pdb('cluster_center.pdb')

If take the route via:

python cluster.py \
  --trajectories peptide1.h5\
  --topology peptide1.pdb \
  --trajectories peptide2.h5\
  --topology peptide2.pdb \
  --atoms '(name CA or name C or name N)' \
  --algorithm khybrid \
  --cluster-number 5 \
  --distances /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-distances.h5 \
  --center-features /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-centers.pickle \
  --assignments /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-assignments.h5

and then go:

with open('fs-khybrid-clusters0020-centers.pickle', 'rb') as f:
    ctr_structs = md.join(pickle.load(f))

I get:

ValueError: Number of atoms in self (239) is not equal to number of atoms in other

I think I have logic error, why this is not working. Any help is appreciated.
Thanks for your effort.

Getting enspara installable on ARM machines

Starting this discussion topic here to track any relevant info in 1 thread:

Problem: Enspara in its current form cannot be installed on ARM machines including Macbooks with M1 chips. This is because Numpy does not work on ARM machines below python 3.8, while Enspara only works up to python<3.8.

What are the updates needed to make enspara python3.8+ compatible? I remember that there are some hard-coded restrictions in setup.py - would removing those simply be sufficient? Prior to that update to setup.py the master branch worked fine for me up to python 3.9.

If it's just a matter of updating setup.py then @jlotthammer or I can submit PRs to fix this. Wanting to make sure there's no other necessary fixes.

Testing warnings

There are some warnings during testing that should be investigated, determined to be benign, and suppressed (preferably in the source they come from).

mdtraj/core/trajectory.py:417: UserWarning: top= kwarg ignored since file contains topology information
  warnings.warn('top= kwarg ignored since file contains topology information')

enspara/enspara/msm/timescales.py:41: RuntimeWarning: invalid value encountered in log
  imp_times = -lag_time / np.log(e_vals[1:])

/Users/jrporter/projects/.venvs/stag3/lib/python3.5/site-packages/matplotlib/__init__.py:1357: UserWarning:  This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)

Issue with ra.py module

Hi,

I am getting issue with the RaggedArray module. I have attached the hdf5 file from one of my FAST swarms. Issue is described by numpy error below. The deprecation warning is was alway there but I think this is first time reading file in which ra.load actually returned a class of RaggedArray().

from enspara.util import array as ra

assigns = ra.load('assignments.h5')
unique_states = np.unique(assigns)

/opt/docs/enspara/lib/python3.7/site-packages/numpy/lib/arraysetops.py:270: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  ar = np.asanyarray(ar)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 6, in unique
  File "/opt/docs/enspara/lib/python3.7/site-packages/numpy/lib/arraysetops.py", line 272, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts)
  File "/opt/docs/enspara/lib/python3.7/site-packages/numpy/lib/arraysetops.py", line 333, in _unique1d
    ar.sort()
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Link to file: https://drive.google.com/file/d/1u6Jc1Dq8sIWF0PRV4pyyzLSVUNGv_k0W/view?usp=sharing

msm does not fit when parallelized

The following code returns an AttributeError. Fitting to assignments works when done normally, but through multiprocessing is broken for some reason.

import itertools
import numpy as np
from enspara.msm import MSM, builders
from multiprocessing import Pool


def _fit_ass(msm_data):
    ass, m = msm_data
    m.fit(ass)
    return


def entry_point():

    ass = np.load("./assignments.npy")
    m = MSM(lag_time=1, method=builders.normalize, max_n_states=4)
    msm_data = list(zip(itertools.repeat(ass,5), itertools.repeat(m,5)))
    p = Pool(processes=1)
    ms = p.map(_fit_ass, msm_data)
    p.terminate()
    return


if __name__=='__main__':
    entry_point()

RaggedArray [:] assignment fails

The following occurs:

a = ra.RaggedArray([[1, 1, 1],
                    [0, 1]], copy=True)
a[:] = -1

gives

RaggedArray([[-1, -1]])

Which differs from the numpy behavior. @mizimmer90

AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

I was following the enspara tutorial and during plotting of transition probabilities using the following scripts I have encountered the aforementioned error

plt.imshow(np.log(m.tprobs_.toarray()))
plt.xticks(range(0, m.n_states_+1))
plt.yticks(range(0, m.n_states_+1))
plt.colorbar(label=r'ln $P(i \rightarrow j)$')

It seems like m.tprobs_. doesn't have any attribute toarray() I was wondering how to solve this I tried with plt.imshow(np.log(m.tprobs_.transpose())) . Is that a correct way to visualize transition between states? Also seems like the plot shown here (https://enspara.readthedocs.io/en/latest/tutorial/fitting.html) is only looking informative if we have 20 states. Any trick to make a nicer plot with more number of states?

Setup.py and documentation question

in the installation documentation, shown below, the numpy install has '==' is that a typo?

conda install -c conda-forge mdtraj=1.8.0
conda install numpy==1.14
conda install cython
conda install mpi4py -c conda-forge

when running python setup.py install I keep getting this error --------------------------------------------------------------------------------Error: building enspara requires numpy and cython>=0.19 Try running the command ``pip install numpy cython`` or``conda install numpy cython``. or see http://docs.scipy.org/doc/numpy/user/install.html andhttp://cython.org/ for more information.Traceback (most recent call last): File "setup.py", line 105, in <module> include_dirs=[np.get_include()], NameError: name 'np' is not defined

despite the fact that if I do ipython import numpy as np there's no problem. I also tried the conda install etc and all the versions and modules are installed. Am I missing something?

Enspara problems with Ragged Array flattening

Cross posting just in case its at the wrong repo https://github.com/bowman-lab/MSM-docking/issues/17#issue-1492276426

RaggedArray.setitem on all-False index fails

If RaggedArray.__setitem__ is called on an index that is all False, it chokes. Compare this to the np.ndarray behavior, which simply doesn't set anything.

PR #41 contributes a failing test for this issue.

UnpickleableException

Hi.
I'm sorry about the intrusion. But I'm facing difficulties installing Enspara.
After running the tutorial, on my Ubuntu installed in a VirtualBox, I have the following error:

"  File "Cython/Compiler/Visitor.py", line 148, in Cython.Compiler.Visitor.TreeVisitor._raise_compiler_error
setuptools.sandbox.UnpickleableException: CompilerCrash((<FileSourceDescriptor:/home/s/anaconda3/lib/python3.6/site-packages/Cython/Includes/numpy/__init__.pxd>, 842, 5), 'AnalyseExpressionsTransform', 'Compiler crash in AnalyseExpressionsTransform

ModuleNode.body = StatListNode(__init__.pxd:17:0)
StatListNode.stats[41] = StatListNode(__init__.pxd:842:5)
StatListNode.stats[0] = CFuncDefNode(__init__.pxd:842:5,
    args = [...]/4,
    inline_in_pxd = True,
    modifiers = [...]/1,
    visibility = \'private\')
    
    Compiler crash traceback from this point on:
  File "Cython/Compiler/Visitor.py", line 180, in Cython.Compiler.Visitor.TreeVisitor._visit
  File "/home/s/anaconda3/lib/python3.6/site-packages/Cython/Compiler/ParseTreeTransforms.py", line 2215, in visit_FuncDefNode
    node.body = node.body.analyse_expressions(node.local_scope)
  File "/home/s/anaconda3/lib/python3.6/site-packages/Cython/Compiler/Nodes.py", line 436, in analyse_expressions
    for stat in self.stats]
  File "/home/s/anaconda3/lib/python3.6/site-packages/Cython/Compiler/Nodes.py", line 436, in <listcomp>
    for stat in self.stats]
  File "/home/s/anaconda3/lib/python3.6/site-packages/Cython/Compiler/Nodes.py", line 6669, in analyse_expressions
    self.item = self.item.coerce_to(self.target.type, env)
  File "/home/s/anaconda3/lib/python3.6/site-packages/Cython/Compiler/ExprNodes.py", line 950, in coerce_to
    src = PyTypeTestNode(src, dst_type, env)
  File "/home/s/anaconda3/lib/python3.6/site-packages/Cython/Compiler/ExprNodes.py", line 12981, in __init__
    assert dst_type.is_extension_type or dst_type.is_builtin_type, "PyTypeTest on non extension type"
    AssertionError: PyTypeTest on non extension type', AssertionError('PyTypeTest on non extension type',), <traceback object at 0x7f7880e37bc8>)"

How can I fix it?
Thank you.

RaggedArray does not implement shape

If we are to follow the numpy semantic closely, we should implement ra.shape, at least to the level that ra.shape[0] has the same meaning as ndarray.shape does.

I propose that the result is actually (n_rows, lengths), where lengths is a list of the rows' lengths.

Clusterer `centers` property

need to add a property for cluster centers in the Clusterer object

Mpi4py in documentation

Thanks for the very clear documentation! I just wanted to point out that in the version that's on github we miss mentioning mpi4py as an optional package to install. I installed it without mpi4py and tried running the FRET calculations and got an error message. But it runs fine once mpi4py is installed.

BACE broken?

BACE integration tests are broken

cluster.py output flags fail if no directory is specified

If a directory is not specified, then cluster.py fails at the output stage. This is v. bad.

This fails:

python ~/modules/enspara/apps/cluster.py \
  --trajectories trajectory-*.xtc \
  --topology fs-peptide.pdb \
  --algorithm khybrid \
  --cluster-number 20 \
  --subsample 10 \
  --atoms '(name N or name C or name CA)' \
  --distances fs-khybrid-clusters0020-distances.h5 \
  --center-features fs-khybrid-clusters0020-centers.pickle \
  --assignments fs-khybrid-clusters0020-assignments.h5

but this does not:

python ~/modules/enspara/apps/cluster.py \
  --trajectories trajectory-*.xtc \
  --topology fs-peptide.pdb \
  --algorithm khybrid \
  --cluster-number 20 \
  --subsample 10 \
  --atoms '(name N or name C or name CA)' \
  --distances ./fs-khybrid-clusters0020-distances.h5 \
  --center-features ./fs-khybrid-clusters0020-centers.pickle \
  --assignments ./fs-khybrid-clusters0020-assignments.h5

`partition_list` and `_partition_list`

Why are there two functions in util.array named some variant of "partition list"? Naively, I expect them to do the same thing... and if they don't do the same thing, perhaps we should rename them so it's obvious that they don't do the same thing...

Memory allocation in cluster rmsd app

The rmsd app allocates 0.9 memory during reassignment

MLE Builder

Current MLE builder depends on msmbuilder; replace with PyEMMA's, check tests.

Replace msm.builders.mle's dependency on msmbuilder with PyEMMA's
Un-skip unit tests test_mle_not_in_place and test_mle_types.
Verify that tests pass/fix numerical differences in tests.
Document builder appropriately

SASA-Exposon calculation problem

Hi,

I am trying to follow the tutorial to calculate exposons. I have tried so far this:

trj = md.load('cent.xtc', top='structure.psf')
sasas = md.shrake_rupley(trj, probe_radius=0.28, mode='residue')

My problem begins here, when I run this 2 commands

sasas = condense_sidechain_sasas(sasas, trj.top)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-26ca3578329b> in <module>
----> 1 sasas = condense_sidechain_sasas(sasas, trj.top)

~/.conda/envs/enspara/lib/python3.6/site-packages/enspara/citation/citation.py in wrapper(*args, **kwargs)
     43         def wrapper(*args, **kwargs):
     44             add_citation(citekey)
---> 45             return func(*args, **kwargs)
     46         return wrapper
     47     return cite_decorator

~/.conda/envs/enspara/lib/python3.6/site-packages/enspara/info_theory/exposons.py in condense_sidechain_sasas(sasas, top)
    144 
    145     assert top.n_residues > 1
--> 146     assert top.n_atoms == sasas.shape[1]
    147 
    148     SELECTION = ('not (name N or name C or name CA or name O or '

AssertionError:

Did I foget any step? My MD simulation was done with OpenMM and charmm forcefield.

Thanks in advance!

n_procs in msm.timescales has no effect

Warning when creating ragged arrays

Creating ragged arrays throws a VisibleDeprecationWarning. A few examples below. I think this should be an easy fix by providing dtype=object at the time of creation, but I wanted to open discussion on if folks think this would break downstream code. Happy to drop these in and run some tests.

/export/seas-home/justinjm/software/enspara/enspara/ra/ra.py:502: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify dtype=object when creating the ndarray
array = np.array(list(array))
/export/seas-home/justinjm/software/enspara/enspara/ra/ra.py:453: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify dtype=object when creating the ndarray
[np.arange(start, stops[num], step) for num in first_dimension_iis])
/export/seas-home/justinjm/software/enspara/enspara/ra/ra.py:462: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify dtype=object when creating the ndarray
for i in range(len(iis_2d_lengths))])), dtype=int)

EDIT current versions of software-
python version 3.6.13
numpy version 1.19.5

repr of clustering broken

Bad file extensions in cluster.py's --center-features don't error

cluster.py, if you pass a path with a weird file extension for --center-features, doesn't warn you. For example, if you're doing rmsd clustering, and you give xxx.npy, it will save that, even if it's a pickle.

cluster.py fails if paths have no directory spec

$ python ~/projects/enspara/enspara/apps/cluster.py \
  --trajectories trajectory-*.xtc \
  --topology fs-peptide.pdb \
  --algorithm khybrid \
  --cluster-number 20 \
  --subsample 10 \
  --atoms '(name N or name C or name CA)' \
  --distances fs-khybrid-clusters0020-distances.h5 \
  --center-features fs-khybrid-clusters0020-centers.pickle \
  --assignments fs-khybrid-clusters0020-assignments.h5
[clustering blah blah]
05-01-2020 11:00:57 __main__ INFO    Clustered 28000 frames into 20 clusters in 2.8169675880053546 seconds.
05-01-2020 11:00:57 __main__ INFO    --center-indices not provided, not writing center indices to file.
05-01-2020 11:00:57 __main__ INFO    Wrote center indices in 0.00 sec.
05-01-2020 11:00:57 __main__ INFO    Saving cluster centers at
Traceback (most recent call last):
  File "/Users/jrporter/projects/enspara/enspara/apps/cluster.py", line 514, in <module>
    sys.exit(main(sys.argv))
  File "/Users/jrporter/projects/enspara/enspara/apps/cluster.py", line 502, in main
    write_centers(result, args)
  File "/Users/jrporter/projects/enspara/enspara/apps/cluster.py", line 422, in write_centers
    os.makedirs(outdir)
  File "/Users/jrporter/.envs/enspara/bin/../lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''

Problem goes away if you specify e.g. --distances ./fs-khybrid-clusters0020-distances.h5 (with the ./).

Pooled clustering followed by reassignment

Dear all,

I try to clusters 50 trajectories each from 2 different peptides (both with 14 aa). I only wright the C, CA and N atoms in the trajectory to be able to clusters them together. For sure, I also only cluster based on RMSD of C, CA and N. After the clustering I want to count which cluster was used how many times by peptide 1 or 2 and hopefully see a different distribution.

What I do:

python cluster.py \
  --trajectories peptide1.h5\
  --topology peptide1.pdb \
  --trajectories peptide2.h5\
  --topology peptide2.pdb \
  --atoms '(name CA or name C or name N)' \
  --algorithm khybrid \
  --cluster-number 10 \
  --distances /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-distances.h5 \
  --center-features /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-centers.pickle \
  --assignments /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-assignments.h5

If I then track back which frame from which peptide was used for cluster I have the feeling, that the first clusters are build by the first 50 trajectories and the last 5 clusters by the the 50 trajectories od peptide2. This is also represented by the centroid structures. First 5 are peptide1, last 5 peptide2.
Am I making a horrible mistake with this?

Any hint is aprreciated, thanks for your effort.

The code I used:

df=pd.DataFrame(cl_ass)
H3=df.iloc[0:50]
ss=df.iloc[50:100]
count_peptide1=pd.value_counts(peptide.values.ravel())
print('peptide1:', count_peptide1)
count_peptide2=pd.value_counts(peptide2.values.ravel())
print('peptide2', count_peptide2)

infile=open('fs-khybrid-clusters0020-centers.pickle', 'rb')
new_file=pickle.load(infile)
first_tr=new_file[0]
first_tr.save_pdb('0.pdb')
sec_tr=new_file[1]
sec_tr.save_pdb('1.pdb')
third_tr=new_file[2]
third_tr.save_pdb('2.pdb')
four_tr=new_file[3]
four_tr.save_pdb('3.pdb')
five_tr=new_file[4]
five_tr.save_pdb('4.pdb')
six_tr=new_file[5]
six_tr.save_pdb('5.pdb')
seven_tr=new_file[6]
seven_tr.save_pdb('6.pdb')
eigth_tr=new_file[7]
eigth_tr.save_pdb('7.pdb')
nine_tr=new_file[8]
nine_tr.save_pdb('8.pdb')
ten_tr=new_file[9]
ten_tr.save_pdb('9.pdb')

RaggedArray.getitem fails on 2d lists

In numpy, it is possible to do the following:

>>> a = np.array(np.arange(9)).reshape(3, 3)
>>> a[list(zip( (1, 2), (2, 0), (0, 0) ))]

array([5, 6, 0])

Specifically, ndarray.__getitem__ handles the case where the requested indices are 2-element list, similar to the output of np.where, but a list rather than a tuple.

Similar code on a RaggedArray:

a = ra.RaggedArray([np.arange(3), np.arange(4), np.arange(2)])
a[list(zip( (1, 2), (2, 0), (0, 0) ))]

fails with

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-108-d8376781920d> in <module>()
      1 a = ra.RaggedArray([np.arange(3), np.arange(4), np.arange(2)])
----> 2 a[list(zip( (1, 2), (2, 0), (0, 0) ))]

/home/jrporter/modules/enspara/util/array.py in __getitem__(self, iis)
    455         elif (type(iis) is slice) or (type(iis) is list) \
    456                 or (type(iis) is np.ndarray):
--> 457             return RaggedArray(self._array[iis])
    458         # tuples get index conversion from 2d to 1d
    459         elif type(iis) is tuple:

IndexError: too many indices for array

The very-similar following succeeds:

>>> a = ra.RaggedArray([np.arange(3), np.arange(4), np.arange(2)])
>>> a[tuple(zip( (1, 2), (2, 0), (0, 0) ))]

array([2, 0, 0])

IIRC this tuple/list issue is one of the things that was gave me a lot of trouble when I wrote the __getitem__ we wound up not using. (I don't remember if I eventually figured it out or not.)

Anyway, it looks like the __getitem__ we currently have implemented attempts to use the type of the indexer ("iis") as an proxy for the meaning of the contained data. I would suggest using the length of the data instead. As a probably-flawed example of what I'm suggesting, if the list only has two elements of matching dimensions, then it's an np.where-result-like indexer. If it instead is a list of many elements of one dimension, then it can be interpreted as a list of indices in the appropriate dimension.

bowman-lab / enspara Goto Github PK

enspara's People

Contributors

Stargazers

Watchers

Forkers

enspara's Issues

Recommend Projects

Recommend Topics

Recommend Org