Draft for Q&A (will be copied to ipython-notebook) pytraj devel

Should I use Trajectory class or <code class="notran

so why does frame['<a class="user-mention notranslate" data-hovercard-type="user" dat

Q&A,about amber-md/pytraj

Comments (20)

hainm commented on June 18, 2024

should I use indexing or frame_iter to get frame from Trajectory

Always try to use frame_iter for optimized speed.
Instead of doing

for i in range(1000):
    traj[i]

SHOULD do

for frame in traj:
      frame

or enumerate

for idx, frame in enumerate(traj):
    frame

for frame in traj(10, 100000, 10): # equal to range(10, 100001, 10)
    frame

Speed test

        from pytraj.testing import make_fake_traj

        # make a fake Trajectory with 1000 frames, 100K atoms
        traj = make_fake_traj(1000, 100000)
        print (traj)

        @Timer()
        def test_frame_iter_speed():
            for frame in traj:
                frame

        n_frames = traj.n_frames
        @Timer()
        def test_frame_indexing_speed():
            for i in range(n_frames):
                traj[i]

        print ("test_frame_iter_speed")
        test_frame_iter_speed() # only 0.00012 s in my computer
        print ("test_frame_indexing_speed")
        test_frame_indexing_speed() # 0.145 s (> 1000 times slower)

from pytraj.

hainm commented on June 18, 2024

Should I use `Trajectory` class or `TrajectoryIterator`?

if memory is not a big problem, you should use Trajectory class, which hold a chunks of frames in memory. Things are much faster (like frame iterator). However, if the trajectory file is too big, it's highly recommended to use TrajectoryIterator to save memory. You still can use fancy indexing for this iterator (either returning a Frame or a mutable Trajectory object).

There are many way to construct Trajectory and TrajectoryIterator. But the esiest way to remember is to use pytraj.io module with load (return Trajectory) and iterload (return TrajectoryIterator) methods.

(See also the dicussion about those classes here) (#262)

What is the different between `for frame in traj[1:11:2]` and `for frame in traj(1, 10, 2)`, where "traj" is a `TrajectoryIterator` object?

Per memory efficiency, traj(1, 10, 2) is better. This is shortcut of traj.frame_iter(start=1, stop=10, stride=2) (which is also similiar to cpptraj's command: trajin my_trajectory.nc 2 11 2. When calling traj(1, 10, 2), only one single Frame is loaded into memory.

print (traj(1, 10, 2))
<generator object frame_iter at 0x2aaac8f7daf8>

In constrast, calling traj[1:11:2], pytraj does the slicing first to create in-memory Trajectory object, and then iterate frame.

Example

print (traj[1:11:2])
<pytraj.Trajectory with 5 frames: <Topology with 1692 mols, 1704 residues, 
5293 atoms, 5300 bonds, PBC with box type = ortho>>

pytraj has operator `traj += 1.` but why does it not have `traj0 = traj1 + 1.`?

Answer: Because of memory saving. Just think about handling 2G Trajectory in memory. It's disaster if you call traj = 2 * traj + 3 * traj

traj += 1. is in place operator, no extra copy for traj.
traj0 = traj1 + 1. is equal to traj0 = traj1.copy(); traj0 += 1.

If you want to have more complicated math + memory saving, you should use apply method.

traj.apply(lambda x : 2 * x + 1.)

This is equal to

traj *= 2.
traj += 1.

from pytraj.

hainm commented on June 18, 2024

What's advantage of using "traj.apply" rather than using "numpy"?

traj.apply(lambda x : x * 2. + 1.) # where `x` is `xyz` coords of each frame

# vs
xyz = xyz * 2 + 1 # xyz is 3D-numpy array (shape=(n_frames, n_atoms, 3)

Answer

memory efficient
traj.apply perform calculation for each frame. In worst case, only a copy of xyz coords for each Frame is copied. With numpy, whole traj's xyz is copied.
speed: faster for very large system (at least 3 times faster for my tested system (500 frames, 169376 atoms)).

from pytraj.testing import make_fake_traj, Timer, duplicate_traj, aa_eq
import pytraj.io as io
from pytraj.misc import merge_trajs

# making fake large traj
traj = io.load_sample_data('tz2')[:]

for i in range(5):
    traj = merge_trajs(traj.copy(), traj.copy())

traj = duplicate_traj(traj, 50)
xyz = traj.xyz[:].copy()

print (traj)
# end of making fake traj

@Timer()
def test_apply(traj):
    traj.apply(lambda x : 2 * x + 1)

@Timer()
def test_apply_similiar_numpy(xyz):
    xyz = xyz * 2. + 1.

test_apply(traj)
test_apply_similiar_numpy(xyz)

Note: to avoid heavy copy for numpy array, use in-place operator xyz *= 2.; xyz += 1.

** Note **: speed might be different depending on available memory.

from pytraj.

hainm commented on June 18, 2024

I still want to use mutable `xyz` coords of traj, should I move on to another package?

Answer
If you still want to use numpy extensively with mutable xyz coords, you can still use pytraj.api.Trajectory class. The advantage of using this class is just like the one from mdtraj, which you can perform all numpy methods with xyz without making extra data copy. However, in order to use cpptraj's actions, pytraj will internally convert to frame (extra copying). Besides copying issue, you can fully use all cpptraj's actions + strength of numpy. Note that fancy indexing like 'traj['@ca'] is not implemented for this class.

It's trivial to convert pytraj's Trajectory and TrajectoryIterator class to this api.Trajectory class.

In [33]: from pytraj import io

In [34]: traj = io.load_sample_data('tz2')

In [35]: traj
Out[35]:
<pytraj.TrajectoryIterator with 10 frames: <Topology with 1692 mols,
1704 residues, 5293 atoms, 5300 bonds, PBC with box type = ortho>>

In [36]: from pytraj import api

In [37]: t = api.Trajectory(traj)

In [38]: t
Out[38]: Trajectory with 10 frames, 5293 atoms

# you can mutate `xyz` coords by
In [39]: t.xyz[0, 0]
Out[39]: array([ 15.55458927,  28.54844856,  17.18908691])

In [40]: t.xyz[0, 0] += 1.

In [41]: t.xyz[0, 0]
Out[41]: array([ 16.55458927,  29.54844856,  18.18908691])

# you can even use this class with `mdtraj` package (for some analyses such as rmsd)
In [43]: md.rmsd(t, t, 0)
Out[43]:
array([  0.        ,   8.28397369,   9.8498888 ,  11.27068233,
        11.86576557,  12.52786446,  13.33117199,  13.53176975,
        13.80132675,  14.00614452], dtype=float32)

# let's compare to pytraj
In [50]: t.rmsd(0)
Out[50]: array('d', [0.0, 8.283967389538908, 9.849893186468968, 11.2706769011418, 
11.865759799173082, 12.527866218740536, 13.331165639093559, 13.53176951142984, 
13.801334666412323, 14.00614642923093])

from pytraj.

hainm commented on June 18, 2024

it's too slow to convert Frame.xyz to numpy array, about 3 us for only 10 atom-Frame.

Answer
It's constant time to cast to numpy memoryview. It takes similar time to cast from 1 million-atom Frame too. This is not problem since you only need cast once without making extra data copying.

cat timing.py

from pytraj.testing import make_random_frame
import numpy as np
from timeit import timeit

def test_time(n_atoms):
    f = make_random_frame(n_atoms)
    def test():
        np.asarray(f)
    print (timeit(test, number=1000000))

for n_atoms in [10, 100, 1000, 100000, 1000000]:
    test_time(n_atoms)

Casting to numpy array-view for 1 million frames only take 6 s.
python timing.py

6.04225823841989
6.05758654139936
6.077507931739092
6.064353432506323
6.1148266065865755

from pytraj.

hainm commented on June 18, 2024

why fancy indexing in `pytraj` for Trajectory does not follow `numpy` indexing?

In [9]: traj['@CA', 5:7, 0]
Out[9]: <Frame with 20 atoms>

In [10]: traj[:, :, 0]
Out[10]: <Frame with 304 atoms>

In [11]: xyz  = traj.xyz

In [12]: xyz[:, :, 0]
Out[12]:
array([[-16.492, -15.656, -17.148, ...,  20.432,  21.025,  20.795],
       [  8.094,   7.667,   8.414, ...,   0.818,   0.859,   0.977],
       [  1.759,   1.225,   1.248, ...,   1.33 ,   2.35 ,   1.074],
       ...,
       [  8.674,   9.412,   7.767, ...,   0.167,   1.399,  -0.574],
       [ -6.327,  -6.377,  -7.054, ...,   2.357,   1.394,   2.156],
       [ -6.128,  -7.091,  -5.744, ...,  -8.726,  -8.979,  -7.769]])

Answer

because Trajectory is not a numpy array. You should use xyz attribute to get raw array (Note: expensive to do this with TrajectoryIerator)
pytraj implements very simple fancy indexing scheme. For example, traj[1:3, :, 0] is equal to traj[1:3][:][0] # from left to right.

In [13]: traj[1:3, :, 0]
Out[13]: <Frame with 304 atoms>

In [14]: traj[1:3][:][ 0]
Out[14]: <Frame with 304 atoms>

from pytraj.

hainm commented on June 18, 2024

so why does frame['@ca'] NOT return a new Frame object?

Answer
I actually don't know which one is more appropriate, whether frame['@ca'] should return a new Frame or should return xyz coords.

Current implementation of pytraj return coords for frame[mask] or frame[AtomMask_object]

In [1]: from pytraj import io

In [2]: f = io.load_sample_data ()[0]

In [3]: traj  = io.load_sample_data ()

In [4]: f = traj[0]

In [5]: f
Out[5]: <Frame with 34 atoms>

In [6]: f.set_top(traj.top)

In [7]: f['@CA']
Out[7]:
array([[  3.97004800e+00,   2.84579500e+00,  -1.00000000e-07],
       [  7.64000760e+00,   3.83858370e+00,  -7.40000000e-06],
       [  1.01610562e+01,   6.68437820e+00,   1.20000000e-06]])

# fancy indexing as a numpy array
In [10]: f['@CA', :, 2] # z-coord for all atoms
Out[10]: array([ -1.00000000e-07,  -7.40000000e-06,   1.20000000e-06])

from pytraj.

hainm commented on June 18, 2024

get wrong file size when importing pytraj' module

Answer
It's likely that you made changes to pxd file extension. The best way to do this is

remove all cythonzied *.cpp files by python clean_cythonize_files.py (in the root folder)
python setup.py install again (require having Cython)

Note: To accelerate the compiling, you can use multiprocessing, just

python ./setup.py build -faster_build
then python setup.py install
(if you don't see the speeding up with -faster_builid, just cancel (Ctrl-C) and run again).

from pytraj.

hainm commented on June 18, 2024

turn on cpptraj' log

from pytraj import set_world_silent
set_world_silent(False)

from pytraj.

hainm commented on June 18, 2024

got "undefined symbol" error

Answer you might have several "libcpptraj.so" files. Need to export/set "LD_LIBRARY_PATH" correctly.
(Note: not done this Q&A yet)

from pytraj.

hainm commented on June 18, 2024

why not directly use `pandas` or `toolz`?

Because pytraj only need several methods. So we either try to reinvent the wheel or shamelessly copy a small part of their code (all have BSD-like license).

pandas is great package but its API is very complicated with lots of methods and it aims broad users. pytraj only focus on analyzing data from MD simulation.

from pytraj.

hainm commented on June 18, 2024

list comprehensions give identical frames when iterating TrajectoryIterator

updated: fixed. no identical frames anymore.

old answer

Answer

To save memory, only a single Frame will be loaded at a time when iterating TrajectoryIterator. So
[frame for frame in traj] is just a list of frame pointers pointing to the same memory block. That's why you got identical frames.

To get different frames, use copy: [frame.copy() for frame in traj]. This should be carefully done since you might end up loading tens of Gb to memory.

from pytraj.

hainm commented on June 18, 2024

undefined symbol: _ZN14MaskTokenArray13SelectedChar_E

Answer
try again:
conda install libcpptraj --force -c pytraj
conda install pytraj-dev --force -c pytraj

and test: python -c 'import pytraj as pt; pt.run_tests()'

from pytraj.

hainm commented on June 18, 2024

Error: Could not determine trajectory format ".nc"

Answer Need to install libcpptraj with libnetcdf.
If you are using conda, you can try conda install libcpptraj --force -c pytraj

from pytraj.

hainm commented on June 18, 2024

why `pytraj.load` supports loading from url but `pytraj.iterload` does not?

Answer: pytraj.load load data to memory, so it's easy to load from url. pytraj.iterload is lazy loading, a small chunk of data will be loaded to memory if ONLY needed. So pytraj.iterload need absolute dir of trajectory file; loading from url create a file in /tmp folder.

In [4]: pt.load(url)
Out[4]:
<pytraj.Trajectory with 1 frames:
 <Topology with 30 atoms, 1 residues, 1 mols, 32 bonds, non-PBC>>

In [5]: pt.iterload(url)
Error: File 'http://ambermd.org/tutorials/basic/tutorial4/files/sustiva.mol2' 
does not exist.

from pytraj.

hainm commented on June 18, 2024

ImportError: libhdf5_hl.so.9: cannot open shared object file: No such file or directory

Answer: try conda install libnetcdf

updated: man, this is conda issue

from pytraj.

hainm commented on June 18, 2024

ValueError: Big-endian buffer not supported on little-endian compiler

Answer wrong dtype

from pytraj.

hainm commented on June 18, 2024

close. we have website now.

from pytraj.

Comments (20)

should I use indexing or frame_iter to get frame from Trajectory

Should I use Trajectory class or TrajectoryIterator?

What is the different between for frame in traj[1:11:2] and for frame in traj(1, 10, 2), where "traj" is a TrajectoryIterator object?

pytraj has operator traj += 1. but why does it not have traj0 = traj1 + 1.?

What's advantage of using "traj.apply" rather than using "numpy"?

I still want to use mutable xyz coords of traj, should I move on to another package?

it's too slow to convert Frame.xyz to numpy array, about 3 us for only 10 atom-Frame.

why fancy indexing in pytraj for Trajectory does not follow numpy indexing?

so why does frame['@ca'] NOT return a new Frame object?

get wrong file size when importing pytraj' module

turn on cpptraj' log

got "undefined symbol" error

why not directly use pandas or toolz?

list comprehensions give identical frames when iterating TrajectoryIterator

old answer

undefined symbol: _ZN14MaskTokenArray13SelectedChar_E

Error: Could not determine trajectory format ".nc"

why pytraj.load supports loading from url but pytraj.iterload does not?

ImportError: libhdf5_hl.so.9: cannot open shared object file: No such file or directory

ValueError: Big-endian buffer not supported on little-endian compiler

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Should I use `Trajectory` class or `TrajectoryIterator`?

What is the different between `for frame in traj[1:11:2]` and `for frame in traj(1, 10, 2)`, where "traj" is a `TrajectoryIterator` object?

pytraj has operator `traj += 1.` but why does it not have `traj0 = traj1 + 1.`?

I still want to use mutable `xyz` coords of traj, should I move on to another package?

why fancy indexing in `pytraj` for Trajectory does not follow `numpy` indexing?

why not directly use `pandas` or `toolz`?

why `pytraj.load` supports loading from url but `pytraj.iterload` does not?