datreant / datreant.data Goto Github PK

convenient data storage and retrieval in HDF5 for Treants

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

datreant.data's Introduction

datreant: persistent, pythonic trees for heterogeneous data

In many fields of science, especially those analyzing experimental or simulation data, there is often an existing ecosystem of specialized tools and file formats which new tools must work around, for better or worse. Furthermore, centralized database solutions may be suboptimal for data storage for a number of reasons, including insufficient hardware infrastructure, variety and heterogeneity of raw data, the need for data portability, etc. This is particularly the case for fields centered around simulation: simulation systems can vary widely in size, composition, rules, paramaters, and starting conditions. And with increases in computational power, it is often necessary to store intermediate results obtained from large amounts of simulation data so it can be accessed and explored interactively.

These problems make data management difficult, and serve as a barrier to answering scientific questions. To make things easier, datreant is a Python package that addresses the tedious and time-consuming logistics of intermediate data storage and retrieval. It solves a boring problem, so we can focus on interesting ones.

For more information on what datreant is and what it does, check out the official documentation.

Getting datreant

See the installation instructions for installation details. The package itself is pure Python.

If you want to work on the code, either for yourself or to contribute back to the project, clone the repository to your local machine with:

git clone https://github.com/datreant/datreant.git

Contributing

This project is still under heavy development, and there are certainly rough edges and bugs. Issues and pull requests welcome!

Check out our contributor's guide to learn how to get started with contributing back.

datreant.data's People

Stargazers

Watchers

Forkers

sseyler

datreant.data's Issues

Allow globbing syntax for Data.remove().

Since datasets are stored in a directory structure, and since their names reflect this, it would be fairly easy to make deletions using globbing. This would be a great convenience when some datasets matching a pattern should be removed without removing others.

Moved from datreant/datreant#8

Pickle dumps not readable between python 2/3

The problem is that we always write using the highest available protocol. But I was unaware that python3 introduced a new protocol version

https://stackoverflow.com/questions/25843698/valueerror-unsupported-pickle-protocol-3-python2-pickle-can-not-load-the-file#25843743

To have pickles load across python version and be reasonable fast/small we have to explicitly choose version 2 of the protocol.

pytable warns about possible performance issues with lots of columns for dataframes

When I store pandas objects with a LOT of columns I get the following error message. I haven't yet noticed any problems with memory though. I assume we don't need to care about this error since we always load the complete hdf5 file into memory and don't to any mmap tricks on the hdf5 file.

/home/max/conda/lib/python2.7/site-packages/tables/table.py:1039: PerformanceWarning: table ``/main/table`` is exceeding the recommended maximum number of columns (512); be ready to see PyTables asking for *lots* of memory and possibly slow I/O
  PerformanceWarning)

Should we just silence this warning as well?

Add simple query for data elements in a Treant.

Although the keys for all data elements currently display by default using, e.g. Treant.data, it would be useful to be able to get a listing of data keys that match a query. This could be as simple as making some kind of Data.isin method that takes a string as input and outputs all keys that have that string present.

Originally placed in datreant/datreant#1

pydata

The current pickle protocol used to store pydata like dictionaries of arrays is not very effiecient. Even yaml files are only half the size. I propose we switch to cPickle and used the highest available protocol/
See gist for test results.

hide pytables naming NatrualNameWarnings

I have a dataframe containing names like out-99. This means I can't use it as a python attribute name in pandas and neither in pytables. Saving the treant pytables gives an annoying NaturalNameWarning: object name is not a valid Python identifier: 'out-99_dtype' .... Since this is only internally handled by datreant it would be good to silence those warnings.

acces data member via `getattr`

I want to access a data member with treant.data.x instead of treant.data['x']. That would add some convenience and is possible with the __getattr__ magic function.

Problem when saving a dataframe that contains pyobjects

I'm trying to save a dataframe that contains a "series of lists" (they correspond to ionic clusters), however there is a problem with the serialization:

t = dtr.Treant('/tmp/hello')
t.data['hello'] = pd.DataFrame({ 'lists': [[0, 1, 2], [0, 1], [10, 22]] })

TypeError: Cannot serialize the column [lists] because
its data contents are [mixed] object dtype

I found that for dataframes, the msgpack format is pretty robust and efficient, maybe we could serialize dataframes using that?

It would, however, hurt retro-compatibility

In order for package to be importable, need to make datreant namespace package

It turns out that it might not be possible to make imports of datreant and datreant.data work cleanly without explicitly making datreant behave as a namespace package. This could be done in at least two ways:

Change the core package from datreant to datreant.core (or similar), to clear the namespace of datreant on its own for the namespace package magic.
Make a datreant.ex (or similar) namespace package that datreant.data and any other "extension" packages like it are imported via, which changes datreant.data to datreant.ex.data.

Neither is ideal, but this will take some playing around to figure out what actually works here.

Storing numpy array with same key as existing pandas object puts both in same directory

Storing e.g. a pandas DataFrame with:

import datreant.core as dtr
import datreant.data.attach

t = dtr.Treant('spore')

t.data['a/dataframe'] = pd.DataFrame(pd.np.random.randn(100, 3))

and then storing e.g. a numpy array with the same key

t.data['a/dataframe'] = pd.np.random.randn(100, 3)

results in two datasets getting stored with the same name, in the same place:

> t.draw()
spore/
 +-- a/
 |   +-- dataframe/
 |       +-- pdData.h5
 |       +-- npData.h5
 +-- Treant.e3bde18c-4eca-4539-8015-9520d5768c12.json

This should not be possible using the datreant.data Limbs.

AttributeError: module 'pandas' has no attribute 'Panel4D'

>>> segs_s2if['restrained_to_repulsion'][0].data['dHdl'] = get_dHdl_XVG(segs_s2if['restrained_to_repulsion'][0], lower=5000,step=200)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hrlee/.conda/envs/alchemlyb/lib/python3.7/site-packages/datreant/data/limbs.py", line 186, in __setitem__
    self.add(handle, data)
  File "/home/hrlee/.conda/envs/alchemlyb/lib/python3.7/site-packages/datreant/data/limbs.py", line 139, in inner
    out = func(self, handle, *args, **kwargs)
  File "/home/hrlee/.conda/envs/alchemlyb/lib/python3.7/site-packages/datreant/data/limbs.py", line 218, in add
    self._datafile.add_data('main', data)
  File "/home/hrlee/.conda/envs/alchemlyb/lib/python3.7/site-packages/datreant/data/core.py", line 59, in add_data
    elif isinstance(data, (pd.Series, pd.DataFrame, pd.Panel, pd.Panel4D)):
AttributeError: module 'pandas' has no attribute 'Panel4D'

My panda is 0.24.1 and it seems Panel4D has been removed according to this: http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.23.0.html

do a groupby/filter on data keys

I have a treant with a lot of data entries called run-* and some others with different names. To get only the one with run-* I currently have to create a list with the keys I like and then iterate over the list of keys.

run_data = (k for k in t.data.keys() if k.startswith('run-'))
for k in run_data:
    data = t.data[k]
    pass

It would be more convenient if I could do

for data in t.data['run-*']:
    pass

I'm not to sure about the syntax here though. It would also mean we allow general regex in the getitem method. But also keep in mind that in over a year of using the library this is the first time I wanted to do something like this.

Release 0.6.0 checklist

Things to do before release:

update version in both setup.py and __init__.py (which currently doesn't have one...)
get docs working on readthedocs with conda
make docs current (#4)
remove dependency_links in setup.py. datreant.core should be on pypi before release
tie release to specific version 0.6.0 of datreant.core in setup.py; probably not necessary once datreant.core is API stable.

Keep track of last modification time

Knowing when a data got modified last would allow me to know if it is up to date. For instance, it would allow me to rerun an analysis on all Treants for which the data of interest is older than the analysis script, or older than the trajectory.

Since data are stored in directories, there could be a json file in that directory with some metadata. Alternatively, datreants.data could access the file system metadata for the hdf5 file.

set compression for hdf5 data storage

From @kain88-de:

I'm not sure if this exists yet but it would be nice if I could change the compression algorithm used by pytables to store my data.

Originally posted in datreant/datreant#32

Direct access to data file path

Data are stored in HDF5 or pickle files. As suggested in #11, it would be convenient to have a method to get the path to the file where a data is stored from the key of that data.

The method would be used in that way:

t = Treant('baobab')
t.data['mydata'] = np.arange(5)
path = t.filepath('mydata')

Having simple access to the path also give access to the file file system properties such as the date of last modification (see #11).