Code Monkey home page Code Monkey logo

datreant.data's Introduction

datreant: persistent, pythonic trees for heterogeneous data

Documentation Status Build Status Code Coverage

In many fields of science, especially those analyzing experimental or simulation data, there is often an existing ecosystem of specialized tools and file formats which new tools must work around, for better or worse. Furthermore, centralized database solutions may be suboptimal for data storage for a number of reasons, including insufficient hardware infrastructure, variety and heterogeneity of raw data, the need for data portability, etc. This is particularly the case for fields centered around simulation: simulation systems can vary widely in size, composition, rules, paramaters, and starting conditions. And with increases in computational power, it is often necessary to store intermediate results obtained from large amounts of simulation data so it can be accessed and explored interactively.

These problems make data management difficult, and serve as a barrier to answering scientific questions. To make things easier, datreant is a Python package that addresses the tedious and time-consuming logistics of intermediate data storage and retrieval. It solves a boring problem, so we can focus on interesting ones.

For more information on what datreant is and what it does, check out the official documentation.

Getting datreant

See the installation instructions for installation details. The package itself is pure Python.

If you want to work on the code, either for yourself or to contribute back to the project, clone the repository to your local machine with:

git clone https://github.com/datreant/datreant.git

Contributing

This project is still under heavy development, and there are certainly rough edges and bugs. Issues and pull requests welcome!

Check out our contributor's guide to learn how to get started with contributing back.

datreant.data's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

sseyler

datreant.data's Issues

Allow globbing syntax for Data.remove().

Since datasets are stored in a directory structure, and since their names reflect this, it would be fairly easy to make deletions using globbing. This would be a great convenience when some datasets matching a pattern should be removed without removing others.

Moved from datreant/datreant#8

pytable warns about possible performance issues with lots of columns for dataframes

When I store pandas objects with a LOT of columns I get the following error message. I haven't yet noticed any problems with memory though. I assume we don't need to care about this error since we always load the complete hdf5 file into memory and don't to any mmap tricks on the hdf5 file.

/home/max/conda/lib/python2.7/site-packages/tables/table.py:1039: PerformanceWarning: table ``/main/table`` is exceeding the recommended maximum number of columns (512); be ready to see PyTables asking for *lots* of memory and possibly slow I/O
  PerformanceWarning)

Should we just silence this warning as well?

Add simple query for data elements in a Treant.

Although the keys for all data elements currently display by default using, e.g. Treant.data, it would be useful to be able to get a listing of data keys that match a query. This could be as simple as making some kind of Data.isin method that takes a string as input and outputs all keys that have that string present.

Originally placed in datreant/datreant#1

pydata

The current pickle protocol used to store pydata like dictionaries of arrays is not very effiecient. Even yaml files are only half the size. I propose we switch to cPickle and used the highest available protocol/
See gist for test results.

hide pytables naming NatrualNameWarnings

I have a dataframe containing names like out-99. This means I can't use it as a python attribute name in pandas and neither in pytables. Saving the treant pytables gives an annoying NaturalNameWarning: object name is not a valid Python identifier: 'out-99_dtype' .... Since this is only internally handled by datreant it would be good to silence those warnings.

acces data member via `__getattr__`

I want to access a data member with treant.data.x instead of treant.data['x']. That would add some convenience and is possible with the __getattr__ magic function.

Problem when saving a dataframe that contains pyobjects

I'm trying to save a dataframe that contains a "series of lists" (they correspond to ionic clusters), however there is a problem with the serialization:

t = dtr.Treant('/tmp/hello')
t.data['hello'] = pd.DataFrame({ 'lists': [[0, 1, 2], [0, 1], [10, 22]] })

TypeError: Cannot serialize the column [lists] because
its data contents are [mixed] object dtype

I found that for dataframes, the msgpack format is pretty robust and efficient, maybe we could serialize dataframes using that?

It would, however, hurt retro-compatibility

In order for package to be importable, need to make datreant namespace package

It turns out that it might not be possible to make imports of datreant and datreant.data work cleanly without explicitly making datreant behave as a namespace package. This could be done in at least two ways:

  1. Change the core package from datreant to datreant.core (or similar), to clear the namespace of datreant on its own for the namespace package magic.
  2. Make a datreant.ex (or similar) namespace package that datreant.data and any other "extension" packages like it are imported via, which changes datreant.data to datreant.ex.data.

Neither is ideal, but this will take some playing around to figure out what actually works here.

Storing numpy array with same key as existing pandas object puts both in same directory

Storing e.g. a pandas DataFrame with:

import datreant.core as dtr
import datreant.data.attach

t = dtr.Treant('spore')

t.data['a/dataframe'] = pd.DataFrame(pd.np.random.randn(100, 3))

and then storing e.g. a numpy array with the same key

t.data['a/dataframe'] = pd.np.random.randn(100, 3)  

results in two datasets getting stored with the same name, in the same place:

> t.draw()
spore/
 +-- a/
 |   +-- dataframe/
 |       +-- pdData.h5
 |       +-- npData.h5
 +-- Treant.e3bde18c-4eca-4539-8015-9520d5768c12.json

This should not be possible using the datreant.data Limbs.

AttributeError: module 'pandas' has no attribute 'Panel4D'

>>> segs_s2if['restrained_to_repulsion'][0].data['dHdl'] = get_dHdl_XVG(segs_s2if['restrained_to_repulsion'][0], lower=5000,step=200)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hrlee/.conda/envs/alchemlyb/lib/python3.7/site-packages/datreant/data/limbs.py", line 186, in __setitem__
    self.add(handle, data)
  File "/home/hrlee/.conda/envs/alchemlyb/lib/python3.7/site-packages/datreant/data/limbs.py", line 139, in inner
    out = func(self, handle, *args, **kwargs)
  File "/home/hrlee/.conda/envs/alchemlyb/lib/python3.7/site-packages/datreant/data/limbs.py", line 218, in add
    self._datafile.add_data('main', data)
  File "/home/hrlee/.conda/envs/alchemlyb/lib/python3.7/site-packages/datreant/data/core.py", line 59, in add_data
    elif isinstance(data, (pd.Series, pd.DataFrame, pd.Panel, pd.Panel4D)):
AttributeError: module 'pandas' has no attribute 'Panel4D'

My panda is 0.24.1 and it seems Panel4D has been removed according to this: http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.23.0.html

do a groupby/filter on data keys

I have a treant with a lot of data entries called run-* and some others with different names. To get only the one with run-* I currently have to create a list with the keys I like and then iterate over the list of keys.

run_data = (k for k in t.data.keys() if k.startswith('run-'))
for k in run_data:
    data = t.data[k]
    pass

It would be more convenient if I could do

for data in t.data['run-*']:
    pass

I'm not to sure about the syntax here though. It would also mean we allow general regex in the getitem method. But also keep in mind that in over a year of using the library this is the first time I wanted to do something like this.

Release 0.6.0 checklist

Things to do before release:

  • update version in both setup.py and __init__.py (which currently doesn't have one...)
  • get docs working on readthedocs with conda
  • make docs current (#4)
  • remove dependency_links in setup.py. datreant.core should be on pypi before release
  • tie release to specific version 0.6.0 of datreant.core in setup.py; probably not necessary once datreant.core is API stable.

Keep track of last modification time

Knowing when a data got modified last would allow me to know if it is up to date. For instance, it would allow me to rerun an analysis on all Treants for which the data of interest is older than the analysis script, or older than the trajectory.

Since data are stored in directories, there could be a json file in that directory with some metadata. Alternatively, datreants.data could access the file system metadata for the hdf5 file.

Direct access to data file path

Data are stored in HDF5 or pickle files. As suggested in #11, it would be convenient to have a method to get the path to the file where a data is stored from the key of that data.

The method would be used in that way:

t = Treant('baobab')
t.data['mydata'] = np.arange(5)
path = t.filepath('mydata')

Having simple access to the path also give access to the file file system properties such as the date of last modification (see #11).

install instructions

I noticed that I forgot to add install instructions for the conda packages. But I've noticed that there are neither for pip this the datreant.data package. I couldn't find anything in the datreant docs.

Shouldn't they be added somewhere with docs for the data package?

Update docs

Since this is now a standalone package, it needs its own docs for what it provides. These can be somewhat cannibalized from datreant, where they originally resided.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.