uchicago-cs / deepdish Goto Github PK

View Code? Open in Web Editor NEW

267.0 21.0 58.0 621 KB

Flexible HDF5 saving/loading and other data science tools from the University of Chicago

Home Page: http://deepdish.io

License: BSD 3-Clause "New" or "Revised" License

Python 81.91% Shell 16.78% Makefile 1.31%

hdf5 pickle python deep-learning

deepdish's Introduction

deepdish

Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog:

http://deepdish.io

Installation

pip install deepdish

Alternatively (if you have conda with the conda-forge channel):

conda install -c conda-forge deepdish

Main feature

The primary feature of deepdish is its ability to save and load all kinds of data as HDF5. It can save any Python data structure, offering the same ease of use as pickling or numpy.save. However, it improves by also offering:

Interoperability between languages (HDF5 is a popular standard)
Easy to inspect the content from the command line (using h5ls or our specialized tool ddls)
Highly compressed storage (thanks to a PyTables backend)
Native support for scipy sparse matrices and pandas DataFrame, Series and Panel
Ability to partially read files, even slices of arrays

An example:

import deepdish as dd

d = {
    'foo': np.ones((10, 20)),
    'sub': {
        'bar': 'a string',
        'baz': 1.23,
    },
}
dd.io.save('test.h5', d)

This can be reconstructed using dd.io.load('test.h5'), or inspected through the command line using either a standard tool:

$ h5ls test.h5
foo                      Dataset {10, 20}
sub                      Group

Or, better yet, our custom tool ddls (or python -m deepdish.io.ls):

$ ddls test.h5
/foo                       array (10, 20) [float64]
/sub                       dict
/sub/bar                   'a string' (8) [unicode]
/sub/baz                   1.23 [float64]

Documentation

http://deepdish.readthedocs.io/

deepdish's People

Contributors

Stargazers

Watchers

Forkers

kdsull mfzhang avalada jeromewang-github timesofbadri yanweifu lepy carpetri lukedeo peterpan1990 codeaudit twmacro sandy4321 craffel strategist922 caomw silky basnijholt kanishkg tinyloop leezqcst soledad89 vkhokhla rowhit andygoo algoskynet oneoverseven aijoe-mcarthyfinch statfungen andrewheusser orico afcarl dataset-fun diggerdu sharukhhasan erikgbl slwatkins xitianlong portugueslab m4ce ceades nbgit10 raphaelquast hercules261188 stratosthirios stjordanis fgoudreault iq-scm nimaiji zodoctor es-kang

deepdish's Issues

Elegent way to combine multiple h5 files into a single h5 file

Hi! Thank you for making this tool available for researchers. It is extremely helpful for data scientists and deep learning researchers. So, my problem is that I have multiple h5 files with the data as a dictionary with integer keys (0,1,...). I just want to combine these into a single file with integer keys.
for example d1 = {0:['abc'],1:['def']}, d2={0:['qwe'],1:['rty']}
I want d3 = {0:['abc'],1:['def'],2:['qwe'],3:['rty']}
is there any way to do this without loading the dictionaries into RAM and writing a script to change the keys of the dictionaries? Thank you!

Incompatibility with pandas v1.2.0

Hi guys,

Thanks for the incredibly useful piece of software!

The recent release of v1.2.0 of pandas seems to have broken compatibility with deepdish. Things work perfectly with pandas v1.1.5 and deepdish v0.3.6, however upgrading to pandas v1.2.0 produces the following error:

File "/Users/adam/anaconda3/lib/python3.8/site-packages/deepdish/io/hdf5io.py", line 583, in save
_save_level(h5file, group, value, name=key,
File "/Users/adam/anaconda3/lib/python3.8/site-packages/deepdish/io/hdf5io.py", line 211, in _save_level
_save_level(handler, new_group, v, name=k, filters=filters,
File "/Users/adam/anaconda3/lib/python3.8/site-packages/deepdish/io/hdf5io.py", line 211, in _save_level
_save_level(handler, new_group, v, name=k, filters=filters,
File "/Users/adam/anaconda3/lib/python3.8/site-packages/deepdish/io/hdf5io.py", line 251, in _save_level
elif _pandas and isinstance(level, (pd.DataFrame, pd.Series, pd.Panel)):
File "/Users/adam/anaconda3/lib/python3.8/site-packages/pandas/init.py", line 244, in getattr
raise AttributeError(f"module 'pandas' has no attribute '{name}'")
AttributeError: module 'pandas' has no attribute 'Panel'

It looks like pandas.Panel has been removed from the latest release. Presumably all that needs to be done is removing references to pd.Panel from hdf5io.py? I'd be happy to submit this as a pull request if you agree this is the right course of action.

Cheers,
Adam

Brief printing

I am thinking about writing a function that replaces print or __repr__ for inspecting large and possibly unknown variables.

The problem

You have a variable, it may be several steps of nested containers (lists of dictionaries of arrays, or what have you), and you print it to standard output. You get a deluge of output and gain little impression of the data. Numpy will abbreviate large arrays, which is good, however Python will not for its lists and dictionaries.

The solution

I want to add a printing function to deepdish, that will try to intelligently give you a summary of a variable. For instance, a list of arrays should yield only the shapes of the arrays. Lists and dictionaries should be abridged if too long. The user should have an option for maximum length of the output (it should guarantee no surprises). The default should be set so that the full output will be visible in a typical terminal.

I think this fits nicely with deepdish, since we're all about data processing and it could be really useful for ddls -i (inspect HDF5 group directly from the command line).

Thoughts:

Color support.
Ability to limit depth.
Possibly a Jupyter notebook HTML version.
Good support for built-ins, numpy, scipy, pandas.
A version of this could be available for replacing dir(x) or x.__dict__, which usually lists a lot of unnecessary crud. Also, it would be nice to know something about each member variable's type.

Any input will be welcome!

How to convert csv/txt file to hdf5?

How to convert mat to lmdb with your code?

Hi, please tell me how to convert mat data to lmdb for caffe use/

UnicodeDecodeError when using Python 2.7

Hello,

I'm using Deepdish to save a dictionary that contains unicode strings as key and numpy arrays, corresponding to computed embeddings. Small example to reproduce the exception:

import deepdish as dd
import numpy as np
d = {'foo': np.ones((10, 20)),'sub': {'bar': 'a string','é': 1.23,},}
dd.io.save('test.h5', d)

And the raised exception is:

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/path.py:112: NaturalNameWarning: object name is not a valid Python identifier: '\xc3\xa9'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
  NaturalNameWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/deepdish/io/hdf5io.py", line 584, in save
    filters=filters, idtable=idtable)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/deepdish/io/hdf5io.py", line 212, in _save_level
    idtable=idtable)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/deepdish/io/hdf5io.py", line 297, in _save_level
    setattr(group._v_attrs, name, level)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/attributeset.py", line 481, in __setattr__
    self._g__setattr(name, value)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/attributeset.py", line 423, in _g__setattr
    self._g_setattr(self._v_node, name, stvalue)
  File "tables/hdf5extension.pyx", line 658, in tables.hdf5extension.AttributeSet._g_setattr (tables/hdf5extension.c:7458)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Any idea how to overcome this problem?

Thanks!

AssertionErrror

spots dictionary is of mixed types e.g. dictionaries, pandas DataFrames, numbers etc.

Speed issue

Hi guys!

I just came accross deepdish and I really love it, thank you very much for this great work!

However, I've noticed that there is a speed problem compared to h5py. Here is a very simple piece of code that shows it:

import numpy as np
import deepdish as dd
import h5py
import time

# Create some random data
data = np.random.rand(100000, 128, 8)

# Deepdish
start_dd = time.time()
dd.io.save('test.h5', {'data': data})
finish_dd = time.time()

# H5py
start_h5py = time.time()
hf = h5py.File('test2.h5', 'w')
hf.create_dataset('data', data=data)
hf.close()
finish_h5py = time.time()

print('Time deepdish = %.2f' % (finish_dd - start_dd))
print('Time h5py = %.2f' % (finish_h5py - start_h5py))

In my computer:
- Time deepdish = 18.53 s
- Time h5py = 4.72 s

I know that the strong point of deepdish is its capability to save complex data (dicts and so on), and perhaps performance is not the key goal here. But still, I think it would be great to achieve similar speed, especially when no complex data is involved.

Whant do you think? It can be done with some optimization?

Cheers,
Eduardo

A Mistake?

http://deepdish.io/2014/10/28/hintons-dark-knowledge/

I was wondering if this was a mistake
Hinton in his lecture slides says that raising the temperature was

yk/T

where as you have it as

yk^1/T

which is a squareroot ? if I use T ?

Can you explain ?
this occurs for the denominator as well.

numpy >= 1.24 enforces deprecation of `np.object`

np.object is used here: https://github.com/uchicago-cs/deepdish/blob/master/deepdish/io/hdf5io.py#L125
Thus the code crashes when it goes through this line when numpy >= 1.24.

Saving Class instances not working when registry is not called

Hi, thanks for this cool library.
The following example I took from your readthedocs (https://deepdish.readthedocs.io/en/latest/io.html#class-instances):

import deepdish as dd

class Foo(dd.util.SaveableRegistry):
    def __init__(self, x):
        self.x = x

    @classmethod
    def load_from_dict(self, d):
        obj = Foo(d['x'])
        return obj

    def save_to_dict(self):
        return {'x': self.x}


if __name__ == '__main__':
    f = Foo(10)
    f.save('foo.h5')
    f = Foo.load('foo.h5')

This is a more minimal example because there is no class 'Bar' that inherits from Foo and therefore the @Foo.register('bar') decorator) is never called. When doing this. This leads to the following traceback:

/Users/jorenretel/bin/miniconda3/envs/abstract_classifier/bin/python /Users/jorenretel/Library/Preferences/PyCharm2019.3/scratches/scratch_3.py
/Users/jorenretel/bin/miniconda3/envs/abstract_classifier/lib/python3.8/site-packages/deepdish/io/hdf5io.py:246: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  elif _pandas and isinstance(level, (pd.DataFrame, pd.Series, pd.Panel)):
Traceback (most recent call last):
  File "/Users/jorenretel/Library/Preferences/PyCharm2019.3/scratches/scratch_3.py", line 20, in <module>
    f = Foo.load('foo.h5')
  File "/Users/jorenretel/bin/miniconda3/envs/abstract_classifier/lib/python3.8/site-packages/deepdish/util/saveable.py", line 162, in load
    return cls.getclass(class_name).load_from_dict(d)
  File "/Users/jorenretel/bin/miniconda3/envs/abstract_classifier/lib/python3.8/site-packages/deepdish/util/saveable.py", line 121, in getclass
    return cls.REGISTRY[name]
KeyError: 'noname'

Process finished with exit code 1

The problem is that this function in deepdish/util/saveable.py never gets overloaded when registry is not called:

    @property
    def name(self):
        """Returns the name of the registry entry."""
        # Automatically overloaded by 'register'
        return "noname"

Possible solution: return None, instead of 'noname'? I am not sure whether this does not have some side effect that I am not aware of.

Overflow Error When Attempting to Save Large Amounts of Data

I have been using deepdish to save dictionaries with large amounts of data. I ran into the following issue when attempting to save a particularly large file. I have tried saving the data with and without compression, if that helps. Can you help me out with it please?

File "C:/Users/xxxxxxxx/Documents/Python_Scripts/Data_Scripts/Finalized_Data_Review_Presentations/data_save_cc_test.py", line 513, in
dd.io.save('%s/Data/%s_%s_cc_data.h5'%(directory,m_list[m],list_type),cc_data,('blosc', 9))

File "C:\Users\xxxxxxxx\AppData\Local\Continuum\anaconda2\lib\site-packages\deepdish\io\hdf5io.py", line 596, in save
filters=filters, idtable=idtable)

File "C:\Users\xxxxxxxx\AppData\Local\Continuum\anaconda2\lib\site-packages\deepdish\io\hdf5io.py", line 304, in _save_level
_save_pickled(handler, group, level, name=name)

File "C:\Users\xxxxxxxx\AppData\Local\Continuum\anaconda2\lib\site-packages\deepdish\io\hdf5io.py", line 172, in _save_pickled
node.append(level)

File "C:\Users\xxxxxxxx\AppData\Local\Continuum\anaconda2\lib\site-packages\tables\vlarray.py", line 547, in append
self._append(nparr, nobjects)

File "tables/hdf5extension.pyx", line 2032, in tables.hdf5extension.VLArray._append

OverflowError: Python int too large to convert to C long

pip install of deepdish

When installing deepdish under windows (both pip and setup.py) I get the following error message

LINK : fatal error LNK1181: cannot open input file 'hdf5dll.lib'

How should I install HDF5? I installed the HDF5 software from https://www.hdfgroup.org/HDF5/release/obtain5.html but this did not solve the issue

Error while installing deepdish

I'm trying to create an LMDB database file to be used with Caffe according to this tutorial on an Ubuntu 14.04 machine using Anaconda Python 2.7.9. However, when I do pip install deepdish, I'm getting the following error:

Collecting deepdish
  Using cached deepdish-0.1.4.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/tmp/pip-build-qKwOBx/deepdish/setup.py", line 12, in <module>
        with open('requirements.txt') as f:
    IOError: [Errno 2] No such file or directory: 'requirements.txt'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-qKwOBx/deepdish

Any ideas why this error might be occurring and how to go about correcting it? Any help is much appreciated. Thank you.

dd.io.save prints compression types.

how to disable that?

Link broken to docs site - goes to spam

Would be great to get the docs site either rehosted or at least removed in the repo since it goes to a spam site currently.

Segmentationfaul core dumped

ValueError when trying to save numpy scalar arrays (ndim = 0)

import numpy as np
import deepdish
deepdish.io.save('test.h5', np.array(0.))

results in

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

deepdish/io/hdf5io.pyc in save(path, data, compression)
    458
    459         else:
--> 460             _save_level(h5file, group, data, name='data', filters=filters)
    461             # Mark this to automatically unpack when loaded
    462             group._v_attrs[DEEPDISH_IO_UNPACK] = True

deepdish/io/hdf5io.pyc in _save_level(handler, group, level, name, filters)
    184
    185     elif isinstance(level, np.ndarray):
--> 186         _save_ndarray(handler, group, name, level, filters=filters)
    187
    188     elif _pandas and isinstance(level, (pd.DataFrame, pd.Series, pd.Panel)):

deepdish/io/hdf5io.pyc in _save_ndarray(handler, group, name, x, filters)
    112         strtype = None
    113         itemsize = None
--> 114     assert np.min(x.shape) > 0, ("deepdish.io.save does not support saving "
    115                                  "numpy arrays with a zero-length axis")
    116     # For small arrays, compression actually leads to larger files, so we are

numpy/core/fromnumeric.pyc in amin(a, axis, out, keepdims)
   2217         except AttributeError:
   2218             return _methods._amin(a, axis=axis,
-> 2219                                 out=out, keepdims=keepdims)
   2220         # NOTE: Dropping the keepdims parameter
   2221         return amin(axis=axis, out=out)

numpy/core/_methods.pyc in _amin(a, axis, out, keepdims)
     27
     28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29     return umr_minimum(a, axis, None, out, keepdims)
     30
     31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):

ValueError: zero-size array to reduction operation minimum which has no identity

Is saving numpy scalar arrays (ndim = 0) supported? If not, I think you could change the test to
assert x.ndim > 0 and np.min(x.shape) > 0. The first condition would fail for numpy scalar arrays, so the second wouldn't be evaluated. If numpy scalar arrays aren't supported, I'm curious why and if that's functionality that could be added. Finally, separately, you also should arguably be doing

if np.min(x.shape) > 0:
    raise ValueError(...)

instead of an assert, see e.g. here, but that's a separate discussion! Thank you again for the excellent library!

Crashes when dealing with large datasets

I am trying to use deepdish to store/restore large datasets in the HDF5 format, but deepdish.io.save crashes every time the dataset is larger than about 2GB.

For example, suppose we have a very large array:
t=bytearray(8*1000*1000*400)
when I try:
dd.io.save('testeDeepdishLimit',t)
I get the error:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-3-26ecd71b151a> in <module>()
----> 1 dd.io.save('testeDeepdishLimit',t)

~/anaconda3/lib/python3.6/site-packages/deepdish/io/hdf5io.py in save(path, data, compression)
    594         else:
    595             _save_level(h5file, group, data, name='data',
--> 596                         filters=filters, idtable=idtable)
    597             # Mark this to automatically unpack when loaded
    598             group._v_attrs[DEEPDISH_IO_UNPACK] = True

~/anaconda3/lib/python3.6/site-packages/deepdish/io/hdf5io.py in _save_level(handler, group, level, name, filters, idtable)
    302 
    303     else:
--> 304         _save_pickled(handler, group, level, name=name)
    305 
    306 

~/anaconda3/lib/python3.6/site-packages/deepdish/io/hdf5io.py in _save_pickled(handler, group, level, name)
    170                   DeprecationWarning)
    171     node = handler.create_vlarray(group, name, tables.ObjectAtom())
--> 172     node.append(level)
    173 
    174 

~/anaconda3/lib/python3.6/site-packages/tables/vlarray.py in append(self, sequence)
    535             nparr = None
    536 
--> 537         self._append(nparr, nobjects)
    538         self.nrows += 1
    539 

tables/hdf5extension.pyx in tables.hdf5extension.VLArray._append()

OverflowError: value too large to convert to int

Is there any workaround for this issue?

Issues with saving files with German Umlaut

When saving files with deepdish, I cannot save in folders (files are ok) containing German umlaut:

import deepdish as dd
# works
dd.io.save(r"D:\Tmp\aeoeue\Test_öäü.h5", {'a': list(range(10))})
# does not work
dd.io.save(r"D:\Tmp\äöü\Test_öäü.h5", {'a': list(range(10))})

Resulting in this error:

---------------------------------------------------------------------------
HDF5ExtError                              Traceback (most recent call last)
<ipython-input-10-c262766a8bba> in <module>()
      2 dd.io.save(r"D:\Tmp\aeoeue\Test_öäü.h5", {'a': list(range(10))})
      3 # does not work
----> 4 dd.io.save(r"D:\Tmp\äöü\Test_öäü.h5", {'a': list(range(10))})

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdish\io\hdf5io.py in save(path, data, compression)
    571     filters = _get_compression_filters(compression)
    572 
--> 573     with tables.open_file(path, mode='w') as h5file:
    574         # If the data is a dictionary, put it flatly in the root
    575         group = h5file.root

~\AppData\Local\Continuum\anaconda3\lib\site-packages\tables\file.py in open_file(filename, mode, title, root_uep, filters, **kwargs)
    318 
    319     # Finally, create the File instance, and return it
--> 320     return File(filename, mode, title, root_uep, filters, **kwargs)
    321 
    322 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\tables\file.py in __init__(self, filename, mode, title, root_uep, filters, **kwargs)
    782 
    783         # Now, it is time to initialize the File extension
--> 784         self._g_new(filename, mode, **params)
    785 
    786         # Check filters and set PyTables format version for new files.

tables\hdf5extension.pyx in tables.hdf5extension.File._g_new()

HDF5ExtError: HDF5 error back trace

  File "C:\ci\hdf5_1525883595717\work\src\H5F.c", line 445, in H5Fcreate
    unable to create file
  File "C:\ci\hdf5_1525883595717\work\src\H5Fint.c", line 1461, in H5F_open
    unable to open file: time = Thu May  2 10:57:31 2019
, name = 'D:\Tmp\äöü\Test_öäü.h5', tent_flags = 13
  File "C:\ci\hdf5_1525883595717\work\src\H5FD.c", line 733, in H5FD_open
    open failed
  File "C:\ci\hdf5_1525883595717\work\src\H5FDsec2.c", line 346, in H5FD_sec2_open
    unable to open file: name = 'D:\Tmp\äöü\Test_öäü.h5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 302

End of HDF5 error back trace

Unable to open/create file 'D:\Tmp\äöü\Test_öäü.h5'

As far as I can understand it is an issue with PyTables / h5py?
But I am just using deepdish.

Cannot load .h5 file created with deepdish.io.save

I saved a huge dataset (6 gb) with deepdish.io.save. It can be read with ddls. But when I'm trying to load it with d = deepdish.io.load(path_to_dataset), I'm getting an empty dictionary.

python 3.5, deepdish 0.3.6

Very slow when creating an LMDB database

Hi, I've followed the instructions in Creating an LMDB database in Python, which is a very helpful post. However I found it would take more than 10 minutes to write less than 10,000 images into an lmdb file.

The map_size was set as 1TB.

Is there any way to accelerate the processing?

Blosc compression not widely supported

The blosc compression (provided by the PyTables backend) is a pretty clear winner:

http://deepdish.readthedocs.org/en/latest/io.html#sparse-matrices

However, it is not widely supported, so deepdish-saved files won't be loadable through for instance Matlab or h5py. This severely damages the otherwise great portability of HDF5 and can especially be confusing for new users.

I am considering the following solution:

Make zlib the default compression method. This is much more widely supported (both Matlab and h5py support it, as an example). The compression rate is even slightly better than blosc, but unfortunately it is much slower.
Add a deepdish config file (maybe .deepdish.conf or .ddrc) where the default compression can be changed. That way, users that care more about speed than portability can opt-in to blosc without having to specify it every time.

Hanging file refs if exception occurs during hdf5io.load/hdf5io.save

Hey,

Thanks for the hdf5io module. It's really great and easy to use.
One issue I encountered while using it is that if an error occurs during save/load the file is left opened and cannot be reopened.

I solved this by simply putting the open file operation with in a context manager i.e:

with  tables.open_file(path, mode='w') as h5file:
    #do stuff

This way the file is always closed when the function returns.

Thanks again for the module. Really useful.

Cheers,
Ran

Preserving structure when re-saving Keras HDF5 files

In some cases, opening an HDF5 file, editing it, and then re-saving it, can cause detrimental changes to the format. Something that was an attribute might have turned into a group, or vice versa. It would also be nice if it preserved the compression choices of the original file.

I have noticed problems with this for instance with a Keras model saved to HDF5.

This is not a trivial problem to solve and requires storing meta data that may give clues to how it should be saved. It can also be addressed by allowing edits without a full load and re-save, which relates to #22.

dd.io.save crashes while saving np.array of objects

dd.io.save crashes when you try to save np.array with dtype=object

Ubuntu 14.04 x64
Python 2.7.11 (default, Dec 15 2015, 16:46:19)
[GCC 4.8.4]

In [1]: np.__version__
Out[2]: '1.11.2'
In [2]: dd.__version__
Out[2]: '0.3.4'
In [3]: tables.__version__
Out[3]: '3.2.2'


In [10]: x = np.array(['123', 567, 'hjjjk'], dtype='O')

In [11]: x
Out[11]: array(['123', 567, 'hjjjk'], dtype=object)

In [12]: dd.io.save('t.h5', x)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-dc094ba588e4> in <module>()
----> 1 dd.io.save('t.h5', x)

/export/home/asanakoy/.local/lib/python2.7/site-packages/deepdish/io/hdf5io.pyc in save(path, data, compression)
    579         else:
    580             _save_level(h5file, group, data, name='data',
--> 581                         filters=filters, idtable=idtable)
    582             # Mark this to automatically unpack when loaded
    583             group._v_attrs[DEEPDISH_IO_UNPACK] = True

/export/home/asanakoy/.local/lib/python2.7/site-packages/deepdish/io/hdf5io.pyc in _save_level(handler, group, level, name, filters, idtable)
    242 
    243     elif isinstance(level, np.ndarray):
--> 244         _save_ndarray(handler, group, name, level, filters=filters)
    245 
    246     elif _pandas and isinstance(level, (pd.DataFrame, pd.Series, pd.Panel)):

/export/home/asanakoy/.local/lib/python2.7/site-packages/deepdish/io/hdf5io.pyc in _save_ndarray(handler, group, name, x, filters)
    123         atom = tables.StringAtom(itemsize)
    124     else:
--> 125         atom = tables.Atom.from_dtype(x.dtype)
    126         strtype = None
    127         itemsize = None

/export/home/asanakoy/.local/lib/python2.7/site-packages/tables/atom.pyc in from_dtype(class_, dtype, dflt)
    377             return class_.from_kind('string', itemsize, dtype.shape, dflt)
    378         # Most NumPy types have direct correspondence with PyTables types.
--> 379         return class_.from_type(basedtype.name, dtype.shape, dflt)
    380 
    381     @classmethod

/export/home/asanakoy/.local/lib/python2.7/site-packages/tables/atom.pyc in from_type(class_, type, shape, dflt)
    402 
    403         if type not in all_types:
--> 404             raise ValueError("unknown type: %r" % (type,))
    405         kind, itemsize = split_type(type)
    406         return class_.from_kind(kind, itemsize, shape, dflt)

ValueError: unknown type: 'object'

Appending values to a numpy array in an HDF5 file created be deepdish.io

Hi, I like deepdish.io interface to HDF5, and would like to use it to store numpy arrays on the go, during a simulation running. Is it possible to append values to a certain numpy array that I've saved?

That is, I would like to create a dictionary of this sort - {u1:numpy_array1,u2:numpy_array2}, in which numpy_array is, for example, (128 * 128) array. I would like to append another (128 * 128 ) array to a third axis in each step of iteration of my code..

loading scalar values into a dictionary from hdf5 groups

First, I would like to thank you for writing such a nice package for interfacing with hdf5 files.

I have been using hdf5 files to store data, and I realized that I could use your package to load data from hdf5 files that weren't created by your package. The only issue in loading the files was the need to load scalar values into a dictionary that are stored as hdf5 groups. It appears deepdish io package expects scalars to be stored as numpy arrays of of length 1. However, I added a couple of lines to the '_load_nonlink_level' method so that if there is a scalar instead of an array, return the value of the scalar.

I don't think your code needs to be modified. But, I was thinking that other people might find the tweak / mod useful if they have hdf5 created in a different way; but would like to use your package to load their files. I've attached the 'hdf5io.py' that I modified. The lines I added are at line 439.

hdf5io.py.txt

optional dependencies?

The package currently depends on matplotlib and skimage. These are both pretty hefty dependencies, and they don't appear to be necessary for the core functionality. Maybe it'd be worth making them optional (e.g., like here in statsmodels: https://github.com/statsmodels/statsmodels/blob/4b55fa4871cf3f1dbd2e30bfe00d80df87d4b340/statsmodels/graphics/utils.py#L9).

Crashes when saving nested objects

how to cite weighted cross entropy

Hi DD

thanks a lot for your weighted cross entropy solution. By the way do you have bibtex or citation note, if i want to mention in my report. Thanks again

Sagar

Need an API to be able to control the PyTables file object, in case you don't want to save to disk

I have a use-case where I don't want to save the resulting HDF5 data to local disk. Rather, I want to handle the produced data with a stream pattern.

In my specific case, this is so I can save the HDF5 data directly to an AWS S3 bucket in a resource-efficient way, but I imaging there are other situations where people may want to have control of the file handle. It would be good to have an extended API which supports this.

Save MATLAB dataset (.mat file) in HDF5 format

Hello,

I have loaded a MATLAB dataset into Python using scipy.io.loadmat(), and have a dictionary saved in a dictionary named mat.

I would like to save this dictionary in HDF5 format, so I can load it into Python when I run the program again. On trying dd.io.save('test.h5', mat), I get the following error:

ValueError: compound data types are not supported: dtype([('ddm', 'O'), ('nfold', 'O'), ('cv_par1', 'O'), ('cv_par2', 'O'), ('cl', 'O')])

Is there any way to fix this, so I can save the MATLAB dataset in HDF5 format?

I'd greatly appreciate any help.

Thank you,

`save` function with mode 'a'

Currently dd.io.save overwrites target file if exists. Would it be a good idea to add a mode argument so that the target file can be constantly updated as new data arrives?

AttributeError when saving a Pandas object with Pandas 0.24.0 or greater

When saving a pd.Series, pd.DataFrame, or pd.Panel to HDF5 using deepdish, an AttributeError is raised, and I cannot save the file. I've tracked down the issue, and it's due to a change in Pandas version 0.24.0.

Here is how I've been able to reproduce the error, where I have installed Pandas 0.24.2, Numpy 0.15.4, deepdish 0.3.6, and PyTables 3.5.1.

import pandas as pd
import numpy as np
import deepdish as dd

dd.io.save("test.h5", {"test" : pd.Series(data=np.random.rand(1))}, )

The error returned is:

---------------------------------------------------------------------------
NoSuchNodeError                           Traceback (most recent call last)
/galbascratch/samwatkins/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py in get_node(self, key)
   1159                 key = '/' + key
-> 1160             return self._handle.get_node(self.root, key)
   1161         except _table_mod.exceptions.NoSuchNodeError:

/galbascratch/samwatkins/anaconda3/lib/python3.7/site-packages/tables/file.py in get_node(self, where, name, classname)
   1643             nodepath = join_path(basepath, name or '') or '/'
-> 1644             node = where._v_file._get_node(nodepath)
   1645         elif isinstance(where, (six.string_types, numpy.str_)):

/galbascratch/samwatkins/anaconda3/lib/python3.7/site-packages/tables/file.py in _get_node(self, nodepath)
   1598 
-> 1599         node = self._node_manager.get_node(nodepath)
   1600         assert node is not None, "unable to instantiate node ``%s``" % nodepath

/galbascratch/samwatkins/anaconda3/lib/python3.7/site-packages/tables/file.py in get_node(self, key)
    436         if self.node_factory:
--> 437             node = self.node_factory(key)
    438             self.cache_node(node, key)

/galbascratch/samwatkins/anaconda3/lib/python3.7/site-packages/tables/group.py in _g_load_child(self, childname)
   1180         # Is the node a group or a leaf?
-> 1181         node_type = self._g_check_has_child(childname)
   1182 

/galbascratch/samwatkins/anaconda3/lib/python3.7/site-packages/tables/group.py in _g_check_has_child(self, name)
    397                 "group ``%s`` does not have a child named ``%s``"
--> 398                 % (self._v_pathname, name))
    399         return node_type

NoSuchNodeError: group ``/`` does not have a child named ``//test``

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-2-60c97adec230> in <module>
----> 1 dd.io.save("test4.h5", {"test" : pd.Series(data=np.random.rand(1))}, )

~/.local/lib/python3.7/site-packages/deepdish-0.3.4-py3.7.egg/deepdish/io/hdf5io.py in save(path, data, compression)
    587             for key, value in data.items():
    588                 _save_level(h5file, group, value, name=key,
--> 589                             filters=filters, idtable=idtable)
    590 
    591         elif (_sns and isinstance(data, SimpleNamespace) and

~/.local/lib/python3.7/site-packages/deepdish-0.3.4-py3.7.egg/deepdish/io/hdf5io.py in _save_level(handler, group, level, name, filters, idtable)
    256         store = _HDFStoreWithHandle(handler)
    257 #         print(store.get_node(group._v_pathname))
--> 258         store.append(group._v_pathname + '/' + name, level)
    259 
    260     elif isinstance(level, (sparse.dok_matrix,

/galbascratch/samwatkins/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py in append(self, key, value, format, append, columns, dropna, **kwargs)
    984         kwargs = self._validate_format(format, kwargs)
    985         self._write_to_group(key, value, append=append, dropna=dropna,
--> 986                              **kwargs)
    987 
    988     def append_to_multiple(self, d, value, selector, data_columns=None,

/galbascratch/samwatkins/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
   1365     def _write_to_group(self, key, value, format, index=True, append=False,
   1366                         complib=None, encoding=None, **kwargs):
-> 1367         group = self.get_node(key)
   1368 
   1369         # remove the node if we are not appending

/galbascratch/samwatkins/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py in get_node(self, key)
   1159                 key = '/' + key
   1160             return self._handle.get_node(self.root, key)
-> 1161         except _table_mod.exceptions.NoSuchNodeError:
   1162             return None
   1163 

AttributeError: 'NoneType' object has no attribute 'exceptions'

From the above, we see that the _table_mod variable is None, which is throwing the error. The reason that this is now an error is related to pandas-dev/pandas#22919, where the exception in HDFStore.get_node was changed from a bare exception to a specific exception.

Before: https://github.com/pandas-dev/pandas/blob/2d0c96119391c85bd4f7ffbb847759ee3777162a/pandas/io/pytables.py#L1157-L1165

After: https://github.com/pandas-dev/pandas/blob/master/pandas/io/pytables.py#L1141-L1149

So, now the _table_mod variable is used to only return None in the case that the exception is a NoSuchNodeError, rather than any error. However, _table_mod should be set by running of the function pandas.io.pytables._tables, which imports PyTables into the namespace as _table_mod. If this function is not run, then _table_mod is left as None, and the above AttributeError occurs.

The problem is that in deepdish's use of pandas.io.pytables.HDFStore, where there's a wrapper of the function called _HDFStoreWithHandle, none of the methods that call the _tables function are called, and _table_mod is left as None, which gives us the AttributeError.

My proposed solution is to add one line to the beginning hdf5io.py file in deepdish, where we call the pandas.io.pytables._tables .

Before:

deepdish/deepdish/io/hdf5io.py

Lines 1 to 12 in 01af936

    
           from __future__ import division, print_function, absolute_import 
        
           import numpy as np 
        
           import tables 
        
           import warnings 
        
           from scipy import sparse 
        
           from deepdish import conf 
        
           try: 
        
               import pandas as pd 
        
               _pandas = True 
        
           except ImportError: 
        
               _pandas = False

After:

from __future__ import division, print_function, absolute_import

import numpy as np
import tables
import warnings
from scipy import sparse
from deepdish import conf
try:
    import pandas as pd
    pd.io.pytables._tables()
    _pandas = True
except ImportError:
    _pandas = False

After making this change, I no longer get the AttributeError and the saving of Pandas data types works seamlessly.

Got error when I tried to load the data from matlab

In matlab:

h5disp('/home/test.h5')
HDF5 test.h5
Group '/'
Attributes:
'TITLE': ''
'CLASS': 'GROUP'
'VERSION': '1.0'
'PYTABLES_FORMAT_VERSION': '2.1'
'DEEPDISH_IO_VERSION': 8
Dataset 'data'
Size: 32x32x3x100
MaxSize: 32x32x3x100
Datatype: H5T_IEEE_F64LE (double)
ChunkSize: 32x32x3x2
Filters: unrecognized filter (blosc)
Attributes:
'CLASS': 'CARRAY'
'VERSION': '1.1'
'TITLE': ''
Dataset 'label'
Size: 100
MaxSize: 100
Datatype: H5T_IEEE_F64LE (double)
ChunkSize: []
Filters: none
FillValue: 0.000000
Attributes:
'CLASS': 'ARRAY'
'VERSION': '2.4'
'TITLE': ''
'FLAVOR': 'numpy'

thisdata = h5read('/home/test.h5', '/data');
Error using h5readc
The HDF5 library encountered an error and produced the following stack trace information:

H5PL__find         can't open directory
H5PL_load          search in paths failed
H5Z_pipeline       required filter 'blosc' is not registered
H5D__chunk_lock    data pipeline read failed
H5D__chunk_read    unable to read raw data chunk
H5D__read          can't read data
H5Dread            can't read data

Error in h5read (line 58)
[data,var_class] = h5readc(Filename,Dataset,start,count,stride);

I followed example in http://deepdish.io/ but 'numpy.ndarray' object has no attribute 'tobytes'

I run the example code in http://deepdish.io/
However, at datum.data = X[i].tobytes() error came.
I know that error is come from deepdish but I don't know the reason and I want to run example code
Do you know the reason the problem ??

My numpy version is 1.8.2

Update conda-forge version to match pypi

Thanks for the great package! Not sure if this is the best place to make this request, but is there any chance you can update the version of deepdish on conda-forge to match the latest version in pypi? It looks like latest version is 0.3.4 which breaks for saving sometimes of objects, but this is resolved in 0.3.6.

There's a user submitted one on Anaconda cloud in case that's helpful for the update: https://anaconda.org/turbach/deepdish, but it would be nice for it to be updated on conda-forge for ease of installation of packages that depend on deepdish

	from __future__ import division, print_function, absolute_import

	import numpy as np
	import tables
	import warnings
	from scipy import sparse
	from deepdish import conf
	try:
	import pandas as pd
	_pandas = True
	except ImportError:
	_pandas = False