Code Monkey home page Code Monkey logo

hdf5storage's Introduction

Overview

This Python package provides high level utilities to read/write a variety of Python types to/from HDF5 (Heirarchal Data Format) formatted files. This package also provides support for MATLAB MAT v7.3 formatted files, which are just HDF5 files with a different extension and some extra meta-data.

All of this is done without pickling data. Pickling is bad for security because it allows arbitrary code to be executed in the interpreter. One wants to be able to read possibly HDF5 and MAT files from untrusted sources, so pickling is avoided in this package.

The package's documetation is found at http://pythonhosted.org/hdf5storage/

The package's source code is found at https://github.com/frejanordsiek/hdf5storage

The package is licensed under a 2-clause BSD license (https://github.com/frejanordsiek/hdf5storage/blob/master/COPYING.txt).

Installation

This package only supports Python >= 2.6.

This package requires the numpy and h5py (>= 2.1) packages. An optional dependency is the scipy package.

To install hdf5storage, download the package and run the command on Python 3 :

python3 setup.py install

or the command on Python 2 :

python setup.py install

Python 2

This package was designed and written for Python 3, with Python 2.7 and 2.6 support added later. This does mean that a few things are a little clunky in Python 2. Examples include requiring unicode keys for dictionaries, the int and long types both being mapped to the Python 3 int type, etc. The storage format's metadata looks more familiar from a Python 3 standpoint as well.

The documentation is written in terms of Python 3 syntax and types primarily. Important Python 2 information beyond direct translations of syntax and types will be pointed out.

Hierarchal Data Format 5 (HDF5)

HDF5 files (see http://www.hdfgroup.org/HDF5/) are a commonly used file format for exchange of numerical data. It has built in support for a large variety of number formats (un/signed integers, floating point numbers, strings, etc.) as scalars and arrays, enums and compound types. It also handles differences in data representation on different hardware platforms (endianness, different floating point formats, etc.). As can be imagined from the name, data is represented in an HDF5 file in a hierarchal form modelling a Unix filesystem (Datasets are equivalent to files, Groups are equivalent to directories, and links are supported).

This package interfaces HDF5 files using the h5py package (http://www.h5py.org/) as opposed to the PyTables package (http://www.pytables.org/).

MATLAB MAT v7.3 file support

MATLAB (http://www.mathworks.com/) MAT files version 7.3 and later are HDF5 files with a different file extension (.mat) and a very specific set of meta-data and storage conventions. This package provides read and write support for a limited set of Python and MATLAB types.

SciPy (http://scipy.org/) has functions to read and write the older MAT file formats. This package has functions modeled after the scipy.io.savemat and scipy.io.loadmat functions, that have the same names and similar arguments. The dispatch to the SciPy versions if the MAT file format is not an HDF5 based one.

Supported Types

The supported Python and MATLAB types are given in the tables below. The tables assume that one has imported collections and numpy as:

import collections as cl
import numpy as np

The table gives which Python types can be read and written, the first version of this package to support it, the numpy type it gets converted to for storage (if type information is not written, that will be what it is read back as) the MATLAB class it becomes if targetting a MAT file, and the first version of this package to support writing it so MATlAB can read it.

+---------------+---------+-------------------------+-------------+---------+------------------+ | Python | MATLAB | Notes | +---------------+---------+-------------------------+-------------+---------+------------------+ | Type | Version | Converted to | Class | Version | | +===============+=========+=========================+=============+=========+==================+ | bool | 0.1 | np.bool_ or np.uint8 | logical | 0.1 |1 | +---------------+---------+-------------------------+-------------+---------+------------------+ | None | 0.1 | np.float64([]) | [] | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | int | 0.1 | np.int64 or np.bytes_ | int64 | 0.1 |23 | +---------------+---------+-------------------------+-------------+---------+------------------+ | long | 0.1 | np.int64 or np.bytes_ | int64 | 0.1 |45 | +---------------+---------+-------------------------+-------------+---------+------------------+ | float | 0.1 | np.float64 | double | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | complex | 0.1 | np.complex128 | double | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | str | 0.1 | np.uint32/16 | char | 0.1 |6 | +---------------+---------+-------------------------+-------------+---------+------------------+ | bytes | 0.1 | np.bytes_ or np.uint16 | char | 0.1 |7 | +---------------+---------+-------------------------+-------------+---------+------------------+ | bytearray | 0.1 | np.bytes_ or np.uint16 | char | 0.1 |8 | +---------------+---------+-------------------------+-------------+---------+------------------+ | list | 0.1 | np.object_ | cell | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | tuple | 0.1 | np.object_ | cell | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | set | 0.1 | np.object_ | cell | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | frozenset | 0.1 | np.object_ | cell | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | cl.deque | 0.1 | np.object_ | cell | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | dict | 0.1 | | struct | 0.1 |9 | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.bool_ | 0.1 | | logical | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.void | 0.1 | | | | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.uint8 | 0.1 | | uint8 | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.uint16 | 0.1 | | uint16 | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.uint32 | 0.1 | | uint32 | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.uint64 | 0.1 | | uint64 | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.uint8 | 0.1 | | int8 | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.int16 | 0.1 | | int16 | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.int32 | 0.1 | | int32 | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.int64 | 0.1 | | int64 | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.float16 | 0.1 | | | |10 | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.float32 | 0.1 | | single | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.float64 | 0.1 | | double | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.complex64 | 0.1 | | single | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.complex128 | 0.1 | | double | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.str_ | 0.1 | np.uint32/16 | char/uint32 | 0.1 |11 | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.bytes_ | 0.1 | np.bytes_ or np.uint16 | char | 0.1 |12 | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.object_ | 0.1 | | cell | 0.1 | | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.ndarray | 0.1 | see notes | see notes | 0.1 |131415 | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.matrix | 0.1 | see notes | see notes | 0.1 |16 | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.chararray | 0.1 | see notes | see notes | 0.1 |17 | +---------------+---------+-------------------------+-------------+---------+------------------+ | np.recarray | 0.1 | structured np.ndarray | see notes | 0.1 |1819 | +---------------+---------+-------------------------+-------------+---------+------------------+

This table gives the MATLAB classes that can be read from a MAT file, the first version of this package that can read them, and the Python type they are read as.

MATLAB Class Version Python Type
logical 0.1 np.bool_
single 0.1 np.float32 or np.complex6420
double 0.1 np.float64 or np.complex12821
uint8 0.1 np.uint8
uint16 0.1 np.uint16
uint32 0.1 np.uint32
uint64 0.1 np.uint64
int8 0.1 np.int8
int16 0.1 np.int16
int32 0.1 np.int32
int64 0.1 np.int64
char 0.1 np.str_
struct 0.1 structured np.ndarray
cell 0.1 np.object_
canonical empty 0.1 np.float64([])

Versions

0.2. Feature release adding the following.
  • Ability to write Python 3.x int and Python 2.x long that are too large to fit into np.int64. Doing so no longer raises an exception.
  • Ability to write np.bytes_ with non-ASCII characters in them. Doing so no longer raises an exception.
0.1.9. Bugfix and minor feature release doing the following.
  • Issue #23. Fixed bug where a structured np.ndarray with a field name of 'O' could never be written as an HDF5 COMPOUND Dataset (falsely thought a field's dtype was object).
  • Issue #6. Added optional data compression and the storage of data checksums. Controlled by several new options.
0.1.8. Bugfix release fixing the following two bugs.
  • Issue #21. Fixed bug where the 'MATLAB_class' Attribute is not set when writing dict types when writing MATLAB metadata.
  • Issue #22. Fixed bug where null characters ('\x00') and forward slashes ('/') were allowed in dict keys and the field names of structured np.ndarray (except that forward slashes are allowed when the structured_numpy_ndarray_as_struct is not set as is the case when the matlab_compatible option is set). These cause problems for the h5py package and the HDF5 library. NotImplementedError is now thrown in these cases.
0.1.7. Bugfix release with an added compatibility option and some added test code. Did the following.
  • Fixed an issue reading variables larger than 2 GB in MATLAB MAT v7.3 files when no explicit variable names to read are given to hdf5storage.loadmat. Fix also reduces memory consumption and processing time a little bit by removing an unneeded memory copy.
  • Options now will accept any additional keyword arguments it doesn't support, ignoring them, to be API compatible with future package versions with added options.
  • Added tests for reading data that has been compressed or had other HDF5 filters applied.

0.1.6. Bugfix release fixing a bug with determining the maximum size of a Python 2.x int on a 32-bit system.

0.1.5. Bugfix release fixing the following bug.
  • Fixed bug where an int could be stored that is too big to fit into an int when read back in Python 2.x. When it is too big, it is converted to a long.
  • Fixed a bug where an int or long that is too big to
big to fit into an np.int64 raised the wrong exception.
  • Fixed bug where fields names for structured np.ndarray with non-ASCII characters (assumed to be UTF-8 encoded in Python 2.x) can't be read or written properly.
  • Fixed bug where np.bytes_ with non-ASCII characters can were converted incorrectly to UTF-16 when that option is set (set implicitly when doing MATLAB compatibility). Now, it throws a NotImplementedError.
0.1.4. Bugfix release fixing the following bugs. Thanks goes to mrdomino for writing the bug fixes.
  • Fixed bug where dtype is used as a keyword parameter of np.ndarray.astype when it is a positional argument.
  • Fixed error caused by h5py.__version__ being absent on Ubuntu 12.04.
0.1.3. Bugfix release fixing the following bug.
  • Fixed broken ability to correctly read and write empty structured np.ndarray (has fields).
0.1.2. Bugfix release fixing the following bugs.
  • Removed mistaken support for np.float16 for h5py versions before 2.2 since that was when support for it was introduced.
  • Structured np.ndarray where one or more fields is of the 'object' dtype can now be written without an error when the structured_numpy_ndarray_as_struct option is not set. They are written as an HDF5 Group, as if the option was set.
  • Support for the 'MATLAB_fields' Attribute for data types that are structures in MATLAB has been added for when the version of the h5py package being used is 2.3 or greater. Support is still missing for earlier versions (this package requires a minimum version of 2.1).
  • The check for non-unicode string keys (str in Python 3 and unicode in Python 2) in the type dict is done right before any changes are made to the HDF5 file instead of in the middle so that no changes are applied if an invalid key is present.
  • HDF5 userblock set with the proper metadata for MATLAB support right at the beginning of when data is being written to an HDF5 file instead of at the end, meaning the writing can crash and the file will still be a valid MATLAB file.
0.1.1. Bugfix release fixing the following bugs.
  • str is now written like numpy.str_ instead of numpy.bytes_.
  • Complex numbers where the real or imaginary part are nan but the other part are not are now read correctly as opposed to setting both parts to nan.
  • Fixed bugs in string conversions on Python 2 resulting from str.decode() and unicode.encode() not taking the same keyword arguments as in Python 3.
  • MATLAB structure arrays can now be read without producing an error on Python 2.
  • numpy.str_ now written as numpy.uint16 on Python 2 if the convert_numpy_str_to_utf16 option is set and the conversion can be done without using UTF-16 doublets, instead of always writing them as numpy.uint32.

0.1. Initial version.


  1. Depends on the selected options. Always np.uint8 when doing MATLAB compatiblity, or if the option is explicitly set.

  2. In Python 2.x, it may be read back as a long if it can't fit in the size of an int.

  3. Stored as a np.int64 if it is small enough to fit. Otherwise its decimal string representation is stored as an np.bytes_ for hdf5storage >= 0.2 (error in earlier versions).

  4. Stored as a np.int64 if it is small enough to fit. Otherwise its decimal string representation is stored as an np.bytes_ for hdf5storage >= 0.2 (error in earlier versions).

  5. Type found only in Python 2.x. Python 2.x's long and int are unified into a single int type in Python 3.x. Read as an int in Python 3.x.

  6. Depends on the selected options and whether it can be converted to UTF-16 without using doublets. If the option is explicity set (or implicitly when doing MATLAB compatibility) and it can be converted to UTF-16 without losing any characters that can't be represented in UTF-16 or using UTF-16 doublets (MATLAB doesn't support them), then it is written as np.uint16 in UTF-16 encoding. Otherwise, it is stored at np.uint32 in UTF-32 encoding.

  7. Depends on the selected options. If the option is explicitly set (or implicitly when doing MATLAB compatibility), it will be stored as np.uint16 in UTF-16 encoding unless it has non-ASCII characters in which case a NotImplementedError is thrown). Otherwise, it is just written as np.bytes_.

  8. Depends on the selected options. If the option is explicitly set (or implicitly when doing MATLAB compatibility), it will be stored as np.uint16 in UTF-16 encoding unless it has non-ASCII characters in which case a NotImplementedError is thrown). Otherwise, it is just written as np.bytes_.

  9. All keys must be str in Python 3 or unicode in Python 2. They cannot have null characters ('\x00') or forward slashes ('/') in them.

  10. np.float16 are not supported for h5py versions before 2.2.

  11. Depends on the selected options and whether it can be converted to UTF-16 without using doublets. If the option is explicity set (or implicitly when doing MATLAB compatibility) and it can be converted to UTF-16 without losing any characters that can't be represented in UTF-16 or using UTF-16 doublets (MATLAB doesn't support them), then it is written as np.uint16 in UTF-16 encoding. Otherwise, it is stored at np.uint32 in UTF-32 encoding.

  12. Depends on the selected options. If the option is explicitly set (or implicitly when doing MATLAB compatibility), it will be stored as np.uint16 in UTF-16 encoding unless it has non-ASCII characters in which case a NotImplementedError is thrown). Otherwise, it is just written as np.bytes_.

  13. Container types are only supported if their underlying dtype is supported. Data conversions are done based on its dtype.

  14. Structured np.ndarray s (have fields in their dtypes) can be written as an HDF5 COMPOUND type or as an HDF5 Group with Datasets holding its fields (either the values directly, or as an HDF5 Reference array to the values for the different elements of the data). Can only be written as an HDF5 COMPOUND type if none of its field are of dtype 'object'. Field names cannot have null characters ('\x00') and, when writing as an HDF5 GROUP, forward slashes ('/') in them.

  15. Structured np.ndarray s with no elements, when written like a structure, will not be read back with the right dtypes for their fields (will all become 'object').

  16. Container types are only supported if their underlying dtype is supported. Data conversions are done based on its dtype.

  17. Container types are only supported if their underlying dtype is supported. Data conversions are done based on its dtype.

  18. Container types are only supported if their underlying dtype is supported. Data conversions are done based on its dtype.

  19. Structured np.ndarray s (have fields in their dtypes) can be written as an HDF5 COMPOUND type or as an HDF5 Group with Datasets holding its fields (either the values directly, or as an HDF5 Reference array to the values for the different elements of the data). Can only be written as an HDF5 COMPOUND type if none of its field are of dtype 'object'. Field names cannot have null characters ('\x00') and, when writing as an HDF5 GROUP, forward slashes ('/') in them.

  20. Depends on whether there is a complex part or not.

  21. Depends on whether there is a complex part or not.

hdf5storage's People

Contributors

frejanordsiek avatar coobas avatar wanglongqi avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.