zarr-developers / numcodecs Goto Github PK

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.

Home Page: http://numcodecs.readthedocs.io

License: MIT License

Python 68.48% Jupyter Notebook 6.90% Cython 24.62%

numcodecs's People

Contributors

Stargazers

Watchers

Forkers

jreback newt0311 anandtrex jeromekelleher mckinsel tomwhite rabernat funkey alimanfoo halehawk qulogic detrout llllllllll sofroniewn klmatlock richafrank jrbourbeau czaki andrewfulton9 tacaswell carreau fagan2888 avalentino joshmoore pbranson lepy adamjstewart leekamentsky martindurant vincentsarago annakwa fjetter manishgit138 olly-writes-code manishnish138 danieldjewell scrarlet jamestwebber prakriti07 odidev dimitripapadopoulos madsbk henryiii da-wad observingclouds alejoe91 ipwhl abhisheknishantpuresoftware jawjay grlee77 ap-- d-v-b alt-shivam bogovicj jakirkham gbotemib kjmeagher saransh-cpp bnavigator psobolewskiphd chipmonkey alexey-kamenev mkitti s-t-e-v-e-n-k akshaysubr dstansby msankeys963 francescalted vyasr thodson-usgs taddyb ivirshup agriyakhetarpal chaowang818 mdunphy neutrinoceros sharkinsspatial

numcodecs's Issues

Parquet UTF8

Consider implementing an optimised text codec, which is intended for object arrays containing only variable-length text strings, using parquet's byte array encoding approach.

Bytes

Consider implementing an optimised bytes codec, which is intended for object arrays containing only variable-length byte strings.

Blosc repr use "BITSHUFFLE" instead of 2 etc.

repr(Blosc(cname='zstd', clevel=1, shuffle=2)) should be 'Blosc(cname='zstd', clevel=1, shuffle=BITSHUFFLE)' etc. for other shuffle values.

Migrate AsType codec from Zarr

Move the AsType codec from Zarr to here.

The VLenArray codec currently supports encoding of an array of arrays of scalars. It would be useful if an array of arrays of (variable length) strings could also be supported. E.g., support VLenArray('str') and use VLenUTF8 internally to encode each sub-array.

Other compressor codecs

It would be possible to implement Zstd, LZ4 and Snappy codecs that make use of these compressors directly, not via Blosc. Also implementing a Zlib codec directly on C code rather than via Python stdlib would probably be faster. Source code for these is already present within the c-blosc submodule. Personally I would always go via Blosc so not strongly motivated to do this myself, but keeping this as placeholder.

Add msgpack-python dependency

Add the msgpack-python package as an install dependency for both pip and conda installs.

JSON

Consider adding a JSON codec which is intended to provide a safe and portable encoding for object arrays.

XOR delta

Implement an XOR delta filter (XOR with previous value)

Blosc AVX2 intrinsic run on non-AVX2 supporting hardware

Originally reported at https://github.com/alimanfoo/zarr/issues/136 but will apply here also.

VLen

Consider implementing a codec for use with object arrays where each item is a variable-length sequence and all sequence members are of the same primitive (fixed size) type.

BitGroom

Paper by Charles Zender introduces Bit-Grooming, a lossy compression/quantization and compares it to lossless compression and to lossy compression with Linear Packing, a common method for netCDF data. See also http://nco.sf.net/nco.html#bg.

msgpack: PendingDeprecationWarning: encoding is deprecated, Use raw=False instead

Using msgpack-python 0.5.6 I get:

/home/aliman/github/alimanfoo/zarr/.tox/py27/local/lib/python2.7/site-packages/numcodecs/msgpacks.py:48: PendingDeprecationWarning: encoding is deprecated, Use raw=False instead.
  items = msgpack.unpackb(buf, encoding=self.encoding)

c-blosc upgrade to 1.13.3

LZMA via backports on PY2

Provide LZMA via backports.lzma if available. xref conda-forge/staged-recipes#2374

get_codec modifies its argument

The get_codec function modifies its argument, popping the "id" key. This can lead to surprising behaviour such as:

config = {"id": "json"}
codec = get_codec(config) # Works
codec = get_codec(config) # fails.

I suggest adding config = dict(config) at the start of the function. The performance implications should be minimal I think.

Test PY36

Add Python 3.6 testing to CI.

Migrate to pytest

Use pytest instead of nose.

msgpack-python is deprecated

The documentation states that msgpack-python is required for MsgPack support. The new package is msgpack.

JSON codec reshapes string arrays

This is carrying on from zarr-developers/zarr-python#258

I've tried to come up with a minimal example, but it's tricky to illustrate without showing the context. Here is an interaction with zarr with some instrumentation in the encode/decode methods for json.

z = zarr.empty(2, dtype=object, object_codec=numcodecs.JSON(), chunks=(1,))
z[0] = ["11"]
z[1] = ["1", "1"]

print(z[:]) # Borks

output:

INPUT: (1,)
INPUT: (1,)
OUTPUT: (1, 1)
OUTPUT: (1, 2)
Traceback (most recent call last):
  File "dev.py", line 34, in <module>
    print(z[:]) # Borks
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 559, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 685, in get_basic_selection
    fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 727, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1015, in _get_selection
    drop_axes=indexer.drop_axes, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1608, in _chunk_getitem
    chunk = self._decode_chunk(cdata)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1751, in _decode_chunk
    chunk = chunk.reshape(self._chunks, order=self._order)
ValueError: cannot reshape array of size 2 into shape (1,)

The INPUT lines are the shapes of the input arrays to encode and the OUTPUT lines are the corresponding output shapes of the arrays from decode.

Problem description

When calling numpy.array([["s1", "s2"], ["s3, "s4"]], dtype=object) numpy is quite aggressive about reshaping the array to store things more efficiently.

I've played around with this a fair bit, and I think the only options are to

Drop the numpy dependency in the encoding and decoding steps for JSON (i.e, don't include the dtype in the JSON encoding), and provide the supplied argument directly to the JSON encoder (and conversely, directly return the value of json.loads() from decode.
Also encode the input array shape in the JSON encoding.

Both of these options are ugly because they break backward compatibility. I'll make a PR for demonstrating option 2 in a minute for discussion.

v0.1 release notes

Updated cpuinfo.py so that it can use the saved dmesg.boot information

Apply changes suggested in https://github.com/alimanfoo/zarr/pull/153.

v0.5.3 release

Proposed to make a new micro release, incorporating the migration to pytest, and the c-blosc upgrade, as well as the new contributing docs.

TODO:

Detect AVX2 support at runtime

Currently users have to decide at compile time if they would like to build a binary that supports AVX2 intrinsics or not. If they build with AVX2 intrinsics and end up deploying to somewhere that lacks AVX2 intrinsics, they will suffer a segfault due to the illegal instruction. Though users can build without AVX2 intrinsics and it will work fine regardless of whether the target infrastructure has AVX2 support, the compression algorithms here may run slower than if they were built with AVX2 support. Admittedly avoiding a segfault is much more important than degraded performance.

However, in the ideal case, we could build numcodecs with and without AVX2 support and then merely detect at runtime whether AVX2 instructions were permitted and thus choose the appropriate code path without crashing in either case. This will take a bit of work to understand where AVX2 instructions are being introduced and how to avoid them. Though some of that was already done in the first referenced issue below.

xref: zarr-developers/zarr-python#136
xref: #24
xref: #26
xref: #27

Setup coveralls

As done for Zarr recently (https://github.com/alimanfoo/zarr/pull/118), setup coveralls for this repo.

Building wheels

As numcodecs includes source that needs to be compiled, there can be some challenges or technical hurtles that can be encountered by a user that they may not be aware of. While we do solve this in some sense by supplying conda-forge packages with prebuilt binaries, pip remains the defacto way for Python users to get packages. However in the case of pip the user will get the sdist, which needs compilation. While this does work, there is definitely some appeal to providing a pip solution that does not require compilation (i.e. prebuilt wheels). This would help users avoid compatibility problems like this one ( #69 ).

One solution would be to piggy back off of whatever conda-forge ends up doing to also supply wheels ( conda-forge/conda-smithy#608 ). This would work well for Windows. Though on macOS conda-forge uses the 10.9 SDK (instead of the 10.6 SDK that Python tries to support). Also on Linux conda-forge uses CentOS 6 with glibc 2.12 (instead of CentOS 5 with glibc 2.5 used by manylinux1). In practice these two cases that don't sync up will be hard for Python to require much longer and some packages already don't comply (happy to go into details if this is of interest). This certainly would enjoy the nice benefit of the architecture and community conda-forge has in place to solve these issues. It also seems that proponents of the wheel format are interested in collaborating, which definitely should help.

If that doesn't work, the alternative would be to build wheels here. For Linux, there is manylinux1 Docker image, which could be used fairly easily to build these. For macOS, we could try to build them here or reach out to MacPython for help. For Windows, it would probably be best to reuse the conda-forge solution as much as possible as that already fits the requirements well. Probably would require some like issue ( conda/conda-build#2490 ) to solved. Though I don't think anyone that has had time to do that. Alternatively one can build a wheel with conda-build.

Using JPEG2000 for chunk compression

I've been using chunk compressed Zarr arrays for some neuroscience image processing tasks, and it's been great so far. However, JPEG2000 might perform better than lz4 or Zstd for my images. I'd like to use Zarr to handle the image chunking with a JPEG2000 compressor, but I'm not sure if this is possible. I realize that this feature isn't as general as numcodecs would want, but I'm mostly asking what the steps would be to see if I should even try.

Blosc hangs when used with multiprocessing

Originally reported in https://github.com/alimanfoo/zarr/issues/199, Blosc causes a hang if used from multiprocessing and use_threads is not set to False. This is a nasty gotcha because the default configuration is to use threads if running from the main thread, which handles the multi-threading case but does not handle the multi-processing case. Minimal example here.

There is possibly a way to prevent this from happening by detecting if the compressor has been pickled and moved across processes, then forcing use_threads to false.

Upgrade c-blosc to 1.11.3

https://github.com/Blosc/c-blosc/blob/master/RELEASE_NOTES.rst#changes-from-1112-to-1113

Blosc.set_block_size()

Add support for manually specifying the block size. N.B., do this in a backwards-compatible way.

Blosc silently fails with 2GB buffer

>>> import numcodecs
>>> import numpy as np                                                                                   
>>> a = np.ones(1024**3, dtype=np.int8)
>>> a.nbytes
1073741824
>>> codec = numcodecs.Blosc()
>>> x = codec.encode(a)
>>> len(x)
4505616
>>> a = np.ones(2 * 1024**3, dtype=np.int8)
>>> x = codec.encode(a)                                                                                  
Input buffer size cannot exceed 2147483631 bytes
>>> codec.decode(x)
# Freezes interpreter

It looks like Blosc is raising this error and it's not being caught here because csize is declared as size_t, and so cannot be negative.

Any thoughts @alimanfoo? Should be an easy fix, but it's not entirely clear how to test this. Do you know of any other ways we can provoke errors in blosc which don't involve mallocing 2GB?

RunLength

It would be possible to implement a simple run length codec, e.g., making use of https://gist.github.com/nvictus/66627b580c13068589957d6ab0919e66. This would very likely not offer any better compression than proper compressors, but might be interesting to try out.

Migrated from https://github.com/alimanfoo/zarr/issues/61

Upgrade c-blosc to 1.14.0

This should probably be a priority as it introduces a forward-compatibility policy.

Regenerate cythonized C sources for Python 3.7 compatibility

To enable building of the C extensions on Windows under Python 3.7, the C sources need to be regenerated with a recent version of cython.

Data fixture

Create a fixture of encoded data using current codec implementations. Add tests to check that data can be decoded to catch any unintended changes that would break backwards compatibility.

datetime64 arrays don't support buffer protocol

On attempt to encode a datetime64 array with any codec that relies on buffer access, e.g.::

  File "numcodecs/blosc.pyx", line 350, in numcodecs.blosc.Blosc.encode (numcodecs/blosc.c:4126)
  File "numcodecs/blosc.pyx", line 159, in numcodecs.blosc.compress (numcodecs/blosc.c:2274)
  File "numcodecs/compat_ext.pyx", line 29, in numcodecs.compat_ext.Buffer.__cinit__ (numcodecs/compat_ext.c:1358)
Exception: cannot include dtype 'M' in a buffer

This could be worked around in compat_ext.pyx, e.g., explicit check for datetime64 array and view as int64 before get buffer.

xref joblib/joblib#183 numpy/numpy#4983

Problems importing blosc on HPC cluster

I am using zarr on an HPC cluster with a rather complex configuration (multiple conda envs, dask distributed, etc.)

Minimal, reproducible code sample, a copy-pastable example if possible

# What should happen (and works on cluster head node)
>>> import zarr
>>> zarr.Blosc
numcodecs.blosc.Blosc

# What happens on a compute node
>>> import zarr
>>> zarr.Blosc
AttributeError: module 'zarr' has no attribute 'Blosc
>>> import numcodecs.blosc
ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/numcodecs/blosc.cpython-36m-x86_64-linux-gnu.so)

Problem description

Both examples use the same underlying conda environment and have identical zarr / numcodecs version. However, there is some deep library error which makes Blosc unusable on the compute nodes.

Version and installation information

Please provide the following:

Value of zarr.__version__: 2.2.0rc4.dev1
Value of numcodecs.__version__: 0.5.3
Version of Python interpreter: 3.6.4
Operating system (Linux/Windows/Mac) Linux
How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): pip

Document MsgPack and Pickle codecs

Add API docs and release notes for msgpack and pickle codecs.

Upgrade c-blosc to 1.12.1

Review AsType tests

Recent PR (#12) has migrated the AsType filter from Zarr. Initial tests look good but it would be good to review prior to next release.

Add 'gzip' as alias for 'zlib'

...for compatibility with h5py.

ZFP Compression

I just learned about a new compression library called ZFP: https://github.com/LLNL/zfp

zfp is an open source C/C++ library for compressed numerical arrays that support high throughput read and write random access. zfp also supports streaming compression of integer and floating-point data, e.g., for applications that read and write large data sets to and from disk.

zfp was developed at Lawrence Livermore National Laboratory and is loosely based on the algorithm described in the following paper:
Peter Lindstrom
"Fixed-Rate Compressed Floating-Point Arrays"
IEEE Transactions on Visualization and Computer Graphics
20(12):2674-2683, December 2014
doi:10.1109/TVCG.2014.2346458
zfp was originally designed for floating-point arrays only, but has been extended to also support integer data, and could for instance be used to compress images and quantized volumetric data. To achieve high compression ratios, zfp uses lossy but optionally error-bounded compression. Although bit-for-bit lossless compression of floating-point data is not always possible, zfp is usually accurate to within machine epsilon in near-lossless mode.

zfp works best for 2D and 3D arrays that exhibit spatial correlation, such as continuous fields from physics simulations, images, regularly sampled terrain surfaces, etc. Although zfp also provides a 1D array class that can be used for 1D signals such as audio, or even unstructured floating-point streams, the compression scheme has not been well optimized for this use case, and rate and quality may not be competitive with floating-point compressors designed specifically for 1D streams.

zfp is freely available as open source under a BSD license, as outlined in the file 'LICENSE'. For more information on zfp and comparisons with other compressors, please see the zfp website. For questions, comments, requests, and bug reports, please contact Peter Lindstrom.

It would be excellent to add ZFP compression to Zarr! What would be the best path towards this? Could it be added to numcodecs?

FWIW in conda-forge we did a poll earlier this year to see how many Windows 32-bit users we have and whether it made sense to support the platform further. We learned a very small percentage of users are on Windows 32-bit with most on 64-bit. It goes without saying that we opted to drop Windows 32-bit after learning this.

ZigZag encoding

Consider implementing zigzag encoding, possibly in combination with naive delta and/or xor delta.

BSON

Consider adding a BSON codec which is intended to provide a safe, portable and efficient encoding for object arrays.

zarr-developers / numcodecs Goto Github PK

numcodecs's People

Contributors

Stargazers

Watchers

Forkers

numcodecs's Issues

Problem description

Minimal, reproducible code sample, a copy-pastable example if possible

Problem description

Version and installation information

Recommend Projects

Recommend Topics

Recommend Org