Code Monkey home page Code Monkey logo

numcodecs's People

Contributors

alimanfoo avatar anandtrex avatar carreau avatar czaki avatar dependabot[bot] avatar dimitripapadopoulos avatar dstansby avatar funkey avatar halehawk avatar jakirkham avatar jeromekelleher avatar joshmoore avatar jrbourbeau avatar madsbk avatar manzt avatar martindurant avatar mkitti avatar msankeys963 avatar mzjp2 avatar newt0311 avatar pbranson avatar psobolewskiphd avatar pyup-bot avatar qulogic avatar rabernat avatar s-t-e-v-e-n-k avatar saransh-cpp avatar tacaswell avatar tomwhite avatar vyasr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

numcodecs's Issues

Parquet UTF8

Consider implementing an optimised text codec, which is intended for object arrays containing only variable-length text strings, using parquet's byte array encoding approach.

Bytes

Consider implementing an optimised bytes codec, which is intended for object arrays containing only variable-length byte strings.

Quantize

Port over the Quantize codec from Zarr.

Encode array of string arrays

The VLenArray codec currently supports encoding of an array of arrays of scalars. It would be useful if an array of arrays of (variable length) strings could also be supported. E.g., support VLenArray('str') and use VLenUTF8 internally to encode each sub-array.

Other compressor codecs

It would be possible to implement Zstd, LZ4 and Snappy codecs that make use of these compressors directly, not via Blosc. Also implementing a Zlib codec directly on C code rather than via Python stdlib would probably be faster. Source code for these is already present within the c-blosc submodule. Personally I would always go via Blosc so not strongly motivated to do this myself, but keeping this as placeholder.

JSON

Consider adding a JSON codec which is intended to provide a safe and portable encoding for object arrays.

XOR delta

Implement an XOR delta filter (XOR with previous value)

VLen

Consider implementing a codec for use with object arrays where each item is a variable-length sequence and all sequence members are of the same primitive (fixed size) type.

get_codec modifies its argument

The get_codec function modifies its argument, popping the "id" key. This can lead to surprising behaviour such as:

config = {"id": "json"}
codec = get_codec(config) # Works
codec = get_codec(config) # fails.

I suggest adding config = dict(config) at the start of the function. The performance implications should be minimal I think.

JSON codec reshapes string arrays

This is carrying on from zarr-developers/zarr-python#258

I've tried to come up with a minimal example, but it's tricky to illustrate without showing the context. Here is an interaction with zarr with some instrumentation in the encode/decode methods for json.

z = zarr.empty(2, dtype=object, object_codec=numcodecs.JSON(), chunks=(1,))
z[0] = ["11"]
z[1] = ["1", "1"]

print(z[:]) # Borks

output:

INPUT: (1,)
INPUT: (1,)
OUTPUT: (1, 1)
OUTPUT: (1, 2)
Traceback (most recent call last):
  File "dev.py", line 34, in <module>
    print(z[:]) # Borks
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 559, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 685, in get_basic_selection
    fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 727, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1015, in _get_selection
    drop_axes=indexer.drop_axes, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1608, in _chunk_getitem
    chunk = self._decode_chunk(cdata)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1751, in _decode_chunk
    chunk = chunk.reshape(self._chunks, order=self._order)
ValueError: cannot reshape array of size 2 into shape (1,)

The INPUT lines are the shapes of the input arrays to encode and the OUTPUT lines are the corresponding output shapes of the arrays from decode.

Problem description

When calling numpy.array([["s1", "s2"], ["s3, "s4"]], dtype=object) numpy is quite aggressive about reshaping the array to store things more efficiently.

I've played around with this a fair bit, and I think the only options are to

  1. Drop the numpy dependency in the encoding and decoding steps for JSON (i.e, don't include the dtype in the JSON encoding), and provide the supplied argument directly to the JSON encoder (and conversely, directly return the value of json.loads() from decode.

  2. Also encode the input array shape in the JSON encoding.

Both of these options are ugly because they break backward compatibility. I'll make a PR for demonstrating option 2 in a minute for discussion.

v0.5.3 release

Proposed to make a new micro release, incorporating the migration to pytest, and the c-blosc upgrade, as well as the new contributing docs.

TODO:

  • Merge #62
  • Merge #64
  • Update release notes
  • Git tag
  • PyPI release
  • Enable release version on rtfd
  • conda-forge release

Detect AVX2 support at runtime

Currently users have to decide at compile time if they would like to build a binary that supports AVX2 intrinsics or not. If they build with AVX2 intrinsics and end up deploying to somewhere that lacks AVX2 intrinsics, they will suffer a segfault due to the illegal instruction. Though users can build without AVX2 intrinsics and it will work fine regardless of whether the target infrastructure has AVX2 support, the compression algorithms here may run slower than if they were built with AVX2 support. Admittedly avoiding a segfault is much more important than degraded performance.

However, in the ideal case, we could build numcodecs with and without AVX2 support and then merely detect at runtime whether AVX2 instructions were permitted and thus choose the appropriate code path without crashing in either case. This will take a bit of work to understand where AVX2 instructions are being introduced and how to avoid them. Though some of that was already done in the first referenced issue below.

xref: zarr-developers/zarr-python#136
xref: #24
xref: #26
xref: #27

Building wheels

As numcodecs includes source that needs to be compiled, there can be some challenges or technical hurtles that can be encountered by a user that they may not be aware of. While we do solve this in some sense by supplying conda-forge packages with prebuilt binaries, pip remains the defacto way for Python users to get packages. However in the case of pip the user will get the sdist, which needs compilation. While this does work, there is definitely some appeal to providing a pip solution that does not require compilation (i.e. prebuilt wheels). This would help users avoid compatibility problems like this one ( #69 ).

One solution would be to piggy back off of whatever conda-forge ends up doing to also supply wheels ( conda-forge/conda-smithy#608 ). This would work well for Windows. Though on macOS conda-forge uses the 10.9 SDK (instead of the 10.6 SDK that Python tries to support). Also on Linux conda-forge uses CentOS 6 with glibc 2.12 (instead of CentOS 5 with glibc 2.5 used by manylinux1). In practice these two cases that don't sync up will be hard for Python to require much longer and some packages already don't comply (happy to go into details if this is of interest). This certainly would enjoy the nice benefit of the architecture and community conda-forge has in place to solve these issues. It also seems that proponents of the wheel format are interested in collaborating, which definitely should help.

If that doesn't work, the alternative would be to build wheels here. For Linux, there is manylinux1 Docker image, which could be used fairly easily to build these. For macOS, we could try to build them here or reach out to MacPython for help. For Windows, it would probably be best to reuse the conda-forge solution as much as possible as that already fits the requirements well. Probably would require some like issue ( conda/conda-build#2490 ) to solved. Though I don't think anyone that has had time to do that. Alternatively one can build a wheel with conda-build.

Using JPEG2000 for chunk compression

I've been using chunk compressed Zarr arrays for some neuroscience image processing tasks, and it's been great so far. However, JPEG2000 might perform better than lz4 or Zstd for my images. I'd like to use Zarr to handle the image chunking with a JPEG2000 compressor, but I'm not sure if this is possible. I realize that this feature isn't as general as numcodecs would want, but I'm mostly asking what the steps would be to see if I should even try.

Blosc hangs when used with multiprocessing

Originally reported in https://github.com/alimanfoo/zarr/issues/199, Blosc causes a hang if used from multiprocessing and use_threads is not set to False. This is a nasty gotcha because the default configuration is to use threads if running from the main thread, which handles the multi-threading case but does not handle the multi-processing case. Minimal example here.

There is possibly a way to prevent this from happening by detecting if the compressor has been pickled and moved across processes, then forcing use_threads to false.

Blosc.set_block_size()

Add support for manually specifying the block size. N.B., do this in a backwards-compatible way.

Blosc silently fails with 2GB buffer

>>> import numcodecs
>>> import numpy as np                                                                                   
>>> a = np.ones(1024**3, dtype=np.int8)
>>> a.nbytes
1073741824
>>> codec = numcodecs.Blosc()
>>> x = codec.encode(a)
>>> len(x)
4505616
>>> a = np.ones(2 * 1024**3, dtype=np.int8)
>>> x = codec.encode(a)                                                                                  
Input buffer size cannot exceed 2147483631 bytes
>>> codec.decode(x)
# Freezes interpreter

It looks like Blosc is raising this error and it's not being caught here because csize is declared as size_t, and so cannot be negative.

Any thoughts @alimanfoo? Should be an easy fix, but it's not entirely clear how to test this. Do you know of any other ways we can provoke errors in blosc which don't involve mallocing 2GB?

Data fixture

Create a fixture of encoded data using current codec implementations. Add tests to check that data can be decoded to catch any unintended changes that would break backwards compatibility.

datetime64 arrays don't support buffer protocol

On attempt to encode a datetime64 array with any codec that relies on buffer access, e.g.::

  File "numcodecs/blosc.pyx", line 350, in numcodecs.blosc.Blosc.encode (numcodecs/blosc.c:4126)
  File "numcodecs/blosc.pyx", line 159, in numcodecs.blosc.compress (numcodecs/blosc.c:2274)
  File "numcodecs/compat_ext.pyx", line 29, in numcodecs.compat_ext.Buffer.__cinit__ (numcodecs/compat_ext.c:1358)
Exception: cannot include dtype 'M' in a buffer

This could be worked around in compat_ext.pyx, e.g., explicit check for datetime64 array and view as int64 before get buffer.

xref joblib/joblib#183 numpy/numpy#4983

Problems importing blosc on HPC cluster

I am using zarr on an HPC cluster with a rather complex configuration (multiple conda envs, dask distributed, etc.)

Minimal, reproducible code sample, a copy-pastable example if possible

# What should happen (and works on cluster head node)
>>> import zarr
>>> zarr.Blosc
numcodecs.blosc.Blosc

# What happens on a compute node
>>> import zarr
>>> zarr.Blosc
AttributeError: module 'zarr' has no attribute 'Blosc
>>> import numcodecs.blosc
ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /nobackup/rpaberna/conda/envs/pangeo/lib/python3.6/site-packages/numcodecs/blosc.cpython-36m-x86_64-linux-gnu.so)

Problem description

Both examples use the same underlying conda environment and have identical zarr / numcodecs version. However, there is some deep library error which makes Blosc unusable on the compute nodes.

Version and installation information

Please provide the following:

  • Value of zarr.__version__: 2.2.0rc4.dev1
  • Value of numcodecs.__version__: 0.5.3
  • Version of Python interpreter: 3.6.4
  • Operating system (Linux/Windows/Mac) Linux
  • How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): pip

Review AsType tests

Recent PR (#12) has migrated the AsType filter from Zarr. Initial tests look good but it would be good to review prior to next release.

ZFP Compression

I just learned about a new compression library called ZFP: https://github.com/LLNL/zfp

zfp is an open source C/C++ library for compressed numerical arrays that support high throughput read and write random access. zfp also supports streaming compression of integer and floating-point data, e.g., for applications that read and write large data sets to and from disk.

zfp was developed at Lawrence Livermore National Laboratory and is loosely based on the algorithm described in the following paper:

Peter Lindstrom
"Fixed-Rate Compressed Floating-Point Arrays"
IEEE Transactions on Visualization and Computer Graphics
20(12):2674-2683, December 2014
doi:10.1109/TVCG.2014.2346458

zfp was originally designed for floating-point arrays only, but has been extended to also support integer data, and could for instance be used to compress images and quantized volumetric data. To achieve high compression ratios, zfp uses lossy but optionally error-bounded compression. Although bit-for-bit lossless compression of floating-point data is not always possible, zfp is usually accurate to within machine epsilon in near-lossless mode.

zfp works best for 2D and 3D arrays that exhibit spatial correlation, such as continuous fields from physics simulations, images, regularly sampled terrain surfaces, etc. Although zfp also provides a 1D array class that can be used for 1D signals such as audio, or even unstructured floating-point streams, the compression scheme has not been well optimized for this use case, and rate and quality may not be competitive with floating-point compressors designed specifically for 1D streams.

zfp is freely available as open source under a BSD license, as outlined in the file 'LICENSE'. For more information on zfp and comparisons with other compressors, please see the zfp website. For questions, comments, requests, and bug reports, please contact Peter Lindstrom.

It would be excellent to add ZFP compression to Zarr! What would be the best path towards this? Could it be added to numcodecs?

Blosc auto shuffle

Consider supporting a value of 'auto' for the 'shuffle' argument to the Blosc compressor, which would use byte shuffle if the itemsize of the buffer is greater than one, otherwise bit shuffle.

Refresh Requirements

The requirements (particularly in requirements_dev.txt) are lagging a bit. Would be good to update them as was done with Zarr not too long ago.

xref: #90 (comment)

Support timeline for Windows 32-bit

Raising to see how long we want to support Windows 32-bit and if there are particular needs motivating its support. Asking as this effectively doubles our CI time. So want to make sure that is an allocation that we are comfortable with.

FWIW in conda-forge we did a poll earlier this year to see how many Windows 32-bit users we have and whether it made sense to support the platform further. We learned a very small percentage of users are on Windows 32-bit with most on 64-bit. It goes without saying that we opted to drop Windows 32-bit after learning this.

ZigZag encoding

Consider implementing zigzag encoding, possibly in combination with naive delta and/or xor delta.

BSON

Consider adding a BSON codec which is intended to provide a safe, portable and efficient encoding for object arrays.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.