Code Monkey home page Code Monkey logo

pystreamvbyte's Introduction

pystreamvbyte

Build Status

Python bindings to streamvbyte.

Installing

$ pip install --user pystreamvbyte

Example

>>> import numpy as np
>>> from streamvbyte import encode, decode
>>>
>>> size = int(40e6)
>>> dtype = np.uint32  # int16, uint16, int32, uint32 supported
>>> data = np.random.randint(0, 512, size=size, dtype=np.uint32)
>>> data.nbytes
160000000
>>> compressed = encode(data)
>>> compressed.nbytes
70001679
>>> recovered = decode(compressed, size, dtype=dtype)
>>> compressed.nbytes / data.nbytes * 100
43.751049375

Development Quick Start

$ git clone --recurse-submodules https://github.com/iiSeymour/pystreamvbyte.git
$ python3 -m venv .venv
$ source .venv/bin/activate
$ make test

pystreamvbyte's People

Contributors

iiseymour avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pystreamvbyte's Issues

Ctypes overhead

The profiler results (encoding 8000 ~150MB uint32 arrays) show that data marshalling is taking the majority of the runtime.

Total time: 1.17961 s
Function: encode at line 81

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    81                                               @wraps(c_func)
    82                                               def encode(data, prev=0):
    83                                           
    84      8000      75564.0      9.4      6.4          if np.issubdtype(data.dtype, np.signedinteger):
    85                                                       diffs = np.ediff1d(data, to_begin=data[0])
    86                                                       shift = np.int8(data.dtype.itemsize * 8 - 1)
    87                                                       data = to_zig_zag(diffs, shift)
    88                                           
    89      8000      36216.0      4.5      3.1          if np.issubdtype(data.dtype, np.uint16):
    90                                                       data = data.astype(np.uint32)
    91                                           
    92      8000     199533.0     24.9     16.9          output = np.zeros(max_compressed_bytes(len(data)), dtype=np.uint8)
    93      8000       5426.0      0.7      0.5          encoded_size = c_func(
    94      8000     355436.0     44.4     30.1              data.ctypes.data_as(ctypes.POINTER(ctypes.c_uint32)),
    95      8000       7960.0      1.0      0.7              len(data),
    96      8000     310685.0     38.8     26.3              output.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
    97      8000     175095.0     21.9     14.8              prev
    98                                                   )
    99      8000      13693.0      1.7      1.2          return output[:encoded_size]

Investigate the performance of other C lib binding methods (i.e. cffi).

Overhead of signed types

Encoding/decoding of int16/int32 arrays has a large overhead (~75%) as zigzag encoding is done in Python. No low hanging optimization left to be had from Python as the implementation is already using numpy ufuncs.

Total time: 0.052461 s
Function: encode at line 80

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    80                                               @wraps(c_func)
    81                                               def encode(data, prev=0):
    82                                           
    83         1         40.0     40.0      0.1          if np.issubdtype(data.dtype, np.signedinteger):
    84         1      10796.0  10796.0     20.6              diffs = np.ediff1d(data, to_begin=data[0])
    85         1          9.0      9.0      0.0              shift = data.dtype.itemsize * 8 - 1
    86         1      26187.0  26187.0     49.9              data = to_zig_zag(diffs, np.int32(shift))
    87                                           
    88         1         39.0     39.0      0.1          if np.issubdtype(data.dtype, np.uint16):
    89                                                       data = data.astype(np.uint32)
    90                                           
    91         1         42.0     42.0      0.1          output = np.zeros(max_compressed_bytes(len(data)), dtype=np.uint8)
    92         1          1.0      1.0      0.0          encoded_size = c_func(
    93         1        230.0    230.0      0.4              data.ctypes.data_as(ctypes.POINTER(ctypes.c_uint32)),
    94         1          1.0      1.0      0.0              len(data),
    95         1         41.0     41.0      0.1              output.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
    96         1      15066.0  15066.0     28.7              prev
    97                                                   )
    98         1          9.0      9.0      0.0          return output[:encoded_size]

Total time: 0.083777 s
Function: decode at line 111

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   111                                               @wraps(c_func)
   112                                               def decode(data, n, prev=0, dtype=None):
   113                                           
   114         1         31.0     31.0      0.0          output = np.zeros(n, dtype=np.uint32)
   115         1          1.0      1.0      0.0          c_func(
   116         1        105.0    105.0      0.1              data.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
   117         1         31.0     31.0      0.0              output.ctypes.data_as(ctypes.POINTER(ctypes.c_uint32)),
   118         1          0.0      0.0      0.0              n,
   119         1      20428.0  20428.0     24.4              prev,
   120                                                   )
   121                                           
   122         1         51.0     51.0      0.1          if dtype and np.issubdtype(dtype, np.signedinteger):
   123         1      20575.0  20575.0     24.6              zigzag = from_zig_zag(output)
   124         1      42553.0  42553.0     50.8              output = np.cumsum(zigzag, dtype=dtype)
   125                                                   elif dtype and output.dtype != dtype:
   126                                                       return output.astype(dtype)
   127         1          2.0      2.0      0.0          return output

Maybe @lemire already has an efficient int16, int32 -> uint32 zigzag implementation and/or is interested in supporting signed typed in streamvbyte natively?

Publish package for Python3.8 and Python3.9

Hey folks, I was just toying around with streamvbyte and this python wrapper is so nice for quick experiments! ๐Ÿค—

I was wondering if we can publish this package for Python 3.8 and Python 3.9, too.

On https://pypi.org/project/pystreamvbyte/#files I can only see published packages for Python

  • 2.7 - end of life
  • 3.4 - end of life
  • 3.5 - end of life
  • 3.6 - active, end of life on 2021-12-23
  • 3.7 - active, end of life on 2023-06-27

and was wondering what we have to do to get more up to date packages there.

Packaging and Publishing

Package the C lib properly and push to PyPI.

  • x86 Linux wheels
  • arm64 Linux wheels
  • MacOS wheels

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.