lucianopaz / compress_pickle Goto Github PK

Standard python pickle, thinly wrapped with standard compression libraries

License: MIT License

Makefile 1.36% Python 94.94% Shell 3.70%

compress_pickle's Introduction

`compress_pickle`

Standard python pickle, thinly wrapped with standard compression libraries

The standard pickle package provides an excellent default tool for serializing arbitrary python objects and storing them to disk. Standard python also includes broad set of data compression packages. compress_pickle provides an interface to the standard pickle.dump, pickle.load, pickle.dumps and pickle.loads functions, but wraps them in order to direct the serialized data through one of the standard compression packages. This way you can seemlessly serialize data to disk or to any file-like object in a compressed way.

compress_pickle supports python >= 3.6. If you must support python 3.5, install compress_pickle==v1.1.1.

Supported compression protocols:

Furthermore, compress_pickle supports the lz4 compression protocol, that isn't part of the standard python compression packages. This is provided as an optional extra requirement that can be installed as:

pip install compress_pickle[lz4]

Please refer to the package's documentation for more information

compress_pickle's People

Contributors

Stargazers

Watchers

Forkers

ctmakro afcarl ml-and-ai-repo anubhav-narayan maayanorner jungerm2 eode ryokash bellyfat arpitjain799 mirror-dump

compress_pickle's Issues

Add dill optional support

It could be nice to optionally support dill besides the standard pickle serialization protocol

Failure loading object

In my Anaconda install I'm using compress_pickle 2.1.0 with python 3.9.0, and I get this error when I try and load a pickled object:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_6167/1779668135.py in <module>
     10     print("loading from pickle: "+feat_model_path)
     11     with open(feat_model_path, 'rb') as f:
---> 12         feat_automl = pickle.load(f)
     13 
     14 print(feat_automl.sprint_statistics())

~/anaconda3/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in load(path, compression, pickler_method, pickler_kwargs, mode, set_default_extension, **kwargs)
    270         pickler_kwargs = {}
    271     try:
--> 272         output = uncompress_and_unpickle(
    273             compresser,
    274             pickler=pickler,

~/anaconda3/lib/python3.9/functools.py in wrapper(*args, **kw)
    886                             '1 positional argument')
    887 
--> 888         return dispatch(args[0].__class__)(*args, **kw)
    889 
    890     funcname = getattr(func, '__name__', 'singledispatch function')

~/anaconda3/lib/python3.9/site-packages/compress_pickle/io/base.py in default_uncompress_and_unpickle(compresser, pickler, **kwargs)
     97     compresser: BaseCompresser, pickler: BasePicklerIO, **kwargs
     98 ) -> Any:
---> 99     return pickler.load(stream=compresser.get_stream(), **kwargs)

~/anaconda3/lib/python3.9/site-packages/compress_pickle/picklers/pickle.py in load(self, stream, **kwargs)
     43             The python object that was loaded.
     44         """
---> 45         return pickle.load(stream, **kwargs)
     46 
     47 

TypeError: __randomstate_ctor() takes from 0 to 1 positional arguments but 2 were given

I wasn't seeing this error with Python 3.10 on Ubuntu.

Fails deserializing multiple objects

Pickle is capable of storing multiple objects in the same file as each dump is self-contained.

Hence the following:

from pickle import dump, load
with open('test.gz', 'wb') as f:
    dump(1, f)
    dump(2, f)
with open('test.gz', 'rb') as f:
    print(load(f))
    print(load(f))

Returns

1
2

But with this library:

from compressed_pickle import dump, load
with open('test.gz', 'wb') as f:
    dump(1, f, compression='gzip')
    dump(2, f, compression='gzip')
with open('test.gz', 'rb') as f:
    print(load(f, compression='gzip'))
    print(load(f, compression='gzip'))

It returns

1
\compress_pickle\compress_pickle.py in load(path, compression, mode, fix_imports, encoding, errors, buffers, arcname, set_default_extension, unhandled_extensions, **kwargs)
    334     else:
    335         try:
--> 336             output = pickle.load(  # type: ignore
    337                 io_stream,
    338                 encoding=encoding,

EOFError: Ran out of input

Support python 3.8

Python 3.8 introduces the buffer_callback kwarg to pickle. It would be nice to add support for it.

Change deployment pipeline workflow

The CI/CD pipeline is a bit of a mess, mostly the part related to deployment, which shouldn't push to master, but to gh-pages instead. It would also be nice to migrate from azure into Github actions.

bzip should be inferred from .bz2 extension

The official extension for bzip2 is '.bz2', not '.bz', so ideally load and dump should be able to infer the format given a '.bz2' filename extension.

Add mypy type hints

It would be nice to add type hints to the package's functions

Add support for pathlib

Python 3.4 introduced pathlib. Currently, only string paths are supported but it would be nice to support pathlib as well.

Support `with open` context manager

Currently, the following snippet does not infer the compression scheme properly and errors:

import compress_pickle as cpickle

with open('test.lzma', 'wb') as f:    # Same issue for other extensions
    cpickle.dump([1, 2, 3], f)

Which results in the following error:

env/lib/python3.8/site-packages/compress_pickle/utils.py in instantiate_compresser(compression, path, mode, set_default_extension, **kwargs)
    110         _path = _stringyfy_path(path)
    111     if compression == "infer":
--> 112         compression = _infer_compression_from_path(_path)
    113     compresser_class = get_compresser(compression)
    114     if set_default_extension and isinstance(path, PATH_TYPES):

UnboundLocalError: local variable '_path' referenced before assignment

While the docstring says "a file-like object (io.BaseIO instances) [...] will be passed to the BaseCompresser class", this does not happen. I'd suggest grabbing the filename like so:

    if isinstance(path, PATH_TYPES):
        _path = _stringyfy_path(path)
    elif isinstance(path, io.IOBase):
        _path = path.name                          # this would set _path to 'test.lzma' in the above example
    else:
        raise RuntimeError("Unrecognized path")

    if compression == "infer":
        compression = _infer_compression_from_path(_path)

The error would simply ensure that the variable _path is defined. Something more descriptive could be added.

Any thoughts on this? I don't have time to submit a PR this week, but I'd be happy to at a later time. Thanks for this excellent library!

Add support for "XZ" extension

Extension .xz seems to be standard for LZMA compression. Can you add support for this?

If I dump a variable to pickle file, then compress with command-line 'xz' utility, I can read back the compressed file with the lzma library. I create an uncompressed pickle file as follows

pickle.dump(test, open("test.pkl", "wb"))

Then from the command-line, I run:

$ xz -9v test.pkl

This converts 'test.pkl' to 'test.pkl.xz'.

I can load the compressed file with lzma library:
test = pickle.load(lzma.open("test.pkl.xz", "rb"))

But if I try to do with with compress_pickle, it breaks:
test = compress_pickle.load(r"C:\Users\domla\tmp\test_pkl.pkl.xz")
triggers the following exception:

ValueError: Cannot infer compression protocol from filename test_pkl.pkl.xz with extension .xz

Or if I explicitly set the compression type:
test = compress_pickle.load(r"C:\Users\domla\tmp\test_pkl.pkl.xz", compression='lzma')

FileNotFoundError: [Errno 2] No such file or directory: 'test_pkl.pkl.xz.lzma'

If I rename the file so that it has extension '.lzma', compress_pickle loads fine.

$ mv test.pkl.xz test.pkl.lzma

test = compress_pickle.load(r"C:\Users\domla\tmp\test.pkl.lzma", compression='lzma')

In regard to the FileNotFoundError, the behavior of compress_pickle is slightly unexpected. If I supply a filename and an compression protocol, I would expect compress_pickle to try to load the filename as-is first, before trying any funny business like sticking an extension on it. This seems obvious enough. It only gets murky if both 'test.pkl' and 'test.pkl.lzma' exist. Although even then my personal expectation is that compress_pickle should always default to filename exactly as supplied.

Brotli support

"object of type 'pickle.PickleBuffer' has no len()" error with large numpy arrays

I get object of type 'pickle.PickleBuffer' has no len() error for any compression other than gzip if data contains a large numpy array

It works for small numpy arrays

I'm pretty sure it's same issue as pandas-dev/pandas#39376

ipython
Python 3.9.0 (default, Oct 13 2020, 14:30:47)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import compress_pickle
   ...: import numpy as np
   ...: import pickle
   ...:
   ...: dnp = {"np_array": np.zeros((100, 37000, 3))}
   ...:
   ...: pickled = pickle.dumps(dnp)
   ...:

In [2]: len(pickled)
Out[2]: 88800178

In [3]:
   ...: pickled = compress_pickle.dumps(dnp, compression='gzip')
   ...: len(pickled)
Out[3]: 86506

In [4]: pickled = compress_pickle.dumps(dnp, compression='zipfile')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-0201a3824617> in <module>
----> 1 pickled = compress_pickle.dumps(dnp, compression='zipfile')

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dumps(obj, compression, protocol, fix_imports, buffer_callback, optimize, **kwargs)
    206     validate_compression(compression, infer_is_valid=False)
    207     with io.BytesIO() as stream:
--> 208         dump(
    209             obj,
    210             path=stream,

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dump(obj, path, compression, mode, protocol, fix_imports, buffer_callback, unhandled_extensions, set_default_extension, optimize, **kwargs)
    125                 io_stream.write(buff)
    126             else:
--> 127                 pickle.dump(  # type: ignore
    128                     obj,
    129                     io_stream,

~/.pyenv/versions/3.9.0/lib/python3.9/zipfile.py in write(self, data)
   1121         if self.closed:
   1122             raise ValueError('I/O operation on closed file.')
-> 1123         nbytes = len(data)
   1124         self._file_size += nbytes
   1125         self._crc = crc32(data, self._crc)

TypeError: object of type 'pickle.PickleBuffer' has no len()

In [5]: pickled = compress_pickle.dumps(dnp, compression='lz4')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-90f985e33a25> in <module>
----> 1 pickled = compress_pickle.dumps(dnp, compression='lz4')

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dumps(obj, compression, protocol, fix_imports, buffer_callback, optimize, **kwargs)
    206     validate_compression(compression, infer_is_valid=False)
    207     with io.BytesIO() as stream:
--> 208         dump(
    209             obj,
    210             path=stream,

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dump(obj, path, compression, mode, protocol, fix_imports, buffer_callback, unhandled_extensions, set_default_extension, optimize, **kwargs)
    149                 io_stream.write(buff)
    150             else:
--> 151                 pickle.dump(obj, io_stream, protocol=protocol, fix_imports=fix_imports)
    152         finally:
    153             io_stream.flush()

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/lz4/frame/__init__.py in write(self, data)
    694         compressed = self._compressor.compress(data)
    695         self._fp.write(compressed)
--> 696         self._pos += len(data)
    697         return len(data)
    698

TypeError: object of type 'pickle.PickleBuffer' has no len()

small numpy array

In [1]: import compress_pickle
   ...: import numpy as np

In [2]: dnp = {"np_array": np.zeros((10, 37, 3))}

In [3]: compress_pickle.utils.get_known_compressions()
Out[3]: [None, 'pickle', 'gzip', 'bz2', 'lzma', 'zipfile', 'lz4']

In [4]: pickled = compress_pickle.dumps(dnp, compression='zipfile')

In [5]: len(pickled)
Out[5]: 9136

In [6]: unpickled = compress_pickle.loads(pickled, compression='zipfile')

In [8]: unpickled['np_array'].shape
Out[8]: (10, 37, 3)

Get codecov to work with azure

Currently, codecov uses travisCI builds to calculate coverage. I'm moving towards azure pipelines and it would be nice to use codecov there as well. There are some success stories, so it should be doable.

_lzma.LZMADecompressor' object has no attribute 'needs_input

compress_pickle.load("at.lzma", compression="lzma")

Is throwing an error

AttributeError: '_lzma.LZMADecompressor' object has no attribute 'needs_input

Running on Python 3.8 in a Linux machine

Have you faced this issue before ?

Add optional "pickletools.optimize" step to dump

Add optional "pickletools.optimize" step to dump prior to the compression.
https://docs.python.org/3/library/pickletools.html#pickletools.optimize

This should squeeze some additional load performance and size reduction for minimal effort.

document the recognized extensions

From the user manual https://lucianopaz.github.io/compress_pickle/html/ :

>>> fname2 = "gzip_compressed_data.gz"  # The compression is inferred from the extension
>>> dump(obj, fname2)

Can you document there which extensions are recognized? Which extensions can I use to automatically use lzma for example?

Missing string interpolation.

This line here:

compress_pickle/compress_pickle/compressers/registry.py

Line 56 in a15a6e8

"Available values are {list(cls._compresser_registry)}"

is missing the f"" and thus yields

  File "/.../python3.7/site-packages/compress_pickle/compressers/registry.py", line 55, in get_compresser
    f"Unknown compresser {compression}. "
ValueError: Unknown compresser zip. Available values are {list(cls._compresser_registry)}

Restructure package to improve maintainability and extensibility

compress_pickle started out small, as a single function script that only provided the dump and load functions. I've expanded the functionality a bit and split the package up into different modules but underlying paradigm has always been functional. I feel that the package is now reaching a point where extending it has become hard, without introducing multiple if statement branches and spaghetti code. In particular, #16 will be a bit ugly to get through with the current structure.

For the moment, I don't have the time to restructure the code into a object oriented implementation, which would be easier to maintain and also to extend, so it's likely that I'll first spin out a small patch to close #16, but it would be really nice in the long run, to aim for an OO structure of compress_pickle.

Add optional support for lz4

This issue accompanies the feature that PR #12 tries to add to the package.

dumping fails with the .lzma extension

I have a piece of code that looks like:

import compress_pickle as cpkl

def dump_object(filename, to_dump):
    """A minor helper to dump the object into filename."""
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    with open(filename, "bw") as fh:
        cpkl.dump(to_dump, fh)

When running it, I get the error:

  File "script_logging.py", line 59, in dump_object
    cpkl.dump(to_dump, fh)
  File "/home/jrmet/.local/lib/python3.8/site-packages/compress_pickle/compress_pickle.py", line 96, in dump
    compresser = instantiate_compresser(
  File "/home/jrmet/.local/lib/python3.8/site-packages/compress_pickle/utils.py", line 112, in instantiate_compresser
    compression = _infer_compression_from_path(_path)
UnboundLocalError: local variable '_path' referenced before assignment

Is this a user or package problem? :)

Add dumps and loads examples in docs

#5 introduced some new features and a broader scope. The main docs page should be updated to reflect the new additions.

Implementation for dumps and loads

Hello,
The package is very useful, however if you want to use other file storage than local FS like gridFS or any remote FS you need to use dumps and loads functionality for in memory compression.

Separate extensions for Picklers

I'd like to add support for pickle extensions as a separate concept. So, for example, "foo.pkl.bz2" is gracefully handled, and "baz.json" is gracefully handled. I would be changing the default pickle_method arguments from "pickle" to "infer" -- but the behavior in the majority of circumstances would be the same.

But overall, this issue is a little more touchy than just adding basic JSON support, because it could cause API breakage. I suppose the core of what I'm asking is:

What level of Public-facing API change are you willing to accept?

I imagine changing the default pickler_method to 'infer' is fine.
I imagine adding new functions is fine
I imagine the API of dump, dumps, load, and loads should remain substantially the same.

However, what about repurposing or renaming other functions? Example:

get_registered_extensions should be repurposed or renamed, because its function is no longer fully clear if there are also pickler extensions
- for example, get_registered_extensions() could:
  - leave it as-is, but that would make the name somewhat deceptive
  - return all registered extensions, for both picklers and compressers, but that would require a different data format (because picklers will have more overlap in extension handling, so something like {"pkl": ["pickle", "dill", "cloudpickle"], "bz": ["bz2"]}
  - be renamed to get_registered_compresser_extensions
  - ..etc

What I would like to do with it is:
* If a function's output changes, rename it to explicitly break compatibility (rather than implicitly by having different output)
* If a function's name is no longer reflective of its purpose, or implies more than it does, rename it.
* All functions that do effectively identical things for compressers and picklers would be named or renamed according to a common theme, like get_default_compresser_name_mapping and get_default_pickler_name_mapping.
* It might be easier to let the public API for registries be handled through module reference, like compressers.get_default_extension_map() and picklers.get_default_extension_map().

Anyways, those are my thoughts currently. In the mean time, I'm trying to have as little impact on the API as possible, to err on the side of caution.

File ending with `.pkl.gz` is stored as `.pklgz`

Hello @lucianopaz,

Thank you for making this library!

I wanted to ask why, if the provided path is .pkl.gz, the files are stored as .pklgz.
This feels like an undesired default behaviour that may be, if necessary, enabled with a flag, as it breaks libraries that depend on the files to be stored as the given path requires.

Ciao e Grazie,
Luca

JSON Support

Would you be interested in JSON (and, for multiples, NDJSON/JSONLines) support for this module? I know its name is "compress_pickle", but the option would be welcome for those who want compressed, serialized data but don't want the concomitant security risks of pickle.

Specifically, if this were implemented, would you accept a PR?

Configure intersphinx

The documentation's conf.py has nitpicks to avoid unknown reference warnings. It would be much better to use intersphinx instead.

lucianopaz / compress_pickle Goto Github PK

compress_pickle's Introduction

compress_pickle

Standard python pickle, thinly wrapped with standard compression libraries

compress_pickle's People

Contributors

Stargazers

Watchers

Forkers

compress_pickle's Issues

Recommend Projects

Recommend Topics

Recommend Org

`compress_pickle`