Code Monkey home page Code Monkey logo

compress_pickle's Introduction

compress_pickle

Standard python pickle, thinly wrapped with standard compression libraries

Code style: black Build Status Coverage Status PyPI License: MIT

The standard pickle package provides an excellent default tool for serializing arbitrary python objects and storing them to disk. Standard python also includes broad set of data compression packages. compress_pickle provides an interface to the standard pickle.dump, pickle.load, pickle.dumps and pickle.loads functions, but wraps them in order to direct the serialized data through one of the standard compression packages. This way you can seemlessly serialize data to disk or to any file-like object in a compressed way.

compress_pickle supports python >= 3.6. If you must support python 3.5, install compress_pickle==v1.1.1.

Supported compression protocols:

Furthermore, compress_pickle supports the lz4 compression protocol, that isn't part of the standard python compression packages. This is provided as an optional extra requirement that can be installed as:

pip install compress_pickle[lz4]

Please refer to the package's documentation for more information

compress_pickle's People

Contributors

eode avatar lucianopaz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

compress_pickle's Issues

Failure loading object

In my Anaconda install I'm using compress_pickle 2.1.0 with python 3.9.0, and I get this error when I try and load a pickled object:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_6167/1779668135.py in <module>
     10     print("loading from pickle: "+feat_model_path)
     11     with open(feat_model_path, 'rb') as f:
---> 12         feat_automl = pickle.load(f)
     13 
     14 print(feat_automl.sprint_statistics())

~/anaconda3/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in load(path, compression, pickler_method, pickler_kwargs, mode, set_default_extension, **kwargs)
    270         pickler_kwargs = {}
    271     try:
--> 272         output = uncompress_and_unpickle(
    273             compresser,
    274             pickler=pickler,

~/anaconda3/lib/python3.9/functools.py in wrapper(*args, **kw)
    886                             '1 positional argument')
    887 
--> 888         return dispatch(args[0].__class__)(*args, **kw)
    889 
    890     funcname = getattr(func, '__name__', 'singledispatch function')

~/anaconda3/lib/python3.9/site-packages/compress_pickle/io/base.py in default_uncompress_and_unpickle(compresser, pickler, **kwargs)
     97     compresser: BaseCompresser, pickler: BasePicklerIO, **kwargs
     98 ) -> Any:
---> 99     return pickler.load(stream=compresser.get_stream(), **kwargs)

~/anaconda3/lib/python3.9/site-packages/compress_pickle/picklers/pickle.py in load(self, stream, **kwargs)
     43             The python object that was loaded.
     44         """
---> 45         return pickle.load(stream, **kwargs)
     46 
     47 

TypeError: __randomstate_ctor() takes from 0 to 1 positional arguments but 2 were given

I wasn't seeing this error with Python 3.10 on Ubuntu.

Fails deserializing multiple objects

Pickle is capable of storing multiple objects in the same file as each dump is self-contained.

Hence the following:

from pickle import dump, load
with open('test.gz', 'wb') as f:
    dump(1, f)
    dump(2, f)
with open('test.gz', 'rb') as f:
    print(load(f))
    print(load(f))

Returns

1
2

But with this library:

from compressed_pickle import dump, load
with open('test.gz', 'wb') as f:
    dump(1, f, compression='gzip')
    dump(2, f, compression='gzip')
with open('test.gz', 'rb') as f:
    print(load(f, compression='gzip'))
    print(load(f, compression='gzip'))

It returns

1
\compress_pickle\compress_pickle.py in load(path, compression, mode, fix_imports, encoding, errors, buffers, arcname, set_default_extension, unhandled_extensions, **kwargs)
    334     else:
    335         try:
--> 336             output = pickle.load(  # type: ignore
    337                 io_stream,
    338                 encoding=encoding,

EOFError: Ran out of input

Change deployment pipeline workflow

The CI/CD pipeline is a bit of a mess, mostly the part related to deployment, which shouldn't push to master, but to gh-pages instead. It would also be nice to migrate from azure into Github actions.

Support `with open` context manager

Currently, the following snippet does not infer the compression scheme properly and errors:

import compress_pickle as cpickle

with open('test.lzma', 'wb') as f:    # Same issue for other extensions
    cpickle.dump([1, 2, 3], f)

Which results in the following error:

env/lib/python3.8/site-packages/compress_pickle/utils.py in instantiate_compresser(compression, path, mode, set_default_extension, **kwargs)
    110         _path = _stringyfy_path(path)
    111     if compression == "infer":
--> 112         compression = _infer_compression_from_path(_path)
    113     compresser_class = get_compresser(compression)
    114     if set_default_extension and isinstance(path, PATH_TYPES):

UnboundLocalError: local variable '_path' referenced before assignment

While the docstring says "a file-like object (io.BaseIO instances) [...] will be passed to the BaseCompresser class", this does not happen. I'd suggest grabbing the filename like so:

    if isinstance(path, PATH_TYPES):
        _path = _stringyfy_path(path)
    elif isinstance(path, io.IOBase):
        _path = path.name                          # this would set _path to 'test.lzma' in the above example
    else:
        raise RuntimeError("Unrecognized path")

    if compression == "infer":
        compression = _infer_compression_from_path(_path)

The error would simply ensure that the variable _path is defined. Something more descriptive could be added.

Any thoughts on this? I don't have time to submit a PR this week, but I'd be happy to at a later time. Thanks for this excellent library!

Add support for "XZ" extension

Extension .xz seems to be standard for LZMA compression. Can you add support for this?

If I dump a variable to pickle file, then compress with command-line 'xz' utility, I can read back the compressed file with the lzma library. I create an uncompressed pickle file as follows

pickle.dump(test, open("test.pkl", "wb"))

Then from the command-line, I run:

$ xz -9v test.pkl

This converts 'test.pkl' to 'test.pkl.xz'.

I can load the compressed file with lzma library:
test = pickle.load(lzma.open("test.pkl.xz", "rb"))

But if I try to do with with compress_pickle, it breaks:
test = compress_pickle.load(r"C:\Users\domla\tmp\test_pkl.pkl.xz")
triggers the following exception:

ValueError: Cannot infer compression protocol from filename test_pkl.pkl.xz with extension .xz

Or if I explicitly set the compression type:
test = compress_pickle.load(r"C:\Users\domla\tmp\test_pkl.pkl.xz", compression='lzma')

FileNotFoundError: [Errno 2] No such file or directory: 'test_pkl.pkl.xz.lzma'

If I rename the file so that it has extension '.lzma', compress_pickle loads fine.

$ mv test.pkl.xz test.pkl.lzma

test = compress_pickle.load(r"C:\Users\domla\tmp\test.pkl.lzma", compression='lzma')

In regard to the FileNotFoundError, the behavior of compress_pickle is slightly unexpected. If I supply a filename and an compression protocol, I would expect compress_pickle to try to load the filename as-is first, before trying any funny business like sticking an extension on it. This seems obvious enough. It only gets murky if both 'test.pkl' and 'test.pkl.lzma' exist. Although even then my personal expectation is that compress_pickle should always default to filename exactly as supplied.

"object of type 'pickle.PickleBuffer' has no len()" error with large numpy arrays

I get object of type 'pickle.PickleBuffer' has no len() error for any compression other than gzip if data contains a large numpy array

It works for small numpy arrays

I'm pretty sure it's same issue as pandas-dev/pandas#39376

ipython
Python 3.9.0 (default, Oct 13 2020, 14:30:47)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import compress_pickle
   ...: import numpy as np
   ...: import pickle
   ...:
   ...: dnp = {"np_array": np.zeros((100, 37000, 3))}
   ...:
   ...: pickled = pickle.dumps(dnp)
   ...:

In [2]: len(pickled)
Out[2]: 88800178

In [3]:
   ...: pickled = compress_pickle.dumps(dnp, compression='gzip')
   ...: len(pickled)
Out[3]: 86506

In [4]: pickled = compress_pickle.dumps(dnp, compression='zipfile')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-0201a3824617> in <module>
----> 1 pickled = compress_pickle.dumps(dnp, compression='zipfile')

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dumps(obj, compression, protocol, fix_imports, buffer_callback, optimize, **kwargs)
    206     validate_compression(compression, infer_is_valid=False)
    207     with io.BytesIO() as stream:
--> 208         dump(
    209             obj,
    210             path=stream,

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dump(obj, path, compression, mode, protocol, fix_imports, buffer_callback, unhandled_extensions, set_default_extension, optimize, **kwargs)
    125                 io_stream.write(buff)
    126             else:
--> 127                 pickle.dump(  # type: ignore
    128                     obj,
    129                     io_stream,

~/.pyenv/versions/3.9.0/lib/python3.9/zipfile.py in write(self, data)
   1121         if self.closed:
   1122             raise ValueError('I/O operation on closed file.')
-> 1123         nbytes = len(data)
   1124         self._file_size += nbytes
   1125         self._crc = crc32(data, self._crc)

TypeError: object of type 'pickle.PickleBuffer' has no len()

In [5]: pickled = compress_pickle.dumps(dnp, compression='lz4')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-90f985e33a25> in <module>
----> 1 pickled = compress_pickle.dumps(dnp, compression='lz4')

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dumps(obj, compression, protocol, fix_imports, buffer_callback, optimize, **kwargs)
    206     validate_compression(compression, infer_is_valid=False)
    207     with io.BytesIO() as stream:
--> 208         dump(
    209             obj,
    210             path=stream,

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dump(obj, path, compression, mode, protocol, fix_imports, buffer_callback, unhandled_extensions, set_default_extension, optimize, **kwargs)
    149                 io_stream.write(buff)
    150             else:
--> 151                 pickle.dump(obj, io_stream, protocol=protocol, fix_imports=fix_imports)
    152         finally:
    153             io_stream.flush()

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/lz4/frame/__init__.py in write(self, data)
    694         compressed = self._compressor.compress(data)
    695         self._fp.write(compressed)
--> 696         self._pos += len(data)
    697         return len(data)
    698

TypeError: object of type 'pickle.PickleBuffer' has no len()

small numpy array

In [1]: import compress_pickle
   ...: import numpy as np

In [2]: dnp = {"np_array": np.zeros((10, 37, 3))}

In [3]: compress_pickle.utils.get_known_compressions()
Out[3]: [None, 'pickle', 'gzip', 'bz2', 'lzma', 'zipfile', 'lz4']

In [4]: pickled = compress_pickle.dumps(dnp, compression='zipfile')

In [5]: len(pickled)
Out[5]: 9136

In [6]: unpickled = compress_pickle.loads(pickled, compression='zipfile')

In [8]: unpickled['np_array'].shape
Out[8]: (10, 37, 3)

Restructure package to improve maintainability and extensibility

compress_pickle started out small, as a single function script that only provided the dump and load functions. I've expanded the functionality a bit and split the package up into different modules but underlying paradigm has always been functional. I feel that the package is now reaching a point where extending it has become hard, without introducing multiple if statement branches and spaghetti code. In particular, #16 will be a bit ugly to get through with the current structure.

For the moment, I don't have the time to restructure the code into a object oriented implementation, which would be easier to maintain and also to extend, so it's likely that I'll first spin out a small patch to close #16, but it would be really nice in the long run, to aim for an OO structure of compress_pickle.

dumping fails with the .lzma extension

I have a piece of code that looks like:

import compress_pickle as cpkl

def dump_object(filename, to_dump):
    """A minor helper to dump the object into filename."""
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    with open(filename, "bw") as fh:
        cpkl.dump(to_dump, fh)

When running it, I get the error:

  File "script_logging.py", line 59, in dump_object
    cpkl.dump(to_dump, fh)
  File "/home/jrmet/.local/lib/python3.8/site-packages/compress_pickle/compress_pickle.py", line 96, in dump
    compresser = instantiate_compresser(
  File "/home/jrmet/.local/lib/python3.8/site-packages/compress_pickle/utils.py", line 112, in instantiate_compresser
    compression = _infer_compression_from_path(_path)
UnboundLocalError: local variable '_path' referenced before assignment

Is this a user or package problem? :)

Implementation for dumps and loads

Hello,
The package is very useful, however if you want to use other file storage than local FS like gridFS or any remote FS you need to use dumps and loads functionality for in memory compression.

Separate extensions for Picklers

I'd like to add support for pickle extensions as a separate concept. So, for example, "foo.pkl.bz2" is gracefully handled, and "baz.json" is gracefully handled. I would be changing the default pickle_method arguments from "pickle" to "infer" -- but the behavior in the majority of circumstances would be the same.

But overall, this issue is a little more touchy than just adding basic JSON support, because it could cause API breakage. I suppose the core of what I'm asking is:

What level of Public-facing API change are you willing to accept?

  • I imagine changing the default pickler_method to 'infer' is fine.
  • I imagine adding new functions is fine
  • I imagine the API of dump, dumps, load, and loads should remain substantially the same.

However, what about repurposing or renaming other functions? Example:

  • get_registered_extensions should be repurposed or renamed, because its function is no longer fully clear if there are also pickler extensions
    • for example, get_registered_extensions() could:
      • leave it as-is, but that would make the name somewhat deceptive
      • return all registered extensions, for both picklers and compressers, but that would require a different data format (because picklers will have more overlap in extension handling, so something like {"pkl": ["pickle", "dill", "cloudpickle"], "bz": ["bz2"]}
      • be renamed to get_registered_compresser_extensions
      • ..etc

What I would like to do with it is:
* If a function's output changes, rename it to explicitly break compatibility (rather than implicitly by having different output)
* If a function's name is no longer reflective of its purpose, or implies more than it does, rename it.
* All functions that do effectively identical things for compressers and picklers would be named or renamed according to a common theme, like get_default_compresser_name_mapping and get_default_pickler_name_mapping.
* It might be easier to let the public API for registries be handled through module reference, like compressers.get_default_extension_map() and picklers.get_default_extension_map().

Anyways, those are my thoughts currently. In the mean time, I'm trying to have as little impact on the API as possible, to err on the side of caution.

File ending with `.pkl.gz` is stored as `.pklgz`

Hello @lucianopaz,

Thank you for making this library!

I wanted to ask why, if the provided path is .pkl.gz, the files are stored as .pklgz.
This feels like an undesired default behaviour that may be, if necessary, enabled with a flag, as it breaks libraries that depend on the files to be stored as the given path requires.

Ciao e Grazie,
Luca

JSON Support

Would you be interested in JSON (and, for multiples, NDJSON/JSONLines) support for this module? I know its name is "compress_pickle", but the option would be welcome for those who want compressed, serialized data but don't want the concomitant security risks of pickle.

Specifically, if this were implemented, would you accept a PR?

Configure intersphinx

The documentation's conf.py has nitpicks to avoid unknown reference warnings. It would be much better to use intersphinx instead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.