blaze / chest Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mrocklin/chest

60.0 16.0 10.0 81 KB

Simple spill-to-disk dictionary

License: Other

Python 100.00%

chest's Introduction

Chest

A dictionary that spills to disk.

Chest acts likes a dictionary but it can write its contents to disk. This is useful in the following two occasions:

Chest can hold datasets that are larger than memory
Chest persists and so can be saved and loaded for later use

Related Projects

The standard library shelve is an alternative out-of-core dictionary. Chest offers the following benefits over shelve:

Chest supports any hashable key (not just strings)
Chest supports pluggable serialization and file saving schemes

Alternatively one might consider a traditional key-value store database like Redis.

Shove is another excellent alternative with support for a variety of stores.

How it works

Chest stores data in two locations

An in-memory dictionary
On the filesystem in a directory owned by the chest

As a user adds contents to the chest the in-memory dictionary fills up. When a chest stores more data in memory than desired (see available_memory= keyword argument) it writes the larger contents of the chest to disk as pickle files (the choice of pickle is configurable). When a user asks for a value chest checks the in-memory store, then checks on-disk and loads the value into memory if necessary, pushing other values to disk.

Chest is a simple project. It was intended to provide a simple interface to assist in the storage and retrieval of numpy arrays. However it's design and implementation are agnostic to this case and so could be used in a variety of other situations.

With minimal work chest could be extended to serve as a communication point between multiple processes.

Known Failings

Chest was designed to hold a moderate amount of largish numpy arrays. It doesn't handle the very many small key-value pairs usecase (though could with small effort). In particular chest has the following deficiencies

Chest is not multi-process safe. We should institute a file lock at least around the .keys file.
Chest does not support mutation of variables on disk.

LICENSE

New BSD. See License

Install

chest is available through conda:

conda install chest

chest is on the Python Package Index (PyPI):

pip install chest

Example

>>> from chest import Chest
>>> c = Chest()

>>> # Acts like a normal dictionary
>>> c['x'] = [1, 2, 3]
>>> c['x']
[1, 2, 3]

>>> # Data persists to local files
>>> c.flush()
>>> import os
>>> os.listdir(c.path)
['.keys', 'x']

>>> # These files hold pickled results
>>> import pickle
>>> pickle.load(open(c.key_to_filename('x')))
[1, 2, 3]

>>> # Though one normally accesses these files with chest itself
>>> c2 = Chest(path=c.path)
>>> c2.keys()
['x']
>>> c2['x']
[1, 2, 3]

>>> # Chest is configurable, so one can use json, etc. instead of pickle
>>> import json
>>> c = Chest(path='my-chest', dump=json.dump, load=json.load)
>>> c['x'] = [1, 2, 3]
>>> c.flush()

>>> json.load(open(c.key_to_filename('x')))
[1, 2, 3]

Dependencies

Chest supports Python 2.6+ and Python 3.2+ with a common codebase.

It currently depends on the heapdict library.

It is a light weight dependency.

chest's People

Contributors

Stargazers

Watchers

Forkers

llllllllll quantopian jakirkham danome austinsmom vishalbelsare afcarl richlysakowski gururajrkatti s-t-e-v-e-n-k

chest's Issues

read-only mode

I use chests to distribute data. In most cases, the data consumer either doesn't want to change the contents of the chest or doesn't even have the file permissions to. chests write the .keys to disk at close, even if they haven't changed. __delitem__ and __setitem__ both delete files even before a flush, so it might be nice to protect them from programmer errors.

Does this seem useful enough to belong in chest itself, or should I implement this on my end in a subclass or the likes?

Include pickling with the highest protocol

I am trying to modify the protocol used for dumping, however, it always complains that it does not understand the argument protocol in dump=partial(pickle.dump, protocol=pickle.HIGHEST_PROTOCOL). Could you put this as default?

Python 3.9 support

 /opt/hostedtoolcache/Python/3.8.7/x64/lib/python3.8/site-packages/chest/core.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
151
    from collections import MutableMapping

Would you accept a PR fix, and make a new release?

dump with another wrapper than the one used for caching

I am trying to save the resulting chest into a different file format than the one used for caching.

A practical example: I used pickle when I initialize my chest because I am using nested dicts and sets, which would be stripped off I guess if it was to be stored in json.

But in the end, I would like to dump my chest into json for export.

Is there a way to do that?

For example, this does not work past the first level:

import ujson as json
with open('db.json', 'w') as f:
    f.write(json.dumps(out, ensure_ascii=False, indent=4, sort_keys=True))

Tag for 0.2.3

Would be nice to have a tag for version 0.2.3 so we know where the PyPI release was made.

run out of inodes by using chest

Hi,

I tried chest on a large file and on an ubuntu 16.04 system. Chest created millions of files on my hard disk and I was running out of inodes. I have plenty of hard disk avaiable but I can not use it because inodes usage is at the peak. What can I do about it? Can I reduce the millions of generated files to hundereds? Is there any option for this?

Thanks in advance

Prefetch opens too many files

For a reasonably large chest c, c.prefetch(list(c.keys())) will cause OSError: [Errno 24] Too many open files because open_many opens everything, consumes everything, and then closes everything.

Simple solution: chunk up prefetches into open_many's of some maximum size.

odd behavior when given a initial data

I just installed anaconda4.3 and I notice this module so I star playing with it and I notice this

>>> from chest import Chest
>>> d={42:23}
>>> c=Chest(d, "my-chest")
>>> list(c)
[]
>>> 42 in c
False 
>>> c[42]
23 
>>> d
{42: 23}
>>> c.flush()
>>> d
{}
>>> list(c)
[]
>>> 42 in c
False
>>> c[42]
Traceback (most recent call last):
  File "<pyshell#11>", line 1, in <module>
    c[42]
  File "C:\Anaconda3\lib\site-packages\chest\core.py", line 182, in __getitem__
    raise KeyError("Key not found: %s" % str(key))
KeyError: 'Key not found: 42'
>>> c.get_from_disk(42)
>>> c[42]
23
>>> 42 in c
False
>>> list(c)
[]
>>>

why it say that 42 is not in c?
why it empty my initializer?
why I get an empty list when I do list(c)

but if I close python, delete the folder, and start again and don't give it a initial data it work fine

>>> from chest import Chest
>>> c =Chest(path="my-chest")
>>> c[42]=23
>>> 42 in c
True
>>> list(c)
[42]
>>> c.flush()
>>> list(c)
[42]
>>> 42 in c
True
>>> c[42]
23
>>>

`data` argument does not work

Hello there,

When using the data argument to initialize the chest with a dict, it seems the inmemory dict does not get correctly initialized. Here is an example:

out = Chest(data={'id': set(), 'label': set()})
out2 = Chest()
out2['id'] = set()
out2['label'] = set()
print(out.keys())
print(out2.keys())

Output:
[]
['id', 'label']

When data is used, the keys are not available, as well as hasattr(out, 'id') will return False, but out['id'] will work and return set().

BTW, it is also not possible to update a chest with a simple dict:

out.update({'id': set([1, 2])})

Output:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-76-7a010f35ab01> in <module>()
----> 1 out.update({'id': set([1, 2])})

C:\git\centertbi-mri\libs\chest\core.py in update(self, other, overwrite)
    290         #  if already flushed, then this does nothing
    291         self.flush()
--> 292         other.flush()
    293         for key in other._keys:
    294             if key in self._keys and overwrite:

AttributeError: 'dict' object has no attribute 'flush'