grantjenks / python-diskcache Goto Github PK

View Code? Open in Web Editor NEW

2.2K 22.0 120.0 2.59 MB

Python disk-backed cache (Django-compatible). Faster than Redis and Memcached. Pure-Python.

Home Page: http://www.grantjenks.com/docs/diskcache/

License: Other

Python 100.00%

python cache persistence key-value-store filesystem

python-diskcache's Introduction

DiskCache: Disk Backed Cache

DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.

The cloud-based computing of 2023 puts a premium on memory. Gigabytes of empty space is left on disks as processes vie for memory. Among these processes is Memcached (and sometimes Redis) which is used as a cache. Wouldn't it be nice to leverage empty disk space for caching?

Django is Python's most popular web framework and ships with several caching backends. Unfortunately the file-based cache in Django is essentially broken. The culling method is random and large caches repeatedly scan a cache directory which slows linearly with growth. Can you really allow it to take sixty milliseconds to store a key in a cache with a thousand items?

In Python, we can do better. And we can do it in pure-Python!

In [1]: import pylibmc
In [2]: client = pylibmc.Client(['127.0.0.1'], binary=True)
In [3]: client[b'key'] = b'value'
In [4]: %timeit client[b'key']

10000 loops, best of 3: 25.4 µs per loop

In [5]: import diskcache as dc
In [6]: cache = dc.Cache('tmp')
In [7]: cache[b'key'] = b'value'
In [8]: %timeit cache[b'key']

100000 loops, best of 3: 11.8 µs per loop

Note: Micro-benchmarks have their place but are not a substitute for real measurements. DiskCache offers cache benchmarks to defend its performance claims. Micro-optimizations are avoided but your mileage may vary.

DiskCache efficiently makes gigabytes of storage space available for caching. By leveraging rock-solid database libraries and memory-mapped files, cache performance can match and exceed industry-standard solutions. There's no need for a C compiler or running another process. Performance is a feature and testing has 100% coverage with unit tests and hours of stress.

Testimonials

Daren Hasenkamp, Founder --

"It's a useful, simple API, just like I love about Redis. It has reduced the amount of queries hitting my Elasticsearch cluster by over 25% for a website that gets over a million users/day (100+ hits/second)."

Mathias Petermann, Senior Linux System Engineer --

"I implemented it into a wrapper for our Ansible lookup modules and we were able to speed up some Ansible runs by almost 3 times. DiskCache is saving us a ton of time."

Does your company or website use DiskCache? Send us a message and let us know.

Features

Pure-Python
Fully Documented
Benchmark comparisons (alternatives, Django cache backends)
100% test coverage
Hours of stress testing
Performance matters
Django compatible API
Thread-safe and process-safe
Supports multiple eviction policies (LRU and LFU included)
Keys support "tag" metadata and eviction
Developed on Python 3.10
Tested on CPython 3.6, 3.7, 3.8, 3.9, 3.10
Tested on Linux, Mac OS X, and Windows
Tested using GitHub Actions

Quickstart

Installing DiskCache is simple with pip:

$ pip install diskcache

You can access documentation in the interpreter with Python's built-in help function:

>>> import diskcache
>>> help(diskcache)                             # doctest: +SKIP

The core of DiskCache is three data types intended for caching. Cache objects manage a SQLite database and filesystem directory to store key and value pairs. FanoutCache provides a sharding layer to utilize multiple caches and DjangoCache integrates that with Django:

>>> from diskcache import Cache, FanoutCache, DjangoCache
>>> help(Cache)                                 # doctest: +SKIP
>>> help(FanoutCache)                           # doctest: +SKIP
>>> help(DjangoCache)                           # doctest: +SKIP

Built atop the caching data types, are Deque and Index which work as a cross-process, persistent replacements for Python's collections.deque and dict. These implement the sequence and mapping container base classes:

>>> from diskcache import Deque, Index
>>> help(Deque)                                 # doctest: +SKIP
>>> help(Index)                                 # doctest: +SKIP

Finally, a number of recipes for cross-process synchronization are provided using an underlying cache. Features like memoization with cache stampede prevention, cross-process locking, and cross-process throttling are available:

>>> from diskcache import memoize_stampede, Lock, throttle
>>> help(memoize_stampede)                      # doctest: +SKIP
>>> help(Lock)                                  # doctest: +SKIP
>>> help(throttle)                              # doctest: +SKIP

Python's docstrings are a quick way to get started but not intended as a replacement for the DiskCache Tutorial and DiskCache API Reference.

User Guide

For those wanting more details, this part of the documentation describes tutorial, benchmarks, API, and development.

Comparisons

Comparisons to popular projects related to DiskCache.

Key-Value Stores

DiskCache is mostly a simple key-value store. Feature comparisons with four other projects are shown in the tables below.

dbm is part of Python's standard library and implements a generic interface to variants of the DBM database — dbm.gnu or dbm.ndbm. If none of these modules is installed, the slow-but-simple dbm.dumb is used.
shelve is part of Python's standard library and implements a “shelf” as a persistent, dictionary-like object. The difference with “dbm” databases is that the values can be anything that the pickle module can handle.
sqlitedict is a lightweight wrapper around Python's sqlite3 database with a simple, Pythonic dict-like interface and support for multi-thread access. Keys are arbitrary strings, values arbitrary pickle-able objects.
pickleDB is a lightweight and simple key-value store. It is built upon Python's simplejson module and was inspired by Redis. It is licensed with the BSD three-clause license.

Features

Feature	diskcache	dbm	shelve	sqlitedict	pickleDB
Atomic?	Always	Maybe	Maybe	Maybe	No
Persistent?	Yes	Yes	Yes	Yes	Yes
Thread-safe?	Yes	No	No	Yes	No
Process-safe?	Yes	No	No	Maybe	No
Backend?	SQLite	DBM	DBM	SQLite	File
Serialization?	Customizable	None	Pickle	Customizable	JSON
Data Types?	Mapping/Deque	Mapping	Mapping	Mapping	Mapping
Ordering?	Insert/Sorted	None	None	None	None
Eviction?	LRU/LFU/more	None	None	None	None
Vacuum?	Automatic	Maybe	Maybe	Manual	Automatic
Transactions?	Yes	No	No	Maybe	No
Multiprocessing?	Yes	No	No	No	No
Forkable?	Yes	No	No	No	No
Metadata?	Yes	No	No	No	No

Quality

Project	diskcache	dbm	shelve	sqlitedict	pickleDB
Tests?	Yes	Yes	Yes	Yes	Yes
Coverage?	Yes	Yes	Yes	Yes	No
Stress?	Yes	No	No	No	No
CI Tests?	Linux/Windows	Yes	Yes	Linux	No
Python?	2/3/PyPy	All	All	2/3	2/3
License?	Apache2	Python	Python	Apache2	3-Clause BSD
Docs?	Extensive	Summary	Summary	Readme	Summary
Benchmarks?	Yes	No	No	No	No
Sources?	GitHub	GitHub	GitHub	GitHub	GitHub
Pure-Python?	Yes	Yes	Yes	Yes	Yes
Server?	No	No	No	No	No
Integrations?	Django	None	None	None	None

Timings

These are rough measurements. See DiskCache Cache Benchmarks for more rigorous data.

Project	diskcache	dbm	shelve	sqlitedict	pickleDB
get	25 µs	36 µs	41 µs	513 µs	92 µs
set	198 µs	900 µs	928 µs	697 µs	1,020 µs
delete	248 µs	740 µs	702 µs	1,717 µs	1,020 µs

Caching Libraries

joblib.Memory provides caching functions and works by explicitly saving the inputs and outputs to files. It is designed to work with non-hashable and potentially large input and output data types such as numpy arrays.
klepto extends Python’s lru_cache to utilize different keymaps and alternate caching algorithms, such as lfu_cache and mru_cache. Klepto uses a simple dictionary-sytle interface for all caches and archives.

Data Structures

dict is a mapping object that maps hashable keys to arbitrary values. Mappings are mutable objects. There is currently only one standard Python mapping type, the dictionary.
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
Sorted Containers is an Apache2 licensed sorted collections library, written in pure-Python, and fast as C-extensions. Sorted Containers implements sorted list, sorted dictionary, and sorted set data types.

Pure-Python Databases

ZODB supports an isomorphic interface for database operations which means there's little impact on your code to make objects persistent and there's no database mapper that partially hides the datbase.
CodernityDB is an open source, pure-Python, multi-platform, schema-less, NoSQL database and includes an HTTP server version, and a Python client library that aims to be 100% compatible with the embedded version.
TinyDB is a tiny, document oriented database optimized for your happiness. If you need a simple database with a clean API that just works without lots of configuration, TinyDB might be the right choice for you.

Object Relational Mappings (ORM)

Django ORM provides models that are the single, definitive source of information about data and contains the essential fields and behaviors of the stored data. Generally, each model maps to a single SQL database table.
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. It provides a full suite of well known enterprise-level persistence patterns.
Peewee is a simple and small ORM. It has few (but expressive) concepts, making it easy to learn and intuitive to use. Peewee supports Sqlite, MySQL, and PostgreSQL with tons of extensions.
SQLObject is a popular Object Relational Manager for providing an object interface to your database, with tables as classes, rows as instances, and columns as attributes.
Pony ORM is a Python ORM with beautiful query syntax. Use Python syntax for interacting with the database. Pony translates such queries into SQL and executes them in the database in the most efficient way.

SQL Databases

SQLite is part of Python's standard library and provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language.
MySQL is one of the world’s most popular open source databases and has become a leading database choice for web-based applications. MySQL includes a standardized database driver for Python platforms and development.
PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development. Psycopg is the most popular PostgreSQL adapter for the Python programming language.
Oracle DB is a relational database management system (RDBMS) from the Oracle Corporation. Originally developed in 1977, Oracle DB is one of the most trusted and widely used enterprise relational database engines.
Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it stores and retrieves data as requested by other software applications.

Other Databases

Memcached is free and open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
Redis is an open source, in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, and more.
MongoDB is a cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schema. PyMongo is the recommended way to work with MongoDB from Python.
LMDB is a lightning-fast, memory-mapped database. With memory-mapped files, it has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.
BerkeleyDB is a software library intended to provide a high-performance embedded database for key/value data. Berkeley DB is a programmatic toolkit that provides built-in database support for desktop and server applications.
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Data is stored sorted by key and users can provide a custom comparison function.

Reference

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

python-diskcache's People

Contributors

Stargazers

Watchers

Forkers

iampyoung michaelkuty suzaku adamchainz ghotiv baughn mrccao bk mrjohnsson77 ypid hugoren larsks elistevens rangeo justintervala vpineda7 zhouxiaomao morristech afcarl minhlab ziligy mghmgh1281375 cosecant-csc mrclary nournia jealousljl drewsonne pbaranay caiqing0204 allaudet bioteqgroup scape1989 allensmile venkatesh-283 abhijitmamarde stungkit akatuy roi-meir zhenruyan alexeshoo audetto leepand ivan-shumilov svaite mayli tmtorresvc rafaelri youindfor tony-santos mgorny jab twithat ncjie kumaraish berlineric bungoume cynepiaadmin narayanamurthyraju shreyashnikam fx-kirin ikeikeikeike suryatmodulus i404788 gmdelc66 thedrow apurveyajnik benjaminrigaud rwaycachedlibs broadbandforum sangensong aigeneratedusername joakimnordling generalcommission maxking lipeibin abhinavomprakash oldrichsmejkal shism2 banteg bhumikapaharia ccamateur artiom ahmadsayed archmonger martijnvanbiervliet webclinic017 arryboom gevmin94 lukemassa python-repository-hub schmoelder skshetry keranem1 pruthvi-dhani kmatt whatamithinking mondeja efiop techcable hpcugent

python-diskcache's Issues

Promote Pickle Protocol to Cache Setting with "disk" Prefix

This is likely a pickle/cPickle bug in Python 2.7 when using HIGHEST_PROTOCOL which is diskcache default. Found while working on nexB/scancode-toolkit#267
You can see these two tests in action:

bare pickling issue: nexB/scancode-toolkit@50bfc21
same issue when using diskcache: nexB/scancode-toolkit@6b92ccf

The simple workaround is to avoid using HIGHEST_PROTOCOL. I guess this can be done by passing a custom Disk subclass forced to use no specific protocol.

OSError on cache.delete [Errno 13] Permission denied

File "/home/grantj/repos/xxx/env27/lib/python2.7/site-packages/diskcache/djangocache.py", line 77, in delete
    self._cache.delete(key)
  File "/home/grantj/repos/xxx/env27/lib/python2.7/site-packages/diskcache/fanout.py", line 123, in delete
    return self.__delitem__(key)
  File "/home/grantj/repos/xxx/env27/lib/python2.7/site-packages/diskcache/fanout.py", line 111, in __delitem__
    return self._shards[index].__delitem__(key)
  File "/home/grantj/repos/xxx/env27/lib/python2.7/site-packages/diskcache/core.py", line 737, in __delitem__
    return self._delete(rowid, version, filename)
  File "/home/grantj/repos/xxx/env27/lib/python2.7/site-packages/diskcache/core.py", line 763, in _delete
    self._remove(filename)
  File "/home/grantj/repos/xxx/env27/lib/python2.7/site-packages/diskcache/core.py", line 790, in _remove
    os.remove(full_path)

OSError: [Errno 13] Permission denied: '/tmp/xxx-cache/003/e6/be/87b78c58d79c1a502b05ce645db8.val'

This is an odd error. On one particular server I get transient [Errno 13] Permission denied errors.

UserWarning: file not found

Occurred today in Travis at 1.6.3 on both CPython 3.4 and PyPy.

======================================================================
ERROR: Stress test multiple threads and processes.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/pypy-2.5.0/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/grantjenks/python-diskcache/tests/stress_test_fanout.py", line 280, in stress_test_mp
    stress_test()
  File "/home/travis/build/grantjenks/python-diskcache/tests/stress_test_fanout.py", line 244, in stress_test
    cache.check()
  File "/home/travis/build/grantjenks/python-diskcache/diskcache/fanout.py", line 135, in check
    return sum((shard.check(fix=fix) for shard in self._shards), [])
  File "/home/travis/build/grantjenks/python-diskcache/diskcache/fanout.py", line 135, in <genexpr>
    return sum((shard.check(fix=fix) for shard in self._shards), [])
  File "/home/travis/build/grantjenks/python-diskcache/diskcache/core.py", line 873, in check
    warnings.warn('file not found: %s' % full_path)
UserWarning: file not found: tmp/000/90/06/d93b748a5ebea1fee3fc3fe441bf.val

This is worrisome. I don't yet understand how a file was lost.

Atomic Increment and Decrement Methods

Add atomic increment and decrement methods.

Bump to v3?

Upcoming features:

Issue #51 fixed
Issue #52 add cull() method
Issue #54 need to optimize pickle
Issue #58 need tests for dir-copy backup
Pull request #61
Issue #59 need support for new-style .format strings when culling

'evict' Should Not Create an Index

'evict' method should not create an index. Instead:

Create the index during initialization.
Provide create_tag_index method.
Provide drop_tag_index method.

Why is culling slow for Django core's filebased cache?

First of all thanks for releasing this! Super cool library, I was just reading about your benchmarks and wondering why culling (glob) was needed (and why it's slow)? The reason you "cull" is to check if you need to remove some old keys to make space for some new set(), is that right?

Would it be possible to keep track of old file state somehow instead of having to keep globbing? Or making many smaller folders to make the sets cheaper... not sure I understand how all this works, and I'm sure you've thought of this I'm just curious why it's not the way-to-go. Thanks for any explanation, I hope to use this on some fun projects soon :)

Some random crash with a fanout cache

I get this error but this kinda random. Only found on python 2.7.3 on Linux/Ubuntu 12.04.

You can see the code calling it here:
https://github.com/nexB/scancode-toolkit/blob/0030a717f8f202b24aa6943f4d62d66a627fd047/src/licensedcode/cache.py#L148

[...]
 File "/home/pombreda/scancode-toolkit/src/licensedcode/cache.py", line 171, in get
   cached = self.cache.get(cache_key)
 File "/home/pombreda/scancode-toolkit/local/lib/python2.7/site-packages/diskcache/fanout.py", line 84, in get
   default=default, read=read, expire_time=expire_time, tag=tag,
 File "/home/pombreda/scancode-toolkit/local/lib/python2.7/site-packages/diskcache/core.py", line 664, in get
   value = self._disk.fetch(self._dir, mode, filename, db_value, read)
 File "/home/pombreda/scancode-toolkit/local/lib/python2.7/site-packages/diskcache/core.py", line 245, in fetch
   return pickle.load(io.BytesIO(value))
TypeError: 'buffer' does not have the buffer interface

I am using 1.6.6
https://github.com/nexB/scancode-toolkit/blob/0030a717f8f202b24aa6943f4d62d66a627fd047/thirdparty/prod/diskcache.ABOUT

Add Lock for Atomic Operations

Add "cas" or "compare and set" method. This is useful for making atomic things.

Use consistent hashing for string, bytes, and datetime objects

Use consistent hashing for string, bytes, and datetime objects. Python 3 initializes PYTHONHASHSEED to a random value. That really messes with things. Use a different hash technique.

Add Memoizing Decorator Support

As described in DiskCache Development it would be great to have an lru_cache-like decorator to cache function calls.

I contacted Yuval Greenfield, the author of filecache and he said the missing decorator feature was what blocked him from using the module. His decorator accepts one argument, seconds before expiry.

I prototyped something simple in Issue #8. It should be relatively easy to combine it with the source from Python 3's functools.lru_cache

How do I check that adding a bool False twice fails?``

In [119]: cache.add(1,True)
Out[119]: True

In [120]: cache.add(1,True)
Out[120]: False

In [121]: cache.add(1,False)
Out[121]: False

In [122]: cache.add(1,False)
Out[122]: False

This API choice reminds me of C. Why not return -1 for the sake of the good old times? An exception would have been a much better choice.

Atomic Add Method

Include atomic "add" method, like "set" but abort if already exists.

List keys?

Is there a builtin way to list keys in the cache?

Right now I'm using

cache._sql('SELECT key FROM Cache').fetchall()

Add max_age parameter to .get()

Hey,

would you accept a pull request that adds a max_age parameter to .get(). So that it returns a default value if the cached value is too old.

I guess something like this is not possible yet, or?

Define `iter` and `reversed`

Define __iter__ and __reversed__.

Improving Behavior When Using `read=True` ?

A couple things I've noticed in programming:

If I store bytes in the cache, it may choose to store the result in the database (if it's small enough). When I read those bytes back, I may request read=True and depending on the size of the value, I may get back a bytes object or a file handle. It seems preferable to always receive a file-like object. This is not possible for integer and floating point types but I think that's ok.
If I try to retrieve a value with read=True then often I would prefer to raise KeyError than return the default. So instead of using this pattern:

try:
    with cache.get('key', read=True) as reader:
        filename = reader.name
except AttributeError: # `None` does not implement the context manager protocol.
    filename = None

I would prefer this pattern:

try:
    with cache.get('key', read=True) as reader:
        filename = reader.name
except KeyError: # Clearly the key was not found in the cache.
    filename = None

But this changes the semantics of get when read=True. Should read=True always return a file-like object? If it used StringIO or BytesIO to do so, then the code above would break on small values because there is no filename (unless it was explicitly stored using read=True).

I'm starting to regret the read=True extension to get and set methods. Maybe the names read and write would have been better. Something like:

cache.write('key', file_like_object)

And:

cache.read('key') # Return file handle

I'm interested in feedback from others. If you use this functionality, please contribute your opinion.

Add "throttle" Method for Rate Limiting

Pseudocode:

def rate_limit(key, count, seconds, memo={}):
    """Rate limit resource identified by key to count per seconds.

    >>> while True:
    ...     wait = next(rate_limit('some key', 5, 8))
    ...     if not wait:
    ...         break
    ...     else:
    ...         time.sleep(wait)
    ... do_rate_limited_work()

    """
    if (key, count, seconds) in memo:
        return memo[key, count, seconds]

    assert isinstance(count, int) and count > 0

    def meter():
        rate = float(count) / seconds
        last = time.time()
        tally = count

        while True:
            now = time.time()
            extent = now - last
            last = now
            tally += extent * rate

            if tally > count:
                tally = count
            elif tally < 1:
                yield (1 - tally) / rate
            else:
                tally -= 1
                yield 0

    limiter = meter()

    memo[key, count, seconds] = limiter
    return limiter

Need to track last and tally in Cache. Might be better to build as a separate object. The generator pattern is not that useful. But the algorithm is tested and correct.

FanoutCache Subdirectories Created with Mode 700

FanoutCache Subdirectories Created with Mode 700. Prefer 755.

Tests needed for incremental backup using rsync

It would be nice if incremental backup behavior were guaranteed by tests to always work.

Edit: I would like to have there be tests that check that using rsync to back up a cache dir will keep the backup up-to-date with all new content in the cache, and that only new content is synched (so that if 99% of the content is unchanged, it will only be scanned locally by rsync, and not redundantly copied).

If the backed-up copy of the cache dir is then used, it should behave identically to the original.

For our use cases we can pause all readers and writers during the course of the backup, so we're not concerned about race conditions.

Add Windows CI

... for instance with Appveyor. Windows multi-processing and locking is a tad different .... and this is an understatement.

Document File-cache Features

Document File-cache Features:

set(..., read=True, ...)
read(...)

ValueError: Signal receivers must accept keyword arguments (**kwargs).

Close must accept kwargs because some projects bind close as signal handler

for example dbtemplates, sorl-thumbnail etc..

signals.request_finished.connect(cache.close)

fixed in #6

NumPy Optimizations and joblib Comparison

Hi,
I'm currently using joblib for caching numpy array objects. Is there any benchmark on these kind of inputs for DiskCache?

Set default expire time

Is there a way to set the default expire time to be used in .set methods for diskcache.Cache objects?

Thanks

Add Tutorial docs for diskcache.Deque and diskcache.Index

Reminder: Add Tutorial docs for diskcache.Deque and diskcache.Index.

Document NFS Incompatibility

Document incompatibility of DiskCache and NFS due to SQLite.

This is relevant to PythonAnywhere users.

UserWarning: row partially committed

This morning in Travis:

======================================================================
ERROR: Stress test multiple threads and processes.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.4.2/lib/python3.4/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/build/grantjenks/python-diskcache/tests/stress_test_fanout.py", line 280, in stress_test_mp
    stress_test()
  File "/home/travis/build/grantjenks/python-diskcache/tests/stress_test_fanout.py", line 244, in stress_test
    cache.check()
  File "/home/travis/build/grantjenks/python-diskcache/diskcache/fanout.py", line 135, in check
    return sum((shard.check(fix=fix) for shard in self._shards), [])
  File "/home/travis/build/grantjenks/python-diskcache/diskcache/fanout.py", line 135, in <genexpr>
    return sum((shard.check(fix=fix) for shard in self._shards), [])
  File "/home/travis/build/grantjenks/python-diskcache/diskcache/core.py", line 841, in check
    rowid, self._disk.get(key, raw)
UserWarning: row 10 partially commited with key (0.8579395219751248, 0.8579395219751248, 0.8579395219751248, 0.8579395219751248, 0.8579395219751248, 0.8579395219751248, 0.8579395219751248, 0.8579395219751248, 0.8579395219751248, 0.8579395219751248, 0.8579395219751248)
----------------------------------------------------------------------

I'm not sure how that happened. What would cause two threads/processes to abort the same set operation in DiskCache? Maybe SQLite would do so?

Document Fanout-Cache Partial Success Quirkiness

FanoutCache leverages the connection timeout. And the "set" and "add" methods do eviction after their write. So the write may succeed and then the eviction may timeout. If the eviction times-out then FanoutCache will report False for an unsuccessful operation although the write occurred successfully.

This odd behavior is worth documenting.

Add Push/Pull for Message Queue

API pseudocode:

def push(channel, value, expire=None, read=False, tag=None):
    # return stored key
    ...
def pull(channel, default=None, expire_time=False, tag=False):
    # does not support "read" because (key, value) is deleted
    # return value
    ...

Queue support is problematic because its slow. DiskCache does best when the ratio of reads to writes is 10:1 or greater. Queues are closer to 1:1 without backpressure.
TODO: Support channel length?
TODO: Support channel iteration?

Change Index.setdefault to use get(); add() loop

Change DiskCache Index.setdefault() to use loop { get(); add() }. Current implementation is not atomic.

Readme file is not pure ascii

Which leads to errors like this:

➜  sv-tools  nix-shell                                                                         ~/dev/sv-tools
these derivations will be built:
  /nix/store/ilvmx3q5bfjqmhc1sarzrlwvbvfc4rly-python-diskcache-v1.6.6-src.drv
  /nix/store/8lmqbi0j8xc2xhvgvl3xqldixgibsg5x-python3.5-diskcache-1.6.6.drv
building path(s) ‘/nix/store/999v4wbl7ycvg3355p66sfnjjj01k2rs-python-diskcache-v1.6.6-src’

trying https://github.com/grantjenks/python-diskcache/archive/v1.6.6.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   135    0   135    0     0    258      0 --:--:-- --:--:-- --:--:--   258
100  406k  100  406k    0     0   262k      0  0:00:01  0:00:01 --:--:-- 1278k
unpacking source archive /tmp/nix-build-python-diskcache-v1.6.6-src.drv-0/v1.6.6.tar.gz
building path(s) ‘/nix/store/cxsbysaqka3312a5wlmpwwdxn8l1zsc4-python3.5-diskcache-1.6.6’
unpacking sources
unpacking source archive /nix/store/999v4wbl7ycvg3355p66sfnjjj01k2rs-python-diskcache-v1.6.6-src
source root is python-diskcache-v1.6.6-src
setting SOURCE_DATE_EPOCH to timestamp 315619200 of file python-diskcache-v1.6.6-src/tox.ini
patching sources
configuring
building
Traceback (most recent call last):
  File "nix_run_setup.py", line 6, in <module>
    exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
  File "setup.py", line 19, in <module>
    readme = reader.read()
  File "/nix/store/w9gwivwr7ih57b4y1dfhv64xy316p5qn-python3-3.5.1/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1068: ordinal not in range(128)
builder for ‘/nix/store/8lmqbi0j8xc2xhvgvl3xqldixgibsg5x-python3.5-diskcache-1.6.6.drv’ failed with exit code 1
error: build of ‘/nix/store/8lmqbi0j8xc2xhvgvl3xqldixgibsg5x-python3.5-diskcache-1.6.6.drv’ failed
/run/current-system/sw/bin/nix-shell: failed to build all dependencies

A quick fix would be to set the encoding to utf-8 on open.

Support Pickling

Pickling is tricky because a file handle is maintained to the database. But ultimately that's just a filename and a sqlite3 connect call. Seems possible that dunder getstate and setstate could be used to make Cache objects pickle-able.

Increase Large Value Threshold

Increase large value threshold based on https://www.sqlite.org/intern-v-extern-blob.html

Also worth setting the page_size explicitly. Locally the default was 4K but in production the default was 1K. I think 4K sounds good enough.

I would also guess that our use case will prefer larger blobs because reading a file requires first reading the filename from the database. It's probably worth creating a benchmark for testing.

Add Eviction Policy: "error" - Raise Exception If Cache Volume Exceeds Size Limit

Add eviction policy: 'error' - raise exception if cache volume exceeds size limit. No 'init' and no 'get' values.

Optimize Pickle Serialization Format

Using diskcache 2.9.0, via python 2.7 on Mac:

I am having problems using a tuple of strings as a cache key. Once the key has been placed into the cache and is subsequently retrieved from the cache via the key iterator, it apparently no longer matches the pickled form of the key when it was first inserted, leading to the value not being found when looked up in sqlite. I have tried changing the pickle format version to no avail. Using a list instead of a tuple works properly. Here is code to demonstrate the problem:

from diskcache import Cache, Index
import shutil
import tempfile

key_part_0 = u"part0"
key_part_1 = u"part1"
to_test = [
    (key_part_0, key_part_1),
    [key_part_0, key_part_1]
]
for key in to_test:
    tmpdir = tempfile.mkdtemp(prefix="discache_test.")
    try:
        dc = Index.fromcache(Cache(tmpdir, eviction_policy='none', size_limit=2**32)) #4GB
        dc[key] = {
            "example0": ["value0"]
        }
        diskcache_key = dc.keys()[0]
        print "Keys equal: %s" % (diskcache_key==key)
        print "Value using original key: %s" % dc[key]
        try:
            print "Value using retrieved key: %s" % dc[diskcache_key]
        except KeyError as e:
            print "Could not find value using retrieved key"
    finally:
        shutil.rmtree(tmpdir, True)

And the output:

Keys equal: True
Value using original key: {'example0': ['value0']}
Could not find value using retrieved key
Keys equal: True
Value using original key: {'example0': ['value0']}
Value using retrieved key: {'example0': ['value0']}

Thanks for any assistance you can provide.

Support get_many, set_many, and delete_many Cache operations

Helps a lot with lots of small values. My use case:

batch queries to API
caching calls is inefficient since sets of queries vary
caching by each query separately makes sense

So it looks like:

def _mygene_fetch(queries, specie):
    # Read cache
    cache = diskcache.Cache('/tmp/mygene-cache/' + specie, timeout=CACHE_TIMEOUT)
    res = compact({q: cache.get(q) for q in queries})

    queries = set(queries) - set(res)
    if queries:
        data = api.querymany(queries, species=specie)
        new = {item['query']: (item['entrezgene'], item['symbol'])
               for item in data}
        res.update(new)
        # Cache results
        for k, v in new.iteritems():
            cache[k] = v

    return res

And it's slow.

OperationalError: database schema has changed

I lost the stack trace but once on Travis-CI, I saw an OperationalError: database schema has changed bug. It was triggered during stress at "CREATE TABLE IF NOT EXISTS Settings ..."

Some research indicated that perhaps a separate cache process had changed the underlying schema and interacted badly with a statement cache. I'm not sure how that works.

This Issue is open as a reminder. If anyone sees something similar, would you add notes here?

Cache.expire() should cull items to respect size_limit

Grant,
I couldn't find a way to contact you regarding a question about DiskCache, except to file an issue.
I do not know how to cull items from the cache if the cache size is larger than the size_limit.

If I fill the cache to something larger than the size_limit, the cache does not report any items as expired, so calling the expire method does nothing. How is the size_limit parameter supposed to work?

Document Hash protocol is not used

Though DiskCache has a dictionary-like interface, the hash protocol is not used. Neither the __hash__ nor __eq__ methods are used for lookups. Instead it depends on the serialization method. For strings, bytes, integers, and floats it works the same. But large integers and other Python objects will be pickled. The bytes representation of pickling will be used for equality. Hashing is not used at all.

Pickling error, args have wrong class

We created a simple transaction class in memory which worked great but it consumes too much heap memory so we switched to diskcache 2.0.2 with Python 3.5. In most cases, diskcache works as expected. However, on occasion we get a pickling exception. Is this a bug? Is there a workaround? Here is the trace, notice the first relationship successfully pushes to the cache, the second one fails:

2016-09-24 12:03:42,371 INFO chemGraph: relationship(108494) Article-id-chem-article-10651284-[chemical]-Chemical-id-chem-chemical-thromboplastin
2016-09-24 12:03:42,461 INFO chemGraph: relationship(108495) Article-id-chem-article-10651284-[mesh]-MESH-id-umls-d008958
ERROR:root:transaction cache failed: (f0e044f)-[:mesh {id:"chem-article-10651284-umls-d008958",importDate:"2016-09-24T16:03:42.461744Z",nameHash:"chem-article-10651284-umls-d008958",source:"http://www.chemicals-r-us.info/chem",type:"article-mesh",uuid:"6de965a7-1f40-5d65-ae5a-ee7ab9723bbf"}]->(a011ffb)
Traceback (most recent call last):
  File "/chem/Fusion/ChemGraph.py", line 240, in checkTransactionLimits
    self.transaction_cache[name]=node
  File "/usr/lib/python3.5/site-packages/diskcache/core.py", line 510, in set
    size, mode, filename, db_value = self._disk.store(value, read)
  File "/usr/lib/python3.5/site-packages/diskcache/core.py", line 196, in store
    result = pickle.dumps(value, protocol=self._protocol)
_pickle.PicklingError: args[0] from __newobj__ args has the wrong class

Here is our diskcache initialization:

  from diskcache import Cache
  self.transaction_cache=Cache(cachePath+'/transaction/'+index, size_limit=MaxCacheSize)

Iteration Skips MAX(rowid)

How to correctly iterate caches?

Hi, i'm building a small tool with Diskcache to calculate/store/compare image hashes. My current implementation uses a FanoutCache to memoize different hashes under different parameters and the intention is to use the hashes for distance calculation.
The result would be sets of image-paths with (closely) similar hashes (image deduplication).

I'm now wondering what the correct approach would be to get results back from the cache to run my similarity algorithm. As far as i understand the docs i could retrieve an index from the current fanout cache and iterate all values, but directly working with the SQLlite disk might be more interesting because i would be able to express my algorithms in SQL directly.

Do you have any thoughts or possible preferred approach you could enlighten me with?

Thank you in advance!

DjangoCache Out of Disk Space Scenario

I had to do an emergency delete of the disk cache /var/tmp/django_disk_cache as my server had run out of disk space.

Ever since I receive Django errors

Exception Type: ValueError at x Exception Value: Key ':1xxx' not found

Disabling DjangoCache is the ony current fix. I've checked the Django db and there's no corresponding table for DjangoCache. There's obviously some reference to these keys somewhere but I can't find them. The docs make mention to a sqlite db, but I've searched my installation and can't find it.

The help for DjangoCache also mention a clear command, which I assume I'm meant to run in a python shell, but I can't figure out how to run it and there's no examples.

Add Cache iteritems Method

The current Cache __iter__ only iterates on keys and therefore iterating the whole cache requires getting each item one by one. I want a dict-like iteritems() method that efficiently iterates and yields (key, value) instead.

copy cache folder of diskcache

For some applications, I would like to generate cache offline (ie during week end, for week day application).
and copy the cache folder to another server/PC where the cache can be used.

Is there any limitation or is it platform dependant ?
(ex Windows ---> Linux) .

What are the limit of size for diskcarch ?
Nb of Key, Size of each individual file (caching 1Go pandas df is ok ?)

import diskcache
cache = diskcache.Cache("tmp")
@cache.memoize
def fn(x):
    return x

AttributeError: 'Cache' object has no attribute 'memoize'

Is there a reason it is only implemented for FanoutCache?