Comments (6)
Another thought: could separating out read
and write
have the added benefit of supporting unicode and bytes files?
from python-diskcache.
I decided to add the read
method but not the write
. A quick survey showed that the different semantics of the read
method are quite useful.
Committed at 204762d
from python-diskcache.
I'm interested in feedback from others. If you use this functionality, please contribute your opinion.
My case is the following. I have ~100k blobs of the same type in the cache and the entries have very different sizes due to their nature being binary diffs between some blobs (see below). 85% of the entries are smaller than 32 KiB but the corner-case is an entry being as large as ~30 MiB, so I decided to use read=True
feature to process the blobs in streaming way. I've found out that read=True
actually forces small objects to be stored as files and does not respect nice disk_min_file_size
:
$ find /srv/proj/diskcache/?? -type f -ls | awk '{print $7}' | stats
min 204
25.0% 215
50.0% 566
75.0% 17520
90.0% 43320
95.0% 55581
max 24473760
Then I've seen this ticket and I've seen that you use reader.name
in some code. So my questions are the following:
Would you consider patch that respects disk_min_file_size
while using read=True
pre-reading min_size
bytes from the stream and deciding if the object should be stored internally or as an external file?
Should the patch add name
and mode
attributes to BytesIO
in this case to preserve some kind of compatibility? These two attributes are the only two that are present in gzip.GzipFile
, open
, tempfile.TemporaryFile
while being absent from BytesIO
.
from python-diskcache.
I’m afraid patching BytesIO objects wouldn’t work for my use case. The motivation is described in the tutorial at the end of the DjangoCache section: http://www.grantjenks.com/docs/diskcache/tutorial.html#djangocache When stored in files, the blobs can be served directly from the file system by a web server. The performance of serving megabyte sized blobs is much improved. And I like the consistency. Could you maybe inherit from cache to provide the semantics you prefer?
I don’t see what problem you’re trying to solve. What does it matter to your code whether the blobs are stored in the database or file system?
from python-diskcache.
One more thought, have you looked at overriding the Disk object? That might be the right place to provide your optimization. You could read some chunk of the file handle and decide if it should be stored differently.
from python-diskcache.
Ah, thanks! I missed DjangoCache section. Now read=True
forcing file-based store makes more sense to me. I'm still unsure if mixing read=True
and read=False
for the same key is possible and provides expected result, but that's not a big deal anyway.
The problem I am solving is ability to handle all my cached blobs in the same consistent way. I lean towards streaming handling as some of the blobs are quite big (~120 MiB decompressed).
However, most of the cached blobs are really small (75% under 20 KiB, 25% under 0.2 KiB). So I was surprised to see disk_min_file_size
optimization being not respected for small files stored with read=True
, so I decided to understand if it's a bug or a feature :-)
TBH, I don't think that it matters much for me if small files are stored as filesystem objects or sqlite BLOBs. I agree that overriding the Disk object looks like the right way to implement that under the intended read=True
use-case.
Thanks for explanation!
from python-diskcache.
Related Issues (20)
- Memoize decorate will remove the type hint of the original function HOT 2
- Is this project still in active development? HOT 1
- sqlite3.DatabaseError: database disk image is malformed HOT 2
- Quesion about cache.close() and cull() HOT 2
- DiskCache Dropping Items HOT 5
- Native async implementation HOT 1
- Controlling the number of files generated by Cache HOT 3
- in Cache._cull(), cull_limit appears to be prescriptive vs. a limit & in Cache.cull(), 10 files are deleted at a time HOT 2
- Enable custom type serialization in `JSONDisk` HOT 2
- diskkache.Cache.get ignoring read=True parameter HOT 2
- Deque `peekleft` blocks infinitely after corruption(?) HOT 18
- No real-time synchronization between writing data and reading data in different processes. HOT 1
- RFE: is it possible to start making github releases?🤔 HOT 2
- Deque with JSONDisk throws TypeError: a bytes-like object is required, not 'int' HOT 1
- [Feature Request] Allow iterkeys method to yield value or tag along with key?
- High memory usage with multiple threads HOT 3
- [Bug] Fetching an item while writing the item into Cache HOT 4
- "unable to open database file" with Python 3.11 and PySide6 threads HOT 8
- Mark the distributed package as typed HOT 1
- Long-lived cache management HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-diskcache.